###1. Be able to import dataset into R It’s great to be able to make data frames, but it would be best if we can import data into R. We can! R handles many types of information. I created a csv (comma separated value) file for us to use at practice data.
First, remember to add comments on on the top of your script that has your name, the data, and the title of this week’s work.
Second, save your R script file often.
Third, think about what you are typing (and telling the program to do). This is the best way to learn, and remember quickly. Copy and paste is fine, but you have to understand what you are doing.
Now we are ready to begin. Normally, we would need to download the data file and put this file into the same folder in which you made your working directory. However, with Rstudio.cloud, I can do this for you. Thus, there is a file called “Finches_Dataset_BIO205Class.csv” for this week’s lab already.
1a. Load the dataset using the read.csv() function and assign it the object name “finch_data”. Note that the name of the file is in quotes, and must include the .csv file type. Remember, synatax matters in R.
finch_data <- read.csv("Finches_Dataset_BIO205Class.csv")
Great. We assigned this dataset to the object name finch_data. First, before we use the dataset, let’s look at it with a traditional “sheet” view, using the View() function. Type “View(finch_data)”
1b. We can find out about our dataset with couple functions. First, let’s use the str() function, for structure.
str(finch_data)
## 'data.frame': 100 obs. of 12 variables:
## $ Band : int 9 12 276 278 283 288 293 294 298 307 ...
## $ Species : chr "Geospiza fortis" "Geospiza fortis" "Geospiza fortis" "Geospiza fortis" ...
## $ Sex : chr "unknown" "female" "unknown" "unknown" ...
## $ First.adult.year: int 1975 1975 1976 1976 1976 1976 1976 1976 1976 1975 ...
## $ Last.Year : int 1977 1977 1977 1977 1977 1977 1977 1977 1977 1977 ...
## $ Survivor : chr "No" "No" "No" "No" ...
## $ Weight..g. : num 14.5 13.5 16.4 18.5 17.4 ...
## $ Wing..mm. : num 67 66 64.2 67.2 70.2 ...
## $ Tarsus..mm. : num 18 18.3 18.5 19.3 19.3 ...
## $ Beak.Length..mm.: num 9.2 9.5 9.93 11.13 12.13 ...
## $ Beak.Depth..mm. : num 8.3 7.5 8 10.6 11.2 9.1 9.5 10.5 8.4 8.6 ...
## $ Beak.Width..mm. : num 8.1 7.5 7.6 9.4 9.5 8.8 8.9 9.1 8.2 8.4 ...
This gives you a lot of information. First, the top line tells you that there are 100 obs and 12 variables. This means that there were 100 finches, and 12 types of data were collected on these finches. The 12 types of data are variables.
Some of the variables are independent variables, including Species, Sex, First.adult.year, Last.year, and Survivor. Others are dependent variables, including Weight..g., Wing..mm,, Tarsus..m., Beak.Length..mm., Beak.Depth..mm., and Beak.Width..mm.
Note, the Weight..g. should be read as Weight (g), but you need to type it as it is shown.
1c. Using the head() function, we can find out the first 6 rows of our data. Note, tail() gives us the last six.
head(finch_data)
## Band Species Sex First.adult.year Last.Year Survivor Weight..g.
## 1 9 Geospiza fortis unknown 1975 1977 No 14.50
## 2 12 Geospiza fortis female 1975 1977 No 13.50
## 3 276 Geospiza fortis unknown 1976 1977 No 16.44
## 4 278 Geospiza fortis unknown 1976 1977 No 18.54
## 5 283 Geospiza fortis male 1976 1977 No 17.44
## 6 288 Geospiza fortis unknown 1976 1977 No 16.34
## Wing..mm. Tarsus..mm. Beak.Length..mm. Beak.Depth..mm. Beak.Width..mm.
## 1 67.00 18.00 9.20 8.3 8.1
## 2 66.00 18.30 9.50 7.5 7.5
## 3 64.19 18.47 9.93 8.0 7.6
## 4 67.19 19.27 11.13 10.6 9.4
## 5 70.19 19.27 12.13 11.2 9.5
## 6 71.19 20.27 10.63 9.1 8.8
tail(finch_data)
## Band Species Sex First.adult.year Last.Year Survivor Weight..g.
## 95 1372 Geospiza fortis female 1976 1981 Yes 16.64
## 96 1797 Geospiza fortis male 1976 1982 Yes 16.67
## 97 2378 Geospiza fortis male 1976 1981 Yes 18.07
## 98 8190 Geospiza fortis unknown 1976 1981 Yes 15.60
## 99 316 Geospiza fortis male 1973 1982 Yes 17.55
## 100 710 Geospiza fortis male 1975 1982 Yes 15.00
## Wing..mm. Tarsus..mm. Beak.Length..mm. Beak.Depth..mm. Beak.Width..mm.
## 95 69.01 18.16 10.43 9.48 8.54
## 96 69.45 19.21 10.53 9.31 8.37
## 97 70.95 21.06 11.23 9.86 8.67
## 98 69.47 18.36 11.23 9.28 8.24
## 99 67.50 19.55 10.90 9.85 9.20
## 100 69.00 19.00 10.50 10.00 8.70
1e. Remember that we talked about variable types in lecture. Let’s see what type of variables we have in this dataset. For this, we use the class() function.
class(finch_data$Weight..g.)
## [1] "numeric"
The result you got should be * [1] “numeric”
If you recall from data types, this is a measurement data, and is numeric.
This is important. Notice that I used the dollar sign to indicate that I am specifically interested in the “Beak.Depth..mm.” column of data within the “finch_data” object. You will be using the dollar sign symbol a lot to call up specific columns. Also, as you start typing “[filename]$” the columns will start appearing as possible options.
Try finding the variable type of the “Sex” column using the class function and add the results as a comment on your script.
1e. Using the summary() function, we can find out the some stats on our data, by each column’s information. Usually, this information is a type of variable in your dataset.
summary(finch_data)
## Band Species Sex First.adult.year
## Min. : 9.0 Length:100 Length:100 Min. :1973
## 1st Qu.: 421.5 Class :character Class :character 1st Qu.:1975
## Median : 613.5 Mode :character Mode :character Median :1975
## Mean :1174.0 Mean :1975
## 3rd Qu.:1588.2 3rd Qu.:1976
## Max. :8191.0 Max. :1976
## Last.Year Survivor Weight..g. Wing..mm.
## Min. :1977 Length:100 Min. :13.00 Min. :64.00
## 1st Qu.:1977 Class :character 1st Qu.:15.00 1st Qu.:67.00
## Median :1978 Mode :character Median :16.24 Median :68.19
## Mean :1978 Mean :16.35 Mean :68.54
## 3rd Qu.:1978 3rd Qu.:17.44 3rd Qu.:70.25
## Max. :1982 Max. :21.24 Max. :74.01
## Tarsus..mm. Beak.Length..mm. Beak.Depth..mm. Beak.Width..mm.
## Min. :17.05 Min. : 8.70 Min. : 7.500 Min. : 7.400
## 1st Qu.:18.49 1st Qu.:10.20 1st Qu.: 8.795 1st Qu.: 8.200
## Median :19.13 Median :10.80 Median : 9.305 Median : 8.600
## Mean :19.19 Mean :10.79 Mean : 9.392 Mean : 8.641
## 3rd Qu.:20.00 3rd Qu.:11.25 3rd Qu.:10.100 3rd Qu.: 9.055
## Max. :21.06 Max. :12.73 Max. :11.210 Max. :10.070
QUESTIONS Try to answer the following questions and leave answers as comments on your R script.
1. How might you get all the numbers of a column, such as the weight of the finches (column is Weight..g.)
2. How do you get the mean of Weight..g. using the mean() function
3. Using the subsetting we learned in the R tutorial, how do you get the values greater than the mean?
4. How would you assign those values in question 3 to a new object name? Run the object.
###2. Manipulating your dataset
I have included a dataset that is pretty “clean.” This mean we don’t have to do much to make it ready for analysis. However, you usually get data (from the cdc, national databases, etc) that isn’t clean and filled with missing values or partial information. This requires cleaning, but will not be the focus of this today. Thus, the finch dataset is pre-cleaned, and ready to use.
Let’s begin to work with the dataset.
Based on your variables (columns) and your background reading, we know that this data examines physical traits of finches under a set of conditions.
#Display the eleventh column by using [ ,column number]
finch_data[,11]
## [1] 8.30 7.50 8.00 10.60 11.20 9.10 9.50 10.50 8.40 8.60 9.20 8.80
## [13] 8.50 8.00 9.70 8.40 7.90 9.30 7.70 8.50 8.20 9.70 10.30 10.20
## [25] 8.90 9.60 7.85 9.60 9.80 8.80 9.00 9.10 9.20 8.80 9.20 8.80
## [37] 9.40 8.30 8.40 10.20 9.30 10.20 10.50 9.00 9.80 9.30 7.60 10.50
## [49] 9.70 8.60 9.80 8.50 10.30 9.90 8.80 10.10 8.20 8.00 8.90 9.10
## [61] 9.80 10.10 8.55 9.30 10.00 10.70 9.10 8.80 10.40 10.70 9.15 11.20
## [73] 10.50 9.70 8.90 10.10 8.90 9.60 8.50 10.08 9.45 8.31 9.80 9.70
## [85] 10.38 10.61 8.38 10.78 11.01 10.68 8.78 10.28 10.86 11.21 9.48 9.31
## [97] 9.86 9.28 9.85 10.00
To call a single column by name, we can either use brackets or the “$” symbol, as mentioned earlier. This was shown to you before briefly. This is actually the easiest way to call up columns, and you should use this method.
# Display a column by name using the $ symbol. Here, lets get the eleventh column again
finch_data$Beak.Depth..mm.
## [1] 8.30 7.50 8.00 10.60 11.20 9.10 9.50 10.50 8.40 8.60 9.20 8.80
## [13] 8.50 8.00 9.70 8.40 7.90 9.30 7.70 8.50 8.20 9.70 10.30 10.20
## [25] 8.90 9.60 7.85 9.60 9.80 8.80 9.00 9.10 9.20 8.80 9.20 8.80
## [37] 9.40 8.30 8.40 10.20 9.30 10.20 10.50 9.00 9.80 9.30 7.60 10.50
## [49] 9.70 8.60 9.80 8.50 10.30 9.90 8.80 10.10 8.20 8.00 8.90 9.10
## [61] 9.80 10.10 8.55 9.30 10.00 10.70 9.10 8.80 10.40 10.70 9.15 11.20
## [73] 10.50 9.70 8.90 10.10 8.90 9.60 8.50 10.08 9.45 8.31 9.80 9.70
## [85] 10.38 10.61 8.38 10.78 11.01 10.68 8.78 10.28 10.86 11.21 9.48 9.31
## [97] 9.86 9.28 9.85 10.00
This shows all the values in the 11th column, which was the beak depth (mm) of all the finches in the dataset. Notice that it might be easier to call up columns by name rather than by number.
We can also get individual rows. In this case, it is the information of individual finches.
#Display the fourth row by using [number, ]
finch_data[20,]
## Band Species Sex First.adult.year Last.Year Survivor Weight..g.
## 20 356 Geospiza fortis female 1973 1977 No 16
## Wing..mm. Tarsus..mm. Beak.Length..mm. Beak.Depth..mm. Beak.Width..mm.
## 20 69 18.5 10.1 8.5 8.1
As you can see, we get all the information of the 20th row, which is finch 356.
As you can imagine, we can also combine row and column information.
# We can combine rows and columns.
# Display just the 20th element of the 11th column. both commands below work!
finch_data[20,11]
## [1] 8.5
finch_data$Beak.Depth..mm.[20]
## [1] 8.5
From this, we can see that finch 356 (20th row) has a beak depth of 8.5mm.
All commands lead to the same answer, but brackets might be quicker for short tables, but you can imagine that if you had thousands of rows, this could become cumbersome. Usually, using the $ is better, as it tells you exactly what you want.
You can also use a combination of information to call up a specific number. For example, what is your research advisor comes in, and specifically interested in the beak depth of bird 356 (Band ID of 356). Sure, you can look at the table, and scroll down to the band number (if it is numerical, that’s easier; if it’s not numerical order, good luck!). So, the power of this is to call up the information you want, easily.
So, what we can do is call up the column of the data we want (beak depth), then we subset using [] and say, only when the Band ID is equal to 356. To do that, we say finch_data$Band ==“356”.
finch_data$Beak.Depth..mm.[finch_data$Band =="356"]
## [1] 8.5
QUESTIONS Try to answer the following questions and leave answers as comments on your R script.
1. Can you call up all the beak depth of finch 687?
2. How about just female finches?
3. Challenge Can you get the BAND ID of only the finches that are female AND survived (Hint, you can use the & to combine operations within the subsetting brackets)?
Review Let’s summarize real quick what you should have learned (and should be able to do). * know how to open a csv file in R, using the read.csv() function * know how to quickly find summary information about the dataset, using the str(), head() and summary() functions * know how to call up specific rows, columns, or individual values in a dataset
It’s nice to see some graphs, and have a visual image of the data before we try to do any statistics. Let’s do some simple graphing. One of R’s most powerful features is its plotting ability and options. The default options are a great place to start, and they can be modified readily to produce exactly the plots that you envision. Right now, we will use the plots that are part of R’s default package. We will use some advanced graphing later.
Last week, we already made scatterplots using the plot() function. Let’s try this again using some simple relationships, where we are examining how the change in one variable affects the other variable. Let’s say we are interested in whether wing length changes as the weight of the finch changes.
QUESTIONS Try to answer the following questions and leave answers as comments on your R script.
1. Come up with a a) biological null and alternative hypotheses and b) statistical null and alternative hypotheses.
2. What are the independent and dependent variables?
3. Try using the plot() function from last week to graph wing length by weight.
Put your answer in the comments, and save the graph. Submit the graph to CANVAS.
Now, let’s visualize data using a slightly different type of graph, called a boxplot. Boxplots, also called box and whisker plots, are nice because they tell you some information about the data. Specifically, they tell you the Spread of Data, which we will talk more about next time. To use the boxplot() function, we simply put our vector of data into the parentheses. Try doing the following
# Plot the beak depth data using a boxplot
boxplot(finch_data$Beak.Depth..mm.)
Boxplots allow us to see several properties of the data. This is shown in the figure below. The box refers to the middle part, where 50% of the middle data lie, or the region between the lower (25%) quartile and the upper (75%) quartile. The middle line is the median. Note taht the mean is not shown in the default boxplot. The whiskers represent the minimum and maximum values in the dataset. Note that in some programs, the whiskers represent other percentages, such as 5% and 95%. In R, they are the min and max. In the figure, each datapoint is shown by the small red circles. You can see there are 5 dots in each quartile.
Great! But, we probably want to split this data up by an independent variable, or different groups of finches. Otherwise, we are just looking at all the data. For example, we can look at whether finches of different sex have different beak depths. How might we do this? To do this we will use the operator “~” (a tilde, usually above the tab key) to denote that we are entering a formula. That is, we are telling R to consider what we input as Dependent Variable ~ Independent Variable (also sometimes called the Response Variable ~ Exploratory Variable).
Here, the dependent or response variable of intereest is Beak Depth The independent or exploratory variable is Sex. So what we want is to know beak depths, but depths dependent on the independent variable sex. What would this command look like?
# Plot the beak depth data using a boxplot, but separate the data by Sex
boxplot(finch_data$Beak.Depth..mm. ~ finch_data$Sex)
# you can also do it this way: boxplot(Beak.Depth..mm. ~ Sex, data = finch_data)
Look, you made your first boxplot with interesting data. Congrats!! This is quite an accomplishment.
Making graphs prettier. Let’s make this a little easier on the eyes! To do this, you add arguments to the function. For more, remember you can always go to ?boxplot() for the documentation file. The arguments we will add are a title, a better x axis label, a better y axis label, and a color. See below. `
boxplot(finch_data$Beak.Depth..mm. ~ finch_data$Sex,
main = "Differences in beaks in male and female finches",
xlab = "Sex",
ylab = "Beak depth (mm)",
col = "red")
Above, we talked about what the parts of a boxplot refer to. Essentially, a boxplot is split into quartiles. The middle line is the median, and 50% of your data around the median is found inside the box. The summary() function gives you some of the information contained in the boxplot. The numbers match up to the boxplot values.
# what are the quartile values of the boxplot
summary(finch_data$Beak.Depth..mm.)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.500 8.795 9.305 9.392 10.100 11.210
# for the data specific to each Sex, you have to use the subsetting options we discussed previously. try it...
summary(finch_data$Beak.Depth..mm.[finch_data$Sex=="female"])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.500 8.450 9.480 9.239 9.850 10.610
summary(finch_data$Beak.Depth..mm.[finch_data$Sex=="male"])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.600 9.000 9.600 9.639 10.440 11.210
We haven’t discussed statistical testing yet (and won’t for the first few weeks). So far, we have done some basic data analysis and graphing. These are a useful first step, to get a visual of your data. In the future, we will examine how statistics, p values, etc., help support our hypothesis.
1. Complete the questions in red above.
2. Complete the prompt or answer the following questions on the dataset. Write your answers as comments.
Graph, using a boxplot, the mean beak depths of survivors and non-survivors finches. The Grants thought that beak size might a favorable trait for survival, since it might help finches eat diverse foods. Did their data show this? If we look at beak depth in survivors and non-survivors, than we might start answering this question.
2a. Write biological null and alternative hypotheses for the Grants’ question. 2b. Write statistical null and alternative hypotheses for the Grants’ question. 2c. In this experiment, what are the independent and dependent variables. 2d. For your independent and dependent variables, identify what TYPE OF DATA they are. 2e. Make a boxplot, splitting up the data by survival status. 2f. Get the actual quartile values of beak depth from both survivors and non-survivors.
If you got this far, you’ve been amazing. If you need help, that’s ok! I am available to help.