To learn how to perform data analysis for morphology data in R.
R is a language and environment for statistical computing and graphics. R Markdown is a file format for making dynamic documents with R (similar to other typesetting softwares like LaTeX). Be sure to Knit frequently so you can catch instances that create errors when knitting when it happens rather than waiting till the very end and not being sure where the problem is.
Top left handcorner = where script is saved (use Cmd+Enter to run a line of code in the console). Top right hand corner = the environment where variables and data are stored. Bottom left handcorner = the console where commands are run (use Cmd+l to clear). Bottom right handcorner = the file system where files are stored
Click on the sheet symobl in the right hand corner. You can create an R script or an R Markdown. Give your new file a title name and press “OK”. Then, save the file in the folder you want and give it a filename. Now you’re good to start writing code!
Downloading and installing a package from the Comprehensive R Archive Network (CRAN) - the main repository for R packages. A package is a unit of shareable code - it bundles together code, data, documentation, and tests, and is easy to share with others. You can google search R packages online to read up on their documentation and see what functions they provide. Or you can use the help(package = [write the package name here without brackets or quotes]) function to get more information on the package under “Help”. For example,
help(package = dplyr)
Now let’s install dplyr, a flexible grammar of data manipulation and provides tools for working with data frames (e.g. finding missing data). dplyr is the next iteration of another package called plyr. To install use install.packages() and put your R package inside the parentheses in quotes.
#install.packages("dplyr")
If you get a “package not found” error you will need to follow online instructions for installing packages.If not, let’s go on to install other packages we might need to analyze our morphology data. Once you have installed your packages, you won’t need to reinstall them again, unless they need to be updated. To install them all at once, put them in a list and seperate them by commas.
# install.packages(c("MASS", "gridExtra", "tidyverse", "mosaic", "broom", "readr", "kableExtra"))
If a certain option needs to be frequently set to a value in multiple code chunks, you can consider setting it globally in the first code chunk of your document by doing {r setup, include=FALSE}
Let’s make a folder in the file system and place all graphs, files, and data related to that project in there. Now, let’s read the data!
Upload the data. Click on the “Upload” Icon in the files into the directory you are working in. In my case, I am working in the ‘/cloud/project/Morphology/’ directory. The directory path points to a file system location.
Read the data using read_csv() and store it in a variable. To see the data write the variable and run the code. See what happens when you Knit it.
morphology_data <- read_csv("mate_trials_summer_2019.csv")
Parsed with column specification:
cols(
ID_num = col_double(),
TgroupID = col_character(),
GgroupID = col_character(),
sex = col_character(),
beak = col_double(),
thorax = col_double(),
wing = col_double(),
body = col_double(),
w_morph = col_character(),
recorder = col_character(),
computer = col_character(),
date_recorded_by_hand = col_character(),
data_recorded_on_excel = col_character(),
notes = col_character()
)
head(morphology_data)
You’ll see when you Knit it that ALL the data will be printed out, but that’s not helpful to us visually neither is it efficient. So, let’s use some data visualizing functions like head(), summary(), glimpse(), and str() to find out more about the data. Use names() to see the column names. What do you see?
glimpse(morphology_data)
Observations: 230
Variables: 14
$ ID_num <dbl> 475, 268, 261, 261, 284, 327, 247, 2…
$ TgroupID <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ GgroupID <chr> "T1", "T2", "T3", "T3", "T4", "T5", …
$ sex <chr> "F", "F", "F", "F", "F", "F", "F", "…
$ beak <dbl> 6.29, 8.44, 8.44, 8.55, 8.42, 8.82, …
$ thorax <dbl> 3.69, 3.73, 3.68, 3.56, 3.83, 3.79, …
$ wing <dbl> 7.81, 10.20, 9.57, 9.49, 9.54, 10.20…
$ body <dbl> 11.22, 14.00, 13.21, 13.13, 13.15, 1…
$ w_morph <chr> "S", "L", "S", "S", "L", "L", "L", "…
$ recorder <chr> "A", "A", "A", "A", "A", "A", "A", "…
$ computer <chr> "yes", "yes", "yes", "no", "yes", "y…
$ date_recorded_by_hand <chr> "09.11.19", "09.11.19", "09.11.19", …
$ data_recorded_on_excel <chr> "09.11.19", "09.11.19", "09.11.19", …
$ notes <chr> NA, NA, "too big for microscope", NA…
names(morphology_data)
[1] "ID_num" "TgroupID"
[3] "GgroupID" "sex"
[5] "beak" "thorax"
[7] "wing" "body"
[9] "w_morph" "recorder"
[11] "computer" "date_recorded_by_hand"
[13] "data_recorded_on_excel" "notes"
summary(morphology_data)
ID_num TgroupID GgroupID
Min. :200 Length:230 Length:230
1st Qu.:273 Class :character Class :character
Median :370 Mode :character Mode :character
Mean :366
3rd Qu.:447
Max. :559
sex beak thorax wing
Length:230 Min. :4.69 Min. :2.60 Min. : 2.35
Class :character 1st Qu.:5.68 1st Qu.:3.19 1st Qu.: 7.85
Mode :character Median :6.26 Median :3.38 Median : 8.82
Mean :6.54 Mean :3.42 Mean : 8.46
3rd Qu.:7.29 3rd Qu.:3.68 3rd Qu.: 9.51
Max. :9.34 Max. :4.11 Max. :11.47
NA's :1 NA's :1 NA's :3
body w_morph recorder
Min. : 6.84 Length:230 Length:230
1st Qu.:11.04 Class :character Class :character
Median :11.87 Mode :character Mode :character
Mean :11.69
3rd Qu.:12.96
Max. :14.74
NA's :3
computer date_recorded_by_hand data_recorded_on_excel
Length:230 Length:230 Length:230
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
notes
Length:230
Class :character
Mode :character
Filter the data by using filter() or select() to pick out the columns, rows, or specific qualifications you’re looking for. Let’s say we want to look at beak length between males and females. We can create a filtered dataset using a method in R called ‘pipelining’, which basically tells R, go to the next line and run that line of code.
female_data <- morphology_data %>%
select(sex, beak) %>%
filter(sex == "F", beak != "NA")
f_mean <- mean(female_data[["beak"]])
f_sd <- sd(female_data[["beak"]])
f_max <- max(female_data[["beak"]])
f_min <- min(female_data[["beak"]])
# Get rid of rows with NA values
male_data <- morphology_data %>%
select(sex, beak) %>%
filter(sex == "M", beak != "NA")
m_mean <- mean(male_data[["beak"]])
m_sd <- sd(male_data[["beak"]])
m_max <- max(male_data[["beak"]])
m_min <- min(male_data[["beak"]])
# Let's make a dataframe from scratch summarizing the statistics.
summary_table <- matrix(byrow=TRUE, c(f_mean, f_sd, f_max, f_min,
m_mean, m_sd, m_max, m_min), nrow=2,
dimnames = list(c("female", "male"),
c("mean", "sd", "max", "min")))
kable(summary_table)
mean | sd | max | min | |
---|---|---|---|---|
female | 7.641 | 0.9773 | 9.34 | 5.77 |
male | 5.781 | 0.4585 | 6.91 | 4.69 |
To help visualize the data we can graph histograms - this will help us see the distribution. To do so just use hist(), boxplot(),
hist(female_data[["beak"]])
hist(male_data[["beak"]])
boxplot(female_data[["beak"]])
boxplot(male_data[["beak"]])