Lesson 1 | 17 Jan 2020

This lesson’s goal.

To learn how to perform data analysis for morphology data in R.

What is RStudio and R Markdown?

R is a language and environment for statistical computing and graphics. R Markdown is a file format for making dynamic documents with R (similar to other typesetting softwares like LaTeX). Be sure to Knit frequently so you can catch instances that create errors when knitting when it happens rather than waiting till the very end and not being sure where the problem is.

What are you looking at when you open RStudio.cloud?

Top left handcorner = where script is saved (use Cmd+Enter to run a line of code in the console). Top right hand corner = the environment where variables and data are stored. Bottom left handcorner = the console where commands are run (use Cmd+l to clear). Bottom right handcorner = the file system where files are stored

Creating a new file.

Click on the sheet symobl in the right hand corner. You can create an R script or an R Markdown. Give your new file a title name and press “OK”. Then, save the file in the folder you want and give it a filename. Now you’re good to start writing code!

Installing packages.

Downloading and installing a package from the Comprehensive R Archive Network (CRAN) - the main repository for R packages. A package is a unit of shareable code - it bundles together code, data, documentation, and tests, and is easy to share with others. You can google search R packages online to read up on their documentation and see what functions they provide. Or you can use the help(package = [write the package name here without brackets or quotes]) function to get more information on the package under “Help”. For example,

help(package = dplyr)

Now let’s install dplyr, a flexible grammar of data manipulation and provides tools for working with data frames (e.g. finding missing data). dplyr is the next iteration of another package called plyr. To install use install.packages() and put your R package inside the parentheses in quotes.

#install.packages("dplyr")

If you get a “package not found” error you will need to follow online instructions for installing packages.If not, let’s go on to install other packages we might need to analyze our morphology data. Once you have installed your packages, you won’t need to reinstall them again, unless they need to be updated. To install them all at once, put them in a list and seperate them by commas.

  1. ‘MASS’ has functions and datasets to support a statistics book by Venables and Ripley and called “Modern Applied Statistics with S”
  2. ‘tidyverse’ is a set of packages that work in harmony because they share common data representations and ‘API’ design. In othe words, you can make pretty graphs and tables.
  3. ‘gridExtra’ provides a number of user-level functions to work with “grid” graphics, notably to arrange multiple grid-based plots on a page, and draw tables.
  4. ‘mosaic’ has datasets and utilities used to teach mathematics, statistics, computation and modeling.
  5. ‘broom’ takes the messy output of built-in functions in R, such as lm, nls, or t.test, and turns them into tidy data frames.
  6. ‘readr’ provides a fast and friendly way to read rectangular data (like ‘csv’, ‘tsv’, and ‘fwf’)
  7. ‘knitr’ provides a general-purpose tool for dynamic report generation in R using Literate Programming techniques. In other words, makes pretty graphs.
  8. ‘kableExtra’ is a light weight table generator coming from ‘knitr’. This package simplifies the way to manipulate the HTML or ‘LaTeX’ codes generated by ‘kable()’ and allows users to construct complex tables and customize styles using a readable syntax. In other words, helps make pretty graphs.
# install.packages(c("MASS", "gridExtra", "tidyverse", "mosaic", "broom", "readr", "kableExtra"))

If a certain option needs to be frequently set to a value in multiple code chunks, you can consider setting it globally in the first code chunk of your document by doing {r setup, include=FALSE}

Let’s make a folder in the file system and place all graphs, files, and data related to that project in there. Now, let’s read the data!

Uploading, importing, and reading the data.

Upload the data. Click on the “Upload” Icon in the files into the directory you are working in. In my case, I am working in the ‘/cloud/project/Morphology/’ directory. The directory path points to a file system location.

Read the data using read_csv() and store it in a variable. To see the data write the variable and run the code. See what happens when you Knit it.

morphology_data <- read_csv("mate_trials_summer_2019.csv")
Parsed with column specification:
cols(
  ID_num = col_double(),
  TgroupID = col_character(),
  GgroupID = col_character(),
  sex = col_character(),
  beak = col_double(),
  thorax = col_double(),
  wing = col_double(),
  body = col_double(),
  w_morph = col_character(),
  recorder = col_character(),
  computer = col_character(),
  date_recorded_by_hand = col_character(),
  data_recorded_on_excel = col_character(),
  notes = col_character()
)
head(morphology_data)

Data exploring.

You’ll see when you Knit it that ALL the data will be printed out, but that’s not helpful to us visually neither is it efficient. So, let’s use some data visualizing functions like head(), summary(), glimpse(), and str() to find out more about the data. Use names() to see the column names. What do you see?

glimpse(morphology_data)
Observations: 230
Variables: 14
$ ID_num                 <dbl> 475, 268, 261, 261, 284, 327, 247, 2…
$ TgroupID               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ GgroupID               <chr> "T1", "T2", "T3", "T3", "T4", "T5", …
$ sex                    <chr> "F", "F", "F", "F", "F", "F", "F", "…
$ beak                   <dbl> 6.29, 8.44, 8.44, 8.55, 8.42, 8.82, …
$ thorax                 <dbl> 3.69, 3.73, 3.68, 3.56, 3.83, 3.79, …
$ wing                   <dbl> 7.81, 10.20, 9.57, 9.49, 9.54, 10.20…
$ body                   <dbl> 11.22, 14.00, 13.21, 13.13, 13.15, 1…
$ w_morph                <chr> "S", "L", "S", "S", "L", "L", "L", "…
$ recorder               <chr> "A", "A", "A", "A", "A", "A", "A", "…
$ computer               <chr> "yes", "yes", "yes", "no", "yes", "y…
$ date_recorded_by_hand  <chr> "09.11.19", "09.11.19", "09.11.19", …
$ data_recorded_on_excel <chr> "09.11.19", "09.11.19", "09.11.19", …
$ notes                  <chr> NA, NA, "too big for microscope", NA…
names(morphology_data)
 [1] "ID_num"                 "TgroupID"              
 [3] "GgroupID"               "sex"                   
 [5] "beak"                   "thorax"                
 [7] "wing"                   "body"                  
 [9] "w_morph"                "recorder"              
[11] "computer"               "date_recorded_by_hand" 
[13] "data_recorded_on_excel" "notes"                 
summary(morphology_data)
     ID_num      TgroupID           GgroupID        
 Min.   :200   Length:230         Length:230        
 1st Qu.:273   Class :character   Class :character  
 Median :370   Mode  :character   Mode  :character  
 Mean   :366                                        
 3rd Qu.:447                                        
 Max.   :559                                        
                                                    
     sex                 beak          thorax          wing      
 Length:230         Min.   :4.69   Min.   :2.60   Min.   : 2.35  
 Class :character   1st Qu.:5.68   1st Qu.:3.19   1st Qu.: 7.85  
 Mode  :character   Median :6.26   Median :3.38   Median : 8.82  
                    Mean   :6.54   Mean   :3.42   Mean   : 8.46  
                    3rd Qu.:7.29   3rd Qu.:3.68   3rd Qu.: 9.51  
                    Max.   :9.34   Max.   :4.11   Max.   :11.47  
                    NA's   :1      NA's   :1      NA's   :3      
      body         w_morph            recorder        
 Min.   : 6.84   Length:230         Length:230        
 1st Qu.:11.04   Class :character   Class :character  
 Median :11.87   Mode  :character   Mode  :character  
 Mean   :11.69                                        
 3rd Qu.:12.96                                        
 Max.   :14.74                                        
 NA's   :3                                            
   computer         date_recorded_by_hand data_recorded_on_excel
 Length:230         Length:230            Length:230            
 Class :character   Class :character      Class :character      
 Mode  :character   Mode  :character      Mode  :character      
                                                                
                                                                
                                                                
                                                                
    notes          
 Length:230        
 Class :character  
 Mode  :character  
                   
                   
                   
                   

Filtering the data.

Filter the data by using filter() or select() to pick out the columns, rows, or specific qualifications you’re looking for. Let’s say we want to look at beak length between males and females. We can create a filtered dataset using a method in R called ‘pipelining’, which basically tells R, go to the next line and run that line of code.

female_data <- morphology_data %>%
  select(sex, beak) %>%
  filter(sex == "F", beak != "NA")

f_mean <- mean(female_data[["beak"]])
f_sd <- sd(female_data[["beak"]])
f_max <- max(female_data[["beak"]])
f_min <- min(female_data[["beak"]])

# Get rid of rows with NA values

male_data <- morphology_data %>%
  select(sex, beak) %>%
  filter(sex == "M", beak != "NA")

m_mean <- mean(male_data[["beak"]])
m_sd <- sd(male_data[["beak"]])
m_max <- max(male_data[["beak"]])
m_min <- min(male_data[["beak"]])

# Let's make a dataframe from scratch summarizing the statistics.

summary_table <- matrix(byrow=TRUE, c(f_mean, f_sd, f_max, f_min,
                                      m_mean, m_sd, m_max, m_min), nrow=2,
                                dimnames = list(c("female", "male"),  
                        c("mean", "sd", "max", "min")))
kable(summary_table)
mean sd max min
female 7.641 0.9773 9.34 5.77
male 5.781 0.4585 6.91 4.69

Graphing the data.

To help visualize the data we can graph histograms - this will help us see the distribution. To do so just use hist(), boxplot(),

hist(female_data[["beak"]])

hist(male_data[["beak"]])

boxplot(female_data[["beak"]])

boxplot(male_data[["beak"]])