Activity 1: “Mixophyes” Dataset

Mixophyes is a genus of frogs. In this activity we will explore the mixo-simplified.csv dataset, which is available from the course website in Blackboard. Download the CSV file to the same this Rmd directory, then we will set the working directory to the current source file location by setwd command

We can set it manually from menu Session -> Set working directory -> To source file location

We also remove the environment and set the default theme for ggplot2

library('ggplot2')
rm( list=ls ())
setwd(dirname(rstudioapi::getActiveDocumentContext()$path))
theme_set(theme_bw())
scale_colour_brewer_stat6020 <- function(...) {
  scale_colour_brewer(palette = "Dark2")
}

scale_fill_brewer_stat6020 <- function(...) {
  scale_fill_brewer(palette = "Dark2")
}

options(
  ggplot2.discrete.colour = scale_colour_brewer_stat6020,
  ggplot2.discrete.fill = scale_fill_brewer_stat6020
)            

1.1. Read the data in:

mixo <- read.csv('mixo-simplified.csv')

1.2. Show a summary:

summary(mixo)
##     Gender             Recap                Mass             SVL        
##  Length:1312        Length:1312        Min.   :  4.00   Min.   : 28.60  
##  Class :character   Class :character   1st Qu.: 28.00   1st Qu.: 63.60  
##  Mode  :character   Mode  :character   Median : 59.00   Median : 79.70  
##                                        Mean   : 62.97   Mean   : 76.38  
##                                        3rd Qu.: 71.00   3rd Qu.: 84.80  
##                                        Max.   :220.00   Max.   :115.00  
##                                        NA's   :206      NA's   :206     
##    Righ.Tibia      Head.Width     Head.Length      Survey.no   
##  Min.   :17.90   Min.   :11.00   Min.   :10.50   Min.   : 1.0  
##  1st Qu.:42.65   1st Qu.:25.75   1st Qu.:22.00   1st Qu.: 7.0  
##  Median :53.60   Median :32.90   Median :29.50   Median :22.0  
##  Mean   :51.16   Mean   :31.74   Mean   :28.13   Mean   :17.9  
##  3rd Qu.:56.15   3rd Qu.:35.00   3rd Qu.:32.40   3rd Qu.:27.0  
##  Max.   :74.80   Max.   :50.60   Max.   :46.30   Max.   :32.0  
##  NA's   :205     NA's   :205     NA's   :691     NA's   :1

Notice that there are many observations (rows) for which part of the variables (columns) are missing; these are indicated as NA values in R. These values may produce warnings in some of the subsequent items of this activity, because some functions in R automatically ignore these values or simply discard the entire corresponding rows when performing the analyses.

Length (SVL) and Mass

SVL (Snout-Vent Length) is a measure of the length of the frog.

1.3. Produce a scatterplot of Mass as predicted by SVL, and include a linear model to the plot (see example with the “Strength” dataset):

ggplot(mixo,aes(x=SVL, y=Mass))+
      geom_point(size=1)+
      geom_smooth(method='lm')+
      ggtitle('Mass by SVL')
## `geom_smooth()` using formula 'y ~ x'

There is clearly a non-linear relationship between these variables, so the linear model does not fit the data properly. If we change the method from ‘lm’ to ‘loess’, we can see how non-linear of the relationship is

ggplot(mixo,aes(x=SVL, y=Mass))+
      geom_point(size=1)+
      geom_smooth(method='loess')+
      ggtitle('Mass by SVL')
## `geom_smooth()` using formula 'y ~ x'

Thinking about frogs, perhaps the mass of the frog would be related to its volume, which would be related to the cube of its length:

  • \(Mass \approx \beta_1 SVL^3\)

where \(\beta_1\) is some positive constant. Recalling that:

  • \(log(a^b) = b \cdot log(a)\), and

  • \(log(a \cdot b) = log(a) + log(b)\),

if we apply the log on both sides we have:

  • \(log(Mass) \approx log(\beta_1) + 3log(SVL)\)

Since \(log(\beta_1)\) is just a constant, a plot with both Mass and SVL log transformed should produce an approximately linear relationship.

1.4. Try and see yourself:

ggplot(mixo,aes(x=SVL, y=Mass))+
      geom_point(size=2)+
      geom_smooth(method='lm' )+
      scale_x_log10()+
      scale_y_log10()+
      ggtitle('Mass by SVL - log scale')
## `geom_smooth()` using formula 'y ~ x'

Yes - very strong linear relationship!

Gender and Other Lengths

1.5. Modify the plot in item 1.4 so that the data points are colour coded by Gender, but the linear model is kept the same (global, for the whole data set). You can achieve this by applying the col aesthetic as Gender, locally to the geom_point() function only:

ggplot(mixo,aes(x=SVL, y=Mass))+
      geom_point(size=1, aes(color=Gender, shape=Gender))+
      geom_smooth(method='lm' )+
      scale_x_log10()+
      scale_y_log10()+
      ggtitle('Mass by SVL - log scale')
## `geom_smooth()` using formula 'y ~ x'

1.6. Now try to produce and plot separate models for each Gender by applying the col aesthetic as Gender again, but now globally in the ggplot() function:

ggplot(mixo,aes(x=SVL, y=Mass, color=Gender))+
      geom_point()+
      geom_smooth(method='lm' )+
      scale_x_log10()+
      scale_y_log10()+
      ggtitle('Mass by SVL - log scale')
## `geom_smooth()` using formula 'y ~ x'

1.7 Produce a scatterplot of the length of the right tibia by SVL, grouped by Gender, and fit a separate linear model to each group:

ggplot(mixo,aes(x=SVL, y=Righ.Tibia,color=Gender))+
      geom_point()+
      geom_smooth(method='lm')+
      ggtitle('Right Tibia by SVL ')
## `geom_smooth()` using formula 'y ~ x'