This first week will focus on some of the most fundamental statistical techniques: the t-test, ANOVA, and basic linear regression. This lab will introduce you to cases where you can apply these techniques, the assumptions of each approach, and the interpretation of the outputs.

You should read through this markdown and try the code. You should be able to copy and paste it into R. There are some questions in here that are just food for thought and other bolded sections where it says “your turn.” You will need to create your own R script and turn in the work needed in the “your turn” sections. Please upload the script to your personal folder. You are more than welcome to use the code I have included in this markdown, but if you know a better (or just different) way, feel free to use that! I should be able to run your R scripts, so remember to include the libraries you use at the top and have code to import the data. Please leave comments in your code where an explanation is asked for.

Before diving into the these techniques let’s have a quick look at the data you will be using today. These are the same data that were presented in class.

Set working directory.

setwd("c:/current/")

Import data.

dat <- read.csv("research_design_lab1.csv")

Examine data.

colnames(dat) #Returns column names

## [1] "elevation" "genus"     "area"      "herbivory" "diversity" "plotID"

head(dat) #Returns first 6 entries for each column

##   elevation genus     area herbivory diversity plotID
## 1      high Piper 400.0000        76         4      1
## 2      high Piper 133.3000         5         2      2
## 3      high Piper 171.5278         3         2      3
## 4      high Piper  38.0000         0         0      4
## 5      high Piper 825.0000         1         2      5
## 6      high Piper 900.0000         5         3      6

Summarize data.

summary(dat) #Returns summary of each column.

##  elevation        genus         area          herbivory       diversity    
##  high:89   Miconia   :61   Min.   :  12.5   Min.   : 0.00   Min.   :0.000  
##  low :90   Piper     :59   1st Qu.: 105.5   1st Qu.: 1.00   1st Qu.:1.000  
##            Psychotria:59   Median : 192.0   Median : 5.00   Median :2.000  
##                            Mean   : 247.4   Mean   :12.16   Mean   :2.095  
##                            3rd Qu.: 355.6   3rd Qu.:15.00   3rd Qu.:3.000  
##                            Max.   :1093.8   Max.   :87.00   Max.   :6.000  
##      plotID     
##  Min.   : 1.00  
##  1st Qu.: 8.00  
##  Median :15.00  
##  Mean   :15.42  
##  3rd Qu.:23.00  
##  Max.   :31.00

Why are some columns summarized with means and quantiles and others not?

t-test

The t-test is a technique used for comparing the means of a categorical variable with two levels. So the t-test requires a continuous response variable and a categorical predictor variables with two levels.

Which column(s) in these data are categorical with two levels?

Before running statistical tests I always find it useful to plot the data first. Lets take a look at a plot of the raw data to see if the results of this t-test make sense.

boxplot(herbivory~elevation, data = dat)

We see two groups (high and low elevation) with different means.

Okay let’s do a t-test! Here we will examine the differences between herbivory in the high elevation and low elevation group.

mod1 <- t.test(herbivory~elevation, data = dat) #We save the results of the t-test as "mod1."
mod1 #Examine the results of the t-test.

## 
##  Welch Two Sample t-test
## 
## data:  herbivory by elevation
## t = 1.3249, df = 170, p-value = 0.187
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.730763  8.796057
## sample estimates:
## mean in group high  mean in group low 
##           13.93820           10.40556

Here we see that the mean herbivory in the high group is 13.9 and 10.4 in the low group. This means the “high” group has about is 3.5 cm^2 of herbivory on average. This is the effect size. It is up to the researcher(s) to know if this is biologically significant or not.

Speaking of significance, the p value is 0.187, meaning that if there is truly no difference between groups that we expect to find a difference at least this extreme 18.7% of the time.

We can see that the differences in means from the boxplot matches the t-test, but there is something else you should notice. Both the high and low samples are skewed, with many small values and a few high values. This brings us to the assumptions of the t-test. These are that the data were randomly sampled, the residuals are normally distributed, and that there is a homogeneity of variance.

The residuals are the the variation that is left over after the accounting for the main variables in the model. First lets plot them then see if they are normally distributed (often if the response variable is normally distributed so are the residuals). You will also notice that I am using the glm here instead of t.test. That is because I want to look at the residuals and the resid() function works well with glm(). GLm and t.test give the same results (with a slightly different output).

Residuals. We want this to look like a random cloud of points. It doesn’t

plot(resid(glm(herbivory~elevation, data = dat)))

Density plot of residuals.

plot(density(resid(glm(herbivory~elevation, data = dat))))

Lets check for homogeneity of variance

var.test(herbivory~elevation, data = dat, alternative = "two.sided") #higher p values indicate that the variance is homogeneous. Because the null hypothesis is that the variances are equal.

## 
##  F test to compare two variances
## 
## data:  herbivory by elevation
## F = 1.4749, num df = 88, denom df = 89, p-value = 0.06908
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.9698925 2.2438722
## sample estimates:
## ratio of variances 
##           1.474863

We can see that is isn’t great, so what can we do? Let’s try a transformation. We will square root the response variable and see if this improves the normality and variance. It is important to remember that after doing this our effect size is on a square root scale which can make interpretation of the results more difficult.

dat$herb_sqrt <- sqrt(dat$herbivory)

plot(resid(glm(herb_sqrt~elevation, data = dat)))

plot(density(resid(glm(herb_sqrt~elevation, data = dat))))

var.test(herb_sqrt~elevation, data = dat, alternative = "two.sided") #higher p values indicate that the variance is homogeneous.

## 
##  F test to compare two variances
## 
## data:  herb_sqrt by elevation
## F = 1.4773, num df = 88, denom df = 89, p-value = 0.0679
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.9715099 2.2476140
## sample estimates:
## ratio of variances 
##           1.477322

That’s a little better.

mod2 <- t.test(herb_sqrt~elevation, data = dat) #We save the results of the t-test as "mod2."
mod2 #Examine the results of the t-test.

## 
##  Welch Two Sample t-test
## 
## data:  herb_sqrt by elevation
## t = 0.91792, df = 169.95, p-value = 0.36
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.3515798  0.9627352
## sample estimates:
## mean in group high  mean in group low 
##           2.845379           2.539801

Why are the means different? How does our interpretation change from the first t-test?

Your turn: Perform a t-test to compare herbivory between Piper and the other plants. Hint: you will need to make a new column to separate “piper” and “not piper.” The ifelse() function might come in handy.

ANOVA

ANOVA stands for analysis of variance and is used for comparing the means of a categorical variable with three or more levels.

Which column(s) in these data are categorical with three levels?

Lets look at the data.

boxplot(herbivory~genus, data = dat)

Time for an ANOVA. Here we will examine the differences between the three genera of plants.

mod3 <- aov(herbivory~genus, data = dat) #We save the results of the ANOVA as "mod3."
summary(mod3) #Examine the results of the ANOVA.

##              Df Sum Sq Mean Sq F value  Pr(>F)   
## genus         2   3511  1755.6   5.804 0.00362 **
## Residuals   176  53236   302.5                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Unlike the t.test() function, the aov() function a bit short sighted and does not output effect sizes, which are the coefficients of the model. We can use the coef() function for this.

coef(mod3)

##     (Intercept)      genusPiper genusPsychotria 
##       12.303279        5.239094       -5.667685

With the anova we are essentially comparing the means of all of the groups. To do this one group becomes the intercept (which is usually the first level alphabetically), and the other coefficients are compared to the intercept. So in this example the intercept is the mean herbivory on Miconia and genusPiper and genusPsychotria are how much greater/less herbivory each have compared to Miconia.

People care about p values, so let’s look at those too.

You can see that the ANOVA (mod 3) is “significant”, but what does that mean? Remember that a p value tells the chance of obtaining group differences at least this extreme if the null hypothesis is true. The null hypothesis in this case is all groups are equal. How can we examine which groups are different? We can use a tukey post-hoc test.

TukeyHSD(mod3)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = herbivory ~ genus, data = dat)
## 
## $genus
##                          diff        lwr       upr     p adj
## Piper-Miconia        5.239094  -2.267465 12.745653 0.2277095
## Psychotria-Miconia  -5.667685 -13.174245  1.838874 0.1776925
## Psychotria-Piper   -10.906780 -18.475635 -3.337924 0.0023435

We can see that the main difference is betwen Psychotria and Piper.

Like the t-test, the ANOVA has assumptions. These are: The data are normally distributed, each group has equal variance, the data are independent.

We should view the data to look at normality. The gvlma() function in the gvlma package can also be used to test for normality.

plot(density(resid(glm(herbivory~genus, data = dat))))

We can use a bartlett test to check for homogeneity of variance.

bartlett.test(herbivory~genus, data = dat)

## 
##  Bartlett test of homogeneity of variances
## 
## data:  herbivory by genus
## Bartlett's K-squared = 13.576, df = 2, p-value = 0.001127

The data are not normal and the variance is not homogeneous. Like before, you can transform the data to try and meet the assumptions of the test.

Your turn: Transform these data, re run the ANOVA, and interpret the results. Keep in mind that there are many types of transformations. Try a few.

The ANOVAs you have run to far are called one-way ANOVAs. This is because you are looking at the effects of one variable (with more than two levels) on a response variable. A two-way ANOVA examines the the relationship between two categorical predictor variables and a response variable. Here is an example:

mod4 <- aov(herbivory~genus*elevation, data = dat) #We save the results of the ANOVA as "mod4."
summary(mod4) #Examine the results of the ANOVA.

##                  Df Sum Sq Mean Sq F value  Pr(>F)   
## genus             2   3511  1755.6   5.879 0.00339 **
## elevation         1    558   557.6   1.867 0.17356   
## genus:elevation   2   1021   510.6   1.710 0.18389   
## Residuals       173  51657   298.6                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

coef(mod4)

##                  (Intercept)                   genusPiper 
##                    13.564516                     8.987208 
##              genusPsychotria                 elevationlow 
##                    -7.840378                    -2.564516 
##      genusPiper:elevationlow genusPsychotria:elevationlow 
##                    -7.287208                     4.357045

Plot:

boxplot(herbivory~genus*elevation, data = dat)

Basic linear regression

A linear regression is a technique used for comparing the linear relationship between a continuous response variable and continuous predictor variables.

Which columns have continuous data? What is weird about the diversity column? We will be able to better work with this column next week when discussing glms. For now we will treat it like the other continuous categories.

Why isn’t plotID continuous despite being comprised of numbers?

Here are those two variables plotted against each other:

plot(herbivory~area, data = dat)
abline(lm(herbivory ~ area, data = dat), col="red")

What is the relationship between area and herbivory? Let’s quantify it!

mod5 <- lm(herbivory~area + diversity, data = dat)
summary(mod5)

## 
## Call:
## lm(formula = herbivory ~ area + diversity, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.544  -8.543  -3.041   3.332  68.538 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.57502    2.24489  -1.593    0.113    
## area         0.00169    0.00562   0.301    0.764    
## diversity    7.31228    0.80787   9.051 2.61e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.7 on 176 degrees of freedom
## Multiple R-squared:  0.3302, Adjusted R-squared:  0.3226 
## F-statistic: 43.39 on 2 and 176 DF,  p-value: 4.805e-16

Like before, the coefficient is the effect size. This is what we care about most. It tells us the amount of change predicted in our response variable (herbivory) for every 1 unit in our predictor variable (herbivory). In this case we expect a change in 0.002 percent herbivory for every 1cm^2 of leaf area. This seems small, but remember the scale of leaf area. This variable ranges from 12 to over 1000. On the other hand diversity only ranges from 0 to 6. The effect size is bigger, but is that just because these are measured on vastly different scales? How can we compare them?

This is the use of standardization. By standardizing variables we put them on the same scale, the scale of standard deviation. This takes all of the data in a column and transforms it to have a mean of zero and a standard deviation of 1. This is very important when it comes to comparing variables measured on different scales.

We can standardize area and diversity like this:

dat$scale_div <- scale(dat$diversity)
dat$scale_area <- scale(dat$area)

Lets take a quick look at what you just did:

hist(dat$area, main = "Area before scaling")

hist(dat$scale_area, main = "Area after scaling")

You can see that distribution is identical, but the values have changed (because they are now standardized). You should also notice that the data are not normal, do you think that is important?

Linear regressions also have assumptions. It assumes a linear relationship between variables, no multi-colinearity, independant samples, normally distributed residuals, and homoskadasticy of variance. Most of these are pretty self explanatory (and overlap with other methods), so I will focus on multi-colinearity and homoskadasticy of variance.

Multi-colinearity occurs when you run a multiple regression (which have multiple predictor variables) and the predictor variables are too corrleated. This can cause the model to have trouble seperating the effects. Checking corrleations among variables or using a variance inflation funciton is a good idea. We will learn more about this in a future lab.

Homoskadasticy of variance means that the variance around a regressions is the same for all predicted values of the model. That is kind of hard to explain so let’s look at figures.

You can see that when data are homoskadastic that the variance does not change along the line. When data are heteroskadastic the variance changes along the line. You can check for this by plotting data, checking residuals of models, or using functions like gvlma().

Your turn: Use these newly created columns to re-run the linear regression. Which variable has a larger effect? Also, what is the p-value telling you here? Use my explanation from before and some research on your own to properly interpret it. P-values are tricky and often do not mean what the researcher thinks they do.

Introduction to linear models

Chris Halsch

1/5/2021

t-test

ANOVA

Basic linear regression