The Basics

Shapiro-Wilk Normality Test

The Shapiro-Wilk normality test can be used to determine if a sample came from a normally distributed population

Usage

shapiro.test(x)

Arguments

x → numeric vector of data values

Notes

  • Null Hypothesis: sample came from a normally distributed population
    • p < 0.05 → not a normal distribution
    • p > 0.05 → normal distribution

Examples

library(datasets)

# mtcars contains data about the design and performance of 32 different cars (1973 models)

# histogram of MPG
hist(mtcars$mpg, breaks = 10)

# Shapiro wilk test
shapiro.test(mtcars$mpg)
## 
##  Shapiro-Wilk normality test
## 
## data:  mtcars$mpg
## W = 0.94756, p-value = 0.1229

\(~\)

Variance

Variance is the average distance from the mean

Usage

var(x, na.rm = TRUE)

Arguments

x → numeric vector of data values

na.rm = TRUE → argument needed to excluding NA values when calculating variance

  • If NA’s are present, excluding this argument will return a variance = NA

Examples

library(palmerpenguins)

# isolate adelie penguins 
adelie = penguins[which(penguins$species == "Adelie"), ]

# variance of bill depth
var(adelie$bill_depth_mm, na.rm = TRUE)
## [1] 1.480237

\(~\)

Standard Deviation

Square root of variance

Usage

sd(x, na.rm = TRUE)

Arguments

x → numeric vector of data values

na.rm = TRUE → argument needed to excluding NA values when calculating variance

  • If NA’s are present, excluding this argument will return a variance = NA

Examples

# standard deviation of adelie bill depth
sd(adelie$bill_depth_mm, na.rm = TRUE)
## [1] 1.21665

\(~\)

Covarience

Usage

cov(x, y, use = "complete.obs")

Arguments

x → numeric vector of data values

y → numeric vector of data values equivalent in size to x

use = "complete.obs" → tells R to only calculate results with complete observations, aka no missing values or NAs

Examples

# covariance between bill depth and bill length
cov(penguins$bill_depth_mm, penguins$bill_length_mm, use = "complete.obs")
## [1] -2.534234

\(~\)

Correlation

Normalized covariance

Usage

cor(x, y, use = "complete.obs")

Arguments

`x → numeric vector of data values

y → numeric vector of data values equivalent in size to x

use = "complete.obs" → tells R to only calculate results with complete observations, aka no missing values or NAs

Examples

# correlation between bill depth and bill length
cor(penguins$bill_depth_mm, penguins$bill_length_mm, use = "complete.obs")
## [1] -0.2350529

T-Tests

The T-test Function

There is one function that can be used to perform either a paired or unpaired t-test

Usage

t.test(x, y, paired = TRUE)

Arguments

`x → numeric vector of data values

y → numeric vector of data values equivalent in size to x

paired = TRUE → determines if t-test is paired or unpaired

  • use paired = FALSE for an unpaired t-test

Examples

Unpaired T-Test

# isolate adelie pengiuns
adelie = penguins[which(penguins$species == "Adelie"), ]

# isolate male adelie penguins
adelie_males = adelie[which(adelie$sex == "male"), ]

# repeat with gentoo penguins
gentoo = penguins[which(penguins$species == "Gentoo"), ]
gentoo_males = gentoo[which(gentoo$sex == "male"), ]

# unpaired t test between 
t.test(adelie_males$body_mass_g, gentoo_males$body_mass_g, paired = FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  adelie_males$body_mass_g and gentoo_males$body_mass_g
## t = -25.262, df = 131.18, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1554.211 -1328.475
## sample estimates:
## mean of x mean of y 
##  4043.493  5484.836

Paired T-Test

This example compares the average weight of the males in a penguin colony across two years

# set up data
male_colony_weights_2008 = c(4850, 5300, 4400, 5000, 4900, 5050, 4200, 5300, 4400, 5650, 4700, 4450, 3950, 5700)
male_colony_weights_2009 =  c(4600, 5300, 4875, 5550, 4950, 5400, 4750, 5650, 4850, 5200, 4925, 5000, 4725, 5350)

# paired t-test
t.test(male_colony_weights_2008, male_colony_weights_2009, paired = TRUE)
## 
##  Paired t-test
## 
## data:  male_colony_weights_2008 and male_colony_weights_2009
## t = -2.3166, df = 13, p-value = 0.03748
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -452.07743  -15.77971
## sample estimates:
## mean of the differences 
##               -233.9286

ANOVA

One-Way ANOVA

ANOVA is used to compare different group means to determine if there is a significant difference. The process of performing an ANOVA in R has a few steps, but all revolve around the use of the aov() funciton.

Usage

aov(var1 ~ var2, data = data_frame)

Arguments

var1 and var2 → column names of the variables of interest

data_frame → dataframe that contains the columns used in the model

Example

Step 1: Create an ANOVA variable in R

Similarly to how we had to create a linear model variable, we have to create an ANOVA variable in R. For this example I will be using a dataset of insect sprays, where the dataframe contains counts of insect in agricultural experimental units treated with different insecticides.

# create ANOVA object
Insect_ANOVA = aov(count ~ spray, data = InsectSprays)

# plain console view of ANOVA object 
Insect_ANOVA
## Call:
##    aov(formula = count ~ spray, data = InsectSprays)
## 
## Terms:
##                    spray Residuals
## Sum of Squares  2668.833  1015.167
## Deg. of Freedom        5        66
## 
## Residual standard error: 3.921902
## Estimated effects may be unbalanced
# summary of ANOVA object
summary(Insect_ANOVA)
##             Df Sum Sq Mean Sq F value Pr(>F)    
## spray        5   2669   533.8    34.7 <2e-16 ***
## Residuals   66   1015    15.4                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Step 1.1: Visualize the data

It helps to interpret the returned P values for the ANOVA by visuaulzing the data. We can do this using a boxplot. This example uses the ggplot2 package.

# load ggplot2 package 
library(ggplot2)

# boxplot of sprays
ggplot(data = InsectSprays, aes(x = spray, y = count, color = spray)) +
  geom_boxplot()

Step 2: Analyze the ANOVA model

The null hypothesis was clearly rejected, but where? There are a few ways to find out, but the best is probably Tukey’s Honest Significant Difference.

Usage

TukeyHSD(ANOVA_variable)

Arguments

ANOVA_variable → variable containing the ANOVA model

TukeyHSD(Insect_ANOVA)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = count ~ spray, data = InsectSprays)
## 
## $spray
##            diff        lwr       upr     p adj
## B-A   0.8333333  -3.866075  5.532742 0.9951810
## C-A -12.4166667 -17.116075 -7.717258 0.0000000
## D-A  -9.5833333 -14.282742 -4.883925 0.0000014
## E-A -11.0000000 -15.699409 -6.300591 0.0000000
## F-A   2.1666667  -2.532742  6.866075 0.7542147
## C-B -13.2500000 -17.949409 -8.550591 0.0000000
## D-B -10.4166667 -15.116075 -5.717258 0.0000002
## E-B -11.8333333 -16.532742 -7.133925 0.0000000
## F-B   1.3333333  -3.366075  6.032742 0.9603075
## D-C   2.8333333  -1.866075  7.532742 0.4920707
## E-C   1.4166667  -3.282742  6.116075 0.9488669
## F-C  14.5833333   9.883925 19.282742 0.0000000
## E-D  -1.4166667  -6.116075  3.282742 0.9488669
## F-D  11.7500000   7.050591 16.449409 0.0000000
## F-E  13.1666667   8.467258 17.866075 0.0000000