The Shapiro-Wilk normality test can be used to determine if a sample came from a normally distributed population
Usage
shapiro.test(x)
Arguments
x → numeric vector of data values
Notes
Examples
library(datasets)
# mtcars contains data about the design and performance of 32 different cars (1973 models)
# histogram of MPG
hist(mtcars$mpg, breaks = 10)
# Shapiro wilk test
shapiro.test(mtcars$mpg)
##
## Shapiro-Wilk normality test
##
## data: mtcars$mpg
## W = 0.94756, p-value = 0.1229
\(~\)
Variance is the average distance from the mean
Usage
var(x, na.rm = TRUE)
Arguments
x → numeric vector of data values
na.rm = TRUE → argument needed to excluding NA values when calculating variance
NA’s are present, excluding this argument will return a variance = NAExamples
library(palmerpenguins)
# isolate adelie penguins
adelie = penguins[which(penguins$species == "Adelie"), ]
# variance of bill depth
var(adelie$bill_depth_mm, na.rm = TRUE)
## [1] 1.480237
\(~\)
Square root of variance
Usage
sd(x, na.rm = TRUE)
Arguments
x → numeric vector of data values
na.rm = TRUE → argument needed to excluding NA values when calculating variance
NA’s are present, excluding this argument will return a variance = NAExamples
# standard deviation of adelie bill depth
sd(adelie$bill_depth_mm, na.rm = TRUE)
## [1] 1.21665
\(~\)
Usage
cov(x, y, use = "complete.obs")
Arguments
x → numeric vector of data values
y → numeric vector of data values equivalent in size to x
use = "complete.obs" → tells R to only calculate results with complete observations, aka no missing values or NAs
Examples
# covariance between bill depth and bill length
cov(penguins$bill_depth_mm, penguins$bill_length_mm, use = "complete.obs")
## [1] -2.534234
\(~\)
Normalized covariance
Usage
cor(x, y, use = "complete.obs")
Arguments
`x → numeric vector of data values
y → numeric vector of data values equivalent in size to x
use = "complete.obs" → tells R to only calculate results with complete observations, aka no missing values or NAs
Examples
# correlation between bill depth and bill length
cor(penguins$bill_depth_mm, penguins$bill_length_mm, use = "complete.obs")
## [1] -0.2350529
There is one function that can be used to perform either a paired or unpaired t-test
Usage
t.test(x, y, paired = TRUE)
Arguments
`x → numeric vector of data values
y → numeric vector of data values equivalent in size to x
paired = TRUE → determines if t-test is paired or unpaired
paired = FALSE for an unpaired t-testExamples
Unpaired T-Test
# isolate adelie pengiuns
adelie = penguins[which(penguins$species == "Adelie"), ]
# isolate male adelie penguins
adelie_males = adelie[which(adelie$sex == "male"), ]
# repeat with gentoo penguins
gentoo = penguins[which(penguins$species == "Gentoo"), ]
gentoo_males = gentoo[which(gentoo$sex == "male"), ]
# unpaired t test between
t.test(adelie_males$body_mass_g, gentoo_males$body_mass_g, paired = FALSE)
##
## Welch Two Sample t-test
##
## data: adelie_males$body_mass_g and gentoo_males$body_mass_g
## t = -25.262, df = 131.18, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1554.211 -1328.475
## sample estimates:
## mean of x mean of y
## 4043.493 5484.836
Paired T-Test
This example compares the average weight of the males in a penguin colony across two years
# set up data
male_colony_weights_2008 = c(4850, 5300, 4400, 5000, 4900, 5050, 4200, 5300, 4400, 5650, 4700, 4450, 3950, 5700)
male_colony_weights_2009 = c(4600, 5300, 4875, 5550, 4950, 5400, 4750, 5650, 4850, 5200, 4925, 5000, 4725, 5350)
# paired t-test
t.test(male_colony_weights_2008, male_colony_weights_2009, paired = TRUE)
##
## Paired t-test
##
## data: male_colony_weights_2008 and male_colony_weights_2009
## t = -2.3166, df = 13, p-value = 0.03748
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -452.07743 -15.77971
## sample estimates:
## mean of the differences
## -233.9286
ANOVA is used to compare different group means to determine if there is a significant difference. The process of performing an ANOVA in R has a few steps, but all revolve around the use of the aov() funciton.
Usage
aov(var1 ~ var2, data = data_frame)
Arguments
var1 and var2 → column names of the variables of interest
data_frame → dataframe that contains the columns used in the model
Example
Step 1: Create an ANOVA variable in R
Similarly to how we had to create a linear model variable, we have to create an ANOVA variable in R. For this example I will be using a dataset of insect sprays, where the dataframe contains counts of insect in agricultural experimental units treated with different insecticides.
# create ANOVA object
Insect_ANOVA = aov(count ~ spray, data = InsectSprays)
# plain console view of ANOVA object
Insect_ANOVA
## Call:
## aov(formula = count ~ spray, data = InsectSprays)
##
## Terms:
## spray Residuals
## Sum of Squares 2668.833 1015.167
## Deg. of Freedom 5 66
##
## Residual standard error: 3.921902
## Estimated effects may be unbalanced
# summary of ANOVA object
summary(Insect_ANOVA)
## Df Sum Sq Mean Sq F value Pr(>F)
## spray 5 2669 533.8 34.7 <2e-16 ***
## Residuals 66 1015 15.4
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Step 1.1: Visualize the data
It helps to interpret the returned P values for the ANOVA by visuaulzing the data. We can do this using a boxplot. This example uses the ggplot2 package.
# load ggplot2 package
library(ggplot2)
# boxplot of sprays
ggplot(data = InsectSprays, aes(x = spray, y = count, color = spray)) +
geom_boxplot()
Step 2: Analyze the ANOVA model
The null hypothesis was clearly rejected, but where? There are a few ways to find out, but the best is probably Tukey’s Honest Significant Difference.
Usage
TukeyHSD(ANOVA_variable)
Arguments
ANOVA_variable → variable containing the ANOVA model
TukeyHSD(Insect_ANOVA)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = count ~ spray, data = InsectSprays)
##
## $spray
## diff lwr upr p adj
## B-A 0.8333333 -3.866075 5.532742 0.9951810
## C-A -12.4166667 -17.116075 -7.717258 0.0000000
## D-A -9.5833333 -14.282742 -4.883925 0.0000014
## E-A -11.0000000 -15.699409 -6.300591 0.0000000
## F-A 2.1666667 -2.532742 6.866075 0.7542147
## C-B -13.2500000 -17.949409 -8.550591 0.0000000
## D-B -10.4166667 -15.116075 -5.717258 0.0000002
## E-B -11.8333333 -16.532742 -7.133925 0.0000000
## F-B 1.3333333 -3.366075 6.032742 0.9603075
## D-C 2.8333333 -1.866075 7.532742 0.4920707
## E-C 1.4166667 -3.282742 6.116075 0.9488669
## F-C 14.5833333 9.883925 19.282742 0.0000000
## E-D -1.4166667 -6.116075 3.282742 0.9488669
## F-D 11.7500000 7.050591 16.449409 0.0000000
## F-E 13.1666667 8.467258 17.866075 0.0000000