By the end of this tutorial, you should be able to:
Select the correct statistical test for your dataset
Run simple linear models in R
Interpret model outputs and diagnostic plots
Calculate F-ratios and degrees of freedom (DF)
In your groups, answer the following questions based on your assigned data set (see sections below). Discuss your findings in the group and prepare a short summary for class discussion
Load the ‘mtcars’ data set using this code:
{data(mtcars)}
data("mtcars")
# Check the distrbution of your data
hist(mtcars$mpg)
# Linear regression: mpg vs hp
lm_model <- lm(mpg ~ hp, data = mtcars)
par(mfrow=c(2,2))
plot(lm_model) # diagnostic plots
anova(lm_model)
## Analysis of Variance Table
##
## Response: mpg
## Df Sum Sq Mean Sq F value Pr(>F)
## hp 1 678.37 678.37 45.46 1.788e-07 ***
## Residuals 30 447.67 14.92
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# ANOVA: mpg vs cyl
lm_model_cyl <- lm(mpg ~ factor(cyl), data = mtcars)
anova(lm_model_cyl)
## Analysis of Variance Table
##
## Response: mpg
## Df Sum Sq Mean Sq F value Pr(>F)
## factor(cyl) 2 824.78 412.39 39.697 4.979e-09 ***
## Residuals 29 301.26 10.39
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Suggested answers:
Linear regression shows a significant negative slope: more horsepower → lower mpg.
Use ANOVA because you’re comparing mpg across 3 groups (cyl=4,6,8).
Load the ‘toothGrowth’ data set using this code:
{data(ToothGrowth)}
data(ToothGrowth)
# Check the distribution of your data
hist(ToothGrowth$len)
# t-test: OJ vs VC
t.test(len ~ supp, data = ToothGrowth)
##
## Welch Two Sample t-test
##
## data: len by supp
## t = 1.9153, df = 55.309, p-value = 0.06063
## alternative hypothesis: true difference in means between group OJ and group VC is not equal to 0
## 95 percent confidence interval:
## -0.1710156 7.5710156
## sample estimates:
## mean in group OJ mean in group VC
## 20.66333 16.96333
# ANOVA: len vs dose
lm_model_dose <- lm(len ~ factor(dose), data = ToothGrowth)
anova(lm_model_dose)
## Analysis of Variance Table
##
## Response: len
## Df Sum Sq Mean Sq F value Pr(>F)
## factor(dose) 2 2426.4 1213.2 67.416 9.533e-16 ***
## Residuals 57 1025.8 18.0
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Suggested answers
Use a t-test for two groups (OJ vs VC).
Results: OJ tends to have longer teeth than VC; higher doses increase tooth length.
For dose (3 groups: 0.5, 1, 2), use ANOVA.
Load the ‘PlantGrowth’ data set using this code:
{data(PlantGrowth)}
data(PlantGrowth)
# Check the distribution of your data
hist(PlantGrowth$weight)
# One-way ANOVA
lm_model_pg <- lm(weight ~ factor(group), data = PlantGrowth)
anova(lm_model_pg)
## Analysis of Variance Table
##
## Response: weight
## Df Sum Sq Mean Sq F value Pr(>F)
## factor(group) 2 3.7663 1.8832 4.8461 0.01591 *
## Residuals 27 10.4921 0.3886
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Suggested answers
Load the ‘USArrests’ data set using this code:
{data(USArrests)}
data(USArrests)
# Correlation test
cor.test(USArrests$Murder, USArrests$Assault)
##
## Pearson's product-moment correlation
##
## data: USArrests$Murder and USArrests$Assault
## t = 9.2981, df = 48, p-value = 2.596e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6739512 0.8831110
## sample estimates:
## cor
## 0.8018733
# Linear regression: Assault vs UrbanPop
lm_usa <- lm(Assault ~ UrbanPop, data = USArrests)
summary(lm_usa)
##
## Call:
## lm(formula = Assault ~ UrbanPop, data = USArrests)
##
## Residuals:
## Min 1Q Median 3Q Max
## -150.78 -61.85 -18.68 58.05 196.85
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 73.0766 53.8508 1.357 0.1811
## UrbanPop 1.4904 0.8027 1.857 0.0695 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 81.33 on 48 degrees of freedom
## Multiple R-squared: 0.06701, Adjusted R-squared: 0.04758
## F-statistic: 3.448 on 1 and 48 DF, p-value: 0.06948
par(mfrow=c(2,2))
plot(lm_usa)
Suggested answers
Load the ‘iris’ data set using this code:
{data(iris)}
data(iris)
# Check the distribution of your data
hist(iris$Sepal.Length)
# ANOVA: Sepal.Length by Species
lm_model_sp <- lm(Sepal.Length ~ factor(Species), data = iris)
par(mfrow=c(2,2))
plot(lm_model_sp)
anova(lm_model_sp)
## Analysis of Variance Table
##
## Response: Sepal.Length
## Df Sum Sq Mean Sq F value Pr(>F)
## factor(Species) 2 63.212 31.606 119.26 < 2.2e-16 ***
## Residuals 147 38.956 0.265
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Linear regression: Petal.Width ~ Petal.Length
lm_iris <- lm(Petal.Width ~ Petal.Length, data = iris)
anova(lm_iris)
## Analysis of Variance Table
##
## Response: Petal.Width
## Df Sum Sq Mean Sq F value Pr(>F)
## Petal.Length 1 80.26 80.260 1882.5 < 2.2e-16 ***
## Residuals 148 6.31 0.043
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
par(mfrow=c(2,2))
plot(lm_iris)
Suggested answers