Learning outcomes

By the end of this tutorial, you should be able to:

  • Select the correct statistical test for your dataset

  • Run simple linear models in R

  • Interpret model outputs and diagnostic plots

  • Calculate F-ratios and degrees of freedom (DF)

Instructions

In your groups, answer the following questions based on your assigned data set (see sections below). Discuss your findings in the group and prepare a short summary for class discussion

Groupings

Group 1 (Motor Trend Car Road Tests)

Load the ‘mtcars’ data set using this code:

{data(mtcars)}

  1. Is there a linear relationship between horsepower (hp) and mpg?
  2. Which test would you use to compare mpg across 3+ cylinder groups?
data("mtcars")

# Check the distrbution of your data
hist(mtcars$mpg)

# Linear regression: mpg vs hp
lm_model <- lm(mpg ~ hp, data = mtcars)
par(mfrow=c(2,2))
plot(lm_model) # diagnostic plots

anova(lm_model)
## Analysis of Variance Table
## 
## Response: mpg
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## hp         1 678.37  678.37   45.46 1.788e-07 ***
## Residuals 30 447.67   14.92                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# ANOVA: mpg vs cyl
lm_model_cyl <- lm(mpg ~ factor(cyl), data = mtcars)
anova(lm_model_cyl)
## Analysis of Variance Table
## 
## Response: mpg
##             Df Sum Sq Mean Sq F value    Pr(>F)    
## factor(cyl)  2 824.78  412.39  39.697 4.979e-09 ***
## Residuals   29 301.26   10.39                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Suggested answers:

  1. Linear regression shows a significant negative slope: more horsepower → lower mpg.

  2. Use ANOVA because you’re comparing mpg across 3 groups (cyl=4,6,8).

Group 2 (ToothGrowth)

Load the ‘toothGrowth’ data set using this code:

{data(ToothGrowth)}

  1. Does supplement type (OJ vs VC) affect tooth length?
  2. Does increasing dose increase tooth length?
  3. Which test would you use to compare tooth length between two supplement types?
data(ToothGrowth)

# Check the distribution of your data
hist(ToothGrowth$len)

# t-test: OJ vs VC
t.test(len ~ supp, data = ToothGrowth)
## 
##  Welch Two Sample t-test
## 
## data:  len by supp
## t = 1.9153, df = 55.309, p-value = 0.06063
## alternative hypothesis: true difference in means between group OJ and group VC is not equal to 0
## 95 percent confidence interval:
##  -0.1710156  7.5710156
## sample estimates:
## mean in group OJ mean in group VC 
##         20.66333         16.96333
# ANOVA: len vs dose
lm_model_dose <- lm(len ~ factor(dose), data = ToothGrowth)
anova(lm_model_dose)
## Analysis of Variance Table
## 
## Response: len
##              Df Sum Sq Mean Sq F value    Pr(>F)    
## factor(dose)  2 2426.4  1213.2  67.416 9.533e-16 ***
## Residuals    57 1025.8    18.0                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Suggested answers

  1. Use a t-test for two groups (OJ vs VC).

  2. Results: OJ tends to have longer teeth than VC; higher doses increase tooth length.

  3. For dose (3 groups: 0.5, 1, 2), use ANOVA.

Group 3 (PlantGrowth)

Load the ‘PlantGrowth’ data set using this code:

{data(PlantGrowth)}

  1. Which test is appropriate for comparing three groups?
  2. Which group is statistically different with another?
  3. Interpret the F-ratio and p-value.
data(PlantGrowth)

# Check the distribution of your data
hist(PlantGrowth$weight)

# One-way ANOVA
lm_model_pg <- lm(weight ~ factor(group), data = PlantGrowth)
anova(lm_model_pg)
## Analysis of Variance Table
## 
## Response: weight
##               Df  Sum Sq Mean Sq F value  Pr(>F)  
## factor(group)  2  3.7663  1.8832  4.8461 0.01591 *
## Residuals     27 10.4921  0.3886                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Suggested answers

  1. Use ANOVA for 3 groups.
  2. Look at the means and CIs if there’s overlapping between the groups.
  3. F-ratio indicates if group means differ. The low F-ratio shows the group means are close together (low variability) relative to the variability within each group. Whereas, the high F-ratio shows the variability of group means is large relative to the within group variability. Our F-ratio here is 4.85 indicates that the between-groups variance is 4.85 times the size of the within-group variance. The null hypothesis value us that variances are equal, which should produces F-ratio of 1. Thus, we rejected the null hypothesis.

Group 4 (USArrests)

Load the ‘USArrests’ data set using this code:

{data(USArrests)}

  1. Which test is appropriate for checking correlation?
  2. Is murder rate correlated with assault rate?
  3. How to predict assault rate using urban population?
data(USArrests)

# Correlation test
cor.test(USArrests$Murder, USArrests$Assault)
## 
##  Pearson's product-moment correlation
## 
## data:  USArrests$Murder and USArrests$Assault
## t = 9.2981, df = 48, p-value = 2.596e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6739512 0.8831110
## sample estimates:
##       cor 
## 0.8018733
# Linear regression: Assault vs UrbanPop
lm_usa <- lm(Assault ~ UrbanPop, data = USArrests)
summary(lm_usa)
## 
## Call:
## lm(formula = Assault ~ UrbanPop, data = USArrests)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -150.78  -61.85  -18.68   58.05  196.85 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  73.0766    53.8508   1.357   0.1811  
## UrbanPop      1.4904     0.8027   1.857   0.0695 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 81.33 on 48 degrees of freedom
## Multiple R-squared:  0.06701,    Adjusted R-squared:  0.04758 
## F-statistic: 3.448 on 1 and 48 DF,  p-value: 0.06948
par(mfrow=c(2,2))
plot(lm_usa)

Suggested answers

  1. Use correlation test (Pearson).
  2. Murder and assault rates are positively correlated.
  3. Linear regression shows whether urban population predicts assault.

Group 5 (Iris Data)

Load the ‘iris’ data set using this code:

{data(iris)}

  1. Which test compares sepal length across 3 groups?
  2. Can petal width predict petal length?
data(iris)

# Check the distribution of your data
hist(iris$Sepal.Length)

# ANOVA: Sepal.Length by Species
lm_model_sp <- lm(Sepal.Length ~ factor(Species), data = iris)
par(mfrow=c(2,2))
plot(lm_model_sp)

anova(lm_model_sp)
## Analysis of Variance Table
## 
## Response: Sepal.Length
##                  Df Sum Sq Mean Sq F value    Pr(>F)    
## factor(Species)   2 63.212  31.606  119.26 < 2.2e-16 ***
## Residuals       147 38.956   0.265                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Linear regression: Petal.Width ~ Petal.Length
lm_iris <- lm(Petal.Width ~ Petal.Length, data = iris)
anova(lm_iris)
## Analysis of Variance Table
## 
## Response: Petal.Width
##               Df Sum Sq Mean Sq F value    Pr(>F)    
## Petal.Length   1  80.26  80.260  1882.5 < 2.2e-16 ***
## Residuals    148   6.31   0.043                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
par(mfrow=c(2,2))
plot(lm_iris)

Suggested answers

  1. Use ANOVA for comparing sepal length across 3 species.
  2. Linear regression shows strong positive relationship between petal length and width.

Wrap-up discussion

  1. How do you decide whether to use a t-test, ANOVA, correlation, or regression?
  2. What does an F-ratio and degrees of freedom represent?
  3. Why are diagnostic plots important when interpreting models?