Regression and anova

Author

Caden Elias

Complete these problems using Quarto. Add code chunks and text as appropriate to this template, render your work as an html file, and submit that online.

The following problems refer to the iris_sm data set, which is just the standard iris set without flowers of the setosa variety.

library(tidyverse)
iris_sm <- iris %>% 
  filter(Species != "setosa")

Problem 1

Fit a linear model explaining Petal.Length with Sepal.Width.

What is the equation of this line?

model<- lm(Petal.Length ~ Sepal.Width, data = iris_sm)
summary(model)


Call:
lm(formula = Petal.Length ~ Sepal.Width, data = iris_sm)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.42624 -0.55918 -0.03418  0.42892  2.34479 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   1.2021     0.6190   1.942    0.055 .  
Sepal.Width   1.2897     0.2141   6.023 2.99e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7089 on 98 degrees of freedom
Multiple R-squared:  0.2702,    Adjusted R-squared:  0.2627 
F-statistic: 36.28 on 1 and 98 DF,  p-value: 2.989e-08

So, \(\beta_0 = 1.202\) and \(\beta_1 = 1.29\) therefore, the equation is: \[\hat{Petal.Length} = 1.202 + Sepal.Width \times 1.29 \]

Using the fitted.values command on the model from part (a), manually compute the residual standard error of the model. Confirm that your value matches the output of the summary command.

\[RSE = \sqrt{1/(n-2)\sum(y_i - \hat{y_i})^2} \]

residuals <- iris_sm$Petal.Length - fitted.values(model)
n <- nrow(iris_sm)
rse <- sqrt(1/(n-2)*(sum(residuals^2)))
rse

[1] 0.7088698

This matches the RSE from the model summary.

Problem 2

Fit a linear model explaining Petal.Length with the other 3 quantitative variables in the set.

What is the equation of the fit model?

model_2 <- lm(Petal.Length ~ Sepal.Width +
                Sepal.Length +
                Petal.Width, data=iris_sm)
summary(model_2)


Call:
lm(formula = Petal.Length ~ Sepal.Width + Sepal.Length + Petal.Width, 
    data = iris_sm)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.8583 -0.2232 -0.0200  0.2337  1.0334 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -0.52212    0.34420  -1.517   0.1326    
Sepal.Width  -0.24789    0.12077  -2.053   0.0428 *  
Sepal.Length  0.69518    0.06210  11.194   <2e-16 ***
Petal.Width   1.06615    0.09789  10.892   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3113 on 96 degrees of freedom
Multiple R-squared:  0.8621,    Adjusted R-squared:  0.8578 
F-statistic: 200.1 on 3 and 96 DF,  p-value: < 2.2e-16

So the equation is: \[\hat{Petal.Length} = -0.522 - 0.248 \times Sepal.Width + 0.695\times Sepal.Length + 1.066 \times Petal.Width \]

What happened to the p-value of the Sepal.Width coefficient? Compare with problem 1 and explain.

summary(model)$coefficients["Sepal.Width", "Pr(>|t|)"]

[1] 2.988765e-08

summary(model_2)$coefficients["Sepal.Width", "Pr(>|t|)"]

[1] 0.04282381

In the second regression model, the p-value is much higher for Sepal.Width. This shows that when we know sepal length and petal width, sepal width doesn’t provide much more information.

Test the null hypothesis that this model is no better than the model from problem 1, that is, that both new regression coefficients are actually zero in the population from which this data is drawn.

anova(model, model_2)

Analysis of Variance Table

Model 1: Petal.Length ~ Sepal.Width
Model 2: Petal.Length ~ Sepal.Width + Sepal.Length + Petal.Width
  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1     98 49.245                                  
2     96  9.304  2    39.941 206.06 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The anova command used in part (c) gives an F-statistic of 206.06. Use the given sums of squares to confirm this value. Also confirm the p-value using an appropriate pf command.

ss_1 <- sum(residuals(model)^2)
ss_2 <- sum(residuals(model_2)^2)
df_1 <- 4-2
df_2 <- n - 4
f_stat <- ((ss_1-ss_2)/df_1)/(ss_2/df_2)
f_stat

[1] 206.0581

We have confirmed these F-stats are equal

p_value <- 1 - pf(f_stat, df_1, df_2)
p_value

[1] 0

This shows the p-values are also equal since the anova returned a p-value very very small almost equal to \(0\).