library(tidyverse)
iris_sm <- iris %>%
filter(Species != "setosa")Regression and anova
Complete these problems using Quarto. Add code chunks and text as appropriate to this template, render your work as an html file, and submit that online.
The following problems refer to the iris_sm data set, which is just the standard iris set without flowers of the setosa variety.
Problem 1
Fit a linear model explaining Petal.Length with Sepal.Width.
- What is the equation of this line?
model<- lm(Petal.Length ~ Sepal.Width, data = iris_sm)
summary(model)
Call:
lm(formula = Petal.Length ~ Sepal.Width, data = iris_sm)
Residuals:
Min 1Q Median 3Q Max
-1.42624 -0.55918 -0.03418 0.42892 2.34479
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.2021 0.6190 1.942 0.055 .
Sepal.Width 1.2897 0.2141 6.023 2.99e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.7089 on 98 degrees of freedom
Multiple R-squared: 0.2702, Adjusted R-squared: 0.2627
F-statistic: 36.28 on 1 and 98 DF, p-value: 2.989e-08
So, \(\beta_0 = 1.202\) and \(\beta_1 = 1.29\) therefore, the equation is: \[\hat{Petal.Length} = 1.202 + Sepal.Width \times 1.29 \]
- Using the
fitted.valuescommand on the model from part (a), manually compute the residual standard error of the model. Confirm that your value matches the output of thesummarycommand.
\[RSE = \sqrt{1/(n-2)\sum(y_i - \hat{y_i})^2} \]
residuals <- iris_sm$Petal.Length - fitted.values(model)
n <- nrow(iris_sm)
rse <- sqrt(1/(n-2)*(sum(residuals^2)))
rse[1] 0.7088698
This matches the RSE from the model summary.
Problem 2
Fit a linear model explaining Petal.Length with the other 3 quantitative variables in the set.
- What is the equation of the fit model?
model_2 <- lm(Petal.Length ~ Sepal.Width +
Sepal.Length +
Petal.Width, data=iris_sm)
summary(model_2)
Call:
lm(formula = Petal.Length ~ Sepal.Width + Sepal.Length + Petal.Width,
data = iris_sm)
Residuals:
Min 1Q Median 3Q Max
-0.8583 -0.2232 -0.0200 0.2337 1.0334
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.52212 0.34420 -1.517 0.1326
Sepal.Width -0.24789 0.12077 -2.053 0.0428 *
Sepal.Length 0.69518 0.06210 11.194 <2e-16 ***
Petal.Width 1.06615 0.09789 10.892 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3113 on 96 degrees of freedom
Multiple R-squared: 0.8621, Adjusted R-squared: 0.8578
F-statistic: 200.1 on 3 and 96 DF, p-value: < 2.2e-16
So the equation is: \[\hat{Petal.Length} = -0.522 - 0.248 \times Sepal.Width + 0.695\times Sepal.Length + 1.066 \times Petal.Width \]
- What happened to the p-value of the
Sepal.Widthcoefficient? Compare with problem 1 and explain.
summary(model)$coefficients["Sepal.Width", "Pr(>|t|)"][1] 2.988765e-08
summary(model_2)$coefficients["Sepal.Width", "Pr(>|t|)"][1] 0.04282381
In the second regression model, the p-value is much higher for Sepal.Width. This shows that when we know sepal length and petal width, sepal width doesn’t provide much more information.
- Test the null hypothesis that this model is no better than the model from problem 1, that is, that both new regression coefficients are actually zero in the population from which this data is drawn.
anova(model, model_2)Analysis of Variance Table
Model 1: Petal.Length ~ Sepal.Width
Model 2: Petal.Length ~ Sepal.Width + Sepal.Length + Petal.Width
Res.Df RSS Df Sum of Sq F Pr(>F)
1 98 49.245
2 96 9.304 2 39.941 206.06 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
- The
anovacommand used in part (c) gives an F-statistic of 206.06. Use the given sums of squares to confirm this value. Also confirm the p-value using an appropriatepfcommand.
ss_1 <- sum(residuals(model)^2)
ss_2 <- sum(residuals(model_2)^2)
df_1 <- 4-2
df_2 <- n - 4
f_stat <- ((ss_1-ss_2)/df_1)/(ss_2/df_2)
f_stat[1] 206.0581
We have confirmed these F-stats are equal
p_value <- 1 - pf(f_stat, df_1, df_2)
p_value[1] 0
This shows the p-values are also equal since the anova returned a p-value very very small almost equal to \(0\).