In this assignment you will take a deep dive into your data for your project! This assignment should be submitted via R Markdown.
The goal of this homework assignment is to fit models to your project dataset
A)
Identify your response variable, a categorical predictor, and a numeric predictor (that you suspect might be related to your response). Describe the units for these variables and for the categorical variable describe the levels.
Response variable:
Flipper Length
Categorical predictor:
Sex
Numeric predictor:
Bill Length
B)
Fit a simple linear model with a response variable and the numeric predictor that you chose. Does the relationship appear to be significant? Make sure to also include a graphic.
Yes, there relationship look significant.
penmod1 <- lm(flipper_length_mm~bill_length_mm, data = penguins)
summary(penmod1)
##
## Call:
## lm(formula = flipper_length_mm ~ bill_length_mm, data = penguins)
##
## Residuals:
## Min 1Q Median 3Q Max
## -43.708 -7.896 0.664 8.650 21.179
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 126.6844 4.6651 27.16 <2e-16 ***
## bill_length_mm 1.6901 0.1054 16.03 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.63 on 340 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.4306, Adjusted R-squared: 0.4289
## F-statistic: 257.1 on 1 and 340 DF, p-value: < 2.2e-16
ggplot(data= penguins,aes(bill_length_mm,flipper_length_mm))+
geom_point()+
stat_smooth(method="lm", se =FALSE)
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (stat_smooth).
## Warning: Removed 2 rows containing missing values (geom_point).
C)
Now, write the “dummy” variable coding for your categorical variable. (Hint: the contrasts() function might help).
We can see that Adelie species is the reference group.
B_0= Adelie
b_1 * x1 = chinstrap x2 { 1 if TRUE
0 if FALSE }
B_2 * x2 = Gentoo x3 {1 if TRUE
0 if FALSE }
contrasts(penguins$species)
## Chinstrap Gentoo
## Adelie 0 0
## Chinstrap 1 0
## Gentoo 0 1
D)
Fit a linear model with a response variable and a categorical explanatory variable. Does it appear that there are differences among the means of levels of the categorical variable? (Hint: Look at the ANOVA F-test). Be sure to include an appropriate graphic (i.e. side-by-side boxplot)
Yes there are differences among the means of the levels of the categorical variales.
You can see from the ANOVA test that they are relatively in the same ball park with their slop. They are all lineared.
penmod2 <- lm(flipper_length_mm~species, data = penguins)
anova(penmod2)
## Analysis of Variance Table
##
## Response: flipper_length_mm
## Df Sum Sq Mean Sq F value Pr(>F)
## species 2 52473 26236.6 594.8 < 2.2e-16 ***
## Residuals 339 14953 44.1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ggplot(penguins, aes(species,flipper_length_mm))+
geom_boxplot()
## Warning: Removed 2 rows containing non-finite values (stat_boxplot).
E)
Now fit a multiple linear model that combines parts (b) and (d), with both the numeric and categorical variables. What are the estimated models for the different levels?
y = b_0 + b_1 * X1 + b_2 * X2 + b_3 * X3 + e
y = b_0 +b_1 + e
y = b_0 + b_2 + e
y = b_0 + e
Include a graphic of the scatter plot with lines overlaid for each level.
penmod3 <- lm(flipper_length_mm~bill_length_mm + species, data = penguins)
anova(penmod3)
## Analysis of Variance Table
##
## Response: flipper_length_mm
## Df Sum Sq Mean Sq F value Pr(>F)
## bill_length_mm 1 29032 29032.1 855.42 < 2.2e-16 ***
## species 2 26923 13461.6 396.64 < 2.2e-16 ***
## Residuals 338 11471 33.9
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ggplot(data= penguins,aes(bill_length_mm,flipper_length_mm, color = species))+
geom_point()+
geom_line(data = augment(penmod3), aes(y= .fitted, color = species))
## Warning: Removed 2 rows containing missing values (geom_point).
F)
Finally, fit a multiple linear model that also includes the interaction between the numeric and categorical variables, which allows for different slopes. What are the estimated models for the different levels?
Include a graphic of the scatter plot with lines overlaid for each level.
penmod4 <- lm(flipper_length_mm~bill_length_mm * species, data = penguins)
anova(penmod4)
## Analysis of Variance Table
##
## Response: flipper_length_mm
## Df Sum Sq Mean Sq F value Pr(>F)
## bill_length_mm 1 29032.1 29032.1 865.4260 < 2e-16 ***
## species 2 26923.1 13461.6 401.2790 < 2e-16 ***
## bill_length_mm:species 2 199.7 99.8 2.9759 0.05235 .
## Residuals 336 11271.7 33.5
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ggplot(data= penguins,aes(bill_length_mm,flipper_length_mm, color = species))+
geom_point()+
geom_line(data = augment(penmod4), aes(y= .fitted, color = species))
## Warning: Removed 2 rows containing missing values (geom_point).
G)
Compare the models from parts (B), (D), (E), and (F).
RS(DF)/SS(reg) = MSE
penmod1 = 38394/340 = 112.9
penmod2 = 14953/339 = 44.1
penmod3 = 11471/338 = 33.9
penmod4 = 11271.7/336 = 33.5
On model penmod4 There the species is not dependent to the flipper length. It is interesting that the responds variables slopes changes.
Penmod1-3. They are not any dramatic difference to the models. I would want to use the one with my most explanatory variables.
anova(penmod1)
## Analysis of Variance Table
##
## Response: flipper_length_mm
## Df Sum Sq Mean Sq F value Pr(>F)
## bill_length_mm 1 29032 29032.1 257.09 < 2.2e-16 ***
## Residuals 340 38394 112.9
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(penmod2)
## Analysis of Variance Table
##
## Response: flipper_length_mm
## Df Sum Sq Mean Sq F value Pr(>F)
## species 2 52473 26236.6 594.8 < 2.2e-16 ***
## Residuals 339 14953 44.1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(penmod3)
## Analysis of Variance Table
##
## Response: flipper_length_mm
## Df Sum Sq Mean Sq F value Pr(>F)
## bill_length_mm 1 29032 29032.1 855.42 < 2.2e-16 ***
## species 2 26923 13461.6 396.64 < 2.2e-16 ***
## Residuals 338 11471 33.9
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(penmod4)
## Analysis of Variance Table
##
## Response: flipper_length_mm
## Df Sum Sq Mean Sq F value Pr(>F)
## bill_length_mm 1 29032.1 29032.1 865.4260 < 2e-16 ***
## species 2 26923.1 13461.6 401.2790 < 2e-16 ***
## bill_length_mm:species 2 199.7 99.8 2.9759 0.05235 .
## Residuals 336 11271.7 33.5
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
H)
Conclusion:
What did you learn from this exercise? Were any of the relationships significant? (Note: This would be great to include in your final project write up!)
I learned more about parallel lines. It is interesting seeing if a explanatory variable rely on the responds or not. I will be doing more of this with my final project. I am still trying to figure out how to make mores sense of my data, even if the P-value is small. I am not sure if this is actually showing if the bill lengths are different from each species and how can I tell other then the graph. I hope to discuss this more with you after break. I think I am not really understand the smaller details and I want too! I like this exercise because I tried my other categorical variable (sex) and it looked like it had a small p-value but not as much as species, so I am wondering if it depends more on the species then there sex. I will look into the island and if the lengths are significant in that way as well as the species. Like I said before, I love this data because I can really see how this program could offer me insight to things I would hypothsis to give me a back board to say hey it does, or actally I was wrong it doesnt! wow, maybe there is something else!