Project Milestone #4

Goal:

The goal of this homework assignment is to fit models to your project dataset

Identify your response variable, a categorical predictor, and a numeric predictor (that you suspect might be related to your response). Describe the units for these variables and for the categorical variable describe the levels.

Response variable:

Flipper Length

Categorical predictor:

Sex

Numeric predictor:

Bill Length

Fit a simple linear model with a response variable and the numeric predictor that you chose. Does the relationship appear to be significant? Make sure to also include a graphic.

Yes, there relationship look significant.

penmod1 <- lm(flipper_length_mm~bill_length_mm, data = penguins)
summary(penmod1)

## 
## Call:
## lm(formula = flipper_length_mm ~ bill_length_mm, data = penguins)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -43.708  -7.896   0.664   8.650  21.179 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    126.6844     4.6651   27.16   <2e-16 ***
## bill_length_mm   1.6901     0.1054   16.03   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.63 on 340 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.4306, Adjusted R-squared:  0.4289 
## F-statistic: 257.1 on 1 and 340 DF,  p-value: < 2.2e-16

ggplot(data= penguins,aes(bill_length_mm,flipper_length_mm))+
  geom_point()+
  stat_smooth(method="lm", se =FALSE)

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 2 rows containing non-finite values (stat_smooth).

## Warning: Removed 2 rows containing missing values (geom_point).

Now, write the “dummy” variable coding for your categorical variable. (Hint: the contrasts() function might help).

We can see that Adelie species is the reference group.

B_0= Adelie

b_1 * x1 = chinstrap x2 { 1 if TRUE

            0 if FALSE  }

B_2 * x2 = Gentoo x3 {1 if TRUE

        0 if FALSE }

contrasts(penguins$species)

##           Chinstrap Gentoo
## Adelie            0      0
## Chinstrap         1      0
## Gentoo            0      1

Fit a linear model with a response variable and a categorical explanatory variable. Does it appear that there are differences among the means of levels of the categorical variable? (Hint: Look at the ANOVA F-test). Be sure to include an appropriate graphic (i.e. side-by-side boxplot)

Yes there are differences among the means of the levels of the categorical variales.

You can see from the ANOVA test that they are relatively in the same ball park with their slop. They are all lineared.

penmod2 <- lm(flipper_length_mm~species, data = penguins)
anova(penmod2)

## Analysis of Variance Table
## 
## Response: flipper_length_mm
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## species     2  52473 26236.6   594.8 < 2.2e-16 ***
## Residuals 339  14953    44.1                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ggplot(penguins, aes(species,flipper_length_mm))+
  geom_boxplot()

## Warning: Removed 2 rows containing non-finite values (stat_boxplot).

Now fit a multiple linear model that combines parts (b) and (d), with both the numeric and categorical variables. What are the estimated models for the different levels?

y = b_0 + b_1 * X1 + b_2 * X2 + b_3 * X3 + e

y = b_0 +b_1 + e

y = b_0 + b_2 + e

y = b_0 + e

Include a graphic of the scatter plot with lines overlaid for each level.

penmod3 <- lm(flipper_length_mm~bill_length_mm + species, data = penguins)
anova(penmod3)

## Analysis of Variance Table
## 
## Response: flipper_length_mm
##                 Df Sum Sq Mean Sq F value    Pr(>F)    
## bill_length_mm   1  29032 29032.1  855.42 < 2.2e-16 ***
## species          2  26923 13461.6  396.64 < 2.2e-16 ***
## Residuals      338  11471    33.9                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ggplot(data= penguins,aes(bill_length_mm,flipper_length_mm, color = species))+
  geom_point()+
  geom_line(data = augment(penmod3), aes(y= .fitted, color = species))

## Warning: Removed 2 rows containing missing values (geom_point).

Finally, fit a multiple linear model that also includes the interaction between the numeric and categorical variables, which allows for different slopes. What are the estimated models for the different levels?

Include a graphic of the scatter plot with lines overlaid for each level.

penmod4 <- lm(flipper_length_mm~bill_length_mm * species, data = penguins)
anova(penmod4)

## Analysis of Variance Table
## 
## Response: flipper_length_mm
##                         Df  Sum Sq Mean Sq  F value  Pr(>F)    
## bill_length_mm           1 29032.1 29032.1 865.4260 < 2e-16 ***
## species                  2 26923.1 13461.6 401.2790 < 2e-16 ***
## bill_length_mm:species   2   199.7    99.8   2.9759 0.05235 .  
## Residuals              336 11271.7    33.5                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ggplot(data= penguins,aes(bill_length_mm,flipper_length_mm, color = species))+
  geom_point()+
  geom_line(data = augment(penmod4), aes(y= .fitted, color = species))

## Warning: Removed 2 rows containing missing values (geom_point).

Compare the models from parts (B), (D), (E), and (F).

Calculate the MSEs

RS(DF)/SS(reg) = MSE

penmod1 = 38394/340 = 112.9

penmod2 = 14953/339 = 44.1

penmod3 = 11471/338 = 33.9

penmod4 = 11271.7/336 = 33.5

Discuss model differences

On model penmod4 There the species is not dependent to the flipper length. It is interesting that the responds variables slopes changes.

Penmod1-3. They are not any dramatic difference to the models. I would want to use the one with my most explanatory variables.

anova(penmod1)

## Analysis of Variance Table
## 
## Response: flipper_length_mm
##                 Df Sum Sq Mean Sq F value    Pr(>F)    
## bill_length_mm   1  29032 29032.1  257.09 < 2.2e-16 ***
## Residuals      340  38394   112.9                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(penmod2)

## Analysis of Variance Table
## 
## Response: flipper_length_mm
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## species     2  52473 26236.6   594.8 < 2.2e-16 ***
## Residuals 339  14953    44.1                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(penmod3)

## Analysis of Variance Table
## 
## Response: flipper_length_mm
##                 Df Sum Sq Mean Sq F value    Pr(>F)    
## bill_length_mm   1  29032 29032.1  855.42 < 2.2e-16 ***
## species          2  26923 13461.6  396.64 < 2.2e-16 ***
## Residuals      338  11471    33.9                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(penmod4)

## Analysis of Variance Table
## 
## Response: flipper_length_mm
##                         Df  Sum Sq Mean Sq  F value  Pr(>F)    
## bill_length_mm           1 29032.1 29032.1 865.4260 < 2e-16 ***
## species                  2 26923.1 13461.6 401.2790 < 2e-16 ***
## bill_length_mm:species   2   199.7    99.8   2.9759 0.05235 .  
## Residuals              336 11271.7    33.5                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Conclusion:

What did you learn from this exercise? Were any of the relationships significant? (Note: This would be great to include in your final project write up!)

I learned more about parallel lines. It is interesting seeing if a explanatory variable rely on the responds or not. I will be doing more of this with my final project. I am still trying to figure out how to make mores sense of my data, even if the P-value is small. I am not sure if this is actually showing if the bill lengths are different from each species and how can I tell other then the graph. I hope to discuss this more with you after break. I think I am not really understand the smaller details and I want too! I like this exercise because I tried my other categorical variable (sex) and it looked like it had a small p-value but not as much as species, so I am wondering if it depends more on the species then there sex. I will look into the island and if the lengths are significant in that way as well as the species. Like I said before, I love this data because I can really see how this program could offer me insight to things I would hypothsis to give me a back board to say hey it does, or actally I was wrong it doesnt! wow, maybe there is something else!

Project Milestone #4

Rebecca Barbanell

11/18/2021

Directions:

Goal: