# 1. load the library "tidyverse"
library(tidyverse)
library(broom)
library(table1)
library(car)
# 2. use the read_csv file to read the dataset
penguins <- read_rds("data/penguins.rds")Question: Use the table1 package to replicate the table below. Use the examples provided in the following vignette. What do you notice from the table? Which species of penguin is the heaviest? Which one has the largest flippers? Which have the largest bill depth? How are the different species distributed across the three islands?
penguins_2 <- penguins
# Change Labels Names
label(penguins_2$sex) <- "Sex"
label(penguins_2$body_mass_g) <- "Body mass (g)"
label(penguins_2$flipper_length_mm) <- "Flipper length (mm)"
label(penguins_2$bill_depth_mm) <- "Bill depth (mm)"
label(penguins_2$island) <- "Island)"
# Create Tittle of Table
caption <- "Measurements for three penguin species in the Palmer Archipelago"
table1(~ sex + body_mass_g + flipper_length_mm + bill_depth_mm + island | species, overall = FALSE, caption=caption, data=penguins_2)| Adelie (N=152) |
Chinstrap (N=68) |
Gentoo (N=124) |
|
|---|---|---|---|
| Sex | |||
| female | 73 (48.0%) | 34 (50.0%) | 58 (46.8%) |
| male | 73 (48.0%) | 34 (50.0%) | 61 (49.2%) |
| Missing | 6 (3.9%) | 0 (0%) | 5 (4.0%) |
| Body mass (g) | |||
| Mean (SD) | 3700 (459) | 3730 (384) | 5080 (504) |
| Median [Min, Max] | 3700 [2850, 4780] | 3700 [2700, 4800] | 5000 [3950, 6300] |
| Missing | 1 (0.7%) | 0 (0%) | 1 (0.8%) |
| Flipper length (mm) | |||
| Mean (SD) | 190 (6.54) | 196 (7.13) | 217 (6.48) |
| Median [Min, Max] | 190 [172, 210] | 196 [178, 212] | 216 [203, 231] |
| Missing | 1 (0.7%) | 0 (0%) | 1 (0.8%) |
| Bill depth (mm) | |||
| Mean (SD) | 18.3 (1.22) | 18.4 (1.14) | 15.0 (0.981) |
| Median [Min, Max] | 18.4 [15.5, 21.5] | 18.5 [16.4, 20.8] | 15.0 [13.1, 17.3] |
| Missing | 1 (0.7%) | 0 (0%) | 1 (0.8%) |
| Island) | |||
| Biscoe | 44 (28.9%) | 0 (0%) | 124 (100%) |
| Dream | 56 (36.8%) | 68 (100%) | 0 (0%) |
| Torgersen | 52 (34.2%) | 0 (0%) | 0 (0%) |
Answer: Which species of penguin is the heaviest? Gentoo is the heaviest among 3 species, with the mean of weight is 5080g. While the one in Adelie and Chinstrap species are 3700g and 3730g, respectively.
Which one has the largest flippers? Gentoo penguins also have the largest flippers, which is 217mm in mean of length. Mean of Adelie and Chinstrap’s fipper are 190mm and 196mm, respectively.
Which have the largest bill depth? Adelie species have the largest bill depth, at the mean of 18.3mm, compared to the other 2 species.
How are the different species distributed across the three islands? Gentoo are only in Biscoe island. Chinstrap are only in dream island. Lastly, Adelie penguines can be seen across 3 islands.
Question: Replicate the scatter plot below, coloring the points according to the species. What can you tell about the relationship between bill depth and penguin weight? Comment on the intercepts and the slopes, associated with the three species (are they the same? similar? different). Does the outcome variable appear normally distributed? Why does that matter?
penguins %>%
ggplot(aes( x = bill_depth_mm, y = body_mass_g, color = species)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
theme_minimal()Answer: In general higher bill depth is associated with higher body mass. But Gentoo tend to have lower bill depth but higher body mass compare to the other two species.
Adelie and Chinstrap have very similar slopes and intercepts, whereas Gentoo have steeper slope and higher intercept compare two the other two groups.
For each group the outcome variable seems normally distributed, this is important because we fulfill the normality assumption.
Question: Now replicate the plot below, describing the relationship between flipper length and body mass, and coloring the data according to the associated species. Facet the plot by island (search online if necessary). Finally, tell a story about the relationship between flipper length and weight in these three penguin species, and the distribution of penguins across the three islands.
penguins %>%
ggplot(aes( x = flipper_length_mm, y = body_mass_g, color = species)) +
geom_point() +
facet_wrap(vars(island), nrow = 2, scales = "free") +
theme_minimal()Answer: Increasing flipper length is associated with a higher body weight. Gentoo is the heaviest species and has the longest flipper length compared to the other two groups. Gentoo are only in Biscoe island. Chinstrap are only in dream island. Lastly, Adelie are evenly distributed on the 3 islands.
Question: Replicate the scatter plot below, adding vertical and horizontal lines to indicate the mean of the predictor and of the outcome variable. Show that the linear regression line passes at the intersection of the two means.
mean_flipper_length_mm <- mean(penguins$flipper_length_mm, na.rm = TRUE)
mean_body_mass_g <- mean(penguins$body_mass_g, na.rm = TRUE)
penguins %>%
ggplot(aes( x = flipper_length_mm, y = body_mass_g)) +
geom_point(aes(color = species)) +
geom_smooth(method = "lm", se = FALSE) +
geom_vline(xintercept = mean_flipper_length_mm) +
geom_hline(yintercept = mean_body_mass_g) Question: In R, run a hypothesis test to see whether the flipper length and the body mass are correlated. Write down your null hypothesis, the p-value and the confidence interval, describe and interpret your findings.
##
## Pearson's product-moment correlation
##
## data: flipper_length_mm and body_mass_g
## t = 32.722, df = 340, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.843041 0.894599
## sample estimates:
## cor
## 0.8712018
Answer: Null hypothesis: Flipper length and body mass are not correlated, p-value < 0.001 and the 95 percent confidence interval [0.843041 0.894599] Interpretation: We reject the null hypothesis, flipper length and body mass have a high positive correlation coefficient of 0.8712018. Therefore, if the flipper length increases the body mass also increases.
Question: Use the lm function in R to fit a simple linear regression with one predictor. Write up your findings using the template below, and filling in the gaps in the template with the result of your model.
##
## Call:
## lm(formula = body_mass_g ~ flipper_length_mm, data = penguins)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1058.80 -259.27 -26.88 247.33 1288.69
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5780.831 305.815 -18.90 <2e-16 ***
## flipper_length_mm 49.686 1.518 32.72 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 394.3 on 340 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.759, Adjusted R-squared: 0.7583
## F-statistic: 1071 on 1 and 340 DF, p-value: < 2.2e-16
Answer: Linear regression was used to test the association between body mass (in grams) and flipper length (in mm) using data from n = 342 Penguins. 76% of the variation in body mass was explained by flipper length (R2= 0.76). There was a significant positive association between body mass and flipper length (B = 49.69; 95% CI = 46.65, 52.72; p < .001). On average, for every 1-mm difference in flipper length, penguins differ in mean body mass by 49.69 grams.
Question: We now want to fit a model with parallel slopes, one slope for each of the three species of penguins. Run the regression as you have done above, but this time, add the species variable as a predictor in your model. (1) Write the equation for the linear model (you will need to replace the coefficients in the formula b0,b1,b2 with their estimates). (2) Write up your findings in a similar manner to the template shown in the previous question. (3) Write a separate model for each of the three species. (4) Use the augment(M2) to see the list of residuals and show that the mean of the residuals is almost zero and its standard deviation is the residual standard error.
##
## Call:
## lm(formula = body_mass_g ~ flipper_length_mm + species, data = penguins)
##
## Residuals:
## Min 1Q Median 3Q Max
## -927.70 -254.82 -23.92 241.16 1191.68
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4031.477 584.151 -6.901 2.55e-11 ***
## flipper_length_mm 40.705 3.071 13.255 < 2e-16 ***
## speciesChinstrap -206.510 57.731 -3.577 0.000398 ***
## speciesGentoo 266.810 95.264 2.801 0.005392 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 375.5 on 338 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.7826, Adjusted R-squared: 0.7807
## F-statistic: 405.7 on 3 and 338 DF, p-value: < 2.2e-16
Answer: (1) Equation for the linear model: Y= -4031 + 40.7·X_flipper - 206.5· X_chinstrap + 266.81 · X_Gentoo s^2 = 375.5^2
Linear regression was used to test the association between body mass (in grams) and flipper length (in mm) and species, using data from n = 342 Penguins. 78% of the variation in body mass was explained by flipper length and species (R2= 0.78). There was a significant positive association between body mass and flipper length for every species (B = 40.71; 95% CI = 34.56, 46.85; p < .001), for Chinstrap (B = -206.51; 95% CI = -321.97, -91.05; p < 0.001) and for Gentoo (B = 266.81; 95% CI = 76.28, 457.34; p < 0.01). On average, for every 1-mm difference in flipper length, penguins differ in mean body mass by 40.71 grams after adjusting for species.
Write a separate model for each of the three species. For Adelie Y= -4031 + 40.71·X_flipper + e e∼N(0 , 375.5^2) For Chinstrap Y = -4237.99 + 40.71 x flipper + e e∼N(0 , 375.5^2) For Gentoo Y = -3764.67 + 40.71 x flipper + e e∼N(0 , 375.5^2)
Use the augment(M2) to see the list of residuals and show that the mean of the residuals is almost zero and its standard deviation is the residual standard error. As we can see in mean_resid the the mean of the residuals is almost zero (1.060549e-11). The standard deviation of the residuals = 374 and the residual standard error is 375.5 are similar. They are slightly different because the degrees of freedom are not included in the calculation of the standard deviation of the residuals.
Question: Reproduce the parallel slopes image below using the coefficients from the second model M2.
## (Intercept) flipper_length_mm speciesChinstrap speciesGentoo
## -4031.4769 40.7054 -206.5101 266.8096
penguins |>
ggplot(
aes(x = flipper_length_mm, y = body_mass_g, color = species)
) +
geom_point() +
# Use the intercept for the reference category (Adelie)
geom_abline(
intercept = coefs[-4031.4769], slope = 40.7054, color = "orange"
) +
# Use the appropriate intercept for Chinstrap
geom_abline(
intercept = -4031.4769 - 206.5101, slope = 40.7054, color = "pink"
) +
# Use the appropriate intercept for Gentoo
geom_abline(
intercept = -4031.4769 + 266.8096, slope = 40.7054, color = "brown"
) +
scale_color_manual(
values = c("orange", "pink", "brown")
) Question: Search online for an explanation of the various plots (for example, this is a good resource, but you may find others). Create a plot for each of the four diagnostic tests, and describe in words what is the purpose of each diagnostic plot. Then interpret the plot.
Answer: Residuals vs fitted: this plot is employed to test the linearity in the residuals of the model. As can be seen from the plot, the residuals are scattered approximately evenly around the variance regression line and the line is also roughly horizontal, which indicates the linearity in the residual model. Our model fulfills the assumption of Linearity.
QQ Residuals: This plot shows if residuals are normally distributed. Since the residuals of our model follow a straight line we can assume they are normally distributed, therefore we fulfill the normality assumption.
Scale Location: This plot shows if residuals are spread equally along the ranges of predictors so it is applied to examine the heteroscedasticity in the regression model. As can be seen from the chart, the points are roughly scattered around the horizontal line, which shows the homoscedasticity of the model.
Cook’s distance: This plot helps us to find influential cases if there are any. In our model there are no data points with a cooks distance larger than 1, therefore we can say there are no influential cases.
Question: Run the type 3 Anova, discuss and interpret the output you see and link it to the results of the fitted model summary(M2).
##
## Call:
## lm(formula = body_mass_g ~ flipper_length_mm + species, data = penguins)
##
## Residuals:
## Min 1Q Median 3Q Max
## -927.70 -254.82 -23.92 241.16 1191.68
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4031.477 584.151 -6.901 2.55e-11 ***
## flipper_length_mm 40.705 3.071 13.255 < 2e-16 ***
## speciesChinstrap -206.510 57.731 -3.577 0.000398 ***
## speciesGentoo 266.810 95.264 2.801 0.005392 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 375.5 on 338 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.7826, Adjusted R-squared: 0.7807
## F-statistic: 405.7 on 3 and 338 DF, p-value: < 2.2e-16
Answer: Looking at the Anova table, we can see that both length of flipper and species have statistically significant effect on the body mass, after controlling other variables (p-value of length of flipper and species variables are smaller than 0.001). This also can be seen from linear regression model as p value of flipper length and species variable also below 0.01, indicating that the fliiper length and type of penguins have linear relationship with body mass of penguin.