27 Oct 2023# 1. load the appropriate libraries
library(tidyverse)
library(broom)
library(table1)
library(car)
# 2. use the read_rds file to read the dataset
penguins <- read_rds("data/penguins.rds")Question: Use the table1 package to replicate the table below. Use the examples provided in the following vignette. What do you notice from the table? Which species of penguin is the heaviest? Which one has the largest flippers? Which have the largest bill depth? How are the different species distributed across the three islands?
# Write your code to create a histogram of delays
table1(~ sex + body_mass_g + flipper_length_mm + bill_depth_mm + island | species, data = penguins)| Adelie (N=152) |
Chinstrap (N=68) |
Gentoo (N=124) |
Overall (N=344) |
|
|---|---|---|---|---|
| sex | ||||
| female | 73 (48.0%) | 34 (50.0%) | 58 (46.8%) | 165 (48.0%) |
| male | 73 (48.0%) | 34 (50.0%) | 61 (49.2%) | 168 (48.8%) |
| Missing | 6 (3.9%) | 0 (0%) | 5 (4.0%) | 11 (3.2%) |
| body_mass_g | ||||
| Mean (SD) | 3700 (459) | 3730 (384) | 5080 (504) | 4200 (802) |
| Median [Min, Max] | 3700 [2850, 4780] | 3700 [2700, 4800] | 5000 [3950, 6300] | 4050 [2700, 6300] |
| Missing | 1 (0.7%) | 0 (0%) | 1 (0.8%) | 2 (0.6%) |
| flipper_length_mm | ||||
| Mean (SD) | 190 (6.54) | 196 (7.13) | 217 (6.48) | 201 (14.1) |
| Median [Min, Max] | 190 [172, 210] | 196 [178, 212] | 216 [203, 231] | 197 [172, 231] |
| Missing | 1 (0.7%) | 0 (0%) | 1 (0.8%) | 2 (0.6%) |
| bill_depth_mm | ||||
| Mean (SD) | 18.3 (1.22) | 18.4 (1.14) | 15.0 (0.981) | 17.2 (1.97) |
| Median [Min, Max] | 18.4 [15.5, 21.5] | 18.5 [16.4, 20.8] | 15.0 [13.1, 17.3] | 17.3 [13.1, 21.5] |
| Missing | 1 (0.7%) | 0 (0%) | 1 (0.8%) | 2 (0.6%) |
| island | ||||
| Biscoe | 44 (28.9%) | 0 (0%) | 124 (100%) | 168 (48.8%) |
| Dream | 56 (36.8%) | 68 (100%) | 0 (0%) | 124 (36.0%) |
| Torgersen | 52 (34.2%) | 0 (0%) | 0 (0%) | 52 (15.1%) |
Answer: As can be seen from the table 1: - Gentoo penguins are the heaviest among 3 species, with the mean of weight is 5080g. While the one in Adelie and Chinstrap species are 3700g and 3730g, respectively. - Gentoo penguins also have largest flippers, which is 217mm in mean of length. Mean of Adelie and Chinstrap’s fipper are 190mm and 196mm, respectively. - Adelie species have the largest bill depth, at the mean of 18.3mm. - Adelie penguines can be seen across 3 islands. While Chinstrap and Gentoo ones only can be seen at Dream and Biscoe island, repspectively.
Question: Replicate the scatter plot below, coloring the points according to the species. What can you tell about the relationship between bill depth and penguin weight? Comment on the intercepts and the slopes, associated with the three species (are they the same? similar? different). Does the outcome variable appear normally distributed? Why does that matter?
penguins |>
ggplot(
aes(x=bill_depth_mm, y=body_mass_g, color = species)) + geom_point() +
geom_smooth(method = "lm", se = FALSE) +
theme_minimal()## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values (`geom_point()`).
Answer: From the scatter plot, a linear relationship can be seen between bill depth and penguin weight. - The intercept of of Gentoo is significantly higher than Chinstrap and Adelie. - In term of slopes, among three species, Gentoo ranks the first place, while that of Adelie and Chinstrap are aproximately the same.
Question: Now replicate the plot below, describing the relationship between flipper length and body mass, and coloring the data according to the associated species. Facet the plot by island (search online if necessary). Finally, tell a story about the relationship between flipper length and weight in these three penguin species, and the distribution of penguins across the three islands.
penguins |>
ggplot(
aes(x = flipper_length_mm, y=body_mass_g, color = species)) +
geom_point() +
facet_wrap(vars(island), nrow = 2, scales = "free")## Warning: Removed 2 rows containing missing values (`geom_point()`).
Answer: As from the facet plots above, we cannot tell about the relationship between flipper length and weight in these three penguin species. However, it can be osberved that while Adelie live at all 3 islands, Chinstrap and Gentoo can only be seen at Dream and Biscoe island, repectively.
Question: Replicate the scatter plot below, adding vertical and horizontal lines to indicate the mean of the predictor and of the outcome variable. Show that the linear regression line passes at the intersection of the two means.
mn.body_mass <- mean(penguins$body_mass_g,na.rm = TRUE)
mn.flipper_length <- mean(penguins$flipper_length_mm, na.rm = TRUE)
penguins |>
ggplot(aes(x = flipper_length_mm, y= body_mass_g)) +
geom_point(aes(color = species)) +
geom_smooth(method = "lm", se = FALSE) +
geom_vline(xintercept = mn.flipper_length) +
geom_hline(yintercept = mn.body_mass)## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values (`geom_point()`).
Question: In R, run a hypothesis test to see whether the flipper length and the body mass are correlated. Write down your null hypothesis, the p-value and the confidence interval, describe and interpret your findings.
##
## Pearson's product-moment correlation
##
## data: penguins$flipper_length_mm and penguins$body_mass_g
## t = 32.722, df = 340, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.843041 0.894599
## sample estimates:
## cor
## 0.8712018
Answer: - Null hypothesis: There is no correlation between flipper length and the body mass in penguins. - p value < 0.05 - 95% CI (0.843041; 0.894599) - Interpretation: The result is shown that there is a high positive correlation between body mass and flipper length in penguins (with correlation coefficient is 0.87 and p value < 0.05)
Question: Use the lm function in R to fit a simple linear regression with one predictor. Write up your findings using the template below, and filling in the gaps in the template with the result of your model.
##
## Call:
## lm(formula = body_mass_g ~ flipper_length_mm, data = penguins)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1058.80 -259.27 -26.88 247.33 1288.69
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5780.831 305.815 -18.90 <2e-16 ***
## flipper_length_mm 49.686 1.518 32.72 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 394.3 on 340 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.759, Adjusted R-squared: 0.7583
## F-statistic: 1071 on 1 and 340 DF, p-value: < 2.2e-16
Answer: Linear regression was used to test the association between body mass (in gram) and length of flipper(in mm) using data from n = 342 penguins. 75.9% of the variation in outcome variable was explained by the model (R2= 75.9%). There was a significant positive association between body mass and length of flipper (B = 49.686; 95% CI = 46.65, 52.722; p < .001). On average, for every 1-mm difference in flipper length, penguins differ in mean of body mass by 49.6 gam.
Question: We now want to fit a model with parallel slopes, one slope for each of the three species of penguins. Run the regression as you have done above, but this time, add the species variable as a predictor in your model. (1) Write the equation for the linear model (you will need to replace the coefficients in the formula b0,b1,b2 with their estimates). (2) Write up your findings in a similar manner to the template shown in the previous question. (3) Write a separate model for each of the three species. (4) Use the augment(M2) to see the list of residuals and show that the mean of the residuals is almost zero and its standard deviation is the residual standard error.
##
## Call:
## lm(formula = body_mass_g ~ flipper_length_mm + species, data = penguins)
##
## Residuals:
## Min 1Q Median 3Q Max
## -927.70 -254.82 -23.92 241.16 1191.68
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4031.477 584.151 -6.901 2.55e-11 ***
## flipper_length_mm 40.705 3.071 13.255 < 2e-16 ***
## speciesChinstrap -206.510 57.731 -3.577 0.000398 ***
## speciesGentoo 266.810 95.264 2.801 0.005392 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 375.5 on 338 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.7826, Adjusted R-squared: 0.7807
## F-statistic: 405.7 on 3 and 338 DF, p-value: < 2.2e-16
Answer: (1) Y= -4031.477 + 40.705⋅Xflipper - 206.510.Xchinstrap + 266.810. XGentoo (2) Linear regression was used to test the association between body mass (in gram) and length of flipper(in mm), stratified by 3 type of penguines using data from n = 340 penguins. - 78.3% of the variation in outcome variable was explained by the model (R2= 78.3%). - There was a significant positive association between body mass and length of flipper of Gentoo (B = 266.810; 95% CI = 76.282, 457.338; p < .01). On average, for every 1-mm difference in flipper length, penguins differ in mean of body mass by 266.810 gam. - On the other hand, there was a significant negative association between body mass and length of flipper of Chinstrap (B = -206.510; 95% CI = -321.972, -91.048; p < .001). On average, for every 1-mm difference in flipper length, penguins differ in mean of body mass by 206.510 gam.
Question: Reproduce the parallel slopes image below using the coefficients from the second model M2
coefs <- coef(M2)
penguins |>
ggplot(
aes(x = flipper_length_mm, y = body_mass_g, color = species)
) +
geom_point() +
# Use the intercept for the reference category (Adelie)
geom_abline(
intercept = coefs[1], slope = coefs[2], color = "orange"
) +
# Use the appropriate intercept for Chinstrap
geom_abline(
intercept = coefs[1] + coefs[3], slope = coefs[2], color = "red"
) +
# Use the appropriate intercept for Gentoo
geom_abline(
intercept = coefs[1] + coefs[4], slope = coefs[2], color = "black"
) +
scale_color_manual(
values = c("orange", "red", "black")
) ## Warning: Removed 2 rows containing missing values (`geom_point()`).
Question:Search online for an explanation of the various plots (for example, this is a good resource, but you may find others). Create a plot for each of the four diagnostic tests, and describe in words what is the purpose of each diagnostic plot. Then interpret the plot: what can we learn from it?
Answer: (1) Residuals and Fitted Plot: this plot is employed to test the linearity in the residuals of the model. As can be seen from the plot, the residuals are scattered approximately evenly around the variance regression line and the line is also roughly horizontal, which indicates the linearity in the residual model (2) QQ Residuals Plot: this plot is used to determine the distribution of residual. It can be oserved that all the dots are mainly follow the straight line y=x, which implies that the sample distribution is similar to theoretical one. So residuals is followed normal distribution and it meets the assumption of regression model (3) Scale-Location Plot is applied to examine the heteroscedasticity in the regression model. As can be seen from the chart, the points roughly scattered around the horizontal line, which is shown the homoscedasticity in the model (4)Cook’s distance is measure of the influence of outlier on the regression coefficients and the overall fit of model. Most of the point are under threshold of considering outlier, we can see 3 points which can influence the regression coefficient.
Question: Run the type 3 Anova, discuss and interpret the output you see and link it to the results of the fitted model summary(M2).
##
## Call:
## lm(formula = body_mass_g ~ flipper_length_mm + species, data = penguins)
##
## Residuals:
## Min 1Q Median 3Q Max
## -927.70 -254.82 -23.92 241.16 1191.68
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4031.477 584.151 -6.901 2.55e-11 ***
## flipper_length_mm 40.705 3.071 13.255 < 2e-16 ***
## speciesChinstrap -206.510 57.731 -3.577 0.000398 ***
## speciesGentoo 266.810 95.264 2.801 0.005392 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 375.5 on 338 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.7826, Adjusted R-squared: 0.7807
## F-statistic: 405.7 on 3 and 338 DF, p-value: < 2.2e-16
Answer: Looking at the Anova table, we can see that both length of flipper and species have statistically significant effect on the body mass, after controlling other variables (p-value of length of flipper and species variables are smaller than 0.001). This also can be seen from linear regression model as p value of flipper length and species variable also below 0.01, indicating that the fliiper length and type of penguins have linear relationship with body mass of penguin.