# 1. load the appropriate libraries
library(tidyverse)
library(broom)
library(table1)
library(car)
# 2. use the read_rds file to read the dataset
penguins <- read_rds("data/penguins.rds")Question: Use the table1 package to replicate the table below. Use the examples provided in the following vignette. What do you notice from the table? Which species of penguin is the heaviest? Which one has the largest flippers? Which have the largest bill depth? How are the different species distributed across the three islands?
# Write your code to create a histogram of delays
label(penguins$sex) <- "Sex"
label(penguins$body_mass_g) <- "Body mass (g)"
label(penguins$flipper_length_mm) <- "Flipper length (mm)"
label(penguins$bill_depth_mm) <- "Bill depth (mm)"
label(penguins$island) <- "Island"
caption <- "Table 1: Measurements for three penguin species in the Palmer Archipelago"
table1(~ sex + body_mass_g + flipper_length_mm + bill_depth_mm + island | species, data=penguins, overall = FALSE, caption = caption)| Adelie (N=152) |
Chinstrap (N=68) |
Gentoo (N=124) |
|
|---|---|---|---|
| Sex | |||
| female | 73 (48.0%) | 34 (50.0%) | 58 (46.8%) |
| male | 73 (48.0%) | 34 (50.0%) | 61 (49.2%) |
| Missing | 6 (3.9%) | 0 (0%) | 5 (4.0%) |
| Body mass (g) | |||
| Mean (SD) | 3700 (459) | 3730 (384) | 5080 (504) |
| Median [Min, Max] | 3700 [2850, 4780] | 3700 [2700, 4800] | 5000 [3950, 6300] |
| Missing | 1 (0.7%) | 0 (0%) | 1 (0.8%) |
| Flipper length (mm) | |||
| Mean (SD) | 190 (6.54) | 196 (7.13) | 217 (6.48) |
| Median [Min, Max] | 190 [172, 210] | 196 [178, 212] | 216 [203, 231] |
| Missing | 1 (0.7%) | 0 (0%) | 1 (0.8%) |
| Bill depth (mm) | |||
| Mean (SD) | 18.3 (1.22) | 18.4 (1.14) | 15.0 (0.981) |
| Median [Min, Max] | 18.4 [15.5, 21.5] | 18.5 [16.4, 20.8] | 15.0 [13.1, 17.3] |
| Missing | 1 (0.7%) | 0 (0%) | 1 (0.8%) |
| Island | |||
| Biscoe | 44 (28.9%) | 0 (0%) | 124 (100%) |
| Dream | 56 (36.8%) | 68 (100%) | 0 (0%) |
| Torgersen | 52 (34.2%) | 0 (0%) | 0 (0%) |
Answer: Gentoo is the heaviest penguin species and also has the biggest flipper length. Adelie and Chinstrap have about equal bill depths. Interestingly, Gentoo has the smallest bill depth. Chinstrap penguins live only on Dream Island, Gentoo penguins only on Biscoe Island, while Adelie penguins are found on all three Islands.
Question: Replicate the scatter plot below, coloring the points according to the species. What can you tell about the relationship between bill depth and penguin weight? Comment on the intercepts and the slopes, associated with the three species (are they the same? similar? different). Does the outcome variable appear normally distributed? Why does that matter?
## `geom_smooth()` using formula = 'y ~ x'
Answer: Overall, in all three penguin species there seems to be a positive relationship between body mass (in g) and bill depth (in mm). This can be seen in the slopes of the individual regression lines. All three are positive and steep. It is interesting that while the slope is almost identical for Adeline and Chinstrap penguins, it is much steeper for Gentoo penguins. As for the intercepts, the same pattern applies. It is roughly the same for Adeline and Chinstrap penguins, while it is much higher for Gentoo penguins. This is because Gentoo penguins are on average much larger. Also, the scatter for them is closer to the y-axis because of a smaller bill depth compared to the other two species which, again, show a similar scatter. Within the penguin species the outcome variable does appear to be normally distributed around the regression line. This can be seen in the relatively equal and evenly distributed spread of data points around the regression lines. This matters because of an assumption that an analysis using linear regression makes. The assumption entails that there are no extreme outliers that could exert strong leverage on teh regression line and thus distort it.
Question: Now replicate the plot below, describing the relationship between flipper length and body mass, and coloring the data according to the associated species. Facet the plot by island (search online if necessary). Finally, tell a story about the relationship between flipper length and weight in these three penguin species, and the distribution of penguins across the three islands.
Answer: The relationship between flipper length and weight seems to be very similar across the penguin populations on each island. This can be seen by looking at the positive overall slope in the point clouds, hinting at a positive association. Chinstrap penguins live only on Dream Island, Gentoo penguins only on Biscoe Island, and Adelie penguins are found on all three Islands.The Gentoo species differs a lot from the other penguin species because they have larger flippers and are fatter. The species living on Biscoe island thus differ a lot from each other. The species on Dream island are more similar to each other in terms of flipper legth and body mass.
Question: Replicate the scatter plot below, adding vertical and horizontal lines to indicate the mean of the predictor and of the outcome variable. Show that the linear regression line passes at the intersection of the two means.
## `geom_smooth()` using formula = 'y ~ x'
Question: In R, run a hypothesis test to see whether the flipper length and the body mass are correlated. Write down your null hypothesis, the p-value and the confidence interval, describe and interpret your findings.
##
## Pearson's product-moment correlation
##
## data: penguins$flipper_length_mm and penguins$body_mass_g
## t = 32.722, df = 340, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.843041 0.894599
## sample estimates:
## cor
## 0.8712018
Answer: H0: In the penguin population, there is no association between flipper length and body mass. The p-value associated with the test is 2.2e-16. P = <0.05 with a 95% confidence interval corresponding to [0.843041 ; 0.894599]). Based on this, we reject the null hypothesis and can conclude that flipper length and body mass are correlated in the penguin population. The confidence interval exemplifies that in 95% of cases, we will find the true correlation within the range of roughly 0.843 and 0.895. This implies a very strong positive relationship between flipper length and body mass.
Question: Use the lm function in R to fit a simple linear regression with one predictor. Write up your findings using the template below, and filling in the gaps in the template with the result of your model.
##
## Call:
## lm(formula = body_mass_g ~ flipper_length_mm, data = penguins)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1058.80 -259.27 -26.88 247.33 1288.69
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5780.831 305.815 -18.90 <2e-16 ***
## flipper_length_mm 49.686 1.518 32.72 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 394.3 on 340 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.759, Adjusted R-squared: 0.7583
## F-statistic: 1071 on 1 and 340 DF, p-value: < 2.2e-16
Answer: Linear regression was used to test the association between body mass (g) and flipper length (mm) using data from n = 342 Penguins. 75.9% of the variation in body mass was explained by flipper length (R2= 0.759). There was a significant positive association between flipper length and body mass (B = 49.686; 95% CI = (49.686 +/- 1.96*1.518) = 46.69892, 52.67221; p < .001). On average, for every 1-mm difference in flipper length, penguins differ in mean body mass by 49.686 g.
Question: We now want to fit a model with parallel slopes, one slope for each of the three species of penguins. Run the regression as you have done above, but this time, add the species variable as a predictor in your model. (1) Write the equation for the linear model (you will need to replace the coefficients in the formula b0,b1,b2 with their estimates). (2) Write up your findings in a similar manner to the template shown in the previous question. (3) Write a separate model for each of the three species. (4) Use the augment(M2) to see the list of residuals and show that the mean of the residuals is almost zero and its standard deviation is the residual standard error.
##
## Call:
## lm(formula = body_mass_g ~ flipper_length_mm + factor(species),
## data = penguins)
##
## Residuals:
## Min 1Q Median 3Q Max
## -927.70 -254.82 -23.92 241.16 1191.68
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4031.477 584.151 -6.901 2.55e-11 ***
## flipper_length_mm 40.705 3.071 13.255 < 2e-16 ***
## factor(species)Chinstrap -206.510 57.731 -3.577 0.000398 ***
## factor(species)Gentoo 266.810 95.264 2.801 0.005392 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 375.5 on 338 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.7826, Adjusted R-squared: 0.7807
## F-statistic: 405.7 on 3 and 338 DF, p-value: < 2.2e-16
## [1] 1.060549e-11
## [1] 373.8795
Answer: (1) Y = -4031.477 + 40.705* x flipper -
206.510 * x Chinstrap + 266.810 * X Gentoo (2) Linear regression was
used to test the association between flipper length (in mm), species (as
a categorical dummy variable) and body mass (in g) using data from n =
342 Penguins. 78.3 % of the variation in body mass was explained by
flipper length and species (R2=0.783). There was a significant positive
association between flipper length and body mass (B = 40.705; 95% CI =
34.66468, 46.74612; p < .001). On average, for every 1-mm difference
in flipper length, penguins differ in mean body mass by 40.705 gramms.
There was also a significant grouping effect between species and body
mass. Because we’re dealing with a categorical variable with three
levels, describing the relationship cannot be done by calling it simply
positive or negtaive. What we’re seeing is that across the species there
appears to be a different “starting point” for the regression line for
which the slope is determined by flipper length and body mass (slope is
equal across groups, intercept is not: it is 206.510 lower for the
Chinstrap species compared to Adelie (the reference), and 266.810 higher
for Gentoo compared to Adelie). (3)
Chinstrap: Y = -4237.987 + 40.705* x flipper Adelie: Y = -4031.477 +
40.705* x flipper Gentoo: Y = -3764.667 + 40.705* x flipper (4) As for
the mean of residuals, the value 1.060549e-11 shows that it is very very
close to 0. The standard deviation of residuals, also called the
standard error of residual is 373.8795. This means that on average, our
predictions made with the M2 model will be 373.8795 gramms off.
Question: Reproduce the parallel slopes image below using the coefficients from the second model M2.
Question: Search online for an explanation of the various plots (for example, this is a good resource, but you may find others). Create a plot for each of the four diagnostic tests, and describe in words what is the purpose of each diagnostic plot. Then interpret the plot: what can we learn from it?
Answer: (1) Residuals vs fitted plot(M2, which = 1) In this plot we can see that the residuals are distributed equally around the horizontal line. We can interpret this plot as support for the linearity assumption. That is a good indication that we don’t have a non-linear relationship between the variables.
QQ Residuals plot(M2, which = 2) We can interpret this plot as support for the fit of our model, as it follows a straight line. This lends support for the residuals being normally distributed. This correpsonds to the multivariate normality assumption of linear regression.
Scale Location plot(M2, which = 3) As we’re seeing a line that is horizontal, this means that the residuals are equally spread around the fitted line. We can interpret this plot as support for equal variance (homoscedasticity) in the residuals.
Cook’s distance plot(M2, which = 4) Cook’s distance is the sclaed change in fitted values, given that the value is removed from the model. The graph thus shows the influence of each observation on the fitted response values. As the cook’s distances in our plot are all very small (below 0.25, as compared to the threshold of 2) we do not have influential outliers in our data.
Question: Run the type 3 Anova, discuss and interpret the output you see and link it to the results of the fitted model summary(M2).
##
## Call:
## lm(formula = body_mass_g ~ flipper_length_mm + species, data = penguins)
##
## Residuals:
## Min 1Q Median 3Q Max
## -927.70 -254.82 -23.92 241.16 1191.68
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4031.477 584.151 -6.901 2.55e-11 ***
## flipper_length_mm 40.705 3.071 13.255 < 2e-16 ***
## speciesChinstrap -206.510 57.731 -3.577 0.000398 ***
## speciesGentoo 266.810 95.264 2.801 0.005392 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 375.5 on 338 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.7826, Adjusted R-squared: 0.7807
## F-statistic: 405.7 on 3 and 338 DF, p-value: < 2.2e-16
Answer: In the following ANOVA analysis, we are testing two null hypotheses. Firstly, we are testing whether the means in body mass are equal for all three penguin species. Secondly, we are testing that there flipper length is not associated with body mass. We are not testing for any interaction effects between species and flipper length.
The results of our ANOVA are similar to those in our linear regression model in the sense that we that we are observing two main effects (species and flipper length). In our linear regression model, all coefficients showed a significant effect. In the ANOVA output we see a significant effect of species, meaning that there is a significant difference in body mass between at least one of the species, but we cannot see that from this analysis as no follow-up tests were done. In addition to this group effect, we see a significant main effect of flipper length. This means that flipper length, independently of species, predicts body mass in penguin populations. Both methods thus overall yield the same conclusions.