Exercise 1

Question: Use the table1 package to replicate the table below. Use the examples provided in the following vignette. What do you notice from the table? Which species of penguin is the heaviest? Which one has the largest flippers? Which have the largest bill depth? How are the different species distributed across the three islands?

Table 1: Measurements for three penguin species in the Palmer Archipelago
Adelie
(N=152)
Chinstrap
(N=68)
Gentoo
(N=124)
Sex
female 73 (48.0%) 34 (50.0%) 58 (46.8%)
male 73 (48.0%) 34 (50.0%) 61 (49.2%)
Missing 6 (3.9%) 0 (0%) 5 (4.0%)
Body Mass (g)
Mean (SD) 3700 (459) 3730 (384) 5080 (504)
Median [Min, Max] 3700 [2850, 4780] 3700 [2700, 4800] 5000 [3950, 6300]
Missing 1 (0.7%) 0 (0%) 1 (0.8%)
Flipper length (mm)
Mean (SD) 190 (6.54) 196 (7.13) 217 (6.48)
Median [Min, Max] 190 [172, 210] 196 [178, 212] 216 [203, 231]
Missing 1 (0.7%) 0 (0%) 1 (0.8%)
Bill depth (mm)
Mean (SD) 38.8 (2.66) 48.8 (3.34) 47.5 (3.08)
Median [Min, Max] 38.8 [32.1, 46.0] 49.6 [40.9, 58.0] 47.3 [40.9, 59.6]
Missing 1 (0.7%) 0 (0%) 1 (0.8%)
Island
Biscoe 44 (28.9%) 0 (0%) 124 (100%)
Dream 56 (36.8%) 68 (100%) 0 (0%)
Torgersen 52 (34.2%) 0 (0%) 0 (0%)

Answer: The table summarizes different characteristics of the three penguin species. Gentoo is the heaviest species with the largest flippers, while the Chinstrap species have the deepest bill. The chinstrap species are only found in Dream island while Gentoo only live in Biscoe island. Adelie on the other hand are distributed equally among Biscoe, Dream and Torgersen islands.

Exercise 2

Question: Replicate the scatter plot below, coloring the points according to the species. What can you tell about the relationship between bill depth and penguin weight? Comment on the intercepts and the slopes, associated with the three species (are they the same? similar? different). Does the outcome variable appear normally distributed? Why does that matter?

Answer: There is a positive relationship between bill depth and body mass for all three species. However, Gentoo as the heaviest species have a much higher intercept.They also have higher slope (more increase in body mass relative to increase in bill depth). The outcome variable for the whole data set is not normally distributed, it is bimodal. However, for each individual species the outcome variables are normally distributed around the correlation line.

Exercise 3

Question: Now replicate the plot below, describing the relationship between flipper length and body mass, and coloring the data according to the associated species. Facet the plot by island (search online if necessary). Finally, tell a story about the relationship between flipper length and weight in these three penguin species, and the distribution of penguins across the three islands.

Answer: The above figure shows flipper length plotted against body mass while distributing the penguin species among the islands. We can see an increasing flipper length is also associated with increasing body mass in all species. The island of Biscoe is home to the Adelie and Gentoo penguins, while the island of Dream is home to the Chinstrap as well as the Adelie. The island of Torgeresen is inhabited by the Adelie penguins only. This means the Adelie are found in all three islands, while the Chinstrap and Gentoo penguins are enclosed in a single island out of the three.

Exercise 4

Question: Replicate the scatter plot below, adding vertical and horizontal lines to indicate the mean of the predictor and of the outcome variable. Show that the linear regression line passes at the intersection of the two means.

Exercise 5

Question: In R, run a hypothesis test to see whether the flipper length and the body mass are correlated. Write down your null hypothesis, the p-value and the confidence interval, describe and interpret your findings.

## 
##  Pearson's product-moment correlation
## 
## data:  penguins$flipper_length_mm and penguins$body_mass_g
## t = 32.722, df = 340, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.843041 0.894599
## sample estimates:
##       cor 
## 0.8712018

Answer: Our Null Hypothesis(Ho) is that there is no correlation between flipper length and body mass of a the penguins, in other words the correlation coefficient is zero. The alternative hypothesis is that the true correlation is not zero. Our results show a correlation coefficient of 0.8712018, with p-value < 0.001 and the 95% confidence interval for the coefficient is [0.843041 0.894599]. This is indicates there is a strong positive correlation between flipper length and body mass of the penguins. Therefore, we reject our null hypothesis.

Exercise 6

Question: Use the lm function in R to fit a simple linear regression with one predictor. Write up your findings using the template below, and filling in the gaps in the template with the result of your model.

## 
## Call:
## lm(formula = body_mass_g ~ flipper_length_mm, data = penguins)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1058.80  -259.27   -26.88   247.33  1288.69 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -5780.831    305.815  -18.90   <2e-16 ***
## flipper_length_mm    49.686      1.518   32.72   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 394.3 on 340 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.759,  Adjusted R-squared:  0.7583 
## F-statistic:  1071 on 1 and 340 DF,  p-value: < 2.2e-16

Answer: Linear regression was used to test the association between body mass(in grams) and flipper length(in millimeters) using data from n =342 Penguins. 76% of the variation in body mass was explained by flipper length (R2=0.7583 ). There was a significant positive association between body mass and flipper length (B = 49.69; 95% CI = 46.67, 52.71; p < .001). On average, for every 1-mm difference in flipper length, penguins differ in mean body mass by 49.69 grams

Exercise 7

Question: We now want to fit a model with parallel slopes, one slope for each of the three species of penguins. Run the regression as you have done above, but this time, add the species variable as a predictor in your model. (1) Write the equation for the linear model (you will need to replace the coefficients in the formula b0,b1,b2 with their estimates). (2) Write up your findings in a similar manner to the template shown in the previous question. (3) Write a separate model for each of the three species. (4) Use the augment(M2) to see the list of residuals and show that the mean of the residuals is almost zero and its standard deviation is the residual standard error.

## 
## Call:
## lm(formula = body_mass_g ~ flipper_length_mm + species, data = penguins)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -927.70 -254.82  -23.92  241.16 1191.68 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -4031.477    584.151  -6.901 2.55e-11 ***
## flipper_length_mm    40.705      3.071  13.255  < 2e-16 ***
## speciesChinstrap   -206.510     57.731  -3.577 0.000398 ***
## speciesGentoo       266.810     95.264   2.801 0.005392 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 375.5 on 338 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.7826, Adjusted R-squared:  0.7807 
## F-statistic: 405.7 on 3 and 338 DF,  p-value: < 2.2e-16

Answer:

  1. The equation for our model can be written as:

    Y= -4031.5 + 40.7 * flipper_length_mm - 206.5* speciesChinstrap + 266.81 * speciesGentoo + e , e~N (0,375.5^2)

  2. Linear regression was used to test the association between body mass (in grams) and flipper length (in mm) and species, using data from n = 342 Penguins. 78% of the variation in body mass was explained by flipper length and species type (R2=0.7807). There was a significant positive association between body mass and flipper length for every species (B = 40.71; 95% CI = 34.56, 46.84; p < 0.001). On average, for every 1-mm difference in flipper length, penguins differ in mean body mass by 40.71 grams after adjusting for species.

  3. Adelie Y= -4031.5 + 40.7 * flipper_length_mm,

    Chinstrap Y = -4238 + 40.7 * flipper_length_mm,

    Gentoo Y = -3764.7 + 40.7 * flipper_length_mm

  4. From the results table we can see that the mean of the residuals is almost zero (1.060549e-11). The standard deviation of the residuals is 373.9 and the residual standard error is 375.5. The residual standard error is slightly larger because it was calculated with 2 less degrees of freedom and there were 2 missing observations. Thus, we can conclude that they are similar.

Exercise 8

Question: Reproduce the parallel slopes image below using the coefficients from the second model M2.

Exercise 9

Question: Search online for an explanation of the various plots (for example, this is a good resource, but you may find others). Create a plot for each of the four diagnostic tests, and describe in words what is the purpose of each diagnostic plot. Then interpret the plot: what can we learn from it?

Answer: The first, residual vs fitted plot, is usually used to detect non-linearity, unequal error variances and outliers. The residuals in this case seem to be more or less randomly distributed around the 0 line, indicating a linear relationship between our variables.

The Q-Q Residual plot is used to check for normality. In this case, the distribution approximates a normal distribution.

The Scale-Location plot is similar to the residuals vs fit plot, but instead of linear residuals it uses the square root of the residuals. It is used to check for equal-variance. In our case, the plot shows a good fit since the residuals appear to be randomly spread with equal variance resulting in a horizontal line.

The last, Residuals vs Leverage plot is used to identify influential observations. In this case, there appears to be no influential cases and all the cases are within the Cook’s distance line.

Exercise 10

Question: Run the type 3 Anova, discuss and interpret the output you see and link it to the results of the fitted model summary(M2).

## 
## Call:
## lm(formula = body_mass_g ~ flipper_length_mm + species, data = penguins)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -927.70 -254.82  -23.92  241.16 1191.68 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -4031.477    584.151  -6.901 2.55e-11 ***
## flipper_length_mm    40.705      3.071  13.255  < 2e-16 ***
## speciesChinstrap   -206.510     57.731  -3.577 0.000398 ***
## speciesGentoo       266.810     95.264   2.801 0.005392 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 375.5 on 338 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.7826, Adjusted R-squared:  0.7807 
## F-statistic: 405.7 on 3 and 338 DF,  p-value: < 2.2e-16

Answer: In the ANOVA analysis, the P-values for both flipper length and species are statistically significant. It means that the mean weights between the species and different flipper lengths are different from zero. Therefore, flipper length and type of species both have an effect on the mean body weight of penguins. The linear regression shows how much influence each species type has on their body weight while the ANOVA shows the general relationship between species and body weight.

Exercise 11 a

Question: As an optional, bonus exercise, repeat exercises 6 - 10 but this time analyze the association between the bill depth and the body mass: body_mass_g ~ bill_depth_mm + species.

## 
## Call:
## lm(formula = body_mass_g ~ bill_depth_mm, data = penguins)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1607.38  -510.10   -66.96   462.43  1819.28 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    7488.65     335.22   22.34   <2e-16 ***
## bill_depth_mm  -191.64      19.42   -9.87   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 708.1 on 340 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.2227, Adjusted R-squared:  0.2204 
## F-statistic: 97.41 on 1 and 340 DF,  p-value: < 2.2e-16

Answer: Linear regression was used to test the association between bill depth (mm) and body mass (g) using data from n = 344 Penguins. 22.3% of the variation in body mass was explained by bill depth (R2= 0.2227). There was a significant negative association between bill depth and body mass (B = -191.64; 95% CI = -230.48, -172.22; p < .001). On average, for every 1-mm difference in bill depth , penguins differ in mean body mass by -191.64 grams.

Exercise 11 b

## 
## Call:
## lm(formula = body_mass_g ~ bill_depth_mm + species, data = penguins)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -867.73 -255.61  -27.41  242.41 1190.86 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -1007.28     323.56  -3.113  0.00201 ** 
## bill_depth_mm      256.61      17.56  14.611  < 2e-16 ***
## speciesChinstrap    13.38      52.95   0.253  0.80069    
## speciesGentoo     2238.67      73.68  30.383  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 362.4 on 338 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.7975, Adjusted R-squared:  0.7957 
## F-statistic: 443.8 on 3 and 338 DF,  p-value: < 2.2e-16

Answer: When the data was sub-grouped according to species, the association reversed and became a positive relationship between bill depth and body mass revealing a Simpson’s Paradox.

  1. The equation for the model is:

    Y= -1007 + 257 * bill_depth_mm + 13.4* speciesChinstrap + 2239 * speciesGentoo + e , e~N (0,362.4ˆ2)

  2. The equations for each species are: Adelie Y= -1007 + 257 * bill_depth_mm Chinstrap Y = -993.6 + 257 * bill_depth_mm Gentoo Y = 1232 + 257 * bill_depth_mm

  3. The mean of residuals is almost zero (4.531533e-12), and the standard deviation of residuals (360.8) is similar to the residual standard error (362.4). The RSE is slightly higher because it is calculation takes into account the two degrees of freedom as well as the predictor variables.

Exercise 11 c

Exercise 11 d

Answer: The plots have similar distributions as exercise 9. The values are normally distributed, there is equal variance of residuals, and there are no highly influencing data points.

Exercise 11 e

## 
## Call:
## lm(formula = body_mass_g ~ bill_depth_mm + species, data = penguins)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -867.73 -255.61  -27.41  242.41 1190.86 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -1007.28     323.56  -3.113  0.00201 ** 
## bill_depth_mm      256.61      17.56  14.611  < 2e-16 ***
## speciesChinstrap    13.38      52.95   0.253  0.80069    
## speciesGentoo     2238.67      73.68  30.383  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 362.4 on 338 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.7975, Adjusted R-squared:  0.7957 
## F-statistic: 443.8 on 3 and 338 DF,  p-value: < 2.2e-16

Answer: Both bill depth and species type are significantly associated with body mass.