Exercise 1

Question: Create a table1 and use GGally::ggpairs to explore your data.

Table 1: Darwin's finches by species
conirostris
(N=185)
difficilis
(N=102)
fortis
(N=510)
fuliginosa
(N=318)
magnirostris
(N=90)
scandens
(N=229)
Beak Height (mm)
Mean (SD) 14.9 (1.63) 8.17 (0.494) 12.3 (1.29) 8.00 (0.508) 19.9 (1.83) 9.56 (0.584)
Median [Min, Max] 15.0 [10.5, 18.7] 8.10 [7.10, 9.70] 12.0 [8.60, 16.6] 8.00 [6.70, 10.2] 20.0 [11.5, 23.2] 9.60 [8.20, 11.9]
Wing Length (mm)
Mean (SD) 77.3 (2.94) 64.7 (3.35) 69.9 (3.45) 61.9 (2.39) 81.4 (3.92) 70.6 (2.20)
Median [Min, Max] 77.0 [68.0, 84.0] 64.0 [58.0, 71.0] 70.0 [60.0, 80.0] 62.0 [55.0, 70.0] 81.0 [70.0, 89.0] 71.0 [65.0, 76.0]
Upper beak
Mean (SD) 21.0 (1.24) 13.8 (0.720) 16.5 (1.21) 12.5 (0.649) 22.8 (1.64) 19.6 (1.09)
Median [Min, Max] 21.0 [18.1, 24.1] 13.8 [12.3, 15.5] 16.3 [13.4, 20.2] 12.5 [11.0, 15.0] 22.9 [15.0, 25.9] 19.6 [17.1, 22.5]
Sex
F 59 (31.9%) 29 (28.4%) 155 (30.4%) 128 (40.3%) 29 (32.2%) 72 (31.4%)
M 126 (68.1%) 73 (71.6%) 355 (69.6%) 190 (59.7%) 61 (67.8%) 157 (68.6%)

Exercise 2

Question: Try to replicate as many graphs as you can from those shown above, and try to explain what could be the limitations of a fixed effects models we created?

## 
## Call:
## lm(formula = beakh ~ wingl, data = finches)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.9257 -1.0104  0.0236  1.0507  6.1625 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -20.397361   0.503009  -40.55   <2e-16 ***
## wingl         0.457641   0.007204   63.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.734 on 1432 degrees of freedom
## Multiple R-squared:  0.7381, Adjusted R-squared:  0.7379 
## F-statistic:  4035 on 1 and 1432 DF,  p-value: < 2.2e-16

Answer: The limitations of a fixed effect model involves that this approach requires an increased number of parameter estimations. The fixed effects approach allows each specie to have its own intercept, which leads to a decrease in the degrees of freedom. The fact that for all species an intercept is estimated will lead to reduce the model’s power. Moreover, there is an increased chance of a Type II error. The model also becomes more complex and more difficult to generalize. Finally, interpreting the results can be more challenging than in simpler models.

Exercise 3

Question:
Try to replicate as many figures above as you can. What do you notice about the intercepts? What do you notice about the slopes? How does this compare to the previous models we fit?

Answer: In the plots above we identified the following: The larger the sizes of the beaks and the wings of the exact species the more the intercept shifts up vertically and to the right.

Also, the intercepts are normally distributed, and information is partially shared across species. Our intercept and slope are negatively correlated.

The slopes have the same increasing tendency when the intercepts and the slopes are calculated separately for each species. The slopes for all the different species can either be the same, meaning the species share the same slope. Alternatively, they can differ for each specific specie.

In contrast to the earlier model, the intercept estimate for each species comprises both the “fixed effect” intercept and the additional “random” deviation. In the random effect approach we allow information to be shared accross species. In previous models with a fixed intercept, on the other hand, each specie has their own intercept that is isolated from the other species.

Exercise 4

Question:
How might we make the intercept more meaningful if we were to refit this model?

## $species
##              (Intercept)
## conirostris    1.3417493
## difficilis    -2.5605229
## fortis         0.4037595
## fuliginosa    -2.0911466
## magnirostris   5.3831675
## scandens      -2.4770068
## 
## with conditional variances for "species"
## [1]  1.3417493 -2.5605229  0.4037595 -2.0911466  5.3831675 -2.4770068
## [1] 0
## (Intercept)       wingl 
##  -3.8188397   0.2247786
## [1] -3.81884
## [1] 0.2247786
## numeric(0)
## $species
##              (Intercept)     wingl
## conirostris    -2.477090 0.2247786
## difficilis     -6.379363 0.2247786
## fortis         -3.415080 0.2247786
## fuliginosa     -5.909986 0.2247786
## magnirostris    1.564328 0.2247786
## scandens       -6.295846 0.2247786
## 
## attr(,"class")
## [1] "coef.mer"
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: beakh ~ wingl + (1 | species)
##    Data: finches
## 
## REML criterion at convergence: 3830.8
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -7.1590 -0.5683 -0.0215  0.5570  3.4892 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  species  (Intercept) 9.6096   3.0999  
##  Residual             0.8176   0.9042  
## Number of obs: 1434, groups:  species, 6
## 
## Fixed effects:
##               Estimate Std. Error         df t value Pr(>|t|)    
## (Intercept) -3.819e+00  1.385e+00  7.131e+00  -2.758   0.0277 *  
## wingl        2.248e-01  7.912e-03  1.431e+03  28.409   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##       (Intr)
## wingl -0.405
## lmer(formula = beakh ~ wingl + (1 | species), data = finches)
##             coef.est coef.se
## (Intercept) -3.82     1.38  
## wingl        0.22     0.01  
## 
## Error terms:
##  Groups   Name        Std.Dev.
##  species  (Intercept) 3.10    
##  Residual             0.90    
## ---
## number of obs: 1434, groups: species, 6
## AIC = 3838.8, DIC = 3819.5
## deviance = 3825.1
## [1] 0.2

Answer: In this case, we can see that the mean of the intercept is very close to 0. We need to center all the variables of the model in such a way that the mean = 0. Besides, we need to scale all variables in such a way that SD = 1. In this way we ensure that all variables are on the same scale, which makes it easier to for insatnce compare effect sizes

Addendum

George Box, a famous statistician, once wrote “…all models are wrong, but some are useful”.

It’s important to know what our models are ignoring and what assumptions we are making. Discuss with your neighbour what is wrong with this model and how it could be done better. How big a problem do you think these issues might be?

Answer:

A model does not not always reflect the reality as a whole. Nevertheless, it is still useful to make inferences and help us understand the best we can. It is very complex to account for all possible factors that are present in the real world and include them in a model, meaning we ignore certain factors that could be influential.

In this dataset, we assume that we have enough information about the features of the species, but there could be some other significant variables that affect the association tested. There may thus be other variables that affect the association we are interested in.

We should be confident that the predictors variables we include, will to a maximum extent, explain the association that is being investigated.