Excercise

  1. The file gapminderData5.csv contains a subset of the GapMinder dataset for 142 countries over 12 time steps (every 5 years between 1952 and 2007).
gapminder <- read.csv("gapminderData5.csv")
head(gapminder)
##       country year      pop continent lifeExp gdpPercap
## 1 Afghanistan 1952  8425333      Asia  28.801  779.4453
## 2 Afghanistan 1957  9240934      Asia  30.332  820.8530
## 3 Afghanistan 1962 10267083      Asia  31.997  853.1007
## 4 Afghanistan 1967 11537966      Asia  34.020  836.1971
## 5 Afghanistan 1972 13079460      Asia  36.088  739.9811
## 6 Afghanistan 1977 14880372      Asia  38.438  786.1134
mlifeExp <- tapply(gapminder$lifeExp, gapminder$country, mean)

Making a scatter plot of life Expectancy vs year

plot(gapminder$year, gapminder$lifeExp,
     col = "blue",
     pch = 16,
     xlab = "Year",
     ylab = "Life Expectancy",
     main = "Life Expectancy vs Year (All countries")

## OR use ggplot

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
ggplot(gapminder, aes(x = year, y = lifeExp, colour = country)) +
  geom_point() +
  theme_minimal() +
  theme(legend.position = "none") +
  labs(title = "Liife Expectancy vs year of all countries")

Line plot for two countries

head(gapminder, 10)
##        country year      pop continent lifeExp gdpPercap
## 1  Afghanistan 1952  8425333      Asia  28.801  779.4453
## 2  Afghanistan 1957  9240934      Asia  30.332  820.8530
## 3  Afghanistan 1962 10267083      Asia  31.997  853.1007
## 4  Afghanistan 1967 11537966      Asia  34.020  836.1971
## 5  Afghanistan 1972 13079460      Asia  36.088  739.9811
## 6  Afghanistan 1977 14880372      Asia  38.438  786.1134
## 7  Afghanistan 1982 12881816      Asia  39.854  978.0114
## 8  Afghanistan 1987 13867957      Asia  40.822  852.3959
## 9  Afghanistan 1992 16317921      Asia  41.674  649.3414
## 10 Afghanistan 1997 22227415      Asia  41.763  635.3414
gapmind_sub <- subset(gapminder, country %in% c("Japan", "Canada"))

Jp <- subset(gapmind_sub, country == "Japan")
Ca <- subset(gapmind_sub, country == "Canada")

plot(Jp$year, Jp$lifeExp,
     xlab = "Year", ylab = "Life Expectancy (yr)",
     main = "Life expectancy between Japan and Canada",
     type = 'l', col = "blue")
lines(Ca$year, Ca$lifeExp,
      type = 'l', col = "red")

legend("topleft", legend = c("Japan", "Canada"),
       lty = 1,
       col = c('blue', 'red'))

Use these data to build a simple linear model (lm()) of lifeExp as a function of year. Before making the model, center the year variable by subtracting 1952 from all values (0 now represents the year 1952)

gapminder$year_c = gapminder$year - 1952

gap_m1 <- lm(lifeExp ~ year_c, data = gapminder)
summary(gap_m1)
## 
## Call:
## lm(formula = lifeExp ~ year_c, data = gapminder)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -39.949  -9.651   1.697  10.335  22.158 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 50.51208    0.53000   95.31   <2e-16 ***
## year_c       0.32590    0.01632   19.96   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.63 on 1702 degrees of freedom
## Multiple R-squared:  0.1898, Adjusted R-squared:  0.1893 
## F-statistic: 398.6 on 1 and 1702 DF,  p-value: < 2.2e-16

Now build a random intercept model using the lme() function, using country as the grouping variable for the random effect (the random effect will look something like random = ~ 1 | country).

gapminder$country = factor(gapminder$country)

library(nlme)
## 
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
## 
##     collapse
gap_m2 <- lme(lifeExp ~ year_c, 
              random = ~ 1| country,
              data = gapminder)
summary(gap_m2)
## Linear mixed-effects model fit by REML
##   Data: gapminder 
##        AIC      BIC    logLik
##   9874.263 9896.022 -4933.132
## 
## Random effects:
##  Formula: ~1 | country
##         (Intercept) Residual
## StdDev:    11.09716 3.584196
## 
## Fixed effects:  lifeExp ~ year_c 
##                Value Std.Error   DF  t-value p-value
## (Intercept) 50.51208 0.9454671 1561 53.42553       0
## year_c       0.32590 0.0050305 1561 64.78581       0
##  Correlation: 
##        (Intr)
## year_c -0.146
## 
## Standardized Within-Group Residuals:
##        Min         Q1        Med         Q3        Max 
## -6.1691513 -0.4910339  0.1216455  0.6030002  2.4194109 
## 
## Number of Observations: 1704
## Number of Groups: 142

Calculate the ICC for the random intercept in this model. You will need to use VarCorr() to find the variance of the intercept and residual term (quick reminder, the intercept ICC is the variance of the intercept, divided by the sum of variances).

VarCorr(gap_m2)
## country = pdLogChol(1) 
##             Variance  StdDev   
## (Intercept) 123.14688 11.097156
## Residual     12.84646  3.584196

ICC = 123.15 + 12.15 = 135.95 = 123.15/135.95

ICC (0.91), 91% of variation in the life expectancy is differences between countries. therefore, since the ICC is high or large then the variation of the coefficient between countries is important and worth retaining in the model.

Now build a third model with a random intercept and slope. The random effect will look something like random = ~ year2 | country. Show the output of summary() for this model.

gap_m3 <- lme(lifeExp ~ year_c,
              random = ~ year_c | country,
              data = gapminder)
summary(gap_m3)
## Linear mixed-effects model fit by REML
##   Data: gapminder 
##        AIC      BIC    logLik
##   8747.277 8779.915 -4367.639
## 
## Random effects:
##  Formula: ~year_c | country
##  Structure: General positive-definite, Log-Cholesky parametrization
##             StdDev     Corr  
## (Intercept) 12.3027957 (Intr)
## year_c       0.1582867 -0.434
## Residual     2.1807917       
## 
## Fixed effects:  lifeExp ~ year_c 
##                Value Std.Error   DF  t-value p-value
## (Intercept) 50.51208 1.0371995 1561 48.70045       0
## year_c       0.32590 0.0136312 1561 23.90865       0
##  Correlation: 
##        (Intr)
## year_c -0.439
## 
## Standardized Within-Group Residuals:
##         Min          Q1         Med          Q3         Max 
## -8.08192929 -0.34891405  0.02690066  0.39742349  4.67388085 
## 
## Number of Observations: 1704
## Number of Groups: 142

Finally, use the AIC() function to get the AIC for the three models (linear, random intercept and random intercept and slope). As a reminder: the AIC is a good criterion of the quality of the model, as it penalizes when extra predictors (overfitting). The models with the lowest AIC values are best. Which of your models is best?

AIC(gap_m1, gap_m2, gap_m3)
## Warning in AIC.default(gap_m1, gap_m2, gap_m3): models are not all fitted to
## the same number of observations
##        df       AIC
## gap_m1  3 13201.732
## gap_m2  4  9874.263
## gap_m3  6  8747.277