Exercise 1

Instructions: The file gapminderData5.csv contains a subset of the GapMinder dataset for 142 countries over 12 time steps (every 5 years between 1952 and 2007).

gap <- read.csv("gapminderData5.csv")

m_le <- tapply(gap$lifeExp, gap$country, mean)
head(m_le)
## Afghanistan     Albania     Algeria      Angola   Argentina   Australia 
##    37.47883    68.43292    59.03017    37.88350    69.06042    74.66292
plt_lifeExp_year <- ggplot(gap, aes(year, lifeExp, color=continent)) +
  geom_point(alpha = 0.3, cex = 2) +
  labs(y = "Life Expectancy (y)", x = "Year") +
  ggtitle("Life Expectancy for All Countries by Year") +
  theme_bw() +
  theme(legend.title = element_blank()) +
  scale_colour_brewer(palette = "Set1")
plt_lifeExp_year

gap_subset <- gap[which(gap$country == "Montenegro" | gap$country == "Morocco"),]

plt_line <- ggplot(gap_subset, aes(x = year, y = lifeExp,
                       colour = country)) +
  geom_line(size = 0.75) +
  theme_bw() +
  labs(x = "Year", y = "Life Expectancy (y)") +
  ggtitle("Life Expectancy Comparison Between \nMontenegro and Morocco") +
  scale_colour_brewer(palette = "Dark2")

plt_line

gap$year_c_s <- (gap$year - 1952) / 10

summary(life_lm <- lm(lifeExp ~ year_c_s,
               data = gap))
## 
## Call:
## lm(formula = lifeExp ~ year_c_s, data = gap)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -39.949  -9.651   1.697  10.335  22.158 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  50.5121     0.5300   95.31   <2e-16 ***
## year_c_s      3.2590     0.1632   19.96   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.63 on 1702 degrees of freedom
## Multiple R-squared:  0.1898, Adjusted R-squared:  0.1893 
## F-statistic: 398.6 on 1 and 1702 DF,  p-value: < 2.2e-16

Answer: The intercept is 50.5121 (p<2e-16), meaning that in the year 1952 (t = 0), the expected value for life expectancy globally was 50.5 years or so. The coefficient estimate for year is 3.259 (p < 2e-16) — every decade after 1952, the life expectancy increased by about 3 years and 3 months. Our F-statistic is 398.6 (p < 2.2e-16).

summary(life_lme <- lme(lifeExp ~ year_c_s,
                random = ~ 1 | country,
                data = gap))
## Linear mixed-effects model fit by REML
##  Data: gap 
##        AIC      BIC    logLik
##   9869.658 9891.417 -4930.829
## 
## Random effects:
##  Formula: ~1 | country
##         (Intercept) Residual
## StdDev:    11.09716 3.584196
## 
## Fixed effects: lifeExp ~ year_c_s 
##                Value Std.Error   DF  t-value p-value
## (Intercept) 50.51208 0.9454671 1561 53.42553       0
## year_c_s     3.25904 0.0503048 1561 64.78581       0
##  Correlation: 
##          (Intr)
## year_c_s -0.146
## 
## Standardized Within-Group Residuals:
##        Min         Q1        Med         Q3        Max 
## -6.1691513 -0.4910339  0.1216455  0.6030002  2.4194109 
## 
## Number of Observations: 1704
## Number of Groups: 142

Answer: The fixed effect coefficients are: Intercept: 50.512 (p = 0), Year: 3.259 (p = 0). The standard deviations are 11.097 (Intercept) and 3.584 (Residual).

icc <- VarCorr(life_lme)
icc
## country = pdLogChol(1) 
##             Variance  StdDev   
## (Intercept) 123.14688 11.097156
## Residual     12.84646  3.584196
123.14688 + 12.84646 #Total Variance in the Model: 135.993
## [1] 135.9933
123.14688 / 135.993 # Proportion of Variation Due to Slope (ICC): 0.9055
## [1] 0.9055384
12.84646 + 3.584196 # Residual Variation: 16.43
## [1] 16.43066

Answer: With an ICC of 90.55%, there is support for using the random intercept model.

summary(life_rand <- lme(lifeExp ~ year_c_s,
                random = ~ year_c_s | country,
                data = gap))
## Linear mixed-effects model fit by REML
##  Data: gap 
##        AIC     BIC    logLik
##   8742.672 8775.31 -4365.336
## 
## Random effects:
##  Formula: ~year_c_s | country
##  Structure: General positive-definite, Log-Cholesky parametrization
##             StdDev    Corr  
## (Intercept) 12.302796 (Intr)
## year_c_s     1.582867 -0.434
## Residual     2.180792       
## 
## Fixed effects: lifeExp ~ year_c_s 
##                Value Std.Error   DF  t-value p-value
## (Intercept) 50.51208 1.0371995 1561 48.70045       0
## year_c_s     3.25904 0.1363121 1561 23.90865       0
##  Correlation: 
##          (Intr)
## year_c_s -0.439
## 
## Standardized Within-Group Residuals:
##         Min          Q1         Med          Q3         Max 
## -8.08192929 -0.34891405  0.02690066  0.39742349  4.67388085 
## 
## Number of Observations: 1704
## Number of Groups: 142

Answer:

AIC(life_lm, life_lme, life_rand)
## Warning in AIC.default(life_lm, life_lme, life_rand): models are not all fitted
## to the same number of observations
##           df       AIC
## life_lm    3 13201.732
## life_lme   4  9869.658
## life_rand  6  8742.672

Answer: With an AIC of 8742.672, the model with the best fit is the model with a random intercept and slope. This makes intuitive sense, since there was a lot of variation in life expectancy between countries in 1952 — we would expect the intercept to vary by country.

While global average life expectancy increased, we would also expect countries to have different trends — for instance, the famine in Somalia caused life expectancy to drop around 1990, but at the same time the India saw a relatively steep increase in life expectancy. Zimbabwe had, on average, a decreased life expectancy of -3.97 years per decade — the lowest slope of the datasets. Oman had the highest increase in life expectancy overall, with a slope of 3.26.

x <- data.frame(life_rand[["coefficients"]][["random"]][["country"]])
max(x$year_c_s)
## [1] 4.236716
min(x$year_c_s)
## [1] -3.972096
max(life_lme[["coefficients"]][["fixed"]][["year_c_s"]])
## [1] 3.259038
min(life_lme[["coefficients"]][["fixed"]][["year_c_s"]])
## [1] 3.259038
gap_subset_2 <- gap[which(gap$country == "India" | gap$country == "Somalia"),]
gap_subset_3 <- gap[which(gap$country == "Zimbabwe" | gap$country == "Oman"),]

ggplot(gap_subset_2, aes(x = year, y = lifeExp,
                       colour = country)) +
  geom_line(size = 0.75) +
  theme_bw() +
  labs(x = "Year", y = "Life Expectancy (y)") +
  ggtitle("Highs and Lows in 1992") +
  scale_colour_brewer(palette = "Dark2") +
  geom_vline(xintercept = c(1992, 1992), linetype = "dotted")

ggplot(gap_subset_3, aes(x = year, y = lifeExp,
                       colour = country)) +
  geom_line(size = 0.75) +
  theme_bw() +
  labs(x = "Year", y = "Life Expectancy (y)") +
  ggtitle("Lowest Slope vs. Highest Slope") +
  scale_colour_brewer(palette = "Dark2") +
  geom_vline(xintercept = c(1992, 1992), linetype = "dotted")