Exercise 1:

The file gapminderData5.csv contains a subset of the GapMinder dataset for 142 countries over 12 time steps (every 5 years between 1952 and 2007). Read the data in and use the tapply() function to estimate the mean life expectancy by country (tapply(gap\(lifeExp, gap\)country, mean))

# read in data
gap <- read.csv("../datafiles/gapminderData5.csv")

# set factors for country, year, and continent data
gap$country <- factor(gap$country)
gap$year <- factor(gap$year)
gap$continent <- factor(gap$continent)

# estimate mean life expectancy by country
country_mean_life <- data.frame(mean_life = tapply(gap$lifeExp, 
                                                   gap$country, mean))

head(country_mean_life)
##             mean_life
## Afghanistan  37.47883
## Albania      68.43292
## Algeria      59.03017
## Angola       37.88350
## Argentina    69.06042
## Australia    74.66292

Make a scatter plot of lifeExp vs. year for all countries

library(ggplot2)

ggplot(gap) +
  geom_point(aes(x = year, y = lifeExp, col = continent)) +
  labs(title = "Life Expectancy 1962 - 2007",
       x = "Year",
       y = "Life Expectancy (Years)",
       col = "Continent") +
  theme_bw()

The good news is that life expectancy for all of us has increased over the last 50+ years.

Make plots for two countries (not the United States!):

My two countries are Finland and Nepal.

gap2 <- subset(gap, (country == "Finland") | (country == "Nepal"))

ggplot(gap2) +
  geom_point(aes(x = as.numeric(as.character(year)), y = lifeExp, col = country)) +
  labs(title = "",
       x = "Year",
       y = "Life Expectancy (years)",
       col = "Country") +
  scale_color_manual(values = c("Finland" = "blue4",
                                    "Nepal" = "darkgreen")) +
  theme_bw()

Finland started out with a life expectancy of 66.5 years that gradually increased to 79.3 years. Nepal had a life expectancy of only 36.2 years in 1952. Fortunately this increased to 63.8 years by 2007.

Use these data to build a simple linear model (lm()) of lifeExp as a function of year. Before making the model, center the year variable by subtracting 1952 from all values (0 now represents the year 1952)

# unfactor year for now
gap2$year <- as.numeric(as.character(gap2$year))
gap2$year52 <- gap2$year - 1952

# change factor levels for only 2 countries
gap2$country <- factor(as.character(gap2$country))
gap2$continent <- factor(as.character(gap2$continent))

lm_life <- lm(lifeExp ~ year52, gap2)

summary(lm_life)
## 
## Call:
## lm(formula = lifeExp ~ year52, data = gap2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.8832 -12.5861   0.0111  12.5179  16.1097 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  50.4403     4.9248  10.242  7.8e-10 ***
## year52        0.3836     0.1517   2.529   0.0191 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.83 on 22 degrees of freedom
## Multiple R-squared:  0.2252, Adjusted R-squared:   0.19 
## F-statistic: 6.395 on 1 and 22 DF,  p-value: 0.01912

Report the coefficients, their significance and a brief explanation of what each one means

The (Intercept) is 50.44 years, which is the average life expectancy for both countries combined. The slope coefficient (year52) is 0.3836 years/year which means for every additional year the life expectancy for both countries increases by 0.3836 years. Both of these coefficients have low p-values, suggesting that the coefficient values are significant.

Give the \(F\)-statistic and associated \(p\)-value

The \(F\)-statistic for the model is 6.392 with 1 and 22 degrees of freedom, resulting in a \(p\)-value of 0.019, which is good(ish).

ggplot(gap2) +
  geom_point(aes(x = year52, y = lifeExp, col = country)) +
  scale_color_manual(values = c("Finland" = "blue4",
                                    "Nepal" = "darkgreen")) +
  geom_abline(slope = coef(lm_life)[["year52"]],
              intercept = coef(lm_life)[["(Intercept)"]],
              col = "darkred") +
  labs(title = "",
       x = "Year",
       y = "Life Expectancy (years)",
       col = "Country") +
  theme_bw()

The model is a good fit, but doesn’t really capture the different life expectancy rates between the two countries.

Now build a random intercept model using the lme() function, using country as the grouping variable for the random effect (the random effect will look something like random = ~ 1 | country).

library(nlme)
## 
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
## 
##     collapse
lme_life2 <- lme(lifeExp ~ year52,
                        random = ~ 1 | country,
                        data = gap2)

summary(lme_life2)
## Linear mixed-effects model fit by REML
##   Data: gap2 
##        AIC      BIC    logLik
##   133.4761 137.8403 -62.73805
## 
## Random effects:
##  Formula: ~1 | country
##         (Intercept) Residual
## StdDev:    16.95562 2.773297
## 
## Fixed effects:  lifeExp ~ year52 
##                Value Std.Error DF   t-value p-value
## (Intercept) 50.44031 12.036630 21  4.190567   4e-04
## year52       0.38359  0.032798 21 11.695741   0e+00
##  Correlation: 
##        (Intr)
## year52 -0.075
## 
## Standardized Within-Group Residuals:
##         Min          Q1         Med          Q3         Max 
## -1.51481611 -0.85365430  0.02447448  0.74413398  1.52279051 
## 
## Number of Observations: 24
## Number of Groups: 2

Report the fixed effect coefficients and associated \(p\)-values (see under Fixed effects in the summary() output)

The fixed effect coefficients (and \(p\)-values) are the same as in the previous model, 50.44 years for average life expectancy and 0.3836 year average increase in life expectancy for each year from 1952.

Report the standard deviation for the two random effects ((Intercept) and Residual)

The standard deviation for the random effect are (Intercept) = 16.96 and Residual = 2.77.

Calculate the ICC for the random intercept in this model. You will need to use VarCorr() to find the variance of the intercept and residual term (quick reminder, the intercept ICC is the variance of the intercept, divided by the sum of variances). Given the value you obtain, does this support the use of a random intercept model?

VarCorr(lme_life2)
## country = pdLogChol(1) 
##             Variance   StdDev   
## (Intercept) 287.493021 16.955619
## Residual      7.691175  2.773297

The total variance is \(287.49 + 7.69 = 295.18\). The ICC for the intercept is \(287.49/295.18 = 97.4%\), the residuals are \(7.69/295.18 = 2.6%\). This suggests that almost all of the variance can be explained by the variation between the intercepts of the two countries different life expectancies, with only a small percentage explained by random effects.

Now build a third model with a random intercept and slope. The random effect will look something like random = ~ year2 | country. Show the output of summary() for this model.

lme_life3 <- lme(lifeExp ~ year52 * country,
                        random = ~ year52 | country,
                        data = gap2)

summary(lme_life3)
## Warning in pt(-abs(tVal), fDF): NaNs produced
## Linear mixed-effects model fit by REML
##   Data: gap2 
##        AIC      BIC   logLik
##   79.81821 87.78407 -31.9091
## 
## Random effects:
##  Formula: ~year52 | country
##  Structure: General positive-definite, Log-Cholesky parametrization
##             StdDev     Corr  
## (Intercept) 0.53878497 (Intr)
## year52      0.01659437 0     
## Residual    0.69990221       
## 
## Fixed effects:  lifeExp ~ year52 * country 
##                         Value Std.Error DF   t-value p-value
## (Intercept)          66.44897 0.6593452 20 100.78025       0
## year52                0.23793 0.0203076 20  11.71608       0
## countryNepal        -32.01733 0.9324549  0 -34.33660     NaN
## year52:countryNepal   0.29134 0.0287193 20  10.14429       0
##  Correlation: 
##                     (Intr) year52 cntryN
## year52              -0.281              
## countryNepal        -0.707  0.199       
## year52:countryNepal  0.199 -0.707 -0.281
## 
## Standardized Within-Group Residuals:
##         Min          Q1         Med          Q3         Max 
## -1.49431132 -0.47551182 -0.07417703  0.22098325  2.46514292 
## 
## Number of Observations: 24
## Number of Groups: 2

The model now takes country into account. There are now four coefficients: (Intercept) = 66.45 years for the life expectancy (grand mean intercept) with respect to Finland, with an average rate of increase (grand mean slope) of 0.24 years for each year after 1952 (again, with respect to Finland). The countryNepal coefficient is the average offset for life expectancy compared to Finland’s (or Nepal’s life expectancy is, on average, 32.02 years lower than that of Finland). The year52:countryNepal is the coefficient for the average change in life expectancy compared to Finland (or Nepal’s average change in yearly life expectancy is \(0.291/0.238 = 122%\) higher than that of Finland’s).

gap2 <- gap2 %>%
          mutate(fit3 = predict(lme_life3, re.form = NULL))

ggplot(gap2) +
  geom_point(aes(x = year52, y = lifeExp, col = country)) +
  scale_color_manual(values = c("Finland" = "blue4",
                                "Nepal" = "darkgreen")) +
  geom_line(aes(x = year52, y = fit3, col = country)) +
  labs(title = "",
       x = "Year",
       y = "Life Expectancy (years)",
       col = "Country") +
  theme_bw()

Using this method, there is now a way to model both country’s life expectancies.

Finally, use the AIC() function to get the AIC for the three models (linear, random intercept and random intercept and slope). As a reminder: the AIC is a good criterion of the quality of the model, as it penalizes when extra predictors (overfitting). The models with the lowest AIC values are best. Which of your models is best?

library(kableExtra)
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
model_compare <- data.frame(linear = AIC(lm_life),
                            randomInt = AIC(lme_life2),
                            rndIntSlop = AIC(lme_life3))

colnames(model_compare) <- c("Linear", "Random Intercept", "Random Intercept & Slope")
rownames(model_compare) <- "AIC"

model_compare %>% kbl( ) %>% kable_styling()
Linear Random Intercept Random Intercept & Slope
AIC 194.4914 133.4761 79.81821

The AIC started out at 194 for the simplest model. With each increase in model complexity the AIC went down. The final model with both random intercept and random slope had an AIC of 80. The AIC is an estimate of prediction error. As the increasing model complexity better explain the data, the AIC value decreases. The final model that accounts for the different countries is the best model for the data.


end