The file gapminderData5.csv contains a subset of the GapMinder dataset for 142 countries over 12 time steps (every 5 years between 1952 and 2007). Read the data in and use the tapply() function to estimate the mean life expectancy by country (tapply(gap\(lifeExp, gap\)country, mean))
# read in data
gap <- read.csv("../datafiles/gapminderData5.csv")
# set factors for country, year, and continent data
gap$country <- factor(gap$country)
gap$year <- factor(gap$year)
gap$continent <- factor(gap$continent)
# estimate mean life expectancy by country
country_mean_life <- data.frame(mean_life = tapply(gap$lifeExp,
gap$country, mean))
head(country_mean_life)
## mean_life
## Afghanistan 37.47883
## Albania 68.43292
## Algeria 59.03017
## Angola 37.88350
## Argentina 69.06042
## Australia 74.66292
Make a scatter plot of lifeExp vs. year for all countries
library(ggplot2)
ggplot(gap) +
geom_point(aes(x = year, y = lifeExp, col = continent)) +
labs(title = "Life Expectancy 1962 - 2007",
x = "Year",
y = "Life Expectancy (Years)",
col = "Continent") +
theme_bw()
The good news is that life expectancy for all of us has increased over the last 50+ years.
Make plots for two countries (not the United States!):
My two countries are Finland and Nepal.
gap2 <- subset(gap, (country == "Finland") | (country == "Nepal"))
ggplot(gap2) +
geom_point(aes(x = as.numeric(as.character(year)), y = lifeExp, col = country)) +
labs(title = "",
x = "Year",
y = "Life Expectancy (years)",
col = "Country") +
scale_color_manual(values = c("Finland" = "blue4",
"Nepal" = "darkgreen")) +
theme_bw()
Finland started out with a life expectancy of 66.5 years that gradually increased to 79.3 years. Nepal had a life expectancy of only 36.2 years in 1952. Fortunately this increased to 63.8 years by 2007.
Use these data to build a simple linear model (lm()) of lifeExp as a function of year. Before making the model, center the year variable by subtracting 1952 from all values (0 now represents the year 1952)
# unfactor year for now
gap2$year <- as.numeric(as.character(gap2$year))
gap2$year52 <- gap2$year - 1952
# change factor levels for only 2 countries
gap2$country <- factor(as.character(gap2$country))
gap2$continent <- factor(as.character(gap2$continent))
lm_life <- lm(lifeExp ~ year52, gap2)
summary(lm_life)
##
## Call:
## lm(formula = lifeExp ~ year52, data = gap2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.8832 -12.5861 0.0111 12.5179 16.1097
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 50.4403 4.9248 10.242 7.8e-10 ***
## year52 0.3836 0.1517 2.529 0.0191 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.83 on 22 degrees of freedom
## Multiple R-squared: 0.2252, Adjusted R-squared: 0.19
## F-statistic: 6.395 on 1 and 22 DF, p-value: 0.01912
Report the coefficients, their significance and a brief explanation of what each one means
The (Intercept) is 50.44 years, which is the average life expectancy for both countries combined. The slope coefficient (year52) is 0.3836 years/year which means for every additional year the life expectancy for both countries increases by 0.3836 years. Both of these coefficients have low p-values, suggesting that the coefficient values are significant.
Give the \(F\)-statistic and associated \(p\)-value
The \(F\)-statistic for the model is 6.392 with 1 and 22 degrees of freedom, resulting in a \(p\)-value of 0.019, which is good(ish).
ggplot(gap2) +
geom_point(aes(x = year52, y = lifeExp, col = country)) +
scale_color_manual(values = c("Finland" = "blue4",
"Nepal" = "darkgreen")) +
geom_abline(slope = coef(lm_life)[["year52"]],
intercept = coef(lm_life)[["(Intercept)"]],
col = "darkred") +
labs(title = "",
x = "Year",
y = "Life Expectancy (years)",
col = "Country") +
theme_bw()
The model is a good fit, but doesn’t really capture the different life expectancy rates between the two countries.
Now build a random intercept model using the lme() function, using country as the grouping variable for the random effect (the random effect will look something like random = ~ 1 | country).
library(nlme)
##
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
##
## collapse
lme_life2 <- lme(lifeExp ~ year52,
random = ~ 1 | country,
data = gap2)
summary(lme_life2)
## Linear mixed-effects model fit by REML
## Data: gap2
## AIC BIC logLik
## 133.4761 137.8403 -62.73805
##
## Random effects:
## Formula: ~1 | country
## (Intercept) Residual
## StdDev: 16.95562 2.773297
##
## Fixed effects: lifeExp ~ year52
## Value Std.Error DF t-value p-value
## (Intercept) 50.44031 12.036630 21 4.190567 4e-04
## year52 0.38359 0.032798 21 11.695741 0e+00
## Correlation:
## (Intr)
## year52 -0.075
##
## Standardized Within-Group Residuals:
## Min Q1 Med Q3 Max
## -1.51481611 -0.85365430 0.02447448 0.74413398 1.52279051
##
## Number of Observations: 24
## Number of Groups: 2
Report the fixed effect coefficients and associated \(p\)-values (see under Fixed effects in the summary() output)
The fixed effect coefficients (and \(p\)-values) are the same as in the previous model, 50.44 years for average life expectancy and 0.3836 year average increase in life expectancy for each year from 1952.
Report the standard deviation for the two random effects ((Intercept) and Residual)
The standard deviation for the random effect are (Intercept) = 16.96 and Residual = 2.77.
Calculate the ICC for the random intercept in this model. You will need to use VarCorr() to find the variance of the intercept and residual term (quick reminder, the intercept ICC is the variance of the intercept, divided by the sum of variances). Given the value you obtain, does this support the use of a random intercept model?
VarCorr(lme_life2)
## country = pdLogChol(1)
## Variance StdDev
## (Intercept) 287.493021 16.955619
## Residual 7.691175 2.773297
The total variance is \(287.49 + 7.69 = 295.18\). The ICC for the intercept is \(287.49/295.18 = 97.4%\), the residuals are \(7.69/295.18 = 2.6%\). This suggests that almost all of the variance can be explained by the variation between the intercepts of the two countries different life expectancies, with only a small percentage explained by random effects.
Now build a third model with a random intercept and slope. The random effect will look something like random = ~ year2 | country. Show the output of summary() for this model.
lme_life3 <- lme(lifeExp ~ year52 * country,
random = ~ year52 | country,
data = gap2)
summary(lme_life3)
## Warning in pt(-abs(tVal), fDF): NaNs produced
## Linear mixed-effects model fit by REML
## Data: gap2
## AIC BIC logLik
## 79.81821 87.78407 -31.9091
##
## Random effects:
## Formula: ~year52 | country
## Structure: General positive-definite, Log-Cholesky parametrization
## StdDev Corr
## (Intercept) 0.53878497 (Intr)
## year52 0.01659437 0
## Residual 0.69990221
##
## Fixed effects: lifeExp ~ year52 * country
## Value Std.Error DF t-value p-value
## (Intercept) 66.44897 0.6593452 20 100.78025 0
## year52 0.23793 0.0203076 20 11.71608 0
## countryNepal -32.01733 0.9324549 0 -34.33660 NaN
## year52:countryNepal 0.29134 0.0287193 20 10.14429 0
## Correlation:
## (Intr) year52 cntryN
## year52 -0.281
## countryNepal -0.707 0.199
## year52:countryNepal 0.199 -0.707 -0.281
##
## Standardized Within-Group Residuals:
## Min Q1 Med Q3 Max
## -1.49431132 -0.47551182 -0.07417703 0.22098325 2.46514292
##
## Number of Observations: 24
## Number of Groups: 2
The model now takes country into account. There are now four coefficients: (Intercept) = 66.45 years for the life expectancy (grand mean intercept) with respect to Finland, with an average rate of increase (grand mean slope) of 0.24 years for each year after 1952 (again, with respect to Finland). The countryNepal coefficient is the average offset for life expectancy compared to Finland’s (or Nepal’s life expectancy is, on average, 32.02 years lower than that of Finland). The year52:countryNepal is the coefficient for the average change in life expectancy compared to Finland (or Nepal’s average change in yearly life expectancy is \(0.291/0.238 = 122%\) higher than that of Finland’s).
gap2 <- gap2 %>%
mutate(fit3 = predict(lme_life3, re.form = NULL))
ggplot(gap2) +
geom_point(aes(x = year52, y = lifeExp, col = country)) +
scale_color_manual(values = c("Finland" = "blue4",
"Nepal" = "darkgreen")) +
geom_line(aes(x = year52, y = fit3, col = country)) +
labs(title = "",
x = "Year",
y = "Life Expectancy (years)",
col = "Country") +
theme_bw()
Using this method, there is now a way to model both country’s life expectancies.
Finally, use the AIC() function to get the AIC for the three models (linear, random intercept and random intercept and slope). As a reminder: the AIC is a good criterion of the quality of the model, as it penalizes when extra predictors (overfitting). The models with the lowest AIC values are best. Which of your models is best?
library(kableExtra)
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
model_compare <- data.frame(linear = AIC(lm_life),
randomInt = AIC(lme_life2),
rndIntSlop = AIC(lme_life3))
colnames(model_compare) <- c("Linear", "Random Intercept", "Random Intercept & Slope")
rownames(model_compare) <- "AIC"
model_compare %>% kbl( ) %>% kable_styling()
| Linear | Random Intercept | Random Intercept & Slope | |
|---|---|---|---|
| AIC | 194.4914 | 133.4761 | 79.81821 |
The AIC started out at 194 for the simplest model. With each increase in model complexity the AIC went down. The final model with both random intercept and random slope had an AIC of 80. The AIC is an estimate of prediction error. As the increasing model complexity better explain the data, the AIC value decreases. The final model that accounts for the different countries is the best model for the data.
end