tapply() function to
estimate the mean life expectancy by country
(tapply(gap$lifeExp, gap$country, mean))lifeExp vs. year
for all countrieslifeExp
vs. year for the United States. Use this to make plots for
two countries (not the United States!):gapminder <- read.csv("gapminderData5.csv")
head(gapminder)
## country year pop continent lifeExp gdpPercap
## 1 Afghanistan 1952 8425333 Asia 28.801 779.4453
## 2 Afghanistan 1957 9240934 Asia 30.332 820.8530
## 3 Afghanistan 1962 10267083 Asia 31.997 853.1007
## 4 Afghanistan 1967 11537966 Asia 34.020 836.1971
## 5 Afghanistan 1972 13079460 Asia 36.088 739.9811
## 6 Afghanistan 1977 14880372 Asia 38.438 786.1134
mlifeExp <- tapply(gapminder$lifeExp, gapminder$country, mean)
Making a scatter plot of life Expectancy vs year
plot(gapminder$year, gapminder$lifeExp,
col = "blue",
pch = 16,
xlab = "Year",
ylab = "Life Expectancy",
main = "Life Expectancy vs Year (All countries")
## OR use ggplot
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
ggplot(gapminder, aes(x = year, y = lifeExp, colour = country)) +
geom_point() +
theme_minimal() +
theme(legend.position = "none") +
labs(title = "Liife Expectancy vs year of all countries")
Line plot for two countries
head(gapminder, 10)
## country year pop continent lifeExp gdpPercap
## 1 Afghanistan 1952 8425333 Asia 28.801 779.4453
## 2 Afghanistan 1957 9240934 Asia 30.332 820.8530
## 3 Afghanistan 1962 10267083 Asia 31.997 853.1007
## 4 Afghanistan 1967 11537966 Asia 34.020 836.1971
## 5 Afghanistan 1972 13079460 Asia 36.088 739.9811
## 6 Afghanistan 1977 14880372 Asia 38.438 786.1134
## 7 Afghanistan 1982 12881816 Asia 39.854 978.0114
## 8 Afghanistan 1987 13867957 Asia 40.822 852.3959
## 9 Afghanistan 1992 16317921 Asia 41.674 649.3414
## 10 Afghanistan 1997 22227415 Asia 41.763 635.3414
gapmind_sub <- subset(gapminder, country %in% c("Japan", "Canada"))
Jp <- subset(gapmind_sub, country == "Japan")
Ca <- subset(gapmind_sub, country == "Canada")
plot(Jp$year, Jp$lifeExp,
xlab = "Year", ylab = "Life Expectancy (yr)",
main = "Life expectancy between Japan and Canada",
type = 'l', col = "blue")
lines(Ca$year, Ca$lifeExp,
type = 'l', col = "red")
legend("topleft", legend = c("Japan", "Canada"),
lty = 1,
col = c('blue', 'red'))
Use these data to build a simple linear model (lm()) of lifeExp as a function of year. Before making the model, center the year variable by subtracting 1952 from all values (0 now represents the year 1952)
gapminder$year_c = gapminder$year - 1952
gap_m1 <- lm(lifeExp ~ year_c, data = gapminder)
summary(gap_m1)
##
## Call:
## lm(formula = lifeExp ~ year_c, data = gapminder)
##
## Residuals:
## Min 1Q Median 3Q Max
## -39.949 -9.651 1.697 10.335 22.158
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 50.51208 0.53000 95.31 <2e-16 ***
## year_c 0.32590 0.01632 19.96 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.63 on 1702 degrees of freedom
## Multiple R-squared: 0.1898, Adjusted R-squared: 0.1893
## F-statistic: 398.6 on 1 and 1702 DF, p-value: < 2.2e-16
Report the coefficients, their significance and a brief explanation of what each one mean -The life expectancy [intercept] in year 0 [1952] was 50.5, p-value < 2.2x10-16 (p<.001). This means that life expectancy will increase every year after year 0 [1952] by 0.326, p<.001
Give the F-statistic and associated p-value
Now build a random intercept model using the lme() function, using country as the grouping variable for the random effect (the random effect will look something like random = ~ 1 | country).
gapminder$country = factor(gapminder$country)
library(nlme)
##
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
##
## collapse
gap_m2 <- lme(lifeExp ~ year_c,
random = ~ 1| country,
data = gapminder)
summary(gap_m2)
## Linear mixed-effects model fit by REML
## Data: gapminder
## AIC BIC logLik
## 9874.263 9896.022 -4933.132
##
## Random effects:
## Formula: ~1 | country
## (Intercept) Residual
## StdDev: 11.09716 3.584196
##
## Fixed effects: lifeExp ~ year_c
## Value Std.Error DF t-value p-value
## (Intercept) 50.51208 0.9454671 1561 53.42553 0
## year_c 0.32590 0.0050305 1561 64.78581 0
## Correlation:
## (Intr)
## year_c -0.146
##
## Standardized Within-Group Residuals:
## Min Q1 Med Q3 Max
## -6.1691513 -0.4910339 0.1216455 0.6030002 2.4194109
##
## Number of Observations: 1704
## Number of Groups: 142
summary() output)
Calculate the ICC for the random intercept in this model. You will
need to use VarCorr() to find the variance of the intercept
and residual term (quick reminder, the intercept ICC is the variance of
the intercept, divided by the sum of variances).
VarCorr(gap_m2)
## country = pdLogChol(1)
## Variance StdDev
## (Intercept) 123.14688 11.097156
## Residual 12.84646 3.584196
ICC = 123.15 + 12.15 = 135.95 = 123.15/135.95
ICC (0.91), 91% of variation in the life expectancy is differences between countries. therefore, since the ICC is high or large then the variation of the coefficient between countries is important and worth retaining in the model.
Now build a third model with a random intercept and slope. The random
effect will look something like random = ~ year2 | country.
Show the output of summary() for this model.
gap_m3 <- lme(lifeExp ~ year_c,
random = ~ year_c | country,
data = gapminder)
summary(gap_m3)
## Linear mixed-effects model fit by REML
## Data: gapminder
## AIC BIC logLik
## 8747.277 8779.915 -4367.639
##
## Random effects:
## Formula: ~year_c | country
## Structure: General positive-definite, Log-Cholesky parametrization
## StdDev Corr
## (Intercept) 12.3027957 (Intr)
## year_c 0.1582867 -0.434
## Residual 2.1807917
##
## Fixed effects: lifeExp ~ year_c
## Value Std.Error DF t-value p-value
## (Intercept) 50.51208 1.0371995 1561 48.70045 0
## year_c 0.32590 0.0136312 1561 23.90865 0
## Correlation:
## (Intr)
## year_c -0.439
##
## Standardized Within-Group Residuals:
## Min Q1 Med Q3 Max
## -8.08192929 -0.34891405 0.02690066 0.39742349 4.67388085
##
## Number of Observations: 1704
## Number of Groups: 142
Finally, use the AIC() function to get the AIC for the three models (linear, random intercept and random intercept and slope). As a reminder: the AIC is a good criterion of the quality of the model, as it penalizes when extra predictors (overfitting). The models with the lowest AIC values are best. Which of your models is best?
AIC(gap_m1, gap_m2, gap_m3)
## Warning in AIC.default(gap_m1, gap_m2, gap_m3): models are not all fitted to
## the same number of observations
## df AIC
## gap_m1 3 13201.732
## gap_m2 4 9874.263
## gap_m3 6 8747.277