I demonstrate a variety of machine learning modeling techniques to predict life expectancy across countries.
We have 51 years of life expectancy data grouped by country. Explanatory variables include year, infant mortality, fertility, population, and GDP/Capita. Quartiling the data will give us an idea of the data distribution and whether we should expect outliers.
# Explore gapminder
head(gapminder)
## # A tibble: 6 x 7
## country year infant_mortality life_expectancy fertility population gdpPercap
## <fct> <int> <dbl> <dbl> <dbl> <dbl> <int>
## 1 Algeria 1960 148. 47.5 7.65 11124892 1242
## 2 Algeria 1961 148. 48.0 7.65 11404859 1047
## 3 Algeria 1962 148. 48.6 7.65 11690152 820
## 4 Algeria 1963 148. 49.1 7.65 11985130 1075
## 5 Algeria 1964 149. 49.6 7.65 12295973 1109
## 6 Algeria 1965 149. 50.1 7.66 12626953 1147
summary(gapminder)
## country year infant_mortality life_expectancy
## Algeria : 52 Min. :1960 Min. : 1.80 Min. :13.20
## Argentina : 52 1st Qu.:1973 1st Qu.: 17.10 1st Qu.:55.46
## Australia : 52 Median :1986 Median : 50.45 Median :65.95
## Austria : 52 Mean :1986 Mean : 60.64 Mean :64.06
## Bangladesh: 52 3rd Qu.:1998 3rd Qu.: 95.12 3rd Qu.:73.52
## Belgium : 52 Max. :2011 Max. :237.20 Max. :82.90
## (Other) :3692
## fertility population gdpPercap
## Min. :1.200 Min. :4.154e+04 Min. : 54.0
## 1st Qu.:2.380 1st Qu.:3.823e+06 1st Qu.: 494.8
## Median :4.085 Median :9.425e+06 Median : 1580.5
## Mean :4.251 Mean :3.574e+07 Mean : 5754.2
## 3rd Qu.:6.070 3rd Qu.:2.896e+07 3rd Qu.: 7214.0
## Max. :8.450 Max. :1.247e+09 Max. :41838.0
##
1. I’m starting with two “kitchen sink” models to demonstrate the ease of building linear regression and random forest models. Please note, later in the demonstration I will cover how these two kitchen sink models OVERFIT and their accuracy is misleading.
2. I show the power of nesting models and how I can test multiple machine learning models at once and extract their accuracy in a single table.
3. Lastly, I take a more industry standard approach to model building. I start by randomizing the order of the data, cross validate the data in order to make sure the model accuracy holds across random slices of the data, and hypertune the models to test for superior accuracy. I use nesting again to see model results across my training, validation, and test datasets all in a single table.
Here I show how simple it is to build a multiple linear regression model. From the summary of results I can see an adjusted R-squared of 95.84%. I can also see my p values (highlighted by multiple asterisks) to identify my most important variables.
lm_model <- lm(life_expectancy ~ ., gapminder)
summary(lm_model)
##
## Call:
## lm(formula = life_expectancy ~ ., data = gapminder)
##
## Residuals:
## Min 1Q Median 3Q Max
## -32.021 -0.870 0.051 1.124 8.883
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.154e+00 1.329e+01 0.388 0.698099
## countryArgentina -1.697e+00 4.992e-01 -3.400 0.000681 ***
## countryAustralia -3.170e+00 6.051e-01 -5.240 1.69e-07 ***
## countryAustria -4.174e+00 6.187e-01 -6.747 1.73e-11 ***
## countryBangladesh -2.460e+00 4.658e-01 -5.281 1.35e-07 ***
## countryBelgium -4.120e+00 6.128e-01 -6.722 2.05e-11 ***
## countryBenin -5.474e+00 4.627e-01 -11.830 < 2e-16 ***
## countryBolivia -2.436e+00 4.492e-01 -5.423 6.23e-08 ***
## countryBotswana -1.149e+01 4.486e-01 -25.608 < 2e-16 ***
## countryBrazil -1.355e+00 4.790e-01 -2.828 0.004710 **
## countryBurkina Faso -9.617e+00 4.621e-01 -20.812 < 2e-16 ***
## countryBurundi -1.101e+01 4.678e-01 -23.529 < 2e-16 ***
## countryCameroon -6.041e+00 4.521e-01 -13.363 < 2e-16 ***
## countryCanada -3.157e+00 6.218e-01 -5.078 3.99e-07 ***
## countryCentral African Republic -1.037e+01 4.619e-01 -22.453 < 2e-16 ***
## countryChile -1.399e+00 4.808e-01 -2.910 0.003637 **
## countryColombia -1.345e+00 4.603e-01 -2.922 0.003494 **
## countryCongo, Rep. -1.070e+01 4.465e-01 -23.968 < 2e-16 ***
## countryCosta Rica 7.407e-01 4.754e-01 1.558 0.119285
## countryCote d'Ivoire -5.954e+00 4.644e-01 -12.822 < 2e-16 ***
## countryDenmark -5.509e+00 6.715e-01 -8.204 3.13e-16 ***
## countryDominican Republic -3.043e-01 4.524e-01 -0.673 0.501137
## countryEcuador -2.387e-01 4.502e-01 -0.530 0.596009
## countryEgypt 2.305e-01 4.518e-01 0.510 0.609922
## countryEl Salvador -2.674e+00 4.500e-01 -5.942 3.05e-09 ***
## countryFiji -9.430e+00 4.690e-01 -20.108 < 2e-16 ***
## countryFinland -5.196e+00 6.218e-01 -8.357 < 2e-16 ***
## countryFrance -3.370e+00 6.098e-01 -5.525 3.50e-08 ***
## countryGhana -6.557e+00 4.476e-01 -14.650 < 2e-16 ***
## countryGreece -5.580e-01 5.492e-01 -1.016 0.309700
## countryGuatemala -3.153e+00 4.454e-01 -7.078 1.72e-12 ***
## countryHonduras -2.776e+00 4.477e-01 -6.201 6.18e-10 ***
## countryIceland -3.573e+00 6.699e-01 -5.334 1.02e-07 ***
## countryIndia -4.373e+00 1.098e+00 -3.984 6.91e-05 ***
## countryIndonesia -4.044e+00 4.855e-01 -8.329 < 2e-16 ***
## countryItaly -1.917e+00 5.886e-01 -3.256 0.001138 **
## countryJapan -5.044e+00 7.299e-01 -6.911 5.57e-12 ***
## countryKenya -7.427e+00 4.531e-01 -16.389 < 2e-16 ***
## countryLesotho -8.544e+00 4.507e-01 -18.956 < 2e-16 ***
## countryLiberia -3.159e+00 4.896e-01 -6.452 1.24e-10 ***
## countryMalawi -6.971e+00 4.851e-01 -14.371 < 2e-16 ***
## countryMalaysia -3.769e+00 4.749e-01 -7.936 2.71e-15 ***
## countryMauritania -4.456e+00 4.492e-01 -9.920 < 2e-16 ***
## countryMexico -1.327e+00 4.643e-01 -2.859 0.004271 **
## countryMorocco -5.073e-01 4.464e-01 -1.136 0.255846
## countryNepal -2.260e+00 4.612e-01 -4.901 9.90e-07 ***
## countryNetherlands -3.241e+00 6.232e-01 -5.201 2.08e-07 ***
## countryNicaragua 2.332e+00 4.461e-01 5.227 1.81e-07 ***
## countryNorway -5.062e+00 7.045e-01 -7.184 8.04e-13 ***
## countryOman -2.398e-01 4.527e-01 -0.530 0.596388
## countryPakistan 1.071e+00 4.697e-01 2.281 0.022577 *
## countryPanama 1.727e-01 4.719e-01 0.366 0.714403
## countryPapua New Guinea -1.161e+01 4.456e-01 -26.061 < 2e-16 ***
## countryParaguay 1.885e+00 4.562e-01 4.132 3.67e-05 ***
## countryPeru -7.149e-01 4.487e-01 -1.593 0.111153
## countryPhilippines -3.967e+00 4.540e-01 -8.738 < 2e-16 ***
## countryPortugal -1.616e+00 5.263e-01 -3.072 0.002144 **
## countryRwanda -9.986e+00 4.659e-01 -21.434 < 2e-16 ***
## countrySenegal -1.064e+01 4.525e-01 -23.516 < 2e-16 ***
## countrySeychelles -4.832e+00 4.828e-01 -10.008 < 2e-16 ***
## countrySierra Leone -9.267e-01 5.005e-01 -1.852 0.064133 .
## countrySingapore -5.134e+00 5.832e-01 -8.803 < 2e-16 ***
## countrySouth Korea -6.197e+00 5.106e-01 -12.138 < 2e-16 ***
## countrySpain -6.608e-01 5.528e-01 -1.195 0.231987
## countrySri Lanka -2.492e+00 4.713e-01 -5.286 1.32e-07 ***
## countrySudan -5.120e+00 4.487e-01 -11.411 < 2e-16 ***
## countrySweden -3.791e+00 6.660e-01 -5.693 1.34e-08 ***
## countryThailand -3.263e+00 4.666e-01 -6.992 3.16e-12 ***
## countryTogo -6.595e+00 4.535e-01 -14.542 < 2e-16 ***
## countryTrinidad and Tobago -4.514e+00 4.975e-01 -9.073 < 2e-16 ***
## countryTurkey -2.961e-01 4.569e-01 -0.648 0.516980
## countryUnited Kingdom -4.390e+00 6.240e-01 -7.035 2.34e-12 ***
## countryUnited States -6.575e+00 7.371e-01 -8.920 < 2e-16 ***
## countryUruguay -1.105e+00 5.032e-01 -2.196 0.028167 *
## countryVenezuela -2.917e+00 4.767e-01 -6.119 1.03e-09 ***
## countryZambia -8.939e+00 4.558e-01 -19.611 < 2e-16 ***
## countryZimbabwe -9.954e+00 4.484e-01 -22.197 < 2e-16 ***
## year 3.602e-02 6.546e-03 5.502 3.98e-08 ***
## infant_mortality -1.649e-01 2.678e-03 -61.567 < 2e-16 ***
## fertility 1.752e-03 6.434e-02 0.027 0.978279
## population 1.039e-10 1.263e-09 0.082 0.934470
## gdpPercap 2.472e-04 1.528e-05 16.177 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.268 on 3922 degrees of freedom
## Multiple R-squared: 0.9592, Adjusted R-squared: 0.9584
## F-statistic: 1138 on 81 and 3922 DF, p-value: < 2.2e-16
In this example, a random forest model has superior accuracy of an R^2 of 99.2%. I also export a variable importance table that confirms the linear model variables of significant importance are:
1. Infant Mortality
2. GDP/Capita
(rf_model <- ranger(life_expectancy ~ ., gapminder, importance = "impurity"))
## Ranger result
##
## Call:
## ranger(life_expectancy ~ ., gapminder, importance = "impurity")
##
## Type: Regression
## Number of trees: 500
## Sample size: 4004
## Number of independent variables: 6
## Mtry: 2
## Target node size: 5
## Variable importance mode: impurity
## Splitrule: variance
## OOB prediction error (MSE): 0.9647985
## R squared (OOB): 0.9921866
importance(rf_model)
## country year infant_mortality fertility
## 12528.53 27805.87 220149.81 112042.84
## population gdpPercap
## 21656.76 99247.72
So far we have built two models that provide a line of best of fit for the ENTIRE dataset. In reality, models will have differing accuracies country to country. Instead, we can build linear regression and random forest models for EACH country to better understand model strengths and weaknesses and potentially increase our total accuracy by building more accurate, local models. For this example I will focus just just on linear models but create a single table where each row (country) contains:
1. Raw data (grouped by country)
2. The model specificly built for each country
3. Coefficients, error & p-values, for each specific model
4. The adjusted R^2 values indicating model accuracy along with key metrics for determining model validity
# Build a linear model for each country
(lm_nest <- gapminder %>%
group_by(country) %>%
nest() %>%
mutate(model = map(data, ~lm(formula = life_expectancy ~ ., data = .x)),
coef = map(model, ~tidy(.x)),
results = map(model, ~glance(.x))) %>%
#unnest(coef, .drop = FALSE) %>%
unnest(results, .drop = FALSE))
## Warning: The `.drop` argument of `unnest()` is deprecated as of tidyr 1.0.0.
## All list-columns are now preserved.
## This warning is displayed once per session.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## # A tibble: 77 x 15
## # Groups: country [185]
## country data model coef r.squared adj.r.squared sigma statistic p.value
## <fct> <lis> <lis> <lis> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Algeria <tib~ <lm> <tib~ 0.999 0.999 0.316 9917. 1.56e-68
## 2 Argent~ <tib~ <lm> <tib~ 0.999 0.999 0.115 8881. 1.97e-67
## 3 Austra~ <tib~ <lm> <tib~ 0.995 0.994 0.300 1709. 5.16e-51
## 4 Austria <tib~ <lm> <tib~ 0.996 0.996 0.249 2317. 4.86e-54
## 5 Bangla~ <tib~ <lm> <tib~ 0.983 0.981 1.11 520. 2.88e-39
## 6 Belgium <tib~ <lm> <tib~ 0.997 0.996 0.206 2663. 1.99e-55
## 7 Benin <tib~ <lm> <tib~ 0.996 0.996 0.465 2264. 8.21e-54
## 8 Bolivia <tib~ <lm> <tib~ 0.999 0.999 0.285 11085. 1.21e-69
## 9 Botswa~ <tib~ <lm> <tib~ 0.859 0.844 2.02 56.1 1.91e-18
## 10 Brazil <tib~ <lm> <tib~ 0.999 0.999 0.133 17765. 2.37e-74
## # ... with 67 more rows, and 6 more variables: df <int>, logLik <dbl>,
## # AIC <dbl>, BIC <dbl>, deviance <dbl>, df.residual <int>
To further demonstrate how easy it is to work with this data. I will quickly identify the top 5 and bottom 5 performing models by country.
# Extract the 4 best fitting models
lm_nest %>%
select(adj.r.squared) %>%
as.data.frame() %>%
top_n(n = 5, wt = adj.r.squared)
## country adj.r.squared
## 1 Brazil 0.9994261
## 2 Greece 0.9994407
## 3 Mexico 0.9995427
## 4 Morocco 0.9997960
## 5 Peru 0.9995782
# Extract the 4 models with the worst fit
lm_nest %>%
select(adj.r.squared) %>%
as.data.frame() %>%
top_n(n = 5, wt = -adj.r.squared)
## country adj.r.squared
## 1 Botswana 0.8438327
## 2 Congo, Rep. 0.8753946
## 3 Lesotho 0.9077870
## 4 Rwanda 0.7673115
## 5 Zambia 0.7060580
After randomizing and splitting the dataset into training, validation, and test sets, I cross validate the data to make sure the model holds across randomized groups of data. Instead of using R^2 I’m going to use a different model accuracy measurement called MAE (Mean Absolute Error) that measures the average magnitude of absolute errors. Although hard to read, the first table contains linear regression model performance on 5 randomized folds of data. The number following the table represents the average MAE from those 5 folds which is an average error of 1.5 years of life expectancy.The following table is the same for the random forest models and still outperforms the linear regression models with an average MAE of .8 years of life expectancy.
The power of these tables are that they contain all the inputs and outputs to recreate the results of each fold and the model as a whole.
# Prepare the initial split object and randomize data order
set.seed(42)
gap_split <- initial_split(gapminder[sample(nrow(gapminder)),], prop = .75)
# Extract the training dataframe
training_data <- training(gap_split)
# Extract the testing dataframe
testing_data <- testing(gap_split)
# Prepare the dataframe containing the cross validation partitions
set.seed(42)
cv_split <- vfold_cv(training_data, v = 5)
cv_data <- cv_split %>%
mutate(
# Extract the train dataframe for each split
train = map(splits, ~training(.x)),
# Extract the validate dataframe for each split
validate = map(splits, ~testing(.x))
)
cv_models_lm <- cv_data %>%
mutate(model = map(train, ~lm(formula = life_expectancy ~ ., data = .x)))
cv_prep_lm <- cv_models_lm %>%
mutate(
# Extract the recorded life expectancy for the records in the validate dataframes
validate_actual = map(validate, ~.x$life_expectancy),
# Predict life expectancy for each validate set using its corresponding model
validate_predicted = map2(.x = model, .y = validate, ~predict(.x, .y))
)
# Calculate the mean absolute error for each validate fold
(cv_eval_lm <- cv_prep_lm %>%
mutate(validate_mae = map2_dbl(validate_actual, validate_predicted, ~mae(actual = .x, predicted = .y))))
## # 5-fold cross-validation
## # A tibble: 5 x 8
## splits id train validate model validate_actual validate_predic~
## * <name> <chr> <nam> <named > <nam> <named list> <named list>
## 1 <spli~ Fold1 <tib~ <tibble~ <lm> <dbl [601]> <dbl [601]>
## 2 <spli~ Fold2 <tib~ <tibble~ <lm> <dbl [601]> <dbl [601]>
## 3 <spli~ Fold3 <tib~ <tibble~ <lm> <dbl [601]> <dbl [601]>
## 4 <spli~ Fold4 <tib~ <tibble~ <lm> <dbl [600]> <dbl [600]>
## 5 <spli~ Fold5 <tib~ <tibble~ <lm> <dbl [600]> <dbl [600]>
## # ... with 1 more variable: validate_mae <dbl>
# Calculate the mean of validate_mae column
mean(cv_eval_lm$validate_mae)
## [1] 1.501987
# Build a random forest model for each fold
cv_models_rf <- cv_data %>%
mutate(model = map(train, ~ranger(formula = life_expectancy ~ ., data = .x,
num.trees = 100, seed = 42)))
# Generate predictions using the random forest model
cv_prep_rf <- cv_models_rf %>%
mutate(validate_actual = map(validate, ~.x$life_expectancy),
validate_predicted = map2(.x = model, .y = validate, ~predict(.x, .y)$predictions))
# Calculate validate MAE for each fold
(cv_eval_rf <- cv_prep_rf %>%
mutate(validate_mae = map2_dbl(validate_actual, validate_predicted, ~mae(actual = .x, predicted = .y))))
## # 5-fold cross-validation
## # A tibble: 5 x 8
## splits id train validate model validate_actual validate_predic~
## * <name> <chr> <nam> <named > <nam> <named list> <named list>
## 1 <spli~ Fold1 <tib~ <tibble~ <ran~ <dbl [601]> <dbl [601]>
## 2 <spli~ Fold2 <tib~ <tibble~ <ran~ <dbl [601]> <dbl [601]>
## 3 <spli~ Fold3 <tib~ <tibble~ <ran~ <dbl [601]> <dbl [601]>
## 4 <spli~ Fold4 <tib~ <tibble~ <ran~ <dbl [600]> <dbl [600]>
## 5 <spli~ Fold5 <tib~ <tibble~ <ran~ <dbl [600]> <dbl [600]>
## # ... with 1 more variable: validate_mae <dbl>
# Calculate the mean of validate_mae column
mean(cv_eval_rf$validate_mae)
## [1] 0.7956071
Now that we have credible proof of the random forest’s outperformance over the linear regression model we can attempt to drive further accuracy improvements by hypertuning the model. For example, we can reduce the number of trees used and the number of predictors used at each node in the analysis. We test a few different combinations to see what performs best and retest for the MAE to confirm it does in fact provide our lowest average MAE yet of .6756 years of life expectancy.
# Prepare for tuning your cross validation folds by varying mtry
cv_tune <- cv_data %>%
crossing(mtry = 2:5)
# Build a model for each fold & mtry combination
cv_model_tunerf <- cv_tune %>%
mutate(model = map2(.x = train, .y = mtry, ~ranger(formula = life_expectancy~.,
data = .x, mtry = .y,
num.trees = 100, seed = 42)))
# Generate validate predictions for each model
cv_prep_tunerf <- cv_model_tunerf %>%
mutate(validate_actual = map(validate, ~.x$life_expectancy),
validate_predicted = map2(.x = model, .y = validate, ~predict(.x, .y)$predictions))
# Calculate validate MAE for each fold and mtry combination
cv_eval_tunerf <- cv_prep_tunerf %>%
mutate(validate_mae = map2_dbl(.x = validate_actual, .y = validate_predicted, ~mae(actual = .x, predicted = .y)))
# Calculate the mean validate_mae for each mtry used
cv_eval_tunerf %>%
group_by(mtry) %>%
summarise(mean_mae = mean(validate_mae))
## # A tibble: 4 x 2
## mtry mean_mae
## <int> <dbl>
## 1 2 0.796
## 2 3 0.793
## 3 4 0.795
## 4 5 0.790
# Build the model using all training data and the best performing parameter
best_model <- ranger(formula = life_expectancy ~ ., data = training_data,
mtry = 5, num.trees = 100, seed = 42)
# Prepare the test_actual vector
test_actual <- testing_data$life_expectancy
# Predict life_expectancy for the testing_data
test_predicted <- predict(best_model, testing_data)$predictions
# Calculate the test MAE
(mae(test_actual, test_predicted))
## [1] 0.6756046