Objective

I demonstrate a variety of machine learning modeling techniques to predict life expectancy across countries.

Exploratory Analysis

We have 51 years of life expectancy data grouped by country. Explanatory variables include year, infant mortality, fertility, population, and GDP/Capita. Quartiling the data will give us an idea of the data distribution and whether we should expect outliers.

# Explore gapminder
head(gapminder)
## # A tibble: 6 x 7
##   country  year infant_mortality life_expectancy fertility population gdpPercap
##   <fct>   <int>            <dbl>           <dbl>     <dbl>      <dbl>     <int>
## 1 Algeria  1960             148.            47.5      7.65   11124892      1242
## 2 Algeria  1961             148.            48.0      7.65   11404859      1047
## 3 Algeria  1962             148.            48.6      7.65   11690152       820
## 4 Algeria  1963             148.            49.1      7.65   11985130      1075
## 5 Algeria  1964             149.            49.6      7.65   12295973      1109
## 6 Algeria  1965             149.            50.1      7.66   12626953      1147
summary(gapminder)
##        country          year      infant_mortality life_expectancy
##  Algeria   :  52   Min.   :1960   Min.   :  1.80   Min.   :13.20  
##  Argentina :  52   1st Qu.:1973   1st Qu.: 17.10   1st Qu.:55.46  
##  Australia :  52   Median :1986   Median : 50.45   Median :65.95  
##  Austria   :  52   Mean   :1986   Mean   : 60.64   Mean   :64.06  
##  Bangladesh:  52   3rd Qu.:1998   3rd Qu.: 95.12   3rd Qu.:73.52  
##  Belgium   :  52   Max.   :2011   Max.   :237.20   Max.   :82.90  
##  (Other)   :3692                                                  
##    fertility       population          gdpPercap      
##  Min.   :1.200   Min.   :4.154e+04   Min.   :   54.0  
##  1st Qu.:2.380   1st Qu.:3.823e+06   1st Qu.:  494.8  
##  Median :4.085   Median :9.425e+06   Median : 1580.5  
##  Mean   :4.251   Mean   :3.574e+07   Mean   : 5754.2  
##  3rd Qu.:6.070   3rd Qu.:2.896e+07   3rd Qu.: 7214.0  
##  Max.   :8.450   Max.   :1.247e+09   Max.   :41838.0  
## 

Process


1. I’m starting with two “kitchen sink” models to demonstrate the ease of building linear regression and random forest models. Please note, later in the demonstration I will cover how these two kitchen sink models OVERFIT and their accuracy is misleading.
2. I show the power of nesting models and how I can test multiple machine learning models at once and extract their accuracy in a single table.
3. Lastly, I take a more industry standard approach to model building. I start by randomizing the order of the data, cross validate the data in order to make sure the model accuracy holds across random slices of the data, and hypertune the models to test for superior accuracy. I use nesting again to see model results across my training, validation, and test datasets all in a single table.

Linear and Random Forest “Kitchen Sink” Models

Here I show how simple it is to build a multiple linear regression model. From the summary of results I can see an adjusted R-squared of 95.84%. I can also see my p values (highlighted by multiple asterisks) to identify my most important variables.

lm_model <- lm(life_expectancy ~ ., gapminder)
summary(lm_model)
## 
## Call:
## lm(formula = life_expectancy ~ ., data = gapminder)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.021  -0.870   0.051   1.124   8.883 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      5.154e+00  1.329e+01   0.388 0.698099    
## countryArgentina                -1.697e+00  4.992e-01  -3.400 0.000681 ***
## countryAustralia                -3.170e+00  6.051e-01  -5.240 1.69e-07 ***
## countryAustria                  -4.174e+00  6.187e-01  -6.747 1.73e-11 ***
## countryBangladesh               -2.460e+00  4.658e-01  -5.281 1.35e-07 ***
## countryBelgium                  -4.120e+00  6.128e-01  -6.722 2.05e-11 ***
## countryBenin                    -5.474e+00  4.627e-01 -11.830  < 2e-16 ***
## countryBolivia                  -2.436e+00  4.492e-01  -5.423 6.23e-08 ***
## countryBotswana                 -1.149e+01  4.486e-01 -25.608  < 2e-16 ***
## countryBrazil                   -1.355e+00  4.790e-01  -2.828 0.004710 ** 
## countryBurkina Faso             -9.617e+00  4.621e-01 -20.812  < 2e-16 ***
## countryBurundi                  -1.101e+01  4.678e-01 -23.529  < 2e-16 ***
## countryCameroon                 -6.041e+00  4.521e-01 -13.363  < 2e-16 ***
## countryCanada                   -3.157e+00  6.218e-01  -5.078 3.99e-07 ***
## countryCentral African Republic -1.037e+01  4.619e-01 -22.453  < 2e-16 ***
## countryChile                    -1.399e+00  4.808e-01  -2.910 0.003637 ** 
## countryColombia                 -1.345e+00  4.603e-01  -2.922 0.003494 ** 
## countryCongo, Rep.              -1.070e+01  4.465e-01 -23.968  < 2e-16 ***
## countryCosta Rica                7.407e-01  4.754e-01   1.558 0.119285    
## countryCote d'Ivoire            -5.954e+00  4.644e-01 -12.822  < 2e-16 ***
## countryDenmark                  -5.509e+00  6.715e-01  -8.204 3.13e-16 ***
## countryDominican Republic       -3.043e-01  4.524e-01  -0.673 0.501137    
## countryEcuador                  -2.387e-01  4.502e-01  -0.530 0.596009    
## countryEgypt                     2.305e-01  4.518e-01   0.510 0.609922    
## countryEl Salvador              -2.674e+00  4.500e-01  -5.942 3.05e-09 ***
## countryFiji                     -9.430e+00  4.690e-01 -20.108  < 2e-16 ***
## countryFinland                  -5.196e+00  6.218e-01  -8.357  < 2e-16 ***
## countryFrance                   -3.370e+00  6.098e-01  -5.525 3.50e-08 ***
## countryGhana                    -6.557e+00  4.476e-01 -14.650  < 2e-16 ***
## countryGreece                   -5.580e-01  5.492e-01  -1.016 0.309700    
## countryGuatemala                -3.153e+00  4.454e-01  -7.078 1.72e-12 ***
## countryHonduras                 -2.776e+00  4.477e-01  -6.201 6.18e-10 ***
## countryIceland                  -3.573e+00  6.699e-01  -5.334 1.02e-07 ***
## countryIndia                    -4.373e+00  1.098e+00  -3.984 6.91e-05 ***
## countryIndonesia                -4.044e+00  4.855e-01  -8.329  < 2e-16 ***
## countryItaly                    -1.917e+00  5.886e-01  -3.256 0.001138 ** 
## countryJapan                    -5.044e+00  7.299e-01  -6.911 5.57e-12 ***
## countryKenya                    -7.427e+00  4.531e-01 -16.389  < 2e-16 ***
## countryLesotho                  -8.544e+00  4.507e-01 -18.956  < 2e-16 ***
## countryLiberia                  -3.159e+00  4.896e-01  -6.452 1.24e-10 ***
## countryMalawi                   -6.971e+00  4.851e-01 -14.371  < 2e-16 ***
## countryMalaysia                 -3.769e+00  4.749e-01  -7.936 2.71e-15 ***
## countryMauritania               -4.456e+00  4.492e-01  -9.920  < 2e-16 ***
## countryMexico                   -1.327e+00  4.643e-01  -2.859 0.004271 ** 
## countryMorocco                  -5.073e-01  4.464e-01  -1.136 0.255846    
## countryNepal                    -2.260e+00  4.612e-01  -4.901 9.90e-07 ***
## countryNetherlands              -3.241e+00  6.232e-01  -5.201 2.08e-07 ***
## countryNicaragua                 2.332e+00  4.461e-01   5.227 1.81e-07 ***
## countryNorway                   -5.062e+00  7.045e-01  -7.184 8.04e-13 ***
## countryOman                     -2.398e-01  4.527e-01  -0.530 0.596388    
## countryPakistan                  1.071e+00  4.697e-01   2.281 0.022577 *  
## countryPanama                    1.727e-01  4.719e-01   0.366 0.714403    
## countryPapua New Guinea         -1.161e+01  4.456e-01 -26.061  < 2e-16 ***
## countryParaguay                  1.885e+00  4.562e-01   4.132 3.67e-05 ***
## countryPeru                     -7.149e-01  4.487e-01  -1.593 0.111153    
## countryPhilippines              -3.967e+00  4.540e-01  -8.738  < 2e-16 ***
## countryPortugal                 -1.616e+00  5.263e-01  -3.072 0.002144 ** 
## countryRwanda                   -9.986e+00  4.659e-01 -21.434  < 2e-16 ***
## countrySenegal                  -1.064e+01  4.525e-01 -23.516  < 2e-16 ***
## countrySeychelles               -4.832e+00  4.828e-01 -10.008  < 2e-16 ***
## countrySierra Leone             -9.267e-01  5.005e-01  -1.852 0.064133 .  
## countrySingapore                -5.134e+00  5.832e-01  -8.803  < 2e-16 ***
## countrySouth Korea              -6.197e+00  5.106e-01 -12.138  < 2e-16 ***
## countrySpain                    -6.608e-01  5.528e-01  -1.195 0.231987    
## countrySri Lanka                -2.492e+00  4.713e-01  -5.286 1.32e-07 ***
## countrySudan                    -5.120e+00  4.487e-01 -11.411  < 2e-16 ***
## countrySweden                   -3.791e+00  6.660e-01  -5.693 1.34e-08 ***
## countryThailand                 -3.263e+00  4.666e-01  -6.992 3.16e-12 ***
## countryTogo                     -6.595e+00  4.535e-01 -14.542  < 2e-16 ***
## countryTrinidad and Tobago      -4.514e+00  4.975e-01  -9.073  < 2e-16 ***
## countryTurkey                   -2.961e-01  4.569e-01  -0.648 0.516980    
## countryUnited Kingdom           -4.390e+00  6.240e-01  -7.035 2.34e-12 ***
## countryUnited States            -6.575e+00  7.371e-01  -8.920  < 2e-16 ***
## countryUruguay                  -1.105e+00  5.032e-01  -2.196 0.028167 *  
## countryVenezuela                -2.917e+00  4.767e-01  -6.119 1.03e-09 ***
## countryZambia                   -8.939e+00  4.558e-01 -19.611  < 2e-16 ***
## countryZimbabwe                 -9.954e+00  4.484e-01 -22.197  < 2e-16 ***
## year                             3.602e-02  6.546e-03   5.502 3.98e-08 ***
## infant_mortality                -1.649e-01  2.678e-03 -61.567  < 2e-16 ***
## fertility                        1.752e-03  6.434e-02   0.027 0.978279    
## population                       1.039e-10  1.263e-09   0.082 0.934470    
## gdpPercap                        2.472e-04  1.528e-05  16.177  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.268 on 3922 degrees of freedom
## Multiple R-squared:  0.9592, Adjusted R-squared:  0.9584 
## F-statistic:  1138 on 81 and 3922 DF,  p-value: < 2.2e-16

In this example, a random forest model has superior accuracy of an R^2 of 99.2%. I also export a variable importance table that confirms the linear model variables of significant importance are:
1. Infant Mortality
2. GDP/Capita

(rf_model <- ranger(life_expectancy ~ ., gapminder, importance = "impurity"))
## Ranger result
## 
## Call:
##  ranger(life_expectancy ~ ., gapminder, importance = "impurity") 
## 
## Type:                             Regression 
## Number of trees:                  500 
## Sample size:                      4004 
## Number of independent variables:  6 
## Mtry:                             2 
## Target node size:                 5 
## Variable importance mode:         impurity 
## Splitrule:                        variance 
## OOB prediction error (MSE):       0.9647985 
## R squared (OOB):                  0.9921866
importance(rf_model)
##          country             year infant_mortality        fertility 
##         12528.53         27805.87        220149.81        112042.84 
##       population        gdpPercap 
##         21656.76         99247.72

Nested Models

So far we have built two models that provide a line of best of fit for the ENTIRE dataset. In reality, models will have differing accuracies country to country. Instead, we can build linear regression and random forest models for EACH country to better understand model strengths and weaknesses and potentially increase our total accuracy by building more accurate, local models. For this example I will focus just just on linear models but create a single table where each row (country) contains:
1. Raw data (grouped by country)
2. The model specificly built for each country
3. Coefficients, error & p-values, for each specific model
4. The adjusted R^2 values indicating model accuracy along with key metrics for determining model validity

# Build a linear model for each country
(lm_nest <- gapminder %>% 
            group_by(country) %>% 
            nest() %>%
            mutate(model = map(data, ~lm(formula = life_expectancy ~ ., data = .x)),
                   coef = map(model, ~tidy(.x)),
                   results = map(model, ~glance(.x))) %>%
            #unnest(coef, .drop = FALSE) %>% 
            unnest(results, .drop = FALSE))
## Warning: The `.drop` argument of `unnest()` is deprecated as of tidyr 1.0.0.
## All list-columns are now preserved.
## This warning is displayed once per session.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## # A tibble: 77 x 15
## # Groups:   country [185]
##    country data  model coef  r.squared adj.r.squared sigma statistic  p.value
##    <fct>   <lis> <lis> <lis>     <dbl>         <dbl> <dbl>     <dbl>    <dbl>
##  1 Algeria <tib~ <lm>  <tib~     0.999         0.999 0.316    9917.  1.56e-68
##  2 Argent~ <tib~ <lm>  <tib~     0.999         0.999 0.115    8881.  1.97e-67
##  3 Austra~ <tib~ <lm>  <tib~     0.995         0.994 0.300    1709.  5.16e-51
##  4 Austria <tib~ <lm>  <tib~     0.996         0.996 0.249    2317.  4.86e-54
##  5 Bangla~ <tib~ <lm>  <tib~     0.983         0.981 1.11      520.  2.88e-39
##  6 Belgium <tib~ <lm>  <tib~     0.997         0.996 0.206    2663.  1.99e-55
##  7 Benin   <tib~ <lm>  <tib~     0.996         0.996 0.465    2264.  8.21e-54
##  8 Bolivia <tib~ <lm>  <tib~     0.999         0.999 0.285   11085.  1.21e-69
##  9 Botswa~ <tib~ <lm>  <tib~     0.859         0.844 2.02       56.1 1.91e-18
## 10 Brazil  <tib~ <lm>  <tib~     0.999         0.999 0.133   17765.  2.37e-74
## # ... with 67 more rows, and 6 more variables: df <int>, logLik <dbl>,
## #   AIC <dbl>, BIC <dbl>, deviance <dbl>, df.residual <int>

To further demonstrate how easy it is to work with this data. I will quickly identify the top 5 and bottom 5 performing models by country.

# Extract the 4 best fitting models
lm_nest %>% 
  select(adj.r.squared) %>% 
  as.data.frame() %>% 
  top_n(n = 5, wt = adj.r.squared)
##   country adj.r.squared
## 1  Brazil     0.9994261
## 2  Greece     0.9994407
## 3  Mexico     0.9995427
## 4 Morocco     0.9997960
## 5    Peru     0.9995782
# Extract the 4 models with the worst fit
lm_nest %>% 
  select(adj.r.squared) %>% 
  as.data.frame() %>% 
  top_n(n = 5, wt = -adj.r.squared)
##       country adj.r.squared
## 1    Botswana     0.8438327
## 2 Congo, Rep.     0.8753946
## 3     Lesotho     0.9077870
## 4      Rwanda     0.7673115
## 5      Zambia     0.7060580

Industry Standard Approach to Model Building (with nesting)

After randomizing and splitting the dataset into training, validation, and test sets, I cross validate the data to make sure the model holds across randomized groups of data. Instead of using R^2 I’m going to use a different model accuracy measurement called MAE (Mean Absolute Error) that measures the average magnitude of absolute errors. Although hard to read, the first table contains linear regression model performance on 5 randomized folds of data. The number following the table represents the average MAE from those 5 folds which is an average error of 1.5 years of life expectancy.The following table is the same for the random forest models and still outperforms the linear regression models with an average MAE of .8 years of life expectancy.

The power of these tables are that they contain all the inputs and outputs to recreate the results of each fold and the model as a whole.

# Prepare the initial split object and randomize data order
set.seed(42)
gap_split <- initial_split(gapminder[sample(nrow(gapminder)),], prop = .75)

# Extract the training dataframe
training_data <- training(gap_split)

# Extract the testing dataframe
testing_data <- testing(gap_split)

# Prepare the dataframe containing the cross validation partitions
set.seed(42)
cv_split <- vfold_cv(training_data, v = 5)

cv_data <- cv_split %>%
  mutate(
    # Extract the train dataframe for each split
    train = map(splits, ~training(.x)),
    # Extract the validate dataframe for each split
    validate = map(splits, ~testing(.x))
  )

cv_models_lm <- cv_data %>%
  mutate(model = map(train, ~lm(formula = life_expectancy ~ ., data = .x)))

cv_prep_lm <- cv_models_lm %>%
  mutate(
    # Extract the recorded life expectancy for the records in the validate dataframes
    validate_actual = map(validate, ~.x$life_expectancy),
    # Predict life expectancy for each validate set using its corresponding model
    validate_predicted = map2(.x = model, .y = validate, ~predict(.x, .y))
  )

# Calculate the mean absolute error for each validate fold
(cv_eval_lm <- cv_prep_lm %>%
  mutate(validate_mae = map2_dbl(validate_actual, validate_predicted, ~mae(actual = .x, predicted = .y))))
## #  5-fold cross-validation 
## # A tibble: 5 x 8
##   splits id    train validate model validate_actual validate_predic~
## * <name> <chr> <nam> <named > <nam> <named list>    <named list>    
## 1 <spli~ Fold1 <tib~ <tibble~ <lm>  <dbl [601]>     <dbl [601]>     
## 2 <spli~ Fold2 <tib~ <tibble~ <lm>  <dbl [601]>     <dbl [601]>     
## 3 <spli~ Fold3 <tib~ <tibble~ <lm>  <dbl [601]>     <dbl [601]>     
## 4 <spli~ Fold4 <tib~ <tibble~ <lm>  <dbl [600]>     <dbl [600]>     
## 5 <spli~ Fold5 <tib~ <tibble~ <lm>  <dbl [600]>     <dbl [600]>     
## # ... with 1 more variable: validate_mae <dbl>
# Calculate the mean of validate_mae column
mean(cv_eval_lm$validate_mae)
## [1] 1.501987
# Build a random forest model for each fold
cv_models_rf <- cv_data %>%
  mutate(model = map(train, ~ranger(formula = life_expectancy ~ ., data = .x,
                                    num.trees = 100, seed = 42)))

# Generate predictions using the random forest model
cv_prep_rf <- cv_models_rf %>%
                mutate(validate_actual = map(validate, ~.x$life_expectancy),
                validate_predicted = map2(.x = model, .y = validate, ~predict(.x, .y)$predictions))

# Calculate validate MAE for each fold
(cv_eval_rf <- cv_prep_rf %>%
  mutate(validate_mae = map2_dbl(validate_actual, validate_predicted, ~mae(actual = .x, predicted = .y))))
## #  5-fold cross-validation 
## # A tibble: 5 x 8
##   splits id    train validate model validate_actual validate_predic~
## * <name> <chr> <nam> <named > <nam> <named list>    <named list>    
## 1 <spli~ Fold1 <tib~ <tibble~ <ran~ <dbl [601]>     <dbl [601]>     
## 2 <spli~ Fold2 <tib~ <tibble~ <ran~ <dbl [601]>     <dbl [601]>     
## 3 <spli~ Fold3 <tib~ <tibble~ <ran~ <dbl [601]>     <dbl [601]>     
## 4 <spli~ Fold4 <tib~ <tibble~ <ran~ <dbl [600]>     <dbl [600]>     
## 5 <spli~ Fold5 <tib~ <tibble~ <ran~ <dbl [600]>     <dbl [600]>     
## # ... with 1 more variable: validate_mae <dbl>
# Calculate the mean of validate_mae column
mean(cv_eval_rf$validate_mae)
## [1] 0.7956071

Hypertuning

Now that we have credible proof of the random forest’s outperformance over the linear regression model we can attempt to drive further accuracy improvements by hypertuning the model. For example, we can reduce the number of trees used and the number of predictors used at each node in the analysis. We test a few different combinations to see what performs best and retest for the MAE to confirm it does in fact provide our lowest average MAE yet of .6756 years of life expectancy.

# Prepare for tuning your cross validation folds by varying mtry
cv_tune <- cv_data %>%
  crossing(mtry = 2:5)

# Build a model for each fold & mtry combination
cv_model_tunerf <- cv_tune %>%
  mutate(model = map2(.x = train, .y = mtry, ~ranger(formula = life_expectancy~.,
                                           data = .x, mtry = .y,
                                           num.trees = 100, seed = 42)))

# Generate validate predictions for each model
cv_prep_tunerf <- cv_model_tunerf %>%
  mutate(validate_actual = map(validate, ~.x$life_expectancy),
         validate_predicted = map2(.x = model, .y = validate, ~predict(.x, .y)$predictions))

# Calculate validate MAE for each fold and mtry combination
cv_eval_tunerf <- cv_prep_tunerf %>%
  mutate(validate_mae = map2_dbl(.x = validate_actual, .y = validate_predicted, ~mae(actual = .x, predicted = .y)))

# Calculate the mean validate_mae for each mtry used
cv_eval_tunerf %>%
  group_by(mtry) %>%
  summarise(mean_mae = mean(validate_mae))
## # A tibble: 4 x 2
##    mtry mean_mae
##   <int>    <dbl>
## 1     2    0.796
## 2     3    0.793
## 3     4    0.795
## 4     5    0.790
# Build the model using all training data and the best performing parameter
best_model <- ranger(formula = life_expectancy ~ ., data = training_data,
                     mtry = 5, num.trees = 100, seed = 42)

# Prepare the test_actual vector
test_actual <- testing_data$life_expectancy

# Predict life_expectancy for the testing_data
test_predicted <- predict(best_model, testing_data)$predictions

# Calculate the test MAE
(mae(test_actual, test_predicted))
## [1] 0.6756046