Summary

Every year since 2011 the United Nations Sustainable Development Solutions Network has produced The Happiness Report. This report ranks each country based on 10 variables.

The purpose of this paper is to create a machine learning algorithm that can accurately predict the happiness score based on the other parameters.

I will try several models that are strong in regression prediction and choose the best model based on Root Mean Squared Error (RMSE) and Rsquared.

The final model was able to predict results on the test set with a very low RMSE of 0.137 and Rsquared of 0.974. The Rsquared score indicates that 97.4% of the variance in the data is explained by the model.

There are 9 predictor variables in the data:
[1] “perceived_well_being” “GDP_per_capita” “social_support”
[4] “life_expectancy” “freedom_of_life_choices” “generosity”
[7] “perception_corruption” “confidence_in_national_govt” “gini_household_income”

The final model only used two of these predictors:
1. Perceived well-being
2. GDP per capita

News story about The Happiness Report

Objective:

Produce a model that minimizes RMSE and performs well against the test set.
Determine the key drivers of a high happiness score.

Approach:

OECD countries

The mission of the Organisation for Economic Co-operation and Development (OECD) is to promote policies that will improve the economic and social well-being of people around the world.

In doing this they collect a lot of data from its 35 member-countries and other countries that make the information available.

The Data

The data is provided by the United Nations Sustainable Development Solutions Network, not the OECD. OECD countries span the geo-political, and socio-economic spectrum and are a subset of the full country list.

The training set will be 87 observations of non-OECD countries.
The test set will be 39 observations of OECD countries.

The outcome and all predictors are continuous, which is uncommon in my experience, so I’ll use a matrix instead of a dataframe for processing.

Preparing the Data

Load libraries

library(tidyverse)
library(caret) 
library(GGally)
library(yardstick)

OECD countries

I got this list from the OECD website but for some reason there are 41 counries in this list, while their website says there are 35. I’m sure there is a good reason but for our purposes it doesn’t matter.

OECD_countries <- c(
  "Australia",
  "Austria",
  "Belgium",
  "Brazil",
  "Canada",
  "Chile",
  "China",
  "Czech Republic",
  "Denmark",
  "Estonia",
  "Finland",
  "France",
  "Germany",
  "Greece",
  "Hungary",
  "Iceland",
  "India",
  "Indonesia",
  "Ireland",
  "Israel",
  "Italy",
  "Japan",
  "Latvia",
  "Luxembourg",
  "Mexico",
  "Netherlands",
  "New Zealand",
  "Norway",
  "Poland",
  "Portugal",
  "Russia",
  "Slovakia",
  "Slovenia",
  "South Africa",
  "South Korea",
  "Spain",
  "Sweden",
  "Switzerland",
  "Turkey",
  "United Kingdom",
  "United States"
)

Happiness scores by country (2017)

The scores and the variables that make up the scores are stored in two separate files so I’ll import them each separately and then join together.

happiness_scores_2017 <- read_csv("happiness_scores.csv") %>%
  select(country, happiness_score) %>%
  arrange(country)

glimpse(happiness_scores_2017)
## Observations: 157
## Variables: 2
## $ country         <chr> "Afghanistan", "Albania", "Algeria", "Angola",...
## $ happiness_score <dbl> 3.632, 4.586, 5.295, 3.795, 6.388, 4.321, 7.27...

Happiness factors by country (2017)

This file has records for each country for each year since 2011. We are only interested the factors that influenced this year’s scores so we’ll filter for only 2017.
Foreshadowing: There were some NA’s in the 2017 data which had to be filled in order to move forward. I used previous observations to fill in these blanks.

happiness_factors_2017 <- read_csv("happiness_factors.csv") %>%
  select(-sd_life_ladder, -sd_by_mean_life_ladder, 
         -democratic_quality, -delivery_quality, -gini_WB, 
         -gini_2000_2015_avg) %>%
  filter(year == 2017) %>%
  rename(perceived_well_being = life_ladder) %>%
  rename(GDP_per_capita = log_GDP_per_capita) %>%
  select(-year, -positive_effect, -`Negative affect`) %>%
  arrange(country) 

glimpse(happiness_factors_2017)
## Observations: 141
## Variables: 10
## $ country                     <chr> "Afghanistan", "Albania", "Algeria...
## $ perceived_well_being        <dbl> 2.661718, 4.639548, 5.248912, 6.03...
## $ GDP_per_capita              <dbl> 7.460144, 9.373718, 9.540244, 9.84...
## $ social_support              <dbl> 0.4908801, 0.6376983, 0.8067539, 0...
## $ life_expectancy             <dbl> 52.33953, 69.05166, 65.69919, 67.5...
## $ freedom_of_life_choices     <dbl> 0.4270109, 0.7496110, 0.4366705, 0...
## $ generosity                  <dbl> -0.106340349, -0.035140377, -0.194...
## $ perception_corruption       <dbl> 0.9543926, 0.8761346, 0.6997742, 0...
## $ confidence_in_national_govt <dbl> 0.2611785, 0.4577375, NA, 0.305430...
## $ gini_household_income       <dbl> 0.2865989, 0.4104880, 0.5275558, 0...

Look at summary

There are a few NA’s in most categories that will cause the machine learning models to fail.

summary(happiness_factors_2017)
##    country          perceived_well_being GDP_per_capita   social_support  
##  Length:141         Min.   :2.662        Min.   : 6.625   Min.   :0.3196  
##  Class :character   1st Qu.:4.628        1st Qu.: 8.549   1st Qu.:0.7335  
##  Mode  :character   Median :5.579        Median : 9.544   Median :0.8286  
##                     Mean   :5.486        Mean   : 9.341   Mean   :0.8060  
##                     3rd Qu.:6.273        3rd Qu.:10.307   3rd Qu.:0.9056  
##                     Max.   :7.788        Max.   :11.465   Max.   :0.9668  
##                                          NA's   :7        NA's   :1       
##  life_expectancy freedom_of_life_choices   generosity      
##  Min.   :44.39   Min.   :0.4270          Min.   :-0.29674  
##  1st Qu.:57.98   1st Qu.:0.7121          1st Qu.:-0.13926  
##  Median :65.13   Median :0.8122          Median :-0.03514  
##  Mean   :63.40   Mean   :0.7787          Mean   :-0.01095  
##  3rd Qu.:69.05   3rd Qu.:0.8803          3rd Qu.: 0.09579  
##  Max.   :76.54   Max.   :0.9852          Max.   : 0.62871  
##                  NA's   :1               NA's   :8         
##  perception_corruption confidence_in_national_govt gini_household_income
##  Min.   :0.1618        Min.   :0.1109              Min.   :0.2456       
##  1st Qu.:0.6812        1st Qu.:0.3378              1st Qu.:0.3777       
##  Median :0.7808        Median :0.4760              Median :0.4388       
##  Mean   :0.7348        Mean   :0.4964              Mean   :0.4585       
##  3rd Qu.:0.8557        3rd Qu.:0.6210              3rd Qu.:0.5522       
##  Max.   :0.9544        Max.   :0.9647              Max.   :0.8520       
##  NA's   :12            NA's   :13

Export to csv and open in Excel

Excel is an excellent tool for looking at and modifying a small table.

Fill in the blanks

Wherever possible I used the last recorded value for the particular variable. There were several countries that had no data for the confidence_in_national_govt and I deleted the entire record.

Import clean dataset

Notice that all the NA’s are gone.

happiness_factors_2017 <- read_csv("happiness_factors_2017.csv") %>%
  rename(perceived_well_being = life_ladder) %>%
  rename(GDP_per_capita = log_GDP_per_capita) %>%
  select(-year) %>%
  arrange(country) 

summary(happiness_factors_2017)
##    country          perceived_well_being GDP_per_capita   social_support  
##  Length:128         Min.   :2.662        Min.   : 6.474   Min.   :0.3196  
##  Class :character   1st Qu.:4.608        1st Qu.: 8.209   1st Qu.:0.7335  
##  Mode  :character   Median :5.600        Median : 9.452   Median :0.8295  
##                     Mean   :5.489        Mean   : 9.243   Mean   :0.8072  
##                     3rd Qu.:6.321        3rd Qu.:10.283   3rd Qu.:0.9073  
##                     Max.   :7.788        Max.   :11.465   Max.   :0.9668  
##  life_expectancy freedom_of_life_choices   generosity       
##  Min.   :44.39   Min.   :0.4270          Min.   :-0.296735  
##  1st Qu.:57.38   1st Qu.:0.7090          1st Qu.:-0.133247  
##  Median :64.73   Median :0.8122          Median :-0.030061  
##  Mean   :63.24   Mean   :0.7792          Mean   :-0.007489  
##  3rd Qu.:69.65   3rd Qu.:0.8803          3rd Qu.: 0.091140  
##  Max.   :76.54   Max.   :0.9852          Max.   : 0.628706  
##  perception_corruption confidence_in_national_govt gini_household_income
##  Min.   :0.1618        Min.   :0.1109              Min.   :0.2456       
##  1st Qu.:0.6819        1st Qu.:0.3378              1st Qu.:0.3741       
##  Median :0.7918        Median :0.4760              Median :0.4308       
##  Mean   :0.7381        Mean   :0.4964              Mean   :0.4527       
##  3rd Qu.:0.8619        3rd Qu.:0.6210              3rd Qu.:0.5496       
##  Max.   :0.9544        Max.   :0.9647              Max.   :0.7154

Put variables on the same scale

It looks like each variable has a different scale. In order for the machine learning algorithm to work optimally with the data it should be normalized so that the mean of each variable is 0, and the standard deviation is 1.0

happy_2017_pp <- preProcess(happiness_factors_2017[,-c(1, 7)],
                            method = c("center", "scale"))

happy_2017_processed <- predict(happy_2017_pp, happiness_factors_2017)

summary(happy_2017_processed)
##    country          perceived_well_being GDP_per_capita   
##  Length:128         Min.   :-2.45923     Min.   :-2.2667  
##  Class :character   1st Qu.:-0.76635     1st Qu.:-0.8461  
##  Mode  :character   Median : 0.09598     Median : 0.1711  
##                     Mean   : 0.00000     Mean   : 0.0000  
##                     3rd Qu.: 0.72359     3rd Qu.: 0.8515  
##                     Max.   : 1.99934     Max.   : 1.8188  
##  social_support    life_expectancy   freedom_of_life_choices
##  Min.   :-3.9357   Min.   :-2.3858   Min.   :-2.7772        
##  1st Qu.:-0.5944   1st Qu.:-0.7419   1st Qu.:-0.5532        
##  Median : 0.1806   Median : 0.1880   Median : 0.2606        
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000        
##  3rd Qu.: 0.8086   3rd Qu.: 0.8107   3rd Qu.: 0.7978        
##  Max.   : 1.2882   Max.   : 1.6821   Max.   : 1.6248        
##    generosity        perception_corruption confidence_in_national_govt
##  Min.   :-0.296735   Min.   :-3.1784       Min.   :-1.9292            
##  1st Qu.:-0.133247   1st Qu.:-0.3102       1st Qu.:-0.7939            
##  Median :-0.030061   Median : 0.2961       Median :-0.1022            
##  Mean   :-0.007489   Mean   : 0.0000       Mean   : 0.0000            
##  3rd Qu.: 0.091140   3rd Qu.: 0.6823       3rd Qu.: 0.6235            
##  Max.   : 0.628706   Max.   : 1.1926       Max.   : 2.3436            
##  gini_household_income
##  Min.   :-1.9882      
##  1st Qu.:-0.7548      
##  Median :-0.2103      
##  Mean   : 0.0000      
##  3rd Qu.: 0.9299      
##  Max.   : 2.5208
sapply(happy_2017_processed, sd)
##                     country        perceived_well_being 
##                          NA                   1.0000000 
##              GDP_per_capita              social_support 
##                   1.0000000                   1.0000000 
##             life_expectancy     freedom_of_life_choices 
##                   1.0000000                   1.0000000 
##                  generosity       perception_corruption 
##                   0.1583641                   1.0000000 
## confidence_in_national_govt       gini_household_income 
##                   1.0000000                   1.0000000

Bind the result to the predictors

This will give us one big dataset with country, score, and predictors that will be split into train and test sets.

happy_2017_w_result <- happy_2017_processed %>%
  left_join(happiness_scores_2017, by = "country") %>%
  select(country, happiness_score, everything()) 

summary(happy_2017_w_result)
##    country          happiness_score perceived_well_being GDP_per_capita   
##  Length:128         Min.   :3.083   Min.   :-2.45923     Min.   :-2.2667  
##  Class :character   1st Qu.:4.449   1st Qu.:-0.76635     1st Qu.:-0.8461  
##  Mode  :character   Median :5.404   Median : 0.09598     Median : 0.1711  
##                     Mean   :5.408   Mean   : 0.00000     Mean   : 0.0000  
##                     3rd Qu.:6.238   3rd Qu.: 0.72359     3rd Qu.: 0.8515  
##                     Max.   :7.632   Max.   : 1.99934     Max.   : 1.8188  
##                     NA's   :2                                             
##  social_support    life_expectancy   freedom_of_life_choices
##  Min.   :-3.9357   Min.   :-2.3858   Min.   :-2.7772        
##  1st Qu.:-0.5944   1st Qu.:-0.7419   1st Qu.:-0.5532        
##  Median : 0.1806   Median : 0.1880   Median : 0.2606        
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000        
##  3rd Qu.: 0.8086   3rd Qu.: 0.8107   3rd Qu.: 0.7978        
##  Max.   : 1.2882   Max.   : 1.6821   Max.   : 1.6248        
##                                                             
##    generosity        perception_corruption confidence_in_national_govt
##  Min.   :-0.296735   Min.   :-3.1784       Min.   :-1.9292            
##  1st Qu.:-0.133247   1st Qu.:-0.3102       1st Qu.:-0.7939            
##  Median :-0.030061   Median : 0.2961       Median :-0.1022            
##  Mean   :-0.007489   Mean   : 0.0000       Mean   : 0.0000            
##  3rd Qu.: 0.091140   3rd Qu.: 0.6823       3rd Qu.: 0.6235            
##  Max.   : 0.628706   Max.   : 1.1926       Max.   : 2.3436            
##                                                                       
##  gini_household_income
##  Min.   :-1.9882      
##  1st Qu.:-0.7548      
##  Median :-0.2103      
##  Mean   : 0.0000      
##  3rd Qu.: 0.9299      
##  Max.   : 2.5208      
## 

Remove NA records caused by join

Two NA’s resulted from the join; delete those records

happy_NAs <- which(is.na(happy_2017_w_result$happiness_score))
happy_2017_wo_na <- happy_2017_w_result[-happy_NAs, ] 

summary(happy_2017_wo_na)
##    country          happiness_score perceived_well_being GDP_per_capita   
##  Length:126         Min.   :3.083   Min.   :-2.459230    Min.   :-2.2667  
##  Class :character   1st Qu.:4.449   1st Qu.:-0.779324    1st Qu.:-0.8498  
##  Mode  :character   Median :5.404   Median : 0.095978    Median : 0.1360  
##                     Mean   :5.408   Mean   :-0.003973    Mean   :-0.0179  
##                     3rd Qu.:6.238   3rd Qu.: 0.725759    3rd Qu.: 0.8080  
##                     Max.   :7.632   Max.   : 1.999340    Max.   : 1.8188  
##  social_support      life_expectancy    freedom_of_life_choices
##  Min.   :-3.935735   Min.   :-2.38582   Min.   :-2.77725       
##  1st Qu.:-0.594734   1st Qu.:-0.76322   1st Qu.:-0.57829       
##  Median : 0.173399   Median : 0.18803   Median : 0.25141       
##  Mean   :-0.008506   Mean   :-0.01184   Mean   :-0.00823       
##  3rd Qu.: 0.802478   3rd Qu.: 0.78936   3rd Qu.: 0.81668       
##  Max.   : 1.288206   Max.   : 1.59048   Max.   : 1.62480       
##    generosity        perception_corruption confidence_in_national_govt
##  Min.   :-0.296735   Min.   :-3.178372     Min.   :-1.92918           
##  1st Qu.:-0.133472   1st Qu.:-0.297393     1st Qu.:-0.77162           
##  Median :-0.032265   Median : 0.296125     Median :-0.10224           
##  Mean   :-0.008467   Mean   : 0.006527     Mean   : 0.00696           
##  3rd Qu.: 0.089498   3rd Qu.: 0.672713     3rd Qu.: 0.62999           
##  Max.   : 0.628706   Max.   : 1.192618     Max.   : 2.34364           
##  gini_household_income
##  Min.   :-1.988242    
##  1st Qu.:-0.762806    
##  Median :-0.210271    
##  Mean   : 0.001428    
##  3rd Qu.: 0.934656    
##  Max.   : 2.520808

Checkpoint

We’ve got a clean set of data to train and test on but there are a couple of standard tests that should be done just to ensure that all bases are covered.

Near-zero variance

This is when a variable contains the same value in all, or almost all, observations.
I have already looked at the data and there are no nzv’s but this is how to test for it.

nzv <- nearZeroVar(happy_2017_wo_na[, 3:11], saveMetrics= TRUE)
sum(nzv$nzv)
## [1] 0

Highly-correlated predictors

Highly-correlated predictor can cause confusion during training and one of the predictors would have to be removed.

One of the best ways to visually see correlations between predictors is with the ggpairs() function in the GGally package.

You can also see the skewness in each distribution that could help in deciding if additional feature engineering is needed.

ggpairs(happy_2017_wo_na[, 3:11])

Note that the density plots for perceived_well_being, GINI_household_income, and possibly a couple others have two populations that might need to be separated through clustering if the models don’t perform well.

There looks like there might be some correlation conflicts between predictors. Look at the official determination.

First loop: High-corr variable removal

# Select predictor variables
pred_vars <- happy_2017_wo_na[, 3:11]

# Create correlation matrix
corr_matrix <-  cor(pred_vars)

# Identify vars with a correlations greater than 0.80; includes negatives
(high_corr_vars <- findCorrelation(corr_matrix, cutoff = .80))
## [1] 2

Interesting, this function indicates that I should drop the GDP_per_capita variable.
Let’s see what happens…

# Remove the variables
happy_wo_hc_vars <- pred_vars[,-high_corr_vars]

glimpse(happy_wo_hc_vars)
## Observations: 126
## Variables: 8
## $ perceived_well_being        <dbl> -2.45922945, -0.73910180, 0.478294...
## $ social_support              <dbl> -2.5530650, -1.3679404, 0.8034490,...
## $ life_expectancy             <dbl> -1.37952691, 0.73505142, 0.5436179...
## $ freedom_of_life_choices     <dbl> -2.77724677, -0.23302656, 0.416475...
## $ generosity                  <dbl> -0.106340349, -0.035140377, -0.186...
## $ perception_corruption       <dbl> 1.19261817, 0.76104643, 0.56757758...
## $ confidence_in_national_govt <dbl> -1.17725451, -0.19352638, -0.95578...
## $ gini_household_income       <dbl> -1.5947299, -0.4055899, -0.5621611...

First loop: Ensure that no high_corr vars still exist

corr_matrix_2 <- cor(happy_wo_hc_vars)
(high_corr_vars_2 <- findCorrelation(corr_matrix_2, cutoff = .80))
## integer(0)

OK, we’re good.

Find Linear combinations

All our variables are numeric so it’s a good idea to look for linear combinations that can also confuse the models.

var_matrix <- as.matrix(happy_wo_hc_vars)

(linear_combos <- findLinearCombos(var_matrix))
## $linearCombos
## list()
## 
## $remove
## NULL

There are none so we can continue to work with happy_wo_hc_vars

Look for outliers

If you look at the summary above scaling gives us another advantage…
You can look at the min and max of each variable which represents the number of standard deviations away that observation is from the mean.

Ignore happiness_score since that variable was not transformed.

social_support has an observation that is about 4 sd’s away, and so does generosity.

Let’s see if they qualify as outliers…

myboxplot <- boxplot(happy_wo_hc_vars) # now you should see your three outliers

All of the circles represent outliers. How may are there altogether?

myboxplot$out # it will print the values of the outliers
##  [1] -2.9970367 -3.9357348 -2.7772468  0.4729739  0.6287057 -1.8021372
##  [7] -3.0716237 -3.0094990 -2.2116714 -2.2497846 -2.0680186 -2.8469538
## [13] -2.6935124 -3.1783718 -2.7505593 -2.3269363
min(abs(myboxplot$out))
## [1] 0.4729739

Most of the outliers occur on variable 6, perception_of_corruption. These outliers may in fact be a separate population.

There are 18 outliers and the cutoff was arbitrariliy set to 1.80 sd’s, which seems a little low to me.

I really don’t want to speculate on why the outliers exist but since the score is based on this data I am not going to remove any of these values.

Done with pre-processing the data.

Bind Result back to remaining predictor set

result <- happy_2017_wo_na[, 1:2]

happy_data <- happy_wo_hc_vars %>%
  bind_cols(result) %>%
  select(country, happiness_score, everything())

glimpse(happy_data)
## Observations: 126
## Variables: 10
## $ country                     <chr> "Afghanistan", "Albania", "Argenti...
## $ happiness_score             <dbl> 3.632, 4.586, 6.388, 4.321, 7.272,...
## $ perceived_well_being        <dbl> -2.45922945, -0.73910180, 0.478294...
## $ social_support              <dbl> -2.5530650, -1.3679404, 0.8034490,...
## $ life_expectancy             <dbl> -1.37952691, 0.73505142, 0.5436179...
## $ freedom_of_life_choices     <dbl> -2.77724677, -0.23302656, 0.416475...
## $ generosity                  <dbl> -0.106340349, -0.035140377, -0.186...
## $ perception_corruption       <dbl> 1.19261817, 0.76104643, 0.56757758...
## $ confidence_in_national_govt <dbl> -1.17725451, -0.19352638, -0.95578...
## $ gini_household_income       <dbl> -1.5947299, -0.4055899, -0.5621611...

Modelling

Set random number seed

Modelling uses a lot of randomization. For reproducibility I’ll set the random seed so the results will always be the same.

set.seed(317)

Partition data into train & test

Caret has an excellent function called createDataPartition() that I usually use but in this case I’ve already decided to split the data by whether the country is part of the OECD or not.

train <- happy_data %>%
  filter(!country %in% OECD_countries) %>%
  select(-country)

test <- happy_data %>%
  filter(country %in% OECD_countries) 

dim(train) 
## [1] 87  9
dim(test)
## [1] 39 10

Train has 87 records, test has 39.
They both have 1 result variable and 8 predictor variables.

Configure training controller

The controller can be used with multiple models so you can compare apples-to-apples without a lot of extra typing.

We have a small set of data so I’m not worried about running a lot of iterations.

I’m using repeated cross-validation. Each repeat will have 9 folds (about 10% of data in train).

I’ll set the number of repeats to 20 initially, but it usually maxes out before then. I’ll raise it if needed.

Note: Normally I set verboseIter to TRUE so I can monitor the progress. For this paper I’ll leave it off but set a timer so you can see how long each model takes to run.

# Create controller for all models
fit_controller <- trainControl(
  method = "repeatedcv",
  number = 9,
  repeats = 20, 
  verboseIter = FALSE)

Model 1: Linear Regression

For regression I always do a linear model (lm) first. This is usually the simplest model to fit but is rarely the winner. It has a better chance with this data since all predictors and the result are continuous numeric values.

modelLookup(model = 'lm')
##   model parameter     label forReg forClass probModel
## 1    lm intercept intercept   TRUE    FALSE     FALSE

The intercept is the only parameter and it will be determined by the model so no need to set any tuning parameters.

Fit the model

Since all predictors are numeric I’m going to convert the dataframe to a matri for better processing.

predictors <- as.matrix(train[, 2:9])
actuals <- as.numeric(train$happiness_score)

start_time <- Sys.time()
model_lm <- train(predictors,
                   actuals, 
                   method = 'lm', 
                  trControl = fit_controller)

end_time <- Sys.time()
(run_time <- end_time - start_time)
## Time difference of 1.585442 secs

Evaluate results

summary(model_lm)
## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.77316 -0.13789  0.00832  0.11484  0.72798 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  5.37056    0.03412 157.385   <2e-16 ***
## perceived_well_being         0.83763    0.05569  15.040   <2e-16 ***
## social_support               0.10355    0.04284   2.417   0.0180 *  
## life_expectancy              0.08856    0.05279   1.678   0.0974 .  
## freedom_of_life_choices      0.04157    0.04651   0.894   0.3742    
## generosity                  -0.21613    0.20872  -1.036   0.3036    
## perception_corruption       -0.02367    0.05868  -0.403   0.6877    
## confidence_in_national_govt -0.02476    0.04728  -0.524   0.6020    
## gini_household_income       -0.04482    0.04042  -1.109   0.2708    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2576 on 78 degrees of freedom
## Multiple R-squared:  0.9287, Adjusted R-squared:  0.9214 
## F-statistic:   127 on 8 and 78 DF,  p-value: < 2.2e-16

The key metric in this data is the Adjusted R-squared. It is 0.9214 which indicates that 92% of the variability of the data is explained by the model.
NOTE: I wonder if the other 8% could be accounted for by the GDP_per_capita variable.

The number of stars next to each variable indicated their importance to the model.

Plot variable importance

plot(varImp(model_lm))

Plot residuals

plot(resid(model_lm))

In this case a messy plot is a good thing. This is the noise that could not be determined by the model … the 8%. There is no descernable pattern so I am confident that this model has done the best it can with the data it has.

Save results

After each model runs I will save the results to a dataframe for final model evaluation at the end.

this_model <- tibble(model = model_lm$method)
this_fit_run_time <- tibble(run_time = run_time)

this_controller <-
  tibble(
    control_method = model_lm$control$method,
    num_folds = model_lm$control$number,
    repeats = model_lm$control$repeats
  )

this_best_tune <- model_lm$bestTune
this_model_metrics <- model_lm$results 
  #filter(nprune == this_best_tune$nprune & degree == this_best_tune$degree)

this_model_record <- cbind(this_model,
                           this_model_metrics,
                           this_fit_run_time,
                           this_controller
                           )

# model_stats will store the results for all the models. 
model_stats <- this_model_record %>%
  mutate(nprune = NA_integer_) %>%
  mutate(degree = NA_integer_) %>%
  mutate(direction = NA_character_) %>%
  mutate(maxvar = NA_integer_) %>%
  mutate(nu = NA_integer_) %>%
  mutate(mstop = NA_integer_)
  
#(model_stats <- rbind(model_stats, this_model_record))

Model 2: Linear regression with stepwise selection; method = lmStepAIC

From the MASS package.

What tuning parameters are there?

modelLookup(model = "lmStepAIC")
##       model parameter     label forReg forClass probModel
## 1 lmStepAIC parameter parameter   TRUE    FALSE     FALSE

The tuning paraneter is called parameter but it should have the maxvar and direction like stepLDA.

modelLookup(model = "stepLDA")
##     model parameter              label forReg forClass probModel
## 1 stepLDA    maxvar Maximum #Variables  FALSE     TRUE      TRUE
## 2 stepLDA direction   Search Direction  FALSE     TRUE      TRUE

I’m adding maxvar and direction to see if they will be used. The direction indicates whether you want your training to be additive or suvractive. Using both tries each way.

start_time <- Sys.time()
model_lmStepAIC <- train(
  predictors,
  actuals,
  method = 'lmStepAIC',
  trControl = fit_controller, 
  maxvar = 8, 
  direction = "both", 
  trace = 0
  )
end_time <- Sys.time()
(run_time <- end_time - start_time)
## Time difference of 7.246598 secs

Show results

model_lmStepAIC$results
##   parameter      RMSE  Rsquared       MAE     RMSESD RsquaredSD      MAESD
## 1      none 0.2577185 0.9323786 0.1993049 0.07313809 0.04131507 0.05217305

We see a slight improvement in both RMSE and Rsquared.

Save results

this_model <- tibble(model = model_lmStepAIC$method)
this_fit_run_time <- tibble(run_time = run_time)

this_controller <-
  tibble(
    control_method = model_lmStepAIC$control$method,
    num_folds = 0,
    repeats = model_lmStepAIC$control$repeats
  )

#this_best_tune <- model_lmStepAIC$bestTune
this_best_tune <- tibble(
  maxvar = 8, 
  direction = "both"
)

this_model_metrics <- model_lmStepAIC$results %>%
  select(-parameter)
  #filter(nprune == this_best_tune$nprune & degree == this_best_tune$degree)

this_model_record <- cbind(this_model,
                           this_model_metrics,
                           this_fit_run_time,
                           this_controller, 
                           this_best_tune) %>%
  mutate(nprune = NA_integer_) %>%
  mutate(degree = NA_integer_) %>%
  mutate(intercept = NA) %>%
  mutate(nu = NA_integer_) %>%
  mutate(mstop = NA_integer_)

# model_stats will store the results for all the models. 

(model_stats <- rbind(model_stats, this_model_record)) %>%
  select(model, RMSE, Rsquared, control_method, repeats, everything()) %>%
  arrange(RMSE)
##       model      RMSE  Rsquared control_method repeats intercept       MAE
## 1 lmStepAIC 0.2577185 0.9323786     repeatedcv      20        NA 0.1993049
## 2        lm 0.2659572 0.9318729     repeatedcv      20      TRUE 0.2081860
##       RMSESD RsquaredSD      MAESD      run_time num_folds nprune degree
## 1 0.07313809 0.04131507 0.05217305 7.246598 secs         0     NA     NA
## 2 0.07949886 0.04022326 0.05644595 1.585442 secs         9     NA     NA
##   direction maxvar nu mstop
## 1      both      8 NA    NA
## 2      <NA>     NA NA    NA

Model 3: Multi-variate Adaptive Regression Splines (MARS)

From the earth package.

What tuning parameters are there?

Caret provides a function called modelLookup()

modelLookup(model = 'earth')
##   model parameter          label forReg forClass probModel
## 1 earth    nprune         #Terms   TRUE     TRUE      TRUE
## 2 earth    degree Product Degree   TRUE     TRUE      TRUE

There are 2 tuning parameters:
nprune = Number of variables to use.
degree = Defaults to 1 which creates an addative model; don’t change unless you have a very good reason.

Create parameter grid

nprune sets the number of variables to use in the model. I try every possibility.

grid <- expand.grid(nprune = 1:8, 
                    degree = 1)
start_time <- Sys.time()
model_earth <- train(
  predictors,
  actuals,
  method = 'earth',
  trControl = fit_controller, 
  tuneGrid = grid
  )
end_time <- Sys.time()
(run_time <- end_time - start_time)
## Time difference of 9.476251 secs

What were the best tuning parameters?

(this_best_tune <- model_earth$bestTune)
##   nprune degree
## 5      5      1

Show final model

model_earth$finalModel$coefficients
##                                            y
## (Intercept)                        4.6758538
## h(perceived_well_being--0.553305)  0.9898294
## h(-0.553305-perceived_well_being) -0.6707775
## h(life_expectancy--1.37953)        0.1512215
## h(0.824039-social_support)        -0.1134321

Plot variable importance

plot(varImp(model_earth))

In this model only three are considered important.

Save results

this_model <- tibble(model = model_earth$method)

this_controller <-
  tibble(
    control_method = model_earth$control$method,
    num_folds = 0,
    repeats = model_earth$control$repeats
  )

this_fit_run_time <- tibble(run_time = run_time)



this_model_record <- cbind(this_model,
                           this_controller,
                           this_fit_run_time,
                           this_model_metrics) %>%
  mutate(intercept = 0.00) %>%
  mutate(direction = NA_character_) %>%
  mutate(maxvar = NA_integer_) %>%
  mutate(nu = NA_integer_) %>%
  mutate(mstop = NA_integer_)

# model_stats stores the results for all the models. 
model_stats <- rbind(model_stats, this_model_record) %>%
  select(model, RMSE, Rsquared, everything()) %>%
  arrange(RMSE)

model_stats
##       model      RMSE  Rsquared intercept       MAE     RMSESD RsquaredSD
## 1 lmStepAIC 0.2577185 0.9323786        NA 0.1993049 0.07313809 0.04131507
## 2     earth 0.2594503 0.9301155         0 0.1977023 0.07162420 0.03620401
## 3        lm 0.2659572 0.9318729         1 0.2081860 0.07949886 0.04022326
##        MAESD      run_time control_method num_folds repeats nprune degree
## 1 0.05217305 7.246598 secs     repeatedcv         0      20     NA     NA
## 2 0.05335324 9.476251 secs     repeatedcv         0      20      5      1
## 3 0.05644595 1.585442 secs     repeatedcv         9      20     NA     NA
##   direction maxvar nu mstop
## 1      both      8 NA    NA
## 2      <NA>     NA NA    NA
## 3      <NA>     NA NA    NA

Model 4: Boosted Smoothing Spline (bstSm)

Tuning parameters: Number of Boosting Iterations (mstop) Shrinkage (nu); this is the learning rate.

Generally, 50 is a good number to start with for mstop. When you plot it you will see where it leveled off and that will be where to set the mstop for the final model.

Nu is a learning parameter from 0 - 1. The default is 0.1, but you may get better results by experimenting with different values.

mstop <- seq(10, 50, 10) 
nu <- c( .01, .05, .1, .25) 
grid <- expand.grid(mstop = mstop, nu = nu)
start_time <- Sys.time()
model_bstSm <- train(predictors,
                     actuals, 
                     method = 'bstSm', 
                     trControl = fit_controller, 
                     tuneGrid = grid)
end_time <- Sys.time()
(run_time <- end_time - start_time)
## Time difference of 9.792036 mins

What are the best settings for the tuning parameters?

(this_best_tune <- model_bstSm$bestTune)
##    mstop  nu
## 14    40 0.1
this_model_metrics <- model_bstSm$results %>%
  filter(mstop == this_best_tune[1,1] & nu == this_best_tune[1,2])

this_model_metrics
##    nu mstop      RMSE  Rsquared      MAE     RMSESD RsquaredSD      MAESD
## 1 0.1    40 0.2530175 0.9345729 0.200745 0.06216054 0.03563159 0.04779683

Plot results

plot(model_bstSm)

Save results

this_model <- tibble(model = model_bstSm$method)

this_controller <-
  tibble(
    control_method = model_bstSm$control$method,
    num_folds = 0,
    repeats = model_bstSm$control$repeats
  )

this_fit_run_time <- tibble(run_time = run_time)

this_model_record <- cbind(this_model,
                           this_controller,
                           this_fit_run_time,
                           this_model_metrics) %>%
  mutate(intercept = 0.00) %>%
  mutate(direction = NA_character_) %>%
  mutate(maxvar = NA_integer_) %>%
  mutate(nprune = NA_integer_) %>%
  mutate(degree = NA_integer_)

# model_stats stores the results for all the models. 
model_stats <- rbind(model_stats, this_model_record) %>%
  select(model, RMSE, Rsquared, everything()) %>%
  arrange(RMSE)

model_stats
##       model      RMSE  Rsquared intercept       MAE     RMSESD RsquaredSD
## 1     bstSm 0.2530175 0.9345729         0 0.2007450 0.06216054 0.03563159
## 2 lmStepAIC 0.2577185 0.9323786        NA 0.1993049 0.07313809 0.04131507
## 3     earth 0.2594503 0.9301155         0 0.1977023 0.07162420 0.03620401
## 4        lm 0.2659572 0.9318729         1 0.2081860 0.07949886 0.04022326
##        MAESD      run_time control_method num_folds repeats nprune degree
## 1 0.04779683 9.792036 secs     repeatedcv         0      20     NA     NA
## 2 0.05217305 7.246598 secs     repeatedcv         0      20     NA     NA
## 3 0.05335324 9.476251 secs     repeatedcv         0      20      5      1
## 4 0.05644595 1.585442 secs     repeatedcv         9      20     NA     NA
##   direction maxvar  nu mstop
## 1      <NA>     NA 0.1    40
## 2      both      8  NA    NA
## 3      <NA>     NA  NA    NA
## 4      <NA>     NA  NA    NA

Look at model performance

Every one of these models had basically the same accuracy and there was about 7% of the variability that is unexplained. I have a feeling that the 7% is in the variable that we dropped earlier: GDP_per_capita.

model_stats
##       model      RMSE  Rsquared intercept       MAE     RMSESD RsquaredSD
## 1     bstSm 0.2530175 0.9345729         0 0.2007450 0.06216054 0.03563159
## 2 lmStepAIC 0.2577185 0.9323786        NA 0.1993049 0.07313809 0.04131507
## 3     earth 0.2594503 0.9301155         0 0.1977023 0.07162420 0.03620401
## 4        lm 0.2659572 0.9318729         1 0.2081860 0.07949886 0.04022326
##        MAESD      run_time control_method num_folds repeats nprune degree
## 1 0.04779683 9.792036 secs     repeatedcv         0      20     NA     NA
## 2 0.05217305 7.246598 secs     repeatedcv         0      20     NA     NA
## 3 0.05335324 9.476251 secs     repeatedcv         0      20      5      1
## 4 0.05644595 1.585442 secs     repeatedcv         9      20     NA     NA
##   direction maxvar  nu mstop
## 1      <NA>     NA 0.1    40
## 2      both      8  NA    NA
## 3      <NA>     NA  NA    NA
## 4      <NA>     NA  NA    NA

Apply winning model to the test set

This is where we can see if there was any overfitting of the training set. If the performance against the test set is significantly worse than the training set then we probably have overfitting. I don’t think that will occur because there is a formula behind the final score so overfitting on this data should never happen.
###Fit model to test, with all variables

test_predictions <- predict(model_earth, test)

test_results <- test %>%
  mutate(pred = test_predictions) %>%
  select(country, happiness_score, pred) %>%
  rename(obs = happiness_score) 

(test_rmse <- rmse(test_results, truth = obs, estimate = pred))
## [1] 0.1683864
(test_rsq <- rsq(test_results, truth = obs, estimate = pred))
## [1] 0.962229

The model actually fit the OECD countries better than the non-OECD countries.

It explains 96.2% of the variance with an RMSE of 0.168.

Backtracking

I am really suspicious about that remaining 7% from the training model. If these variables are all that is used to create the final score then these models should have dialed it in to 99% accuracy. I am going to add that variable back, then run lmStepAIC again to see if there is an improvement.

Select all variables, including GDP_per_capita

predictors <- happy_2017_wo_na[, 3:11]
actuals <- happy_2017_wo_na[, 2]

glimpse(predictors)
## Observations: 126
## Variables: 9
## $ perceived_well_being        <dbl> -2.45922945, -0.73910180, 0.478294...
## $ GDP_per_capita              <dbl> -1.4593121, 0.1069263, 0.4914529, ...
## $ social_support              <dbl> -2.5530650, -1.3679404, 0.8034490,...
## $ life_expectancy             <dbl> -1.37952691, 0.73505142, 0.5436179...
## $ freedom_of_life_choices     <dbl> -2.77724677, -0.23302656, 0.416475...
## $ generosity                  <dbl> -0.106340349, -0.035140377, -0.186...
## $ perception_corruption       <dbl> 1.19261817, 0.76104643, 0.56757758...
## $ confidence_in_national_govt <dbl> -1.17725451, -0.19352638, -0.95578...
## $ gini_household_income       <dbl> -1.5947299, -0.4055899, -0.5621611...

Setup new controller

I’m going to really ramp up the number of repeats to see if that will help dial-in the algorithm better.

fit_controller <- trainControl(
  method = "repeatedcv",
  number = 9,
  repeats = 200, 
  verboseIter = FALSE)

Fit best performing model with all variables

The lmStepAIC model performed best so I’ll start with that one.

start_time <- Sys.time()
model_lmStepAIC_2 <- train(
  predictors,
  actuals$happiness_score,
  method = 'lmStepAIC',
  trControl = fit_controller, 
  maxvar = 8, 
  direction = "both", 
  trace = 0
  )
end_time <- Sys.time()
(run_time <- end_time - start_time)
## Time difference of 1.311215 mins

Show results

(this_model_metrics <- model_lmStepAIC_2$results %>% select(-parameter))
##        RMSE  Rsquared       MAE     RMSESD RsquaredSD      MAESD
## 1 0.2415366 0.9586275 0.1886687 0.05659547   0.020312 0.03880572

Yes, there is a marked improvement of RMSE and Rsquared.

Show final model

model_lmStepAIC_2$finalModel$coefficients
##           (Intercept)  perceived_well_being        GDP_per_capita 
##            5.41509435            0.90955552            0.14227269 
##        social_support perception_corruption 
##            0.07856861           -0.03596966

Well look at that. Now GDP_per_capita is the 2nd most important variable.

Plot residuals

plot(resid(model_lmStepAIC_2))

Hmmm, I see an ocsillating pattern that converges at ~65 and ~90. Then again maybe I’m just seeing things.

Perform white noise test

The normwhn.test package has a whitenoise_test() function where the null hypothesis is that the data is white noise, so a p-value of less that 0.05 will indicate that there is still some ‘signal’ in the residuals.

library(normwhn.test)

whitenoise.test(resid(model_lmStepAIC_2))

## [1] "no. of observations"
## [1] 126
## [1] "T"
## [1] 63
## [1] "CVM stat MN"
## [1] 1.032482
## [1] "tMN"
## [1] 0.2578211
## [1] "test value"
## [1] 0.8974284

The test value is well above 0.05 so we can confirm that the residuals are noise.

Save results

this_model <- tibble(model = paste0(model_lmStepAIC_2$method, "_2"))
this_fit_run_time <- tibble(run_time = run_time)

this_controller <-
  tibble(
    control_method = model_lmStepAIC_2$control$method,
    num_folds = 0,
    repeats = model_lmStepAIC_2$control$repeats
  )

#this_best_tune <- model_lmStepAIC$bestTune
this_best_tune <- tibble(
  maxvar = 8, 
  direction = "both"
)

this_model_record <- cbind(this_model,
                           this_fit_run_time,
                           this_model_metrics, 
                           this_best_tune,
                           this_controller
                           ) %>%
  mutate(nprune = NA_integer_) %>%
  mutate(degree = NA_integer_) %>%
  mutate(intercept = NA) %>%
  mutate(nu = NA) %>%
  mutate(mstop = NA_integer_)

# model_stats will store the results for all the models. 

(model_stats <- rbind(model_stats, this_model_record)) %>%
  select(model, RMSE, Rsquared, everything()) %>%
  arrange(RMSE)
##         model      RMSE  Rsquared intercept       MAE     RMSESD
## 1 lmStepAIC_2 0.2415366 0.9586275        NA 0.1886687 0.05659547
## 2       bstSm 0.2530175 0.9345729         0 0.2007450 0.06216054
## 3   lmStepAIC 0.2577185 0.9323786        NA 0.1993049 0.07313809
## 4       earth 0.2594503 0.9301155         0 0.1977023 0.07162420
## 5          lm 0.2659572 0.9318729         1 0.2081860 0.07949886
##   RsquaredSD      MAESD      run_time control_method num_folds repeats
## 1 0.02031200 0.03880572 1.311215 secs     repeatedcv         0     200
## 2 0.03563159 0.04779683 9.792036 secs     repeatedcv         0      20
## 3 0.04131507 0.05217305 7.246598 secs     repeatedcv         0      20
## 4 0.03620401 0.05335324 9.476251 secs     repeatedcv         0      20
## 5 0.04022326 0.05644595 1.585442 secs     repeatedcv         9      20
##   nprune degree direction maxvar  nu mstop
## 1     NA     NA      both      8  NA    NA
## 2     NA     NA      <NA>     NA 0.1    40
## 3     NA     NA      both      8  NA    NA
## 4      5      1      <NA>     NA  NA    NA
## 5     NA     NA      <NA>     NA  NA    NA

Fit earth model to all variables

The earth model previously performed 2nd best. Let’s give it another chance with GPD_per_capita.

###Create parameter grid  
grid <- expand.grid(nprune = 1:8, 
                    degree = 1)
start_time <- Sys.time()
model_earth_2 <- train(
  predictors,
  actuals$happiness_score,
  method = 'earth',
  trControl = fit_controller, 
  tuneGrid = grid
  )
end_time <- Sys.time()
(run_time <- end_time - start_time)
## Time difference of 2.354251 mins

What were the best tuning parameters?

(this_best_tune <- model_earth_2$bestTune)
##   nprune degree
## 4      4      1

Previously nprune was optimized at 5 and now it is 4.

Show final model

model_earth_2$finalModel$coefficients
##                                            y
## (Intercept)                        4.7241066
## h(perceived_well_being--0.650129)  1.1121285
## h(-0.650129-perceived_well_being) -0.6567519
## h(0.106926-GDP_per_capita)        -0.2289394

This model only uses 2 of the 9 available variables.

Plot variable importance

plot(varImp(model_earth_2))

Save results

this_model <- tibble(model = paste0(model_earth_2$method, "_2"))

this_controller <-
  tibble(
    control_method = model_earth_2$control$method,
    num_folds = 0,
    repeats = model_earth_2$control$repeats
  )

this_fit_run_time <- tibble(run_time = run_time)



this_model_record <- cbind(this_model,
                           this_fit_run_time,
                           this_controller,
                           this_model_metrics) %>%
  mutate(intercept = 0.00) %>%
  mutate(direction = NA_character_) %>%
  mutate(maxvar = NA_integer_) %>%
  mutate(nu = NA) %>%
  mutate(mstop = NA_integer_)

# model_stats stores the results for all the models. 
model_stats <- rbind(model_stats, this_model_record) %>%
  select(model, RMSE, Rsquared, everything()) %>%
  arrange(RMSE)

model_stats
##         model      RMSE  Rsquared intercept       MAE     RMSESD
## 1     earth_2 0.2350535 0.9595860         0 0.1807705 0.05439138
## 2 lmStepAIC_2 0.2415366 0.9586275        NA 0.1886687 0.05659547
## 3       bstSm 0.2530175 0.9345729         0 0.2007450 0.06216054
## 4   lmStepAIC 0.2577185 0.9323786        NA 0.1993049 0.07313809
## 5       earth 0.2594503 0.9301155         0 0.1977023 0.07162420
## 6          lm 0.2659572 0.9318729         1 0.2081860 0.07949886
##   RsquaredSD      MAESD      run_time control_method num_folds repeats
## 1 0.02007829 0.04122297 2.354251 secs     repeatedcv         0     200
## 2 0.02031200 0.03880572 1.311215 secs     repeatedcv         0     200
## 3 0.03563159 0.04779683 9.792036 secs     repeatedcv         0      20
## 4 0.04131507 0.05217305 7.246598 secs     repeatedcv         0      20
## 5 0.03620401 0.05335324 9.476251 secs     repeatedcv         0      20
## 6 0.04022326 0.05644595 1.585442 secs     repeatedcv         9      20
##   nprune degree direction maxvar  nu mstop
## 1      4      1      <NA>     NA  NA    NA
## 2     NA     NA      both      8  NA    NA
## 3     NA     NA      <NA>     NA 0.1    40
## 4     NA     NA      both      8  NA    NA
## 5      5      1      <NA>     NA  NA    NA
## 6     NA     NA      <NA>     NA  NA    NA

Boosted Smoothing Splines

We might as well run bstSm against all predictors since it is supposed to be an enhanced version of ’earth. Let’s see what we get.

Set parm grid

In the previous fitting the best tune was mstop = 40 and nu = 0.1.
I’ll test a couple other values around those anchors to see if there is any change.

mstop <- c(30, 40, 50) 
nu <- c(.05, .1, .15) 
grid <- expand.grid(mstop = mstop, nu = nu)
start_time <- Sys.time()
model_bstSm_1 <- train(predictors,
                     actuals$happiness_score, 
                     method = 'bstSm', 
                     trControl = fit_controller, 
                     tuneGrid = grid)
end_time <- Sys.time()
(run_time <- end_time - start_time)
## Time difference of 24.33777 mins

What are the best settings for the tuning parameters?

(this_best_tune <- model_bstSm_1$bestTune)
##   mstop   nu
## 7    30 0.15
this_model_metrics <- model_bstSm_1$results %>%
  filter(mstop == this_best_tune[1,1] & nu == this_best_tune[1,2])

Plot results

plot(model_bstSm_1)

Save results

this_model <- tibble(model = paste0(model_bstSm_1$method, "_2"))

this_controller <-
  tibble(
    control_method = model_bstSm_1$control$method,
    num_folds = 0,
    repeats = model_bstSm_1$control$repeats
  )

this_fit_run_time <- tibble(run_time = run_time)

this_model_record <- cbind(this_model,
                           this_controller,
                           this_fit_run_time,
                           this_model_metrics) %>%
  mutate(intercept = 0.00) %>%
  mutate(direction = NA_character_) %>%
  mutate(maxvar = NA_integer_) %>%
  mutate(degree = NA_integer_) %>%
  mutate(nprune = NA_integer_)

# model_stats stores the results for all the models. 
model_stats <- rbind(model_stats, this_model_record) %>%
  select(model, RMSE, Rsquared, everything()) %>%
  arrange(RMSE)

model_stats
##         model      RMSE  Rsquared intercept       MAE     RMSESD
## 1     bstSm_2 0.2266729 0.9636103         0 0.1752025 0.04994372
## 2     earth_2 0.2350535 0.9595860         0 0.1807705 0.05439138
## 3 lmStepAIC_2 0.2415366 0.9586275        NA 0.1886687 0.05659547
## 4       bstSm 0.2530175 0.9345729         0 0.2007450 0.06216054
## 5   lmStepAIC 0.2577185 0.9323786        NA 0.1993049 0.07313809
## 6       earth 0.2594503 0.9301155         0 0.1977023 0.07162420
## 7          lm 0.2659572 0.9318729         1 0.2081860 0.07949886
##   RsquaredSD      MAESD       run_time control_method num_folds repeats
## 1 0.01725562 0.03833247 24.337775 secs     repeatedcv         0     200
## 2 0.02007829 0.04122297  2.354251 secs     repeatedcv         0     200
## 3 0.02031200 0.03880572  1.311215 secs     repeatedcv         0     200
## 4 0.03563159 0.04779683  9.792036 secs     repeatedcv         0      20
## 5 0.04131507 0.05217305  7.246598 secs     repeatedcv         0      20
## 6 0.03620401 0.05335324  9.476251 secs     repeatedcv         0      20
## 7 0.04022326 0.05644595  1.585442 secs     repeatedcv         9      20
##   nprune degree direction maxvar   nu mstop
## 1     NA     NA      <NA>     NA 0.15    30
## 2      4      1      <NA>     NA   NA    NA
## 3     NA     NA      both      8   NA    NA
## 4     NA     NA      <NA>     NA 0.10    40
## 5     NA     NA      both      8   NA    NA
## 6      5      1      <NA>     NA   NA    NA
## 7     NA     NA      <NA>     NA   NA    NA

The bstSm model with all variables performed the best on the training set. Let’s see how each model performs against the test set.

Add GDP_per_capita to test set

test <- happy_2017_wo_na %>%
  filter(country %in% OECD_countries)

glimpse(test)
## Observations: 39
## Variables: 11
## $ country                     <chr> "Australia", "Austria", "Belgium",...
## $ happiness_score             <dbl> 7.272, 7.139, 6.927, 6.419, 6.476,...
## $ perceived_well_being        <dbl> 1.5373403, 1.5692499, 1.2514771, 0...
## $ GDP_per_capita              <dbl> 1.20215279, 1.20575278, 1.15781245...
## $ social_support              <dbl> 1.15263575, 0.79956381, 0.92404516...
## $ life_expectancy             <dbl> 1.2072194, 1.1536177, 1.1261885, 0...
## $ freedom_of_life_choices     <dbl> 1.03623710, 0.87440700, 0.61234632...
## $ generosity                  <dbl> 0.30169326, 0.12499674, 0.04548147...
## $ perception_corruption       <dbl> -1.80213723, -1.21229604, -1.07584...
## $ confidence_in_national_govt <dbl> -0.2151998, -0.3027780, -0.2335925...
## $ gini_household_income       <dbl> -0.23422989, -1.24589669, -0.94419...

Predict values on test with lmStepAIC

test_prediction <- predict(model_lmStepAIC_2, test)

test_results <- test %>%
  mutate(pred = test_prediction) %>%
  select(country, happiness_score, pred) %>%
  rename(obs = happiness_score) 

this_rmse <- rmse(test_results, truth = obs, estimate = pred)
this_rsquared <- rsq(test_results, truth = obs, estimate = pred)

final_results <- tibble(
  model = "lmStepAIC", 
  rmse = this_rmse, 
  Rsquared = this_rsquared
)

final_results
## # A tibble: 1 x 3
##   model      rmse Rsquared
##   <chr>     <dbl>    <dbl>
## 1 lmStepAIC 0.161    0.965

Predict values on test with earth

test_prediction <- predict(model_earth_2, test)

test_results <- test %>%
  mutate(pred = test_prediction) %>%
  select(country, happiness_score, pred) %>%
  rename(obs = happiness_score) 

this_rmse <- rmse(test_results, truth = obs, estimate = pred)
this_rsquared <- rsq(test_results, truth = obs, estimate = pred)

this_final_results <- tibble(
  model = "earth", 
  rmse = this_rmse, 
  Rsquared = this_rsquared
)

final_results <- rbind(final_results, this_final_results) %>%
  arrange(rmse)

final_results
## # A tibble: 2 x 3
##   model      rmse Rsquared
##   <chr>     <dbl>    <dbl>
## 1 earth     0.137    0.974
## 2 lmStepAIC 0.161    0.965

Predict values on test with bstSm

test_prediction <- predict(model_bstSm_1, test)

test_results <- test %>%
  mutate(pred = test_prediction) %>%
  select(country, happiness_score, pred) %>%
  rename(obs = happiness_score) 

this_rmse <- rmse(test_results, truth = obs, estimate = pred)
this_rsquared <- rsq(test_results, truth = obs, estimate = pred)

this_final_results <- tibble(
  model = "bstSm", 
  rmse = this_rmse, 
  Rsquared = this_rsquared
)

final_results <- rbind(final_results, this_final_results) %>%
  arrange(rmse)

final_results
## # A tibble: 3 x 3
##   model      rmse Rsquared
##   <chr>     <dbl>    <dbl>
## 1 earth     0.137    0.974
## 2 bstSm     0.151    0.970
## 3 lmStepAIC 0.161    0.965

In the end the earth model which uses Multi-variate Adaptive Regression Splines (MARS) wins.

Plot actuals vs predictions on test set using earth

test_prediction <- predict(model_earth_2, test)

test_results <- test %>%
  mutate(pred = test_prediction) %>%
  select(country, happiness_score, pred) %>%
  rename(obs = happiness_score) 

ggplot(test_results, aes(x = obs, y = pred)) +
  geom_point()

Very nearly a straight line from lower-left to upper-right. This is a good thing.

Plot variable importance

plot(varImp(model_earth_2))

Final model

The final model ended up using 2 of the 9 original predictors. One of them was the predictor that was originally removed because it was highly correlated with another predictor, GPD_per_capita.

model_earth_2$finalModel$coefficients
##                                            y
## (Intercept)                        4.7241066
## h(perceived_well_being--0.650129)  1.1121285
## h(-0.650129-perceived_well_being) -0.6567519
## h(0.106926-GDP_per_capita)        -0.2289394

Conclusion

The most surprising thing in this machine learning exercise was that a standard function to remove correlated predictors recommended removing the 2nd most important predictor in the final model. The modeller should not blindly accept the recommendations of the different tests.

I also made a conscious choice to keep the outliers because they were the actual values used by The Happiness Project to calculate the scores.

So did I crack the code to happiness? No, I just reverse-engineered the algorithm that is used to produce the score. I’m sure that if I contacted the United Nations Sustainable Development Solutions Network and showed them the model they would say that I wasn’t even close to their actual model. A model is just a simplified representation of the data so it may not be how they calculated it but they could use my model and no one would know the difference.

What’s next?

In 2017 the US dropped 4 spots to #18. What is the driver of the decreased rank and what would it take for the US to get into the top 10.

END