Every year since 2011 the United Nations Sustainable Development Solutions Network has produced The Happiness Report. This report ranks each country based on 10 variables.
The purpose of this paper is to create a machine learning algorithm that can accurately predict the happiness score based on the other parameters.
I will try several models that are strong in regression prediction and choose the best model based on Root Mean Squared Error (RMSE) and Rsquared.
The final model was able to predict results on the test set with a very low RMSE of 0.137 and Rsquared of 0.974. The Rsquared score indicates that 97.4% of the variance in the data is explained by the model.
There are 9 predictor variables in the data:
[1] “perceived_well_being” “GDP_per_capita” “social_support”
[4] “life_expectancy” “freedom_of_life_choices” “generosity”
[7] “perception_corruption” “confidence_in_national_govt” “gini_household_income”
The final model only used two of these predictors:
1. Perceived well-being
2. GDP per capita
News story about The Happiness Report
Produce a model that minimizes RMSE and performs well against the test set.
Determine the key drivers of a high happiness score.
The mission of the Organisation for Economic Co-operation and Development (OECD) is to promote policies that will improve the economic and social well-being of people around the world.
In doing this they collect a lot of data from its 35 member-countries and other countries that make the information available.
The data is provided by the United Nations Sustainable Development Solutions Network, not the OECD. OECD countries span the geo-political, and socio-economic spectrum and are a subset of the full country list.
The training set will be 87 observations of non-OECD countries.
The test set will be 39 observations of OECD countries.
The outcome and all predictors are continuous, which is uncommon in my experience, so I’ll use a matrix instead of a dataframe for processing.
library(tidyverse)
library(caret)
library(GGally)
library(yardstick)
I got this list from the OECD website but for some reason there are 41 counries in this list, while their website says there are 35. I’m sure there is a good reason but for our purposes it doesn’t matter.
OECD_countries <- c(
"Australia",
"Austria",
"Belgium",
"Brazil",
"Canada",
"Chile",
"China",
"Czech Republic",
"Denmark",
"Estonia",
"Finland",
"France",
"Germany",
"Greece",
"Hungary",
"Iceland",
"India",
"Indonesia",
"Ireland",
"Israel",
"Italy",
"Japan",
"Latvia",
"Luxembourg",
"Mexico",
"Netherlands",
"New Zealand",
"Norway",
"Poland",
"Portugal",
"Russia",
"Slovakia",
"Slovenia",
"South Africa",
"South Korea",
"Spain",
"Sweden",
"Switzerland",
"Turkey",
"United Kingdom",
"United States"
)
The scores and the variables that make up the scores are stored in two separate files so I’ll import them each separately and then join together.
happiness_scores_2017 <- read_csv("happiness_scores.csv") %>%
select(country, happiness_score) %>%
arrange(country)
glimpse(happiness_scores_2017)
## Observations: 157
## Variables: 2
## $ country <chr> "Afghanistan", "Albania", "Algeria", "Angola",...
## $ happiness_score <dbl> 3.632, 4.586, 5.295, 3.795, 6.388, 4.321, 7.27...
This file has records for each country for each year since 2011. We are only interested the factors that influenced this year’s scores so we’ll filter for only 2017.
Foreshadowing: There were some NA’s in the 2017 data which had to be filled in order to move forward. I used previous observations to fill in these blanks.
happiness_factors_2017 <- read_csv("happiness_factors.csv") %>%
select(-sd_life_ladder, -sd_by_mean_life_ladder,
-democratic_quality, -delivery_quality, -gini_WB,
-gini_2000_2015_avg) %>%
filter(year == 2017) %>%
rename(perceived_well_being = life_ladder) %>%
rename(GDP_per_capita = log_GDP_per_capita) %>%
select(-year, -positive_effect, -`Negative affect`) %>%
arrange(country)
glimpse(happiness_factors_2017)
## Observations: 141
## Variables: 10
## $ country <chr> "Afghanistan", "Albania", "Algeria...
## $ perceived_well_being <dbl> 2.661718, 4.639548, 5.248912, 6.03...
## $ GDP_per_capita <dbl> 7.460144, 9.373718, 9.540244, 9.84...
## $ social_support <dbl> 0.4908801, 0.6376983, 0.8067539, 0...
## $ life_expectancy <dbl> 52.33953, 69.05166, 65.69919, 67.5...
## $ freedom_of_life_choices <dbl> 0.4270109, 0.7496110, 0.4366705, 0...
## $ generosity <dbl> -0.106340349, -0.035140377, -0.194...
## $ perception_corruption <dbl> 0.9543926, 0.8761346, 0.6997742, 0...
## $ confidence_in_national_govt <dbl> 0.2611785, 0.4577375, NA, 0.305430...
## $ gini_household_income <dbl> 0.2865989, 0.4104880, 0.5275558, 0...
There are a few NA’s in most categories that will cause the machine learning models to fail.
summary(happiness_factors_2017)
## country perceived_well_being GDP_per_capita social_support
## Length:141 Min. :2.662 Min. : 6.625 Min. :0.3196
## Class :character 1st Qu.:4.628 1st Qu.: 8.549 1st Qu.:0.7335
## Mode :character Median :5.579 Median : 9.544 Median :0.8286
## Mean :5.486 Mean : 9.341 Mean :0.8060
## 3rd Qu.:6.273 3rd Qu.:10.307 3rd Qu.:0.9056
## Max. :7.788 Max. :11.465 Max. :0.9668
## NA's :7 NA's :1
## life_expectancy freedom_of_life_choices generosity
## Min. :44.39 Min. :0.4270 Min. :-0.29674
## 1st Qu.:57.98 1st Qu.:0.7121 1st Qu.:-0.13926
## Median :65.13 Median :0.8122 Median :-0.03514
## Mean :63.40 Mean :0.7787 Mean :-0.01095
## 3rd Qu.:69.05 3rd Qu.:0.8803 3rd Qu.: 0.09579
## Max. :76.54 Max. :0.9852 Max. : 0.62871
## NA's :1 NA's :8
## perception_corruption confidence_in_national_govt gini_household_income
## Min. :0.1618 Min. :0.1109 Min. :0.2456
## 1st Qu.:0.6812 1st Qu.:0.3378 1st Qu.:0.3777
## Median :0.7808 Median :0.4760 Median :0.4388
## Mean :0.7348 Mean :0.4964 Mean :0.4585
## 3rd Qu.:0.8557 3rd Qu.:0.6210 3rd Qu.:0.5522
## Max. :0.9544 Max. :0.9647 Max. :0.8520
## NA's :12 NA's :13
Excel is an excellent tool for looking at and modifying a small table.
Wherever possible I used the last recorded value for the particular variable. There were several countries that had no data for the confidence_in_national_govt and I deleted the entire record.
Notice that all the NA’s are gone.
happiness_factors_2017 <- read_csv("happiness_factors_2017.csv") %>%
rename(perceived_well_being = life_ladder) %>%
rename(GDP_per_capita = log_GDP_per_capita) %>%
select(-year) %>%
arrange(country)
summary(happiness_factors_2017)
## country perceived_well_being GDP_per_capita social_support
## Length:128 Min. :2.662 Min. : 6.474 Min. :0.3196
## Class :character 1st Qu.:4.608 1st Qu.: 8.209 1st Qu.:0.7335
## Mode :character Median :5.600 Median : 9.452 Median :0.8295
## Mean :5.489 Mean : 9.243 Mean :0.8072
## 3rd Qu.:6.321 3rd Qu.:10.283 3rd Qu.:0.9073
## Max. :7.788 Max. :11.465 Max. :0.9668
## life_expectancy freedom_of_life_choices generosity
## Min. :44.39 Min. :0.4270 Min. :-0.296735
## 1st Qu.:57.38 1st Qu.:0.7090 1st Qu.:-0.133247
## Median :64.73 Median :0.8122 Median :-0.030061
## Mean :63.24 Mean :0.7792 Mean :-0.007489
## 3rd Qu.:69.65 3rd Qu.:0.8803 3rd Qu.: 0.091140
## Max. :76.54 Max. :0.9852 Max. : 0.628706
## perception_corruption confidence_in_national_govt gini_household_income
## Min. :0.1618 Min. :0.1109 Min. :0.2456
## 1st Qu.:0.6819 1st Qu.:0.3378 1st Qu.:0.3741
## Median :0.7918 Median :0.4760 Median :0.4308
## Mean :0.7381 Mean :0.4964 Mean :0.4527
## 3rd Qu.:0.8619 3rd Qu.:0.6210 3rd Qu.:0.5496
## Max. :0.9544 Max. :0.9647 Max. :0.7154
It looks like each variable has a different scale. In order for the machine learning algorithm to work optimally with the data it should be normalized so that the mean of each variable is 0, and the standard deviation is 1.0
happy_2017_pp <- preProcess(happiness_factors_2017[,-c(1, 7)],
method = c("center", "scale"))
happy_2017_processed <- predict(happy_2017_pp, happiness_factors_2017)
summary(happy_2017_processed)
## country perceived_well_being GDP_per_capita
## Length:128 Min. :-2.45923 Min. :-2.2667
## Class :character 1st Qu.:-0.76635 1st Qu.:-0.8461
## Mode :character Median : 0.09598 Median : 0.1711
## Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.72359 3rd Qu.: 0.8515
## Max. : 1.99934 Max. : 1.8188
## social_support life_expectancy freedom_of_life_choices
## Min. :-3.9357 Min. :-2.3858 Min. :-2.7772
## 1st Qu.:-0.5944 1st Qu.:-0.7419 1st Qu.:-0.5532
## Median : 0.1806 Median : 0.1880 Median : 0.2606
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.8086 3rd Qu.: 0.8107 3rd Qu.: 0.7978
## Max. : 1.2882 Max. : 1.6821 Max. : 1.6248
## generosity perception_corruption confidence_in_national_govt
## Min. :-0.296735 Min. :-3.1784 Min. :-1.9292
## 1st Qu.:-0.133247 1st Qu.:-0.3102 1st Qu.:-0.7939
## Median :-0.030061 Median : 0.2961 Median :-0.1022
## Mean :-0.007489 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.091140 3rd Qu.: 0.6823 3rd Qu.: 0.6235
## Max. : 0.628706 Max. : 1.1926 Max. : 2.3436
## gini_household_income
## Min. :-1.9882
## 1st Qu.:-0.7548
## Median :-0.2103
## Mean : 0.0000
## 3rd Qu.: 0.9299
## Max. : 2.5208
sapply(happy_2017_processed, sd)
## country perceived_well_being
## NA 1.0000000
## GDP_per_capita social_support
## 1.0000000 1.0000000
## life_expectancy freedom_of_life_choices
## 1.0000000 1.0000000
## generosity perception_corruption
## 0.1583641 1.0000000
## confidence_in_national_govt gini_household_income
## 1.0000000 1.0000000
This will give us one big dataset with country, score, and predictors that will be split into train and test sets.
happy_2017_w_result <- happy_2017_processed %>%
left_join(happiness_scores_2017, by = "country") %>%
select(country, happiness_score, everything())
summary(happy_2017_w_result)
## country happiness_score perceived_well_being GDP_per_capita
## Length:128 Min. :3.083 Min. :-2.45923 Min. :-2.2667
## Class :character 1st Qu.:4.449 1st Qu.:-0.76635 1st Qu.:-0.8461
## Mode :character Median :5.404 Median : 0.09598 Median : 0.1711
## Mean :5.408 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.:6.238 3rd Qu.: 0.72359 3rd Qu.: 0.8515
## Max. :7.632 Max. : 1.99934 Max. : 1.8188
## NA's :2
## social_support life_expectancy freedom_of_life_choices
## Min. :-3.9357 Min. :-2.3858 Min. :-2.7772
## 1st Qu.:-0.5944 1st Qu.:-0.7419 1st Qu.:-0.5532
## Median : 0.1806 Median : 0.1880 Median : 0.2606
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.8086 3rd Qu.: 0.8107 3rd Qu.: 0.7978
## Max. : 1.2882 Max. : 1.6821 Max. : 1.6248
##
## generosity perception_corruption confidence_in_national_govt
## Min. :-0.296735 Min. :-3.1784 Min. :-1.9292
## 1st Qu.:-0.133247 1st Qu.:-0.3102 1st Qu.:-0.7939
## Median :-0.030061 Median : 0.2961 Median :-0.1022
## Mean :-0.007489 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.091140 3rd Qu.: 0.6823 3rd Qu.: 0.6235
## Max. : 0.628706 Max. : 1.1926 Max. : 2.3436
##
## gini_household_income
## Min. :-1.9882
## 1st Qu.:-0.7548
## Median :-0.2103
## Mean : 0.0000
## 3rd Qu.: 0.9299
## Max. : 2.5208
##
Two NA’s resulted from the join; delete those records
happy_NAs <- which(is.na(happy_2017_w_result$happiness_score))
happy_2017_wo_na <- happy_2017_w_result[-happy_NAs, ]
summary(happy_2017_wo_na)
## country happiness_score perceived_well_being GDP_per_capita
## Length:126 Min. :3.083 Min. :-2.459230 Min. :-2.2667
## Class :character 1st Qu.:4.449 1st Qu.:-0.779324 1st Qu.:-0.8498
## Mode :character Median :5.404 Median : 0.095978 Median : 0.1360
## Mean :5.408 Mean :-0.003973 Mean :-0.0179
## 3rd Qu.:6.238 3rd Qu.: 0.725759 3rd Qu.: 0.8080
## Max. :7.632 Max. : 1.999340 Max. : 1.8188
## social_support life_expectancy freedom_of_life_choices
## Min. :-3.935735 Min. :-2.38582 Min. :-2.77725
## 1st Qu.:-0.594734 1st Qu.:-0.76322 1st Qu.:-0.57829
## Median : 0.173399 Median : 0.18803 Median : 0.25141
## Mean :-0.008506 Mean :-0.01184 Mean :-0.00823
## 3rd Qu.: 0.802478 3rd Qu.: 0.78936 3rd Qu.: 0.81668
## Max. : 1.288206 Max. : 1.59048 Max. : 1.62480
## generosity perception_corruption confidence_in_national_govt
## Min. :-0.296735 Min. :-3.178372 Min. :-1.92918
## 1st Qu.:-0.133472 1st Qu.:-0.297393 1st Qu.:-0.77162
## Median :-0.032265 Median : 0.296125 Median :-0.10224
## Mean :-0.008467 Mean : 0.006527 Mean : 0.00696
## 3rd Qu.: 0.089498 3rd Qu.: 0.672713 3rd Qu.: 0.62999
## Max. : 0.628706 Max. : 1.192618 Max. : 2.34364
## gini_household_income
## Min. :-1.988242
## 1st Qu.:-0.762806
## Median :-0.210271
## Mean : 0.001428
## 3rd Qu.: 0.934656
## Max. : 2.520808
We’ve got a clean set of data to train and test on but there are a couple of standard tests that should be done just to ensure that all bases are covered.
This is when a variable contains the same value in all, or almost all, observations.
I have already looked at the data and there are no nzv’s but this is how to test for it.
nzv <- nearZeroVar(happy_2017_wo_na[, 3:11], saveMetrics= TRUE)
sum(nzv$nzv)
## [1] 0
All our variables are numeric so it’s a good idea to look for linear combinations that can also confuse the models.
var_matrix <- as.matrix(happy_wo_hc_vars)
(linear_combos <- findLinearCombos(var_matrix))
## $linearCombos
## list()
##
## $remove
## NULL
There are none so we can continue to work with happy_wo_hc_vars
If you look at the summary above scaling gives us another advantage…
You can look at the min and max of each variable which represents the number of standard deviations away that observation is from the mean.
Ignore happiness_score since that variable was not transformed.
social_support has an observation that is about 4 sd’s away, and so does generosity.
Let’s see if they qualify as outliers…
myboxplot <- boxplot(happy_wo_hc_vars) # now you should see your three outliers
All of the circles represent outliers. How may are there altogether?
myboxplot$out # it will print the values of the outliers
## [1] -2.9970367 -3.9357348 -2.7772468 0.4729739 0.6287057 -1.8021372
## [7] -3.0716237 -3.0094990 -2.2116714 -2.2497846 -2.0680186 -2.8469538
## [13] -2.6935124 -3.1783718 -2.7505593 -2.3269363
min(abs(myboxplot$out))
## [1] 0.4729739
Most of the outliers occur on variable 6, perception_of_corruption. These outliers may in fact be a separate population.
There are 18 outliers and the cutoff was arbitrariliy set to 1.80 sd’s, which seems a little low to me.
I really don’t want to speculate on why the outliers exist but since the score is based on this data I am not going to remove any of these values.
result <- happy_2017_wo_na[, 1:2]
happy_data <- happy_wo_hc_vars %>%
bind_cols(result) %>%
select(country, happiness_score, everything())
glimpse(happy_data)
## Observations: 126
## Variables: 10
## $ country <chr> "Afghanistan", "Albania", "Argenti...
## $ happiness_score <dbl> 3.632, 4.586, 6.388, 4.321, 7.272,...
## $ perceived_well_being <dbl> -2.45922945, -0.73910180, 0.478294...
## $ social_support <dbl> -2.5530650, -1.3679404, 0.8034490,...
## $ life_expectancy <dbl> -1.37952691, 0.73505142, 0.5436179...
## $ freedom_of_life_choices <dbl> -2.77724677, -0.23302656, 0.416475...
## $ generosity <dbl> -0.106340349, -0.035140377, -0.186...
## $ perception_corruption <dbl> 1.19261817, 0.76104643, 0.56757758...
## $ confidence_in_national_govt <dbl> -1.17725451, -0.19352638, -0.95578...
## $ gini_household_income <dbl> -1.5947299, -0.4055899, -0.5621611...
Modelling uses a lot of randomization. For reproducibility I’ll set the random seed so the results will always be the same.
set.seed(317)
Caret has an excellent function called createDataPartition() that I usually use but in this case I’ve already decided to split the data by whether the country is part of the OECD or not.
train <- happy_data %>%
filter(!country %in% OECD_countries) %>%
select(-country)
test <- happy_data %>%
filter(country %in% OECD_countries)
dim(train)
## [1] 87 9
dim(test)
## [1] 39 10
Train has 87 records, test has 39.
They both have 1 result variable and 8 predictor variables.
The controller can be used with multiple models so you can compare apples-to-apples without a lot of extra typing.
We have a small set of data so I’m not worried about running a lot of iterations.
I’m using repeated cross-validation. Each repeat will have 9 folds (about 10% of data in train).
I’ll set the number of repeats to 20 initially, but it usually maxes out before then. I’ll raise it if needed.
Note: Normally I set verboseIter to TRUE so I can monitor the progress. For this paper I’ll leave it off but set a timer so you can see how long each model takes to run.
# Create controller for all models
fit_controller <- trainControl(
method = "repeatedcv",
number = 9,
repeats = 20,
verboseIter = FALSE)
For regression I always do a linear model (lm) first. This is usually the simplest model to fit but is rarely the winner. It has a better chance with this data since all predictors and the result are continuous numeric values.
modelLookup(model = 'lm')
## model parameter label forReg forClass probModel
## 1 lm intercept intercept TRUE FALSE FALSE
The intercept is the only parameter and it will be determined by the model so no need to set any tuning parameters.
Since all predictors are numeric I’m going to convert the dataframe to a matri for better processing.
predictors <- as.matrix(train[, 2:9])
actuals <- as.numeric(train$happiness_score)
start_time <- Sys.time()
model_lm <- train(predictors,
actuals,
method = 'lm',
trControl = fit_controller)
end_time <- Sys.time()
(run_time <- end_time - start_time)
## Time difference of 1.585442 secs
summary(model_lm)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.77316 -0.13789 0.00832 0.11484 0.72798
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.37056 0.03412 157.385 <2e-16 ***
## perceived_well_being 0.83763 0.05569 15.040 <2e-16 ***
## social_support 0.10355 0.04284 2.417 0.0180 *
## life_expectancy 0.08856 0.05279 1.678 0.0974 .
## freedom_of_life_choices 0.04157 0.04651 0.894 0.3742
## generosity -0.21613 0.20872 -1.036 0.3036
## perception_corruption -0.02367 0.05868 -0.403 0.6877
## confidence_in_national_govt -0.02476 0.04728 -0.524 0.6020
## gini_household_income -0.04482 0.04042 -1.109 0.2708
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2576 on 78 degrees of freedom
## Multiple R-squared: 0.9287, Adjusted R-squared: 0.9214
## F-statistic: 127 on 8 and 78 DF, p-value: < 2.2e-16
The key metric in this data is the Adjusted R-squared. It is 0.9214 which indicates that 92% of the variability of the data is explained by the model.
NOTE: I wonder if the other 8% could be accounted for by the GDP_per_capita variable.
The number of stars next to each variable indicated their importance to the model.
model_lm$results
## intercept RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 TRUE 0.2659572 0.9318729 0.208186 0.07949886 0.04022326 0.05644595
The RMSE is 0.266 which we’ll use to compare against the results of the other models.
plot(varImp(model_lm))
plot(resid(model_lm))
In this case a messy plot is a good thing. This is the noise that could not be determined by the model … the 8%. There is no descernable pattern so I am confident that this model has done the best it can with the data it has.
After each model runs I will save the results to a dataframe for final model evaluation at the end.
this_model <- tibble(model = model_lm$method)
this_fit_run_time <- tibble(run_time = run_time)
this_controller <-
tibble(
control_method = model_lm$control$method,
num_folds = model_lm$control$number,
repeats = model_lm$control$repeats
)
this_best_tune <- model_lm$bestTune
this_model_metrics <- model_lm$results
#filter(nprune == this_best_tune$nprune & degree == this_best_tune$degree)
this_model_record <- cbind(this_model,
this_model_metrics,
this_fit_run_time,
this_controller
)
# model_stats will store the results for all the models.
model_stats <- this_model_record %>%
mutate(nprune = NA_integer_) %>%
mutate(degree = NA_integer_) %>%
mutate(direction = NA_character_) %>%
mutate(maxvar = NA_integer_) %>%
mutate(nu = NA_integer_) %>%
mutate(mstop = NA_integer_)
#(model_stats <- rbind(model_stats, this_model_record))
From the MASS package.
modelLookup(model = "lmStepAIC")
## model parameter label forReg forClass probModel
## 1 lmStepAIC parameter parameter TRUE FALSE FALSE
The tuning paraneter is called parameter but it should have the maxvar and direction like stepLDA.
modelLookup(model = "stepLDA")
## model parameter label forReg forClass probModel
## 1 stepLDA maxvar Maximum #Variables FALSE TRUE TRUE
## 2 stepLDA direction Search Direction FALSE TRUE TRUE
I’m adding maxvar and direction to see if they will be used. The direction indicates whether you want your training to be additive or suvractive. Using both tries each way.
start_time <- Sys.time()
model_lmStepAIC <- train(
predictors,
actuals,
method = 'lmStepAIC',
trControl = fit_controller,
maxvar = 8,
direction = "both",
trace = 0
)
end_time <- Sys.time()
(run_time <- end_time - start_time)
## Time difference of 7.246598 secs
model_lmStepAIC$results
## parameter RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 none 0.2577185 0.9323786 0.1993049 0.07313809 0.04131507 0.05217305
We see a slight improvement in both RMSE and Rsquared.
this_model <- tibble(model = model_lmStepAIC$method)
this_fit_run_time <- tibble(run_time = run_time)
this_controller <-
tibble(
control_method = model_lmStepAIC$control$method,
num_folds = 0,
repeats = model_lmStepAIC$control$repeats
)
#this_best_tune <- model_lmStepAIC$bestTune
this_best_tune <- tibble(
maxvar = 8,
direction = "both"
)
this_model_metrics <- model_lmStepAIC$results %>%
select(-parameter)
#filter(nprune == this_best_tune$nprune & degree == this_best_tune$degree)
this_model_record <- cbind(this_model,
this_model_metrics,
this_fit_run_time,
this_controller,
this_best_tune) %>%
mutate(nprune = NA_integer_) %>%
mutate(degree = NA_integer_) %>%
mutate(intercept = NA) %>%
mutate(nu = NA_integer_) %>%
mutate(mstop = NA_integer_)
# model_stats will store the results for all the models.
(model_stats <- rbind(model_stats, this_model_record)) %>%
select(model, RMSE, Rsquared, control_method, repeats, everything()) %>%
arrange(RMSE)
## model RMSE Rsquared control_method repeats intercept MAE
## 1 lmStepAIC 0.2577185 0.9323786 repeatedcv 20 NA 0.1993049
## 2 lm 0.2659572 0.9318729 repeatedcv 20 TRUE 0.2081860
## RMSESD RsquaredSD MAESD run_time num_folds nprune degree
## 1 0.07313809 0.04131507 0.05217305 7.246598 secs 0 NA NA
## 2 0.07949886 0.04022326 0.05644595 1.585442 secs 9 NA NA
## direction maxvar nu mstop
## 1 both 8 NA NA
## 2 <NA> NA NA NA
From the earth package.
Caret provides a function called modelLookup()
modelLookup(model = 'earth')
## model parameter label forReg forClass probModel
## 1 earth nprune #Terms TRUE TRUE TRUE
## 2 earth degree Product Degree TRUE TRUE TRUE
There are 2 tuning parameters:
nprune = Number of variables to use.
degree = Defaults to 1 which creates an addative model; don’t change unless you have a very good reason.
nprune sets the number of variables to use in the model. I try every possibility.
grid <- expand.grid(nprune = 1:8,
degree = 1)
start_time <- Sys.time()
model_earth <- train(
predictors,
actuals,
method = 'earth',
trControl = fit_controller,
tuneGrid = grid
)
end_time <- Sys.time()
(run_time <- end_time - start_time)
## Time difference of 9.476251 secs
(this_best_tune <- model_earth$bestTune)
## nprune degree
## 5 5 1
model_earth$finalModel$coefficients
## y
## (Intercept) 4.6758538
## h(perceived_well_being--0.553305) 0.9898294
## h(-0.553305-perceived_well_being) -0.6707775
## h(life_expectancy--1.37953) 0.1512215
## h(0.824039-social_support) -0.1134321
this_model_metrics <- model_earth$results %>%
filter(nprune == 5 & degree == 1)
this_model_metrics
## degree nprune RMSE Rsquared MAE RMSESD RsquaredSD
## 1 1 5 0.2594503 0.9301155 0.1977023 0.0716242 0.03620401
## MAESD
## 1 0.05335324
plot(varImp(model_earth))
In this model only three are considered important.
this_model <- tibble(model = model_earth$method)
this_controller <-
tibble(
control_method = model_earth$control$method,
num_folds = 0,
repeats = model_earth$control$repeats
)
this_fit_run_time <- tibble(run_time = run_time)
this_model_record <- cbind(this_model,
this_controller,
this_fit_run_time,
this_model_metrics) %>%
mutate(intercept = 0.00) %>%
mutate(direction = NA_character_) %>%
mutate(maxvar = NA_integer_) %>%
mutate(nu = NA_integer_) %>%
mutate(mstop = NA_integer_)
# model_stats stores the results for all the models.
model_stats <- rbind(model_stats, this_model_record) %>%
select(model, RMSE, Rsquared, everything()) %>%
arrange(RMSE)
model_stats
## model RMSE Rsquared intercept MAE RMSESD RsquaredSD
## 1 lmStepAIC 0.2577185 0.9323786 NA 0.1993049 0.07313809 0.04131507
## 2 earth 0.2594503 0.9301155 0 0.1977023 0.07162420 0.03620401
## 3 lm 0.2659572 0.9318729 1 0.2081860 0.07949886 0.04022326
## MAESD run_time control_method num_folds repeats nprune degree
## 1 0.05217305 7.246598 secs repeatedcv 0 20 NA NA
## 2 0.05335324 9.476251 secs repeatedcv 0 20 5 1
## 3 0.05644595 1.585442 secs repeatedcv 9 20 NA NA
## direction maxvar nu mstop
## 1 both 8 NA NA
## 2 <NA> NA NA NA
## 3 <NA> NA NA NA
Tuning parameters: Number of Boosting Iterations (mstop) Shrinkage (nu); this is the learning rate.
Generally, 50 is a good number to start with for mstop. When you plot it you will see where it leveled off and that will be where to set the mstop for the final model.
Nu is a learning parameter from 0 - 1. The default is 0.1, but you may get better results by experimenting with different values.
mstop <- seq(10, 50, 10)
nu <- c( .01, .05, .1, .25)
grid <- expand.grid(mstop = mstop, nu = nu)
start_time <- Sys.time()
model_bstSm <- train(predictors,
actuals,
method = 'bstSm',
trControl = fit_controller,
tuneGrid = grid)
end_time <- Sys.time()
(run_time <- end_time - start_time)
## Time difference of 9.792036 mins
(this_best_tune <- model_bstSm$bestTune)
## mstop nu
## 14 40 0.1
this_model_metrics <- model_bstSm$results %>%
filter(mstop == this_best_tune[1,1] & nu == this_best_tune[1,2])
this_model_metrics
## nu mstop RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 0.1 40 0.2530175 0.9345729 0.200745 0.06216054 0.03563159 0.04779683
plot(model_bstSm)
this_model <- tibble(model = model_bstSm$method)
this_controller <-
tibble(
control_method = model_bstSm$control$method,
num_folds = 0,
repeats = model_bstSm$control$repeats
)
this_fit_run_time <- tibble(run_time = run_time)
this_model_record <- cbind(this_model,
this_controller,
this_fit_run_time,
this_model_metrics) %>%
mutate(intercept = 0.00) %>%
mutate(direction = NA_character_) %>%
mutate(maxvar = NA_integer_) %>%
mutate(nprune = NA_integer_) %>%
mutate(degree = NA_integer_)
# model_stats stores the results for all the models.
model_stats <- rbind(model_stats, this_model_record) %>%
select(model, RMSE, Rsquared, everything()) %>%
arrange(RMSE)
model_stats
## model RMSE Rsquared intercept MAE RMSESD RsquaredSD
## 1 bstSm 0.2530175 0.9345729 0 0.2007450 0.06216054 0.03563159
## 2 lmStepAIC 0.2577185 0.9323786 NA 0.1993049 0.07313809 0.04131507
## 3 earth 0.2594503 0.9301155 0 0.1977023 0.07162420 0.03620401
## 4 lm 0.2659572 0.9318729 1 0.2081860 0.07949886 0.04022326
## MAESD run_time control_method num_folds repeats nprune degree
## 1 0.04779683 9.792036 secs repeatedcv 0 20 NA NA
## 2 0.05217305 7.246598 secs repeatedcv 0 20 NA NA
## 3 0.05335324 9.476251 secs repeatedcv 0 20 5 1
## 4 0.05644595 1.585442 secs repeatedcv 9 20 NA NA
## direction maxvar nu mstop
## 1 <NA> NA 0.1 40
## 2 both 8 NA NA
## 3 <NA> NA NA NA
## 4 <NA> NA NA NA
Every one of these models had basically the same accuracy and there was about 7% of the variability that is unexplained. I have a feeling that the 7% is in the variable that we dropped earlier: GDP_per_capita.
model_stats
## model RMSE Rsquared intercept MAE RMSESD RsquaredSD
## 1 bstSm 0.2530175 0.9345729 0 0.2007450 0.06216054 0.03563159
## 2 lmStepAIC 0.2577185 0.9323786 NA 0.1993049 0.07313809 0.04131507
## 3 earth 0.2594503 0.9301155 0 0.1977023 0.07162420 0.03620401
## 4 lm 0.2659572 0.9318729 1 0.2081860 0.07949886 0.04022326
## MAESD run_time control_method num_folds repeats nprune degree
## 1 0.04779683 9.792036 secs repeatedcv 0 20 NA NA
## 2 0.05217305 7.246598 secs repeatedcv 0 20 NA NA
## 3 0.05335324 9.476251 secs repeatedcv 0 20 5 1
## 4 0.05644595 1.585442 secs repeatedcv 9 20 NA NA
## direction maxvar nu mstop
## 1 <NA> NA 0.1 40
## 2 both 8 NA NA
## 3 <NA> NA NA NA
## 4 <NA> NA NA NA
This is where we can see if there was any overfitting of the training set. If the performance against the test set is significantly worse than the training set then we probably have overfitting. I don’t think that will occur because there is a formula behind the final score so overfitting on this data should never happen.
###Fit model to test, with all variables
test_predictions <- predict(model_earth, test)
test_results <- test %>%
mutate(pred = test_predictions) %>%
select(country, happiness_score, pred) %>%
rename(obs = happiness_score)
(test_rmse <- rmse(test_results, truth = obs, estimate = pred))
## [1] 0.1683864
(test_rsq <- rsq(test_results, truth = obs, estimate = pred))
## [1] 0.962229
The model actually fit the OECD countries better than the non-OECD countries.
It explains 96.2% of the variance with an RMSE of 0.168.
I am really suspicious about that remaining 7% from the training model. If these variables are all that is used to create the final score then these models should have dialed it in to 99% accuracy. I am going to add that variable back, then run lmStepAIC again to see if there is an improvement.
predictors <- happy_2017_wo_na[, 3:11]
actuals <- happy_2017_wo_na[, 2]
glimpse(predictors)
## Observations: 126
## Variables: 9
## $ perceived_well_being <dbl> -2.45922945, -0.73910180, 0.478294...
## $ GDP_per_capita <dbl> -1.4593121, 0.1069263, 0.4914529, ...
## $ social_support <dbl> -2.5530650, -1.3679404, 0.8034490,...
## $ life_expectancy <dbl> -1.37952691, 0.73505142, 0.5436179...
## $ freedom_of_life_choices <dbl> -2.77724677, -0.23302656, 0.416475...
## $ generosity <dbl> -0.106340349, -0.035140377, -0.186...
## $ perception_corruption <dbl> 1.19261817, 0.76104643, 0.56757758...
## $ confidence_in_national_govt <dbl> -1.17725451, -0.19352638, -0.95578...
## $ gini_household_income <dbl> -1.5947299, -0.4055899, -0.5621611...
I’m going to really ramp up the number of repeats to see if that will help dial-in the algorithm better.
fit_controller <- trainControl(
method = "repeatedcv",
number = 9,
repeats = 200,
verboseIter = FALSE)
The lmStepAIC model performed best so I’ll start with that one.
start_time <- Sys.time()
model_lmStepAIC_2 <- train(
predictors,
actuals$happiness_score,
method = 'lmStepAIC',
trControl = fit_controller,
maxvar = 8,
direction = "both",
trace = 0
)
end_time <- Sys.time()
(run_time <- end_time - start_time)
## Time difference of 1.311215 mins
(this_model_metrics <- model_lmStepAIC_2$results %>% select(-parameter))
## RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 0.2415366 0.9586275 0.1886687 0.05659547 0.020312 0.03880572
Yes, there is a marked improvement of RMSE and Rsquared.
model_lmStepAIC_2$finalModel$coefficients
## (Intercept) perceived_well_being GDP_per_capita
## 5.41509435 0.90955552 0.14227269
## social_support perception_corruption
## 0.07856861 -0.03596966
Well look at that. Now GDP_per_capita is the 2nd most important variable.
plot(resid(model_lmStepAIC_2))
Hmmm, I see an ocsillating pattern that converges at ~65 and ~90. Then again maybe I’m just seeing things.
The normwhn.test package has a whitenoise_test() function where the null hypothesis is that the data is white noise, so a p-value of less that 0.05 will indicate that there is still some ‘signal’ in the residuals.
library(normwhn.test)
whitenoise.test(resid(model_lmStepAIC_2))
## [1] "no. of observations"
## [1] 126
## [1] "T"
## [1] 63
## [1] "CVM stat MN"
## [1] 1.032482
## [1] "tMN"
## [1] 0.2578211
## [1] "test value"
## [1] 0.8974284
The test value is well above 0.05 so we can confirm that the residuals are noise.
this_model <- tibble(model = paste0(model_lmStepAIC_2$method, "_2"))
this_fit_run_time <- tibble(run_time = run_time)
this_controller <-
tibble(
control_method = model_lmStepAIC_2$control$method,
num_folds = 0,
repeats = model_lmStepAIC_2$control$repeats
)
#this_best_tune <- model_lmStepAIC$bestTune
this_best_tune <- tibble(
maxvar = 8,
direction = "both"
)
this_model_record <- cbind(this_model,
this_fit_run_time,
this_model_metrics,
this_best_tune,
this_controller
) %>%
mutate(nprune = NA_integer_) %>%
mutate(degree = NA_integer_) %>%
mutate(intercept = NA) %>%
mutate(nu = NA) %>%
mutate(mstop = NA_integer_)
# model_stats will store the results for all the models.
(model_stats <- rbind(model_stats, this_model_record)) %>%
select(model, RMSE, Rsquared, everything()) %>%
arrange(RMSE)
## model RMSE Rsquared intercept MAE RMSESD
## 1 lmStepAIC_2 0.2415366 0.9586275 NA 0.1886687 0.05659547
## 2 bstSm 0.2530175 0.9345729 0 0.2007450 0.06216054
## 3 lmStepAIC 0.2577185 0.9323786 NA 0.1993049 0.07313809
## 4 earth 0.2594503 0.9301155 0 0.1977023 0.07162420
## 5 lm 0.2659572 0.9318729 1 0.2081860 0.07949886
## RsquaredSD MAESD run_time control_method num_folds repeats
## 1 0.02031200 0.03880572 1.311215 secs repeatedcv 0 200
## 2 0.03563159 0.04779683 9.792036 secs repeatedcv 0 20
## 3 0.04131507 0.05217305 7.246598 secs repeatedcv 0 20
## 4 0.03620401 0.05335324 9.476251 secs repeatedcv 0 20
## 5 0.04022326 0.05644595 1.585442 secs repeatedcv 9 20
## nprune degree direction maxvar nu mstop
## 1 NA NA both 8 NA NA
## 2 NA NA <NA> NA 0.1 40
## 3 NA NA both 8 NA NA
## 4 5 1 <NA> NA NA NA
## 5 NA NA <NA> NA NA NA
The earth model previously performed 2nd best. Let’s give it another chance with GPD_per_capita.
###Create parameter grid
grid <- expand.grid(nprune = 1:8,
degree = 1)
start_time <- Sys.time()
model_earth_2 <- train(
predictors,
actuals$happiness_score,
method = 'earth',
trControl = fit_controller,
tuneGrid = grid
)
end_time <- Sys.time()
(run_time <- end_time - start_time)
## Time difference of 2.354251 mins
(this_best_tune <- model_earth_2$bestTune)
## nprune degree
## 4 4 1
Previously nprune was optimized at 5 and now it is 4.
model_earth_2$finalModel$coefficients
## y
## (Intercept) 4.7241066
## h(perceived_well_being--0.650129) 1.1121285
## h(-0.650129-perceived_well_being) -0.6567519
## h(0.106926-GDP_per_capita) -0.2289394
This model only uses 2 of the 9 available variables.
this_model_metrics <- model_earth_2$results %>%
filter(nprune == 4 & degree == 1)
this_model_metrics
## degree nprune RMSE Rsquared MAE RMSESD RsquaredSD
## 1 1 4 0.2350535 0.959586 0.1807705 0.05439138 0.02007829
## MAESD
## 1 0.04122297
Amazing!! Those 2 variables were able to explain 96% of variation with an RMSE of 0.2337. This is the best performer so far.
plot(varImp(model_earth_2))
this_model <- tibble(model = paste0(model_earth_2$method, "_2"))
this_controller <-
tibble(
control_method = model_earth_2$control$method,
num_folds = 0,
repeats = model_earth_2$control$repeats
)
this_fit_run_time <- tibble(run_time = run_time)
this_model_record <- cbind(this_model,
this_fit_run_time,
this_controller,
this_model_metrics) %>%
mutate(intercept = 0.00) %>%
mutate(direction = NA_character_) %>%
mutate(maxvar = NA_integer_) %>%
mutate(nu = NA) %>%
mutate(mstop = NA_integer_)
# model_stats stores the results for all the models.
model_stats <- rbind(model_stats, this_model_record) %>%
select(model, RMSE, Rsquared, everything()) %>%
arrange(RMSE)
model_stats
## model RMSE Rsquared intercept MAE RMSESD
## 1 earth_2 0.2350535 0.9595860 0 0.1807705 0.05439138
## 2 lmStepAIC_2 0.2415366 0.9586275 NA 0.1886687 0.05659547
## 3 bstSm 0.2530175 0.9345729 0 0.2007450 0.06216054
## 4 lmStepAIC 0.2577185 0.9323786 NA 0.1993049 0.07313809
## 5 earth 0.2594503 0.9301155 0 0.1977023 0.07162420
## 6 lm 0.2659572 0.9318729 1 0.2081860 0.07949886
## RsquaredSD MAESD run_time control_method num_folds repeats
## 1 0.02007829 0.04122297 2.354251 secs repeatedcv 0 200
## 2 0.02031200 0.03880572 1.311215 secs repeatedcv 0 200
## 3 0.03563159 0.04779683 9.792036 secs repeatedcv 0 20
## 4 0.04131507 0.05217305 7.246598 secs repeatedcv 0 20
## 5 0.03620401 0.05335324 9.476251 secs repeatedcv 0 20
## 6 0.04022326 0.05644595 1.585442 secs repeatedcv 9 20
## nprune degree direction maxvar nu mstop
## 1 4 1 <NA> NA NA NA
## 2 NA NA both 8 NA NA
## 3 NA NA <NA> NA 0.1 40
## 4 NA NA both 8 NA NA
## 5 5 1 <NA> NA NA NA
## 6 NA NA <NA> NA NA NA
We might as well run bstSm against all predictors since it is supposed to be an enhanced version of ’earth. Let’s see what we get.
In the previous fitting the best tune was mstop = 40 and nu = 0.1.
I’ll test a couple other values around those anchors to see if there is any change.
mstop <- c(30, 40, 50)
nu <- c(.05, .1, .15)
grid <- expand.grid(mstop = mstop, nu = nu)
start_time <- Sys.time()
model_bstSm_1 <- train(predictors,
actuals$happiness_score,
method = 'bstSm',
trControl = fit_controller,
tuneGrid = grid)
end_time <- Sys.time()
(run_time <- end_time - start_time)
## Time difference of 24.33777 mins
(this_best_tune <- model_bstSm_1$bestTune)
## mstop nu
## 7 30 0.15
this_model_metrics <- model_bstSm_1$results %>%
filter(mstop == this_best_tune[1,1] & nu == this_best_tune[1,2])
plot(model_bstSm_1)
this_model <- tibble(model = paste0(model_bstSm_1$method, "_2"))
this_controller <-
tibble(
control_method = model_bstSm_1$control$method,
num_folds = 0,
repeats = model_bstSm_1$control$repeats
)
this_fit_run_time <- tibble(run_time = run_time)
this_model_record <- cbind(this_model,
this_controller,
this_fit_run_time,
this_model_metrics) %>%
mutate(intercept = 0.00) %>%
mutate(direction = NA_character_) %>%
mutate(maxvar = NA_integer_) %>%
mutate(degree = NA_integer_) %>%
mutate(nprune = NA_integer_)
# model_stats stores the results for all the models.
model_stats <- rbind(model_stats, this_model_record) %>%
select(model, RMSE, Rsquared, everything()) %>%
arrange(RMSE)
model_stats
## model RMSE Rsquared intercept MAE RMSESD
## 1 bstSm_2 0.2266729 0.9636103 0 0.1752025 0.04994372
## 2 earth_2 0.2350535 0.9595860 0 0.1807705 0.05439138
## 3 lmStepAIC_2 0.2415366 0.9586275 NA 0.1886687 0.05659547
## 4 bstSm 0.2530175 0.9345729 0 0.2007450 0.06216054
## 5 lmStepAIC 0.2577185 0.9323786 NA 0.1993049 0.07313809
## 6 earth 0.2594503 0.9301155 0 0.1977023 0.07162420
## 7 lm 0.2659572 0.9318729 1 0.2081860 0.07949886
## RsquaredSD MAESD run_time control_method num_folds repeats
## 1 0.01725562 0.03833247 24.337775 secs repeatedcv 0 200
## 2 0.02007829 0.04122297 2.354251 secs repeatedcv 0 200
## 3 0.02031200 0.03880572 1.311215 secs repeatedcv 0 200
## 4 0.03563159 0.04779683 9.792036 secs repeatedcv 0 20
## 5 0.04131507 0.05217305 7.246598 secs repeatedcv 0 20
## 6 0.03620401 0.05335324 9.476251 secs repeatedcv 0 20
## 7 0.04022326 0.05644595 1.585442 secs repeatedcv 9 20
## nprune degree direction maxvar nu mstop
## 1 NA NA <NA> NA 0.15 30
## 2 4 1 <NA> NA NA NA
## 3 NA NA both 8 NA NA
## 4 NA NA <NA> NA 0.10 40
## 5 NA NA both 8 NA NA
## 6 5 1 <NA> NA NA NA
## 7 NA NA <NA> NA NA NA
The bstSm model with all variables performed the best on the training set. Let’s see how each model performs against the test set.
GDP_per_capita to test settest <- happy_2017_wo_na %>%
filter(country %in% OECD_countries)
glimpse(test)
## Observations: 39
## Variables: 11
## $ country <chr> "Australia", "Austria", "Belgium",...
## $ happiness_score <dbl> 7.272, 7.139, 6.927, 6.419, 6.476,...
## $ perceived_well_being <dbl> 1.5373403, 1.5692499, 1.2514771, 0...
## $ GDP_per_capita <dbl> 1.20215279, 1.20575278, 1.15781245...
## $ social_support <dbl> 1.15263575, 0.79956381, 0.92404516...
## $ life_expectancy <dbl> 1.2072194, 1.1536177, 1.1261885, 0...
## $ freedom_of_life_choices <dbl> 1.03623710, 0.87440700, 0.61234632...
## $ generosity <dbl> 0.30169326, 0.12499674, 0.04548147...
## $ perception_corruption <dbl> -1.80213723, -1.21229604, -1.07584...
## $ confidence_in_national_govt <dbl> -0.2151998, -0.3027780, -0.2335925...
## $ gini_household_income <dbl> -0.23422989, -1.24589669, -0.94419...
test_prediction <- predict(model_lmStepAIC_2, test)
test_results <- test %>%
mutate(pred = test_prediction) %>%
select(country, happiness_score, pred) %>%
rename(obs = happiness_score)
this_rmse <- rmse(test_results, truth = obs, estimate = pred)
this_rsquared <- rsq(test_results, truth = obs, estimate = pred)
final_results <- tibble(
model = "lmStepAIC",
rmse = this_rmse,
Rsquared = this_rsquared
)
final_results
## # A tibble: 1 x 3
## model rmse Rsquared
## <chr> <dbl> <dbl>
## 1 lmStepAIC 0.161 0.965
test_prediction <- predict(model_earth_2, test)
test_results <- test %>%
mutate(pred = test_prediction) %>%
select(country, happiness_score, pred) %>%
rename(obs = happiness_score)
this_rmse <- rmse(test_results, truth = obs, estimate = pred)
this_rsquared <- rsq(test_results, truth = obs, estimate = pred)
this_final_results <- tibble(
model = "earth",
rmse = this_rmse,
Rsquared = this_rsquared
)
final_results <- rbind(final_results, this_final_results) %>%
arrange(rmse)
final_results
## # A tibble: 2 x 3
## model rmse Rsquared
## <chr> <dbl> <dbl>
## 1 earth 0.137 0.974
## 2 lmStepAIC 0.161 0.965
test_prediction <- predict(model_bstSm_1, test)
test_results <- test %>%
mutate(pred = test_prediction) %>%
select(country, happiness_score, pred) %>%
rename(obs = happiness_score)
this_rmse <- rmse(test_results, truth = obs, estimate = pred)
this_rsquared <- rsq(test_results, truth = obs, estimate = pred)
this_final_results <- tibble(
model = "bstSm",
rmse = this_rmse,
Rsquared = this_rsquared
)
final_results <- rbind(final_results, this_final_results) %>%
arrange(rmse)
final_results
## # A tibble: 3 x 3
## model rmse Rsquared
## <chr> <dbl> <dbl>
## 1 earth 0.137 0.974
## 2 bstSm 0.151 0.970
## 3 lmStepAIC 0.161 0.965
In the end the earth model which uses Multi-variate Adaptive Regression Splines (MARS) wins.
earthtest_prediction <- predict(model_earth_2, test)
test_results <- test %>%
mutate(pred = test_prediction) %>%
select(country, happiness_score, pred) %>%
rename(obs = happiness_score)
ggplot(test_results, aes(x = obs, y = pred)) +
geom_point()
Very nearly a straight line from lower-left to upper-right. This is a good thing.
plot(varImp(model_earth_2))
The final model ended up using 2 of the 9 original predictors. One of them was the predictor that was originally removed because it was highly correlated with another predictor, GPD_per_capita.
model_earth_2$finalModel$coefficients
## y
## (Intercept) 4.7241066
## h(perceived_well_being--0.650129) 1.1121285
## h(-0.650129-perceived_well_being) -0.6567519
## h(0.106926-GDP_per_capita) -0.2289394
The most surprising thing in this machine learning exercise was that a standard function to remove correlated predictors recommended removing the 2nd most important predictor in the final model. The modeller should not blindly accept the recommendations of the different tests.
I also made a conscious choice to keep the outliers because they were the actual values used by The Happiness Project to calculate the scores.
So did I crack the code to happiness? No, I just reverse-engineered the algorithm that is used to produce the score. I’m sure that if I contacted the United Nations Sustainable Development Solutions Network and showed them the model they would say that I wasn’t even close to their actual model. A model is just a simplified representation of the data so it may not be how they calculated it but they could use my model and no one would know the difference.
In 2017 the US dropped 4 spots to #18. What is the driver of the decreased rank and what would it take for the US to get into the top 10.