I. Background and Data Processing

As the World Happiness Report’s website states “The World Happiness Report is a landmark survey of the state of global happiness that ranks 156 countries by how happy their citizens perceive themselves to be. The World Happiness Report 2020 for the first time ranks cities around the world by their subjective well-being and digs more deeply into how the social, urban and natural environments combine to affect our happiness.”

The World Happiness Report is a publication of the Sustainable Development Solutions Network. The happiness scores and rankings use data from the Gallup World Poll, and the scores are based on answers to the main life evaluation question asked in a poll. This report was written by independent experts and does not necessarily reflect the views of the United Nations. The 156 observations in the data represent different countries in each row.

I.A Research Question and Background

I.A.1 Research Question

The research question that we would like to examine is what are the crucial determinants of happiness in a country? In order to answer this question, we plan to explore the information that is recorded for the 26 variables for each country in the columns are: the country’s name, year, life ladder, Log GDP per capita, social support, health life expectancy in birth, freedom to make life choices, generosity, perceptions of corruption, positive affect, negative affect, confidence in national government, democratic quality, delivery quality, standard deviation of ladder by country-year, Standard deviation/Mean of ladder by country-year, GINI index (World Bank estimate), GINI index (World Bank estimate) average 2000-2017 unbalanced panel, GINI of household income reported in Gallup by wp5-year, Most people can be trusted, Most people can be trusted WVS round 1981-1984, Most people can be trusted WVS round 1989-1993, Most people can be trusted WVS round 1994-98, Most people can be trusted WVS round 1999-2004, Most people can be trusted WVS round 2005-2009, and Most people can be trusted WVS round 2010-2014. These represent an estimate of the extent/contribution of each factor on a country’s happiness score.

I.A.2 Relevant Literature

Additionally, we wanted to examine previous literature that also explores country happiness. For instance, this website about the GNH Happiness Index used in Bhutan (GNH) describes how the GNH index is a holistic approach to measure the happiness and wellbeing of the Bhutanese population. The GNH index is a measurement tool used for policy making to increase GNH. It includes the nine domains which are further supported by the 33 indicators. The Index analyzes the nation’s wellbeing with each person’s achievements in each indicator. In addition to analyzing the happiness and wellbeing of the people, it also guides how policies may be designed to further create enabling conditions for the weaker scoring results of the survey.

The New York Times wrote an interesting article about the results from the 2020 World Happiness Report with special consideration of the ongoing COVID-19 pandemic. The article says that happiness isn’t a function of how well positive emotions are expressed, but rather, it’s a measure of general satisfaction with life, and the confidence in a living a secure life according to John F. Helliwell, an editor of the annual happiness report. Happy people “wouldn’t have the highest smile factor,” he said. “They do trust each other and care about each other, and that’s what fundamentally makes for a better life.” - NYT.

II. Exploratory Data Analysis

II.A Initial Summary Statistics

II.A.1 Count of Countries per Year

Before cleaning the data, the amount of countries listed in this report varies year by year, from 2005 to 2019. 2005 has the least amount of participating countries with 27, and 2017 the most with 147. After cleaning the data for variables less than 70% populated and from the years 2014-2019, 117 countries are left for analysis for each of the 5 years.

Count of Countries by Year
Year Count_of_Countries
2005 27
2006 89
2007 102
2008 110
2009 114
2010 124
2011 146
2012 142
2013 137
2014 145
2015 143
2016 142
2017 147
2018 142
2019 138

II.A.2 Distribution of Happiness Values

One reason we are able to move forward with our method of cleaning the data (explained in II.B) is that the remaining 117 countries are a representative sample of the population. This can be seen by the Happiness Index distribution on the histograms below, before and after cleaning.

II.A.3 Average Happiness Over Time

II.B Justify Data Processing Decisions

We collected all six years of data from the Happiness Dataset from 2015 to 2020, and then assigned the outcome variables to match up to the year in the data for which the outcome variables were determined in the original dataset. For processing and cleaning the data, we first read in and combined all sheets from the workbook. Next, we decided to only keep the independent variables that were at least 70% populated and removed the rest. Finally, we decided to remove rows where the Happiness Index (HI) was NA or missing and drop any observations that didn’t have all the variables populated for the decision tree models since we didn’t want to impute missing values. After cleaning and processing all the data, we were still left with 615 observations from the original amount of 1026 observations

II.B.1 Data Characteristics

The type of data is a sample, as not all of the countries are present in the data set. A poll was implemented to gather data on the variables of interest. This data from the Gallup World Poll was used to determine the influence of each to calculate the happiness score 1 and rank for each country. While the data set does not mention how the specific countries were selected to be put in the data set, we can see that there are observations for all of the larger and more prominent countries of the world. We see that the countries that tend to be missing are the smaller ones, where possibly polling was simply not conducted or polling was not deemed to be suitable. Looking at the data set, the qualitative variables in the data set are the name and region of each country. All of the other variables are quantitative.

II.B.2 Data Issues

There are a couple of issues with the data set. The first issue is that this data set only has 157 countries, while there are 195 countries recognized by the United Nations. This would impact the statistical calculations and graphics made. The various statistics, such as average and median, most likely would change with all of these countries being represented. Additionally, graphs generated representing the data would change with more countries being present. The countries with the lowest happiness score could change, the boxplots representing each region could change, and trends would be more thoroughly seen if all of the countries were present. Another potential issue is that polls were conducted to determine the values for each of the factors in respect to the happiness score. We don’t know how these surveys were conducted in each country, if countries took it seriously or didn’t, if polling was consistent across the countries, and if the answers from these polls are entirely representative of the country’s entire population. This would lead to inaccurate representation in the data.

II.B.3 Correlation Matrix

We’ve created a correlation matrix between all of the numeric variables such as Happiness Index (HI), year, Life Ladder, Log GDP per capita, social support, health life expectancy at birth, freedom to make life choices, generosity, perceptions of corruption, and positive/negative affect. The matrix tells us how each variable is correlated with respect to every other quantitative factor and the strength of that correlation. For instance, HI has a strongly positive correlation with Life Ladder. Similarly, perceptions of corruption are negatively correlated with happiness index as expected.

HI year Life.Ladder Log.GDP.per.capita Social.support Healthy.life.expectancy.at.birth Freedom.to.make.life.choices Generosity Perceptions.of.corruption Positive.affect Negative.affect Confidence.in.national.government Democratic.Quality Delivery.Quality
HI 1.00 0.02 0.98 0.81 0.75 0.81 0.52 0.16 -0.48 0.53 -0.45 -0.14 0.68 0.75
year 0.02 1.00 0.04 0.00 0.01 0.04 0.14 -0.10 -0.03 0.00 0.07 0.07 0.00 0.00
Life.Ladder 0.98 0.04 1.00 0.80 0.74 0.79 0.53 0.15 -0.47 0.52 -0.44 -0.12 0.68 0.75
Log.GDP.per.capita 0.81 0.00 0.80 1.00 0.73 0.87 0.36 -0.03 -0.39 0.32 -0.45 -0.19 0.72 0.82
Social.support 0.75 0.01 0.74 0.73 1.00 0.69 0.39 0.04 -0.27 0.45 -0.58 -0.17 0.58 0.60
Healthy.life.expectancy.at.birth 0.81 0.04 0.79 0.87 0.69 1.00 0.37 0.03 -0.36 0.35 -0.43 -0.24 0.69 0.76
Freedom.to.make.life.choices 0.52 0.14 0.53 0.36 0.39 0.37 1.00 0.31 -0.50 0.63 -0.31 0.44 0.48 0.49
Generosity 0.16 -0.10 0.15 -0.03 0.04 0.03 0.31 1.00 -0.32 0.29 -0.07 0.38 0.07 0.15
Perceptions.of.corruption -0.48 -0.03 -0.47 -0.39 -0.27 -0.36 -0.50 -0.32 1.00 -0.35 0.35 -0.44 -0.42 -0.60
Positive.affect 0.53 0.00 0.52 0.32 0.45 0.35 0.63 0.29 -0.35 1.00 -0.38 0.17 0.44 0.39
Negative.affect -0.45 0.07 -0.44 -0.45 -0.58 -0.43 -0.31 -0.07 0.35 -0.38 1.00 -0.04 -0.42 -0.50
Confidence.in.national.government -0.14 0.07 -0.12 -0.19 -0.17 -0.24 0.44 0.38 -0.44 0.17 -0.04 1.00 -0.10 -0.01
Democratic.Quality 0.68 0.00 0.68 0.72 0.58 0.69 0.48 0.07 -0.42 0.44 -0.42 -0.10 1.00 0.88
Delivery.Quality 0.75 0.00 0.75 0.82 0.60 0.76 0.49 0.15 -0.60 0.39 -0.50 -0.01 0.88 1.00

III. Clustering Analysis

In analyzing our data, we have defined the variable of interest to be the quartile in which the happiness index falls for a given country and year. Accordingly the base rate would be 25% of our observations for each quartile, meaning the probability that we correctly assign a country-year observation to the correct quartile (by random chance) is 0.25, by construction. In this section, we instead try to cluster the data using our explanatory variables to group together countries with similar characteristics. Using both the elbow method and the results of the NbClust function, we’ll determine the optimal number of clusters into which the data will be grouped. Once countries with similar characteristics have been sorted into this optimal number of groups, we will examine the happiness index scores associated with each grouping with the expectation that countries within a grouping would likely have similar happiness index scores.

III.A. Clustering k-means

In order to apply the k-means algorithm, the data required a few modifications. First, we removed the variable of interest from the data (both the factorized quartile and raw happiness index measure). The goal of this exercise is to group together countries with similar characteristics under the hypothesis that these similarities would imply similar measures of happiness. Accordingly, Country was also removed as we don’t want that factor variable to provide explanatory power (lest our advice for an unhappy country be to try not being that country). The other factor variable in our dataset, the year of the observation, does seem relevant as global events in a particular year can certainly explain shifts in a country’s happiness, e.g. a global pandemic. Rather than utilizing dummy coding as we would for nominal factor variables, we instead allowed year to be treated as a numeric variable and applied the same standardization as the rest of the variables. Having created a final data set consisting only of numeric variables, we then standardized the entire dataset.

III.A.1 Optimal Number of Clusters

Applying the k-means function to assign observations to as few as one or up to ten different clusters, we seek to identify the number of clusters that will maximize the inter-cluster variance, (i.e. the sum of the distances between points from different clusters) subject to the constraint of minimizing the intra-cluster variance (the sum of the distances between points within the same cluster).

III.A.1.1 Elbow Method

The elbow graph shown below plots the number of clusters used against the fraction of the total variance accounted for by the designated number of clusters. The latter refers to the ratio of inter-cluster variance relative to the total variance in the data (i.e. the sum of the distances between all the points in the data set). The ‘elbow’ seems to indicate that the point at which the increased inter-cluster variance relative to the smaller intra-cluster variance would be around 3 clusters.

III.A.1.2 NbClust Majority Rule

Using the NbClust function we find that the optimal number of clusters is 3; as shown on the graph below, the majority of models (12) voted for 3 clusters. This reaffirms our conclusion drawn from our initial inspection of the elbow chart.

III.A.2 Assigning Optimal Number of Clusters

Using the recommended number of clusters, we find that 3 clusters explains 44% of the total variance. Assigning the predicted clusterings to the actual data we can then visualize the output to show that our model does extremely well at assigning countries to clusters reflecting overall happiness.

As the graphs below illustrate, our clusters do quite well at identifying countries on the lower and higher end of the happiness index range. However, as mentioned earlier, the self-reported happiness variable (Life Ladder) appears to be too tightly correlated with our dependent variable. In the following subsection we will explore what the results of our clustering analysis would be without this explanatory variable.

Life Ladder

GDP-per-Capita

Life Expectancy

III.B. Revised Clustering Analysis

As discussed in the section above, the self-reported happiness variable (Life Ladder) is very highly correlated with the happiness index, our dependent variable. In this section we remove the variable from our data and redo our clustering analysis. The goal of this analysis is to identify countries with happiness index scores that diverge from what our expectations would be based upon all the other factors (the growth in GDP, trust in government, life expectancy, etc.).

III.B.1 Removing Life Ladder

As before, we now re-apply the k-means function to assign observations to as few as one or up to ten different clusters, seeking to identify the number of clusters that will maximize the inter-cluster variance and minimize the intra-cluster variance.

III.B.1.1 Elbow Method

The elbow graph shown below plots the number of clusters used against the fraction of the total variance accounted for by the designated number of clusters. The ‘elbow’ seems to indicate that the point at which the increased inter-cluster variance relative to the smaller intra-cluster variance would still be around 3 clusters.

III.B.1.2 NbClust Majority Rule

Using the NbClust function we find that the optimal number of clusters is 3; as shown on the graph below, the majority of models (12) voted for 3 clusters. This reaffirms our conclusion drawn from our initial inspection of the elbow chart and is consistent with our findings prior to removing Life Ladder from our data.

III.B.2 Assigning Optimal Number of Clusters

Using the recommended number of clusters on the revised data set, we find that 3 clusters explains 42.6% of the variance which is only slightly less than the 44% of variance accounted for when applying the same number of clusters and including the variable Life Ladder. Assigning the predicted clusters to the actual data we can then visualize the output to show that our model still does quite well at assigning countries to clusters reflecting overall happiness.

As before, we now plot our revised clusters against the observed happiness index and variables of interest. Note that even without training the clusters using the self-reported variable Life Ladder, our clusters still do well at identifying countries on the high and low end of the happiness index spectrum.

Life Ladder

GDP-per-Capita

Life Expectancy

III.C Evaluating Clusters

Having chosen the optimal number of clusters and set of explanatory variables, we can assess the results of the k-means clustering analysis using two approaches. First, we will compare the distribution of our clusters against the initially designated quartiles of the happiness index distribution. Then we will examine a series of visualizations of the data to glean insights into our clusters.

III.C.1 Pseudo-Confusion Matrix

The following table compares the actual happiness quartile assignments and the cluster assigned by our model. Note that no countries falling in the lowest quartile of the HI were assigned to the cluster associated with happier countries and vice versa. The majority of observations in both the happier and unhappier clusters are concentrated in the 1st and 4th quartiles, respectively. This is a good indication that there are strong similarities across the characteristics of countries that are the happiest and least happy. Note that the cluster that spans all four quartiles of the happiness index does not necessarily indicate a good or bad fit, but rather reflects that by just choosing the quartiles of a continuous distribution, there may not be significant differences between a country whose happiness index falls near the threshold between one quartile and another.

1 2 3 4
Happy Cluster 0 2 12 98
Average Cluster 31 83 136 56
Unhappy Cluster 123 69 5 0

III.C.2 Visualizing Final Clusters

Having seen that our clusters fit well when plotted against the happiness index itself, we can begin to explore how particular variables influenced a given cluster assignment by plotting explanatory variables and contrasting the assigned cluster and observed happiness index. The graphs below plot pairs of explanatory variables against our predicted happiness clustering (the shape) and the actual assigned happiness index (color scale).

Social Support

GDP-per-Capita

Democratic Quality

Perceived Corruption

Generosity

III.D Conclusion of k-means

In sum, we implemented the k-means algorithm and determined that the variance in the data is best explained by grouping countries into three distinct clusters. This finding proved robust both when we included and removed the self-reported happiness rating, Life Ladder. Upon evaluating our model’s clustering assignments against the continuous happiness index, we found that there were significant similarities in characteristics across countries at the higher and lower end of the happiness index spectrum. The model also adequately identified countries lying toward the center of the happiness index distribution (i.e. the average or moderately happy countries). However, the broad range of characteristics (higher intra-cluster variance) of this cluster revealed that the k-means algorithm is not as well suited for distinguishing between the moderately unhappy and moderately happy countries.

Decision Tree

The purpose of this decision tree is to classify each country into happiness quartiles based on variables such as life ladder, social support, and democratic quality. The decision tree model will first be built using default settings but then the threshold will be adjusted to optimize for both the highest and lowest quartiles, allowing us to glean insights into which factors contribute the most to countries happiness.

Methods

The Happiness Index was changed into quartiles, where the quartile distributio is as follows:

##      0%     25%     50%     75%    100% 
## 2.69300 4.46280 5.32250 6.30595 7.76890

Any rows with NA were removed from the data frame in order to perform the decision tree analysis.

Base rate calculation

The base rate for this classifier is the individual percentages for each quartile. Quartile 1 has a base rate of 25.04%, Quartile 2: 25.04%, Quartile 3: 24.9%, and Quartile 4: 25.04%. This base rate is as expected when distributing data into quartiles.

Build model using default settings

The most important variable for the tree is Life.Ladder. The first split in the tree is created using this variable, as seen below, with one split of life ladder less than 5.5 and the other of life ladder greater than or equal to 5.5. Life ladder is where people rate their own lives on a 0 to 10 scale with 10 being the best possible life. Thus, it seems that the most importantly variable for a country’s happiness is how its people rate their lives, or their perception of how good their life is.

Life ladder is the only variable that matters in this classifier, and as seen below, the life ladder scores line up fairly well, and almost perfectly with the quartile distribution from above, with the only discrepancy being a life ladder score of 5.5 as opposed to a quartile break of 5.3.

Optimal number of splits

The relative error is the relative error for predictions generating the tree. The xerror is the cross validated error. The xstd is the standard deviation of cross validated errors. These variables lead to the inequality: where the split should be chosen at the lowest level where rel_error + xstd < xerror.

The graph below plots the X relative error, or xerror, on the y-axis and the complexity on the x-axis. From calculations, we found that xerror exceeds opt after the 3rd split, where xerror is 0.2234 and opt is 0.21314. In the graph, the threshold appears to be crossed after the third split. This would indicate an optimal split at the fourth level. Thus, the plot and the table comparing opt and xerror do not agree, and we choose to take 4 splits as the optimal amount because we prefer it to line up with the quartile designation. The optimal cp, or cp at four split is .01 as seen in the table below.

CP nsplit rel error xerror xstd opt
0.3340564 0 1.0000000 1.0694143 0.0214519 1.0214519
0.2516269 1 0.6659436 0.6659436 0.0268971 0.6928407
0.2212581 2 0.4143167 0.4273319 0.0251005 0.4394172
0.0100000 3 0.1930586 0.2234273 0.0200870 0.2131456

Model Evaluation

Confusion Matrix
##                     
## happy_data1_fitted_t   1   2   3   4
##                    1 128  17   0   0
##                    2  26 128  16   0
##                    3   0   9 127  11
##                    4   0   0  10 143

The confusion matrix above compares the predicted values (Q1-4) of the quartile generated by the model to the actual quartile. Here, rows are predicted quartiles while columns are actual quartiles. From the confusion matrix it appears that we are correctly identifying 85.5% of observations. Our model seems to only misclassify with one quartile up or down (i.e. quartile 2 either being misclassified as 1 or 3 but never as 4).

Hit and Detection Rate

The error rate is defined as a mis-classification of Quartile, (e.g. predicting Q1 when the true class is Q4), divided by the total number of data points. The error rate for this model is 14.5%. This is a fairly low error rate. The hit rate is the portion of predictions that were correctly identified, or 1-error rate = 85.5%.

Comparison of Results to Base Rates

The following offers a deeper analysis of the performance of our model based on the results of our confusion matrix, which compares the predicted values (Q1-4) of the quartile generated by the model to the actual happiness quartile. From the confusion matrix it appears that we are correctly identifying [fraction], or 85.5% of observations with error coming from the next highest or lowest quartile.

1. For Q1, we've correctly identified 83.11% of observations which is considerably better than our base rate of 25%. Our model misidentified 17 Q2 observations as Q1.

2. For Q2, our model is performing quite well, correctly identifying 83.11% of observations, considerably better than our base rate of 25%. Our model misidentified 26 Q1 observations and 16 Q3 observations as Q2.

3. For Q3, we've correctly identified 83.0% of observations considerably better than our base rate of 25%. The model misidentified 9 Q2 observations and 11 Q4 observations as Q3.

4. For Q4, we've correctly identified 92.9% of observations, considerably better than out base rate of 25%. Our model misidentified 10 Q3 observations as Q4.
ROC and AUC Score

The ROC curve is above the y = x baseline, meaning the tree is better than randomly guessing. The line plotting Sensitivity against Specificity shows a decent model, with an overall relatively high Multi-class Area Under the Curve (AUC): .9527. There are a few conclusions to glean from this. First is that this decision tree model has merit in being significantly better than guessing for some classes, while the model might not be perfect it is a large step up from using nothing.

Metric to optimize

If our primary goal is to identify the highest quartile of happiness, Q4, we would lower the probability threshold for assigning observations to Q4 while trying to preserve our high degree of accuracy for other quartiles of happiness. The results of lowering the probability threshold for Q4 tumors to 0.07 are shown in the confusion matrix below. Note that lowering this threshold means we are now correctly identifying all 154 of the Q4 observations correctly (as intended). However, we are no longer identifying any Q3 observations, meaning at this threshold, all 137 Q3 observations are being classified as Q4. Because the threshold needs to be this low in order to correctly classify all actual Q4 observations, but at the same time this causes the model to classify all Q3 observations as Q4, this means that the model has a hard time distinguishing between Q3 and Q4.

##    
##       1   2   3   4
##   1 128  17   0   0
##   2  26 128  16   0
##   4   0   9 137 154

Hyperparameter adjustment

Adjusting the complexity (cp) threshold to 0.01 yields an identical model to before. [We still misidentify quartiles for only ones one above or one below; and our accuracy remains the same for each class, and thus overall.]

The optimal cp from earlier was a cp of .01. The decision tree model was rerun with this optimal cp yeilding identical results. This could be because the intial decision tree already had three splits.

Decision Tree Model, no life ladder

Next, we will investigate what the decision tree looks like without the life ladder variable, that functioned as an almost perfect classifier. In order to prevent overfitting of the model, we set minsplit to 93 where minsplit is the minimum number of observations that must exist in a node in order for a split to be attempted. Thus, at least over 15% of the data must be in a node in order for a split to be attempted.

The most important variable of this tree is Healthy life expectancy at birth, and the next most important variable is Log GDP per capita. The decision tree can be seen below and is evidently significantly more complicated than the first tree, just based on the variable life ladder.

Variable Importance

Delving more into variable importance, the table below shows the variable importance value for each variable. Note that healthy life expectancy at birth is the most important variable while generosity is the least important variable in predicting the quartile of happiness a country is in.

x
Healthy.life.expectancy.at.birth 138.0836675
Log.GDP.per.capita 121.4351116
Delivery.Quality 103.5952352
Democratic.Quality 70.0539273
Social.support 66.7701964
Perceptions.of.corruption 44.0360384
Negative.affect 19.2264843
Freedom.to.make.life.choices 17.1940553
Confidence.in.national.government 4.3479511
Positive.affect 1.5486990
Generosity 0.3697141

The graph below depicts this variable importance visually. It is interesting to note that healthy life expectancy at birth and log GDP per capita both rank significantly more important than other variables, as both have variable importance values at least 1.6x higher than the next most important variable, delivery quality. Thus, if a country were to want to increase their happiness ranking, without taking into account people’s perception of their life quality (life ladder), they could focus more on the maternity services provided in their hospitals and the GDP per capita.

Model Evaluation

Let’s evaluate this model in comparison with the first decision tree model we generated.

Confusion Matrix
##                     
## happy_data2_fitted_t   1   2   3   4
##                    1 135  62   2   0
##                    2   7  35   2   4
##                    3  12  52 117  19
##                    4   0   5  32 131

The confusion matrix above compares the predicted values (Q1-4) of the quartile generated by the model to the actual quartile. Here, rows are predicted quartiles while columns are actual quartiles. From the confusion matrix it appears that we are correctly identifying 72.2% of observations, significantly lower than the above model that correctly identified 85.5% of observations. Our model seems to misclassify with all other quartiles as seen in predicted Q3, not to only misclassify with one quartile up or down as the first model did.

Hit and Detection Rate

The error rate is defined as a mis-classification of Quartile, (e.g. predicting Q1 when the true class is Q4), divided by the total number of data points. The error rate for this model is 27.8%. This is a fairly high error rate. The hit rate is the portion of predictions that were correctly identified, or 1-error rate = 72.2%.

Comparison of Results to Base Rates

The following offers a deeper analysis of the performance of our model based on the results of our confusion matrix, which compares the predicted values (Q1-4) of the quartile generated by the model to the actual happiness quartile. From the confusion matrix it appears that we are correctly identifying [fraction], or 72.2% of observations with error coming from the next highest or lowest quartile.

1. For Q1, we've correctly identified 87.3% of observations which is considerably better than our base rate of 25%. This is lower compared to the first model correct Q1 identification as 83.11%.

2. For Q2, our model correctly identifyies 46.4% of observations, considerably better than our base rate of 25%. This is lower compared to the first model correct Q2 identification as 83.11%.

3. For Q3, we've correctly identified 51.8% of observations considerably better than our base rate of 25%. This is lower compared to the first model correct Q3 identification as 83.0%.

4. For Q4, we've correctly identified 96.2% of observations, considerably better than out base rate of 25%. This is lower compared to the first model correct Q4 identification as 92.9%.
ROC and AUC Score

The ROC curve is above the y = x baseline, meaning the tree is better than randomly guessing. The line plotting Sensitivity against Specificity shows a decent model, with an overall relatively high Multi-class Area Under the Curve (AUC): .9015.

Recommendations

This model has very serious real world implications. The most important variable of this decision tree was life ladder. When taken out of the equation, the most important variable became healthy life expectancy at birth. It is important to note that optimization of only the variables described in this analysis as opposed to taking a holistic approach could harm a countries actual happiness while improving their score, in a similar mechanism as the U.S. News and World report college ranking variable optimization. One could argue that receiving a false classification in a lower quartile is less harmful than receiving a false classification in a higher quartile, as the former could make that country (if they pay attention to the scores) work harder to increase the happiness level of their citizens.

Conclusions

We’ve created a correlation matrix between all of the numeric variables such as Happiness Index (HI), year, Life Ladder, Log GDP per capita, social support, health life expectancy at birth, freedom to make life choices, generosity, perceptions of corruption, and positive/negative affect. The matrix tells us how each variable is correlated with respect to every other quantitative factor and the strength of that correlation. For instance, HI has a strongly positive correlation with Life Ladder. Similarly, perceptions of corruption are negatively correlated with happiness index as expected.

In sum, we implemented the k-means algorithm and determined that the variance in the data is best explained by grouping countries into three distinct clusters. This finding proved robust both when we included and removed the self-reported happiness rating, Life Ladder. Upon evaluating our model’s clustering assignments against the continuous happiness index, we found that there were significant similarities in characteristics across countries at the higher and lower end of the happiness index spectrum. The model also adequately identified countries lying toward the center of the happiness index distribution (i.e. the average or moderately happy countries). However, the broad range of characteristics (higher intra-cluster variance) of this cluster revealed that the k-means algorithm is not as well suited for distinguishing between the moderately unhappy and moderately happy countries.

The decision tree model further demonstrated life ladder as the most important variable in determining a country’s happiness, with higher life ladder scores conferring higher quartile of country happiness. In conclusion, in focusing on improving a country’s happiness, special care should be put into ensuring that people perceive their lives as great, as that seems to be the number one determinant.