Introduction

The World Happiness Report is a landmark survey of the state of global happiness. The first report was published on 2012 but it gains global recognition during an event of International Happiness Day (March 20th) at United Nations 2017. Many governments, organizations, and civil society use happiness indicators to inform policy decision making. So, what makes a country become a happy one?

In this Learning By Building assesment, I will try to analyzie what makes a country become a happy country using Naive Bayes and Random Forest to see which of the model can be accurately predict happiness variables. I use the 2019 data from here https://www.kaggle.com/unsdsn/world-happiness#2019.csv The happiness scores and rankings use data from the Gallup World Poll. The scores are based on answers to the main life evaluation question asked in the poll.

Data Wrangling

## Observations: 156
## Variables: 9
## $ Overall.rank                 <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1…
## $ Country.or.region            <fct> Finland, Denmark, Norway, Iceland, Nethe…
## $ Score                        <dbl> 7.769, 7.600, 7.554, 7.494, 7.488, 7.480…
## $ GDP.per.capita               <dbl> 1.340, 1.383, 1.488, 1.380, 1.396, 1.452…
## $ Social.support               <dbl> 1.587, 1.573, 1.582, 1.624, 1.522, 1.526…
## $ Healthy.life.expectancy      <dbl> 0.986, 0.996, 1.028, 1.026, 0.999, 1.052…
## $ Freedom.to.make.life.choices <dbl> 0.596, 0.592, 0.603, 0.591, 0.557, 0.572…
## $ Generosity                   <dbl> 0.153, 0.252, 0.271, 0.354, 0.322, 0.263…
## $ Perceptions.of.corruption    <dbl> 0.393, 0.410, 0.341, 0.118, 0.298, 0.343…

Overall.rank: The Happiness Rank
Country.or.region: Name of the country or region
Score: The Happiness Score
GDP.per.capita: GDP per kapita of each country
Social Support: having friends and other people, including family, to turn to in times of need
Healthy.life.expectancy: the average number of years that a newborn can expect to live in “full health”
Freedom.to.make.life.choices: There are many life choices to make a living
Generosity: How generous the people that surrounds them
Perceptions.of.corruption: Perception of whether the government in the said country were corrupted or not

The Overall.rank and Country.or.region doesn’t give valuable information so I will take them down. In the end, I will only use numerical variables as the preidctors. The Score column is our target class so I will reorder it to the last column and decide later which of the happy country and the lesshappy ones.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.853   4.545   5.380   5.407   6.184   7.769


In here, I want to classify which country is happy and which is lesshappy based on the Score. I decide that the country with lesshappy scored under the 3rd Quarter of the Score population and we got the value 6.184 as the 3rd Quarter of the population, so I will classify the Score data into happy and lesshappy.

Exploratory Data Anaysis

Target Variable Proportion

## 
##     happy lesshappy 
##      0.25      0.75

Since I purposedly classify the Target Variable based on the #rd Q, the proportion is unquestionably unbalanced, and I will upsampling the Target Variables to further processing the data. - Let’s see if the predictor variables distribute normally.


The data distribute quite normal execpt for Perception.of.corruption and Social.support.

  • let’s see the correlation between each of the numerical predictors


Between GDP.per.capita and Healthy.life.expectancy + Social.support they have a relative high correlation but not too high (1). Hopefully, the model in Naive Bayes can perform well.

Cross Validation

## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used

Modelling

Naive Bayes

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  happy lesshappy
##   happy         6         4
##   lesshappy     1        19
##                                           
##                Accuracy : 0.8333          
##                  95% CI : (0.6528, 0.9436)
##     No Information Rate : 0.7667          
##     P-Value [Acc > NIR] : 0.2666          
##                                           
##                   Kappa : 0.5946          
##                                           
##  Mcnemar's Test P-Value : 0.3711          
##                                           
##             Sensitivity : 0.8571          
##             Specificity : 0.8261          
##          Pos Pred Value : 0.6000          
##          Neg Pred Value : 0.9500          
##              Prevalence : 0.2333          
##          Detection Rate : 0.2000          
##    Detection Prevalence : 0.3333          
##       Balanced Accuracy : 0.8416          
##                                           
##        'Positive' Class : happy           
## 


From the confussionMatrix above, we can see the Sensitivity of Naive Bayes model to predict happy is 85.71%. I choose Sensitivity since we want to predict the happy country as much as possible (to reduce False Negative).

Let’s see ROC Curve to check if the model performs well


Our ROC curve perform excellently since the curve is skewed to far top-left. But let’s see how the AuC perform

## [1] 0.9254658


The score of AUC returns 92.54% (Perfect value = 100%) and it’s a great result! It means our happy_naive model seperate the Positive and Negatives pretty well.

Decision Tree Model


from the plot above, we can see the Decision Tree model recognizes the GDP.per.Capita as the highest entropy to seperate nodes into yes (happy) and no (lesshappy but still heterogenous). This model expresses that the GDP.per.capita, Social.support, and Freedom.to.make,life.choices as the most significant variables in categorizing the happy counties from the lesshappy. But, Decision Tree model tends to overfitting with agressive algorithm.

Model Fitting & Performance

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  happy lesshappy
##   happy         4         3
##   lesshappy     3        20
##                                           
##                Accuracy : 0.8             
##                  95% CI : (0.6143, 0.9229)
##     No Information Rate : 0.7667          
##     P-Value [Acc > NIR] : 0.4296          
##                                           
##                   Kappa : 0.441           
##                                           
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 0.5714          
##             Specificity : 0.8696          
##          Pos Pred Value : 0.5714          
##          Neg Pred Value : 0.8696          
##              Prevalence : 0.2333          
##          Detection Rate : 0.1333          
##    Detection Prevalence : 0.2333          
##       Balanced Accuracy : 0.7205          
##                                           
##        'Positive' Class : happy           
## 


From the confussionMatrix above, we can see the Sensitivity of Naive Bayes model to predict happy is 57.14%. I choose Sensitivity since we want to predict the happy country as much as possible (to reduce False Negative).

Let’s see ROC Curve to check if the model performs well


our ROC curve perform well since the curve is skewed to far top-left. But let’s see how the AuC perform

## [1] 0.7950311


The score of AUC returns 79.50%. It is > 75%means there is still a not perfect separation but in a very little amount.

Conclusion

##      Sensitivity       AUC
## [1,]   0.8571429 0.9254658
## [2,]   0.5714286 0.7950311


From both of the model performance Naive Bayes [1,] and Decision Tree [2,] of Sensitivity & AUC, we can see that the Naive Bayes performs better. So, it is suggested to use the Naive Bayes for predicting the Happiness Score.