Introduction

The World Happiness Report is a landmark survey of the state of global happiness. The first report was published on 2012 but it gains global recognition during an event of International Happiness Day (March 20th) at United Nations 2017. Many governments, organizations, and civil society use happiness indicators to inform policy decision making. So, what makes a country become a happy one?

In this Learning By Building assesment, I will try to analyzie what makes a country become a happy country using Naive Bayes and Random Forest to see which of the model can be accurately predict happiness variables. I use the 2019 data from here https://www.kaggle.com/unsdsn/world-happiness#2019.csv The happiness scores and rankings use data from the Gallup World Poll. The scores are based on answers to the main life evaluation question asked in the poll.

Data Wrangling

happiness19 <- read.csv("2019.csv")
glimpse(happiness19)

## Observations: 156
## Variables: 9
## $ Overall.rank                 <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1…
## $ Country.or.region            <fct> Finland, Denmark, Norway, Iceland, Nethe…
## $ Score                        <dbl> 7.769, 7.600, 7.554, 7.494, 7.488, 7.480…
## $ GDP.per.capita               <dbl> 1.340, 1.383, 1.488, 1.380, 1.396, 1.452…
## $ Social.support               <dbl> 1.587, 1.573, 1.582, 1.624, 1.522, 1.526…
## $ Healthy.life.expectancy      <dbl> 0.986, 0.996, 1.028, 1.026, 0.999, 1.052…
## $ Freedom.to.make.life.choices <dbl> 0.596, 0.592, 0.603, 0.591, 0.557, 0.572…
## $ Generosity                   <dbl> 0.153, 0.252, 0.271, 0.354, 0.322, 0.263…
## $ Perceptions.of.corruption    <dbl> 0.393, 0.410, 0.341, 0.118, 0.298, 0.343…

Overall.rank: The Happiness Rank
Country.or.region: Name of the country or region
Score: The Happiness Score
GDP.per.capita: GDP per kapita of each country
Social Support: having friends and other people, including family, to turn to in times of need
Healthy.life.expectancy: the average number of years that a newborn can expect to live in “full health”
Freedom.to.make.life.choices: There are many life choices to make a living
Generosity: How generous the people that surrounds them
Perceptions.of.corruption: Perception of whether the government in the said country were corrupted or not

The Overall.rank and Country.or.region doesn’t give valuable information so I will take them down. In the end, I will only use numerical variables as the preidctors. The Score column is our target class so I will reorder it to the last column and decide later which of the happy country and the lesshappy ones.

happiness19 <- happiness19[,3:9]

# reorder the column
happiness19 <- happiness19[c(2,3,4,5,6,7,1)]
summary(happiness19$Score)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.853   4.545   5.380   5.407   6.184   7.769

happiness <- happiness19 %>% mutate(Score = as.factor(ifelse(Score >= 6.184,"happy", "lesshappy")))
head(happiness)

In here, I want to classify which country is happy and which is lesshappy based on the Score. I decide that the country with lesshappy scored under the 3rd Quarter of the Score population and we got the value 6.184 as the 3rd Quarter of the population, so I will classify the Score data into happy and lesshappy.

Exploratory Data Anaysis

Target Variable Proportion

prop.table(table(happiness$Score))

## 
##     happy lesshappy 
##      0.25      0.75

Since I purposedly classify the Target Variable based on the #rd Q, the proportion is unquestionably unbalanced, and I will upsampling the Target Variables to further processing the data. - Let’s see if the predictor variables distribute normally.

happiness %>%
  keep(is.numeric) %>%                     
  gather() %>%                             
  ggplot(aes(value)) +                     
    facet_wrap(~ key, scales = "free") +   
    geom_density()

The data distribute quite normal execpt for Perception.of.corruption and Social.support.

let’s see the correlation between each of the numerical predictors

ggcorr(happiness[,-7], label = TRUE, label_size = 3, hjust = 1, layout.exp = 2)

Between GDP.per.capita and Healthy.life.expectancy + Social.support they have a relative high correlation but not too high (1). Hopefully, the model in Naive Bayes can perform well.

Cross Validation

RNGkind(sample.kind = "Rounding")

## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used

set.seed(222)
split <- initial_split(data = happiness, prop = 0.8, strata = "Score")

train <- training(split)
test <- testing(split)

Data Pre-processing

happy_recipe <- recipe(Score~., train) %>% 
  step_upsample(Score, seed = 222) %>%
  step_nzv(all_predictors()) %>% 
  prep()

happy_train <- juice(happy_recipe)
happy_test <- bake(happy_recipe, test)

Modelling

Naive Bayes

# model building
happy_naive <- naiveBayes(happy_train[-7], happy_train$Score, laplace = 1)

# model fitting
naive_pred <- predict(happy_naive, happy_test, type = "class")

# performance evaluation
metric_n <- confusionMatrix(naive_pred, test$Score, positive = "happy")
metric_n

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  happy lesshappy
##   happy         6         4
##   lesshappy     1        19
##                                           
##                Accuracy : 0.8333          
##                  95% CI : (0.6528, 0.9436)
##     No Information Rate : 0.7667          
##     P-Value [Acc > NIR] : 0.2666          
##                                           
##                   Kappa : 0.5946          
##                                           
##  Mcnemar's Test P-Value : 0.3711          
##                                           
##             Sensitivity : 0.8571          
##             Specificity : 0.8261          
##          Pos Pred Value : 0.6000          
##          Neg Pred Value : 0.9500          
##              Prevalence : 0.2333          
##          Detection Rate : 0.2000          
##    Detection Prevalence : 0.3333          
##       Balanced Accuracy : 0.8416          
##                                           
##        'Positive' Class : happy           
##

From the confussionMatrix above, we can see the Sensitivity of Naive Bayes model to predict happy is 85.71%. I choose Sensitivity since we want to predict the happy country as much as possible (to reduce False Negative).

Let’s see ROC Curve to check if the model performs well

# pre-process the prediction into probability
naive_prob <- predict(happy_naive, happy_test, type = "raw")
happy_df <- data.frame("prediction"=naive_prob[,1], "trueclass"=as.numeric(test$Score=="happy"))

# ROC
happy_roc <- ROCR::prediction(happy_df$prediction, happy_df$trueclass)  
plot(performance(happy_roc, "tpr", "fpr"))

Our ROC curve perform excellently since the curve is skewed to far top-left. But let’s see how the AuC perform

auc <- performance(happy_roc, "auc")
auc@y.values[[1]]

## [1] 0.9254658

The score of AUC returns 92.54% (Perfect value = 100%) and it’s a great result! It means our happy_naive model seperate the Positive and Negatives pretty well.

Decision Tree Model

happy_dt <- rpart(formula = Score ~ ., data = happy_train, method = "class")

fancyRpartPlot(happy_dt, sub = NULL)

from the plot above, we can see the Decision Tree model recognizes the GDP.per.Capita as the highest entropy to seperate nodes into yes (happy) and no (lesshappy but still heterogenous). This model expresses that the GDP.per.capita, Social.support, and Freedom.to.make,life.choices as the most significant variables in categorizing the happy counties from the lesshappy. But, Decision Tree model tends to overfitting with agressive algorithm.

Model Fitting & Performance

# Model fitting
dt_pred <- predict(happy_dt, happy_test, type = "class")

# performance evaluation
metric_dt <- confusionMatrix(dt_pred, test$Score, positive = "happy")
metric_dt

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  happy lesshappy
##   happy         4         3
##   lesshappy     3        20
##                                           
##                Accuracy : 0.8             
##                  95% CI : (0.6143, 0.9229)
##     No Information Rate : 0.7667          
##     P-Value [Acc > NIR] : 0.4296          
##                                           
##                   Kappa : 0.441           
##                                           
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 0.5714          
##             Specificity : 0.8696          
##          Pos Pred Value : 0.5714          
##          Neg Pred Value : 0.8696          
##              Prevalence : 0.2333          
##          Detection Rate : 0.1333          
##    Detection Prevalence : 0.2333          
##       Balanced Accuracy : 0.7205          
##                                           
##        'Positive' Class : happy           
##

From the confussionMatrix above, we can see the Sensitivity of Naive Bayes model to predict happy is 57.14%. I choose Sensitivity since we want to predict the happy country as much as possible (to reduce False Negative).

Let’s see ROC Curve to check if the model performs well

# pre-process the prediction into probability
dt_prob <- predict(happy_dt, happy_test, type = "prob")
happy_df_dt <- data.frame("prediction"=dt_prob[,1], "trueclass"=as.numeric(test$Score=="happy"))

# ROC
happy_roc_dt <- ROCR::prediction(happy_df_dt$prediction, happy_df_dt$trueclass)  
plot(performance(happy_roc_dt, "tpr", "fpr"))

our ROC curve perform well since the curve is skewed to far top-left. But let’s see how the AuC perform

auc_dt <- performance(happy_roc_dt, "auc")
auc_dt@y.values[[1]]

## [1] 0.7950311

The score of AUC returns 79.50%. It is > 75%means there is still a not perfect separation but in a very little amount.

Conclusion

cbind(Sensitivity=c(metric_n$byClass[[1]], metric_dt$byClass[[1]]), AUC=c(auc@y.values[[1]], auc_dt@y.values[[1]]))

##      Sensitivity       AUC
## [1,]   0.8571429 0.9254658
## [2,]   0.5714286 0.7950311

From both of the model performance Naive Bayes [1,] and Decision Tree [2,] of Sensitivity & AUC, we can see that the Naive Bayes performs better. So, it is suggested to use the Naive Bayes for predicting the Happiness Score.

LBB - Classification 2 Red Wine

Andina Septia

4/11/2020