library(DT)
library(gridExtra)
library(caret)


Introduction

In this blog, I will cover some some basic concepts about binary logistic regression by applying it to a data set from Kaggle called “World Happiness Report”. This data set can be downloaded from https://www.kaggle.com/unsdsn/world-happiness/data.

The linked data set above contains three csv files for year 2015, 2016, and 2017. In this exercise, I will be using the data set from the 2017 file. This has 155 observations.

This data set has the following variables.

Column Description
Country Name of the country.
Region Region the country belongs to.
Happiness Rank Rank of the country based on the Happiness Score.
Happiness ScoreA metric measured in 2015 by asking the sampled people the question: “How would you rate your happiness on a scale of 0 to 10 where 10 is the happiest.”
Standard Error The standard error of the happiness score.
Economy (GDP per Capita) The extent to which GDP contributes to the calculation of the Happiness Score.
Family The extent to which Family contributes to the calculation of the Happiness Score
Health (Life Expectancy) The extent to which Life expectancy contributed to the calculation of the Happiness Score
Freedom The extent to which Freedom contributed to the calculation of the Happiness Score.
Trust (Government Corruption) The extent to which Perception of Corruption contributes to Happiness Score.
Generosity The extent to which Generosity contributed to the calculation of the Happiness Score.
Dystopia Residual The extent to which Dystopia Residual contributed to the calculation of the Happiness Score.


Summary of Exercise

This exercise is a binary logistic regression based on one explantory variable. A binary response variable is generated called Is.Happy. This is calculated based on the happiness metric, which is a scale from 0 to 10. A score below 5 assigns 1 to the binary response; otherwise, 0 is assigned. A single explanatory variable is generated called Ave.Contribution. This is the average of the six variables Economy, Family, Heath, Freedom, Trust, and Generosity. The data set is split into training set and a testing set. The model is built using the training set. The training set is adjusted by “up sampling” the minority class. The predicted values are set based on cutoff probability of 0.5.


Response Variable

For this exercise, the binary response variable is going to be based on Happiness, which is a metric on the scale of 1 to 10 (happiest). The binary response variable Is.Happy is set to 0 if Happiness score is below 5 and is set to 1 otherwise. This is a very simplistic view of the Happiness metric, but it gives us an idea whether a country is below or at/above the midpoint of the happiness metric scale.


Explanatory Variable

For this simple exercise, I will be predicting Is.Happyby creating a variable called Avg.Contribution, which takes the simple average of the contribution of Economy, Family, Heath, Freedom, Trust, and Generosity to the calculation of the happiness score.


Process Data

Read 2017 csv file and split into training and testing data set. The caret package is used to split that data into 70% training and 30% testing data set. The function used to do the data split is createDataPartition.

Columns 1,3,4, and 11 are removed from the data set. These are columns not necessary for this exercise.

df <- read.csv("2017.csv")
df <- df[, -c(2,4,5,12)]

Preview data set.

datatable(df)

Generate Is.Happy response variable based on Happiness.Score value. If Happiness.Score is less than 5, set Is.Happy value to 0. If Happiness.Score is 5 or higher, set Is.Happy value to 1.

df$Is.Happy <- as.factor(ifelse(df$Happiness.Score < 5, 0, 1))

Preview Is.Happy binary response variable.

datatable(df[,c(1,2,9)])

Take a look at the distribution of Is.Happy.

57 have value of 0 and 98 have value of 1. Majority of the observations are at or above the midpoint of the happiness metric scale.

table(df$Is.Happy)
## 
##  0  1 
## 57 98

Generate Ave.contribution explanatory variable.

This is generated by taking the simple average of the six variables Economy, Family, Heath, Freedom, Trust, and Generosity.

df$Ave.Contribution <- (df$Economy..GDP.per.Capita. + df$Family + df$Health..Life.Expectancy. + df$Freedom + df$Trust..Government.Corruption. + df$Generosity)/6

Preview Ave.Contribution.

datatable(df[, c(1,9,10)])

Plot of Is.Happy and Ave.Contribution.

plot(x=df$Ave.Contribution, y=df$Is.Happy, pch = 19, frame = FALSE)

As you can see, in general observations with higher average contribution to happiness tend to be at or above the midpoint happiness metric scale.


Creating Training and Testing Set

The caret library is used to split the data into training and testing set. The data is going to be split into 70% towards training and the remaining 30% is going to be used for testing.

'%ni%' <- Negate('%in%')  # define 'not in' func
options(scipen=999)  # prevents printing scientific notations.

# Prepare training and test data
set.seed(100)
train_index <- createDataPartition(df$Is.Happy, p=0.7, list = F)  # 70% training data
train <- df[train_index, ]
test <- df[-train_index, ]

Size of train set.

nrow(train)
## [1] 109

Size of test set.

nrow(test)
## [1] 46

The train set has more 1’s and 0’s. The majority of the training set is represented by observations at or above the midpoint of happiness metric scale.

table(train$Is.Happy)
## 
##  0  1 
## 40 69

To try and mitigate this imbalance in the training set, up sampling is going to be done so that the minority class is going to be repeatedly sampled until it reaches the same size as the majority class. The upsample function of the caret library is used for this.

Reference used for up sampling: https://www.machinelearningplus.com/machine-learning/logistic-regression-tutorial-examples-r/

set.seed(100)
up_train <- upSample(x = train[, colnames(train) %ni% "Class"],
                     y = train$Is.Happy)

Below is the new distribution of Is.Happy in the up sampled training set.

table(up_train$Is.Happy)
## 
##  0  1 
## 69 69


Build Binary Logistic Regression Model

The glm function, which stands for generalized linear model , is used to build the model. The binomial family is specified, which informs the glm function that binary logistic regression model is going to be used. The up_train data set is used to build the model.

logitmod <- glm(Is.Happy ~ Ave.Contribution, family = "binomial", data=up_train)

Below is the summary of the logitmod model.

summary(logitmod)
## 
## Call:
## glm(formula = Is.Happy ~ Ave.Contribution, family = "binomial", 
##     data = up_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6494  -0.3049   0.0042   0.2220   3.2761  
## 
## Coefficients:
##                  Estimate Std. Error z value     Pr(>|z|)    
## (Intercept)       -14.728      2.685  -5.486 0.0000000412 ***
## Ave.Contribution   27.631      5.009   5.516 0.0000000347 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 191.309  on 137  degrees of freedom
## Residual deviance:  60.789  on 136  degrees of freedom
## AIC: 64.789
## 
## Number of Fisher Scoring iterations: 7


Predicting Test Set

Below generates the prediction of probability that an observation has a happiness metric score of 5 or higher.

When using the predict function, you need to set type to response (type=response) so that the predict function computes the prediction probabilities. Otherwise, the prediction will be in the logit scale.

pred <- predict(logitmod, newdata = test, type = "response")

The predicted probabilities need to be converted to 1 or 0. In this exercise, I will use the cutoff point of 0.5. If the predicted probability is > 0.5, value of 1 is assigned; otherwise, value of 0 is assigned.

Below, the prediction values (1 or 0) is assigned.

Pred.Is.Happy <- as.factor(ifelse(pred > 0.5, 1, 0))

Below is the confusion matrix (from the caret package) that compares the predicted values with the actual values of of the testing set.

The model has an accuracy of about 85%.

confusionMatrix(data=Pred.Is.Happy, reference=factor(test$Is.Happy, levels=c('0', '1')))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 12  2
##          1  5 27
##                                           
##                Accuracy : 0.8478          
##                  95% CI : (0.7113, 0.9366)
##     No Information Rate : 0.6304          
##     P-Value [Acc > NIR] : 0.001076        
##                                           
##                   Kappa : 0.6611          
##                                           
##  Mcnemar's Test P-Value : 0.449692        
##                                           
##             Sensitivity : 0.7059          
##             Specificity : 0.9310          
##          Pos Pred Value : 0.8571          
##          Neg Pred Value : 0.8438          
##              Prevalence : 0.3696          
##          Detection Rate : 0.2609          
##    Detection Prevalence : 0.3043          
##       Balanced Accuracy : 0.8185          
##                                           
##        'Positive' Class : 0               
##