library(DT)
library(gridExtra)
library(caret)
In this blog, I will cover some some basic concepts about binary logistic regression by applying it to a data set from Kaggle called “World Happiness Report”. This data set can be downloaded from https://www.kaggle.com/unsdsn/world-happiness/data.
The linked data set above contains three csv files for year 2015, 2016, and 2017. In this exercise, I will be using the data set from the 2017 file. This has 155 observations.
This data set has the following variables.
Column | Description |
---|---|
Country |
Name of the country. |
Region |
Region the country belongs to. |
Happiness Rank |
Rank of the country based on the Happiness Score. |
Happiness |
ScoreA metric measured in 2015 by asking the sampled people the question: “How would you rate your happiness on a scale of 0 to 10 where 10 is the happiest.” |
Standard Error |
The standard error of the happiness score. |
Economy |
(GDP per Capita) The extent to which GDP contributes to the calculation of the Happiness Score. |
Family |
The extent to which Family contributes to the calculation of the Happiness Score |
Health |
(Life Expectancy) The extent to which Life expectancy contributed to the calculation of the Happiness Score |
Freedom |
The extent to which Freedom contributed to the calculation of the Happiness Score. |
Trust |
(Government Corruption) The extent to which Perception of Corruption contributes to Happiness Score. |
Generosity |
The extent to which Generosity contributed to the calculation of the Happiness Score. |
Dystopia Residual |
The extent to which Dystopia Residual contributed to the calculation of the Happiness Score. |
This exercise is a binary logistic regression based on one explantory variable. A binary response variable is generated called Is.Happy
. This is calculated based on the happiness metric, which is a scale from 0 to 10. A score below 5 assigns 1 to the binary response; otherwise, 0 is assigned. A single explanatory variable is generated called Ave.Contribution
. This is the average of the six variables Economy
, Family
, Heath
, Freedom
, Trust
, and Generosity
. The data set is split into training set and a testing set. The model is built using the training set. The training set is adjusted by “up sampling” the minority class. The predicted values are set based on cutoff probability of 0.5.
For this exercise, the binary response variable is going to be based on Happiness
, which is a metric on the scale of 1 to 10 (happiest). The binary response variable Is.Happy
is set to 0 if Happiness
score is below 5 and is set to 1 otherwise. This is a very simplistic view of the Happiness
metric, but it gives us an idea whether a country is below or at/above the midpoint of the happiness metric scale.
For this simple exercise, I will be predicting Is.Happy
by creating a variable called Avg.Contribution
, which takes the simple average of the contribution of Economy
, Family
, Heath
, Freedom
, Trust
, and Generosity
to the calculation of the happiness score.
Read 2017 csv file and split into training and testing data set. The caret
package is used to split that data into 70% training and 30% testing data set. The function used to do the data split is createDataPartition
.
Columns 1,3,4, and 11 are removed from the data set. These are columns not necessary for this exercise.
df <- read.csv("2017.csv")
df <- df[, -c(2,4,5,12)]
Preview data set.
datatable(df)
Generate Is.Happy
response variable based on Happiness.Score
value. If Happiness.Score
is less than 5, set Is.Happy
value to 0. If Happiness.Score
is 5 or higher, set Is.Happy
value to 1.
df$Is.Happy <- as.factor(ifelse(df$Happiness.Score < 5, 0, 1))
Preview Is.Happy
binary response variable.
datatable(df[,c(1,2,9)])
Take a look at the distribution of Is.Happy
.
57 have value of 0
and 98 have value of 1
. Majority of the observations are at or above the midpoint of the happiness metric scale.
table(df$Is.Happy)
##
## 0 1
## 57 98
Generate Ave.contribution
explanatory variable.
This is generated by taking the simple average of the six variables Economy
, Family
, Heath
, Freedom
, Trust
, and Generosity
.
df$Ave.Contribution <- (df$Economy..GDP.per.Capita. + df$Family + df$Health..Life.Expectancy. + df$Freedom + df$Trust..Government.Corruption. + df$Generosity)/6
Preview Ave.Contribution
.
datatable(df[, c(1,9,10)])
Plot of Is.Happy
and Ave.Contribution
.
plot(x=df$Ave.Contribution, y=df$Is.Happy, pch = 19, frame = FALSE)
As you can see, in general observations with higher average contribution to happiness tend to be at or above the midpoint happiness metric scale.
The caret
library is used to split the data into training and testing set. The data is going to be split into 70% towards training and the remaining 30% is going to be used for testing.
'%ni%' <- Negate('%in%') # define 'not in' func
options(scipen=999) # prevents printing scientific notations.
# Prepare training and test data
set.seed(100)
train_index <- createDataPartition(df$Is.Happy, p=0.7, list = F) # 70% training data
train <- df[train_index, ]
test <- df[-train_index, ]
Size of train set.
nrow(train)
## [1] 109
Size of test set.
nrow(test)
## [1] 46
The train set has more 1’s and 0’s. The majority of the training set is represented by observations at or above the midpoint of happiness metric scale.
table(train$Is.Happy)
##
## 0 1
## 40 69
To try and mitigate this imbalance in the training set, up sampling is going to be done so that the minority class is going to be repeatedly sampled until it reaches the same size as the majority class. The upsample
function of the caret
library is used for this.
Reference used for up sampling: https://www.machinelearningplus.com/machine-learning/logistic-regression-tutorial-examples-r/
set.seed(100)
up_train <- upSample(x = train[, colnames(train) %ni% "Class"],
y = train$Is.Happy)
Below is the new distribution of Is.Happy
in the up sampled training set.
table(up_train$Is.Happy)
##
## 0 1
## 69 69
The glm
function, which stands for generalized linear model , is used to build the model. The binomial
family is specified, which informs the glm
function that binary logistic regression model is going to be used. The up_train
data set is used to build the model.
logitmod <- glm(Is.Happy ~ Ave.Contribution, family = "binomial", data=up_train)
Below is the summary of the logitmod
model.
summary(logitmod)
##
## Call:
## glm(formula = Is.Happy ~ Ave.Contribution, family = "binomial",
## data = up_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6494 -0.3049 0.0042 0.2220 3.2761
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -14.728 2.685 -5.486 0.0000000412 ***
## Ave.Contribution 27.631 5.009 5.516 0.0000000347 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 191.309 on 137 degrees of freedom
## Residual deviance: 60.789 on 136 degrees of freedom
## AIC: 64.789
##
## Number of Fisher Scoring iterations: 7
Below generates the prediction of probability that an observation has a happiness metric score of 5 or higher.
When using the predict
function, you need to set type
to response
(type=response) so that the predict
function computes the prediction probabilities. Otherwise, the prediction will be in the logit scale.
pred <- predict(logitmod, newdata = test, type = "response")
The predicted probabilities need to be converted to 1 or 0. In this exercise, I will use the cutoff point of 0.5. If the predicted probability is > 0.5, value of 1 is assigned; otherwise, value of 0 is assigned.
Below, the prediction values (1 or 0) is assigned.
Pred.Is.Happy <- as.factor(ifelse(pred > 0.5, 1, 0))
Below is the confusion matrix (from the caret
package) that compares the predicted values with the actual values of of the testing set.
The model has an accuracy of about 85%.
confusionMatrix(data=Pred.Is.Happy, reference=factor(test$Is.Happy, levels=c('0', '1')))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 12 2
## 1 5 27
##
## Accuracy : 0.8478
## 95% CI : (0.7113, 0.9366)
## No Information Rate : 0.6304
## P-Value [Acc > NIR] : 0.001076
##
## Kappa : 0.6611
##
## Mcnemar's Test P-Value : 0.449692
##
## Sensitivity : 0.7059
## Specificity : 0.9310
## Pos Pred Value : 0.8571
## Neg Pred Value : 0.8438
## Prevalence : 0.3696
## Detection Rate : 0.2609
## Detection Prevalence : 0.3043
## Balanced Accuracy : 0.8185
##
## 'Positive' Class : 0
##