In this article we will try to do regression to analyze “Is alcohol consumption makes you happy?”. This project based on data availiable in this kaggle. The dataset contain not only alcohol consumption like beer, wine, and spirit, but also social indexes like GDP, HDI, region, and hemisphere as comparison. The HappinessScore itself is measured by asking sampled people the question: “How would you rate your happiness on a scale of 0 to 10 where 10 is the happiest”. So we are here to find out is alcohol consumption linked to happiness?

1 Background

1.1 Objective

We’ll try to build regression model to analyze “is alcohol consumption makes you happy? based on survey data in 112 countries”. We will try to find which variables are significant in predicting the Happiness Score and how well the variable describe the Happiness Score We will also evaluate the model with statistical approach to determine the quality of our model. this analysis might be used to debunk the alcoholic behaviour for some people who use it as ‘stress relief’. Note: The dataset only contain 112 row to be analyed. Every rows represent countries. If we have more data (i still believe bigger data = bigger impact) about alcohol consummption in all around earth, this kind of analysis can help world-wide government to do reducing-alcohol like movement as research justification.

1.2 libraries

You can load the package into your workspace using the library() function

library(dplyr)
library(data.table)
library(ggplot2)
library(rsample)
library(ggpubr)
library(GGally)
library(caret)
library(lmtest)
library(MLmetrics)
library(car)

2 Let’s Begin

2.1 Data Import

data.1 <- read.csv("hac.csv")
glimpse(data.1)

## Observations: 122
## Variables: 9
## $ Country          <fct> Denmark, Switzerland, Iceland, Norway, Finland, Ca...
## $ Region           <fct> Western Europe, Western Europe, Western Europe, We...
## $ Hemisphere       <fct> north, north, north, north, north, north, north, s...
## $ HappinessScore   <dbl> 7.526, 7.509, 7.501, 7.498, 7.413, 7.404, 7.339, 7...
## $ HDI              <int> 928, 943, 933, 951, 918, 922, 928, 915, 938, 932, ...
## $ GDP_PerCapita    <dbl> 53.579, 79.866, 60.530, 70.890, 43.433, 42.349, 45...
## $ Beer_PerCapita   <int> 224, 185, 233, 169, 263, 240, 251, 203, 261, 152, ...
## $ Spirit_PerCapita <int> 81, 100, 61, 71, 133, 122, 88, 79, 72, 60, 69, 75,...
## $ Wine_PerCapita   <int> 278, 280, 78, 129, 97, 100, 190, 175, 212, 186, 9,...

Here is the description of every variables to helps you understand it better:
- Country: well.. the country. i belief you already know it
- Region: Region the country belongs to
- Hemisphere: Hemisphere of country
- Happiness Score: A metric measured in 2016 by asking the sampled people the question: “How would you rate your happiness on a scale of 0 to 10 where 10 is the happiest”
- HDI: Stands for Human Development Index, taken by United Nations Development Programme
- GDP_PerCapita: Gross Domestic Product index per capita
- Beer_PerCapita: Liters ( per capita ) of beer consumption
- Spirit_PerCapita: Consumption of spirits drink ( per capita )
- Wine_PerCapita: Wine consumption

3 Data wrangling & EDA

3.1 Preparing Data

Every rows indicate every sampled countries. Rather than removing it, we’ll make it as rownames.

# make column country as rownames, then delete unused column
rownames(data.1) <- data.1$Country
data.1[1] <- NULL

head(data.1)

before we start, let’s make sure our data doesnt have any missing value

colSums(is.na(data.1))

##           Region       Hemisphere   HappinessScore              HDI 
##                0                0                0                0 
##    GDP_PerCapita   Beer_PerCapita Spirit_PerCapita   Wine_PerCapita 
##                0                0                0                0

For correlation check, We will use ggcorr() function from GGally packages. ggcorr is a function for making a correlation matrix plot.

ggcorr(data.1,label=T, layout.exp=2, hjust= 1)

The plot above only showed numeric variables. As we can see, several variables has high positive correlation like HDI, GDP_PerCapita however, has negative correlation to HappinessScore.

Let’s see the histogram to every numeric variables. i want to know the distribution of each variables

ggplot(gather(data.1 %>% select_if(is.numeric)), aes(value)) + 
    geom_histogram(bins = 10) + 
    facet_wrap(~key, scales = 'free_x')

It looks like very few of country that actually has high GDP_PerCapita, also for spirit_PerCapita and wine_PerCapita. i’m afraid some of them will be assume as outlier. Good thing is, Our target variable HappinessScore has normal distribution. Let’s see how it looks in boxplot

boxplot(scale(data.1 %>% select_if(is.numeric)))

it looks like most country have low GDP and very vew of them have high GDP.

boxplot(data.1$GDP_PerCapita)

sum(data.1$GDP_PerCapita > 200)

## [1] 15

only 15 countries have high GDP. I’ll continue to build model using all column and its outlier. If things gone bad, we’ll remove the outlier and rebuild the model. Anyway, just for fun, let’s see which country have high GDP_PerCapita.

data.1[which(data.1$GDP_PerCapita > 200),]

Countries with high GDP mostly are located in sub-saharan Africa, have low happinessScore, and looks like a poor 3rd-world country.

4 modeling

4.1 Splitting

# split the data to train and test for later evaluation use
split.1 <- initial_split(data.1, prop = 0.8, strata = "HappinessScore")
train <- training(split.1)
test <- testing(split.1)

4.2 Modelling

let’s build a model using simple linear regresion

mod.1 <- lm(HappinessScore~., train)

summary(mod.1)

## 
## Call:
## lm(formula = HappinessScore ~ ., data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.39719 -0.27527  0.04117  0.35164  1.08263 
## 
## Coefficients: (1 not defined because of singularities)
##                                          Estimate  Std. Error t value
## (Intercept)                            1.72544682  1.00645132   1.714
## RegionCentral and Eastern Europe      -1.41282934  0.62250726  -2.270
## RegionEastern Asia                    -1.41434122  0.77786205  -1.818
## RegionLatin America and Caribbean     -0.35050866  0.62526574  -0.561
## RegionMiddle East and Northern Africa -1.47497574  0.65022404  -2.268
## RegionNorth America                   -0.30811604  0.72287201  -0.426
## RegionSoutheastern Asia               -1.09263023  0.67383573  -1.622
## RegionSub-Saharan Africa              -1.19824578  0.65661991  -1.825
## RegionWestern Europe                  -0.59360813  0.61381708  -0.967
## Hemispherenorth                        0.23540987  0.31272295   0.753
## Hemispherenoth                                 NA          NA      NA
## Hemispheresouth                        0.07583468  0.34151324   0.222
## HDI                                    0.00654423  0.00092141   7.102
## GDP_PerCapita                         -0.00005535  0.00040697  -0.136
## Beer_PerCapita                        -0.00044532  0.00086591  -0.514
## Spirit_PerCapita                      -0.00092425  0.00113483  -0.814
## Wine_PerCapita                        -0.00208128  0.00108405  -1.920
##                                             Pr(>|t|)    
## (Intercept)                                   0.0901 .  
## RegionCentral and Eastern Europe              0.0258 *  
## RegionEastern Asia                            0.0726 .  
## RegionLatin America and Caribbean             0.5766    
## RegionMiddle East and Northern Africa         0.0259 *  
## RegionNorth America                           0.6710    
## RegionSoutheastern Asia                       0.1087    
## RegionSub-Saharan Africa                      0.0716 .  
## RegionWestern Europe                          0.3363    
## Hemispherenorth                               0.4537    
## Hemispherenoth                                    NA    
## Hemispheresouth                               0.8248    
## HDI                                   0.000000000367 ***
## GDP_PerCapita                                 0.8921    
## Beer_PerCapita                                0.6084    
## Spirit_PerCapita                              0.4177    
## Wine_PerCapita                                0.0583 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5593 on 84 degrees of freedom
## Multiple R-squared:  0.796,  Adjusted R-squared:  0.7596 
## F-statistic: 21.85 on 15 and 84 DF,  p-value: < 0.00000000000000022

From the summary above we found 1 variables are not defined because of singularities. That’s happen because two or more of the variables are perfectly collinear. To identify which variable is colinear, we can use alias() function to our model

alias(mod.1)

## Model :
## HappinessScore ~ Region + Hemisphere + HDI + GDP_PerCapita + 
##     Beer_PerCapita + Spirit_PerCapita + Wine_PerCapita
## 
## Complete :
##                (Intercept) RegionCentral and Eastern Europe RegionEastern Asia
## Hemispherenoth 0           0                                1                 
##                RegionLatin America and Caribbean
## Hemispherenoth 0                                
##                RegionMiddle East and Northern Africa RegionNorth America
## Hemispherenoth 0                                     0                  
##                RegionSoutheastern Asia RegionSub-Saharan Africa
## Hemispherenoth 0                       0                       
##                RegionWestern Europe Hemispherenorth Hemispheresouth HDI
## Hemispherenoth 0                    0               0               0  
##                GDP_PerCapita Beer_PerCapita Spirit_PerCapita Wine_PerCapita
## Hemispherenoth 0             0              0                0

Looks like Hemisphere north are highly corelated with region Eastern Asia. from the model summary, we saw that most levels in Hemisphere variable have high Pr(>|t|) (which mean have low significance to our model), thus we will remove Hemisphere variable that can be problematic with respect to multicollinearity, and re-build the model.

# removing Hemisphere from both train and test data
train <- train %>% select(-c("Hemisphere"))
test <- test %>% select(-c("Hemisphere"))

# rebuild model
mod.2 <- lm(HappinessScore~., train)
summary(mod.2)

## 
## Call:
## lm(formula = HappinessScore ~ ., data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.39946 -0.27415  0.02537  0.33244  1.06382 
## 
## Coefficients:
##                                           Estimate   Std. Error t value
## (Intercept)                            1.922788660  0.940576202   2.044
## RegionCentral and Eastern Europe      -1.282656978  0.593981561  -2.159
## RegionEastern Asia                    -1.524216386  0.712811709  -2.138
## RegionLatin America and Caribbean     -0.300741828  0.610354647  -0.493
## RegionMiddle East and Northern Africa -1.356404195  0.628887472  -2.157
## RegionNorth America                   -0.164344496  0.693467361  -0.237
## RegionSoutheastern Asia               -0.978083545  0.653585877  -1.496
## RegionSub-Saharan Africa              -1.206130214  0.652031702  -1.850
## RegionWestern Europe                  -0.438989067  0.572625983  -0.767
## HDI                                    0.006436907  0.000908660   7.084
## GDP_PerCapita                         -0.000004637  0.000400876  -0.012
## Beer_PerCapita                        -0.000454719  0.000857808  -0.530
## Spirit_PerCapita                      -0.000905049  0.001127386  -0.803
## Wine_PerCapita                        -0.002186465  0.001049239  -2.084
##                                             Pr(>|t|)    
## (Intercept)                                   0.0440 *  
## RegionCentral and Eastern Europe              0.0336 *  
## RegionEastern Asia                            0.0353 *  
## RegionLatin America and Caribbean             0.6235    
## RegionMiddle East and Northern Africa         0.0338 *  
## RegionNorth America                           0.8132    
## RegionSoutheastern Asia                       0.1382    
## RegionSub-Saharan Africa                      0.0678 .  
## RegionWestern Europe                          0.4454    
## HDI                                   0.000000000362 ***
## GDP_PerCapita                                 0.9908    
## Beer_PerCapita                                0.5974    
## Spirit_PerCapita                              0.4243    
## Wine_PerCapita                                0.0401 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5557 on 86 degrees of freedom
## Multiple R-squared:  0.7938, Adjusted R-squared:  0.7627 
## F-statistic: 25.47 on 13 and 86 DF,  p-value: < 0.00000000000000022

Variable with Pr(>|t|) > 0.05 simply means that variables doesn’t have any significant value to our model, thus we’re looking for variables with Pr(>|t|) below 0.05. It’save to remove variable that doesn’t meet our criteria. We can identify better model by looking at R-squared. Let’s build second model with selected variables only. To do that, we can use stepwise.

Stepwise regression is a set of methods to fit regression models where the choice of predictive variables is carried out by an automatic procedure. Stepwise will train model with every posible combination of variables with the lowest AIC. AIC are not represented in summary() but i can say that AIC demonstrated how we can estimate the amount of “information loss” from one model to another.

4.3 Model Tuning

# rebuild model with stepwise
mod.3 <- step(mod.2, direction = "backward", trace = 0)
summary(mod.3)

## 
## Call:
## lm(formula = HappinessScore ~ Region + HDI + Wine_PerCapita, 
##     data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.38805 -0.29101  0.03532  0.35718  1.03663 
## 
## Coefficients:
##                                         Estimate Std. Error t value
## (Intercept)                            2.0489400  0.8752720   2.341
## RegionCentral and Eastern Europe      -1.3650709  0.5804843  -2.352
## RegionEastern Asia                    -1.5116504  0.7031927  -2.150
## RegionLatin America and Caribbean     -0.3444090  0.6014020  -0.573
## RegionMiddle East and Northern Africa -1.2660613  0.6152631  -2.058
## RegionNorth America                   -0.2170035  0.6847090  -0.317
## RegionSoutheastern Asia               -0.9851934  0.6438780  -1.530
## RegionSub-Saharan Africa              -1.2056568  0.6411261  -1.881
## RegionWestern Europe                  -0.4526532  0.5653238  -0.801
## HDI                                    0.0060932  0.0007575   8.044
## Wine_PerCapita                        -0.0021291  0.0010139  -2.100
##                                               Pr(>|t|)    
## (Intercept)                                     0.0215 *  
## RegionCentral and Eastern Europe                0.0209 *  
## RegionEastern Asia                              0.0343 *  
## RegionLatin America and Caribbean               0.5683    
## RegionMiddle East and Northern Africa           0.0425 *  
## RegionNorth America                             0.7520    
## RegionSoutheastern Asia                         0.1295    
## RegionSub-Saharan Africa                        0.0633 .  
## RegionWestern Europe                            0.4254    
## HDI                                   0.00000000000353 ***
## Wine_PerCapita                                  0.0386 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5506 on 89 degrees of freedom
## Multiple R-squared:  0.7905, Adjusted R-squared:  0.767 
## F-statistic: 33.59 on 10 and 89 DF,  p-value: < 0.00000000000000022

Now we have the best selected variables according to stepwise. We’ll build new model from the stepwise for later use.

mod.3x <- lm(HappinessScore ~ Region + HDI + Spirit_PerCapita + Wine_PerCapita, data = train)

You need to keep in mind, R-squared doesn’t indicate the quality of the model, instead R-squared value tells us how well our model describes the data. It measures the extent to which the variance in our dependent variable can be explained by the independent variables. both model shows quite high R-squared values, our model represent 80% variance of data. Both model only have slight differences in R-squared (0.8103 > 0.8082), thus it save for us to remove several variable because it still represent variance of our data as much as before.

However, let’s evaluate our model to see the differences between both model

5 Model Evaluation

5.1 RMSE, MAE, and MAPE

Note: Personally, i doubt our model is good enough for prediction. We only have 112 obs to begin with. However, its a fun experience to find out whether alcohol consumption is related to the people’s happiness score.

# Prediction with all variable
pred.1 <- predict(mod.2, newdata = test)

# prediction with stepwise
pred.2 <- predict(mod.3x, newdata = test)

# calculate MAE and RMSE
all.var <- data.frame(MAE = MAE(pred.1, test$HappinessScore), 
                      RMSE = RMSE(pred.1, test$HappinessScore),
                      MAPE = MAPE(pred.1, test$HappinessScore))
with.step <- data.frame(MAE = MAE(pred.2, test$HappinessScore), 
                        RMSE = RMSE(pred.2, test$HappinessScore),
                        MAPE = MAPE(pred.2, test$HappinessScore))
rbind(all.var = all.var, with.step = with.step)

Top row represent MAE, RMSE, and MAPE from first model (with all variables) and the bottom represent second model (with stepwise). It’s hard to intepret MAE and RMSE, but its good metric for comparing models. From MAPE, Our model only have 9% of error predicting HappinesScore from unseen data. its actually good enogh even though i doubt it before. However our model with stepwise are not better than the first one. But we’re dealing with linear regression here, we need more statistical apporach to determine the quality of our model.
note: from here, ill continue to using model with stepwise for statistical analysis.

5.2 Residual plot

A residual plot is a plot that plot the residual values on the y-axis and the fitted values on the x-axis. A residual plot that has points randomly scattered around the x-axis is the one we want. It doesn’t mean that the model is perfect, but it does mean that the regression model you fit appropriately describes the variability in our dependent variable.

plot(train$HappinessScore, residuals(mod.3x),
     xlab = "Hapiness Score", ylab = "residuals")
abline(h=0, col="maroon", lwd=2)

There’s no pattern, exactly like we want ! it means our model fit appropriately describes the variability in our dependent variable.

5.3 Heteroscedasticity

Heteroscedasticity is a condition where the variability of a variable is unequal across its range of value. In a linear regression model, if the variance of its error is showing unequal variation across the target variable range, it shows that heteroscedasticity is present and the implication to that is related to the previous statement of a non-random pattern in residual.

bptest(mod.3x)

## 
##  studentized Breusch-Pagan test
## 
## data:  mod.3x
## BP = 12.368, df = 11, p-value = 0.3366

The test has a p-value above 0.05, therefore We can conclude that the residuals has a constant variance (heteroscedasticity is not present)

5.4 Normality

The normality assumption means that the residuals from the linear regression model should be normally distributed because we expect to get residuals near the zero value. We will use Shapiro-Wilk test to our residual

shapiro.test(mod.3x$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  mod.3x$residuals
## W = 0.98557, p-value = 0.3489

The test has a p-value above 0.05, therefore we know that our residuals are normally distributed

5.5 Multicolinearity

we’re using Variance Inflation Factor (VIF) to check multicollinearity among our variables in model. A common rule of thumbs is that a VIF number greater than 10 may indicate high collinearity and worth further inspection,

vif(mod.3x)

##                      GVIF Df GVIF^(1/(2*Df))
## Region           9.728609  8        1.152798
## HDI              3.977431  1        1.994350
## Spirit_PerCapita 1.880714  1        1.371391
## Wine_PerCapita   2.893909  1        1.701149

From the test we know that our model doesn’t have any multicolinearity. However some variabels like Region and HDI has high colinearity.

6 Conclusion

Our model meets all statistical test standard such as: Normality, Heteroscedasticity, residual plot, and Multicolinearity.
The highest R-squared covered 81% variance of data, if we remove some variables, its still cover approximately 80% variance.
Predictive ability of the model measured by RMSE is 0.6239352, MAE is 0.4941122, and MAPE 0.09416681. Means that our model can predict Happiness Score with only 9% of error.
From our data, Region and HDI has the biggest influence to Happiness Score. Spirit and wine consumption has a slight influence on happiness score. The thing is they affect negatively. So to may alcoholic folks out there, based on my analysis, alcohol makes you (slightly) unhappy.

Thank you !

Alcohol equals happiness ?

jojoecp

4/14/2020