In this article we will try to do regression to analyze “Is alcohol consumption makes you happy?”. This project based on data availiable in this kaggle. The dataset contain not only alcohol consumption like beer, wine, and spirit, but also social indexes like GDP, HDI, region, and hemisphere as comparison. The HappinessScore
itself is measured by asking sampled people the question: “How would you rate your happiness on a scale of 0 to 10 where 10 is the happiest”. So we are here to find out is alcohol consumption linked to happiness?
We’ll try to build regression model to analyze “is alcohol consumption makes you happy? based on survey data in 112 countries”. We will try to find which variables are significant in predicting the Happiness Score and how well the variable describe the Happiness Score We will also evaluate the model with statistical approach to determine the quality of our model. this analysis might be used to debunk the alcoholic behaviour for some people who use it as ‘stress relief’. Note: The dataset only contain 112 row to be analyed. Every rows represent countries. If we have more data (i still believe bigger data = bigger impact) about alcohol consummption in all around earth, this kind of analysis can help world-wide government to do reducing-alcohol like movement as research justification.
You can load the package into your workspace using the library()
function
## Observations: 122
## Variables: 9
## $ Country <fct> Denmark, Switzerland, Iceland, Norway, Finland, Ca...
## $ Region <fct> Western Europe, Western Europe, Western Europe, We...
## $ Hemisphere <fct> north, north, north, north, north, north, north, s...
## $ HappinessScore <dbl> 7.526, 7.509, 7.501, 7.498, 7.413, 7.404, 7.339, 7...
## $ HDI <int> 928, 943, 933, 951, 918, 922, 928, 915, 938, 932, ...
## $ GDP_PerCapita <dbl> 53.579, 79.866, 60.530, 70.890, 43.433, 42.349, 45...
## $ Beer_PerCapita <int> 224, 185, 233, 169, 263, 240, 251, 203, 261, 152, ...
## $ Spirit_PerCapita <int> 81, 100, 61, 71, 133, 122, 88, 79, 72, 60, 69, 75,...
## $ Wine_PerCapita <int> 278, 280, 78, 129, 97, 100, 190, 175, 212, 186, 9,...
Here is the description of every variables to helps you understand it better:
- Country
: well.. the country. i belief you already know it
- Region
: Region the country belongs to
- Hemisphere
: Hemisphere of country
- Happiness Score
: A metric measured in 2016 by asking the sampled people the question: “How would you rate your happiness on a scale of 0 to 10 where 10 is the happiest”
- HDI
: Stands for Human Development Index, taken by United Nations Development Programme
- GDP_PerCapita
: Gross Domestic Product index per capita
- Beer_PerCapita
: Liters ( per capita ) of beer consumption
- Spirit_PerCapita
: Consumption of spirits drink ( per capita )
- Wine_PerCapita
: Wine consumption
Every rows indicate every sampled countries. Rather than removing it, we’ll make it as rownames.
# make column country as rownames, then delete unused column
rownames(data.1) <- data.1$Country
data.1[1] <- NULL
head(data.1)
before we start, let’s make sure our data doesnt have any missing value
## Region Hemisphere HappinessScore HDI
## 0 0 0 0
## GDP_PerCapita Beer_PerCapita Spirit_PerCapita Wine_PerCapita
## 0 0 0 0
For correlation check, We will use ggcorr()
function from GGally packages. ggcorr is a function for making a correlation matrix plot.
The plot above only showed numeric variables. As we can see, several variables has high positive correlation like
HDI
, GDP_PerCapita
however, has negative correlation to HappinessScore.
Let’s see the histogram to every numeric variables. i want to know the distribution of each variables
ggplot(gather(data.1 %>% select_if(is.numeric)), aes(value)) +
geom_histogram(bins = 10) +
facet_wrap(~key, scales = 'free_x')
It looks like very few of country that actually has high
GDP_PerCapita
, also for spirit_PerCapita
and wine_PerCapita
. i’m afraid some of them will be assume as outlier. Good thing is, Our target variable HappinessScore
has normal distribution. Let’s see how it looks in boxplot
it looks like most country have low GDP and very vew of them have high GDP.
## [1] 15
only 15 countries have high GDP. I’ll continue to build model using all column and its outlier
. If things gone bad, we’ll remove the outlier and rebuild the model. Anyway, just for fun, let’s see which country have high GDP_PerCapita
.
Countries with high GDP mostly are located in sub-saharan Africa, have low happinessScore, and looks like a poor 3rd-world country.
let’s build a model using simple linear regresion
##
## Call:
## lm(formula = HappinessScore ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.39719 -0.27527 0.04117 0.35164 1.08263
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value
## (Intercept) 1.72544682 1.00645132 1.714
## RegionCentral and Eastern Europe -1.41282934 0.62250726 -2.270
## RegionEastern Asia -1.41434122 0.77786205 -1.818
## RegionLatin America and Caribbean -0.35050866 0.62526574 -0.561
## RegionMiddle East and Northern Africa -1.47497574 0.65022404 -2.268
## RegionNorth America -0.30811604 0.72287201 -0.426
## RegionSoutheastern Asia -1.09263023 0.67383573 -1.622
## RegionSub-Saharan Africa -1.19824578 0.65661991 -1.825
## RegionWestern Europe -0.59360813 0.61381708 -0.967
## Hemispherenorth 0.23540987 0.31272295 0.753
## Hemispherenoth NA NA NA
## Hemispheresouth 0.07583468 0.34151324 0.222
## HDI 0.00654423 0.00092141 7.102
## GDP_PerCapita -0.00005535 0.00040697 -0.136
## Beer_PerCapita -0.00044532 0.00086591 -0.514
## Spirit_PerCapita -0.00092425 0.00113483 -0.814
## Wine_PerCapita -0.00208128 0.00108405 -1.920
## Pr(>|t|)
## (Intercept) 0.0901 .
## RegionCentral and Eastern Europe 0.0258 *
## RegionEastern Asia 0.0726 .
## RegionLatin America and Caribbean 0.5766
## RegionMiddle East and Northern Africa 0.0259 *
## RegionNorth America 0.6710
## RegionSoutheastern Asia 0.1087
## RegionSub-Saharan Africa 0.0716 .
## RegionWestern Europe 0.3363
## Hemispherenorth 0.4537
## Hemispherenoth NA
## Hemispheresouth 0.8248
## HDI 0.000000000367 ***
## GDP_PerCapita 0.8921
## Beer_PerCapita 0.6084
## Spirit_PerCapita 0.4177
## Wine_PerCapita 0.0583 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5593 on 84 degrees of freedom
## Multiple R-squared: 0.796, Adjusted R-squared: 0.7596
## F-statistic: 21.85 on 15 and 84 DF, p-value: < 0.00000000000000022
From the summary above we found 1 variables are not defined because of singularities. That’s happen because two or more of the variables are perfectly collinear. To identify which variable is colinear, we can use alias() function to our model
## Model :
## HappinessScore ~ Region + Hemisphere + HDI + GDP_PerCapita +
## Beer_PerCapita + Spirit_PerCapita + Wine_PerCapita
##
## Complete :
## (Intercept) RegionCentral and Eastern Europe RegionEastern Asia
## Hemispherenoth 0 0 1
## RegionLatin America and Caribbean
## Hemispherenoth 0
## RegionMiddle East and Northern Africa RegionNorth America
## Hemispherenoth 0 0
## RegionSoutheastern Asia RegionSub-Saharan Africa
## Hemispherenoth 0 0
## RegionWestern Europe Hemispherenorth Hemispheresouth HDI
## Hemispherenoth 0 0 0 0
## GDP_PerCapita Beer_PerCapita Spirit_PerCapita Wine_PerCapita
## Hemispherenoth 0 0 0 0
Looks like Hemisphere north
are highly corelated with region Eastern Asia
. from the model summary, we saw that most levels in Hemisphere
variable have high Pr(>|t|)
(which mean have low significance to our model), thus we will remove Hemisphere
variable that can be problematic with respect to multicollinearity, and re-build the model.
# removing Hemisphere from both train and test data
train <- train %>% select(-c("Hemisphere"))
test <- test %>% select(-c("Hemisphere"))
##
## Call:
## lm(formula = HappinessScore ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.39946 -0.27415 0.02537 0.33244 1.06382
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 1.922788660 0.940576202 2.044
## RegionCentral and Eastern Europe -1.282656978 0.593981561 -2.159
## RegionEastern Asia -1.524216386 0.712811709 -2.138
## RegionLatin America and Caribbean -0.300741828 0.610354647 -0.493
## RegionMiddle East and Northern Africa -1.356404195 0.628887472 -2.157
## RegionNorth America -0.164344496 0.693467361 -0.237
## RegionSoutheastern Asia -0.978083545 0.653585877 -1.496
## RegionSub-Saharan Africa -1.206130214 0.652031702 -1.850
## RegionWestern Europe -0.438989067 0.572625983 -0.767
## HDI 0.006436907 0.000908660 7.084
## GDP_PerCapita -0.000004637 0.000400876 -0.012
## Beer_PerCapita -0.000454719 0.000857808 -0.530
## Spirit_PerCapita -0.000905049 0.001127386 -0.803
## Wine_PerCapita -0.002186465 0.001049239 -2.084
## Pr(>|t|)
## (Intercept) 0.0440 *
## RegionCentral and Eastern Europe 0.0336 *
## RegionEastern Asia 0.0353 *
## RegionLatin America and Caribbean 0.6235
## RegionMiddle East and Northern Africa 0.0338 *
## RegionNorth America 0.8132
## RegionSoutheastern Asia 0.1382
## RegionSub-Saharan Africa 0.0678 .
## RegionWestern Europe 0.4454
## HDI 0.000000000362 ***
## GDP_PerCapita 0.9908
## Beer_PerCapita 0.5974
## Spirit_PerCapita 0.4243
## Wine_PerCapita 0.0401 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5557 on 86 degrees of freedom
## Multiple R-squared: 0.7938, Adjusted R-squared: 0.7627
## F-statistic: 25.47 on 13 and 86 DF, p-value: < 0.00000000000000022
Variable with Pr(>|t|)
> 0.05 simply means that variables doesn’t have any significant value to our model, thus we’re looking for variables with Pr(>|t|)
below 0.05. It’save to remove variable that doesn’t meet our criteria. We can identify better model by looking at R-squared
. Let’s build second model with selected variables only. To do that, we can use stepwise
.
Stepwise
regression is a set of methods to fit regression models where the choice of predictive variables is carried out by an automatic procedure. Stepwise
will train model with every posible combination of variables with the lowest AIC
. AIC
are not represented in summary()
but i can say that AIC
demonstrated how we can estimate the amount of “information loss” from one model to another.
# rebuild model with stepwise
mod.3 <- step(mod.2, direction = "backward", trace = 0)
summary(mod.3)
##
## Call:
## lm(formula = HappinessScore ~ Region + HDI + Wine_PerCapita,
## data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.38805 -0.29101 0.03532 0.35718 1.03663
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 2.0489400 0.8752720 2.341
## RegionCentral and Eastern Europe -1.3650709 0.5804843 -2.352
## RegionEastern Asia -1.5116504 0.7031927 -2.150
## RegionLatin America and Caribbean -0.3444090 0.6014020 -0.573
## RegionMiddle East and Northern Africa -1.2660613 0.6152631 -2.058
## RegionNorth America -0.2170035 0.6847090 -0.317
## RegionSoutheastern Asia -0.9851934 0.6438780 -1.530
## RegionSub-Saharan Africa -1.2056568 0.6411261 -1.881
## RegionWestern Europe -0.4526532 0.5653238 -0.801
## HDI 0.0060932 0.0007575 8.044
## Wine_PerCapita -0.0021291 0.0010139 -2.100
## Pr(>|t|)
## (Intercept) 0.0215 *
## RegionCentral and Eastern Europe 0.0209 *
## RegionEastern Asia 0.0343 *
## RegionLatin America and Caribbean 0.5683
## RegionMiddle East and Northern Africa 0.0425 *
## RegionNorth America 0.7520
## RegionSoutheastern Asia 0.1295
## RegionSub-Saharan Africa 0.0633 .
## RegionWestern Europe 0.4254
## HDI 0.00000000000353 ***
## Wine_PerCapita 0.0386 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5506 on 89 degrees of freedom
## Multiple R-squared: 0.7905, Adjusted R-squared: 0.767
## F-statistic: 33.59 on 10 and 89 DF, p-value: < 0.00000000000000022
Now we have the best selected variables according to stepwise
. We’ll build new model from the stepwise
for later use.
You need to keep in mind, R-squared doesn’t indicate the quality of the model, instead R-squared value tells us how well our model describes the data. It measures the extent to which the variance in our dependent variable can be explained by the independent variables. both model shows quite high R-squared values, our model represent 80% variance of data. Both model only have slight differences in R-squared (0.8103 > 0.8082), thus it save for us to remove several variable because it still represent variance of our data as much as before.
However, let’s evaluate our model to see the differences between both model
Note: Personally, i doubt our model is good enough for prediction. We only have 112 obs to begin with. However, its a fun experience to find out whether alcohol consumption is related to the people’s happiness score.
# Prediction with all variable
pred.1 <- predict(mod.2, newdata = test)
# prediction with stepwise
pred.2 <- predict(mod.3x, newdata = test)
# calculate MAE and RMSE
all.var <- data.frame(MAE = MAE(pred.1, test$HappinessScore),
RMSE = RMSE(pred.1, test$HappinessScore),
MAPE = MAPE(pred.1, test$HappinessScore))
with.step <- data.frame(MAE = MAE(pred.2, test$HappinessScore),
RMSE = RMSE(pred.2, test$HappinessScore),
MAPE = MAPE(pred.2, test$HappinessScore))
rbind(all.var = all.var, with.step = with.step)
Top row represent MAE
, RMSE
, and MAPE
from first model (with all variables) and the bottom represent second model (with stepwise). It’s hard to intepret MAE
and RMSE
, but its good metric for comparing models. From MAPE
, Our model only have 9% of error predicting HappinesScore from unseen data. its actually good enogh even though i doubt it before. However our model with stepwise are not better than the first one. But we’re dealing with linear regression here, we need more statistical apporach to determine the quality of our model.
note: from here, ill continue to using model with stepwise
for statistical analysis.
A residual plot is a plot that plot the residual values on the y-axis and the fitted values on the x-axis. A residual plot that has points randomly scattered around the x-axis is the one we want. It doesn’t mean that the model is perfect, but it does mean that the regression model you fit appropriately describes the variability in our dependent variable.
plot(train$HappinessScore, residuals(mod.3x),
xlab = "Hapiness Score", ylab = "residuals")
abline(h=0, col="maroon", lwd=2)
There’s no pattern, exactly like we want ! it means our model fit appropriately describes the variability in our dependent variable.
Heteroscedasticity is a condition where the variability of a variable is unequal across its range of value. In a linear regression model, if the variance of its error is showing unequal variation across the target variable range, it shows that heteroscedasticity is present and the implication to that is related to the previous statement of a non-random pattern in residual.
##
## studentized Breusch-Pagan test
##
## data: mod.3x
## BP = 12.368, df = 11, p-value = 0.3366
The test has a p-value above 0.05, therefore We can conclude that the residuals has a constant variance (heteroscedasticity is not present)
The normality assumption means that the residuals from the linear regression model should be normally distributed because we expect to get residuals near the zero value. We will use Shapiro-Wilk test to our residual
##
## Shapiro-Wilk normality test
##
## data: mod.3x$residuals
## W = 0.98557, p-value = 0.3489
The test has a p-value above 0.05, therefore we know that our residuals are normally distributed
we’re using Variance Inflation Factor (VIF) to check multicollinearity among our variables in model. A common rule of thumbs is that a VIF number greater than 10 may indicate high collinearity and worth further inspection,
## GVIF Df GVIF^(1/(2*Df))
## Region 9.728609 8 1.152798
## HDI 3.977431 1 1.994350
## Spirit_PerCapita 1.880714 1 1.371391
## Wine_PerCapita 2.893909 1 1.701149
From the test we know that our model doesn’t have any multicolinearity. However some variabels like Region
and HDI
has high colinearity.
Normality
, Heteroscedasticity
, residual plot
, and Multicolinearity
.RMSE
is 0.6239352, MAE
is 0.4941122, and MAPE
0.09416681. Means that our model can predict Happiness Score
with only 9% of error.Region
and HDI
has the biggest influence to Happiness Score
. Spirit and wine consumption has a slight influence on happiness score
. The thing is they affect negatively. So to may alcoholic folks out there, based on my analysis, alcohol makes you (slightly) unhappy.Thank you !