Individual’s happiness is a complex concept, which is quite complicated to measure and to predict. Firstly, it is because individuals’ imply their own subjective meaning to the concept of happiness. Secondly, individual’s feeling of happiness can be influenced by many factors, which are not often obvious.
As a country for analysis of happiness I chose Germany. I would like to explore how several factors are connected with the individuals’ happiness in this country. Overall, the number of observations for Germany is 1528.
In this study variable “Happiness” is a mean score of three variables, which describe individual’s well-being. Those variables are answers to questions “How much freedom of choice and control you feel you have over the way your life turns out?”, “how satisfied are you with your life as a whole these days?” and “How satisfied are you with the financial situation of your household?”, i.e. it covers such aspects as individual’s control over his/her life, general life satisfaction and individual’s satisfaction with his/her current financial situation. All variables are measured in 10-points scale and in combination they give more complex overview on how happy individual is.
As a predictors were chosen following variables:
“Subjective security” - the answer on the question “how secure do you feel these days?”, i.e. subjective estimation of the feeling of safety and security. This variable will be treated as factor with four levels: “Not at all secure”, “Not very secure”, “Quite secure” and “Very secure”.
“Income group” - Individual’s estimation of his/her household belonging to a particular income group (on a scale from 1 to 10)
“Perceived democracy” - individual’s subjective estimation of how democratic is government in the country, where he/she is currently living (on a scale from 1 to 10)
In the “World Values Survey” the variable “Health” implies the distinguishing of individuals by 5 health groups. However, I would like to explore the issue of health and happiness in more general terms. That is why I made the binomial variable of “Health state”, which combined respondents, who described their health as “Very Good”, “Good” and “Fair” into the general category “Good”, while respondents who describe their health as “Very poor” and “Poor” were combined into the category “Poor”.
RQ: What is the best factor to predict the happiness of individuals in Germany: health condition, subjective security, perceived democracy in the country or belonging to income group?
df <- read_csv("~/DA_data/World_Values_Survey_Wave_7_Inverted_csv_v1_6_2.csv") #data set with already inverted scale
dfGer <- filter(df, df$B_COUNTRY=="276")
dfGer_1 <- dplyr::select(dfGer, c("B_COUNTRY", "Q48", "Q49", "Q50", "Q131P", "Q47P", "Q288" , "Q251"))
dfGer_1 <- rename(dfGer_1,
"country" = "B_COUNTRY",
"Security" = "Q131P",
"Health" = "Q47P",
"inc_group" = "Q288",
"Democracy" = "Q251"
)dfGer_1$inc_group <- as.numeric(dfGer_1$inc_group)
dfGer_1$Democracy <- as.numeric(dfGer_1$Democracy)
dfGer_1$Happy <- rowMeans((dfGer_1[, c("Q48", "Q49", "Q50")])) #HappinessdfGer_1$Security <- as.factor(dfGer_1$Security)
dfGer_1$Security <- gsub("4", "Very secure", dfGer_1$Security) #data already with inverted scales
dfGer_1$Security <- gsub("3", "Quite secure", dfGer_1$Security)
dfGer_1$Security <- gsub("2", "Not very secure", dfGer_1$Security)
dfGer_1$Security <- gsub("1", "Not at all secure", dfGer_1$Security)
dfGer_1$Health_state <- ifelse(dfGer_1$Health <= 3, 0, 1) #Health state
dfGer_1$Health_state <-gsub("0", "Poor", dfGer_1$Health_state)
dfGer_1$Health_state <-gsub("1", "Good", dfGer_1$Health_state)
dfGer_1$Health_state <- as.factor(dfGer_1$Health_state)Hypothesis 1: All predictors are more likely to have statistically significant positive influence on the variable “Happiness” for the respondents in Germany
Hypothesis 2: Variable “Income group” is more likely to have stronger influence on “Happiness” among all other predictors.
Hypothesis 3: Variable “Perceived democracy” is more likely to have the smallest effect on the variable “Happiness” comparing with other predictors.
Hypothesis 4: The non-linear effect is more likely to be presented for the variable “Income group”.
dfGer_1$country <- as.character(dfGer_1$country)
dfGer_1 = subset(dfGer_1, select = -c(Health, Q48, Q49, Q50)) #removing variables which were used to construct other variables
describe(dfGer_1)## vars n mean sd median trimmed mad min max range skew
## country* 1 1528 1.00 0.00 1.00 1.00 0.00 1.00 1 0.00 NaN
## Security* 2 1526 3.13 0.65 3.00 3.17 0.00 1.00 4 3.00 -0.29
## inc_group 3 1474 5.19 1.69 5.00 5.26 1.48 1.00 10 9.00 -0.34
## Democracy 4 1502 7.58 1.94 8.00 7.77 1.48 1.00 10 9.00 -0.93
## Happy 5 1506 7.33 1.48 7.67 7.42 1.48 1.67 10 8.33 -0.61
## Health_state* 6 1525 1.35 0.48 1.00 1.31 0.00 1.00 2 1.00 0.62
## kurtosis se
## country* NaN 0.00
## Security* -0.03 0.02
## inc_group -0.19 0.04
## Democracy 0.82 0.05
## Happy 0.35 0.04
## Health_state* -1.61 0.01
The maximum value of variable “Happiness” is 10. The mean value of happiness is 7.33 points, median 7.67, while the minimal value is 1.67. There are 26 observations with NAs. The distribution is not normal, it is moderately skewed to the right. The mean value of the variable “Perceived democracy” is 7.58, the median is 8, the maximum value is 10 and minimal value 1. The number of NAs is 26. The distribution is not normal - it is skewed to the right.
The mean value of the variable “Income group” is 5.19, the median value is 5. The maximum value is 10 and the minimal value is 1. The number of NAs is 54. The distribution is slightly skewed to the right.
For the exploration of variable “Health state” the bar plot was built. Among respondents there are more individuals, who have “Good” state of health, than who have “Poor” state of health. The number of NAs for this variable is 3.
ggplot (data= dfGer_1, aes(Health_state, color = Health_state, fill = Health_state)) +
geom_bar(alpha = 0.7) +
labs(title = "Health state groups", x = "Health group", y = "Number of respondents")+
theme_bw()As for the variable “Security”, it was already said that it has four levels. According to the plot below, the most frequent answer is “Quite secure” and then goes “Very secure”, which means that in general respondents in Germany perceive themselves in a secure environment. Only few anwers are in the category “Not at all secure”. Besides, there are only 2 NAs in this variable.
ggplot (data= dfGer_1, aes(Security, color = Security, fill = Security)) +
geom_bar(alpha = 0.7) +
labs(title = "Groups of perceived security levels", x = "Security", y = "Number of respondents")+
theme_bw()There are not a lot of NAs in the received dataset, so it means that its removing will not extremely affect results. After removing NAs, the dataset contains 1439 observations.
## # A tibble: 1 x 1
## n
## <int>
## 1 1439
To explore relations between numeric predictors and the outcome variable, the correlation plot was built.
From the correlation plot it is seen that: - All variables have small but statistically significant positive correlations between each other. - The highest correlation score among received result is between variables “Happiness” and “Income group”, which is quite close to being a medium, but it still be defined as weak.
For the exploration of relationships between factor variable “Health state” and numeric variable “Happiness” I am going to apply t-test, because for the variable “Health state” we have only to groups of respondents (“Good” and “Poor”). The distribution of the variable “Happiness” is not normal, however the sample is big enough to conduct the test.
ggplot(dfGer_1, aes(Happy, color = Health_state, fill = Health_state)) +
geom_density(alpha = 0.5) +
labs(title = "Happiness by health state group",x = "Happiness", y = "Density")+
theme_bw() The distribution of the variable “Happiness” is not normal, however the sample is big enough to conduct the test.
##
## Bartlett test of homogeneity of variances
##
## data: dfGer_1$Happy by dfGer_1$Health_state
## Bartlett's K-squared = 38.052, df = 1, p-value = 6.889e-10
According to the Bartlett test, variances of distributions are not equal, so Welch’s correction is needed for the following run of t-test for these two variables.
##
## Welch Two Sample t-test
##
## data: dfGer_1$Happy by dfGer_1$Health_state
## t = 11.323, df = 838.36, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.7856022 1.1150797
## sample estimates:
## mean in group Good mean in group Poor
## 7.649609 6.699268
The p-value < o.05, which means that there exists statistically significant difference in means of variable “Happiness” between “Good” and “Poor” health state groups. The mean value of happiness for good health state group is 7.649, while for poor health state group the mean value is 6.699.
To explore relationships between factor variable “Subjective security” and numeric variable “Happiness” I am going to apply ANOVA statistical test with the aim to identify whether there are differences in means of “Happiness” across four groups.
ggplot(dfGer_1, aes(Happy, color = Security, fill = Security)) +
geom_density(alpha = 0.5) +
labs(title = "Happiness by groups of perceived security",x = "Security", y = "Density")+
theme_bw() It is seen from the plot the distribution of the variable “Happiness” is not normal in all groups. However, as it was mentioned above, the size of the sample allows to conduct comparison tests.
oneway.test(dfGer_1$Happy ~ dfGer_1$Security, var.equal = F) #equality of variances was already checked in previous test##
## One-way analysis of means (not assuming equal variances)
##
## data: dfGer_1$Happy and dfGer_1$Security
## F = 19.246, num df = 3.000, denom df = 33.857, p-value = 1.845e-07
According to the ANOVA, there exists statistically significant differences in means of “Happiness” across groups of "Subjective security’
For the search of the best model I chose forward method. Firstly, I add to the model the predictor “Health”.
Then I add the variable “Subjective security”.
## Analysis of Variance Table
##
## Model 1: Happy ~ Health_state
## Model 2: Happy ~ Health_state + Security
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1437 2869.3
## 2 1434 2749.3 3 119.95 20.855 3.165e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ANOVA test showed that p-value < 0.05, which means that results are statistically significant and the added variable “Subjective security” made the model better.
In the next step I add another predictor - the variable “Income group”
## Analysis of Variance Table
##
## Model 1: Happy ~ Health_state + Security
## Model 2: Happy ~ Health_state + Security + inc_group
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1434 2749.3
## 2 1433 2324.2 1 425.19 262.16 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value is smaller than 0.05, which means that results of ANOVA are statistically significant and newly-added predictor improved the model.
Next, I add to the 3rd model predictor “Perceived democracy” and compare new model with the previous one.
## Analysis of Variance Table
##
## Model 1: Happy ~ Health_state + Security + inc_group
## Model 2: Happy ~ Health_state + Security + inc_group + Democracy
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1433 2324.2
## 2 1432 2277.1 1 47.085 29.611 6.204e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The results are statistically significant (p-value < 0.05), which means that the new predictor made the model better.
So, the 4th model is better among received models.
##
## Call:
## lm(formula = Happy ~ Health_state + Security + inc_group + Democracy,
## data = dfGer_1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.7233 -0.8067 0.0805 0.8102 3.6504
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.67671 0.46278 7.945 3.9e-15 ***
## Health_statePoor -0.61393 0.07208 -8.517 < 2e-16 ***
## SecurityNot very secure 1.12322 0.45695 2.458 0.014087 *
## SecurityQuite secure 1.44777 0.45164 3.206 0.001378 **
## SecurityVery secure 1.67105 0.45483 3.674 0.000248 ***
## inc_group 0.31866 0.02030 15.695 < 2e-16 ***
## Democracy 0.09812 0.01803 5.442 6.2e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.261 on 1432 degrees of freedom
## Multiple R-squared: 0.2804, Adjusted R-squared: 0.2774
## F-statistic: 92.99 on 6 and 1432 DF, p-value: < 2.2e-16
The model is statistically significant. All predictors are statistically significant as well. The model predicts that for people with good state of health, from the lowest income group, who perceive state as not-democratic and security as not at all secure, the happiness will be 3.6767 points. If that person has poor state of health, then their predicted happiness decreases by 0.61 points and will have the value of 3.0627 points. If the income group increases on 1 point, then the “Happiness” increases on 0.318 points. The increase of “Perceived democracy” on 1 point add to “Happiness” 0.098 points. If person thinks that his/her level of security is “Not very secure”, i.e. the situation starts to improve, th predicted value of happiness increases by 1.12 points. If the perceived security is “Quite secure”, then the predicted value of happiness increases by 1.447 points. And in the case, if individual perceives security environment around his/her as “Very secure”, then the values of happiness increases by 1.67 points. So, it means that people who live in secure environment are happier than individuals, who are perceive environment as not secure.
Overall, the model explains 27,74% of variance of the variable “Happiness” among respondents in Germany.
## GVIF Df GVIF^(1/(2*Df))
## Health_state 1.067015 1 1.032964
## Security 1.084400 3 1.013596
## inc_group 1.065029 1 1.032002
## Democracy 1.105492 1 1.051424
There is no values more than 5, which means that model does not contain multicollinearity, i.e. no predictors are strongly correlated with each other.
Looking in the distribution of residuals on the graph “Q-Q plot” it can be said that in general the model explains quite well the values of residuals. However, very positive and very negative values of predicted “Happiness the received model is not able to explain. According to the leverage’s plot there also exists a small set of outliers, which influence the received regression model. The distribution of residuals on the plot”Residuals vs. Fitted" shows that not all residuals are distributed along red line, and many residuals are placed on a long distant from 0, so it can be assumed that there exists heteroscedasticity, which also corresponds with many “check marks” on the plot “Scale-Locations”.
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 56.44234, Df = 1, p = 5.7871e-14
##
## studentized Breusch-Pagan test
##
## data: m4
## BP = 53.918, df = 6, p-value = 7.664e-10
Both tests on homoscedasticity have p-value < 0.05, which means that the residuals of the variable “Happiness” have heteroscedasticity. I am going to try BoxCox tranformation with the aim to correct the heteroscedasticity in the model and check whether it will help or not.
## Box-Cox Transformation
##
## 1439 data points used to estimate Lambda
##
## Input data summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.667 6.333 7.667 7.319 8.333 10.000
##
## Largest/Smallest: 6
## Sample Skewness: -0.619
##
## Estimated Lambda: 1.9
dfGer_1$Happy1<-predict(happy_dfG_mod, dfGer_1$Happy)
m4_1 <- lm(dfGer_1$Happy1 ~ Health_state + Security + Democracy + inc_group, data = dfGer_1)
car::ncvTest(m4_1)## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 7.147618, Df = 1, p = 0.0075064
The p-value < 0.05, which means that transformation did not help to solve the problem of heteroscedasticity. That is why I will continue to work with the original model.
The variable “Income group” was measured on 10-points scale and treated in the model as numeric, so it means that there are quite a lot of groups, which lay in the “middle income” and it could be quite hard for the respondents to draw a difference between these categories. That is why I decided th check whether there exists a non-linear effect on happiness in case of the variable “Income group”
library(mgcv)
m4_nl <- gam(Happy ~ Health_state + Security + s(inc_group) + Democracy, data = dfGer_1)
summary(m4_nl)##
## Family: gaussian
## Link function: identity
##
## Formula:
## Happy ~ Health_state + Security + s(inc_group) + Democracy
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.29868 0.46012 11.516 < 2e-16 ***
## Health_statePoor -0.61012 0.07169 -8.511 < 2e-16 ***
## SecurityNot very secure 1.16478 0.45642 2.552 0.010814 *
## SecurityQuite secure 1.48838 0.45078 3.302 0.000985 ***
## SecurityVery secure 1.72798 0.45368 3.809 0.000146 ***
## Democracy 0.09643 0.01793 5.377 8.83e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(inc_group) 4.549 5.632 47.66 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.286 Deviance explained = 29.1%
## GCV = 1.5817 Scale est. = 1.5701 n = 1439
According to the received model, non-linear effect is statistically significant (p-value < 0.05) which means that it exists, i.e. there is non-linear effect of “Income group” on “Happiness”. Besides, taking into account R-square (adjusted), the new model with non-linear effect explains 22,9% of variance of the variable “Happiness” among the sample, which is higher than the received model without non-linearity.
## Analysis of Variance Table
##
## Model 1: Happy ~ Health_state + Security + inc_group + Democracy
## Model 2: Happy ~ Health_state + Security + s(inc_group) + Democracy
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1432.0 2277.1
## 2 1428.5 2242.8 3.5487 34.249 6.1468 0.0001449 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-values is less than 0.05. It means that the model with non-linear effect of the variable “Income group” predicts the “Happiness” better than the model without this effect.
To explore more precisely I built a plot.
ggplot(dfGer_1, aes(inc_group, Happy) ) +
geom_point() +
stat_smooth(method = gam, formula = y ~ s(x))According to the plot, “Happiness” increases with the increasing of income group for individuals with lower income; for the individuals from lowest income groups (1-4 on a scale) “Happiness” increases on the plot from 5 to 7 points. Then, for the groups 4-7 the changes of the variable “Happiness” are quite small. For these “middle” groups, according to the plot, variable “Happiness” would be from 7 to 8 points. For the higher-income groups the plot shows the general trend of increasing of “Happiness”, however due to relatively broad confidence intervals it can be assumed that changes are not that strong.
From the course of “Sociological theory” I know that individual’s social class/income group can be connected with their health. That is why in this project I decided to examine the presence of the interaction effect for variables “Income group” and “Health state” and to explore its interaction effect.
##
## Call:
## lm(formula = Happy ~ Health_state * inc_group + Security + Democracy,
## data = dfGer_1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.6901 -0.7778 0.0616 0.8228 3.8292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.08076 0.47178 8.650 < 2e-16 ***
## Health_statePoor -1.42553 0.21850 -6.524 9.46e-11 ***
## inc_group 0.25741 0.02551 10.091 < 2e-16 ***
## SecurityNot very secure 1.06942 0.45487 2.351 0.01886 *
## SecurityQuite secure 1.39037 0.44961 3.092 0.00202 **
## SecurityVery secure 1.61292 0.45279 3.562 0.00038 ***
## Democracy 0.09636 0.01795 5.370 9.20e-08 ***
## Health_statePoor:inc_group 0.16191 0.04118 3.932 8.82e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.255 on 1431 degrees of freedom
## Multiple R-squared: 0.2881, Adjusted R-squared: 0.2846
## F-statistic: 82.72 on 7 and 1431 DF, p-value: < 2.2e-16
The received model is statistically significant as well as the interaction effect between health state (for the category “Poor state health”) and income group. The predicted value of “Happiness” for people with good state of health, from the lowest income group, who perceive state’s governance as not-democratic and who do not feel secure will be 4.08 points. If that person has poor state of health, then their predicted happiness will reduce by 1.425 points. The increase of “Perceived democracy” on 1 point adds to “Happiness” 0.096 points. For individual who perceive their security as “Not very secure” the 1.069 points are added to “Happiness”. If respondent thinks that their security is “Quite secure”, then the variable “Happiness” increases by 1.39 points. In the case of “Very secure” perceived security, it adds to predicted happiness 1/61 point.
The increasing of the “Income group” on 1 point increases the variable “Happiness” on 0.25 points. However, if the person has the poor health, then increase of his/her income group on one point will add to “Happiness” only 0.16 points.
Overall, the model explains 28,46% of variance of the variable “Happiness” among respondents in Germany, which is higher than it does the model without the interaction effect.
## Analysis of Variance Table
##
## Model 1: Happy ~ Health_state + Security + inc_group + Democracy
## Model 2: Happy ~ Health_state * inc_group + Security + Democracy
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1432 2277.1
## 2 1431 2252.7 1 24.342 15.463 8.818e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-values is less than 0.05, which means that the adding of the interaction effect between “Health state” and “Income group” made the model better comparing with the model without this effect.
To explore more the interaction I made the interaction plot.
library(lattice)
interaction.plot(dfGer_1$inc_group, dfGer_1$Health_state, dfGer_1$Happy,
type = "l", col = c("blue", "red"),
ylab = "Predicted happiness",
xlab = "Income group")For both health group with the increasing of income group, the happiness increases as well. However, for the people with poor state of health in the income group from 4 to 7 predicted value of “Happiness” does not change and is consistent on the point around 7. After that, starting from the income group 7, the predicted “Happiness” drastically rises, and for higher income groups it becomes very close to the predicted level of “Happiness” for people with good state of health. Finally, for people from lower income groups and with poor state of health the predicted increasing of happiness is more intensive than for people with good state of health.
Overall, for the respondents in Germany such factors as “Subjective security”, “Perceived democracy”, “Health state”, “Income group” are good predictors for predicting the level of individual’s happiness. It confirms the Hypothesis 1.
The income group appeared to be the strongest predictor of happiness among other numeric variables used in this research. It means that for respondents in Germany their own financial well-being is more important than the level of state democracy. It also make the Hypothesis 2 confirmed. also, the Hypothesis 3 was confirmed - the level of perceived democracy of state’s governance is significant but less strong predictor of happiness. Besides, for respondents in Germany the level of perceived security appears to be quite important factor affecting happiness. People who perceive their security as very high are in general more happy than individuals who perceive their subjective security as low. Overall, it means that for individuals in Germany democracy in politics is an important factor for happiness, but other more “personal” factors such as their income and their own feeling of security are more valuable.
The Hypothesis 4 was confirmed. The non-linear effect exists for the predictor “Income group”. However, the non-linear effect showed that for individuals from middle-income groups the belonging to this group almost does not influences changes in the level of “Happiness”.
The inclusion of the interaction effect between “Income group” and “Health state” is statistically significant for the individuals with poor state of health. It reduces the effect of increasing of the income group on predicted happiness. Furthermore, for the individuals with poor health in middle-income groups the level of predicted happiness is consistent. It means that for this category of respondents their health becomes more stronger factor affecting their happiness, so the slight increase of their income would not make their feeling of happiness higher.