Happiness in USA

Here I present our team’s collective work on happiness which everyone had a share of and the following individual interpretations.

The team’s members:

Anastasia Vlasenko
Artem Kulikov
Nadezhda Bykova
Anna Gorobtsova

Introduction

For the purpose of this analysis we decided to choose USA because we were quite familiar with its stratification and structure because of the sociological theory classes. Thus, it would be easier for us to think of the variables that could be meaningful for explaining happiness.

American society is highly self-oriented and often refer to as an individualistic country. We also made emphasis on American Dream and values that it portrays to get the most to explain happiness. Mainly, throughout the rotating predictors process, we tried to match the variables available to us to the ideas of individualism and American Dream. Thus, we will be trying to explain happiness through having the freedom of choice (individualism) and the belief that hard work brings its own profits (American Dream).

We are interested in how the culture-specific variables enhance the model based on predictors that, basically, form a foundation for explaining happiness in any country: financial situation and health status.

The additional predictors will include a respondent’s age and gender.

Therefore, the main question we will try to answer is whether there is an association between the individualistic values and happiness in USA.

We think it is important to study happiness from the alternative (to Russian reality) perspective to see how it varies across the countries and understand if there can be a cultural explanation to the differences.

Data preparation

We start off as usual, with loading the libraries and the dataset. Data we use in this analysis is World Value Survey of 2005-2006 for various countries among which we decided to choose USA.

library(foreign)
library(foreign)
library(ggplot2)
library(knitr)
library(sjPlot)
library(corrplot)
library(gridExtra)
library(dplyr)
library(car)
library(psych)

wvs <- read.spss("wvs.sav", to.data.frame = TRUE, use.value.labels = TRUE)

Now we can start creating the happiness index. For the happiness index we will be using the satisfaction measure and the perceived happiness measure. We sum both and create a new column.

Recoding satisfaction variable to match it the happiness levels:

wvs$sat <- rep(NA, length(wvs$V22))

wvs$sat[wvs$V22 == "Dissatisfied" |
          wvs$V22 == "2"] <- "1"

wvs$sat[wvs$V22 == "3" |
          wvs$V22 == "4" |
          wvs$V22 == "5"] <- "2" 

wvs$sat[wvs$V22 == "6" |
          wvs$V22 == "7" |
          wvs$V22 == "8"] <- "3" 

wvs$sat[wvs$V22 == "Satisfied" |
          wvs$V22 == "9"] <- "4"

Recoding happiness variable:

wvs$V10 <- ifelse(wvs$V10 =="Not at all happy", 1,
                            ifelse(wvs$V10 =="Not very happy", 2,
                                   ifelse(wvs$V10 =="Quite happy", 3,
                                          ifelse(wvs$V10 =="Very happy", 4, NA))))

Creating index of happiness:

wvs$hapIND <- as.numeric(wvs$V10) + as.numeric(wvs$sat)

NB! We subset the country a little bit later because the earlier substraction was causing too many troubles.

Now we finally subset our final dataset and recode the variables we will be using.

wvs$V68 <- ifelse(wvs$V68 == "Completely satisfied",10,
                        ifelse(wvs =="Completely dissatisfied",1,
                               wvs$V68))

wvs$V11 <- ifelse(wvs$V11=="Very good", 4,
                     ifelse(wvs$V11=="Good",3,
                            ifelse(wvs$V11=="Fair", 2,
                                   ifelse(wvs$V11=="Poor", 1, NA))))       
 
wvs$V46 <- ifelse (wvs$V46=="None at all",1,
                      ifelse(wvs$V46=="A great deal",10,
                             wvs$V46))

wvs$V120 <- ifelse(wvs$V120=="Hard work doesn't generally bring success - it's more a matter of luck and connections",10,
                      ifelse(wvs$V120=="In the long run, hard work usually brings a better life",1,
                             wvs$V120))

wvsUSA <- subset(wvs, V2 == "USA")
save <- c("V10", "V11", "V2", "V68", "V120", "V46", "V22", "hapIND", "V237","V239", "V235")
data1 <- wvsUSA[save] 
data1 <- na.omit(data1)

wvsUSA1 <- data1

Let’s explore the data!

describe(wvsUSA1)

##        vars    n  mean    sd median trimmed   mad min max range  skew
## V10       1 3939  3.32  0.63      3    3.38  0.00   1   4     3 -0.56
## V11       2 3939  3.13  0.82      3    3.21  1.48   1   4     3 -0.66
## V2*       3 3939 11.00  0.00     11   11.00  0.00  11  11     0   NaN
## V68       4 3939  6.55  2.44      7    6.72  2.97   1  10     9 -0.55
## V120      5 3939  3.57  2.35      3    3.29  2.97   1  10     9  0.84
## V46       6 3939  7.63  1.89      8    7.81  1.48   1  10     9 -0.84
## V22*      7 3939  7.63  1.84      8    7.81  1.48   1  10     9 -0.95
## hapIND    8 3939  6.51  1.13      6    6.58  1.48   2   8     6 -0.47
## V237*     9 3939 33.47 17.71     31   32.78 20.76   3  80    77  0.31
## V239*    10 3939 20.87  7.92     19   19.65  2.97   1  95    94  3.70
## V235*    11 3939  1.50  0.50      2    1.50  0.00   1   2     1 -0.01
##        kurtosis   se
## V10        0.23 0.01
## V11       -0.19 0.01
## V2*         NaN 0.00
## V68       -0.38 0.04
## V120      -0.03 0.04
## V46        0.60 0.03
## V22*       0.82 0.03
## hapIND     0.07 0.02
## V237*     -0.84 0.28
## V239*     21.83 0.13
## V235*     -2.00 0.01

The data has 3939 observations and 12 variables (after deleting NAs).

Now let’s see how our index is distributed. Along the way we will be looking at the numbers of observations in each category to make sure they fit the assumption.

ggplot(wvsUSA1, aes(x = hapIND)) +
  geom_bar(fill = "pink") +
  xlab("Happiness index, from low to high") +
  ylab("Number of observations") +
  ggtitle("Happiness index distribution for USA") +
  geom_vline(aes(xintercept = mean(wvsUSA1$hapIND), colour="Mean"), lwd=1.1 )

The graph is slightly right-skewed but mostly it can be worked with.

Let’s see what financial satisfaction looks like on the graph:

ggplot(wvsUSA1, aes(x=V68)) +
  geom_bar(fill = "pink") +
  xlab("Financial satisfaction, from low to high") +
  ylab("Number of observations") +
  ggtitle("Financial status distribution for USA") +
  geom_vline(aes(xintercept = mean(wvsUSA1$V68), colour="Mean"), lwd=1.1 )

The distribution is quite normal so we can move on to health.

We have to be careful here because health was coded from high (1) to low (4). However, we changed that at the recoding stage so now it is from poor to very good.

ggplot(wvsUSA1) +
  geom_boxplot(aes(x = V11, y = hapIND)) +
  xlab("Health state, from low to high") +
  ylab("Happiness index") +
  ggtitle("Health state and happiness level across USA") + 
  scale_x_discrete(labels = c('Poor', 'Fair', 'Good', "Very good")) +
  theme_bw()

Here we can clearly see how healthier people are, on average, happier.

ggplot(wvsUSA1, aes(x=V237)) +
  geom_histogram(fill = "pink") +
  xlab("Age in years") +
  ylab("Number of observations") +
  ggtitle("Age distribution in USA") +
  geom_vline(aes(xintercept = mean(wvsUSA1$V237), colour="Mean"), lwd=1.1 )

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The age distribution graph seems normal.

And now let’s check what the culture-specific variables look like.

ggplot(wvsUSA1, aes(x=V46)) +
  geom_bar(fill = "pink") +
  xlab("Freedom of choice, from low to high") +
  ylab("Number of observations") +
  ggtitle("Freedom of choice distribution for USA") +
  geom_vline(aes(xintercept = mean(wvsUSA1$V46), colour="Mean"), lwd=1.1 )

ggplot(wvsUSA1, aes(x=V120)) +
  geom_bar(fill = "pink") +
  xlab("Hard work pays off, from low to high") +
  ylab("Number of observations") +
  ggtitle("Belief that hard works is worth it, distributed for USA") +
  geom_vline(aes(xintercept = mean(wvsUSA1$V120), colour="Mean"), lwd=1.1 )

Again, we see some skews for both graphs but these are not drastic so we conclude we could move on with them.

Last but not least, we have to check how means of happiness indexare different for males and females. We will be using T-test for that.

H0: There is no difference between the means of happiness index between women and men. H1: There is some difference between the means of happiness index between women and men.

tres <- t.test(hapIND ~ V235, wvsUSA1)
tres

## 
##  Welch Two Sample t-test
## 
## data:  hapIND by V235
## t = -0.16142, df = 3930.9, p-value = 0.8718
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.07621165  0.06461653
## sample estimates:
##   mean in group Male mean in group Female 
##             6.502548             6.508346

Since the p-value is large, we accept H0, thus, there is no difference between the means of happiness for men and women. However, we will later try and build an interaction effect model to check whether it changes somehow with the connection to other variables.

Hypotheses

After careful consideration and looking through the data we have, we have driven the next hypotheses:

H1: A model that includes culture-specific features explains a bigger share of variance than a model that has only “main” variables.

H2: A model that includes culture-specific features does not explain a bigger share of variance than a model that has only “main” variables.

Analysis

In this analysis we will be using the forward method for building up our regression models. First,let us start with the simplest one - explaining happiness by health.

Model 1: health

model1 <- lm(hapIND ~ V11, data = wvsUSA1)
summary(model1)

## 
## Call:
## lm(formula = hapIND ~ V11, data = wvsUSA1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.8760 -0.7208  0.1240  0.9090  2.2792 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.72078    0.08696  65.788  < 2e-16 ***
## V112         0.37027    0.09674   3.827 0.000131 ***
## V113         0.69425    0.09086   7.641  2.7e-14 ***
## V114         1.15525    0.09143  12.635  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.079 on 3935 degrees of freedom
## Multiple R-squared:  0.08402,    Adjusted R-squared:  0.08332 
## F-statistic: 120.3 on 3 and 3935 DF,  p-value: < 2.2e-16

According to the p-value, we can conclude that the results are significant and with an increase in health index by a category, happiness index increases by 0.37 (for people with “fair” health compared to those with “poor” health). Compared to poor health, people with good health are happier by 0.69. Finally, people with very good health are happier than people with poor health by 1.15. The first model explains 8% of the variance. Generally, the model shows that the healtier people are, the happier they are.

Model 2: health + financial stability

model2 <- lm(hapIND ~ V11 + V68, data = wvsUSA1) 
summary(model2)

## 
## Call:
## lm(formula = hapIND ~ V11 + V68, data = wvsUSA1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0599 -0.6962  0.0257  0.7302  2.9725 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.665639   0.088248  52.869  < 2e-16 ***
## V112        0.337487   0.088536   3.812  0.00014 ***
## V113        0.584823   0.083244   7.025  2.5e-12 ***
## V114        0.944866   0.084013  11.247  < 2e-16 ***
## V68         0.180948   0.006542  27.658  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9875 on 3934 degrees of freedom
## Multiple R-squared:  0.2331, Adjusted R-squared:  0.2324 
## F-statistic:   299 on 4 and 3934 DF,  p-value: < 2.2e-16

According to the p-value, we can conclude that the results are significant and with an increase in financial stability by a unit, happiness increases by 0.18. The second model explains 23% of the variance.

anova(model1,model2)

## Analysis of Variance Table
## 
## Model 1: hapIND ~ V11
## Model 2: hapIND ~ V11 + V68
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   3935 4582.3                                  
## 2   3934 3836.4  1    745.96 764.95 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Having checked whether the second model is better than the first, we see that the p-value is smaller than 0.05 and, thus, the model is indeed better.

Model 3: health + financial stability + freedom of choice

model3 <- lm(hapIND ~ V11 + V68 + V46, data = wvsUSA1) 
summary(model3)

## 
## Call:
## lm(formula = hapIND ~ V11 + V68 + V46, data = wvsUSA1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5183 -0.6004  0.0035  0.6583  3.1759 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.749668   0.096293  38.940  < 2e-16 ***
## V112        0.271655   0.084571   3.212  0.00133 ** 
## V113        0.471257   0.079664   5.916 3.59e-09 ***
## V114        0.789231   0.080578   9.795  < 2e-16 ***
## V68         0.147872   0.006468  22.862  < 2e-16 ***
## V46         0.163722   0.008341  19.629  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9425 on 3933 degrees of freedom
## Multiple R-squared:  0.3016, Adjusted R-squared:  0.3007 
## F-statistic: 339.6 on 5 and 3933 DF,  p-value: < 2.2e-16

According to the p-value, we can conclude that the results are significant and with an increase in freedom of choice, happiness increases by 0.16. The third model explains 30% of the variance.

anova(model2,model3)

## Analysis of Variance Table
## 
## Model 1: hapIND ~ V11 + V68
## Model 2: hapIND ~ V11 + V68 + V46
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   3934 3836.4                                  
## 2   3933 3494.1  1    342.29 385.29 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Having checked whether the third model is better than the second, we see that the p-value is smaller than 0.05 and, thus, the model is indeed better.

Model 4: health + financial stability + freedom of choice + hard work

model4 <- lm(hapIND ~ V11 + V68 + V46 + V120, data = wvsUSA1) 
summary(model4)

## 
## Call:
## lm(formula = hapIND ~ V11 + V68 + V46 + V120, data = wvsUSA1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5073 -0.6154  0.0222  0.6546  3.3403 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.899906   0.101179  38.545  < 2e-16 ***
## V112         0.280544   0.084365   3.325 0.000891 ***
## V113         0.483387   0.079491   6.081 1.31e-09 ***
## V114         0.796288   0.080376   9.907  < 2e-16 ***
## V68          0.144671   0.006486  22.305  < 2e-16 ***
## V46          0.159841   0.008359  19.122  < 2e-16 ***
## V120        -0.030478   0.006462  -4.716 2.48e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.94 on 3932 degrees of freedom
## Multiple R-squared:  0.3055, Adjusted R-squared:  0.3044 
## F-statistic: 288.2 on 6 and 3932 DF,  p-value: < 2.2e-16

According to the p-value, we can conclude that the results are significant and with an increase in believin in hard work, happiness decreases by 0.03. The fourth model explains 30% of the variance.

anova(model3,model4)

## Analysis of Variance Table
## 
## Model 1: hapIND ~ V11 + V68 + V46
## Model 2: hapIND ~ V11 + V68 + V46 + V120
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   3933 3494.1                                  
## 2   3932 3474.4  1    19.656 22.244 2.484e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Having checked whether the fourth model is better than the third, we see that the p-value is smaller than 0.05 and, thus, the model is indeed better.

The fourth model seems to be fitting the first hypothesis and shows that American happiness and individualistic values are indeed associated. Let us now check the rest of the assumptions to make sure the model is not “lying”.

par(mfrow=c(2,2))
plot(model4)

The first plot shows three outliers which may influence the data. The normality plot looks perfectly fine, meaning that the residuals are normal. The third and fourth plots clearly show that the data is not changed by the outliers since the Cook’s distance is not crossed.

We also have to check for the abscence of multicollinearity:

vif(model4)

##          GVIF Df GVIF^(1/(2*Df))
## V11  1.044684  3        1.007312
## V68  1.112133  1        1.054577
## V46  1.113914  1        1.055421
## V120 1.031354  1        1.015556

Each number is less than five which means that the assumption holds.

Lastly, we should check for heteroscedasticity (whether the observations are randomly distributed):

ncvTest(model4)

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 79.08202, Df = 1, p = < 2.22e-16

Since p-value is smaller than 0.05, we conclude that the variance of the residuals is not constant and infer that heteroscedasticity is present.

Model 5: non-linearity

When looking at the variables and distributions, we came across the variable that indicates age of the respondents.

model5 <- lm(hapIND ~ V11 + V68 + V46 + V120 + poly(V237, 3, raw = TRUE), data = wvsUSA1)
summary(model5)

## 
## Call:
## lm(formula = hapIND ~ V11 + V68 + V46 + V120 + poly(V237, 3, 
##     raw = TRUE), data = wvsUSA1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.3938 -0.6100  0.0251  0.6589  3.3392 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 4.411e+00  2.842e-01  15.521  < 2e-16 ***
## V112                        3.022e-01  8.426e-02   3.586  0.00034 ***
## V113                        5.397e-01  8.015e-02   6.733 1.91e-11 ***
## V114                        8.798e-01  8.196e-02  10.735  < 2e-16 ***
## V68                         1.362e-01  6.689e-03  20.368  < 2e-16 ***
## V46                         1.604e-01  8.338e-03  19.232  < 2e-16 ***
## V120                       -2.879e-02  6.455e-03  -4.460 8.42e-06 ***
## poly(V237, 3, raw = TRUE)1 -4.350e-02  1.764e-02  -2.465  0.01373 *  
## poly(V237, 3, raw = TRUE)2  9.439e-04  3.628e-04   2.601  0.00932 ** 
## poly(V237, 3, raw = TRUE)3 -5.753e-06  2.324e-06  -2.475  0.01336 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9373 on 3929 degrees of freedom
## Multiple R-squared:  0.3101, Adjusted R-squared:  0.3085 
## F-statistic: 196.2 on 9 and 3929 DF,  p-value: < 2.2e-16

We tried the polynominal method to see whether the model can contain a non-linear effect. As we see from the output, there are three “parts” polynominals which a line is divided to. Updated fourth model shows that with an increase in age in the first polynominal, we have a decrease in happiness by -4.350e. The second polynominal is characterized by an increase in happiness by 9.439e, whereas the third leads to a decrease by 5.753e. The variance explained is 30%.

Let’s see the plot.

ggplot(wvsUSA1, aes(V237, hapIND)) +
  geom_point() +
  stat_smooth(model = model5) +
  xlab("Age") +
  ylab("Happiness index") +
  ggtitle("Scatterplot of age and happiness with a regression line")

Clearly, the effect is non-linear and we can conclude that in the 35-37 year period people become less happy (on average) and in the 40-50 year period the happiness starts increasing again. This is connected to the middle life crisis and has serious mental health issues if not treated properly. The line is divided in three parts by two dots (age indicators) and, thus, have different increase/decrease direction.

anova(model4, model5)

## Analysis of Variance Table
## 
## Model 1: hapIND ~ V11 + V68 + V46 + V120
## Model 2: hapIND ~ V11 + V68 + V46 + V120 + poly(V237, 3, raw = TRUE)
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   3932 3474.4                                  
## 2   3929 3451.5  3    22.881 8.6819 9.709e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on the p-value we get that the non-linear model with the variable of age is better than the one we used in linear modelling.

Model 6: interaction effect

The sixth model combines the third model and an added interaction effect of freedom of choice and gender.

model6 <- lm(hapIND ~ V11 + V68 + V120 + V46*V235, data = wvsUSA1)
summary(model6)

## 
## Call:
## lm(formula = hapIND ~ V11 + V68 + V120 + V46 * V235, data = wvsUSA1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.4755 -0.6185  0.0226  0.6578  3.2946 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     4.019874   0.118721  33.860  < 2e-16 ***
## V112            0.279263   0.084326   3.312 0.000936 ***
## V113            0.481937   0.079458   6.065 1.44e-09 ***
## V114            0.795007   0.080339   9.896  < 2e-16 ***
## V68             0.144965   0.006491  22.335  < 2e-16 ***
## V120           -0.030774   0.006461  -4.763 1.97e-06 ***
## V46             0.142249   0.011453  12.420  < 2e-16 ***
## V235Female     -0.239516   0.124542  -1.923 0.054530 .  
## V46:V235Female  0.035074   0.015839   2.214 0.026858 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9396 on 3930 degrees of freedom
## Multiple R-squared:  0.3065, Adjusted R-squared:  0.3051 
## F-statistic: 217.1 on 8 and 3930 DF,  p-value: < 2.2e-16

The model still explains only 30% of the variance but, what is interesting, we see that for women, with each increase in freedom of choice, happiness index is higher than for men by 0.035.

plot_model(model6, type = "int")

NB - male is red and female is blue. The plot proves the said before and we get a clear picture depicting that slopes of the lines are different and at the point when happiness index gets to 7, women start becoming happier than men with the same freedom of choice index.

anova(model5, model6)

## Analysis of Variance Table
## 
## Model 1: hapIND ~ V11 + V68 + V46 + V120 + poly(V237, 3, raw = TRUE)
## Model 2: hapIND ~ V11 + V68 + V120 + V46 * V235
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   3929 3451.5                                  
## 2   3930 3469.3 -1   -17.774 20.233 7.055e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

According to the results, the sixth model is no better than the fifth one which made our non-linear model to be of the best fit.

Conclusion

The previous findings in the analysis showed an association of individualistic values and happiness in America based on 2005-2006 WVS data. We were able to find a model that explains quite a large share of variance and includes such variables as health state, financial stability, freedom of choice, belief in hard work and age. Thus, we proved the first hypothesis that individualistic values enhance the model and help us explain more and better.

Moreover, it is very important to note that USA is a large country and the data gathered from the different parts sometimes might contradict within itself. Therefore, we should be aware that this partly influences the small explanation rate. All in all, individualism and American Dream as culture-specific features help to explain happiness in America.

Data Analysis: Project One

Anastasia Vlasenko

18/04/2020