Introduction

Development of the research background and preliminary exploratory analysis (checking the variables, siginificance of the models, significance of the interaction effect etc. to make sure everything works properly) was conducted collectively. Because of that our works are written on models with the same predictors and the same idea behind. Individual files with detailed descriptions and analysis were written separately which is why the final projects are quite different from each other and should be assessed individually.

Team members:

Artyom Kulikov
Nadezhda Bykova
Anastasia Vlasenko
Anna Gorobtsova

Two ways of measuring happiness

Even before choosing a particular country to work with, we decided to start by thinking about the construction of the outcome variable - happiness. After investigating the description of all the variables in the dataset we have come to the conclusion that there were two ways we could measure happiness. First of all, we could technically take variables relaetd to income (ex. satisfaction with the financial situation of the household, estimate of the relative position of the household in terms of income deciles) and health and by combining them measure happiness more or less objectively, through two other more precise measures. As it was shortly summarized in Ross (2015) article on Health, Happiness, and Higher Levels of Social Organisation, it was proved that people with poorer health reported lower levels of happiness than those in good health. This can be explained by not being able to do some of the things a person wants to do, chronic pain, general uncomfort etc. While health, according to Ross, was found to have a lasting effect on happiness, the case with income was a bit different. Though it was concluded that, on average, groups with higher status were happier than the lowest status groups when cases of particular countries were taken into account, when the average incomes and happiness were compared across countries there was no such trend (Ball & Chernova, 2008). One of the main hypothesis was that it is not the absolute income but a relative one that matters: within a country, if one’s income is higher in absolute terms it will also be higher in realtive terms and a person will be happier; when average incomes between countries are comapred, the distribution of relative incomes does not differ between countries if absolute incomes are distributed around their means in the same manner for all of them.

One of the variables in the dataset, V68, could in this case taken be taken as the income measurement and luckily it is subjective (and most likely relative): How satisfied are you with the financial situation of your household? In order to answer this question respondents do not need to provide actual data on their income but rather evaluate their own satisfaction with it and, as I hypothesize, respondents are also likely to compare their situation to that of their friends or other people in their country in general which makes this measure also sligtly realtive. For the measurement of health variable V11 could be taken: All in all, how would you describe your state of health these days? As it can be seen it is also subjective.

Our general idea was that if happiness was measured by these two variables it would be a bit less subjective (in comparison to the second version of it which will be discussed later) simply becase we wouldn’t be directly asking respondents to assess their own happiness but rather use their subjective assessment of two more objective measures to assess their level of happiness indirectly.

However, we decided to use those variables as predictors and measure the hapiness itself subjectively by using variables V22 (life satisfaction) and V10 (subjective happiness).

Andrews (1974) argues that despite the fact that both objective and subjective indicators are important for research, it’s those perceptual indicators of well-being that allow reserachers to gain deeper knowleadge on how a social system is constructed based on how individuals who live in it percieve their own life. He also mentioned that by using perceptual measures researchers can compare indicators from different sectors since the general question is still how satisfied a person is with the object of interest. We have concluded that for the purposes of this research we wanted to measure and predict perceptual happiness rather than the more objective version of it.

Research question and choice of the country

Keeping an idea of more objective measures of happiness in mind we came to the conclusion that while those variables seemed to be more international (since both health and income are universaly understanable for everyone) there were also some variables that seemed to be more focused on culture and distinct for particular countries. We decided that we wanted to divide our predictor variables into two groups: internationally objective ones (health and income) and more culture-specific ones, in order to asses which of them would be more significant and if addition of the culture-specific variables would improve the model based solely on internationally objective ones only.

We have decided to pick the United States of America as our country and use variables V120 (believe in hard work) and V46 (free choice) as culture-specific ones.

The USA is quite well-known for being one of the world’s most delevoped democracies and its citizens value their free choice and ability to make a difference. As it was mentioned in the work of Inglehart, Foa, Peterson, and Welzel (2008), some studies show that democracy is, in fact, related to happiness (positively) since it provides people with a wider range of free choice, freedom in expression, freedom to travel etc. We wanted to see if freedom of choice and control over respondents life would be a strong predictor of happiness in the USA and chose the variable V46 as a measurement of free choice. Respondents were asked to choose a number from 1 to 10 indicating how much control over their life they have with 1 meaning “no choice at all” and 10 meaning “a great deal of choice”.

The second culture-specific variable we chose was the one assessing one’s believe in hardwork. The famous concept of the American dream also includes the believe that by hard work everyone can achieve success in life: since, ideally, the american society is not supposed to have any barriers there should be equal opportunity to prosperity and social mobility to everyone who is willing to invest their effort in something. In order to assess the effect of the believe in hard work on subjective happiness we wanted to include the variable V120 which consists of respondents’ answers to the question of whether they believe that in the long run, hard work usually brings a better life or its luck and connections that are more important.

The final research question is: does adding culture-specific predictors (freedom of choice, believe in hard work) make the predictive model of happiness better in comparison to the one built solely on international more objective predictors (health and income)

Our hypothesis would be that those culture-specific predictors actually do improve the model. Therefore, in order to study the topic of hapiness in a particular country it is extremely important to gain enough knowledge about its’ culture first and improve the model by adding such culture-specific variables.

Getting started

We begin by simply uploading the data and attaching required packages.

# Attaching the packages to work with
library(foreign)
library(ggplot2)
library(knitr)
library(sjPlot)
library(corrplot)
library(dplyr)
library(car)
library(psych)
library(ggpubr)
library(gridExtra)

# Uploading the data
data <- read.spss("WVS.sav", to.data.frame = TRUE, use.value.labels = TRUE)
# Choosing variables of interest
save <- c("V2", "V10", "V11", "V22", "V46", "V68", "V120", "V235", "V237")
data <- data[save] 
# Deleting NA cases
data <- na.omit(data)
# Choosing a particular country - USA
dataUS <- subset(data, V2 == "USA")
str(dataUS)

## 'data.frame':    4368 obs. of  9 variables:
##  $ V2  : Factor w/ 112 levels "France","Britain",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ V10 : Factor w/ 4 levels "Very happy","Quite happy",..: 1 2 1 1 2 2 2 2 1 3 ...
##  $ V11 : Factor w/ 5 levels "Very good","Good",..: 3 3 4 1 2 1 2 1 1 4 ...
##  $ V22 : Factor w/ 10 levels "Dissatisfied",..: 8 9 8 9 6 8 7 8 10 9 ...
##  $ V46 : Factor w/ 10 levels "None at all",..: 8 8 4 8 7 8 7 6 8 8 ...
##  $ V68 : Factor w/ 10 levels "Dissatisfied",..: 7 8 6 10 4 8 3 10 8 7 ...
##  $ V120: Factor w/ 10 levels "In the long run, hard work usually brings a better life",..: 2 2 2 5 6 3 8 5 1 1 ...
##  $ V235: Factor w/ 2 levels "Male","Female": 2 1 1 2 1 1 2 1 2 1 ...
##  $ V237: Factor w/ 87 levels "100","101","15",..: 54 47 43 56 29 29 10 78 41 57 ...

There were 179854 observations of 9 variables (after we specified the ones we need) in the initial dataset and 4368 observations of the same variables in our new American dataset. All of them are coded as factors right now and some work needs to be done in terms of re-coding.

A new way of measuring happiness

As it was stated previously, we have decided to use variables V22 (life satisfaction) and V10 to measure happiness. One of the questions we faced was the problem of how to combine those two since they were coded as having different scales (1-10, 1-4). In order to solve this problem we decided to re-code the variable V22 and make it a 4-level-scale variable just as the happiness one. We proposed such a re-coding scheme:

Values “Dissatisfied” (=1) and “2” of the V22 variable would be coded as “1”
Values “3”, “4”, and “5” of the V22 would be coded as “2”
Values “6”, “7”, “8” of the V22 would be coded as “3”
Values “9” and “Satisfied” of the V22 would be coded as “4”

We belive that levels “Quite happy” and “Not very happy” are quite vague which is why three levels of V22 variable correspond to each of them while in case of the “Not at all happy” and “Very happy” respondents are quite confident in their answer - which is why only two highest and two lowest values of the V22 were ascribed to these two levels.

# Re-coding the V22 variable
dataUS$sat <- rep(NA, length(dataUS$V22))

dataUS$sat[dataUS$V22 == "Dissatisfied" | 
             dataUS$V22 == "2"] <- "1"

dataUS$sat[dataUS$V22 == "3" |
             dataUS$V22 == "4" |
             dataUS$V22 == "5"] <- "2" 

dataUS$sat[dataUS$V22 == "6" |
             dataUS$V22 == "7" |
             dataUS$V22 == "8"] <- "3" 

dataUS$sat[dataUS$V22 == "Satisfied" | 
             dataUS$V22 == "9"] <- "4"

dataUS$sat <- as.factor(dataUS$sat)

# Double-checking the result
table(dataUS$sat)

## 
##    1    2    3    4 
##   73  561 2249 1485

The variable V10 also had to be re-coded into a numeric one.

# Re-coding the V10 variable
dataUS$V10 <-   ifelse(dataUS$V10=="Not at all happy",1,
                            ifelse(dataUS$V10=="Not very happy",2,
                                   ifelse(dataUS$V10=="Quite happy",3,
                                          ifelse(dataUS$V10=="Very happy",4, NA))))
# Double-checking the result
table(dataUS$V10)

## 
##    1    2    3    4 
##   36  309 2238 1785

In a similar way we have to re-code other variables. Note that the V11 (health) variable is not transformed into a numeric variable in the end - it is used in further analysis as a factor. While all the remaining variables use 1-10 scales and can be used as numeric, five levels of the health variable is simply not enough. Initially the scale for V11 was different from the others (the best health was coded as “1”) we change the order of the levels for convenience. Order of the levels was also changed fot the V120 (believe in hardwork) variable.

# Re-coding health (V11)

dataUS$V11 <- ifelse(dataUS$V11=="Very good", 5, 
                     ifelse(dataUS$V11=="Good", 4, 
                            ifelse(dataUS$V11=="Fair",3, 
                                   ifelse(dataUS$V11=="Poor",2, 
                                          ifelse(dataUS$V11=="Very poor", 1, NA))))) 
# Re-coding choice (V46)

dataUS$V46 <- ifelse (dataUS$V46=="None at all", 1, 
                      ifelse(dataUS$V46=="A great deal", 10, 
                             dataUS$V46))

# Re-coding hardwork (V120) with a change in the direction of the scale

dataUS$V120 <- ifelse(dataUS$V120=="Hard work doesn’t generally bring success—it’s more a matter of luck and connections", 1, 
                      ifelse(dataUS$V120 == "2" ,9, 
                             ifelse(dataUS$V120 == "3", 8, 
                                     ifelse(dataUS$V120 == "4", 7,
                                            ifelse(dataUS$V120 == "5", 6, 
                                                   ifelse(dataUS$V120 == "6", 5,
                                                          ifelse(dataUS$V120 == "7", 4,
                                                                 ifelse(dataUS$V120 == "8",3,
                                                                        ifelse(dataUS$V120 == "9", 2, 
                      ifelse(dataUS$V120=="In the long run, hard work usually brings a better life", 10, 
                             NA))))))))))

# Re-coding income satisfaction (V68)

dataUS$V68 <- ifelse(dataUS$V68 == "Completely satisfied", 10,
                     ifelse(dataUS$V68 =="Completely dissatisfied", 1,
                            dataUS$V68))


# Making sure all our numeric variables are actually numeric

dataUS$sat <- as.numeric(dataUS$sat)
dataUS$V10 <- as.numeric(dataUS$V10)
dataUS$V46 <- as.numeric(dataUS$V46)
dataUS$V68 <- as.numeric(dataUS$V68)
dataUS$V120 <- as.numeric(dataUS$V120)
dataUS$V237 <- as.numeric(dataUS$V237)
dataUS$V11 <- as.factor(dataUS$V11)

Finally, we have decided to sum up these two variables in order to get a new happiness index. But right before doing that we also wanted to take a look at the distribution of levels of happiness in each of the new life-satisfaction groups.

# Since some NAs were created during re-coding
dataUS <- na.omit(dataUS)
# Plotting happiness distribution inside each life-satisfaction level
set_theme(legend.pos = "top", legend.inside = TRUE, axis.textsize = 0.8)

plot1 <- sjp.xtab(dataUS$sat, dataUS$V10, bar.pos = "stack", legend.title = "Level of happiness", 
                  axis.titles = "Level of life satisfaction", show.total = FALSE, margin = "row", 
                  geom.colors = (palette = "Pastel2"))

plot1

We can observe that among respondents who have the highest life satisfaction (9-10) more than a half have reported being “Very happy” (67.1%) - double the number of those who reported being “Quite happy” (30.4%). Only 2.1% and 0.3% of these respondents have reported being “Not very happy” and “Not at all happy” respectively.

In case of those respondents with level of life satisfaction equal to 3 there is a slight change: “Quite happy” respondents is a dominant category with 63.8%, the category of “Very happy” - with 30.9%. Remaining two categories are still extremely small in number even though the number of people who are “Not very happy” is bigger in comparison to the previous group.

Interestingly, the lowest percentage of “Very happy” respondents is found in case of those people who reported having a level of satisfaction of “2”: 12.6%. Even in case of the life-satisfaction level “1” percentage of “Very happy” respondents is 26.8%. This might be explained by the fact that as one’s income grows their greedeness and desire for more grows as well leaving their level of happiness of life satisfaction almost the same if not smaller. In other words, if we suppose that some respondents who reported their life satisfaction to be “2” were previously in the group “1”, they now want even more and feel less happy simply because now they know that they can achieve even higher level of income/life satisfaction. The percentage of people who are “Not at all happy” is a lot higher in the group who reported their life satisfaction to be “1” - 11.3%. However, it is still not the most dominant category in that group: 36.6% of those people reported being “Not very happy”. We can only hypothesize that these people either learned to be find happiness in small things even if their life conditions are not that great or that they are simply are hiding thier true perception of their happiness.

We finally proceed to the creation of the new happiness index.

dataUS$hapIND<- dataUS$V10 + dataUS$sat

# Visualizing the distribution
ggplot(dataUS, aes(x=hapIND))+
  geom_histogram(fill="#fdcdac", binwidth= 1)+
  labs(title = "Distribution of hapIND variable", y="Frequency", x="Level of hapIND")+
  theme_bw()

mean(dataUS$hapIND)

## [1] 6.500469

As it can be seen, the distribution is left-skewed which means that there are more observations with higher values of happiness than we would expect in case of standard normal distribution. The mean value is equal to 6.500469. Since our new measure of happiness has its minimal value equal to 2 (1 satisfaction + 1 on happiness) and maximal value equal to 8 (4 satisfaction + 4 happiness), we can concude that on average American population tend to report a slightly high level of happiness.

Predictor variables: distributions

There are total of 4 predictor variables (V11, V46, V68, V120) we are planning on including in our model and before doing that we shall inspect them.

Before we start it is important to shortly describe the scales these variables use:

V11 (health): uses four levels from “Very Poor” (1) to “Very good” (5)
V46 (free choice): uses ten levels from “No choice” (1) to “A great deal of choice”
V68 (income satisfaction): uses ten levels from “Completely dissatisfied” (1) to “Completely satisfied” (10)
V120 (hardwork): uses ten levels from “Hard work doesn’t generally bring success—it’s more a matter of luck and connections”(1) to “In the long run, hard work generally usually brings a better life” (10)

The difference in scales can also be obsereved in the difference of “range” metric.

# Getting a table with general descriptive statistics for variables in datasaet
save1 <- c("V46", "V68", "V120")
dataUSPred <- dataUS[save1] 
describe(dataUSPred)

##      vars    n mean   sd median trimmed  mad min max range  skew kurtosis   se
## V46     1 4262 7.61 1.90      8    7.78 1.48   1  10     9 -0.84     0.61 0.03
## V68     2 4262 6.53 2.42      7    6.69 2.97   1  10     9 -0.53    -0.39 0.04
## V120    3 4262 7.56 2.17      8    7.80 2.97   2  10     8 -0.70    -0.40 0.03

As we observe, three distributions of numeric variables are left-skewed judging by their negative skew values (-0.84; -0.53; -0.70). In terms of kurtosis, all distributions are platykurtic judging by the values of kurtosis smaller than 3. If we take a look at mean values of these variables we can see that:

The mean value of V46 is 7.61 which means that on average American respondents tend to think that they have quite a lot of control over their life
The mean value of V68 is 6.53 which means that on average American respondents are still somewhat satisfied with the financial situation of their household even though mean is only sligthly less that 5
The mean value of V120 is 7.56 which means that on average American respondents tend to believe that hard work does bring success in a long run

If we were to visualize the distribution of these variables:

p1 <- ggplot(dataUS, aes(x=V46))+
  geom_histogram(fill="lightblue1", binwidth= 1)+
  labs(title = "Distribution of V46 (choice) variable", y="Frequency", x="Level of free choice")+
  theme_bw()

p2 <- ggplot(dataUS, aes(x=V68))+
  geom_histogram(fill="steelblue2", binwidth= 1)+
  labs(title = "Distribution of V68 (income satisfaction) variable", y="Frequency", x="Level of satisfaction with income")+
  theme_bw()

p3 <- ggplot(dataUS, aes(x=V120))+
  geom_histogram(fill="slateblue4", binwidth= 1)+
  labs(title = "Distribution of V120 (hard work) variable", y="Frequency", x="Level of believe in hardwork")+
  theme_bw()

ggarrange(p1, p2, p3 + rremove("x.text"), 
          labels = c("A", "B", "C"),
          ncol = 2, nrow = 2)

While our hypothesis in terms of skewness seems to be true, we also observe that the distribution of the variable V68 (satisfaction with financial situation) seems to be a bit more normal than remaining distributions.

Now we also need to visualize our health variable which can be done with a bar plot.

We observe that the majority of respondents reported their health condition to be “Good” (3) - 42.5%. Slightly less repported it to be “Very good” - 36.5%. The three remaining categories were a lot less popular. However, it should be noted this does not necessarily mean that people in US are all healthy. This variable is still subjective and people might be percieving their health condition in a better way than it actually is or simply hiding the truth from the reserachers. A very low percentage of people reported their health to be “Very poor”. As we will see further, the distribution of of age of respondents was skewed - there were more young respondents in this study - which also might have contributed to the generally good health condition of them.

In what follows we will investigate the relationship between subjective health and happiness.

Predictor variables: bivariate plots and tests

We begin by visualizing the relationship between dependent variable (happiness) and predictor variables on scatterplots and boxplots and then move on to taking a look at the results of the cor.test for each of the numeric variables.

# Visualizing the relationship between happiness and four predictors

p1 <- ggplot(dataUS, aes(x=V46, y = hapIND ))+
  geom_point(position = "jitter")+
  labs(title = "Relationship between V46 (choice) variable and happiness", y="Level of happiness", x="Level of free choice")+
  theme_bw()+
  geom_smooth(method = "lm")


p2 <- ggplot(dataUS, aes(x=V68, y = hapIND ))+
  geom_point(position = "jitter")+
  labs(title = "Relationship between V68 (income satisfaction) variable and happiness", y="Level of happiness", x="Level of satisfaction with income")+
  theme_bw() +
  geom_smooth(method = "lm")


p3 <- ggplot(dataUS, aes(x=V120, y = hapIND ))+
  geom_point(position = "jitter")+
  labs(title = "Relationship between V120 (hard work) variable and happiness", y="Level of happiness", x="Level of believe in hardwork")+
  theme_bw()+
  geom_smooth(method = "lm")


ggarrange(p1, p2, p3 + rremove("x.text"), 
          labels = c("A", "B", "C"),
          ncol = 2, nrow = 2)

# Obtaining correlation coefficients
cor.test(dataUS$V46, dataUS$hapIND)

## 
##  Pearson's product-moment correlation
## 
## data:  dataUS$V46 and dataUS$hapIND
## t = 28.263, df = 4260, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3717792 0.4223522
## sample estimates:
##       cor 
## 0.3973674

cor.test(dataUS$V68, dataUS$hapIND)

## 
##  Pearson's product-moment correlation
## 
## data:  dataUS$V68 and dataUS$hapIND
## t = 30.645, df = 4260, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4000816 0.4492909
## sample estimates:
##       cor 
## 0.4250002

cor.test(dataUS$V120, dataUS$hapIND)

## 
##  Pearson's product-moment correlation
## 
## data:  dataUS$V120 and dataUS$hapIND
## t = 10.246, df = 4260, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1256406 0.1842451
## sample estimates:
##       cor 
## 0.1550793

We observe that all correlation coefficients are positive with the smallest one chracterizing the relationship between V120 (hardwork) and happiness - 0.1550793, and the biggest one characterizing the relationship between V68 (financial situation satisfaction) and happiness - 0.4266566. Though there are multiple ways to assess if the correlation coefficients indicate weak, medium or strong relationship, I would conclude that two our of three correlation coefficients (0.3973674 and 0.4250002) indicate a moderate positive relationship between happiness and V46, V68 variables. Correlation coefficient of 0.1565979 still indicates a positive relationship between the variable V120 and happiness but this relationship is rather weak.

By taking a look at the scatterplot we observe that all three trend lines are going upwards indicating a positive relationship but the line on the C plot is flatter than the rest of them indicating a weaker relationship between variables. We can also observe that the significance level (p-value) of the correlation is small in all four cases (<0.05) and conclude that the null hypothesis (that there is no correlation) should be rejected.

In case of the health variable (ordered) we have four categories and we need to test the relationship between them and the numeric variable happiness we have to build a boxplot and perform ANOVA.

We observe that for “1”, “2”, “3” and “4” groups of health the median happiness is the same and equals 6. However, in case of the health group “5” the happiness level is higher and equals 7. While the difference in medians seems apparent in order to check if the group differences in means are statistically significant we need to perform ANOVA.

##               Df Sum Sq Mean Sq F value Pr(>F)    
## dataUS$V11     4    504  125.90   107.6 <2e-16 ***
## Residuals   4257   4982    1.17                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

P-value < .001 (<2e-16) means that the null hypothesis should be rejected (the probability to obtain such data that we have if H0 was true for the population is low), thus, the difference in the happiness rate across four health groups is statistically significant.

Regression analysis

We will firstly construct models for both groups of variables only briefly analyzing them and then turn to the construction of the final model with predictors from both groups of variables. Detailed diagnostics will be provided for the final model.

Before we continue with the model construction, we wanted to centralize our predictor variables in order for our interpretation of regression coefficients to be more meaningful. After the centralization the intercept would be interpreted as the level of happiness that is predicted in case all predictor variables take their mean values. Before that we couldn’t interpret the intercept since none of our predictor variables actually take the value 0.

#Centralization
dataUS$V68_cent <-  dataUS$V68 - mean(dataUS$V68)  
dataUS$V46_cent <-  dataUS$V46 - mean(dataUS$V46)
dataUS$V120_cent <-  dataUS$V120 - mean(dataUS$V120)

Model 1 (Health)

We begin by building a model based solely on the very first predictor - health (V11). Our hypothesis is case of this variable was that people who subjectively assess their health as good (people from the “3” or “4” health groups) are happier than those who report worse health.

model1 <-lm(hapIND ~ V11, data = dataUS) 
summary(model1)

## 
## Call:
## lm(formula = hapIND ~ V11, data = dataUS)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.8838 -0.6242  0.1162  0.9291  2.3758 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.9130     0.2256  26.214  < 2e-16 ***
## V112         -0.2888     0.2408  -1.199    0.230    
## V113          0.1579     0.2292   0.689    0.491    
## V114          0.5125     0.2270   2.258    0.024 *  
## V115          0.9707     0.2272   4.272 1.98e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.082 on 4257 degrees of freedom
## Multiple R-squared:  0.09181,    Adjusted R-squared:  0.09096 
## F-statistic: 107.6 on 4 and 4257 DF,  p-value: < 2.2e-16

We observe that the model itself is better than the model based solely on the mean value (p-value: < 2.2e-16) but only V114 and V115 are statistically significant. The model explains around 9% of the variance in the dependent variable - happiness.

The regression equation would look like this:

\[Predicted happiness = 5.9130 - 0.2888 * V112 + 0.1579 * V113 + 0.5125 * V114 + 0.9707 * V115\]

Since we should only interpret the statistically significant variables the interpretation would be:

When a respondent reports a health condition “Very poor” (1) all other (V112, V113, V114, V115) variables take a value of 0 and the predicted happiness level is 5.9130. When a person reports a health condition of “Good” - happiness increases by 0.5125, and in case of “Very good” health condition - increases by 0.9707

It has to be noted that “Good” and “Very good” were the two categories with the biggest number of observations which probably contributed to their significance.

plot_model(model1)

If we take a look at the plot we can see how estimates of happiness rise with higher levels of health (“Good”, “Very good”).

Model 2 (Health + Financial situation)

The next step would be to add another variable of the first “universally obejective” group - satisfaction with financial situation of the household. As we hypothesize, here we will also observe such a pattern: the relationship is moderate and positive meaning as the satisfaction with the financial situation of the household increases happiness also increases.

model2 <-lm(hapIND ~ V11 + V68_cent, data = dataUS) 
summary(model2)

## 
## Call:
## lm(formula = hapIND ~ V11 + V68_cent, data = dataUS)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0719 -0.7058  0.0155  0.7401  3.0534 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.945640   0.206770  28.755  < 2e-16 ***
## V112        -0.178535   0.220741  -0.809    0.419    
## V113         0.228627   0.210127   1.088    0.277    
## V114         0.496977   0.208075   2.388    0.017 *  
## V115         0.855773   0.208327   4.108 4.07e-05 ***
## V68_cent     0.181230   0.006366  28.469  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9916 on 4256 degrees of freedom
## Multiple R-squared:  0.2371, Adjusted R-squared:  0.2362 
## F-statistic: 264.5 on 5 and 4256 DF,  p-value: < 2.2e-16

mean(dataUS$V68)

## [1] 6.527687

A small p-value (<0.05) indicates that the model is better than the one based on the mean while V115 and V114 are still the only levels of V11 that are statistically significant and should be interpreted. An increase in the percentage of the variance explained by the model is huge - the second model explains around 24% of the variance in happiness.

The regression equation would look like this:

\[Predicted happiness = 5.945640 -0.178535 * V112 + 0.228627 * V113 + 0.496977 * V114 + 0.855773 * V115 + 0.181230 * V68 \]

The interpretation would be:

With each one-point increase in the value of variable “V68” (satisfaction with financial situation of the household) the happiness increases by 0.181230.
When a respondent reports a health condition “Very poor” (1) all other (V112, V113, V114, V115) variables take a value of 0, when that happens at the same time with the variable V68 being equal to its mean the predicted happiness level is 5.945640 (intercept is statistically significant).
When a person reports a health condition of“Good” the happiness increases by 0.496977, in case of the “Very good” - increases by 0.855773.

Since we already have two models and they are nested we can perform ANOVA to test if the addition of the satsfaction of the financial situation of the household significantly improved the model.

anova(model1, model2)

## Analysis of Variance Table
## 
## Model 1: hapIND ~ V11
## Model 2: hapIND ~ V11 + V68_cent
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   4257 4981.9                                  
## 2   4256 4184.9  1    796.94 810.47 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

P-value is small (<0.05) which means that the addition of a new predictor (DF = 1) DID improve the model.

Model 3 (Free choice)

We construct the third model using the variable V46 (free choice). We hypothesize that as the level of free choice increases, happiness increases as well. This was also demonstrated on the scatterplots above.

model3 <-lm(hapIND ~ V46_cent, data = dataUS) 
summary(model3)

## 
## Call:
## lm(formula = hapIND ~ V46_cent, data = dataUS)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.3564 -0.5936 -0.0679  0.9321  3.0665 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.500469   0.015951  407.54   <2e-16 ***
## V46_cent    0.237157   0.008391   28.26   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.041 on 4260 degrees of freedom
## Multiple R-squared:  0.1579, Adjusted R-squared:  0.1577 
## F-statistic: 798.8 on 1 and 4260 DF,  p-value: < 2.2e-16

mean(dataUS$V46)

## [1] 7.607461

The predictor is statistically significant and the model is better than the one based solvely on the mean value (p-value: < 2.2e-16). The model explains around 16% of the variance in happiness which is, in fact, more than what the very first model (health only) explained.

The regression equation would look like this:

\[Predicted happiness = 6.500469 + 0.237157 * V46 \]

The interpretation would be:

With each one-point increase in the value of variable “V46” (free choice) the happiness increases by 0.237157. In case V11 is equal to its mean the level of happiness would be equal to 6.500469

plot_model(model3)

The relationship is linear on the middle range of values and clearly not linear when it comes to both ends of the line. That is quite extreme at the begining of the line but lucikly it looks like the rest of the data would be explained by the linear model quite well.

Model 4 (Free choice + Hardwork)

In order to construct the forth model we add variable V120 (believe in hardwork) as a predictor. We hypothesized that beliveing in hardwork paying off and bringing success and money would make people happier by simply giving them hope and on the scatterplot we observe that a positive relationship between these variables was present but was weak.

model4 <-lm(hapIND ~ V46_cent + V120_cent, data = dataUS) 
summary(model4)

## 
## Call:
## lm(formula = hapIND ~ V46_cent + V120_cent, data = dataUS)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.1773 -0.6667  0.0354  0.8227  3.1471 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.500469   0.015856 409.977   <2e-16 ***
## V46_cent    0.228720   0.008423  27.156   <2e-16 ***
## V120_cent   0.053186   0.007363   7.223    6e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.035 on 4259 degrees of freedom
## Multiple R-squared:  0.1681, Adjusted R-squared:  0.1677 
## F-statistic: 430.3 on 2 and 4259 DF,  p-value: < 2.2e-16

mean(dataUS$V46)

## [1] 7.607461

mean(dataUS$V120)

## [1] 7.562412

Both predictors are statistically significant and the model is better than the one based on the mean value (p-value: < 2.2e-16). The model now explains 17% of the variance in happiness variable which is less than the second model based on internatially “objective” variables.

The regression equation would look like this:

\[Predicted happiness = 6.500469 + 0.228720 * V46 + 0.053186 * V120 \]

The interpretation would be:

With each one-point increase in the value of variable “V46” (free choice) the happiness increases by 0.228720. With each one-point increase in the value of variable “V120” (believe in hardwork) the happiness increases by 0.053186. In case of both predictor variables are equal to their mean values the level of happiness would be equal to 6.500469

Since now we have two nested models we can compare then using ANOVA.

anova(model3, model4)

## Analysis of Variance Table
## 
## Model 1: hapIND ~ V46_cent
## Model 2: hapIND ~ V46_cent + V120_cent
##   Res.Df    RSS Df Sum of Sq     F Pr(>F)    
## 1   4260 4619.3                              
## 2   4259 4563.4  1    55.899 52.17  6e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

P-value is small (<0.05) which means that the addition of a new predictor (DF = 1) DID improve the model.

Model 5 (Final)

Now we can construct a new model by adding teo culture-specific predictors to the model2 and check if addition of those did significantly improve the model.

model5<-lm(hapIND ~ V11 + V68_cent + V46_cent + V120_cent, data = dataUS) 
summary(model5)

## 
## Call:
## lm(formula = hapIND ~ V11 + V68_cent + V46_cent + V120_cent, 
##     data = dataUS)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5115 -0.6174  0.0294  0.6625  2.9777 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.037671   0.197036  30.643  < 2e-16 ***
## V112        -0.166136   0.210205  -0.790 0.429367    
## V113         0.179369   0.200170   0.896 0.370261    
## V114         0.410415   0.198291   2.070 0.038535 *  
## V115         0.725582   0.198573   3.654 0.000261 ***
## V68_cent     0.142743   0.006335  22.532  < 2e-16 ***
## V46_cent     0.157769   0.008064  19.563  < 2e-16 ***
## V120_cent    0.037135   0.006760   5.493 4.18e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9442 on 4254 degrees of freedom
## Multiple R-squared:  0.3087, Adjusted R-squared:  0.3075 
## F-statistic: 271.3 on 7 and 4254 DF,  p-value: < 2.2e-16

The final model explains 31% of variance in the outcome variable and is better than the one built based on the mean value (p-value: < 2.2e-16). V115 and V114 levels of V11 are significant just as all other predictors are.

The regression equation would look like this:

\[Predicted happiness = 6.037671 - 0.166136 * V112 + 0.179369 * V113 + 0.410415 * V114 + 0.725582 * V115 + 0.142743 * V68 + 0.157769 * V46 + 0.037135 * V120 \]

The interpretation would be:

With each one-point increase in the value of variable “V68” (satisfaction with financial situation) the happiness increases by 0.142743.
With each one-point increase in the value of variable “V46” (free choice) the happiness increases by 0.157769.
With each one-point increase in the value of variable “V120” (believe in hardwork) the happiness increases by 0.037135.
When a respondent reports a health condition “Very poor” (1) all other (V112, V113, V114, V115) variables take a value of 0, when that happens at the same time with all remaining variables equal to their mean values the predicted happiness level is 6.037671.
When a person reports a health condition of “Good”the happiness increases by 0.410415, in case of the “Very good” - increases by 0.725582.

In order to assess if the addition of culture-specific variables helped us to improve the model we perform ANOVA.

anova(model2, model5)

## Analysis of Variance Table
## 
## Model 1: hapIND ~ V11 + V68_cent
## Model 2: hapIND ~ V11 + V68_cent + V46_cent + V120_cent
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   4256 4184.9                                  
## 2   4254 3792.3  2    392.65 220.23 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

P-value is small (<0.05) which means that the addition of two new predictors (DF = 2) DID improve the model.

anova(model4, model5)

## Analysis of Variance Table
## 
## Model 1: hapIND ~ V46_cent + V120_cent
## Model 2: hapIND ~ V11 + V68_cent + V46_cent + V120_cent
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   4259 4563.4                                  
## 2   4254 3792.3  5    771.14 173.01 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The same goes for the comparison between the final model and the model based solely on the culture-specific predictors. The conclusion that can be drawn here is that it is beneficial to use both universal and culture-specific variables together.

Model Diagnostics

We might first check for outliers.

outlierTest(model5)

##         rstudent unadjusted p-value Bonferroni p
## 316879 -4.795097         1.6816e-06    0.0071668

par(mfrow=c(1,1)) 
qqPlot(model5, main="QQ Plot")

## 316879 317482 
##   2331   2885

We observe that observation 316879 is the most extreme in our data and it is an outlier (since p-value <0.05 and we reject the null hypothesis). We observe observations 316879 and 317482 on the plot.

In order to check for multicollinearity we calculate variance inflation factors for all four variables.

vif(model5)

##               GVIF Df GVIF^(1/(2*Df))
## V11       1.046530  4        1.005701
## V68_cent  1.122507  1        1.059484
## V46_cent  1.123512  1        1.059959
## V120_cent 1.032894  1        1.016314

Values for all four variables (1.046530, 1.122507, 1.123512, 1.032894) are less than 5 which indicates that everything is okay. The smallest value of the variance inflation factor is 1 and as we can see values of vif for all three variables are not that much bigger than that. We would be concerned if the values exceed 5 or 10.

We then proceed to check for heteroscedasticity of residuals.

ncvTest(model5)

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 83.05354, Df = 1, p = < 2.22e-16

Unfortunately, p-value is statistically significant (<0.05) which is a sign of heteroscedasticity (the residuals don’t have a constant variance).

One of possible explanations for this lies in the nature of the dataset: since we chose the USA for this research project, the dataset might contain both very low values for one state and extremely high values for another. As we understand, heteroscedasticity in cross-sectional studies is a common phenomena. We will also observe heteroscedasticity on the diagnostic plots in what follows.

par(mfrow=c(2,2)) 
plot(model5)

When it comes to the “Residuals vs Fitted” graph that is used to check the linear relationship we are looking for a straight horizontal line. The line in our case is not particularly straight (at the end of it it starts to fluctuate)
The second graph is used to check the normality of residual distribution and in ideal case the dots would follow a straight dashed line: unfortunately, we observe that at the very beginning of the line there are values there are observations that clearly stand out (316879, 317482, 314336).
“Scale-Location” graph is used to check the homogeneity of variance of the residuals and the ideal case would be a horizontal line with points equally spread around it - that is not our case.
There are no points on the Line of Cook’s distance which means there are no high influence outliers that should be deleted from the model.

Linearity check

Though we could observe that the relationship between some of our predictors and the outcome was not ideally linear we wanted to add a clear non-linear effect and see how the model changes. In order to do that we have chosen variable V237 (age of the respondent).

Before we proceed to the model construction, we need to take a lok at the variable V237 itself. Note that for the purposes of this part of the project only we won’t be centralizing the V237. Later on we will have to take logarythm of the V237 and as it is knows we can’t take logarythms of negative values.

ggplot(dataUS, aes(x=V237))+
  geom_histogram(fill="darkslategray3", binwidth= 1)+
  labs(title = "Distribution of V237 (age) variable", y="Frequency", x="Age")+
  theme_bw()

mean(dataUS$V237)

## [1] 35.55631

The mean age is 35.55631 and the distribution seems to be a bit right-skewed which indicates that there were more younger respondents. Apart from that, the shape of the distribution looks quite normal - close to the bell shape.

Since age is a numeric variable we can also take a look at the scatterplot and correlation coefficient.

# Visualizing the relationship between V237 (age) and happiness
 ggplot(dataUS, aes(x=V237, y = hapIND ))+
  geom_point(position = "jitter")+
  labs(title = "Relationship between V237 (age) variable and happiness", y="Level of happiness", x="Age")+
  theme_bw()+
  geom_smooth()+
  xlim(-10,100)

cor.test(dataUS$hapIND,dataUS$V237)

## 
##  Pearson's product-moment correlation
## 
## data:  dataUS$hapIND and dataUS$V237
## t = 3.7736, df = 4260, p-value = 0.0001631
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.02774440 0.08759186
## sample estimates:
##        cor 
## 0.05771999

We observe that the relationship does not resemeble a straight line and we are dealing with a non-linear ralationship where the slope depends on X1 (not on another variable).The correlation coefficient indicates a very weak positive relationship between happiness and age.

modeltest <- lm(hapIND ~ V237, data = dataUS)
plot_model(modeltest)

As it can be illustrated if we built a model with no logarythm transformation, a level of happiness significantly drops around the age 30 (which can be possibly explained by the middle life crisis or the burden of adult life) and then grows again till it reaches its peak at the age of 60 - presumably, when people retire and finally can peacefully enjoy their life or at least start preparing to do so.

In order to build a non-linear model we take a natural logarythm of the V237 variable. By using the logarithm instead of the un-logged form of the variable we can makes the effective relationship non-linear, while still preserving the linear model.

Here I won’t use a centralized version of the variable to avoid confusion since the model is not our final but rather just a test one.

model6 <- lm(hapIND ~ log(V237) , data = dataUS)
summary(model6)

## 
## Call:
## lm(formula = hapIND ~ log(V237), data = dataUS)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5647 -0.5395 -0.3665  0.5845  1.6475 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.18960    0.10202  60.672   <2e-16 ***
## log(V237)    0.09089    0.02939   3.092    0.002 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.133 on 4260 degrees of freedom
## Multiple R-squared:  0.00224,    Adjusted R-squared:  0.002005 
## F-statistic: 9.562 on 1 and 4260 DF,  p-value: 0.001999

The coefficient in this case can be interpreted as: a 1% increase in age is expected to increase happiness by 0.09/100 = 0.0009. That is because as long as one percent increase is fixed in the V237 variable, we will see the same difference in happiness score regardless of the baseline.

That happens due to the use of the logarythm. If we take two values of the age variable a1 and a2, we can perform such a transformation:

happiness(a2) - happiness(a1) = B * (log(a2) - log(a1)) = B (log(a2/a1))

If the a rises by 1% the a2/a1 is equal to 1.01. The difference in expected mean happiness scores will remain B * log(1.01) as long as the we keep the increase in percentage constant. In our case, for a 1% increase of age (V237), the difference in the expected mean happiness scores will be around 0.09089 * 0.01 = 0.0009089. If the log() is used the exact value would be 0.09089 * ln(1.01) = 0.00090438557.

ln() is calculated here instedad of log() because the function log() takes a natural logarythm of the value in R.

Such a model only explains 0.002005 of variance.

If we were to plot it we would get such a line to alongside the non-linear trend.

plot_model(model6)

If we were to add the logarythm version of that variable to our final model the effect of log(V237) would be linear even the effect of the V237 is not.

Note that I kept all the variables not centralized for the uniformity: here the coefficients won’t be interpreted in detail.

model61<- lm(hapIND ~ V11 + V68 + V46 + V120 + log(V237), data = dataUS)
summary(model61)

## 
## Call:
## lm(formula = hapIND ~ V11 + V68 + V46 + V120 + log(V237), data = dataUS)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.4834 -0.6155  0.0330  0.6637  2.9893 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.366372   0.227940  14.769  < 2e-16 ***
## V112        -0.163612   0.210020  -0.779  0.43600    
## V113         0.193014   0.200047   0.965  0.33468    
## V114         0.441185   0.198393   2.224  0.02621 *  
## V115         0.770392   0.198986   3.872  0.00011 ***
## V68          0.138659   0.006481  21.394  < 2e-16 ***
## V46          0.158288   0.008059  19.640  < 2e-16 ***
## V120         0.035718   0.006771   5.275  1.4e-07 ***
## log(V237)    0.076044   0.025974   2.928  0.00343 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9433 on 4253 degrees of freedom
## Multiple R-squared:  0.3101, Adjusted R-squared:  0.3088 
## F-statistic: 238.9 on 8 and 4253 DF,  p-value: < 2.2e-16

The same logic as described before is applied here: a 1% increase in age would lead to a 0.076044 * 0.01 = 0.00076044 increase in the outcome variable. Other coefficients can be interpreted just as they were interpreted before. Such a model explains 30% of the variable in the outcome variable.

Another way to deal with non-linearity in this case would be to use polynoms. As we observed on the graph, the line was waggly: it was firstly decreasing (0-30), then started to rise (30-60) and then dropped again. Such a trend can be put into a third degree polynomial (cubic polynomial) since we have two extrema (the lowest point and the highest point). A cubic curve would simply pass through our data points better than a quardratic curve or a line would.

model62<- lm(hapIND ~ V11 + V68 + V46 + V120 + poly(V237, 3, raw = TRUE), data = dataUS)
summary(model62)

## 
## Call:
## lm(formula = hapIND ~ V11 + V68 + V46 + V120 + poly(V237, 3, 
##     raw = TRUE), data = dataUS)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.4259 -0.6139  0.0328  0.6641  2.9985 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 3.723e+00  2.341e-01  15.904  < 2e-16 ***
## V112                       -1.675e-01  2.099e-01  -0.798   0.4248    
## V113                        1.946e-01  1.999e-01   0.973   0.3305    
## V114                        4.530e-01  1.983e-01   2.284   0.0224 *  
## V115                        7.886e-01  1.990e-01   3.963 7.52e-05 ***
## V68                         1.362e-01  6.543e-03  20.811  < 2e-16 ***
## V46                         1.583e-01  8.054e-03  19.655  < 2e-16 ***
## V120                        3.517e-02  6.770e-03   5.195 2.15e-07 ***
## poly(V237, 3, raw = TRUE)1 -1.742e-02  9.772e-03  -1.783   0.0747 .  
## poly(V237, 3, raw = TRUE)2  5.655e-04  2.721e-04   2.078   0.0378 *  
## poly(V237, 3, raw = TRUE)3 -4.507e-06  2.259e-06  -1.995   0.0461 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9427 on 4251 degrees of freedom
## Multiple R-squared:  0.3113, Adjusted R-squared:  0.3097 
## F-statistic: 192.2 on 10 and 4251 DF,  p-value: < 2.2e-16

We won’t dive into details, but it has to be mentioned that we observe an interesting trend in the output model: because we applied a polynomial transformation we can see how different the trend is on different parts of the data (age variable). The coefficient for the first segment (note that its not statistically significant for the significance level 0.05) is negative since, just as we observe on the graph, as me move along the x axis the happiness decreases. The coefficient is also negative for the last segment (after 60 y.o) where happiness also starts to decrease again. However, when it comes to the middle section the coefficient is positive and happiness rises with increase in age. This complex relationship wouldn’t be captured by a simple linear model.

Interaction effect

Lastly, we wanted to add an interaction effect to our final model. In order to do so, we have picked V235 (gender) to check if the strength of the relationship between one of the predictors and the outcome variable is the same for men and women. We decided to check if the strength of the relationship between happiness and free choice would be different for different genders.

Before we proceed to the model construction, we need to take a lok at the variable V235 itself.

## [1] 6

## [1] 7

The median happiness for men is 6 while it is 7 for female respondents (both obsereved on the graph and calculated). Both genders have outliers at lower values of happiness.

We an also plot the distribution of happiness variable for both genders separately to view how many respondents of which group chose

We observe that in case of level of happiness 3 - 8 the percentages for female and male respondents are almost equal with sligth dominance of females at all levels but the “6” one:

52.5% and 47.5%
51.7% and 48.3%
47.3% and 52.7%
51.0% and 49.0%
51.5% and 48.5%

However, in case of lower levels of happiness the percentage of female respondents is a lot bigger that the percentage of men:

75% and 25%
62.2% and 37.8%

Even though these levels almost have no observations in comparison to the “6”, “7”, “8” (second graph), the difference in percentages is apparent (first graph). We already observed that the median happiness of women is higher than it is for men, but what about the difference in means?

Since in this case we are dealing with only one numeric variable and the other one is binary it seems reasonable to perform a T-test to check if the mean happiness is different for these two genders

One of the assumptions we have to check before carrying out the test itself is the homogeneity of variances. The two hypothesis would be:

H0: Variances do not differ.
H1: Variances differ.

# Test for homogeneity of variances
var.test(dataUS$hapIND ~ dataUS$V235)

## 
##  F test to compare two variances
## 
## data:  dataUS$hapIND by dataUS$V235
## F = 0.89048, num df = 2127, denom df = 2133, p-value = 0.007459
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.8179494 0.9694382
## sample estimates:
## ratio of variances 
##          0.8904759

As we can see, the p-value is small (0.007459 <0.05) and we would have to reject the null hypothesis and conclude that, unfortunately, the variances do differ. We will have to specify that in the T-test formula so that it performs the Welch’s t-test.

Hypotheses for the test itself:

H0: In population, two means of two gender groups DO NOT differ.
H1: In population, two means of two gender groups DO differ.

t.test(dataUS$hapIND ~ dataUS$V235, var.equal = FALSE)

## 
##  Welch Two Sample t-test
## 
## data:  dataUS$hapIND by dataUS$V235
## t = 3.801e-05, df = 4247.1, p-value = 1
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.06814830  0.06815094
## sample estimates:
##   mean in group Male mean in group Female 
##             6.500470             6.500469

Since the p-value is large(0.9295) the null hypothesis can’t be rejected. Therefore, we conclude that the means of these two groups actually do not differ. As we can observe in the output, mean for male respondents is equal to 6.500470, while the mean for female respondents is equal to 6.500469.

Now, we can finally move to the model construction.

model7 <- lm(hapIND ~ V11 + V68_cent + V46_cent*V235 + V120_cent, data = dataUS)
summary(model7)

## 
## Call:
## lm(formula = hapIND ~ V11 + V68_cent + V46_cent * V235 + V120_cent, 
##     data = dataUS)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.4718 -0.6113  0.0308  0.6514  2.8613 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          6.036585   0.197466  30.570  < 2e-16 ***
## V112                -0.174177   0.210101  -0.829  0.40714    
## V113                 0.168468   0.200093   0.842  0.39986    
## V114                 0.399241   0.198220   2.014  0.04406 *  
## V115                 0.714940   0.198496   3.602  0.00032 ***
## V68_cent             0.143034   0.006338  22.568  < 2e-16 ***
## V46_cent             0.138209   0.011061  12.495  < 2e-16 ***
## V235Female           0.021958   0.028969   0.758  0.44849    
## V120_cent            0.037357   0.006760   5.526 3.46e-08 ***
## V46_cent:V235Female  0.038836   0.015218   2.552  0.01075 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9436 on 4252 degrees of freedom
## Multiple R-squared:  0.3098, Adjusted R-squared:  0.3084 
## F-statistic: 212.1 on 9 and 4252 DF,  p-value: < 2.2e-16

We observe that the interaction effect is significant (0.01075 < 0.05) just as V114, V115, V68, V46 and V120 are. The model explains around 31% and is better than the model built solely on the mean (p-value: < 2.2e-16).

The regression equation would look like this:

\[Predicted happiness = 6.036585 - 0.174177 * V112 + 0.168468 * V113 + 0.399241 * V114 + 0.714940 * V115 + 0.143034 * V68 + 0.137802 * V46 + 0.037357 * V120 + 0.021958 * V235(female) + 0.038836 * V46 * V235\]

The interpretation would be:

With each one-point increase in the value of variable “V68” (satisfaction with financial situation) the happiness increases by 0.143034.
With each one-point increase in the value of variable “V46” (free choice) the happiness increases by 0.138209.
With each one-point increase in the value of variable “V120” (believe in hardwork) the happiness increases by 0.037357.
When a respondent reports a health condition “Very poor” (1) all other (V112, V113, V114, V115) variables take a value of 0, when that happens at the same time with all remaining variables equal to their means the predicted happiness level is 6.036585.
When a person reports a health condition of “Good” the happiness increases by 0.399241, in case of the “Very good” - increases by 0.714940.

A change in V46 or in the V235 varaible would also cause a change in “0.038836 * V46 * V235” terms due to the interaction effect.

If we were to plot the difference between women and men in terms of the strength of the realtionship between happiness and free choice for them we would observe the following.

plot_model(model7, type = "int")

We can see that the line for female (colored blue) goes up steeper than the one for men. While the relationship between freedom of choice and happiness is still positive, the effect is stronger for female respondents than it is for male ones.

anova(model5, model7)

## Analysis of Variance Table
## 
## Model 1: hapIND ~ V11 + V68_cent + V46_cent + V120_cent
## Model 2: hapIND ~ V11 + V68_cent + V46_cent * V235 + V120_cent
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)  
## 1   4254 3792.3                             
## 2   4252 3786.0  2    6.3099 3.5433  0.029 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

If we run ANOVA is does conclude that the interaction model is better (0.029) but the significance code is only ‘*’. The interaction model explains 0.3089 of the variance in comparison to the final model that explained 0.3081.

Conclusion

Just to recap:

The model based on health explained 9% of the variance
The full model for international variables explained 24% of the variance
The model based on the free choice explained 16%
The full model for culture-specific variable explained 17% of the variance
The final model explained around 31% of the variance

Our general hypothesis of the importance of the culture-spesific variables was supported by the data: the full model with the addition of the culture-sepecific variables was better than the model based solely on the universal variables. We also found out that it was the same the other way around and, in general, the final model with all the variables was better than either of two preliminary models. We can conclude that both types of variables should be used in further reserach but since universal ones are the ones that are used more commonly either way our main suggestion would be to make sure one also uses culture-specific ones and studies a particular country’s culture before diving into the reserach itself.

Our hypothesis in terms of variables also turned out to be true: there was a positive relationship between the happiness and all four of our predictors despite the fact that it differed in strength (for numeric variables). We have also investigated that the addition of age to the model without transforming it could lead to some problems: in fact, the relationship between age and happiness was not linear but rather cubic. This finding shows how important it is to take a look at the nature of the relationship between the outcome and predictor variables before building a model.

One of the unexpected findings was made after the addition of the gender variable to the model. We have found out that the strength of the relationship between happiness and freedom of choice was different for female and male respondents: for female respondents the relationship was stronger which was showed by the more steeper line on the graph. America is famous for trying to fight with racial, gender and other prejudice, but freedom of choice and control over one’s life migh still be something American women struggle with: it is really important for them to gain independance and as we can see the more freedom they have the happier they are.

Sources

Andrews, F.M. (1974). Social indicators of perceived life quality. Social Indicators Research , 1, 279–299.

Ball, R., & Chernova, K. (2008). Absolute Income, Relative Income, and Happiness. Social Indicators Research, 88(3), 497-529

Inglehart, R., Foa, R., Peterson, C., & Welzel, C. (2008). Development, Freedom, and Rising Happiness: A Global Perspective (1981-2007). Perspectives on Psychological Science, 3(4), 264-285.

Ross, N. (2005). Health, Happiness, and Higher Levels of Social Organisation. Journal of Epidemiology and Community Health, 59(8), 614-614.

Data Analysis: Project 1

Nadezhda Bykova

20.04.2020

Introduction

Two ways of measuring happiness

Research question and choice of the country

Getting started

A new way of measuring happiness

Predictor variables: distributions

Predictor variables: bivariate plots and tests

Regression analysis

Model 1 (Health)

Model 2 (Health + Financial situation)

Model 3 (Free choice)

Model 4 (Free choice + Hardwork)

Model 5 (Final)

Model Diagnostics

Linearity check

Interaction effect

Conclusion

Sources