Project 1

Introduction

Project author: Lamova Tamara

Project co-authors: Likhodievskaya Yulya, Matveeva Lada, Ryabova Anastasia.

The main idea of the research project is to predict the level of happiness of South Korean citizens. The motivation was a personal interest in this country, as well as a desire to learn more about the data in South Korea.

South Korea is a country that constantly developing. In Korea, education and good skilled work are greatly appreciated, however, not everyone succeeds in reaching heights. Even those students who successfully graduate from a university cannot always get a high position due to the enormous competition in the professions in demand. Korean citizens appreciate good work and do everything in order to maximize their self-sufficiency. Therefore, South Korea is one of the most developed countries in the world of technology, but at the same time with the highest suicide rate. In the capital of the country, Seoul, there is even a so-called “suicide bridge”, which, due to its accessibility, has always attracted the attention of desperate people. Moreover, this bridge is located within walking distance from the financial district of the city, which indicates why people jumped so often from the bridge: Koreans who lost their job, failed a deal or failed to reach heights wanted to take their own lives. Of course, over time, this bridge turned into a “bridge of life”, because it was strengthened and hung motivating phrases along the entire bridge to prevent suicide. The question is: what really makes the people of South Korea happy? Is it true that success at work and consistently high income are most important to them?

Research question: What determines the level of happiness of the people of South Korea?

Data description

The data we are using is from WVS Database (World Values Survey). Using data for the last period (2010-2014), we select the desired country for analysis and work with part of the data. We leave the entire time period, and so we get 3421 observations, which allows us to analyze and predict happiness.

The next step is to choose variables that will explain our level of happiness.

Since we want to predict the level of happiness, based on income, position in society and the prestige of work, we are going to take the following variables:

finsatisf - How satisfied are you with the financial situation of your household? Completely dissatisfied 1 to Completely satisfied 10 1 2 3 4 5 6 7 8 9 10
hardsuccess - How would you place your views on this scale? 1 means you agree completely with the statement on the left; 10 means you agree completely with the statement on the right; and if your views fall somewhere in between, you can choose any number in between.

In the long run, hard work usually brings a better life

Hard work doesn’t generally bring success—it’s more a matter of luck and connections]

1 2 3 4 5 6 7 8 9 10

employment - Are you employed now or not? If more than one job: only for the main job:

Yes, has paid employment: Full time employee (30 hours a week or more) 1 Part time employee (less than 30 hours a week) 2 Self employed 3

No, no paid employment: Retired/pensioned 4 Housewife not otherwise employed 5 Student 6 Unemployed 7 Other 8

class - People sometimes describe themselves as belonging to the working class, the middle class, or the upper or lower class. Would you describe yourself as belonging to the:

1 Upper class 2 Upper middle class 3 Lower middle class 4 Working class 5 Lower class

age - age of a participant

The variable which we explain:

happy - Taking all things together, would you say you are:

1 Very happy 2 Rather happy 3 Not very happy 4 Not at all happy

The variable for creating happy INDEX:

satisf - All things considered, how satisfied are you with your life as a whole these days? Using this card on which 1 means you are “completely dissatisfied” and 10 means you are “completely satisfied” where would you put your satisfaction with your life as a whole?:

Completely dissatisfied 1 to Completely satisfied 10 1 2 3 4 5 6 7 8 9 10

Recoding variables

Our variables are factor variables (expection is age, which is numeric), however, the scales in the dataset itself are not all suitable for the correct analysis, so the next step was to recode the variables. That means, the answers to every question is recoded this way:

Not at all happy = 1 Not very happy = 2, and so on with happy, satisf, finsatisf and hardsucess.

Creating index of happiness - our outcome variable

When the variables were encoded, we proceeded to create the index we needed for the further construction of models. Therefore, the happiness index was created from two variables: happy and satisf. However, let us look at plots of these two variables.

Do you happy?

Actually, we see that great part of participants are quite happy, hovewer there are some who is not very happy. That is interesting to know why.

How satisfied are you with your life as a whole?

Again we see that there are not a lot people who are completely dissatisfied with their life and not a lot of people who are completely satisfied. However, main part of participants more or less are satisfied.

After looking at the plots and creating INDEX, let us look at the distribution of the INDEX with general histigram and standartized one.

Here we can conclude that our INDEX is distributed normally and we can go further.

Descriptive statistics

Before constructing the analysis, it is important to look at the distribution of our predictive variables.

The distribution of finsatisf

There are few Koreans who are completely satisfied with their financial condition, but basically they are more or less satisfied. Although those who are not completely satisfied are also a sufficient number as we see.

The distribution of employment

The graph shows us that most Koreans work full time, and few who are not employed.

The distribution of hardsucess

It is noteworthy that Koreans believe that hard work really brings success (1, 2, 3)

The distribution of class

People consider themselves as lower middle class often and as upper middle class in general.

Let us look at the relationships between out outcome variable happy and predictors.

The relationship between happy and employment

The graph shows the presence of outliers, but it is also good that we see a difference in the level of happiness among different professions. For example, on average people who are unemplolyed or has a part-time job a less happy.

The relationship between happy and class

People who consider themselves as upper class on average seems more happy than others. It is seen from the graph that working and lower class are less happy than people from upper classes.

The relationship between happy and hardsuccess

No actually pattern has been noticed, but maybe later it will be significant in our model.

The relationship between happy and finsatisf

A lot of extreme cases: outliers, which may affect regression. Positive relatioship: the more a person is satisfied with his financial situation of household, the happier he is.

Hypotheses

The higher level of satisfaction with the financial situation of household corresponds with the higher level of happiness (Hagerty & Veenhoven, 2003). In a study, Michael R. Hagerty & Ruut Veenhoven uses the theory of absolute utility that predicts that extra income allows each person to satisfy additional needs, thereby increasing the average long-term happiness - it was proved that satisfaction with your financial situation positively affects the feeling of happiness only in the short term.
Individuals who believe that their hard work brings success, feel happier.
Full-time workers, part-time workers and retired individuals are likely to be happier, than unemployed individuals (Lawrence et al., 2016). According to longitudinal research about happiness in USA, about 30% of individuals who have full-time job, part-time job or retired are “very happy”, while only 18% of unemployed individuals feel the same. Individuals feel much happier becoming older. (Stone A. A. et al., 2010). In a study on psychological well-being and its relationship with age, the results showed that after 50 years, people feel much happier than in their youth, when the level of happiness is on the decline.
Individuals belonging to higher classes feel happier (Paul Cameron, 2016). In a study, Paul Cameron came to the conclusion that people belonging to a higher class report more positive and happy moods than people with a lower position.

Regression analysis

To predict happiness, we first build a model containing all the predictors and look at its significance.

## 
## Call:
## lm(formula = wvsKR1$happyIND2 ~ wvsKR1$finsatisf1 + wvsKR1$hardsuccess1 + 
##     wvsKR1$employment + wvsKR1$class)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9606 -0.4655  0.0111  0.4934  3.4490 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -1.224063   0.143430  -8.534  < 2e-16 ***
## wvsKR1$finsatisf12              0.514464   0.124068   4.147 3.62e-05 ***
## wvsKR1$finsatisf13              0.659641   0.119297   5.529 3.96e-08 ***
## wvsKR1$finsatisf14              0.918816   0.113766   8.076 1.65e-15 ***
## wvsKR1$finsatisf15              1.173729   0.111577  10.519  < 2e-16 ***
## wvsKR1$finsatisf16              1.517238   0.110906  13.680  < 2e-16 ***
## wvsKR1$finsatisf17              1.677895   0.116939  14.348  < 2e-16 ***
## wvsKR1$finsatisf18              2.115567   0.161752  13.079  < 2e-16 ***
## wvsKR1$finsatisf110             2.795932   0.238240  11.736  < 2e-16 ***
## wvsKR1$hardsuccess12            0.172406   0.079709   2.163 0.030747 *  
## wvsKR1$hardsuccess13            0.091913   0.094103   0.977 0.328907    
## wvsKR1$hardsuccess14            0.135191   0.089237   1.515 0.130054    
## wvsKR1$hardsuccess15            0.177270   0.097880   1.811 0.070383 .  
## wvsKR1$hardsuccess16            0.172175   0.100708   1.710 0.087597 .  
## wvsKR1$hardsuccess17            0.266182   0.123495   2.155 0.031334 *  
## wvsKR1$hardsuccess18            0.004769   0.136674   0.035 0.972173    
## wvsKR1$hardsuccess19           -0.049126   0.168479  -0.292 0.770654    
## wvsKR1$hardsuccess110           0.289107   0.083688   3.455 0.000571 ***
## wvsKR1$employmentHousewife     -0.164734   0.064706  -2.546 0.011028 *  
## wvsKR1$employmentOther         -0.256835   0.081735  -3.142 0.001718 ** 
## wvsKR1$employmentPart time     -0.326759   0.100251  -3.259 0.001149 ** 
## wvsKR1$employmentRetired       -0.592034   0.134066  -4.416 1.10e-05 ***
## wvsKR1$employmentSelf employed -0.095628   0.109554  -0.873 0.382909    
## wvsKR1$employmentStudents      -0.077495   0.080850  -0.958 0.338010    
## wvsKR1$employmentUnemployed    -0.171329   0.129388  -1.324 0.185711    
## wvsKR1$classLower middle class  0.138021   0.113062   1.221 0.222427    
## wvsKR1$classUpper class         0.470612   0.317285   1.483 0.138280    
## wvsKR1$classUpper middle class  0.065752   0.122877   0.535 0.592680    
## wvsKR1$classWorking class      -0.122818   0.122239  -1.005 0.315232    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8116 on 1166 degrees of freedom
## Multiple R-squared:  0.3568, Adjusted R-squared:  0.3414 
## F-statistic:  23.1 on 28 and 1166 DF,  p-value: < 2.2e-16

P-value is less than 0.05, R-squared is a statistical measure of how close the data are to the fitted regression line. The higher the R-squared, the better the model fits the data. This model explains 36% of the variability of the response data around its mean. Finsatisf here is significant in model (***), as employment variable. However, we can notice that hardsuccess and class variables are not that much significant. Explanatory power not really strong.

The intercept is -1.22, ,means that if age, finsatisf, class and employment variables are egual to 0, the index of happy is egual to -1.22.

Backward method

Next step of our analysis - creating the best model. Here we are using backward method -removing insignificant factors untill we will find good model. We don not have a lot of predictors, so this step is quite simple.

Now we are removing hardsuccess1 since our previous model showed it is not significant.

## 
## Call:
## lm(formula = wvsKR1$happyIND2 ~ wvsKR1$finsatisf1 + wvsKR1$employment + 
##     wvsKR1$class)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8647 -0.4387 -0.0083  0.5063  3.5807 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -1.09065    0.13179  -8.276 3.44e-16 ***
## wvsKR1$finsatisf12              0.50082    0.12399   4.039 5.71e-05 ***
## wvsKR1$finsatisf13              0.66108    0.11871   5.569 3.18e-08 ***
## wvsKR1$finsatisf14              0.92209    0.11347   8.126 1.11e-15 ***
## wvsKR1$finsatisf15              1.16308    0.11132  10.448  < 2e-16 ***
## wvsKR1$finsatisf16              1.53027    0.11081  13.810  < 2e-16 ***
## wvsKR1$finsatisf17              1.69070    0.11677  14.479  < 2e-16 ***
## wvsKR1$finsatisf18              2.07441    0.16157  12.839  < 2e-16 ***
## wvsKR1$finsatisf110             2.79215    0.23835  11.714  < 2e-16 ***
## wvsKR1$employmentHousewife     -0.17554    0.06453  -2.720 0.006618 ** 
## wvsKR1$employmentOther         -0.24747    0.08156  -3.034 0.002465 ** 
## wvsKR1$employmentPart time     -0.32996    0.09966  -3.311 0.000958 ***
## wvsKR1$employmentRetired       -0.60818    0.13405  -4.537 6.29e-06 ***
## wvsKR1$employmentSelf employed -0.10120    0.10951  -0.924 0.355633    
## wvsKR1$employmentStudents      -0.09971    0.08038  -1.240 0.215067    
## wvsKR1$employmentUnemployed    -0.18816    0.12938  -1.454 0.146111    
## wvsKR1$classLower middle class  0.15269    0.11253   1.357 0.175085    
## wvsKR1$classUpper class         0.43503    0.31550   1.379 0.168204    
## wvsKR1$classUpper middle class  0.07795    0.12234   0.637 0.524138    
## wvsKR1$classWorking class      -0.11025    0.12209  -0.903 0.366715    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8143 on 1175 degrees of freedom
## Multiple R-squared:  0.3475, Adjusted R-squared:  0.3369 
## F-statistic: 32.93 on 19 and 1175 DF,  p-value: < 2.2e-16

P-value is less than 0.05, therefore it is significant and we can conlcude that some predictors are explaining the level of happiness. This model explains 35% of the variability of the response data around its mean. Finsatisf here is significant again, as employment variable. However, we can notice that class variables is not significant. The intercept is -1.09, ,means that if age, finsatisf, class and employment variables are egual to 0, the index of happy is egual to -1.09. Positive relation with finsatisf and class. Negative relation with employment. Explanatory power not very strong.

Removing class variable

## 
## Call:
## lm(formula = wvsKR1$happyIND2 ~ wvsKR1$finsatisf1 + wvsKR1$employment)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8114 -0.4574 -0.0252  0.4961  3.6641 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -1.02927    0.10086 -10.205  < 2e-16 ***
## wvsKR1$finsatisf12              0.50252    0.12372   4.062 5.19e-05 ***
## wvsKR1$finsatisf13              0.67853    0.11708   5.796 8.73e-09 ***
## wvsKR1$finsatisf14              0.94735    0.11211   8.450  < 2e-16 ***
## wvsKR1$finsatisf15              1.18626    0.10885  10.898  < 2e-16 ***
## wvsKR1$finsatisf16              1.56030    0.10747  14.519  < 2e-16 ***
## wvsKR1$finsatisf17              1.72075    0.11300  15.227  < 2e-16 ***
## wvsKR1$finsatisf18              2.10526    0.15711  13.400  < 2e-16 ***
## wvsKR1$finsatisf110             2.84957    0.23253  12.255  < 2e-16 ***
## wvsKR1$employmentHousewife     -0.16754    0.06478  -2.586 0.009819 ** 
## wvsKR1$employmentOther         -0.23954    0.08187  -2.926 0.003500 ** 
## wvsKR1$employmentPart time     -0.38329    0.09899  -3.872 0.000114 ***
## wvsKR1$employmentRetired       -0.57970    0.13439  -4.314 1.74e-05 ***
## wvsKR1$employmentSelf employed -0.07190    0.10938  -0.657 0.511078    
## wvsKR1$employmentStudents      -0.07381    0.08010  -0.921 0.357008    
## wvsKR1$employmentUnemployed    -0.22635    0.12590  -1.798 0.072452 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8186 on 1179 degrees of freedom
## Multiple R-squared:  0.3384, Adjusted R-squared:  0.3299 
## F-statistic:  40.2 on 15 and 1179 DF,  p-value: < 2.2e-16

P-value is less than 0.05. This model explains 33% of the variability of the response data around its mean. R-squared became less - it is worse. It decreased because a predictor improved the model less than what is predicted previoulsy.

Removing employment variable just to see how one predictor expain happiness.

## 
## Call:
## lm(formula = wvsKR1$happyIND2 ~ wvsKR1$finsatisf1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9254 -0.4085  0.0237  0.5326  3.6166 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -1.22137    0.09318 -13.108  < 2e-16 ***
## wvsKR1$finsatisf12   0.55201    0.12467   4.428 1.04e-05 ***
## wvsKR1$finsatisf13   0.72974    0.11700   6.237 6.19e-10 ***
## wvsKR1$finsatisf14   0.99941    0.11159   8.956  < 2e-16 ***
## wvsKR1$finsatisf15   1.26713    0.10831  11.699  < 2e-16 ***
## wvsKR1$finsatisf16   1.62935    0.10710  15.213  < 2e-16 ***
## wvsKR1$finsatisf17   1.78730    0.11298  15.819  < 2e-16 ***
## wvsKR1$finsatisf18   2.14900    0.15695  13.692  < 2e-16 ***
## wvsKR1$finsatisf110  2.92503    0.23326  12.540  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8282 on 1186 degrees of freedom
## Multiple R-squared:  0.3187, Adjusted R-squared:  0.3141 
## F-statistic: 69.34 on 8 and 1186 DF,  p-value: < 2.2e-16

P-value is less than 0.05. This model explains 31% of the variability of the response data around its mean. However, R-squered are smaller, since we have only one predictor.

Removing finsatisf variable

## 
## Call:
## lm(formula = wvsKR1$happyIND2 ~ wvsKR1$employment)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.91265 -0.54538  0.05894  0.49662  2.69888 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     0.175084   0.047957   3.651 0.000273 ***
## wvsKR1$employmentHousewife     -0.237500   0.077351  -3.070 0.002186 ** 
## wvsKR1$employmentOther         -0.312467   0.097062  -3.219 0.001320 ** 
## wvsKR1$employmentPart time     -0.715074   0.116096  -6.159 9.98e-10 ***
## wvsKR1$employmentRetired       -0.478726   0.160460  -2.983 0.002908 ** 
## wvsKR1$employmentSelf employed -0.054106   0.130728  -0.414 0.679036    
## wvsKR1$employmentStudents      -0.005434   0.095488  -0.057 0.954625    
## wvsKR1$employmentUnemployed    -0.409321   0.149426  -2.739 0.006249 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9805 on 1187 degrees of freedom
## Multiple R-squared:  0.04429,    Adjusted R-squared:  0.03865 
## F-statistic: 7.858 on 7 and 1187 DF,  p-value: 2.483e-09

P-value is less than 0.05. This model explains 4% of the variability of the response data around its mean. Finsatisf here is significant again, as employment variable. It is important to say, that our employment variable became molre significant in this model not like in previous. Explanatory power is weak. It decreased because a predictor improved the model less than what is predicted previoulsy.

Comparing models

Now since we have several models, we need to compare them to find the best one.

For nested models it is usually used anova, for non-nested - AIC. We are using anova.

## Analysis of Variance Table
## 
## Model 1: wvsKR1$happyIND2 ~ wvsKR1$finsatisf1 + wvsKR1$employment + wvsKR1$class
## Model 2: wvsKR1$happyIND2 ~ wvsKR1$finsatisf1 + wvsKR1$hardsuccess1 + 
##     wvsKR1$employment + wvsKR1$class
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1   1175 779.10                              
## 2   1166 767.98  9     11.12 1.8758 0.05175 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Reduced model is better than full because p valuse > 0.05.

## Analysis of Variance Table
## 
## Model 1: wvsKR1$happyIND2 ~ wvsKR1$finsatisf1 + wvsKR1$employment
## Model 2: wvsKR1$happyIND2 ~ wvsKR1$finsatisf1 + wvsKR1$hardsuccess1 + 
##     wvsKR1$employment + wvsKR1$class
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
## 1   1179 790.00                                
## 2   1166 767.98 13    22.025 2.5724 0.001632 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here full model is better than reduced because p value < 0.05.

## Analysis of Variance Table
## 
## Model 1: wvsKR1$happyIND2 ~ wvsKR1$finsatisf1
## Model 2: wvsKR1$happyIND2 ~ wvsKR1$finsatisf1 + wvsKR1$hardsuccess1 + 
##     wvsKR1$employment + wvsKR1$class
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   1186 813.51                                  
## 2   1166 767.98 20    45.529 3.4563 4.353e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here again full model is better than reduced because p value < 0.05.

## Analysis of Variance Table
## 
## Model 1: wvsKR1$happyIND2 ~ wvsKR1$employment
## Model 2: wvsKR1$happyIND2 ~ wvsKR1$finsatisf1 + wvsKR1$hardsuccess1 + 
##     wvsKR1$employment + wvsKR1$class
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1   1187 1141.12                                  
## 2   1166  767.98 21    373.15 26.978 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The last anova test showed that full model is better than reduced because p value < 0.05.

Therefore, Model0 is better than models 2,3 and 4. Model1 is better than model0. So, model1 - our best model.

Model diagnostics

Model diagnostic is should also be presented. Firsly, we check our model on multicollinerity, where collinearity exists between three or more variables. If it has been presented in the model, the solution of the regression model becomes unstable. Therefore we are using VIF, the variance inflation factor, which measures how much the variance of a regression coefficient is inflated due to multicollinearity in the model.

##                       GVIF Df GVIF^(1/(2*Df))
## wvsKR1$finsatisf1 1.423944  8        1.022335
## wvsKR1$employment 1.276656  7        1.017599
## wvsKR1$class      1.491733  4        1.051263

The test showed that vif score for the predictor variableles less than 5 - that is okay (moderately correlated). No multicollinearity in our model is presented.

Next step in model diagnostic is a look on residuals and leverages.

The first plot is a scatter plot of residuals on the y axis and fitted values (estimated responses) on the x axis. It seems that the residuals and the fitted values are uncorrelated, as they should be in a homoscedastic linear model with normally distributed errors. So, no heteroscedasticity.

Normal Q-Q plot shows that the distributions matched more or less perfectly, the residuals are normally distributed because the points follow the dotted line closely.It is seen expect observations 577,1098, 1165. That is okay. The model residuals have passed the test of normality.

Scale location plot indicates spread of points across predicted values range. A horizontal red line is ideal and would indicate that residuals have uniform variance across the range. For our model the results are not good.

The last graphs show that we have outliers, but not leverages. Outliers are data points whose response y does not follow the general trend of the rest of the data. A data point has high leverage if it has “extreme” predictor x values. Leverage is a measure of how unusual the X value of a point is. Leverage is an outlier if it greatly affects the slope of the regression line. Leverages should be deleted. But also everything is depended on how many observations do we have. Under Cook’s distance there is no points, means no leverages, which is good.

Other tests to assess the adequacy of our model

##      rstudent unadjusted p-value Bonferonni p
## 1098 4.482532         8.0979e-06     0.009677

Bonferonni p-value shows that observation 1098 is an outlier, but it is not influences the regression line - the test statistically significant.

qqPlot(model1, main="QQ Plot")

## [1] 1098 1165

There is another way to present Q-Q plot which also shows a normal distribution.

And here is another variant to show leverages plot.

The distribution of studentized residuals is normal.

Adding non-linear effect

For adding non-linear effect and see if we have a better model, we will use 3 methods: polynom, spline, GAM.

Polynom

## 
## Call:
## lm(formula = wvsKR1$happyIND2 ~ poly(finsatisf1, 3) + employment + 
##     class, data = wvsKR1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9187 -0.4491 -0.0169  0.4869  3.5483 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              0.04855    0.11391   0.426  0.67007    
## poly(finsatisf1, 3)1    18.50086    0.89505  20.670  < 2e-16 ***
## poly(finsatisf1, 3)2     0.60084    0.84270   0.713  0.47599    
## poly(finsatisf1, 3)3     1.66392    0.82843   2.009  0.04482 *  
## employmentHousewife     -0.17701    0.06450  -2.745  0.00615 ** 
## employmentOther         -0.25499    0.08104  -3.146  0.00169 ** 
## employmentPart time     -0.32664    0.09963  -3.279  0.00107 ** 
## employmentRetired       -0.61375    0.13388  -4.584 5.04e-06 ***
## employmentSelf employed -0.09531    0.10941  -0.871  0.38389    
## employmentStudents      -0.09626    0.08026  -1.199  0.23064    
## employmentUnemployed    -0.19563    0.12876  -1.519  0.12896    
## classLower middle class  0.15309    0.11226   1.364  0.17294    
## classUpper class         0.43735    0.31498   1.389  0.16524    
## classUpper middle class  0.08157    0.12193   0.669  0.50364    
## classWorking class      -0.11063    0.12175  -0.909  0.36372    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8149 on 1180 degrees of freedom
## Multiple R-squared:  0.3438, Adjusted R-squared:  0.336 
## F-statistic: 44.16 on 14 and 1180 DF,  p-value: < 2.2e-16

## Analysis of Variance Table
## 
## Model 1: wvsKR1$happyIND2 ~ finsatisf1 + employment + class
## Model 2: wvsKR1$happyIND2 ~ poly(finsatisf1, 3) + employment + class
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1   1175 779.10                           
## 2   1180 783.52 -5   -4.4272 1.3354 0.2466

## [1] 2922.081

## [1] 2918.852

The lower AIC - the better. According to AIC, adding non-linear effect did not bring a really better results and better model.

## 
## Call:
## lm(formula = wvsKR1$happyIND2 ~ poly(finsatisf1, 4) + employment + 
##     class, data = wvsKR1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9149 -0.4527 -0.0205  0.4834  3.5422 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              0.04729    0.11420   0.414  0.67884    
## poly(finsatisf1, 4)1    18.50228    0.89546  20.662  < 2e-16 ***
## poly(finsatisf1, 4)2     0.60310    0.84315   0.715  0.47457    
## poly(finsatisf1, 4)3     1.66390    0.82877   2.008  0.04491 *  
## poly(finsatisf1, 4)4     0.14087    0.82266   0.171  0.86407    
## employmentHousewife     -0.17721    0.06453  -2.746  0.00612 ** 
## employmentOther         -0.25441    0.08115  -3.135  0.00176 ** 
## employmentPart time     -0.32623    0.09970  -3.272  0.00110 ** 
## employmentRetired       -0.61450    0.13401  -4.585 5.01e-06 ***
## employmentSelf employed -0.09492    0.10948  -0.867  0.38613    
## employmentStudents      -0.09617    0.08030  -1.198  0.23131    
## employmentUnemployed    -0.19414    0.12911  -1.504  0.13292    
## classLower middle class  0.15431    0.11254   1.371  0.17056    
## classUpper class         0.43500    0.31540   1.379  0.16810    
## classUpper middle class  0.08264    0.12214   0.677  0.49877    
## classWorking class      -0.10920    0.12209  -0.894  0.37127    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8152 on 1179 degrees of freedom
## Multiple R-squared:  0.3438, Adjusted R-squared:  0.3355 
## F-statistic: 41.18 on 15 and 1179 DF,  p-value: < 2.2e-16

## Analysis of Variance Table
## 
## Model 1: wvsKR1$happyIND2 ~ finsatisf1 + employment + class
## Model 2: wvsKR1$happyIND2 ~ poly(finsatisf1, 4) + employment + class
##   Res.Df   RSS Df Sum of Sq      F Pr(>F)
## 1   1175 779.1                           
## 2   1179 783.5 -4   -4.4077 1.6619 0.1565

## Analysis of Variance Table
## 
## Model 1: wvsKR1$happyIND2 ~ poly(finsatisf1, 3) + employment + class
## Model 2: wvsKR1$happyIND2 ~ poly(finsatisf1, 4) + employment + class
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1   1180 783.52                           
## 2   1179 783.50  1  0.019485 0.0293 0.8641

## [1] 2922.081

## [1] 2918.852

## [1] 2920.823

Comparing models, adding non-linear effect did not bring better results in this case too.

Spline

## 
## Call:
## lm(formula = happyIND2 ~ employment + bs(finsatisf1, knots = knots) + 
##     class, data = wvsKR1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9187 -0.4491 -0.0169  0.4869  3.5483 
## 
## Coefficients: (3 not defined because of singularities)
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     1.54505    0.20299   7.611 5.53e-14 ***
## employmentHousewife            -0.17701    0.06450  -2.745  0.00615 ** 
## employmentOther                -0.25499    0.08104  -3.146  0.00169 ** 
## employmentPart time            -0.32664    0.09963  -3.279  0.00107 ** 
## employmentRetired              -0.61375    0.13388  -4.584 5.04e-06 ***
## employmentSelf employed        -0.09531    0.10941  -0.871  0.38389    
## employmentStudents             -0.09626    0.08026  -1.199  0.23064    
## employmentUnemployed           -0.19563    0.12876  -1.519  0.12896    
## bs(finsatisf1, knots = knots)1       NA         NA      NA       NA    
## bs(finsatisf1, knots = knots)2       NA         NA      NA       NA    
## bs(finsatisf1, knots = knots)3 -2.59618    0.20116 -12.906  < 2e-16 ***
## bs(finsatisf1, knots = knots)4 -1.49246    0.18091  -8.250 4.19e-16 ***
## bs(finsatisf1, knots = knots)5 -1.50138    0.35694  -4.206 2.79e-05 ***
## bs(finsatisf1, knots = knots)6       NA         NA      NA       NA    
## classLower middle class         0.15309    0.11226   1.364  0.17294    
## classUpper class                0.43735    0.31498   1.389  0.16524    
## classUpper middle class         0.08157    0.12193   0.669  0.50364    
## classWorking class             -0.11063    0.12175  -0.909  0.36372    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8149 on 1180 degrees of freedom
## Multiple R-squared:  0.3438, Adjusted R-squared:  0.336 
## F-statistic: 44.16 on 14 and 1180 DF,  p-value: < 2.2e-16

## Analysis of Variance Table
## 
## Response: wvsKR1$happyIND2
##              Df Sum Sq Mean Sq F value    Pr(>F)    
## finsatisf1    8 380.49  47.562 71.7307 < 2.2e-16 ***
## employment    7  23.50   3.358  5.0640 1.119e-05 ***
## class         4  10.91   2.726  4.1119  0.002587 ** 
## Residuals  1175 779.10   0.663                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## Analysis of Variance Table
## 
## Response: wvsKR1$happyIND2
##                       Df Sum Sq Mean Sq  F value    Pr(>F)    
## poly(finsatisf1, 3)    3 375.43 125.143 188.4676 < 2.2e-16 ***
## employment             7  24.10   3.442   5.1840  7.85e-06 ***
## class                  4  10.95   2.738   4.1237  0.002533 ** 
## Residuals           1180 783.52   0.664                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## [1] 2918.852

## [1] 2918.852

## [1] 2922.081

Here again we see that Spline method did not make our model better.

GAM

AIC(modelgam) #2899.895 AIC(model1pl0) #2892.488

So, GAM also did not impove our model.

Therefore, we should continie to work with modek without non-linear effect.

Adding interaction effect

We should also try to add an interactive effect, and to do this, we take a variable such as age.

Simple model with age

## 
## Call:
## lm(formula = wvsKR1$happyIND2 ~ finsatisf1 + employment + age + 
##     class, data = wvsKR1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9612 -0.4590 -0.0265  0.4436  3.6296 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.882609   0.118417  -7.453 1.75e-13 ***
## finsatisf1   0.294515   0.012598  23.377  < 2e-16 ***
## employment  -0.026821   0.010531  -2.547 0.010992 *  
## age         -0.006273   0.001720  -3.647 0.000277 ***
## class       -0.060197   0.018903  -3.185 0.001487 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8221 on 1190 degrees of freedom
## Multiple R-squared:  0.3265, Adjusted R-squared:  0.3242 
## F-statistic: 144.2 on 4 and 1190 DF,  p-value: < 2.2e-16

The p-value is smaller than 0.05, therefore it is significant and we can conlcude that some predictors are explaining the level of happiness. This model explains 33% of the variability of the response data around its mean. Finsatisf here is significant again, as employment variable. The intercept is -0.88,means that if age, finsatisf, class and employment variables are egual to 0, the index of happy is egual to -0.88. If an finsatisf changes on 1, the variable happy changes on 0.29. If an employment varible changes on 1, the happy variable changes on -0.02. If an age varible changes on 1, the happy variable changes on -0.006. If an class varible changes on 1, the happy variable changes on -0.06.

Age * class

Hypothesis: The older the person and the more class he considers himself to be, the happier he is.

## Learn more about sjPlot with 'browseVignettes("sjPlot")'.

Interpretation:

1 - Lower class 2 - Lower middle class 3 - Upper class 4 - Upper middle class 5 - Working class

The interaction effect is significance. The hypothesis is confirmed, since with an increase in age, the upper-middle class feels happier. And the lower the class the less happiness a person with age.

Age * finsatisf

Hypothesis: The older the person and the more he satisfied with his financial situation, the happier he is.

## 
## Call:
## lm(formula = wvsKR1$happyIND2 ~ age * finsatisf1 + employment + 
##     class, data = wvsKR1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0387 -0.4421 -0.0155  0.4546  3.8209 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -0.5255270  0.2057840  -2.554 0.010780 *  
## age            -0.0147889  0.0043683  -3.386 0.000734 ***
## finsatisf1      0.2185959  0.0379533   5.760 1.07e-08 ***
## employment     -0.0255225  0.0105329  -2.423 0.015536 *  
## class          -0.0602325  0.0188751  -3.191 0.001454 ** 
## age:finsatisf1  0.0018072  0.0008524   2.120 0.034197 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8208 on 1189 degrees of freedom
## Multiple R-squared:  0.329,  Adjusted R-squared:  0.3262 
## F-statistic: 116.6 on 5 and 1189 DF,  p-value: < 2.2e-16

The p-value is smaller than 0.05, therefore it is significant and we can conlcude that some predictors are explaining the level of happiness. This model explains 33% of the variability of the response data around its mean. Finsatisf here is significant again, as employment, class and age variable. Age became more significant than in previous model. The intercept is -0.52, ,means that if age, finsatisf, class and employment variables are egual to 0, the index of happy is egual to -0.52. If an finsatisf changes on 1, the variable happy changes on 0.21. If an employment varible changes on 1, the happy variable changes on -0.03. If an age varible changes on 1, the happy variable changes on -0.01. If an class varible changes on 1, the happy variable changes on -0.06.

Interpretation:

1 - completely satisfied 9 - competely dissatisifed

The hypothesis is fully confirmed, since people who are satisfied with their financial condition are much happier than those who are not satisfied, and their happiness index increases with age.

Age * employment

Hypothesis: The older the person and if he had a paid stable employment, the happier he is.

## 
## Call:
## lm(formula = wvsKR1$happyIND2 ~ finsatisf1 + age * employment + 
##     class, data = wvsKR1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9657 -0.4587 -0.0115  0.4574  3.6180 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -0.9627870  0.1599010  -6.021  2.3e-09 ***
## finsatisf1      0.2941031  0.0126129  23.318  < 2e-16 ***
## age            -0.0041720  0.0032995  -1.264  0.20633    
## employment     -0.0060472  0.0297596  -0.203  0.83901    
## class          -0.0600479  0.0189074  -3.176  0.00153 ** 
## age:employment -0.0005617  0.0007526  -0.746  0.45561    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8222 on 1189 degrees of freedom
## Multiple R-squared:  0.3268, Adjusted R-squared:  0.324 
## F-statistic: 115.4 on 5 and 1189 DF,  p-value: < 2.2e-16

The p-value is smaller than 0.05, therefore it is significant and we can conlcude that some predictors are explaining the level of happiness. This model explains 33% of the variability of the response data around its mean. Finsatisf here is significant again, as class, but employment and age are not significant. The intercept is -0.96, ,means that if age, finsatisf, class and employment variables are egual to 0, the index of happy is egual to -0.96. If an finsatisf changes on 1, the variable happy changes on 0.29. If an employment varible changes on 1, the happy variable changes on -0.006. If an age varible changes on 1, the happy variable changes on -0.004. If an class varible changes on 1, the happy variable changes on -0.06.

Interpretation:

1 - Full-time paid job 8 - Other

The interaction effect is not really significant. The hypothesis cannot really be fully confirmed. We can only notice that people at full time work are indeed happier, but over the years their happiness index does not get higher. The level of happiness is higher for them, but falls almost the same as for other people.

Therefore, our first and second interaction model seems to be significant in our analysis.

Conclusion:

We conducted data analysis and predicted the level of happiness for South Koreans. After describing the data, constructing a regression model and choosing the best one, we tested it and added interactive effects for a more detailed understanding of the results and the relationship of variables. We have chosen the best model containing such variables as: finsatisf, class, employment. The hardsuccess variable was deleted because it swas unsignificant.

Back to our hypotheses.

The higher level of satisfaction with the financial situation of household corresponds with the higher level of happiness - was confirmed. The model showed a strong positive relationship between the happiness index and the financial situation of Koreans.
Individuals who believe that their hard work brings success, feel happier - was not confirmed. Hardsuccess variable was deleted and had a weak relationship with the happiness index.
Full-time workers, part-time workers and retired individuals are likely to be happier, than unemployed individuals - was confirmed partically. The interactive effect showed that work affects the level of happiness, and full-time workers are happier, just as the level of happiness increases with age. However, we were unable to prove that part-time workers are indeed less happy with age.
Individuals belonging to higher classes feel happier - was confirmed. The model showed positive relationship with the level of happiness. The interactive model showed that with age people of higher classes are more happier.

Therefore, we can conclude that people in Korea are indeed happier when they are financially stable, have good paid jobs, and consider themselves to be in the upper class.

Further refinement may include improved models, as well as considering the question of the level of happiness on the other hand.

Bibliography:

Hagerty M. R., Veenhoven R. Wealth and happiness revisited–growing national income does go with greater happiness //Social indicators research. – 2003. – Т. 64. – №. 1. – С. 1-27.

Happiness and Longevity in the United States Elizabeth M. Lawrence, Richard G. Rogers, Tim Wadsworth Soc Sci Med. Author manuscript; available in PMC 2016 Nov 1.Published in final edited form as: Soc Sci Med. 2015 Nov; 145: 115–119

Piff, Paul K. and Jake P. Moskowitz. “Wealth, Poverty, and Happiness: Social Class Is Differentially Associated With Positive Emotions.” Emotion 18 (2018): 902–905.

Paul Cameron, Mood as an Indicant of Happiness: Age, Sex, Social Class, and Situational Differences, Journal of Gerontology, Volume 30, Issue 2, March 1975, Pages 216–224.

Stone A. A. et al. A snapshot of the age distribution of psychological well-being in the United States // Proceedings of the National Academy of Sciences. - 2010. - T. 107. - No. 22. - S. 9985-9990.