datach=read_excel('F00011095-WVS_Wave_7_Ukraine_Excel_v2.0.xlsx')
head(datach)
## # A tibble: 6 × 416
##   `version: Version of Data File` doi: Digital Object I…¹ A_YEAR: Year of surv…²
##   <chr>                           <chr>                                    <dbl>
## 1 1-5-0 (2020-11-16)              doi.org/10.14281/18241…                   2020
## 2 1-5-0 (2020-11-16)              doi.org/10.14281/18241…                   2020
## 3 1-5-0 (2020-11-16)              doi.org/10.14281/18241…                   2020
## 4 1-5-0 (2020-11-16)              doi.org/10.14281/18241…                   2020
## 5 1-5-0 (2020-11-16)              doi.org/10.14281/18241…                   2020
## 6 1-5-0 (2020-11-16)              doi.org/10.14281/18241…                   2020
## # ℹ abbreviated names: ¹​`doi: Digital Object Identifier`,
## #   ²​`A_YEAR: Year of survey`
## # ℹ 413 more variables: `B_COUNTRY: ISO 3166-1 numeric country code` <dbl>,
## #   `B_COUNTRY_ALPHA: ISO 3166-1 alpha-3 country code` <chr>,
## #   `C_COW_NUM: CoW country code numeric` <dbl>,
## #   `C_COW_ALPHA: CoW country code alpha` <chr>,
## #   `D_INTERVIEW: Interview ID` <dbl>, `J_INTDATE: Date of interview` <dbl>, …

Literature review

We want to analyze life satisfaction and chose income, importance of politics and importance of job in our project as independent variables and explored their relations with the life satisfaction.

Relations between job and life satisfaction are a topic of many discussions in scientific community, but most articles we found were focused on job satisfaction-life satisfaction relationship. In our dataset the parameter of job satisfaction was absent, so the importance of job was chosen. Not many papers inquire into job importance as a straight predictor of life satisfaction, however, the article by Iris et al.(1972) found the effect of job importance as moderator of relationship between job satisfaction and life satisfaction. In Tait el al.(1989) the relationship between job satisfaction and life satisfaction was found to be stronger for males than for females, but it was also reported to have disappeared in the latest study - the reason we decided to examine interaction effect between gender and job importance in our project.

In Selim, S(2007) analysis of World Values Survey’s results in Turkey, the negative correlation was found between life satisfaction (or happiness) and age, work and politics importance, as well as for males, while the positive correlation was found for higher income.

We found the class and income variables to be highly correlated (-0.59 computed correlation with pearson-method with listwise-deletion). However, we use class as control variable with income as variable in focus to account for social aspect of “class” in our model, and with class variable the model has slightly better explanatory power.

References:

  1. Iris, B., & Barrett, G. V. (1972). Some relations between job and life satisfaction and job importance. Journal of Applied Psychology, 56(4), 301–304.

  2. Tait, M., Padgett, M. Y., & Baldwin, T. T. (1989). Job and life satisfaction: A reevaluation of the strength of the relationship and gender effects as a function of the date of the study. Journal of Applied Psychology, 74(3), 502–507.

  3. Selim, S. (2007). Life Satisfaction and Happiness in Turkey. Social Indicators Research, 88(3), 531–562.

data = datach %>% dplyr::select("Q49: Satisfaction with your life", "Q4: Important in life: Politics", "Q5: Important in life: Work", "Q260: Sex", "Q262: Age", "Q287: Social class (subjective)", "Q288: Scale of incomes")
data[data<0] <- NA
data <- na.omit(data)
  • lifesat: 1 - Completely dissatisfied, 10 - Completely satisfied
  • politics, work: 1 - Very important, 2 - Rather important, 3 - Rather not important, 4 - Not important
  • sex: 1 - M, 2 - F
  • class: 1. Highest class, 2. Upper middle class, 3. Lower middle class, 4. Working class, 5. Lowest class
  • income: 1 - lowest income group, 10 - highest income group

Control variables: sex, age, class.

Renaming variables and cleaning data.

data$lifesat = data$`Q49: Satisfaction with your life`
data$politics = data$`Q4: Important in life: Politics`
data$work = data$`Q5: Important in life: Work`
data$sex = data$`Q260: Sex`
data$age = data$`Q262: Age`
data$class = data$`Q287: Social class (subjective)`
data$income = data$`Q288: Scale of incomes`
data = data %>% dplyr::select(lifesat, politics, work, sex, age, class, income)

Changing values in sex, 1 - m, 2 - f.

data$sex <- as.character(data$sex)
data$sex[data$sex == 1] <- "M"
data$sex[data$sex == 2] <- "F"

Descriptive statistics

y <- c(data$lifesat)
x <- c(data$politics)
a <- c(data$work)
c <- c(data$sex)
f <- c(data$age)
e <- c(data$class)
g <- c(data$income)
t.test(y ~ c, data = data)
## 
##  Welch Two Sample t-test
## 
## data:  y by c
## t = 0.24148, df = 948.15, p-value = 0.8092
## alternative hypothesis: true difference in means between group F and group M is not equal to 0
## 95 percent confidence interval:
##  -0.2269385  0.2906226
## sample estimates:
## mean in group F mean in group M 
##        6.112805        6.080963

From t-test we can conclude that as p-value is big the difference in group means of sex is zero.

cor.test(x, y, method = "spearman", use = "complete.obs", exact=FALSE)
## 
##  Spearman's rank correlation rho
## 
## data:  x and y
## S = 218348647, p-value = 0.09682
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## 0.04979653

There is no sufficient evidence to suggest that there is a statistically significant correlation between life satisfaction and politics as p-value > 0.05.

cor.test(a, y, method = "spearman", use = "complete.obs", exact=FALSE)
## 
##  Spearman's rank correlation rho
## 
## data:  a and y
## S = 248231823, p-value = 0.007395
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##         rho 
## -0.08024823

We can suggest that there is a statistically significant correlation between life satisfaction and work as p-value < 0.05.

cor.test(f, y, method = "spearman", use = "complete.obs", exact=FALSE)
## 
##  Spearman's rank correlation rho
## 
## data:  f and y
## S = 282405128, p-value = 1.051e-14
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.2289627

We can suggest that there is a statistically significant correlation between life satisfaction and age as p-value < 0.05.

cor.test(e, y, method = "spearman", use = "complete.obs", exact=FALSE)
## 
##  Spearman's rank correlation rho
## 
## data:  e and y
## S = 284232277, p-value = 1.154e-15
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## -0.236914

We can suggest that there is a statistically significant correlation between life satisfaction and social class as p-value < 0.05.

cor.test(g, y, method = "spearman", use = "complete.obs", exact=FALSE)
## 
##  Spearman's rank correlation rho
## 
## data:  g and y
## S = 161387779, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.2976772

We can suggest that there is a statistically significant correlation between life satisfaction and income as p-value < 0.05.

Changing types of variables for a model

data$sex = as.factor(data$sex)
data$politics = as.factor(data$politics)
data$work = as.factor(data$work)
data$income = as.numeric(data$income)
data$class = as.numeric(data$class)
skim(data)
Data summary
Name data
Number of rows 1113
Number of columns 7
_______________________
Column type frequency:
factor 3
numeric 4
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
politics 0 1 FALSE 4 4: 399, 3: 365, 2: 247, 1: 102
work 0 1 FALSE 4 1: 460, 2: 444, 3: 107, 4: 102
sex 0 1 FALSE 2 F: 656, M: 457

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
lifesat 0 1 6.10 2.14 1 5 6 8 10 ▂▃▇▇▃
age 0 1 47.70 16.38 18 34 46 61 86 ▅▇▆▆▂
class 0 1 3.48 0.94 1 3 3 4 5 ▁▃▇▇▃
income 0 1 4.48 1.94 1 3 5 6 10 ▃▆▇▃▁
describe(data)
## data 
## 
##  7  Variables      1113  Observations
## --------------------------------------------------------------------------------
## lifesat 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1113        0       10    0.978      6.1    2.394        2        3 
##      .25      .50      .75      .90      .95 
##        5        6        8        9       10 
## 
## lowest :  1  2  3  4  5, highest:  6  7  8  9 10
##                                                                       
## Value          1     2     3     4     5     6     7     8     9    10
## Frequency     42    25    57   105   194   182   229   146    57    76
## Proportion 0.038 0.022 0.051 0.094 0.174 0.164 0.206 0.131 0.051 0.068
## --------------------------------------------------------------------------------
## politics 
##        n  missing distinct 
##     1113        0        4 
##                                   
## Value          1     2     3     4
## Frequency    102   247   365   399
## Proportion 0.092 0.222 0.328 0.358
## --------------------------------------------------------------------------------
## work 
##        n  missing distinct 
##     1113        0        4 
##                                   
## Value          1     2     3     4
## Frequency    460   444   107   102
## Proportion 0.413 0.399 0.096 0.092
## --------------------------------------------------------------------------------
## sex 
##        n  missing distinct 
##     1113        0        2 
##                       
## Value          F     M
## Frequency    656   457
## Proportion 0.589 0.411
## --------------------------------------------------------------------------------
## age 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1113        0       69        1     47.7    18.82     23.0     26.2 
##      .25      .50      .75      .90      .95 
##     34.0     46.0     61.0     70.0     74.4 
## 
## lowest : 18 19 20 21 22, highest: 82 83 84 85 86
## --------------------------------------------------------------------------------
## class 
##        n  missing distinct     Info     Mean      Gmd 
##     1113        0        5    0.908    3.479    1.032 
## 
## lowest : 1 2 3 4 5, highest: 1 2 3 4 5
##                                         
## Value          1     2     3     4     5
## Frequency      8   156   420   353   176
## Proportion 0.007 0.140 0.377 0.317 0.158
## --------------------------------------------------------------------------------
## income 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1113        0       10    0.974    4.481    2.181        1        1 
##      .25      .50      .75      .90      .95 
##        3        5        6        7        8 
## 
## lowest :  1  2  3  4  5, highest:  6  7  8  9 10
##                                                                       
## Value          1     2     3     4     5     6     7     8     9    10
## Frequency    115    75   133   198   262   165   108    41    13     3
## Proportion 0.103 0.067 0.119 0.178 0.235 0.148 0.097 0.037 0.012 0.003
## --------------------------------------------------------------------------------
summary(data)
##     lifesat     politics work    sex          age           class      
##  Min.   : 1.0   1:102    1:460   F:656   Min.   :18.0   Min.   :1.000  
##  1st Qu.: 5.0   2:247    2:444   M:457   1st Qu.:34.0   1st Qu.:3.000  
##  Median : 6.0   3:365    3:107           Median :46.0   Median :3.000  
##  Mean   : 6.1   4:399    4:102           Mean   :47.7   Mean   :3.479  
##  3rd Qu.: 8.0                            3rd Qu.:61.0   3rd Qu.:4.000  
##  Max.   :10.0                            Max.   :86.0   Max.   :5.000  
##      income      
##  Min.   : 1.000  
##  1st Qu.: 3.000  
##  Median : 5.000  
##  Mean   : 4.481  
##  3rd Qu.: 6.000  
##  Max.   :10.000

Numeric variables:

ggplot(data, aes(x = lifesat)) +
  geom_histogram(binwidth = 1, color =1, fill = "white" ) +xlab('Satisfaction with life') + ylab('Count')+ geom_density(aes(y=..count..), colour="black", adjust=1.5)
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Distribution is non-normal with peak around 7 and with a slight left skew.

ggplot(data, aes(x = age))+
  geom_histogram(binwidth = 1, color =1, fill = "brown" ) +xlab('Age') + ylab('Count')+ geom_density(aes(y=..count..), colour="black", adjust=1.75)

This distribution is not normal and does not have a particularly clear pattern, however, it concentrates around 35 and has a little right skew.

ggplot(data, aes(x = class))+
  geom_histogram(binwidth = 1, color =1, fill = "yellow" ) +xlab('Class') + ylab('Count')+ geom_density(aes(y=..count..), colour="black", adjust=4)

Distribution is almost normal with the left skew.

ggplot(data, aes(x = income))+
  geom_histogram(binwidth = 1, color =1, fill = "pink" ) +xlab('Satisfaction with life') + ylab('Count')+ geom_density(aes(y=..count..), colour="black", adjust=1.5)

The plot looks almost like normal distribution but has right skew.

Non-numeric variables:

ggplot(data) + 
  geom_bar(aes(sex, fill=sex)) + 
  guides(fill=F)+
  xlab('Gender') + 
  ylab('Count')
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Number of women in the sample is almost 1.5 higher than number of males.

ggplot(data) + 
  geom_bar(aes(politics, fill=politics)) + 
  guides(fill=F)+
  xlab('Importance of politics') + 
  ylab('Count')

The majority of the sample do not believe politics to be an important part of their lives, most participants believe politics to be not important at all, and the lowest number believes for politics to be very important in their lives. The number of participants is steadily increasing from “Very important” group to “Not important at all” group.

ggplot(data) + 
  geom_bar(aes(work, fill=sex)) + 
  guides(fill=F)+
  xlab('Importance of work with gender') + 
  ylab('Count')

The vast majority of participants holds work as an important part of their lives. Maximum number of people, who believe that work is “Very important” for them, is almost the same as those with work being “Mostly important”. Groups with “Mostly not important” and “Not important at all” attitude are much smaller and also relatively equal with minimum having “Not important at all” group. There is no significant gender inequality among the categories.

Conclusions:

  1. Numeric variables are not normally distributed.

  2. All non-numeric variables are not equal by the groups of observation.

Model

model1 <- lm(lifesat~age+politics+class+income+sex*work,data=data)
summary(model1)
## 
## Call:
## lm(formula = lifesat ~ age + politics + class + income + sex * 
##     work, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.1148 -1.2819  0.0689  1.2685  5.7657 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.121036   0.490017  12.491  < 2e-16 ***
## age         -0.018834   0.004313  -4.367 1.38e-05 ***
## politics2    0.516628   0.239002   2.162  0.03086 *  
## politics3    0.539214   0.228545   2.359  0.01848 *  
## politics4    0.650491   0.227875   2.855  0.00439 ** 
## class       -0.165120   0.079914  -2.066  0.03904 *  
## income       0.248980   0.039898   6.240 6.22e-10 ***
## sexM        -0.468733   0.189384  -2.475  0.01347 *  
## work2       -0.232167   0.176334  -1.317  0.18824    
## work3        0.084863   0.291814   0.291  0.77125    
## work4       -0.600224   0.295913  -2.028  0.04276 *  
## sexM:work2   0.570237   0.272343   2.094  0.03650 *  
## sexM:work3   0.206967   0.435254   0.476  0.63452    
## sexM:work4   1.390538   0.452454   3.073  0.00217 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.004 on 1099 degrees of freedom
## Multiple R-squared:  0.136,  Adjusted R-squared:  0.1258 
## F-statistic: 13.31 on 13 and 1099 DF,  p-value: < 2.2e-16
  • When all the features are at 0 the expected response is 6.12.
  • age one year increase in age changes lifesat by -0.02 on average, with age people are less life satisfied.
  • politics:
  • mean lifesat value of politics2 is higher by 0.52 than mean lifesat value of politics1, those for whom politics is rather important (2) are life satisfied by 0.52 higher than those for whom politics is very important (1)
  • mean lifesat value of politics3 is higher by 0.54 than mean lifesat value of politics1, those for whom politics is rather not important (3) are life satisfied by 0.54 higher than those for whom politics is very important (1)
  • mean lifesat value of politics4 is higher by 0.65 than mean lifesat value of politics1, those for whom politics is not important (4) are life satisfied by 0.65 higher than those for whom politics is very important (1)
  • class: one unit increase in class changes lifesat by -0.17 on average, the less person evaluates his/her class, the less life satisfied he/she is.
  • income: one unit increase in income changes lifesat by 0.25 on average, the more person evaluates his/her well-being, the more satisfied he/she is with life.
  • sex: for men mean lifesat value is 0.47 lower than for women, men are less satisfied with their life.
  • work: mean lifesat value of work4 is lower by 0.6 than mean lifesat value of work1, those for whom work is not important (4) are life satisfied by 0.6 lower than those for whom work is very important (1)
library(effects)
plot(effect(term="sex:work",mod=model1,default.levels=20),multiline=TRUE, colors = c("red","blue"))

  • Women for whom work is very important (1) have higher life satisfaction, whereas men with the same attitude to work have lower life satisfaction. Women for whom work is not important (4) are less satisfied with their life, men for whom work is not important (4) are more life satisfied. Interesting thing is observed in the middle: women for whom work is rather important (2) are less satisfied than men with same work attitude, and the reverse situation is at point 3 “work is rather not important”.

Assumptions

1. Linearity of the data. The relationship between the predictor and the outcome is assumed to be linear.

plot(model1, 1)

There is no pattern in the residual plot. This suggests that we can assume linear relationship between the predictors and the outcome variable.

shapiro.test(model1$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model1$residuals
## W = 0.99788, p-value = 0.168

From the output obtained we can assume normality.

2. Normality of residuals. The residual errors are assumed to be normally distributed.

plot(model1, 2)

The above plot indicates a highly normal distribution of residual values, as shown by the tightness of the points to the line.

ols_plot_resid_hist(model1)

We can see that residuals are normally distributed, which is good for our model.

library(MASS)
sresid <- studres(model1) 
shapiro.test(sresid)
## 
##  Shapiro-Wilk normality test
## 
## data:  sresid
## W = 0.99787, p-value = 0.1645

Indeed, normal distribution, p-value > 0.05.

3. Homogeneity of residuals variance. The residuals are assumed to have a constant variance (homoscedasticity)

plot(model1, 3)

library(lmtest)
bptest(model1)
## 
##  studentized Breusch-Pagan test
## 
## data:  model1
## BP = 42.519, df = 13, p-value = 5.385e-05
ncvTest(model1)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 18.24124, Df = 1, p = 1.9462e-05

Heteroscedasticity is present.

The line in the graph is not horizontal. The model have a p-value less that a significance level of 0.05, therefore we can reject the null hypothesis that the variance of the residuals is constant and infer that heteroscedasticity is present. But neither using log of the dependent variable(life satisfaction) nor using weighted regression solved this so heteroscedasticity is the problem in this model that is hard to fix as there is a large range of observed data values.

4. Multicollinearity

vif(model1)
## there are higher-order terms (interactions) in this model
## consider setting type = 'predictor'; see ?vif
##              GVIF Df GVIF^(1/(2*Df))
## age      1.381801  1        1.175500
## politics 1.120203  3        1.019098
## class    1.574006  1        1.254594
## income   1.662477  1        1.289371
## sex      2.405714  1        1.551036
## work     5.660725  3        1.334992
## sex:work 8.377689  3        1.425129

The values are less than 5, everything is fine.

5. Outliers.

plot(model1, 5)

There are no outliers that exceed 3 standard deviations, what is good. There are no any influential points in our regression model.

6. Independence

In case of this assumption, it was known from the website from which we took data that respondents answered independent from each other so this criteria is met.

Conclusion:

All in all, our model is a good one because the only problem with it is heteroscedasticity.

  • With age people are less satisfied with their life.

  • Men are less satisfied with their life.

  • Those for whom politics very important are less satisfied with their life, the less a person is interested in politics the happier he/she is.

  • The more person evaluates his/her class the more life satisfied he/she is.

  • Those for whom work is not important are less life satisfied than those for whom work is very important.

  • Women for whom work is very important have higher life satisfaction, whereas men with the same attitude to work have lower life satisfaction.