datach=read_excel('F00011095-WVS_Wave_7_Ukraine_Excel_v2.0.xlsx')
head(datach)
## # A tibble: 6 × 416
## `version: Version of Data File` doi: Digital Object I…¹ A_YEAR: Year of surv…²
## <chr> <chr> <dbl>
## 1 1-5-0 (2020-11-16) doi.org/10.14281/18241… 2020
## 2 1-5-0 (2020-11-16) doi.org/10.14281/18241… 2020
## 3 1-5-0 (2020-11-16) doi.org/10.14281/18241… 2020
## 4 1-5-0 (2020-11-16) doi.org/10.14281/18241… 2020
## 5 1-5-0 (2020-11-16) doi.org/10.14281/18241… 2020
## 6 1-5-0 (2020-11-16) doi.org/10.14281/18241… 2020
## # ℹ abbreviated names: ¹`doi: Digital Object Identifier`,
## # ²`A_YEAR: Year of survey`
## # ℹ 413 more variables: `B_COUNTRY: ISO 3166-1 numeric country code` <dbl>,
## # `B_COUNTRY_ALPHA: ISO 3166-1 alpha-3 country code` <chr>,
## # `C_COW_NUM: CoW country code numeric` <dbl>,
## # `C_COW_ALPHA: CoW country code alpha` <chr>,
## # `D_INTERVIEW: Interview ID` <dbl>, `J_INTDATE: Date of interview` <dbl>, …
We want to analyze life satisfaction and chose income, importance of politics and importance of job in our project as independent variables and explored their relations with the life satisfaction.
Relations between job and life satisfaction are a topic of many discussions in scientific community, but most articles we found were focused on job satisfaction-life satisfaction relationship. In our dataset the parameter of job satisfaction was absent, so the importance of job was chosen. Not many papers inquire into job importance as a straight predictor of life satisfaction, however, the article by Iris et al.(1972) found the effect of job importance as moderator of relationship between job satisfaction and life satisfaction. In Tait el al.(1989) the relationship between job satisfaction and life satisfaction was found to be stronger for males than for females, but it was also reported to have disappeared in the latest study - the reason we decided to examine interaction effect between gender and job importance in our project.
In Selim, S(2007) analysis of World Values Survey’s results in Turkey, the negative correlation was found between life satisfaction (or happiness) and age, work and politics importance, as well as for males, while the positive correlation was found for higher income.
We found the class and income variables to
be highly correlated (-0.59 computed correlation with pearson-method
with listwise-deletion). However, we use class as control variable with
income as variable in focus to account for social aspect of “class” in
our model, and with class variable the model has slightly
better explanatory power.
References:
Iris, B., & Barrett, G. V. (1972). Some relations between job and life satisfaction and job importance. Journal of Applied Psychology, 56(4), 301–304.
Tait, M., Padgett, M. Y., & Baldwin, T. T. (1989). Job and life satisfaction: A reevaluation of the strength of the relationship and gender effects as a function of the date of the study. Journal of Applied Psychology, 74(3), 502–507.
Selim, S. (2007). Life Satisfaction and Happiness in Turkey. Social Indicators Research, 88(3), 531–562.
data = datach %>% dplyr::select("Q49: Satisfaction with your life", "Q4: Important in life: Politics", "Q5: Important in life: Work", "Q260: Sex", "Q262: Age", "Q287: Social class (subjective)", "Q288: Scale of incomes")
data[data<0] <- NA
data <- na.omit(data)
Control variables: sex, age, class.
Renaming variables and cleaning data.
data$lifesat = data$`Q49: Satisfaction with your life`
data$politics = data$`Q4: Important in life: Politics`
data$work = data$`Q5: Important in life: Work`
data$sex = data$`Q260: Sex`
data$age = data$`Q262: Age`
data$class = data$`Q287: Social class (subjective)`
data$income = data$`Q288: Scale of incomes`
data = data %>% dplyr::select(lifesat, politics, work, sex, age, class, income)
Changing values in sex, 1 - m, 2 - f.
data$sex <- as.character(data$sex)
data$sex[data$sex == 1] <- "M"
data$sex[data$sex == 2] <- "F"
y <- c(data$lifesat)
x <- c(data$politics)
a <- c(data$work)
c <- c(data$sex)
f <- c(data$age)
e <- c(data$class)
g <- c(data$income)
t.test(y ~ c, data = data)
##
## Welch Two Sample t-test
##
## data: y by c
## t = 0.24148, df = 948.15, p-value = 0.8092
## alternative hypothesis: true difference in means between group F and group M is not equal to 0
## 95 percent confidence interval:
## -0.2269385 0.2906226
## sample estimates:
## mean in group F mean in group M
## 6.112805 6.080963
From t-test we can conclude that as p-value is big the difference in group means of sex is zero.
cor.test(x, y, method = "spearman", use = "complete.obs", exact=FALSE)
##
## Spearman's rank correlation rho
##
## data: x and y
## S = 218348647, p-value = 0.09682
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.04979653
There is no sufficient evidence to suggest that there is a statistically significant correlation between life satisfaction and politics as p-value > 0.05.
cor.test(a, y, method = "spearman", use = "complete.obs", exact=FALSE)
##
## Spearman's rank correlation rho
##
## data: a and y
## S = 248231823, p-value = 0.007395
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.08024823
We can suggest that there is a statistically significant correlation between life satisfaction and work as p-value < 0.05.
cor.test(f, y, method = "spearman", use = "complete.obs", exact=FALSE)
##
## Spearman's rank correlation rho
##
## data: f and y
## S = 282405128, p-value = 1.051e-14
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.2289627
We can suggest that there is a statistically significant correlation between life satisfaction and age as p-value < 0.05.
cor.test(e, y, method = "spearman", use = "complete.obs", exact=FALSE)
##
## Spearman's rank correlation rho
##
## data: e and y
## S = 284232277, p-value = 1.154e-15
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.236914
We can suggest that there is a statistically significant correlation between life satisfaction and social class as p-value < 0.05.
cor.test(g, y, method = "spearman", use = "complete.obs", exact=FALSE)
##
## Spearman's rank correlation rho
##
## data: g and y
## S = 161387779, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.2976772
We can suggest that there is a statistically significant correlation between life satisfaction and income as p-value < 0.05.
Changing types of variables for a model
data$sex = as.factor(data$sex)
data$politics = as.factor(data$politics)
data$work = as.factor(data$work)
data$income = as.numeric(data$income)
data$class = as.numeric(data$class)
skim(data)
| Name | data |
| Number of rows | 1113 |
| Number of columns | 7 |
| _______________________ | |
| Column type frequency: | |
| factor | 3 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| politics | 0 | 1 | FALSE | 4 | 4: 399, 3: 365, 2: 247, 1: 102 |
| work | 0 | 1 | FALSE | 4 | 1: 460, 2: 444, 3: 107, 4: 102 |
| sex | 0 | 1 | FALSE | 2 | F: 656, M: 457 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| lifesat | 0 | 1 | 6.10 | 2.14 | 1 | 5 | 6 | 8 | 10 | ▂▃▇▇▃ |
| age | 0 | 1 | 47.70 | 16.38 | 18 | 34 | 46 | 61 | 86 | ▅▇▆▆▂ |
| class | 0 | 1 | 3.48 | 0.94 | 1 | 3 | 3 | 4 | 5 | ▁▃▇▇▃ |
| income | 0 | 1 | 4.48 | 1.94 | 1 | 3 | 5 | 6 | 10 | ▃▆▇▃▁ |
describe(data)
## data
##
## 7 Variables 1113 Observations
## --------------------------------------------------------------------------------
## lifesat
## n missing distinct Info Mean Gmd .05 .10
## 1113 0 10 0.978 6.1 2.394 2 3
## .25 .50 .75 .90 .95
## 5 6 8 9 10
##
## lowest : 1 2 3 4 5, highest: 6 7 8 9 10
##
## Value 1 2 3 4 5 6 7 8 9 10
## Frequency 42 25 57 105 194 182 229 146 57 76
## Proportion 0.038 0.022 0.051 0.094 0.174 0.164 0.206 0.131 0.051 0.068
## --------------------------------------------------------------------------------
## politics
## n missing distinct
## 1113 0 4
##
## Value 1 2 3 4
## Frequency 102 247 365 399
## Proportion 0.092 0.222 0.328 0.358
## --------------------------------------------------------------------------------
## work
## n missing distinct
## 1113 0 4
##
## Value 1 2 3 4
## Frequency 460 444 107 102
## Proportion 0.413 0.399 0.096 0.092
## --------------------------------------------------------------------------------
## sex
## n missing distinct
## 1113 0 2
##
## Value F M
## Frequency 656 457
## Proportion 0.589 0.411
## --------------------------------------------------------------------------------
## age
## n missing distinct Info Mean Gmd .05 .10
## 1113 0 69 1 47.7 18.82 23.0 26.2
## .25 .50 .75 .90 .95
## 34.0 46.0 61.0 70.0 74.4
##
## lowest : 18 19 20 21 22, highest: 82 83 84 85 86
## --------------------------------------------------------------------------------
## class
## n missing distinct Info Mean Gmd
## 1113 0 5 0.908 3.479 1.032
##
## lowest : 1 2 3 4 5, highest: 1 2 3 4 5
##
## Value 1 2 3 4 5
## Frequency 8 156 420 353 176
## Proportion 0.007 0.140 0.377 0.317 0.158
## --------------------------------------------------------------------------------
## income
## n missing distinct Info Mean Gmd .05 .10
## 1113 0 10 0.974 4.481 2.181 1 1
## .25 .50 .75 .90 .95
## 3 5 6 7 8
##
## lowest : 1 2 3 4 5, highest: 6 7 8 9 10
##
## Value 1 2 3 4 5 6 7 8 9 10
## Frequency 115 75 133 198 262 165 108 41 13 3
## Proportion 0.103 0.067 0.119 0.178 0.235 0.148 0.097 0.037 0.012 0.003
## --------------------------------------------------------------------------------
summary(data)
## lifesat politics work sex age class
## Min. : 1.0 1:102 1:460 F:656 Min. :18.0 Min. :1.000
## 1st Qu.: 5.0 2:247 2:444 M:457 1st Qu.:34.0 1st Qu.:3.000
## Median : 6.0 3:365 3:107 Median :46.0 Median :3.000
## Mean : 6.1 4:399 4:102 Mean :47.7 Mean :3.479
## 3rd Qu.: 8.0 3rd Qu.:61.0 3rd Qu.:4.000
## Max. :10.0 Max. :86.0 Max. :5.000
## income
## Min. : 1.000
## 1st Qu.: 3.000
## Median : 5.000
## Mean : 4.481
## 3rd Qu.: 6.000
## Max. :10.000
ggplot(data, aes(x = lifesat)) +
geom_histogram(binwidth = 1, color =1, fill = "white" ) +xlab('Satisfaction with life') + ylab('Count')+ geom_density(aes(y=..count..), colour="black", adjust=1.5)
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Distribution is non-normal with peak around 7 and with a slight left skew.
ggplot(data, aes(x = age))+
geom_histogram(binwidth = 1, color =1, fill = "brown" ) +xlab('Age') + ylab('Count')+ geom_density(aes(y=..count..), colour="black", adjust=1.75)
This distribution is not normal and does not have a particularly clear pattern, however, it concentrates around 35 and has a little right skew.
ggplot(data, aes(x = class))+
geom_histogram(binwidth = 1, color =1, fill = "yellow" ) +xlab('Class') + ylab('Count')+ geom_density(aes(y=..count..), colour="black", adjust=4)
Distribution is almost normal with the left skew.
ggplot(data, aes(x = income))+
geom_histogram(binwidth = 1, color =1, fill = "pink" ) +xlab('Satisfaction with life') + ylab('Count')+ geom_density(aes(y=..count..), colour="black", adjust=1.5)
The plot looks almost like normal distribution but has right skew.
ggplot(data) +
geom_bar(aes(sex, fill=sex)) +
guides(fill=F)+
xlab('Gender') +
ylab('Count')
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Number of women in the sample is almost 1.5 higher than number of males.
ggplot(data) +
geom_bar(aes(politics, fill=politics)) +
guides(fill=F)+
xlab('Importance of politics') +
ylab('Count')
The majority of the sample do not believe politics to be an important part of their lives, most participants believe politics to be not important at all, and the lowest number believes for politics to be very important in their lives. The number of participants is steadily increasing from “Very important” group to “Not important at all” group.
ggplot(data) +
geom_bar(aes(work, fill=sex)) +
guides(fill=F)+
xlab('Importance of work with gender') +
ylab('Count')
The vast majority of participants holds work as an important part of their lives. Maximum number of people, who believe that work is “Very important” for them, is almost the same as those with work being “Mostly important”. Groups with “Mostly not important” and “Not important at all” attitude are much smaller and also relatively equal with minimum having “Not important at all” group. There is no significant gender inequality among the categories.
Numeric variables are not normally distributed.
All non-numeric variables are not equal by the groups of observation.
model1 <- lm(lifesat~age+politics+class+income+sex*work,data=data)
summary(model1)
##
## Call:
## lm(formula = lifesat ~ age + politics + class + income + sex *
## work, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.1148 -1.2819 0.0689 1.2685 5.7657
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.121036 0.490017 12.491 < 2e-16 ***
## age -0.018834 0.004313 -4.367 1.38e-05 ***
## politics2 0.516628 0.239002 2.162 0.03086 *
## politics3 0.539214 0.228545 2.359 0.01848 *
## politics4 0.650491 0.227875 2.855 0.00439 **
## class -0.165120 0.079914 -2.066 0.03904 *
## income 0.248980 0.039898 6.240 6.22e-10 ***
## sexM -0.468733 0.189384 -2.475 0.01347 *
## work2 -0.232167 0.176334 -1.317 0.18824
## work3 0.084863 0.291814 0.291 0.77125
## work4 -0.600224 0.295913 -2.028 0.04276 *
## sexM:work2 0.570237 0.272343 2.094 0.03650 *
## sexM:work3 0.206967 0.435254 0.476 0.63452
## sexM:work4 1.390538 0.452454 3.073 0.00217 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.004 on 1099 degrees of freedom
## Multiple R-squared: 0.136, Adjusted R-squared: 0.1258
## F-statistic: 13.31 on 13 and 1099 DF, p-value: < 2.2e-16
library(effects)
plot(effect(term="sex:work",mod=model1,default.levels=20),multiline=TRUE, colors = c("red","blue"))
plot(model1, 1)
There is no pattern in the residual plot. This suggests that we can assume linear relationship between the predictors and the outcome variable.
shapiro.test(model1$residuals)
##
## Shapiro-Wilk normality test
##
## data: model1$residuals
## W = 0.99788, p-value = 0.168
From the output obtained we can assume normality.
plot(model1, 2)
The above plot indicates a highly normal distribution of residual values, as shown by the tightness of the points to the line.
ols_plot_resid_hist(model1)
We can see that residuals are normally distributed, which is good for our model.
library(MASS)
sresid <- studres(model1)
shapiro.test(sresid)
##
## Shapiro-Wilk normality test
##
## data: sresid
## W = 0.99787, p-value = 0.1645
Indeed, normal distribution, p-value > 0.05.
plot(model1, 3)
library(lmtest)
bptest(model1)
##
## studentized Breusch-Pagan test
##
## data: model1
## BP = 42.519, df = 13, p-value = 5.385e-05
ncvTest(model1)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 18.24124, Df = 1, p = 1.9462e-05
Heteroscedasticity is present.
The line in the graph is not horizontal. The model have a p-value less that a significance level of 0.05, therefore we can reject the null hypothesis that the variance of the residuals is constant and infer that heteroscedasticity is present. But neither using log of the dependent variable(life satisfaction) nor using weighted regression solved this so heteroscedasticity is the problem in this model that is hard to fix as there is a large range of observed data values.
vif(model1)
## there are higher-order terms (interactions) in this model
## consider setting type = 'predictor'; see ?vif
## GVIF Df GVIF^(1/(2*Df))
## age 1.381801 1 1.175500
## politics 1.120203 3 1.019098
## class 1.574006 1 1.254594
## income 1.662477 1 1.289371
## sex 2.405714 1 1.551036
## work 5.660725 3 1.334992
## sex:work 8.377689 3 1.425129
The values are less than 5, everything is fine.
plot(model1, 5)
There are no outliers that exceed 3 standard deviations, what is good. There are no any influential points in our regression model.
In case of this assumption, it was known from the website from which we took data that respondents answered independent from each other so this criteria is met.
All in all, our model is a good one because the only problem with it is heteroscedasticity.
With age people are less satisfied with their life.
Men are less satisfied with their life.
Those for whom politics very important are less satisfied with their life, the less a person is interested in politics the happier he/she is.
The more person evaluates his/her class the more life satisfied he/she is.
Those for whom work is not important are less life satisfied than those for whom work is very important.
Women for whom work is very important have higher life satisfaction, whereas men with the same attitude to work have lower life satisfaction.