Group project #1

Our team and distribution of tasks:

Lukyanchikova Polina: напишите че вы делали
Chistyakova Alina: selection of the variables, data preparation, model explanation
Sharifullin Timur:
Ushakov Ivan:

Uploading data

Wave_7_Germany <- read_excel("C:/Users/Alina/Documents/data/Wave_7_Germany.xlsx")

Variables

germany <- Wave_7_Germany[,c(81,103,164,175,275,295,297,310,328)]
#rm(Germany)

names(germany)[1:9] = c("satisfaction", "conf_gov", "security",
                         "worry_job","pol_scale", "sex", "age", 
                         "marital", "class")
germany[germany < 0] <- NA
germany <- na.omit(germany)

Our dependent variable - Live satisfaction

Independent variables:

conf_gov: Could you tell me how much confidence you have in the government;
security: Could you tell me how secure do you feel these days;
worry_job: To what degree are you worried about losing a job or not finding one;
pol_scale: In political matters, people talk of “the left” and “the right.” How would you place your views on this scale, generally speaking? (1 - left, 10 - right);
sex
age
marital: Are you currently Married/ Living together as married/ Divorced/ Separated/ Widowed/ Single;
class: People sometimes describe themselves as belonging to the working class, the middle class, or the upper or lower class. Would you describe yourself as belonging to certain class [list of them].

**тут наверное надо гипотезы и теорию*

Recoding the variables

Scales

germany$worry_job <- abs(germany$worry_job - 5) #new: "Very much" - 5, "Not at all" - 1)
germany$security <- abs(germany$security - 5) #new: from 1 to 4 (1 - Not at all, 4 - Very  secure)
germany$conf_gov <- abs(germany$conf_gov - 5) #new: from 1 to 4 (1 - None at all, 4 - A great deal)

Factors

germany$class <- recode_factor(germany$class, `1` = "Upper class", `2` = "Upper middle class", `3` = "Lower middle class", `4` = "Working class", `5` = "Lower class")
germany$marital <- recode_factor(germany$marital, `1` = "Married", `2` = "Living together as married",`3` = "Divorced", `4` = "Separated", `5` = "Widowed", `6` = "Single")
germany$sex <- recode_factor(germany$sex, `1` = "Male", `2` = "Female")

#Let's have a look at our data
#hist(germany$satisfaction, main= "Distribution of life satisfaction level in Germany", xlab = "Life satisfaction", col = "lightblue")
#summary(germany)

Model

model <- lm(formula = satisfaction ~ sex + age + marital + class + pol_scale + worry_job + security  + conf_gov, data = germany)
summary(model)

## 
## Call:
## lm(formula = satisfaction ~ sex + age + marital + class + pol_scale + 
##     worry_job + security + conf_gov, data = germany)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.2397 -0.7681  0.1539  0.9463  3.9825 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        7.4705272  0.5022059  14.875  < 2e-16 ***
## sexFemale                          0.2263323  0.0861548   2.627 0.008719 ** 
## age                               -0.0001458  0.0030997  -0.047 0.962483    
## maritalLiving together as married  0.0851302  0.1508956   0.564 0.572743    
## maritalDivorced                   -0.6889404  0.1653391  -4.167 3.30e-05 ***
## maritalSeparated                  -0.5753806  0.3485481  -1.651 0.099035 .  
## maritalWidowed                    -0.5932505  0.2045536  -2.900 0.003795 ** 
## maritalSingle                     -0.3119494  0.1267027  -2.462 0.013950 *  
## classUpper middle class           -0.8173926  0.3410831  -2.396 0.016702 *  
## classLower middle class           -1.0224243  0.3405822  -3.002 0.002736 ** 
## classWorking class                -1.5533729  0.3510123  -4.425 1.05e-05 ***
## classLower class                  -3.7955419  0.4950650  -7.667 3.55e-14 ***
## pol_scale                          0.0606112  0.0248199   2.442 0.014744 *  
## worry_job                         -0.2313458  0.0502359  -4.605 4.55e-06 ***
## security                           0.3170466  0.0712602   4.449 9.40e-06 ***
## conf_gov                           0.2048316  0.0598192   3.424 0.000637 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.488 on 1238 degrees of freedom
## Multiple R-squared:  0.1808, Adjusted R-squared:  0.1709 
## F-statistic: 18.21 on 15 and 1238 DF,  p-value: < 2.2e-16

For females, Life satisfaction predicted to be more than for males by 0.27.
Age appeared to be not significant in our model.
Life satisfaction foretold to be less for Divorced than for Married by [-]0.69, the same can be observed for Separated (-0.58), Widowed (-0.59), Single (-0.31). It seems that loneliness decreased life satisfaction significantly.
Upper middle class by -0.8173926 is less satisfied with their lives than Upper class, for Lower middle class, Working class, Lower class situation is the same with following estimated coefficients (-1.02, -1.55, -3.8 respectively).
The more “right” the respondent’s views on the political scale, the more satisfied he/she is with life. One increase in placements at scale where 10 is right political views leads to increase of Life satisfaction by 0.06.
Life satisfaction foretold to be less for people who are worried about losing a job, with estimated effect of -0.231.
The more secure the respondent feels and the more confident he/she is in the government of the country, the greater his/her satisfaction with life. (Effects are 0.31 & 0.2 respectively)

The model explains about 17% of the variance.

Model Diagnistics

4.1 Normality of residuals

plot(model, 2)

This graph tells us that the residuals are non-normally distributed due to presence of some outliers.

ols_plot_resid_hist(model)

Here we see that there is a bell-shaped distribution, but some outliers exist.

outlierTest(model, n.max = 100)

##      rstudent unadjusted p-value Bonferroni p
## 997 -4.247130         2.3274e-05     0.029186
## 628 -4.129893         3.8729e-05     0.048567

The outlier test doesn’t notify us of some significant outliers. Let’s dig further and deeper.

new: The outlier test notify us about 2 significant outliers. (997,628) Let’s dig further and deeper.

plot(model, 5)

The graph of standardized residuals vs. leverage points tells us that we do have some influental values that affect the normality of residuals badly.

plot(model, 4, id.n = 10)

These are the values that one may exclude.

Я добавила сюда (997,628) из предыдущего теста

datagood <- germany[-c(166, 454, 501, 495, 567, 620, 628, 980, 997, 1254), ]

Then this table was used to detect the values.

4.2 Heteroscedasticity

To check for heteroskedasticity, we run the ncv test.

ncvTest(model)

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 59.73628, Df = 1, p = 1.0846e-14

The p-value is below the threshold, so one can be sure - the heteroskedascticity assumption is met.

4.3 Non-linearity check

boxcox(model)

The boxcox test suggests that we should subject our dependent variable to a logariphmic transformation.

4.4 Multicolinearity

vif(model)

##               GVIF Df GVIF^(1/(2*Df))
## sex       1.050684  1        1.025029
## age       1.668111  1        1.291554
## marital   1.695859  5        1.054239
## class     1.140919  4        1.016616
## pol_scale 1.059517  1        1.029329
## worry_job 1.196603  1        1.093893
## security  1.165030  1        1.079366
## conf_gov  1.117345  1        1.057045

Last but not least we check that none of the estimates exceed 4. Then we can be sure that none of the variables in our model are corellated.

Final model

modelbest <- lm(I(satisfaction)^2 ~ sex + age + marital + class + pol_scale + worry_job + security  + conf_gov, data = datagood)

summary(modelbest)

## 
## Call:
## lm(formula = I(satisfaction)^2 ~ sex + age + marital + class + 
##     pol_scale + worry_job + security + conf_gov, data = datagood)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -59.539 -13.453   0.181  13.238  59.999 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        58.41372    7.07100   8.261 3.70e-16 ***
## sexFemale                           3.24730    1.21685   2.669 0.007717 ** 
## age                                 0.02616    0.04374   0.598 0.549883    
## maritalLiving together as married   1.30145    2.12562   0.612 0.540472    
## maritalDivorced                   -10.02031    2.33376  -4.294 1.90e-05 ***
## maritalSeparated                   -8.73819    5.48619  -1.593 0.111471    
## maritalWidowed                     -7.82864    2.89861  -2.701 0.007012 ** 
## maritalSingle                      -5.03413    1.78311  -2.823 0.004831 ** 
## classUpper middle class           -12.63556    4.79196  -2.637 0.008474 ** 
## classLower middle class           -16.01469    4.78481  -3.347 0.000842 ***
## classWorking class                -22.10098    4.93343  -4.480 8.17e-06 ***
## classLower class                  -51.74995    7.34680  -7.044 3.11e-12 ***
## pol_scale                           0.88377    0.35077   2.520 0.011878 *  
## worry_job                          -3.12745    0.70995  -4.405 1.15e-05 ***
## security                            4.63870    1.00341   4.623 4.18e-06 ***
## conf_gov                            2.62362    0.84404   3.108 0.001924 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.9 on 1228 degrees of freedom
## Multiple R-squared:  0.1684, Adjusted R-squared:  0.1582 
## F-statistic: 16.58 on 15 and 1228 DF,  p-value: < 2.2e-16

у меня чего-то не работает

#par(mfrow = c(2, 2))
#plot(modelbest)

Residuals vs Fitted graph tells us that the relations between the variables in the model are almost perfectly linear, the assumption is met.

Normal Q-Q graph identifies some outliers, but in general the standardized residuals are normally distributed, assumption is met.

Scale-Location plot show that squares of standardized residuals are almost flatly, equally distributed, so the assumption of heteroskedasticity is met.

Residuals vs Leverage plot doesn’t have outlying point in the right upper corner, but does have some highlighted by the text.

However, none of the points are present in the “outlierTest”, so we mat be confident with the model.