Our team and distribution of tasks:

Uploading data

Wave_7_Germany <- read_excel("C:/Users/Alina/Documents/data/Wave_7_Germany.xlsx")

Variables

germany <- Wave_7_Germany[,c(81,103,164,175,275,295,297,310,328)]
#rm(Germany)
names(germany)[1:9] = c("satisfaction", "conf_gov", "security",
                         "worry_job","pol_scale", "sex", "age", 
                         "marital", "class")
germany[germany < 0] <- NA
germany <- na.omit(germany)

Our dependent variable - Live satisfaction

Independent variables:

**тут наверное надо гипотезы и теорию*

Recoding the variables

Scales

germany$worry_job <- abs(germany$worry_job - 5) #new: "Very much" - 5, "Not at all" - 1)
germany$security <- abs(germany$security - 5) #new: from 1 to 4 (1 - Not at all, 4 - Very  secure)
germany$conf_gov <- abs(germany$conf_gov - 5) #new: from 1 to 4 (1 - None at all, 4 - A great deal)

Factors

germany$class <- recode_factor(germany$class, `1` = "Upper class", `2` = "Upper middle class", `3` = "Lower middle class", `4` = "Working class", `5` = "Lower class")
germany$marital <- recode_factor(germany$marital, `1` = "Married", `2` = "Living together as married",`3` = "Divorced", `4` = "Separated", `5` = "Widowed", `6` = "Single")
germany$sex <- recode_factor(germany$sex, `1` = "Male", `2` = "Female")
#Let's have a look at our data
#hist(germany$satisfaction, main= "Distribution of life satisfaction level in Germany", xlab = "Life satisfaction", col = "lightblue")
#summary(germany)

Model

model <- lm(formula = satisfaction ~ sex + age + marital + class + pol_scale + worry_job + security  + conf_gov, data = germany)
summary(model)
## 
## Call:
## lm(formula = satisfaction ~ sex + age + marital + class + pol_scale + 
##     worry_job + security + conf_gov, data = germany)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.2397 -0.7681  0.1539  0.9463  3.9825 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        7.4705272  0.5022059  14.875  < 2e-16 ***
## sexFemale                          0.2263323  0.0861548   2.627 0.008719 ** 
## age                               -0.0001458  0.0030997  -0.047 0.962483    
## maritalLiving together as married  0.0851302  0.1508956   0.564 0.572743    
## maritalDivorced                   -0.6889404  0.1653391  -4.167 3.30e-05 ***
## maritalSeparated                  -0.5753806  0.3485481  -1.651 0.099035 .  
## maritalWidowed                    -0.5932505  0.2045536  -2.900 0.003795 ** 
## maritalSingle                     -0.3119494  0.1267027  -2.462 0.013950 *  
## classUpper middle class           -0.8173926  0.3410831  -2.396 0.016702 *  
## classLower middle class           -1.0224243  0.3405822  -3.002 0.002736 ** 
## classWorking class                -1.5533729  0.3510123  -4.425 1.05e-05 ***
## classLower class                  -3.7955419  0.4950650  -7.667 3.55e-14 ***
## pol_scale                          0.0606112  0.0248199   2.442 0.014744 *  
## worry_job                         -0.2313458  0.0502359  -4.605 4.55e-06 ***
## security                           0.3170466  0.0712602   4.449 9.40e-06 ***
## conf_gov                           0.2048316  0.0598192   3.424 0.000637 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.488 on 1238 degrees of freedom
## Multiple R-squared:  0.1808, Adjusted R-squared:  0.1709 
## F-statistic: 18.21 on 15 and 1238 DF,  p-value: < 2.2e-16

The model explains about 17% of the variance.

Model Diagnistics

4.1 Normality of residuals
plot(model, 2)

This graph tells us that the residuals are non-normally distributed due to presence of some outliers.

ols_plot_resid_hist(model)

Here we see that there is a bell-shaped distribution, but some outliers exist.

outlierTest(model, n.max = 100)
##      rstudent unadjusted p-value Bonferroni p
## 997 -4.247130         2.3274e-05     0.029186
## 628 -4.129893         3.8729e-05     0.048567

The outlier test doesn’t notify us of some significant outliers. Let’s dig further and deeper.

new: The outlier test notify us about 2 significant outliers. (997,628) Let’s dig further and deeper.

plot(model, 5)

The graph of standardized residuals vs. leverage points tells us that we do have some influental values that affect the normality of residuals badly.

plot(model, 4, id.n = 10)

These are the values that one may exclude.

Я добавила сюда (997,628) из предыдущего теста

datagood <- germany[-c(166, 454, 501, 495, 567, 620, 628, 980, 997, 1254), ]

Then this table was used to detect the values.

4.2 Heteroscedasticity

To check for heteroskedasticity, we run the ncv test.

ncvTest(model)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 59.73628, Df = 1, p = 1.0846e-14

The p-value is below the threshold, so one can be sure - the heteroskedascticity assumption is met.

4.3 Non-linearity check

boxcox(model)

The boxcox test suggests that we should subject our dependent variable to a logariphmic transformation.

4.4 Multicolinearity
vif(model)
##               GVIF Df GVIF^(1/(2*Df))
## sex       1.050684  1        1.025029
## age       1.668111  1        1.291554
## marital   1.695859  5        1.054239
## class     1.140919  4        1.016616
## pol_scale 1.059517  1        1.029329
## worry_job 1.196603  1        1.093893
## security  1.165030  1        1.079366
## conf_gov  1.117345  1        1.057045

Last but not least we check that none of the estimates exceed 4. Then we can be sure that none of the variables in our model are corellated.

Final model

modelbest <- lm(I(satisfaction)^2 ~ sex + age + marital + class + pol_scale + worry_job + security  + conf_gov, data = datagood)
summary(modelbest)
## 
## Call:
## lm(formula = I(satisfaction)^2 ~ sex + age + marital + class + 
##     pol_scale + worry_job + security + conf_gov, data = datagood)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -59.539 -13.453   0.181  13.238  59.999 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        58.41372    7.07100   8.261 3.70e-16 ***
## sexFemale                           3.24730    1.21685   2.669 0.007717 ** 
## age                                 0.02616    0.04374   0.598 0.549883    
## maritalLiving together as married   1.30145    2.12562   0.612 0.540472    
## maritalDivorced                   -10.02031    2.33376  -4.294 1.90e-05 ***
## maritalSeparated                   -8.73819    5.48619  -1.593 0.111471    
## maritalWidowed                     -7.82864    2.89861  -2.701 0.007012 ** 
## maritalSingle                      -5.03413    1.78311  -2.823 0.004831 ** 
## classUpper middle class           -12.63556    4.79196  -2.637 0.008474 ** 
## classLower middle class           -16.01469    4.78481  -3.347 0.000842 ***
## classWorking class                -22.10098    4.93343  -4.480 8.17e-06 ***
## classLower class                  -51.74995    7.34680  -7.044 3.11e-12 ***
## pol_scale                           0.88377    0.35077   2.520 0.011878 *  
## worry_job                          -3.12745    0.70995  -4.405 1.15e-05 ***
## security                            4.63870    1.00341   4.623 4.18e-06 ***
## conf_gov                            2.62362    0.84404   3.108 0.001924 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.9 on 1228 degrees of freedom
## Multiple R-squared:  0.1684, Adjusted R-squared:  0.1582 
## F-statistic: 16.58 on 15 and 1228 DF,  p-value: < 2.2e-16

у меня чего-то не работает

#par(mfrow = c(2, 2))
#plot(modelbest)

Residuals vs Fitted graph tells us that the relations between the variables in the model are almost perfectly linear, the assumption is met.

Normal Q-Q graph identifies some outliers, but in general the standardized residuals are normally distributed, assumption is met.

Scale-Location plot show that squares of standardized residuals are almost flatly, equally distributed, so the assumption of heteroskedasticity is met.

Residuals vs Leverage plot doesn’t have outlying point in the right upper corner, but does have some highlighted by the text.

However, none of the points are present in the “outlierTest”, so we mat be confident with the model.