https://ess-search.nsd.no/en/study/7ccf7f30-fd1a-470a-9b90-4c91b0bc7438 (You’ll need to create an account to do that.) Use the Russian data.
Perform linear regression to test which of the basic SES components (income, education, occupation) would better explain health inequalities in Russia. Dependent variable is Subjective general health (C7 in the ESS dataset). Control variables: age, gender, marital status, place of residence. To run 4 regressions is recommended. Each of 3 components separately and 1 regression with all three of them. You may use ISCO-08 for occupation categories, make sure that each category is not less than 10 percent of the sample.
Report the estimated results and whether they have a statistically signicant influence on the subjective health condition. Which model fits best?
Plot the results. UPD: if you haven’t study vizualization yet, just do the table.
Download files on the Assignment section
df <- import("/Users/DP/Downloads/ESS6e02_5.csv")
RU <- filter(df, cntry == "RU")
RU1 <- select(RU, c("isco08","hinctnta", "edlvdru", "health", "agea", "gndr", "marsts" , "domicil"))
RU1 <- RU1 %>%
mutate(isco08=ifelse(isco08 %in% c(66666, 77777, 99999), NA, isco08),
hinctnta =ifelse(hinctnta %in% c( 77, 88, 99), NA, hinctnta),
edlvdru=ifelse(edlvdru %in% c( 5555, 7777, 8888, 9999), NA, edlvdru),
health=ifelse(health %in% c( 7, 8, 9), NA, health),
agea =ifelse(agea %in% c( 999), NA, agea),
gndr=ifelse(gndr %in% c( 9), NA, gndr),
marsts=ifelse(marsts %in% c( 66, 77, 88, 99), NA, marsts) ,
domicil=ifelse(domicil %in% c( 7, 8, 9), NA, domicil))
skim(RU1)
| Name | RU1 |
| Number of rows | 2484 |
| Number of columns | 8 |
| _______________________ | |
| Column type frequency: | |
| numeric | 8 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| isco08 | 330 | 0.87 | 5099.21 | 2399.36 | 110 | 3132.25 | 5221 | 7212 | 9629 | ▁▇▇▃▅ |
| hinctnta | 534 | 0.79 | 6.11 | 2.79 | 1 | 4.00 | 6 | 9 | 10 | ▃▆▆▆▇ |
| edlvdru | 0 | 1.00 | 6.91 | 2.47 | 1 | 5.00 | 7 | 10 | 11 | ▂▃▇▁▆ |
| health | 18 | 0.99 | 2.76 | 0.81 | 1 | 2.00 | 3 | 3 | 5 | ▁▅▇▂▁ |
| agea | 3 | 1.00 | 46.00 | 18.10 | 15 | 30.00 | 45 | 60 | 90 | ▇▇▇▆▂ |
| gndr | 0 | 1.00 | 1.62 | 0.49 | 1 | 1.00 | 2 | 2 | 2 | ▅▁▁▁▇ |
| marsts | 1077 | 0.57 | 5.03 | 1.04 | 1 | 4.00 | 5 | 6 | 6 | ▁▁▆▅▇ |
| domicil | 3 | 1.00 | 2.37 | 1.22 | 1 | 1.00 | 3 | 3 | 5 | ▇▁▇▅▁ |
#basic SES components
isco08 - What is/was the name or title of your main job? In your main job, what kind of work do/did you do most of the time? What training or qualifications are/were needed for the job?
hinctnta - Using this card, please tell me which letter describes your household’s total income, after tax and compulsory deductions, from all sources? If you don’t know the exact figure, please give an estimate. Use the part of the card that you know best: weekly, monthly or annual income.
1 - J Less than 6000 rubles (or less than 72 th rubles a year) 2 - R 6’001-9’000 rubles (72-108 th rubles a year) 3 - C 9’001-12’000 rubles (108-144 th rubles a year) 4 - M 12’001-15’000 rubles (144-180 th rubles a year) 5 - F 15’001-18’000 rubles (180-216 th rubles a year) 6 - S 18’001-21’000 rubles (216-252 th rubles a year) 7 - K 21’001-25’000 rubles (252-300 th rubles a year) 8 - P 25’001-30’000 rubles (300-360 th rubles a year) 9 - D 30’001-40’000 rubles (360-480 th rubles a year) 10 - H More than 40’001 rubles (more than 480 th rubles a year)
edlvdru - What is the highest level of education you have successfully completed? (Russian Federation)
#Dependent variable
health - How is your health in general? Would you say it is …
#Control variables:
agea - Age of respondent, calculated
gndr - CODE SEX, respondent
1 Male 2 Female
marsts - This question is about your legal marital status not about who you may or may not be living with. Which one of the descriptions on this card describes your legal marital status now?
1 Legally married 2 In a legally registered civil union 3 Legally separated 4 Legally divorced/Civil union dissolved 5 Widowed/Civil partner died 6 None of these (NEVER married or in legally registered civil union)
domicil - Which phrase on this card best describes the area where you live? 1 A big city 2 Suburbs or outskirts of big city 3 Town or small city 4 Country village 5 Farm or home in countryside
RU1$isco08 <- ifelse(RU1$isco08 <= 310 & RU1$isco08 >= 0, 1, RU1$isco08)
RU1$isco08 <- ifelse(RU1$isco08 <= 1439 & RU1$isco08 >= 1000, 2, RU1$isco08)
RU1$isco08 <- ifelse(RU1$isco08 <= 2659 & RU1$isco08 >= 2000, 3, RU1$isco08)
RU1$isco08 <- ifelse(RU1$isco08 <= 3522 & RU1$isco08 >= 3000, 4, RU1$isco08)
RU1$isco08 <- ifelse(RU1$isco08 <= 4419 & RU1$isco08 >= 4000, 5, RU1$isco08)
RU1$isco08 <- ifelse(RU1$isco08 <= 5419 & RU1$isco08 >= 5000, 6, RU1$isco08)
RU1$isco08 <- ifelse(RU1$isco08 <= 6340 & RU1$isco08 >= 6000, 7, RU1$isco08)
RU1$isco08 <- ifelse(RU1$isco08 <= 7549 & RU1$isco08 >= 7000, 8, RU1$isco08)
RU1$isco08 <- ifelse(RU1$isco08 <= 8350 & RU1$isco08 >= 8000, 9, RU1$isco08)
RU1$isco08 <- ifelse(RU1$isco08 <= 9629 & RU1$isco08 >= 9000, 10, RU1$isco08)
1 - Military 2 - Managers 3 - Professionals 4 - Technicians and associate professionals 5 - Clerical support workers 6 - Service and sales workers 7 - Skilled agricultural, forestry and fishery workers 8 - Craft and related trades workers 9 - Plant and machine operators, and assemblers 10 - Elementary occupations
RU1 %>%
group_by(isco08) %>%
count()
## # A tibble: 11 × 2
## # Groups: isco08 [11]
## isco08 n
## <dbl> <int>
## 1 1 13
## 2 2 90
## 3 3 375
## 4 4 312
## 5 5 161
## 6 6 506
## 7 7 33
## 8 8 263
## 9 9 211
## 10 10 190
## 11 NA 330
RU1 <- RU1 %>%
mutate(isco08= car::recode(isco08, recodes="c(1,2,7) = 8 ;
3 = 1; 4=2; 5=3; 6=4; 8=5; 9=6; 10=7 ", as.factor=TRUE))
RU1 %>%
group_by(edlvdru) %>%
count()
## # A tibble: 11 × 2
## # Groups: edlvdru [11]
## edlvdru n
## <int> <int>
## 1 1 7
## 2 2 75
## 3 3 169
## 4 4 350
## 5 5 61
## 6 6 224
## 7 7 806
## 8 8 60
## 9 9 9
## 10 10 714
## 11 11 9
RU1 <- RU1 %>%
mutate(edlvdru= car::recode(edlvdru, recodes="c(2,3,4) = 1;
c(5,6,7) = 2;
c(8,9,10,11) = 3", as.factor=TRUE))
1 - school ed. 2 - college 3 - high ed.
RU1 %>%
group_by(health) %>%
count()
## # A tibble: 6 × 2
## # Groups: health [6]
## health n
## <int> <int>
## 1 1 128
## 2 2 748
## 3 3 1225
## 4 4 321
## 5 5 44
## 6 NA 18
RU1 <- RU1 %>%
mutate(health= car::recode(health, recodes="1=5;
2=4;
3=3;
4=2;
2=1"))
RU1 %>%
group_by(edlvdru) %>%
count()
## # A tibble: 3 × 2
## # Groups: edlvdru [3]
## edlvdru n
## <fct> <int>
## 1 1 601
## 2 2 1091
## 3 3 792
RU1$gndr <- as.factor(RU1$gndr)
RU1$marsts <- as.factor(RU1$marsts)
note: In the process of processing the data, I excluded the marital status, since there was a large number of missing values in it, and there was also a problem in a small amount, for example, of data in which respondents said they were married (33), which could lead to errors
RU1 %>%
group_by(marsts) %>%
count()
## # A tibble: 5 × 2
## # Groups: marsts [5]
## marsts n
## <fct> <int>
## 1 1 33
## 2 4 410
## 3 5 377
## 4 6 587
## 5 <NA> 1077
RU1 %>%
group_by(domicil) %>%
count()
## # A tibble: 6 × 2
## # Groups: domicil [6]
## domicil n
## <int> <int>
## 1 1 999
## 2 2 94
## 3 3 866
## 4 4 509
## 5 5 13
## 6 NA 3
RU1 <- RU1 %>%
mutate(domicil= car::recode(domicil, recodes="c(1,2)=1;
3=2;
c(4,5)=3", as.factor=TRUE))
1- big city 2- town 3 - rural
RU2 <- drop_na(RU1)
RU1$marsts <- NULL
RU1 <- drop_na(RU1)
RU1 %>%
sjmisc::descr(show = c('n', "mean","sd", "md", "range")) %>%
rename("variable" = "var",
"Number of obs." = "n",
"Mean" = "mean",
"SD" = "sd",
"Median" = "md",
"Range" = "range")
##
## ## Basic descriptive statistics
##
## variable Number of obs. Mean SD Median Range
## isco08 1728 3.973380 2.1075285 4 7 (1-8)
## hinctnta 1728 6.184028 2.7879590 6 9 (1-10)
## edlvdru 1728 2.119213 0.7136043 2 2 (1-3)
## health 1728 3.263310 0.7779510 3 3 (2-5)
## agea 1728 48.129630 17.2997847 48 74 (15-89)
## gndr 1728 1.630787 0.4827314 2 1 (1-2)
## domicil 1728 1.781829 0.7792731 2 2 (1-3)
RU2 %>%
sjmisc::descr(show = c('n', "mean","sd", "md", "range")) %>%
rename("variable" = "var",
"Number of obs." = "n",
"Mean" = "mean",
"SD" = "sd",
"Median" = "md",
"Range" = "range")
##
## ## Basic descriptive statistics
##
## variable Number of obs. Mean SD Median Range
## isco08 960 3.934375 2.1246028 4 7 (1-8)
## hinctnta 960 5.355208 2.8526325 5 9 (1-10)
## edlvdru 960 2.094792 0.7380426 2 2 (1-3)
## health 960 3.228125 0.8073183 3 3 (2-5)
## agea 960 48.767708 19.1645419 48 74 (15-89)
## gndr 960 1.735417 0.4413410 2 1 (1-2)
## marsts 960 4.901042 1.0311109 5 5 (1-6)
## domicil 960 1.735417 0.7687834 2 2 (1-3)
agea, hinctnta, health
RU1 %>%
pivot_longer(c(agea, hinctnta, health),
names_to = 'Var', values_to = 'Score') %>%
ggplot(aes(y=Score)) +
geom_boxplot() +
ggtitle("Distribution of variables") +
xlab("Variable") +
ylab("Score") +
theme_bw()+
theme(legend.position="none") +
facet_wrap(~Var, scales = "free")
RU1 %>%
pivot_longer(c(agea, hinctnta, health),
names_to = 'Var', values_to = 'Score') %>%
ggplot(aes(x=Score, fill=Var)) +
geom_histogram(aes(y=..density.., fill = Var), bins = 10) +
geom_density(alpha = .5, color="blue")+
ggtitle("Distribution of variables") +
xlab("Variable") +
ylab("Score") +
theme_bw()+
theme(legend.position="none") +
facet_wrap(~Var, scales = "free" )
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
RU1 %>%
ggplot(aes(x = isco08 )) +
geom_bar(fill = "lightblue", color = "blue") +
xlab("Category") +
ylab("Frequency") +
theme_bw()
RU1 %>%
ggplot(aes(x = edlvdru )) +
geom_bar(fill = "lightblue", color = "blue") +
xlab("Category") +
ylab("Frequency") +
theme_bw()
“gndr”, “marsts” , “domicil”
RU1 %>%
ggplot(aes(x = gndr )) +
geom_bar(fill = "lightblue", color = "blue") +
xlab("Category") +
ylab("Frequency") +
theme_bw()
RU1 %>%
ggplot(aes(x = domicil )) +
geom_bar(fill = "lightblue", color = "blue") +
xlab("Category") +
ylab("Frequency") +
theme_bw()
RU1 %>%
ggplot(aes(x=agea, y=health)) +
geom_point(size=2) +
geom_smooth(method=lm)
## `geom_smooth()` using formula = 'y ~ x'
RU1 %>%
ggplot(aes(x=hinctnta, y=health)) +
geom_point(size=2) +
geom_smooth(method=lm)
## `geom_smooth()` using formula = 'y ~ x'
chart.Correlation(RU1[,c('agea', 'hinctnta', 'health')],
histogram = TRUE,
method = "spearman") # Spearman's method
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Есть совпадающие значения: не могу высчитать точное p-значение
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Есть совпадающие значения: не могу высчитать точное p-значение
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Есть совпадающие значения: не могу высчитать точное p-значение
## Warning in par(usr): argument 1 does not name a graphical parameter
model1= lm(health ~ hinctnta + agea + gndr + domicil, data = RU1)
sjPlot::tab_model(model1, show.std = T)
| health | |||||
|---|---|---|---|---|---|
| Predictors | Estimates | std. Beta | CI | standardized CI | p |
| (Intercept) | 3.99 | 0.10 | 3.82 – 4.16 | 0.01 – 0.19 | <0.001 |
| hinctnta | 0.03 | 0.11 | 0.02 – 0.04 | 0.06 – 0.16 | <0.001 |
| agea | -0.02 | -0.39 | -0.02 – -0.02 | -0.43 – -0.34 | <0.001 |
| gndr [2] | -0.11 | -0.15 | -0.18 – -0.04 | -0.23 – -0.06 | 0.001 |
| domicil [2] | -0.05 | -0.06 | -0.12 – 0.03 | -0.16 – 0.04 | 0.226 |
| domicil [3] | 0.05 | 0.06 | -0.04 – 0.13 | -0.05 – 0.17 | 0.301 |
| Observations | 1728 | ||||
| R2 / R2 adjusted | 0.212 / 0.210 | ||||
All variables except the region of residence are statistically significant for the model. Age has a statistically significant effect on the dependent variable with each year of life, human health decreases by 0.02. Gender has a statistically significant impact on human health, so the health value for women is less by 0.11 compared to men Income has a statistically significant effect on the dependent variable with an increase in income by one point, human health increases by 0.03 This model adjusted the R-squared by 0.2 then we can say that this is not a very good model. 20% of the human health assessment can be explained by the model.
model2= lm(health ~ isco08 + agea + gndr + domicil , data = RU1)
sjPlot::tab_model(model2, show.std = T)
| health | |||||
|---|---|---|---|---|---|
| Predictors | Estimates | std. Beta | CI | standardized CI | p |
| (Intercept) | 4.31 | 0.17 | 4.18 – 4.44 | 0.04 – 0.31 | <0.001 |
| isco08 [2] | -0.01 | -0.01 | -0.13 – 0.11 | -0.16 – 0.14 | 0.905 |
| isco08 [3] | 0.10 | 0.12 | -0.05 – 0.24 | -0.06 – 0.31 | 0.189 |
| isco08 [4] | -0.01 | -0.01 | -0.11 – 0.10 | -0.15 – 0.13 | 0.887 |
| isco08 [5] | -0.02 | -0.03 | -0.15 – 0.11 | -0.19 – 0.14 | 0.765 |
| isco08 [6] | -0.11 | -0.14 | -0.25 – 0.03 | -0.32 – 0.04 | 0.128 |
| isco08 [7] | -0.09 | -0.11 | -0.22 – 0.05 | -0.28 – 0.07 | 0.219 |
| isco08 [8] | 0.11 | 0.14 | -0.05 – 0.27 | -0.07 – 0.35 | 0.186 |
| agea | -0.02 | -0.42 | -0.02 – -0.02 | -0.47 – -0.38 | <0.001 |
| gndr [2] | -0.16 | -0.21 | -0.24 – -0.08 | -0.31 – -0.11 | <0.001 |
| domicil [2] | -0.06 | -0.08 | -0.14 – 0.01 | -0.18 – 0.02 | 0.109 |
| domicil [3] | 0.01 | 0.01 | -0.08 – 0.10 | -0.10 – 0.12 | 0.862 |
| Observations | 1728 | ||||
| R2 / R2 adjusted | 0.208 / 0.203 | ||||
In this model, occupation and place of residence are statistically insignificant, the other variables do not change. Gender has a statistically significant impact on human health, so the health value for women is less by 0.16 compared to men Age has a statistically significant effect on the dependent variable with each year of life, human health decreases by 0.02. R adjusted is 0.2, then we can say that this is not a very good model. 20% of the human health assessment can be explained by the model
model3= lm(health ~ edlvdru + agea + gndr + domicil , data = RU1)
sjPlot::tab_model(model3, show.std = T)
| health | |||||
|---|---|---|---|---|---|
| Predictors | Estimates | std. Beta | CI | standardized CI | p |
| (Intercept) | 4.18 | 0.03 | 4.05 – 4.32 | -0.09 – 0.16 | <0.001 |
| edlvdru [2] | 0.07 | 0.09 | -0.02 – 0.16 | -0.03 – 0.20 | 0.128 |
| edlvdru [3] | 0.15 | 0.19 | 0.05 – 0.24 | 0.06 – 0.31 | 0.003 |
| agea | -0.02 | -0.41 | -0.02 – -0.02 | -0.46 – -0.37 | <0.001 |
| gndr [2] | -0.15 | -0.19 | -0.21 – -0.08 | -0.28 – -0.10 | <0.001 |
| domicil [2] | -0.06 | -0.07 | -0.13 – 0.02 | -0.17 – 0.02 | 0.145 |
| domicil [3] | 0.03 | 0.03 | -0.06 – 0.11 | -0.08 – 0.15 | 0.569 |
| Observations | 1728 | ||||
| R2 / R2 adjusted | 0.207 / 0.204 | ||||
Age has a statistically significant effect on the dependent variable with each year of life, human health decreases by 0.02. In this model, a statistically significant variable is that which refers to the number of people with higher education, thus, when receiving higher education, the health indicator increases by 0.15. Gender has a statistically significant impact on human health, so the health value for women is less by 0.15 compared to men R-adjusted is 0.2, then we can say that this is not a very good model. 20% of the human health assessment can be explained by the model
model4= lm(health ~ edlvdru+ isco08 +hinctnta + agea + gndr + domicil , data = RU1)
sjPlot::tab_model(model4, show.std = T)
| health | |||||
|---|---|---|---|---|---|
| Predictors | Estimates | std. Beta | CI | standardized CI | p |
| (Intercept) | 3.92 | -0.01 | 3.70 – 4.14 | -0.20 – 0.18 | <0.001 |
| edlvdru [2] | 0.04 | 0.05 | -0.05 – 0.13 | -0.06 – 0.17 | 0.375 |
| edlvdru [3] | 0.11 | 0.14 | -0.00 – 0.22 | -0.01 – 0.28 | 0.061 |
| isco08 [2] | 0.02 | 0.03 | -0.10 – 0.14 | -0.13 – 0.18 | 0.746 |
| isco08 [3] | 0.13 | 0.16 | -0.02 – 0.27 | -0.03 – 0.35 | 0.092 |
| isco08 [4] | 0.05 | 0.06 | -0.07 – 0.16 | -0.09 – 0.21 | 0.418 |
| isco08 [5] | 0.04 | 0.06 | -0.10 – 0.18 | -0.12 – 0.23 | 0.541 |
| isco08 [6] | -0.03 | -0.04 | -0.18 – 0.12 | -0.23 – 0.15 | 0.688 |
| isco08 [7] | 0.03 | 0.03 | -0.12 – 0.18 | -0.16 – 0.23 | 0.736 |
| isco08 [8] | 0.12 | 0.16 | -0.04 – 0.29 | -0.05 – 0.37 | 0.137 |
| hinctnta | 0.03 | 0.10 | 0.01 – 0.04 | 0.05 – 0.14 | <0.001 |
| agea | -0.02 | -0.38 | -0.02 – -0.02 | -0.43 – -0.33 | <0.001 |
| gndr [2] | -0.13 | -0.17 | -0.21 – -0.05 | -0.27 – -0.07 | 0.001 |
| domicil [2] | -0.03 | -0.04 | -0.11 – 0.04 | -0.14 – 0.05 | 0.381 |
| domicil [3] | 0.06 | 0.08 | -0.03 – 0.15 | -0.04 – 0.19 | 0.185 |
| Observations | 1728 | ||||
| R2 / R2 adjusted | 0.217 / 0.211 | ||||
In this model, education, place of residence and occupation are statistically insignificant Gender has a statistically significant impact on human health, so the health value for women is less by 0.13 compared to men Age has a statistically significant effect on the dependent variable with each year of life, human health decreases by 0.02. R-adjusted is 0.21, then we can say that this is not a very good model. 21% of the human health assessment can be explained by the model
# model comparison
anova(model1, model2, model3, model4)
## Analysis of Variance Table
##
## Model 1: health ~ hinctnta + agea + gndr + domicil
## Model 2: health ~ isco08 + agea + gndr + domicil
## Model 3: health ~ edlvdru + agea + gndr + domicil
## Model 4: health ~ edlvdru + isco08 + hinctnta + agea + gndr + domicil
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1722 823.24
## 2 1716 827.83 6 -4.5890
## 3 1721 828.72 -5 -0.8887 0.3722 0.867926
## 4 1713 817.97 8 10.7451 2.8128 0.004234 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
4 the model is the best based on this test
Conclusion: none of these variables can fully explain inequality in the context of health