For this assignment, use the dataset of the European Social Survey (ESS) 6th round (2012)

https://ess-search.nsd.no/en/study/7ccf7f30-fd1a-470a-9b90-4c91b0bc7438 (You’ll need to create an account to do that.) Use the Russian data.

Perform linear regression to test which of the basic SES components (income, education, occupation) would better explain health inequalities in Russia. Dependent variable is Subjective general health (C7 in the ESS dataset). Control variables: age, gender, marital status, place of residence. To run 4 regressions is recommended. Each of 3 components separately and 1 regression with all three of them. You may use ISCO-08 for occupation categories, make sure that each category is not less than 10 percent of the sample.
Report the estimated results and whether they have a statistically signicant influence on the subjective health condition. Which model fits best?
Plot the results. UPD: if you haven’t study vizualization yet, just do the table.
Download files on the Assignment section

df <- import("/Users/DP/Downloads/ESS6e02_5.csv")

RU <- filter(df, cntry == "RU")

RU1 <- select(RU, c("isco08","hinctnta", "edlvdru", "health", "agea", "gndr", "marsts" , "domicil"))
RU1 <- RU1 %>% 
  mutate(isco08=ifelse(isco08 %in% c(66666, 77777, 99999), NA, isco08), 
         hinctnta =ifelse(hinctnta %in% c( 77, 88, 99), NA, hinctnta), 
         edlvdru=ifelse(edlvdru %in% c( 5555, 7777, 8888, 9999), NA, edlvdru),
         health=ifelse(health %in% c( 7, 8, 9), NA, health),
         agea =ifelse(agea %in% c( 999), NA, agea),
         gndr=ifelse(gndr %in% c( 9), NA, gndr),
         marsts=ifelse(marsts %in% c( 66, 77, 88, 99), NA, marsts) ,
         domicil=ifelse(domicil %in% c( 7, 8, 9), NA, domicil))
skim(RU1)

Data summary
Name	RU1
Number of rows	2484
Number of columns	8
_______________________
Column type frequency:
numeric	8
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
isco08	330	0.87	5099.21	2399.36	110	3132.25	5221	7212	9629	▁▇▇▃▅
hinctnta	534	0.79	6.11	2.79	1	4.00	6	9	10	▃▆▆▆▇
edlvdru	0	1.00	6.91	2.47	1	5.00	7	10	11	▂▃▇▁▆
health	18	0.99	2.76	0.81	1	2.00	3	3	5	▁▅▇▂▁
agea	3	1.00	46.00	18.10	15	30.00	45	60	90	▇▇▇▆▂
gndr	0	1.00	1.62	0.49	1	1.00	2	2	2	▅▁▁▁▇
marsts	1077	0.57	5.03	1.04	1	4.00	5	6	6	▁▁▆▅▇
domicil	3	1.00	2.37	1.22	1	1.00	3	3	5	▇▁▇▅▁

#basic SES components

isco08 - What is/was the name or title of your main job? In your main job, what kind of work do/did you do most of the time? What training or qualifications are/were needed for the job?

hinctnta - Using this card, please tell me which letter describes your household’s total income, after tax and compulsory deductions, from all sources? If you don’t know the exact figure, please give an estimate. Use the part of the card that you know best: weekly, monthly or annual income.

1 - J Less than 6000 rubles (or less than 72 th rubles a year) 2 - R 6’001-9’000 rubles (72-108 th rubles a year) 3 - C 9’001-12’000 rubles (108-144 th rubles a year) 4 - M 12’001-15’000 rubles (144-180 th rubles a year) 5 - F 15’001-18’000 rubles (180-216 th rubles a year) 6 - S 18’001-21’000 rubles (216-252 th rubles a year) 7 - K 21’001-25’000 rubles (252-300 th rubles a year) 8 - P 25’001-30’000 rubles (300-360 th rubles a year) 9 - D 30’001-40’000 rubles (360-480 th rubles a year) 10 - H More than 40’001 rubles (more than 480 th rubles a year)

edlvdru - What is the highest level of education you have successfully completed? (Russian Federation)

#Dependent variable

health - How is your health in general? Would you say it is …

#Control variables:

agea - Age of respondent, calculated

gndr - CODE SEX, respondent

1 Male 2 Female

marsts - This question is about your legal marital status not about who you may or may not be living with. Which one of the descriptions on this card describes your legal marital status now?

1 Legally married 2 In a legally registered civil union 3 Legally separated 4 Legally divorced/Civil union dissolved 5 Widowed/Civil partner died 6 None of these (NEVER married or in legally registered civil union)

domicil - Which phrase on this card best describes the area where you live? 1 A big city 2 Suburbs or outskirts of big city 3 Town or small city 4 Country village 5 Farm or home in countryside

RU1$isco08 <- ifelse(RU1$isco08 <= 310 & RU1$isco08 >= 0, 1, RU1$isco08)
RU1$isco08 <- ifelse(RU1$isco08 <= 1439 & RU1$isco08 >= 1000, 2, RU1$isco08)
RU1$isco08 <- ifelse(RU1$isco08 <= 2659 & RU1$isco08 >= 2000, 3, RU1$isco08)
RU1$isco08 <- ifelse(RU1$isco08 <= 3522 & RU1$isco08 >= 3000, 4, RU1$isco08)
RU1$isco08 <- ifelse(RU1$isco08 <= 4419 & RU1$isco08 >= 4000, 5, RU1$isco08)
RU1$isco08 <- ifelse(RU1$isco08 <= 5419 & RU1$isco08 >= 5000, 6, RU1$isco08)
RU1$isco08 <- ifelse(RU1$isco08 <= 6340 & RU1$isco08 >= 6000, 7, RU1$isco08)
RU1$isco08 <- ifelse(RU1$isco08 <= 7549 & RU1$isco08 >= 7000, 8, RU1$isco08)
RU1$isco08 <- ifelse(RU1$isco08 <= 8350 & RU1$isco08 >= 8000, 9, RU1$isco08)
RU1$isco08 <- ifelse(RU1$isco08 <= 9629 & RU1$isco08 >= 9000, 10, RU1$isco08)

1 - Military 2 - Managers 3 - Professionals 4 - Technicians and associate professionals 5 - Clerical support workers 6 - Service and sales workers 7 - Skilled agricultural, forestry and fishery workers 8 - Craft and related trades workers 9 - Plant and machine operators, and assemblers 10 - Elementary occupations

RU1 %>%  
  group_by(isco08) %>% 
  count()

## # A tibble: 11 × 2
## # Groups:   isco08 [11]
##    isco08     n
##     <dbl> <int>
##  1      1    13
##  2      2    90
##  3      3   375
##  4      4   312
##  5      5   161
##  6      6   506
##  7      7    33
##  8      8   263
##  9      9   211
## 10     10   190
## 11     NA   330

RU1 <- RU1 %>%  
  mutate(isco08= car::recode(isco08, recodes="c(1,2,7) = 8 ; 
  3 = 1; 4=2; 5=3; 6=4; 8=5; 9=6; 10=7 ", as.factor=TRUE))

RU1 %>%  
  group_by(edlvdru) %>% 
  count()

## # A tibble: 11 × 2
## # Groups:   edlvdru [11]
##    edlvdru     n
##      <int> <int>
##  1       1     7
##  2       2    75
##  3       3   169
##  4       4   350
##  5       5    61
##  6       6   224
##  7       7   806
##  8       8    60
##  9       9     9
## 10      10   714
## 11      11     9

RU1 <- RU1 %>%  
  mutate(edlvdru= car::recode(edlvdru, recodes="c(2,3,4) = 1; 
  c(5,6,7) = 2;
  c(8,9,10,11) = 3", as.factor=TRUE))

1 - school ed. 2 - college 3 - high ed.

RU1 %>%  
  group_by(health) %>% 
  count()

## # A tibble: 6 × 2
## # Groups:   health [6]
##   health     n
##    <int> <int>
## 1      1   128
## 2      2   748
## 3      3  1225
## 4      4   321
## 5      5    44
## 6     NA    18

RU1 <- RU1 %>%  
  mutate(health= car::recode(health, recodes="1=5;
                              2=4; 
                              3=3;
                              4=2;
                              2=1"))

RU1 %>%  
  group_by(edlvdru) %>% 
  count()

## # A tibble: 3 × 2
## # Groups:   edlvdru [3]
##   edlvdru     n
##   <fct>   <int>
## 1 1         601
## 2 2        1091
## 3 3         792

RU1$gndr <- as.factor(RU1$gndr)
RU1$marsts <- as.factor(RU1$marsts)

note: In the process of processing the data, I excluded the marital status, since there was a large number of missing values in it, and there was also a problem in a small amount, for example, of data in which respondents said they were married (33), which could lead to errors

RU1 %>%  
  group_by(marsts) %>% 
  count()

## # A tibble: 5 × 2
## # Groups:   marsts [5]
##   marsts     n
##   <fct>  <int>
## 1 1         33
## 2 4        410
## 3 5        377
## 4 6        587
## 5 <NA>    1077

RU1 %>%  
  group_by(domicil) %>% 
  count()

## # A tibble: 6 × 2
## # Groups:   domicil [6]
##   domicil     n
##     <int> <int>
## 1       1   999
## 2       2    94
## 3       3   866
## 4       4   509
## 5       5    13
## 6      NA     3

RU1 <- RU1 %>%  
  mutate(domicil= car::recode(domicil, recodes="c(1,2)=1;
                              3=2; 
                              c(4,5)=3", as.factor=TRUE))

1- big city 2- town 3 - rural

RU2 <- drop_na(RU1)

RU1$marsts <- NULL

RU1 <- drop_na(RU1)

EDA

descriptive statistics

RU1 %>% 
sjmisc::descr(show = c('n', "mean","sd", "md", "range")) %>% 
  rename("variable" = "var",
         "Number of obs." = "n",
         "Mean" = "mean",
         "SD" = "sd",
         "Median" = "md",
         "Range" = "range")

## 
## ## Basic descriptive statistics
## 
##  variable Number of obs.      Mean         SD Median      Range
##    isco08           1728  3.973380  2.1075285      4    7 (1-8)
##  hinctnta           1728  6.184028  2.7879590      6   9 (1-10)
##   edlvdru           1728  2.119213  0.7136043      2    2 (1-3)
##    health           1728  3.263310  0.7779510      3    3 (2-5)
##      agea           1728 48.129630 17.2997847     48 74 (15-89)
##      gndr           1728  1.630787  0.4827314      2    1 (1-2)
##   domicil           1728  1.781829  0.7792731      2    2 (1-3)

RU2 %>% 
sjmisc::descr(show = c('n', "mean","sd", "md", "range")) %>% 
  rename("variable" = "var",
         "Number of obs." = "n",
         "Mean" = "mean",
         "SD" = "sd",
         "Median" = "md",
         "Range" = "range")

## 
## ## Basic descriptive statistics
## 
##  variable Number of obs.      Mean         SD Median      Range
##    isco08            960  3.934375  2.1246028      4    7 (1-8)
##  hinctnta            960  5.355208  2.8526325      5   9 (1-10)
##   edlvdru            960  2.094792  0.7380426      2    2 (1-3)
##    health            960  3.228125  0.8073183      3    3 (2-5)
##      agea            960 48.767708 19.1645419     48 74 (15-89)
##      gndr            960  1.735417  0.4413410      2    1 (1-2)
##    marsts            960  4.901042  1.0311109      5    5 (1-6)
##   domicil            960  1.735417  0.7687834      2    2 (1-3)

agea, hinctnta, health

RU1 %>% 
  pivot_longer(c(agea, hinctnta, health),
               names_to = 'Var', values_to = 'Score') %>% 
  ggplot(aes(y=Score)) + 
  geom_boxplot() +
  ggtitle("Distribution of variables") +
  xlab("Variable") + 
  ylab("Score") +
  theme_bw()+
  theme(legend.position="none") +
  facet_wrap(~Var, scales = "free")

RU1 %>% 
  pivot_longer(c(agea, hinctnta, health),
               names_to = 'Var', values_to = 'Score') %>% 
  ggplot(aes(x=Score, fill=Var)) + 
  geom_histogram(aes(y=..density.., fill = Var), bins = 10) +
  geom_density(alpha = .5, color="blue")+
  ggtitle("Distribution of variables") +
  xlab("Variable") + 
  ylab("Score") +
  theme_bw()+
  theme(legend.position="none") +
  facet_wrap(~Var, scales = "free" )

## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

RU1 %>% 
  ggplot(aes(x = isco08 )) +
  geom_bar(fill = "lightblue", color = "blue") +
  xlab("Category") +
  ylab("Frequency") +
  theme_bw()

RU1 %>% 
  ggplot(aes(x = edlvdru )) +
  geom_bar(fill = "lightblue", color = "blue") +
  xlab("Category") +
  ylab("Frequency") +
  theme_bw()

“gndr”, “marsts” , “domicil”

RU1 %>% 
  ggplot(aes(x = gndr )) +
  geom_bar(fill = "lightblue", color = "blue") +
  xlab("Category") +
  ylab("Frequency") +
  theme_bw()

RU1 %>% 
  ggplot(aes(x = domicil )) +
  geom_bar(fill = "lightblue", color = "blue") +
  xlab("Category") +
  ylab("Frequency") +
  theme_bw()

RU1 %>% 
  ggplot(aes(x=agea, y=health)) +
  geom_point(size=2) +
  geom_smooth(method=lm)

## `geom_smooth()` using formula = 'y ~ x'

RU1 %>% 
  ggplot(aes(x=hinctnta, y=health)) +
  geom_point(size=2) +
  geom_smooth(method=lm)

## `geom_smooth()` using formula = 'y ~ x'

chart.Correlation(RU1[,c('agea', 'hinctnta', 'health')],
                  histogram = TRUE,
                  method = "spearman") # Spearman's method

## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Есть совпадающие значения: не могу высчитать точное p-значение

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Есть совпадающие значения: не могу высчитать точное p-значение

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Есть совпадающие значения: не могу высчитать точное p-значение

## Warning in par(usr): argument 1 does not name a graphical parameter

model1= lm(health ~ hinctnta + agea + gndr + domicil, data = RU1)
sjPlot::tab_model(model1, show.std = T)

	health
Predictors	Estimates	std. Beta	CI	standardized CI	p
(Intercept)	3.99	0.10	3.82 – 4.16	0.01 – 0.19	<0.001
hinctnta	0.03	0.11	0.02 – 0.04	0.06 – 0.16	<0.001
agea	-0.02	-0.39	-0.02 – -0.02	-0.43 – -0.34	<0.001
gndr [2]	-0.11	-0.15	-0.18 – -0.04	-0.23 – -0.06	0.001
domicil [2]	-0.05	-0.06	-0.12 – 0.03	-0.16 – 0.04	0.226
domicil [3]	0.05	0.06	-0.04 – 0.13	-0.05 – 0.17	0.301
Observations	1728
R² / R² adjusted	0.212 / 0.210

All variables except the region of residence are statistically significant for the model. Age has a statistically significant effect on the dependent variable with each year of life, human health decreases by 0.02. Gender has a statistically significant impact on human health, so the health value for women is less by 0.11 compared to men Income has a statistically significant effect on the dependent variable with an increase in income by one point, human health increases by 0.03 This model adjusted the R-squared by 0.2 then we can say that this is not a very good model. 20% of the human health assessment can be explained by the model.

model2= lm(health ~ isco08 + agea + gndr + domicil , data = RU1)
sjPlot::tab_model(model2, show.std = T)

	health
Predictors	Estimates	std. Beta	CI	standardized CI	p
(Intercept)	4.31	0.17	4.18 – 4.44	0.04 – 0.31	<0.001
isco08 [2]	-0.01	-0.01	-0.13 – 0.11	-0.16 – 0.14	0.905
isco08 [3]	0.10	0.12	-0.05 – 0.24	-0.06 – 0.31	0.189
isco08 [4]	-0.01	-0.01	-0.11 – 0.10	-0.15 – 0.13	0.887
isco08 [5]	-0.02	-0.03	-0.15 – 0.11	-0.19 – 0.14	0.765
isco08 [6]	-0.11	-0.14	-0.25 – 0.03	-0.32 – 0.04	0.128
isco08 [7]	-0.09	-0.11	-0.22 – 0.05	-0.28 – 0.07	0.219
isco08 [8]	0.11	0.14	-0.05 – 0.27	-0.07 – 0.35	0.186
agea	-0.02	-0.42	-0.02 – -0.02	-0.47 – -0.38	<0.001
gndr [2]	-0.16	-0.21	-0.24 – -0.08	-0.31 – -0.11	<0.001
domicil [2]	-0.06	-0.08	-0.14 – 0.01	-0.18 – 0.02	0.109
domicil [3]	0.01	0.01	-0.08 – 0.10	-0.10 – 0.12	0.862
Observations	1728
R² / R² adjusted	0.208 / 0.203

In this model, occupation and place of residence are statistically insignificant, the other variables do not change. Gender has a statistically significant impact on human health, so the health value for women is less by 0.16 compared to men Age has a statistically significant effect on the dependent variable with each year of life, human health decreases by 0.02. R adjusted is 0.2, then we can say that this is not a very good model. 20% of the human health assessment can be explained by the model

model3= lm(health ~ edlvdru + agea + gndr + domicil , data = RU1)
sjPlot::tab_model(model3, show.std = T)

	health
Predictors	Estimates	std. Beta	CI	standardized CI	p
(Intercept)	4.18	0.03	4.05 – 4.32	-0.09 – 0.16	<0.001
edlvdru [2]	0.07	0.09	-0.02 – 0.16	-0.03 – 0.20	0.128
edlvdru [3]	0.15	0.19	0.05 – 0.24	0.06 – 0.31	0.003
agea	-0.02	-0.41	-0.02 – -0.02	-0.46 – -0.37	<0.001
gndr [2]	-0.15	-0.19	-0.21 – -0.08	-0.28 – -0.10	<0.001
domicil [2]	-0.06	-0.07	-0.13 – 0.02	-0.17 – 0.02	0.145
domicil [3]	0.03	0.03	-0.06 – 0.11	-0.08 – 0.15	0.569
Observations	1728
R² / R² adjusted	0.207 / 0.204

Age has a statistically significant effect on the dependent variable with each year of life, human health decreases by 0.02. In this model, a statistically significant variable is that which refers to the number of people with higher education, thus, when receiving higher education, the health indicator increases by 0.15. Gender has a statistically significant impact on human health, so the health value for women is less by 0.15 compared to men R-adjusted is 0.2, then we can say that this is not a very good model. 20% of the human health assessment can be explained by the model

model4= lm(health ~ edlvdru+ isco08 +hinctnta + agea + gndr + domicil , data = RU1)
sjPlot::tab_model(model4, show.std = T)

	health
Predictors	Estimates	std. Beta	CI	standardized CI	p
(Intercept)	3.92	-0.01	3.70 – 4.14	-0.20 – 0.18	<0.001
edlvdru [2]	0.04	0.05	-0.05 – 0.13	-0.06 – 0.17	0.375
edlvdru [3]	0.11	0.14	-0.00 – 0.22	-0.01 – 0.28	0.061
isco08 [2]	0.02	0.03	-0.10 – 0.14	-0.13 – 0.18	0.746
isco08 [3]	0.13	0.16	-0.02 – 0.27	-0.03 – 0.35	0.092
isco08 [4]	0.05	0.06	-0.07 – 0.16	-0.09 – 0.21	0.418
isco08 [5]	0.04	0.06	-0.10 – 0.18	-0.12 – 0.23	0.541
isco08 [6]	-0.03	-0.04	-0.18 – 0.12	-0.23 – 0.15	0.688
isco08 [7]	0.03	0.03	-0.12 – 0.18	-0.16 – 0.23	0.736
isco08 [8]	0.12	0.16	-0.04 – 0.29	-0.05 – 0.37	0.137
hinctnta	0.03	0.10	0.01 – 0.04	0.05 – 0.14	<0.001
agea	-0.02	-0.38	-0.02 – -0.02	-0.43 – -0.33	<0.001
gndr [2]	-0.13	-0.17	-0.21 – -0.05	-0.27 – -0.07	0.001
domicil [2]	-0.03	-0.04	-0.11 – 0.04	-0.14 – 0.05	0.381
domicil [3]	0.06	0.08	-0.03 – 0.15	-0.04 – 0.19	0.185
Observations	1728
R² / R² adjusted	0.217 / 0.211

In this model, education, place of residence and occupation are statistically insignificant Gender has a statistically significant impact on human health, so the health value for women is less by 0.13 compared to men Age has a statistically significant effect on the dependent variable with each year of life, human health decreases by 0.02. R-adjusted is 0.21, then we can say that this is not a very good model. 21% of the human health assessment can be explained by the model

# model comparison
anova(model1, model2, model3, model4)

## Analysis of Variance Table
## 
## Model 1: health ~ hinctnta + agea + gndr + domicil
## Model 2: health ~ isco08 + agea + gndr + domicil
## Model 3: health ~ edlvdru + agea + gndr + domicil
## Model 4: health ~ edlvdru + isco08 + hinctnta + agea + gndr + domicil
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
## 1   1722 823.24                                
## 2   1716 827.83  6   -4.5890                   
## 3   1721 828.72 -5   -0.8887 0.3722 0.867926   
## 4   1713 817.97  8   10.7451 2.8128 0.004234 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

4 the model is the best based on this test

Conclusion: none of these variables can fully explain inequality in the context of health

Untitled

Diana Piskareva

2023-05-29

EDA

descriptive statistics