1.Please copy the following code to clean wave1 data of pairfam. How many variables are there in "wave1b"? How many observations are removed after the cleaning.
library(tidyverse) # see session 3.
library(haven) # Import data, see session 3.
library(janitor) #cleaning data
library(estimatr) #robust standard error OLS
library(texreg) #export regression result
library(ggplot2) # Allows us to create nice figures
library(oaxaca) # for Blinder-Oaxaca Decomposition
rm(list=ls())#clear all the dataset you created in your r project
wave1 <- read_dta("anchor1_50percent_Eng.dta")
wave1b <- wave1 %>%
transmute(
age,
isced,
sex=as_factor(sex_gen) %>% fct_drop(),
#treat sex as a categorical, and drop unused levels
nkids=case_when(nkids<0 ~ as.numeric(NA),
TRUE ~ as.numeric(nkids)),
#specify when nkids should be missing
sat6=case_when(sat6<0 ~ as.numeric(NA),
TRUE ~ as.numeric(sat6)),
#specify when sat should be considered missing
relstat=as_factor(relstat), #treat relationship status as categorical
relstat1=case_when(relstat=="-7 Incomplete data" ~ as.character(NA),
TRUE ~ as.character(relstat)) %>%
#specify when relstat1 should be missing
as_factor() %>% fct_drop(),
#make relstat1 as a factor again & drop unused levels
health=case_when(
hlt1<0 ~ NA,#specify when it should be missing
TRUE ~ hlt1)
) %>%
drop_na() #remove all observations that are missing
#after cleaning, in wave1b, there are 8 var;
#there are 6155 observations. 6201-6155=46 cases that are removed.
2.Focus on the variable "isced" in wave1b. The variable "isced" is not so familar to us. It refers to educational attainment. Now let us deal with this variable.Your taks: Please check the variable "isced". 1) However many categories does this variable have? How many cases are missing in this variable? 2) Remove individuals who are missing in isced.
tabyl(wave1b$isced)
# wave1b$isced n percent
# -7 16 0.0026
# 0 2353 0.3823
# 1 83 0.0135
# 2 230 0.0374
# 3 159 0.0258
# 4 1723 0.2799
# 5 70 0.0114
# 6 450 0.0731
# 7 1007 0.1636
# 8 64 0.0104
#there 10 categories
wave1b$isced %>% as_factor() %>% tabyl()
# . n percent
# -7 Incomplete data 16 0.0026
# -3 Does not apply 0 0.0000
# 0 currently enrolled 2353 0.3823
# 1 no degree (1b) 83 0.0135
# 2 lower secondary education (2b) 230 0.0374
# 3 lower secondary education (2a) 159 0.0258
# 4 upper secondary education vocational (3b) 1723 0.2799
# 5 upper secondary education general (3a) 70 0.0114
# 6 post-secondary non tertiary education general (4a) 450 0.0731
# 7 first stage of tertiary education (5) 1007 0.1636
# 8 second stage of tertiary education (6) 64 0.0104
#16 cases reporting "Incomplete data"
wave1b <- wave1b %>%
mutate(
isced_new1=case_when( #based on isced, I created a new var "isced_new1"
isced<0 ~ as.numeric(NA),
TRUE ~ isced)
) %>% drop_na(isced_new1) #dropping 6155-6139=16 cases
# 16 cases reporting missing, and noteworthy is that there're 2353 cases in the "current enrolled".
3.Please copy the following codes to create a variable named "marital". Then provide a frequency table for marital variable. How many respondents are divorced?
wave1b <- wave1b %>%
mutate(
marital=case_when(
relstat1 %in% c("1 Never married single",
"2 Never married LAT",
"3 Never married COHAB") ~ "Nevermarried",
# when relstat1 has any of the three situations, I assign "Nevermarried" to new variable "marital"
relstat1 %in% c("4 Married COHAB",
"5 Married noncohabiting") ~ 'Married',
# when relstat1 has any of the two situations, I assign "Married" to new variable "marital
relstat1 %in% c("6 Divorced/separated single",
"7 Divorced/separated LAT",
"8 Divorced/separated COHAB") ~ 'Divorced',
# when relstat1 has any of the three situations, I assign "Divorced" to new variable "marital"
relstat1 %in% c("9 Widowed single","10 Widowed LAT") ~ 'Widowed'
# when relstat1 has any of the two situations, I assign "Widow" to new variable "marital"
) %>% as_factor()
)
tabyl(wave1b$marital)
# wave1b$marital n percent
# Nevermarried 4103 0.66835
# Married 1749 0.28490
# Divorced 283 0.04610
# Widowed 4 0.00065
#there are 283 cases that are divorced
4.I want to compare the life satisfaction of married and divorced individuals.
wave1c <- wave1b %>%
filter(marital!= "Nevermarried" & marital!= "Widowed")
#wave1c is restricted to only married or divorced.
average <- wave1c%>%
group_by(marital) %>%
summarise(
ave_sat=mean(sat6),
ave_age=mean(age),
ave_nkids=mean(nkids),
ave_health=mean(health)
)
average
# # A tibble: 2 × 5
# marital ave_sat ave_age ave_nkids ave_health
# <fct> <dbl> <dbl> <dbl> <dbl>
# 1 Married 7.80 33.5 1.67 3.76
# 2 Divorced 6.67 34.3 1.41 3.53
#married group have higher satisfaction, more children, and better health.
wave1c$isced_new1 %>% tabyl() #27 cases reported 0
# . n percent
# 0 27 0.013
# 1 42 0.021
# 2 133 0.065
# 3 73 0.036
# 4 952 0.469
# 5 44 0.022
# 6 182 0.090
# 7 534 0.263
# 8 45 0.022
wave1c$isced_new1 %>% as_factor() %>% tabyl()
# . n percent
# -7 Incomplete data 0 0.000
# -3 Does not apply 0 0.000
# 0 currently enrolled 27 0.013
# 1 no degree (1b) 42 0.021
# 2 lower secondary education (2b) 133 0.065
# 3 lower secondary education (2a) 73 0.036
# 4 upper secondary education vocational (3b) 952 0.469
# 5 upper secondary education general (3a) 44 0.022
# 6 post-secondary non tertiary education general (4a) 182 0.090
# 7 first stage of tertiary education (5) 534 0.263
# 8 second stage of tertiary education (6) 45 0.022
#Note: those reported 0 meaning that they are still in education.
#Note: when isced_new1==7 or 8, meaning they are tertiary educated.
wave1c <- wave1c %>%
mutate(
tertedu=case_when( #create a tertedu to specify high education.
isced_new1%in%c(7,8) ~ 1, #1 means attained teritary education
isced_new1%in%c(0:6) ~ 0 #1 means attained below teritary education
)
)
tabyl(wave1c, tertedu, marital ) %>%
adorn_percentages("col")
# tertedu Nevermarried Married Divorced Widowed
# 0 NaN 0.7 0.81 NaN
# 1 NaN 0.3 0.19 NaN
# the married group have higher proportion of tertiary education.
5.Please run the OLS regression for the married and divorce group seperately. Use the life satisfaction as the outcome; use age, nkids, health, and tertedu as IVs. Please compare whether the effect of these variables on life satisfaction are different for the married and divorced groups
OLS_divorced <- lm_robust(formula = sat6 ~ age + nkids + health + tertedu, subset=(marital=="Divorced"), data = wave1c)
OLS_married <- lm_robust(formula = sat6 ~ age + nkids + health + tertedu, subset=(marital=="Married"), data = wave1c)
#subset=() here is to specify which sample for the OLS
screenreg(list(OLS_divorced,OLS_married),include.ci = FALSE, digits = 3,
custom.model.names = c("OLS for divorced", "OLS for married"),#rename the models
single.row =TRUE #to make the coef. and standard error in the same row
)
#
# ======================================================
# OLS for divorced OLS for married
# ------------------------------------------------------
# (Intercept) 5.818 (1.152) *** 5.892 (0.347) ***
# age -0.040 (0.032) -0.003 (0.010)
# nkids -0.052 (0.119) 0.044 (0.043)
# health 0.631 (0.134) *** 0.499 (0.044) ***
# tertedu 0.439 (0.278) 0.216 (0.082) **
# ------------------------------------------------------
# R^2 0.108 0.087
# Adj. R^2 0.096 0.085
# Num. obs. 283 1749
# RMSE 2.063 1.600
# ======================================================
# *** p < 0.001; ** p < 0.01; * p < 0.05
#you can see that the four factors have different effect on life satisfaction of the divorced and married groups
6.Why the life satisfaction of the divorced is lower than that of the married? Is it because different distribution of individuals in age, number of children, health or education attainment, or because these variables's effect on life satisfaction are different between the married and the divorced.
#first you have to specify the control and treated group
wave1c <- wave1c %>%
mutate(
marital_new1=case_when(
marital %in% c("Married") ~ TRUE, #oaxaca treat those TRUE as the control group,i.e. group B
marital %in% c("Divorced") ~ FALSE) #oaxaca treat those FALSE as the treated group, i.e. group A
)
result <- oaxaca(formula = sat6 ~ age + nkids + health + tertedu | marital_new1 , data = wave1c)
result$twofold$overall
# group.weight coef(explained) se(explained) coef(unexplained) se(unexplained) coef(unexplained A)
# [1,] 0.00 -0.16 0.044 -0.98 0.14 -9.8e-01
# [2,] 1.00 -0.22 0.081 -0.91 0.15 0.0e+00
# [3,] 0.50 -0.19 0.053 -0.94 0.14 -4.9e-01
# [4,] 0.14 -0.17 0.072 -0.97 0.15 -8.4e-01
# [5,] -1.00 -0.20 0.048 -0.93 0.14 -8.0e-01
# [6,] -2.00 -0.16 0.044 -0.97 0.14 4.7e-14
# se(unexplained A) coef(unexplained B) se(unexplained B)
# [1,] 1.4e-01 0.00 0.000
# [2,] 0.0e+00 -0.91 0.151
# [3,] 7.1e-02 -0.46 0.075
# [4,] 2.0e-02 -0.13 0.130
# [5,] 1.2e-01 -0.13 0.020
# [6,] 1.4e-14 -0.97 0.142
#finding: no matter which weight we are using, the unexplained is the major contributer.
#first you have to specify the control and treated group
result$n$n.A # the sample size of divorced
# [1] 283
result$n$n.B # the sample size of divorced
# [1] 1749
#In this way, better go with the pooled weighting. You can ask chatgpt to explain what weighting are good to use.
plot(result, decomposition = "twofold", type="overall", group.weight = -1, title = "Overall")