Exercise 6

1.Please copy the following code to clean wave1 data of pairfam. How many variables are there in "wave1b"? How many observations are removed after the cleaning.

library(tidyverse) # see session 3.
library(haven) # Import data, see session 3.
library(janitor) #cleaning data
library(estimatr) #robust standard error OLS
library(texreg) #export regression result
library(ggplot2) # Allows us to create nice figures
library(oaxaca) # for Blinder-Oaxaca Decomposition

rm(list=ls())#clear all the dataset you created in your r project

wave1 <- read_dta("anchor1_50percent_Eng.dta")

wave1b <- wave1 %>% 
  transmute(
           age,
           isced,
           sex=as_factor(sex_gen) %>% fct_drop(), 
           #treat sex as a categorical, and drop unused levels
           nkids=case_when(nkids<0 ~ as.numeric(NA),
                              TRUE ~ as.numeric(nkids)),
           #specify when nkids should be missing
           sat6=case_when(sat6<0 ~ as.numeric(NA), 
                          TRUE ~ as.numeric(sat6)), 
          #specify when sat should be considered missing
           relstat=as_factor(relstat), #treat relationship status as categorical
           relstat1=case_when(relstat=="-7 Incomplete data" ~ as.character(NA), 
                              TRUE ~ as.character(relstat)) %>%  
          #specify when relstat1 should be missing
             as_factor() %>% fct_drop(),
          #make relstat1 as a factor again & drop unused levels
          health=case_when(
                  hlt1<0 ~ NA,#specify when it should be missing
                  TRUE ~ hlt1)
           )  %>%  
  drop_na() #remove all observations that are missing

#after cleaning, in wave1b, there are 8 var; 
#there are 6155 observations. 6201-6155=46 cases that are removed.

2.Focus on the variable "isced" in wave1b. The variable "isced" is not so familar to us. It refers to educational attainment. Now let us deal with this variable.Your taks: Please check the variable "isced". 1) However many categories does this variable have? How many cases are missing in this variable? 2) Remove individuals who are missing in isced.

tabyl(wave1b$isced)
#  wave1b$isced    n percent
#            -7   16  0.0026
#             0 2353  0.3823
#             1   83  0.0135
#             2  230  0.0374
#             3  159  0.0258
#             4 1723  0.2799
#             5   70  0.0114
#             6  450  0.0731
#             7 1007  0.1636
#             8   64  0.0104
#there 10 categories

wave1b$isced %>% as_factor() %>% tabyl() 
#                                                     .    n percent
#                                    -7 Incomplete data   16  0.0026
#                                     -3 Does not apply    0  0.0000
#                                  0 currently enrolled 2353  0.3823
#                                      1 no degree (1b)   83  0.0135
#                      2 lower secondary education (2b)  230  0.0374
#                      3 lower secondary education (2a)  159  0.0258
#           4 upper secondary education vocational (3b) 1723  0.2799
#              5 upper secondary education general (3a)   70  0.0114
#  6 post-secondary non tertiary education general (4a)  450  0.0731
#               7 first stage of tertiary education (5) 1007  0.1636
#              8 second stage of tertiary education (6)   64  0.0104
#16 cases reporting "Incomplete data"

wave1b <- wave1b %>% 
  mutate(
    isced_new1=case_when( #based on isced, I created a new var "isced_new1"
      isced<0 ~ as.numeric(NA),
      TRUE ~ isced)
  ) %>% drop_na(isced_new1) #dropping 6155-6139=16 cases
# 16 cases reporting missing, and noteworthy is that there're 2353 cases in the "current enrolled".

3.Please copy the following codes to create a variable named "marital". Then provide a frequency table for marital variable. How many respondents are divorced?

wave1b <- wave1b %>% 
  mutate(
    marital=case_when(
      relstat1 %in% c("1 Never married single",
                     "2 Never married LAT",
                     "3 Never married COHAB") ~ "Nevermarried",
      # when relstat1 has any of the three situations, I assign "Nevermarried" to new variable "marital"
      relstat1 %in% c("4 Married COHAB",
                     "5 Married noncohabiting") ~ 'Married',
      # when relstat1 has any of the two situations, I assign "Married" to new variable "marital
      relstat1 %in% c("6 Divorced/separated single",
                     "7 Divorced/separated LAT",
                     "8 Divorced/separated COHAB") ~ 'Divorced',
      # when relstat1 has any of the three situations, I assign "Divorced" to new variable "marital"
      relstat1 %in% c("9 Widowed single","10 Widowed LAT") ~ 'Widowed'
      # when relstat1 has any of the two situations, I assign "Widow" to new variable "marital"
                     ) %>% as_factor()
        )

tabyl(wave1b$marital)
#  wave1b$marital    n percent
#    Nevermarried 4103 0.66835
#         Married 1749 0.28490
#        Divorced  283 0.04610
#         Widowed    4 0.00065
#there are 283 cases that are divorced

4.I want to compare the life satisfaction of married and divorced individuals.

Step 1: Please first remove those "Nevermarried" and "Widowed" from wave1b, and then estimate the average life satisfaction,average age, average number of children, average health conditions for the divorced and married groups. Which group have higher life satisfaction, more number of children, and better health?

wave1c  <-  wave1b %>% 
  filter(marital!= "Nevermarried" & marital!= "Widowed")
#wave1c is restricted to only married or divorced. 

average <-  wave1c%>% 
  group_by(marital) %>% 
  summarise(
    ave_sat=mean(sat6),
    ave_age=mean(age),
    ave_nkids=mean(nkids),
    ave_health=mean(health)
  )

average  
# # A tibble: 2 × 5
#   marital  ave_sat ave_age ave_nkids ave_health
#   <fct>      <dbl>   <dbl>     <dbl>      <dbl>
# 1 Married     7.80    33.5      1.67       3.76
# 2 Divorced    6.67    34.3      1.41       3.53
#married group have higher satisfaction, more children, and better health.

Step 2: which group, married or divorced have higher proportion of tertiary educated individuals

wave1c$isced_new1 %>% tabyl() #27 cases reported 0
#  .   n percent
#  0  27   0.013
#  1  42   0.021
#  2 133   0.065
#  3  73   0.036
#  4 952   0.469
#  5  44   0.022
#  6 182   0.090
#  7 534   0.263
#  8  45   0.022
wave1c$isced_new1 %>% as_factor() %>% tabyl()
#                                                     .   n percent
#                                    -7 Incomplete data   0   0.000
#                                     -3 Does not apply   0   0.000
#                                  0 currently enrolled  27   0.013
#                                      1 no degree (1b)  42   0.021
#                      2 lower secondary education (2b) 133   0.065
#                      3 lower secondary education (2a)  73   0.036
#           4 upper secondary education vocational (3b) 952   0.469
#              5 upper secondary education general (3a)  44   0.022
#  6 post-secondary non tertiary education general (4a) 182   0.090
#               7 first stage of tertiary education (5) 534   0.263
#              8 second stage of tertiary education (6)  45   0.022
#Note: those reported 0 meaning that they are still in education.
#Note: when isced_new1==7 or 8, meaning they are tertiary educated.

wave1c <- wave1c %>% 
  mutate(
    tertedu=case_when(          #create a tertedu to specify high education.
      isced_new1%in%c(7,8) ~ 1, #1 means attained teritary education
      isced_new1%in%c(0:6) ~ 0  #1 means attained below teritary education
    )
  )

tabyl(wave1c, tertedu, marital ) %>% 
  adorn_percentages("col")
#  tertedu Nevermarried Married Divorced Widowed
#        0          NaN     0.7     0.81     NaN
#        1          NaN     0.3     0.19     NaN
# the married group have higher proportion of tertiary education.

5.Please run the OLS regression for the married and divorce group seperately. Use the life satisfaction as the outcome; use age, nkids, health, and tertedu as IVs. Please compare whether the effect of these variables on life satisfaction are different for the married and divorced groups

OLS_divorced <- lm_robust(formula = sat6 ~ age + nkids + health + tertedu, subset=(marital=="Divorced"), data = wave1c)
OLS_married <- lm_robust(formula = sat6 ~ age + nkids + health + tertedu, subset=(marital=="Married"), data = wave1c)
#subset=() here is to specify which sample for the OLS
screenreg(list(OLS_divorced,OLS_married),include.ci = FALSE, digits = 3, 
                  custom.model.names = c("OLS for divorced", "OLS for married"),#rename the models
                  single.row =TRUE #to make the coef. and standard error in the same row
          )
# 
# ======================================================
#              OLS for divorced     OLS for married     
# ------------------------------------------------------
# (Intercept)    5.818 (1.152) ***     5.892 (0.347) ***
# age           -0.040 (0.032)        -0.003 (0.010)    
# nkids         -0.052 (0.119)         0.044 (0.043)    
# health         0.631 (0.134) ***     0.499 (0.044) ***
# tertedu        0.439 (0.278)         0.216 (0.082) ** 
# ------------------------------------------------------
# R^2            0.108                 0.087            
# Adj. R^2       0.096                 0.085            
# Num. obs.    283                  1749                
# RMSE           2.063                 1.600            
# ======================================================
# *** p < 0.001; ** p < 0.01; * p < 0.05
#you can see that the four factors have different effect on life satisfaction of the divorced and married groups

6.Why the life satisfaction of the divorced is lower than that of the married? Is it because different distribution of individuals in age, number of children, health or education attainment, or because these variables's effect on life satisfaction are different between the married and the divorced.

Step1: Please use the oaxaca decomposition to find out why the life satisfaction of the divorced is lower than that of the married. Do the difference mainly come from the explained part or from the unexplained part?

#first you have to specify the control and treated group
wave1c <- wave1c %>% 
  mutate(
    marital_new1=case_when(
      marital %in% c("Married") ~ TRUE, #oaxaca treat those TRUE as the control group,i.e. group B
      marital %in% c("Divorced") ~ FALSE) #oaxaca treat those FALSE as the treated group, i.e. group A
         )

result <- oaxaca(formula = sat6 ~ age +  nkids + health + tertedu | marital_new1 , data = wave1c)


result$twofold$overall
#      group.weight coef(explained) se(explained) coef(unexplained) se(unexplained) coef(unexplained A)
# [1,]         0.00           -0.16         0.044             -0.98            0.14            -9.8e-01
# [2,]         1.00           -0.22         0.081             -0.91            0.15             0.0e+00
# [3,]         0.50           -0.19         0.053             -0.94            0.14            -4.9e-01
# [4,]         0.14           -0.17         0.072             -0.97            0.15            -8.4e-01
# [5,]        -1.00           -0.20         0.048             -0.93            0.14            -8.0e-01
# [6,]        -2.00           -0.16         0.044             -0.97            0.14             4.7e-14
#      se(unexplained A) coef(unexplained B) se(unexplained B)
# [1,]           1.4e-01                0.00             0.000
# [2,]           0.0e+00               -0.91             0.151
# [3,]           7.1e-02               -0.46             0.075
# [4,]           2.0e-02               -0.13             0.130
# [5,]           1.2e-01               -0.13             0.020
# [6,]           1.4e-14               -0.97             0.142
#finding: no matter which weight we are using, the unexplained is the major contributer.

(optional)Step2: Which group.weight would you think it is good to use?

#first you have to specify the control and treated group
result$n$n.A # the sample size of divorced
# [1] 283
result$n$n.B # the sample size of divorced
# [1] 1749
#In this way, better go with the pooled weighting. You can ask chatgpt to explain what weighting are good to use.

(optional)Step3: plot the results where please show the overall explained and unexplained effect and use the group.weight=-1 (i.e. pooled weighting)?

plot(result, decomposition = "twofold", type="overall", group.weight = -1, title = "Overall")

Exercise 6

Mengni Chen

09-10-2024