Advanced quantitative data analysis

class: center, middle, inverse, title-slide

.title[
# Advanced quantitative data analysis
]
.subtitle[
## Cross-sectional data analysis II
]
.author[
### Mengni Chen
]
.institute[
### Department of Sociology, University of Copenhagen
]

---

#Let's get ready

```r
library(tidyverse) #for recoding
library(haven) #for importing data
library(janitor) #for tabulation
library(estimatr) #robust standard error OLS
library(texreg) #export regression result
library(ggplot2) #for plotting 
library(oaxaca) #for Blinder-Oaxaca Decomposition
```
Prepare a dataset of age, sex,  relationship status, **family satisfaction**
- DV: satisfaction with family (sat1i4)
- IV: 
  - age(age), 
  - sex(sex_gen), 
  - relationship status(relstat), 
  - number of children (nkids), 
  - health (hlt1)
  - religion (sd30)

- [Click here and copy the codes to prepare the data](https://rpubs.com/fancycmn/1351702)

---
#Recapture exercise: the relationship between having a partner and family satisfaction
First, we can estimate the average family satisfaction across relationship status to have a first look
.pull-left[

```r
tabyl(wave1c,relstat_new2) 
```

```
##  relstat_new2    n   percent
##        single 2430 0.4162384
##     partnered 3408 0.5837616
##     separated    0 0.0000000
##       widowed    0 0.0000000
```

```r
#check the frequency of each category
```
]
.pull-right[

```
## # A tibble: 2 × 2
##   relstat_new2 mean_famsat
##   <fct>              <dbl>
## 1 single              8.50
## 2 partnered           8.79
```
]

---
#Recapture exercise: why those partnered are happier about family life than those singles

```r
wave1c$relstat_new2 <- fct_drop(wave1c$relstat_new2) #drop unused levels
compare<- wave1c %>% 
  group_by(relstat_new2) %>% 
  dplyr::summarise(
    mean_famsat=mean(famsat), #calculate the mean of famsat by relstat_new2
    mean_age=mean(age), #calculate the mean of age by relstat_new2
    mean_nkids=mean(nkids),#calculate the mean of nkids by relstat_new2
    mean_health=mean(health),#calculate the mean of health by relstat_new2
    prop_religion=sum(religion=="Yes")/n() #the no. of Yes/the total size of the group
          ) 
compare
```

```
## # A tibble: 2 × 6
##   relstat_new2 mean_famsat mean_age mean_nkids mean_health prop_religion
##   <fct>              <dbl>    <dbl>      <dbl>       <dbl>         <dbl>
## 1 single              8.50     20.3     0.0580        3.9          0.745
## 2 partnered           8.79     29.1     0.968         3.79         0.724
```

---
#Recapture exercise: why those partnered are happier about family life than those singles

```r
OLS_single <- lm_robust(formula = famsat ~ age + nkids + health + religion, subset=(relstat_new2=="single"), data = wave1c)
OLS_partner <- lm_robust(formula = famsat ~ age + nkids + health + religion, subset=(relstat_new2=="partnered"), data = wave1c)
#subset=() here is to specify which sample for the OLS

screenreg(list(OLS_single,OLS_partner),include.ci = FALSE, digits = 3, 
                  custom.model.names = c("OLS for single","OLS for partnered"),#rename the models
                  single.row =TRUE ) #to make the coef. and standard error in the same row
```

```
## 
## =======================================================
##              OLS for single        OLS for partnered   
## -------------------------------------------------------
## (Intercept)     8.238 (0.238) ***     7.811 (0.191) ***
## age            -0.048 (0.007) ***    -0.013 (0.005) ** 
## nkids          -0.491 (0.185) **      0.249 (0.031) ***
## health          0.265 (0.044) ***     0.250 (0.033) ***
## religionYes     0.294 (0.093) **      0.248 (0.068) ***
## -------------------------------------------------------
## R^2             0.075                 0.044            
## Adj. R^2        0.073                 0.043            
## Num. obs.    2430                  3408                
## RMSE            1.841                 1.673            
## =======================================================
## *** p < 0.001; ** p < 0.01; * p < 0.05
```

```r
#screen reg is to show the result in your screen in a formatted way
```

---
#Blinder-Oaxaca Decomposition
The difference between the average family life satisfaction between those singles and partnered can be described:
`$$\overline{y^a}-\overline{y^b}= 
\alpha^a - \alpha^b +
\beta_1^a*\overline{age_i^a}- \beta_1^b*\overline{age_i^b} +
...+
\beta_4^a*\overline{religion_i^a}-\beta_4^b*\overline{religion_i^b}$$`
some transformation
`$$\overline{y^a}-\overline{y^b}=$$` 
`$$\alpha^a - \alpha^b + \color{red}\beta_\color{red}1^\color{red}b*(\overline{age_i^a}-\overline{age_i^b})+ \overline{age_i^a}*(\beta_1^a-\beta_1^b)+ ...+ \color{red}\beta_\color{red}4^\color{red}b*(\overline{religion_i^a}-\overline{religion_i^b})+ \overline{religion_i^a}*(\beta_4^a-\beta_4^b)$$` 
*note: `$a$` represents "partnered group"; `$b$` represents "single group"
`$$\color{red}\beta_\color{red}1^\color{red}b,\color{red}\beta_\color{red}2^\color{red}b,\color{red}\beta_\color{red}3^\color{red}b,\color{red}\beta_\color{red}4^\color{red}b$$` are the reference regression coefficients.

---
#Blinder-Oaxaca Decomposition: coding
- first step: identify group, here is "relstat_new3"

```r
wave1c <- wave1c %>% 
  mutate(
    relstat_new3=case_when(
      relstat_new2 %in% c("single") ~ TRUE, #oaxaca treat those TRUE as the control group,i.e. group B
      relstat_new2 %in% c("partnered") ~ FALSE) #oaxaca treat those FALSE as the treated group, i.e. group A
         ) 
```

- second step: decompose. Remember, to use regression coefficients of Group B (control,i.e. single) as reference, i.e. `$\color{red}\beta^\color{red}b$`, we need to specific group.weights=0

```r
#oaxaca(#formula = y ~ x1 + x2.. | your group variable, data=your data)
result <- oaxaca(formula = famsat ~ age +  nkids + health + religion | relstat_new3 , group.weights = 0, data = wave1c)
```

```
## oaxaca: oaxaca() performing analysis. Please wait.
```

```
## 
## Bootstrapping standard errors:
```

```
## 1 / 100 (1%)
```

```
## 10 / 100 (10%)
```

```
## 20 / 100 (20%)
```

```
## 30 / 100 (30%)
```

```
## 40 / 100 (40%)
```

```
## 50 / 100 (50%)
```

```
## 60 / 100 (60%)
```

```
## 70 / 100 (70%)
```

```
## 80 / 100 (80%)
```

```
## 90 / 100 (90%)
```

```
## 100 / 100 (100%)
```

---
#Recapture exercise: why those partnered are happier about family life than those singles

- please look at the results in the exercise.pptx in the absalon
- discuss with your classmates what is primary explanation for the difference in family satisfaction between the partnered and the single
  - whether is mainly due to the difference in characteristics between the groups, such as age, health, no. of kids ect.
  - whether is mainly due to the difference in the regression coefficients between the groups
- how to interpret the decomposition result of age and no of kids in the final table and graph

---
#BO Decomposition: alternatively?
The difference between the average family life satisfaction between those singles and partnered can be described:
`$$\overline{y^a}-\overline{y^b}= 
\alpha^a - \alpha^b +
\beta_1^a*\overline{age_i^a}- \beta_1^b*\overline{age_i^b} +
...+
\beta_4^a*\overline{religion_i^a}-\beta_4^b*\overline{religion_i^b}$$`
some transformation
`$$\overline{y^a}-\overline{y^b}=$$` 
`$$\alpha^a - \alpha^b + \color{red}\beta_\color{red}1^\color{red}b*(\overline{age_i^a}-\overline{age_i^b})+ \overline{age_i^a}*(\beta_1^a-\beta_1^b)+ ...+ \color{red}\beta_\color{red}4^\color{red}b*(\overline{religion_i^a}-\overline{religion_i^b})+ \overline{religion_i^a}*(\beta_4^a-\beta_4^b)$$`

you can also transform alternatively:
`$$\overline{y^a}-\overline{y^b}=$$` 
`$$\alpha^a - \alpha^b +\color{red}\beta_\color{red}1^\color{red}a*(\overline{age_i^a}-\overline{age_i^b})+ \overline{age_i^b}*(\beta_1^a-\beta_1^b)+...+ \color{red}\beta_\color{red}4^\color{red}a*(\overline{religion_i^a}-\overline{religion_i^b})+ \overline{religion_i^b}*(\beta_4^a-\beta_4^b)$$`

*note: `$a$` represents "partnered group"; `$b$` represents "single group"

---
#BO Decomposition: alternatively?
`$$\small\overline{y^a}-\overline{y^b}=$$` 
`$$\scriptsize\alpha^a - \alpha^b + \dfrac{\color{red}(\color{red}\beta_\color{red}1^\color{red}a\color{red}+\color{red}\beta_\color{red}1^\color{red}b\color{red})}{\color{red}{2}}*(\overline{age_i^a}-\overline{age_i^b})+ \dfrac{(\overline{age_i^a}+\overline{age_i^b})}2*(\beta_1^a-\beta_1^b)+...+\dfrac{\color{red}(\color{red}\beta_\color{red}4^\color{red}a\color{red}+\color{red}\beta_\color{red}4^\color{red}b\color{red})}{\color{red}{2}}*(\overline{religion_i^a}-\overline{religion_i^b})+ \dfrac{(\overline{religion_i^a}+\overline{religion_i^b})}2*(\beta_4^a-\beta_4^b)$$`

*note: `$a$` represents "partnered"; `$b$` prepresents "single"

---
#Blinder-Oaxaca Decomposition: group A minus group B 
which should be use as the reference set of regression coefficients
- Group B's regression coefficients used as reference, i.e. `$\beta^R=\color{red}\beta^\color{red}b$`. **In oaxaca package, group.weights=0**

- Group A's regression coefficients used as reference, i.e. `$\beta^R=\color{red}\beta^\color{red}a$`. **In oaxaca package, group.weights=1**

- Equally weighted average (each 0.5) of Group A and B coefficients used as reference, as in Reimers (1983). That is, `$\beta^R=\color{red}0\color{red}.\color{red}5\color{red}\beta^\color{red}a\color{red}+\color{red}0\color{red}.\color{red}5\color{red}\beta^\color{red}a$` .**In oaxaca package, group.weights=0.5**

- an average of Group A and B regression coefficients weighted by the number of observations in Group A and B, following Cotton (1988). `$\color{red}{\beta^R = \dfrac{n_A \beta_A + n_B \beta_B}{n_A + n_B}}$`

- Coefficients from a pooled regression (that does not include the group indicator variable) used as reference, as suggested by Neumark (1988). **In oaxaca package, group.weights=-1**

- Coefficients from a pooled regression (that includes the group indicator) used as reference. See Jann (2008). **In oaxaca package, group.weights=-2**

- In oaxaca package, all the situations are estimated.

---
#Look at the results
<img src="https://github.com/fancycmn/25-Session3/blob/main/s6-f2.png?raw=true" width="80%" style="display: block; margin: auto;" >

---
#Blinder-Oaxaca Decomposition: extract the results
how to read the result? you need to know what is [list in R](https://www.youtube.com/watch?v=X8lNTDeiKiE&t=171s)

```r
#I want to see the resutl that uses single's regression coefficient as the reference 
variable_ref_single <- result$twofold$variables[[1]]
overall_ref_single <- result$twofold$overall[1,]
result_ref_single<- rbind(variable_ref_single, overall_ref_single)[,c(1:5)] 
#rbind is to bind rows of "variable" and "overall" dataset. [,c(1:5)] after rbind is to select column 1 to 5.
result_ref_single
```

```
##                    group.weight coef(explained) se(explained) coef(unexplained)
## (Intercept)                   0     0.000000000   0.000000000       -0.42701487
## age                           0    -0.422009527   0.060452941        1.00010887
## nkids                         0    -0.446704537   0.173173148        0.71583557
## health                        0    -0.029670894   0.008244338       -0.05760230
## religionYes                   0    -0.006087292   0.004226827       -0.03354226
## overall_ref_single            0    -0.904472251   0.179096610        1.19778501
##                    se(unexplained)
## (Intercept)             0.30470685
## age                     0.24932510
## nkids                   0.18594266
## health                  0.21793099
## religionYes             0.08286957
## overall_ref_single      0.18615111
```
---
#Blinder-Oaxaca Decomposition: extract the results

```r
#I want to see the resutl that uses partner's regression coefficient as the reference 
variable_ref_partner <- result$twofold$variables[[2]]
overall_ref_partner <- result$twofold$overall[2,]
result_ref_partner<- rbind(variable_ref_partner, overall_ref_partner)[,c(1:5)] 
result_ref_partner
```

```
##                     group.weight coef(explained) se(explained)
## (Intercept)                    1     0.000000000   0.000000000
## age                            1    -0.117777571   0.044626121
## nkids                          1     0.226209515   0.029682518
## health                         1    -0.027970043   0.006794473
## religionYes                    1    -0.005129556   0.003505165
## overall_ref_partner            1     0.075332346   0.032690396
##                     coef(unexplained) se(unexplained)
## (Intercept)               -0.42701487      0.30470685
## age                        0.69587691      0.17227983
## nkids                      0.04292152      0.01045162
## health                    -0.05930315      0.22401872
## religionYes               -0.03449999      0.08510618
## overall_ref_partner        0.21798041      0.05453479
```
---
#Blinder-Oaxaca Decomposition: visualize results
.pull-left[

```r
plot(result, decomposition = "twofold", group.weight = 0, 
     type="overall", title = "Overall：ref=partner")
```
<img src="https://github.com/fancycmn/25-Session3/blob/main/s6-f3.png?raw=true" width="80%" style="display: block; margin: auto;" >

]
.pull-right[

```r
plot(result, decomposition = "twofold", group.weight = 1, 
     type="overall", title = "Overall：ref=partner")
```
<img src="https://github.com/fancycmn/25-Session3/blob/main/s6-f4.png?raw=true" width="80%" style="display: block; margin: auto;" >

]

---
#Blinder-Oaxaca Decomposition: visualize results
.pull-left[

```r
plot(result, decomposition = "twofold", group.weight =0, 
     type="variables", title="by variable：ref=single")
```
<img src="https://github.com/fancycmn/25-Session3/blob/main/s6-f5.png?raw=true" width="80%" style="display: block; margin: auto;" >
]

.pull-right[

```r
plot(result, decomposition = "twofold", group.weight =1, 
     type="variables", title="by variable：ref=partner")
```
<img src="https://github.com/fancycmn/25-Session3/blob/main/s6-f6.png?raw=true" width="80%" style="display: block; margin: auto;" >

]

---
#Blinder-Oaxaca Decomposition: a common practice in the literature
- First, there is no only one correct answer about which group's regression coefficient should be chosen as the reference regression coefficient.

- Second, most of the literature use the pooled regression coefficient in Neumark (1988) approach, to ensure the decomposition's results for the explained and unexplained parts are invariant to the choice of the reference group's coefficients in the overall decomposition. That is, **in oaxaca package, group.weights=-1**

---
#Blinder-Oaxaca Decomposition: results when pooled regression coefficents are used as the reference.

```r
#I want to see the resutl that uses partner's regression coefficient as the reference 
variable_ref_pooled <- result$twofold$variables[[5]]
overall_ref_pooled  <- result$twofold$overall[5,]
result_ref_pooled <- rbind(variable_ref_pooled , overall_ref_pooled)[,c(1:5)] 
result_ref_pooled 
```

```
##                    group.weight coef(explained) se(explained) coef(unexplained)
## (Intercept)                  -1     0.000000000   0.000000000      -0.427014872
## age                          -1    -0.198333845   0.031118577       0.776433183
## nkids                        -1     0.278374389   0.027315180      -0.009243358
## health                       -1    -0.029835503   0.006858157      -0.057437689
## religionYes                  -1    -0.006015705   0.003742019      -0.033613843
## overall_ref_pooled           -1     0.044189334   0.023755313       0.249123423
##                    se(unexplained)
## (Intercept)             0.30470685
## age                     0.20236911
## nkids                   0.02353413
## health                  0.22149570
## religionYes             0.08423947
## overall_ref_pooled      0.04044684
```
---
#Blinder-Oaxaca Decomposition: results when pooled regression coefficents are used as the reference.
.pull-left[

```r
plot(result, decomposition = "twofold", group.weight = -1, 
     type="overall", 
     title = "Overall：ref=pooled without group dummy")
```
<img src="https://github.com/fancycmn/25-Session3/blob/main/s6-f7.png?raw=true" width="80%" style="display: block; margin: auto;" >

]

.pull-right[

```r
plot(result, decomposition = "twofold", group.weight = -1, 
     type="variables", 
     title = "by variable：ref=pooled without group dummy")
```
<img src="https://github.com/fancycmn/25-Session3/blob/main/s6-f8.png?raw=true" width="70%" style="display: block; margin: auto;" >

]

---
#Take home
Two-fold Oaxaca decomposition
- extract the result
- how to choose regression coefficent as the reference 
- a common practice: choose the pooled regression as the reference coefficent

---
class: center, middle
#[Exercise](https://rpubs.com/fancycmn/1352157)