Advanced quantitative data analysis

class: center, middle, inverse, title-slide

.title[
# Advanced quantitative data analysis
]
.subtitle[
## Cross-sectional data analysis I
]
.author[
### Mengni Chen
]
.institute[
### Department of Sociology, University of Copenhagen
]

---

#Let's get ready

```r
#install two new packages
install.packages("estimatr") #for robust standard errors OLS
install.packages("texreg") #output the regression results
```

```r
library(tidyverse) # see session 3.
library(haven) # Import data, see session 3.
library(janitor) #cleaning data
library(estimatr) # Allows us to estimate robust standard errors OLS.
library(texreg) # Allows us to make nicely-formatted Html & Latex regression tables.
library(ggplot2) # Allows us to create nice figures
```

---
#Prepare data
a dataset of age, sex,  relationship status, life satisfaction
- DV: life satisfaction
- IV: age, sex, relationship status

```r
wave1 <- read_dta("anchor1_50percent_Eng.dta")
# sample size =6201
```

```r
#have a initial look at the variables
wave1$sat6 %>% as_factor() %>% tabyl()
# 5 cases with "no answer" or "Don't know"

wave1$sex_gen %>% as_factor() %>% tabyl()
# no missing

wave1$relstat %>% as_factor() %>% tabyl()
# 34 cases with "incomplete data"
```

---
#Prepare data
a dataset of age, sex,  relationship status, life satisfaction
- DV: life satisfaction
- IV: age, sex, relationship status

```r
wave1b <- wave1 %>% 
  transmute(
    age,
    
    sat6=case_when(sat6<0 ~ as.numeric(NA), #specify when sat should be considered missing
                   TRUE ~ as.numeric(sat6)),
    
    sex_gen=as_factor(sex_gen) %>% fct_drop(), #treat sex_gen as categorical, and drop unused level
    
    relstat=as_factor(relstat), #treat relationship status as categorical
    
    relstat_new1=case_when(
      relstat=="-7 Incomplete data" ~ as.character(NA),#specify when it should be missing
      TRUE ~ as.character(relstat)
                         ) %>% as_factor() %>% fct_drop() 
      #make relstat as a factor, and then drop unused levels in relstat_new1
            ) %>%   
  drop_na() #drop all observations with missing values in the sample
# sample size change from 6201 to 6162
```
---
#Prepare data
a dataset of age, sex,  relationship status, life satisfaction
- DV: life satisfaction
- IV: age, sex, relationship status

```r
tabyl(wave1b,relstat_new1)
```

```
##                 relstat_new1    n      percent
##       1 Never married single 2446 0.3969490425
##              4 Married COHAB 1734 0.2814021422
##          2 Never married LAT 1012 0.1642323921
##        3 Never married COHAB  659 0.1069457968
##   8 Divorced/separated COHAB   76 0.0123336579
##             9 Widowed single    3 0.0004868549
##  6 Divorced/separated single  145 0.0235313210
##     7 Divorced/separated LAT   63 0.0102239533
##      5 Married noncohabiting   23 0.0037325544
##               10 Widowed LAT    1 0.0001622850
```
---
#The relationship between having a partner and life satisfaction
- Generate a new variable for relationship status

```r
wave1c <- wave1b %>% 
  mutate(
    relstat_new2=case_when(
      relstat_new1 %in% c("1 Never married single") ~ "single",
      #treat 'never married single' as 'single'
      relstat_new1 %in% c("2 Never married LAT", 
                     "3 Never married COHAB",
                     "4 Married COHAB",
                     "5 Married noncohabiting") ~ 'partnered',
      #treat the 4 situations as "partnered"
      relstat_new1 %in% c("6 Divorced/separated single",
                     "7 Divorced/separated LAT",
                     "8 Divorced/separated COHAB") ~ 'separated',
      #treat the 3 situations as "separated"
      relstat_new1 %in% c("9 Widowed single",
                     "10 Widowed LAT") ~ 'widowed'
      #treat the 2 situations as "widowed"
                          ) %>% as_factor()# make relstat_new2 as factor
         ) %>% 
  filter(relstat_new2!= "widowed" &  relstat_new2!= "separated") #only 4 widowed and 284 separated, dropping.
# sample size change to 5874 after dropping widowed and separated
```

---
#Exploring the relationship between having a partner and life satisfaction
First, we can estimate the average sat6 across relationship status to have a first look
.pull-left[

```r
tabyl(wave1c,relstat_new2) 
```

```
##  relstat_new2    n   percent
##        single 2446 0.4164113
##     partnered 3428 0.5835887
##     separated    0 0.0000000
##       widowed    0 0.0000000
```

```r
#check the frequency of each category
```
]
.pull-right[

```r
wave1c$relstat_new2 <- fct_drop(wave1c$relstat_new2) #drop unused levels
rel_sat<- wave1c %>% 
  group_by(relstat_new2) %>% 
  dplyr::summarise(
    mean_sat6=mean(sat6), #calculate the mean of sat6 by marital1
          ) 
rel_sat
```

```
## # A tibble: 2 × 2
##   relstat_new2 mean_sat6
##   <fct>            <dbl>
## 1 single            7.50
## 2 partnered         7.78
```
]

---
#Model the relationship between relationship status and life satisfaction

Coming up with a model of life satisfaction: Can life satisfaction (i.e., `$y$`) be expressed as a function (i.e., `$f()$`) of another variable (i.e., `$x$`)?

`$$\hat{y} = f(x)$$`
The two critical questions are thus:

1. What is `$f()$`?
2. What is `$x$`?

---
# A linear model of `$y$`

x could be age, health conditions, relationship status, income

---
# Ordinary Least Squares
.pull-left[
We find `$\alpha$` and `$\beta$` by choosing the one line that minimizes the *residual sum of squares*:

`$$\begin{align*}
\ & min \text{RSS} = \min \sum_{i=1}^{n} e_{i}^{2} \\ 
\ & = \min \sum_{i=1}^{n} y_{i} - \hat{y_{i}} \\
\ & = \min \sum_{i=1}^{n} (y_{i} - (\alpha + \beta x_{i}))^{2}
    \end{align*}$$`
    ]
    
.pull-right[
<img src="https://miro.medium.com/max/625/0*gglavDlTUWKn4Loe" width="100%" style="display: block; margin: auto;">

]
---
#Assumptions of OLS
`$$y_i= \alpha+\beta*x_i  + e_i$$`
--

- About **the sample**
  - Assumption 1: The sample taken for the linear regression model must be drawn randomly from the population.
  
--

- About **the relationship between Y and X**
  - Assumption 2: The dependent variable (Y) is a linear function of independent variables (X) and the error term.
  
--

- About **the independent variables - X**
  - Assumption 3: There is no multi-collinearity (or perfect collinearity). In other words, there should be no linear relationship between the independent variables.
  
--

- About **the error term**
  - Assumption 4: The distribution of error terms has zero mean and doesn’t depend on the independent variables(Xs). There must be no relationship between the Xs and the error term.
  -  Assumption 5: There is homoscedasticity and no autocorrelation.
  -  Assumption 6: Error terms should be normally distributed.

---
# The `lm()` funcion

```r
# Estimate a linear OLS model and assign the results to object ols.
ols1 <- lm(formula = sat6 ~ age, data = wave1c) 
# sat6 is DV, should be written on the left side of ~ 
# age is IV, should be written on the right side of ~
summary(ols1)
```

```
## 
## Call:
## lm(formula = sat6 ~ age, data = wave1c)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.8540 -0.6709  0.3291  1.1827  2.5671 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.128590   0.071795 113.220  < 2e-16 ***
## age         -0.018309   0.002684  -6.822 9.88e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.705 on 5872 degrees of freedom
## Multiple R-squared:  0.007863,   Adjusted R-squared:  0.007694 
## F-statistic: 46.54 on 1 and 5872 DF,  p-value: 9.88e-12
```

---
#The `lm_robust` function
- OLS based on a cross-sectional dataset often has issue of [heteroscedasticity](https://rpubs.com/cyobero/187387). Therefore,inference from the standard errors, t-values, p-values, and confidence intervals is likely to  be wrong.
- We thus need to estimate robust standard errors OLS, by using `estimatr::lm_robust()`.

```r
ols2_robust <- lm_robust(formula = sat6 ~ age, data = wave1c)
summary(ols2_robust)
```

```
## 
## Call:
## lm_robust(formula = sat6 ~ age, data = wave1c)
## 
## Standard error type:  HC2 
## 
## Coefficients:
##             Estimate Std. Error t value  Pr(>|t|) CI Lower CI Upper   DF
## (Intercept)  8.12859    0.06827 119.063 0.000e+00  7.99475  8.26243 5872
## age         -0.01831    0.00263  -6.962 3.715e-12 -0.02346 -0.01315 5872
## 
## Multiple R-squared:  0.007863 ,  Adjusted R-squared:  0.007694 
## F-statistic: 48.47 on 1 and 5872 DF,  p-value: 3.715e-12
```

---
#Categorical predictors
Factors are automatically treated as dummies, with the first level taken as reference. Thus if you want to change the reference, you need to recode the levels of the factor using the `forcats` package.

```r
#as relstat_new2 is a factor in wave1c, it is automatically treated as dummies.
ols3_cat <- lm_robust(formula = sat6 ~ age +relstat_new2, data = wave1c)
summary(ols3_cat)
```

```
## 
## Call:
## lm_robust(formula = sat6 ~ age + relstat_new2, data = wave1c)
## 
## Standard error type:  HC2 
## 
## Coefficients:
##                       Estimate Std. Error t value  Pr(>|t|) CI Lower CI Upper
## (Intercept)            8.25606   0.069014  119.63 0.000e+00  8.12077  8.39135
## age                   -0.03715   0.003176  -11.70 2.982e-31 -0.04338 -0.03092
## relstat_new2partnered  0.60272   0.054838   10.99 7.879e-28  0.49522  0.71022
##                         DF
## (Intercept)           5871
## age                   5871
## relstat_new2partnered 5871
## 
## Multiple R-squared:  0.02967 ,   Adjusted R-squared:  0.02934 
## F-statistic: 82.07 on 2 and 5871 DF,  p-value: < 2.2e-16
```

---
#Categorical predictors
.pull-left[

```r
#Export to html
texreg::htmlreg(list(ols3_cat), 
        include.ci = FALSE, 
        file = "OLS3_cat(2024).html") #html
```

```
## The table was written to the file 'OLS3_cat(2024).html'.
```
]
.pull-right[
<img src="https://github.com/fancycmn/24-Session-5/blob/main/Figure1.JPG?raw=true" style="display: block; margin: auto;" >
]

- Intercept: when a person is aged 0 and single, his/her life satisfaction  is 8.26 on average.
- Relationship status: when age is controlled, a person who is partnered has a higher life satisfaction than a person who is single, by the amount of 0.602 on average.
- Age: when relationship status is controlled, as the age increases by 1 year, the life satisfaction decreases by 0.037 on average.
---
#Categorical predictors: change the reference level
recode the levels of relstat_new2

```r
wave1c$relstat_new2 <- fct_relevel(wave1c$relstat_new2, "partnered", "single")
#use fct_level function to re-level relstat_new2, now "partnered" is the 1st level and will be treated as the reference category
ols3_cat_relevel <- lm_robust(formula = sat6 ~ age + relstat_new2, data = wave1c)
summary(ols3_cat_relevel)
```

```
## 
## Call:
## lm_robust(formula = sat6 ~ age + relstat_new2, data = wave1c)
## 
## Standard error type:  HC2 
## 
## Coefficients:
##                    Estimate Std. Error t value  Pr(>|t|) CI Lower CI Upper   DF
## (Intercept)         8.85878   0.095240   93.02 0.000e+00  8.67207  9.04549 5871
## age                -0.03715   0.003176  -11.70 2.982e-31 -0.04338 -0.03092 5871
## relstat_new2single -0.60272   0.054838  -10.99 7.879e-28 -0.71022 -0.49522 5871
## 
## Multiple R-squared:  0.02967 ,   Adjusted R-squared:  0.02934 
## F-statistic: 82.07 on 2 and 5871 DF,  p-value: < 2.2e-16
```

---

#Regression tables

Present sevearl regresion together; report standard errors, not confidence intervals.

```r
texreg::screenreg(list(ols1,ols2_robust,ols3_cat,ols3_cat_relevel), 
                  include.ci = FALSE, digits = 4) # include.ci=FALSE means don't include CI, digit=4 to specify the digits of your result. 
```

```
## 
## =================================================================================
##                        Model 1        Model 2        Model 3        Model 4      
## ---------------------------------------------------------------------------------
## (Intercept)               8.1286 ***     8.1286 ***     8.2561 ***     8.8588 ***
##                          (0.0718)       (0.0683)       (0.0690)       (0.0952)   
## age                      -0.0183 ***    -0.0183 ***    -0.0371 ***    -0.0371 ***
##                          (0.0027)       (0.0026)       (0.0032)       (0.0032)   
## relstat_new2partnered                                   0.6027 ***               
##                                                        (0.0548)                  
## relstat_new2single                                                    -0.6027 ***
##                                                                       (0.0548)   
## ---------------------------------------------------------------------------------
## R^2                       0.0079         0.0079         0.0297         0.0297    
## Adj. R^2                  0.0077         0.0077         0.0293         0.0293    
## Num. obs.              5874           5874           5874           5874         
## RMSE                                     1.7053         1.6866         1.6866    
## =================================================================================
## *** p < 0.001; ** p < 0.01; * p < 0.05
```

---
#Export regression tables

```r
#Export to word
texreg::htmlreg(list(ols1,ols2_robust,ols3_cat,ols3_cat_relevel), 
        include.ci = FALSE, digits = 4,
        file = "MyOLSModels(2024).doc") # Word table
```

```
## The table was written to the file 'MyOLSModels(2024).doc'.
```

```r
#Export to html
texreg::htmlreg(list(ols1,ols2_robust,ols3_cat,ols3_cat_relevel), 
        include.ci = FALSE, 
        file = "MyOLSModels(2024).html") #html
```

```
## The table was written to the file 'MyOLSModels(2024).html'.
```

```r
#Export to excel
texreg::htmlreg(list(ols1,ols2_robust,ols3_cat,ols3_cat_relevel), 
        include.ci = FALSE, 
        file = "MyOLSModels(2024).xls") #excel
```

```
## The table was written to the file 'MyOLSModels(2024).xls'.
```
---
#Interaction effect
Baseline model: IV=age, gender, relationship status

```r
wave1c$relstat_new2 <- fct_relevel(wave1c$relstat_new2, "single")
ols4 <- lm_robust(formula = sat6 ~ age +relstat_new2+sex_gen, data = wave1c)
summary(ols4)
```

```
## 
## Call:
## lm_robust(formula = sat6 ~ age + relstat_new2 + sex_gen, data = wave1c)
## 
## Standard error type:  HC2 
## 
## Coefficients:
##                       Estimate Std. Error  t value  Pr(>|t|) CI Lower CI Upper
## (Intercept)            8.24343    0.07218 114.2062 0.000e+00  8.10193  8.38493
## age                   -0.03708    0.00318 -11.6585 4.557e-31 -0.04331 -0.03084
## relstat_new2partnered  0.59868    0.05533  10.8199 4.980e-27  0.49021  0.70715
## sex_gen2 Female        0.02607    0.04444   0.5866 5.575e-01 -0.06104  0.11318
##                         DF
## (Intercept)           5870
## age                   5870
## relstat_new2partnered 5870
## sex_gen2 Female       5870
## 
## Multiple R-squared:  0.02972 ,   Adjusted R-squared:  0.02923 
## F-statistic: 54.86 on 3 and 5870 DF,  p-value: < 2.2e-16
```
---
#Interaction effect x1*x2
The effect of relationship status (a categorical var) on life satisfaction differs by gender

```r
ols4_inter1 <- lm_robust(formula = sat6 ~ age +relstat_new2*sex_gen, data = wave1c)#relstat_new2*sex_gen is the interaction term
summary(ols4_inter1)
```

```
## 
## Call:
## lm_robust(formula = sat6 ~ age + relstat_new2 * sex_gen, data = wave1c)
## 
## Standard error type:  HC2 
## 
## Coefficients:
##                                         Estimate Std. Error    t value
## (Intercept)                            8.2541902   0.075707 109.027431
## age                                   -0.0370647   0.003181 -11.651397
## relstat_new2partnered                  0.5774176   0.069724   8.281511
## sex_gen2 Female                        0.0003469   0.071477   0.004853
## relstat_new2partnered:sex_gen2 Female  0.0439392   0.091082   0.482416
##                                        Pr(>|t|) CI Lower CI Upper   DF
## (Intercept)                           0.000e+00   8.1058  8.40260 5869
## age                                   4.947e-31  -0.0433 -0.03083 5869
## relstat_new2partnered                 1.492e-16   0.4407  0.71410 5869
## sex_gen2 Female                       9.961e-01  -0.1398  0.14047 5869
## relstat_new2partnered:sex_gen2 Female 6.295e-01  -0.1346  0.22249 5869
## 
## Multiple R-squared:  0.02976 ,   Adjusted R-squared:  0.0291 
## F-statistic: 41.14 on 4 and 5869 DF,  p-value: < 2.2e-16
```
---
#Interaction effect x1*x2
The effect of age (a continous var) on life satisfaction differs by gender

```r
ols4_inter2 <- lm_robust(formula = sat6 ~ age*sex_gen +relstat_new2, data = wave1c)#age*sex_gen is the interaction term
summary(ols4_inter2)
```

```
## 
## Call:
## lm_robust(formula = sat6 ~ age * sex_gen + relstat_new2, data = wave1c)
## 
## Standard error type:  HC2 
## 
## Coefficients:
##                       Estimate Std. Error t value  Pr(>|t|)  CI Lower CI Upper
## (Intercept)            8.49053   0.096717  87.787 0.000e+00  8.300929  8.68013
## age                   -0.04707   0.004187 -11.241 5.050e-29 -0.055273 -0.03886
## sex_gen2 Female       -0.46112   0.136611  -3.375 7.417e-04 -0.728930 -0.19332
## relstat_new2partnered  0.60666   0.055323  10.966 1.034e-27  0.498210  0.71512
## age:sex_gen2 Female    0.01913   0.005219   3.665 2.496e-04  0.008895  0.02936
##                         DF
## (Intercept)           5869
## age                   5869
## sex_gen2 Female       5869
## relstat_new2partnered 5869
## age:sex_gen2 Female   5869
## 
## Multiple R-squared:  0.03186 ,   Adjusted R-squared:  0.0312 
## F-statistic: 44.26 on 4 and 5869 DF,  p-value: < 2.2e-16
```

---
#Interaction effect: export results

```r
texreg::screenreg(list(ols4,ols4_inter1,ols4_inter2), 
                  include.ci = FALSE, digits = 4, single.row=TRUE,
                  custom.coef.names =c(
                    "Intercept",
                    "Age",
                    "Partnered(ref.=Single)",
                    "Female(ref.=male)",
                    "Partnered x Female",
                    "Age x Female"      )#you can create understandable names for the coef.
                  ) #single.row=TRUE is to place standard error the same line. 
```

```
## 
## ==============================================================================================
##                         Model 1                 Model 2                 Model 3               
## ----------------------------------------------------------------------------------------------
## Intercept                  8.2434 (0.0722) ***     8.2542 (0.0757) ***     8.4905 (0.0967) ***
## Age                       -0.0371 (0.0032) ***    -0.0371 (0.0032) ***    -0.0471 (0.0042) ***
## Partnered(ref.=Single)     0.5987 (0.0553) ***     0.5774 (0.0697) ***     0.6067 (0.0553) ***
## Female(ref.=male)          0.0261 (0.0444)         0.0003 (0.0715)        -0.4611 (0.1366) ***
## Partnered x Female                                 0.0439 (0.0911)                            
## Age x Female                                                               0.0191 (0.0052) ***
## ----------------------------------------------------------------------------------------------
## R^2                        0.0297                  0.0298                  0.0319             
## Adj. R^2                   0.0292                  0.0291                  0.0312             
## Num. obs.               5874                    5874                    5874                  
## RMSE                       1.6867                  1.6868                  1.6849             
## ==============================================================================================
## *** p < 0.001; ** p < 0.01; * p < 0.05
```
---
#Interaction effect: interpretation

<font size="2">`$$Model1: sat6=8.2434 -0.0371age +0.5987partnered + 0.0261female$$`
`$$Model2: sat6=8.2542 -0.0371age +0.5774partnered + 0.0003female + 0.0439partnered*female$$`
`$$Model3: sat6=8.4905 -0.0471age +0.6067partnered - 0.4611 female + 0.0191age*female$$`
<img src="https://github.com/fancycmn/24-Session-5/blob/main/Figure2a.JPG?raw=true" style="display: block; margin: auto;" >

---
#Take home
1. Exploratory analysis: frequency, mean
2. Linear regression
    - Robust standard errors OLS
    - Interction terms
3. Generate regression tables
4. Important code
    - lm(formula = your DV ~ your IV, data = your dataset)
    - lm_robust(formula = your DV ~ your IV, data = your dataset)
    - lm_robust(formula = your DV ~ IV1 + IV2*IV3, data = your dataset)
    - texreg::screenreg()

---
class: center, middle
#[Exercise](https://rpubs.com/fancycmn/1224073)