Advanced quantitative data analysis

class: center, middle, inverse, title-slide

.title[
# Advanced quantitative data analysis
]
.subtitle[
## OLS
]
.author[
### Mengni Chen
]
.institute[
### Department of Sociology, University of Copenhagen
]

---

#Let's get ready

```r
#install.packages("estimatr")
#install.packages("texreg")
# Add packages to library
library(tidyverse) # Add the tidyverse package to my current library.
library(haven) # Import data.
library(Hmisc) # Weighting
library(ggplot2) # Allows us to create nice figures.
library(estimatr) # Allows us to estimate (cluster-)robust standard errors.
library(texreg) # Allows us to make nicely-formatted Html & Latex regression tables.
```

---
#Prepare data
a dataset of age, sex,  relationship status, life satisfaction, weight
- DV: life satisfaction
- IV: age, sex, relationship status

```r
wave1 <- read_dta("anchor1_50percent_Eng.dta")
# sample size =6201

#
wave1b <- wave1 %>% 
  transmute(
           age=zap_labels(age), #Independent variable
           
           sat6=case_when(sat6<0 ~ as.numeric(NA), #specify when sat should be considered missing
                          TRUE ~ as.numeric(sat6)) %>% zap_label(), #remove labels of sat6
           
           cdweight=zap_label(cdweight), #cdweight is the variable telling the weight
           
           sex_gen=as_factor(sex_gen) %>% fct_drop(), #treat sex_gen as categorical, and drop unused level
           
           relstat=as_factor(relstat), #treat relationship status as categorical
           relstat=case_when(relstat=="-7 Incomplete data" ~ as.character(NA),
                             TRUE ~ as.character(relstat)) %>%  #specify when relstat should be considered missing
             as_factor() %>%  #make relstat as a factor
             fct_drop() #drop unused levels in relstat
) %>%   
  drop_na() #drop all observations with missing values in the sample
# sample size change from 6201 to 6162
```
---
#The relationship between marital status and life satisfaction
- Generate a new variable for marital status

```r
wave1c <- wave1b %>% 
  mutate(
    marital1=case_when(
      relstat %in% c("1 Never married single","2 Never married LAT","3 Never married COHAB") ~ "Nevermarried",
      # when relstat has any of the three situations, I assign "Nevermarried" to new variable "marital1"
      
      relstat %in% c("4 Married COHAB","5 Married noncohabiting") ~ 'Married',
      # when relstat has any of the two situations, I assign "Married" to new variable "marital1"
      
      relstat %in% c("6 Divorced/separated single","7 Divorced/separated LAT","8 Divorced/separated COHAB") ~ 'Divorced',
      # when relstat has any of the three situations, I assign "Divorced" to new variable "marital1"
      
      relstat %in% c("9 Widowed single","10 Widowed LAT") ~ 'Widow'
      # when relstat has any of the two situations, I assign "Widow" to new variable "marital1"
      
    ) %>% as_factor()# I treat marital1 as a categorical variable
  ) %>% filter(marital1!= "Widow") #I drop observations when he/she is widowed.
# sample size change to 6158 after dropping those widowed
```

---
#Averge life satisfaction across marital status
To study the relationship between marital status and life satisfaction, we can estimate the average sat6 across marital status to have a first look

```r
mar_sat<- wave1c %>% 
  group_by(marital1) %>% 
  dplyr::summarise(
    mean_sat6=mean(sat6)
    ) 
mar_sat
```

```
## # A tibble: 3 × 2
##   marital1     mean_sat6
##   <fct>            <dbl>
## 1 Nevermarried      7.60
## 2 Married           7.80
## 3 Divorced          6.68
```
---
#Averge life satisfaction across marital status, how to bring in confidence interval

The standard error tells you the standard deviation of your result (i.e., mean, percentage,β etc.), if we would conduct the same study over and over again.

[? difference between Standard deviation (SD) and Standard error (SE)](https://www.investopedia.com/ask/answers/042415/what-difference-between-standard-error-means-and-standard-deviation.asp)

Standard error of the mean：
`$\text{SE}(\bar{x}) = \frac{ \text{SD(x)}}{\sqrt{n}}$`

Standard error of a proportion:
`$\text{SE}(p) = \sqrt{\frac{p(1-p)}{n}}$`

The 95% Confidence Interval tries to capture the range in which the true value lies with an accuracy of 95%.
`$$\text{CI}_{95} = \bar{x} \pm \text{Critical Value} * \text{SE}(\bar{x})$$`
`$$\text{CI}_{95} = p \pm \text{Critical Value} * \text{SE}(p)$$`

To have critical values, you can use `qt(p=0.975, df=df)` where p=0.975 to specifiy 95%, df is degree of freedom, you can use sample size as its degree of freedom.

---
#Averge life satisfaction across marital status, how to bring in confidence interval

.pull-left[
When we don't consider weight

```r
mar_sat<- wave1c %>% 
  group_by(marital1) %>% 
  dplyr::summarise(
  n=n(), #sample size
  mean_sat6=mean(x = sat6),#mean of sat6
  sd_sat6=sd(x=sat6), #standard deviation 
  se_sat6=sd_sat6/sqrt(n), #standard error 
  sat6_lci=mean_sat6 - qt(p=0.975, df=n)*se_sat6, #low confidence interval
  sat6_hci=mean_sat6 + qt(p=0.975, df=n)*se_sat6 #high confidence interval
  ) %>% 
  select(marital1, mean_sat6, sat6_lci, sat6_hci)
#Note: qt(p=0.975, df=df) is to specify critial value at 95% Confidence interval
mar_sat
```

```
## # A tibble: 3 × 4
##   marital1     mean_sat6 sat6_lci sat6_hci
##   <fct>            <dbl>    <dbl>    <dbl>
## 1 Nevermarried      7.60     7.55     7.66
## 2 Married           7.80     7.73     7.88
## 3 Divorced          6.68     6.42     6.93
```
]
.pull-right[
When we consider weight

```r
mar_sat<- wave1c %>% 
  group_by(marital1) %>% 
  dplyr::summarise(
  wn=sum(cdweight), #weighted sample size
  m_wsat6=wtd.mean(x = sat6, weights = cdweight), #weighted mean
  var_wsat6=wtd.var(x = sat6, weights = cdweight), #weighted variance
  sd_wsat6=sqrt(var_wsat6), #standard deviation by take a square root of variance 
  se_wsat6=sd_wsat6/sqrt(wn), #standard error following the formular
  wsat6_lci=m_wsat6-qt(p=0.975, df=wn)*se_wsat6, #low confidence interval
  wsat6_hci=m_wsat6+qt(p=0.975, df=wn)*se_wsat6 #high confidence interval
  ) %>% select(marital1,m_wsat6,wsat6_lci,wsat6_hci )
mar_sat
```

```
## # A tibble: 3 × 4
##   marital1     m_wsat6 wsat6_lci wsat6_hci
##   <fct>          <dbl>     <dbl>     <dbl>
## 1 Nevermarried    7.51      7.46      7.57
## 2 Married         7.79      7.72      7.87
## 3 Divorced        6.50      6.22      6.78
```
]

---
#Avergea life satisfaction across marital status

.pull-left[
<img src="https://github.com/fancycmn/Slide5/blob/main/S5_Pic1.png?raw=true" width="90%" style="display: block; margin: auto;" >
]

.pull-right[

```r
ggplot(data = mar_sat, mapping = aes(y = m_wsat6, x = marital1)) +
  geom_bar(stat = "identity") +
  geom_errorbar(mapping = aes(ymin = wsat6_lci, ymax = wsat6_hci),
                width = 0.55) + # Add layer of errorbars.
  labs(y = "Life satisfaction",
       x = "Marital1")+
  scale_y_continuous(limits=c(0, 10),breaks=seq(0,10,2))
```
]

---
#Model the relationship between marital status and life satisfaction

Coming up with a model of life satisfaction: Can life satisfaction (i.e., `$y$`) be expressed as a function (i.e., `$f()$`) of another variable (i.e., `$x$`)?

`$$\hat{y} = f(x)$$`
The two critical questions are thus:

1. What is `$f()$`?
2. What is `$x$`?

---
# A linear model of `$y$`

x could be age, health conditions, marital status, income

---
# Ordinary Least Squares
.pull-left[
We find `$\alpha$` and `$\beta$` by choosing the one line that minimizes the *residual sum of squares*:

`$$\begin{align*}
\ & min \text{RSS} = \min \sum_{i=1}^{n} e_{i}^{2} \\ 
\ & = \min \sum_{i=1}^{n} y_{i} - \hat{y_{i}} \\
\ & = \min \sum_{i=1}^{n} (y_{i} - (\alpha + \beta x_{i}))^{2}
    \end{align*}$$`
    ]
    
.pull-right[
<img src="https://miro.medium.com/max/625/0*gglavDlTUWKn4Loe" width="100%" style="display: block; margin: auto;">

]

---
# The `lm()` funcion

```r
# Estimate a linear OLS model and assign the results to object ols.
ols1 <- lm(formula = sat6 ~ age, data = wave1c) 
# sat6 is DV, should be written on the left side of ~ 
# age is IV, should be written on the right side of ~
summary(ols1)
```

```
## 
## Call:
## lm(formula = sat6 ~ age, data = wave1c)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.8676 -0.8215  0.2016  1.1785  2.6630 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.213634   0.071954 114.151   <2e-16 ***
## age         -0.023069   0.002649  -8.708   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.737 on 6156 degrees of freedom
## Multiple R-squared:  0.01217,    Adjusted R-squared:  0.01201 
## F-statistic: 75.84 on 1 and 6156 DF,  p-value: < 2.2e-16
```

---
#Weighted linear OLS
- The weights argument allows you to use (post-stratification) weights.

```r
# Estimate a weighted linear OLS model.
ols2_weight <- lm(formula = sat6 ~ age, data = wave1c, weights = cdweight)#specify weight here
summary(ols2_weight)
```

```
## 
## Call:
## lm(formula = sat6 ~ age, data = wave1c, weights = cdweight)
## 
## Weighted Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.7388  -0.6955   0.2364   1.1134   4.9252 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.227414   0.077310  106.42   <2e-16 ***
## age         -0.025337   0.002766   -9.16   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.787 on 6156 degrees of freedom
## Multiple R-squared:  0.01345,    Adjusted R-squared:  0.01329 
## F-statistic:  83.9 on 1 and 6156 DF,  p-value: < 2.2e-16
```
---
#Weighted linear OLS with `lm_robust`
- But beware, `lm()` inference (e.g., standard errors, t-values, p-values, and confidence intervals) will be wrong, because weights introduce [heteroscedasticity](.https://en.wikipedia.org/wiki/Homoscedasticity_and_heteroscedasticity)
- We thus need to estimate heteroscedasticity-robust standard errors, by using `estimatr::lm_robust()`.

```r
ols2_weight_robust <- lm_robust(formula = sat6 ~ age, data = wave1c, weights = cdweight)
summary(ols2_weight_robust)
```

```
## 
## Call:
## lm_robust(formula = sat6 ~ age, data = wave1c, weights = cdweight)
## 
## Weighted, Standard error type:  HC2 
## 
## Coefficients:
##             Estimate Std. Error t value  Pr(>|t|) CI Lower CI Upper   DF
## (Intercept)  8.22741   0.076592 107.419 0.000e+00  8.07727  8.37756 6156
## age         -0.02534   0.003011  -8.415 4.833e-17 -0.03124 -0.01943 6156
## 
## Multiple R-squared:  0.01345 ,   Adjusted R-squared:  0.01329 
## F-statistic: 70.81 on 1 and 6156 DF,  p-value: < 2.2e-16
```

---
#Categorical predictors
Factors are automatically treated as dummies, with the first level taken as reference. Thus if you want to change the reference, you need to recode the levels of the factor using the `forcats` package.

```r
#as marital1 is a factor in wave1c, it is automatically treated as dummies.
ols3_cat <- lm_robust(formula = sat6 ~ age + marital1, data = wave1c, weights = cdweight)
summary(ols3_cat)
```

```
## 
## Call:
## lm_robust(formula = sat6 ~ age + marital1, data = wave1c, weights = cdweight)
## 
## Weighted, Standard error type:  HC2 
## 
## Coefficients:
##                  Estimate Std. Error t value  Pr(>|t|) CI Lower CI Upper   DF
## (Intercept)       8.54409   0.092680  92.189 0.000e+00  8.36241  8.72578 6154
## age              -0.04426   0.004283 -10.334 7.863e-25 -0.05266 -0.03587 6154
## marital1Married   0.73801   0.077592   9.511 2.635e-21  0.58590  0.89012 6154
## marital1Divorced -0.52564   0.168952  -3.111 1.872e-03 -0.85684 -0.19443 6154
## 
## Multiple R-squared:  0.04629 ,   Adjusted R-squared:  0.04583 
## F-statistic: 58.52 on 3 and 6154 DF,  p-value: < 2.2e-16
```
---
#Categorical predictors
recode the levels of marital1

```r
wave1c$marital1 <- fct_relevel(wave1c$marital1, "Divorced", "Nevermarried", "Married")
#use fct_level function to re-level marital1
ols3_cat_relevel <- lm_robust(formula = sat6 ~ age + marital1, data = wave1c, weights = cdweight)
summary(ols3_cat_relevel)
```

```
## 
## Call:
## lm_robust(formula = sat6 ~ age + marital1, data = wave1c, weights = cdweight)
## 
## Weighted, Standard error type:  HC2 
## 
## Coefficients:
##                      Estimate Std. Error t value   Pr(>|t|) CI Lower CI Upper
## (Intercept)           8.01846   0.215844  37.149 9.858e-273  7.59532  8.44159
## age                  -0.04426   0.004283 -10.334  7.863e-25 -0.05266 -0.03587
## marital1Nevermarried  0.52564   0.168952   3.111  1.872e-03  0.19443  0.85684
## marital1Married       1.26364   0.161271   7.836  5.462e-15  0.94750  1.57979
##                        DF
## (Intercept)          6154
## age                  6154
## marital1Nevermarried 6154
## marital1Married      6154
## 
## Multiple R-squared:  0.04629 ,   Adjusted R-squared:  0.04583 
## F-statistic: 58.52 on 3 and 6154 DF,  p-value: < 2.2e-16
```
---

#Interaction effects
An example of interaction between marital status and sex.

```r
#as we re-level marital1 in last slides, we now re-level it back
wave1c$marital1 <- fct_relevel(wave1c$marital1,"Nevermarried",  "Married","Divorced") 
ols4_interactsex <- lm_robust(formula = sat6 ~ age +  marital1*sex_gen, data = wave1c, weights = cdweight)
# use * to specific the interaction
summary(ols4_interactsex)
```

```
## 
## Call:
## lm_robust(formula = sat6 ~ age + marital1 * sex_gen, data = wave1c, 
##     weights = cdweight)
## 
## Weighted, Standard error type:  HC2 
## 
## Coefficients:
##                                  Estimate Std. Error  t value  Pr(>|t|)
## (Intercept)                       8.51700   0.096867  87.9243 0.000e+00
## age                              -0.04414   0.004291 -10.2876 1.267e-24
## marital1Married                   0.77335   0.101589   7.6126 3.091e-14
## marital1Divorced                 -0.67740   0.292832  -2.3133 2.074e-02
## sex_gen2 Female                   0.05380   0.061767   0.8711 3.837e-01
## marital1Married:sex_gen2 Female  -0.07577   0.111197  -0.6814 4.956e-01
## marital1Divorced:sex_gen2 Female  0.24337   0.336766   0.7227 4.699e-01
##                                  CI Lower CI Upper   DF
## (Intercept)                       8.32710  8.70689 6151
## age                              -0.05255 -0.03573 6151
## marital1Married                   0.57420  0.97250 6151
## marital1Divorced                 -1.25146 -0.10335 6151
## sex_gen2 Female                  -0.06728  0.17489 6151
## marital1Married:sex_gen2 Female  -0.29376  0.14221 6151
## marital1Divorced:sex_gen2 Female -0.41681  0.90355 6151
## 
## Multiple R-squared:  0.04672 ,   Adjusted R-squared:  0.04579 
## F-statistic: 29.78 on 6 and 6151 DF,  p-value: < 2.2e-16
```

---
#Interaction effects
An example of interaction between age and sex

```r
ols5_interactage <- lm_robust(formula = sat6 ~ age*sex_gen +  marital1, data = wave1c, weights = cdweight)
summary(ols5_interactage)
```

```
## 
## Call:
## lm_robust(formula = sat6 ~ age * sex_gen + marital1, data = wave1c, 
##     weights = cdweight)
## 
## Weighted, Standard error type:  HC2 
## 
## Coefficients:
##                     Estimate Std. Error t value  Pr(>|t|)  CI Lower CI Upper
## (Intercept)          8.71145   0.121206  71.873 0.000e+00  8.473839  8.94905
## age                 -0.05118   0.005264  -9.723 3.485e-22 -0.061501 -0.04086
## sex_gen2 Female     -0.36254   0.151283  -2.396 1.659e-02 -0.659106 -0.06597
## marital1Married      0.72461   0.078032   9.286 2.175e-20  0.571641  0.87758
## marital1Divorced    -0.54587   0.169630  -3.218 1.298e-03 -0.878403 -0.21333
## age:sex_gen2 Female  0.01517   0.005905   2.569 1.021e-02  0.003596  0.02675
##                       DF
## (Intercept)         6152
## age                 6152
## sex_gen2 Female     6152
## marital1Married     6152
## marital1Divorced    6152
## age:sex_gen2 Female 6152
## 
## Multiple R-squared:  0.04763 ,   Adjusted R-squared:  0.04686 
## F-statistic: 35.51 on 5 and 6152 DF,  p-value: < 2.2e-16
```

---

#Regression tables

Present sevearl regresion together; report standard errors, not confidence intervals.

```r
texreg::screenreg(list(ols1, ols2_weight_robust, ols3_cat, ols4_interactsex, ols5_interactage), 
                  include.ci = FALSE, digits = 3) # include.ci=FALSE means don't include CI, digit=3 to specify the digits of your result. 
```

```
## 
## ======================================================================================================
##                                   Model 1       Model 2       Model 3       Model 4       Model 5     
## ------------------------------------------------------------------------------------------------------
## (Intercept)                          8.214 ***     8.227 ***     8.544 ***     8.517 ***     8.711 ***
##                                     (0.072)       (0.077)       (0.093)       (0.097)       (0.121)   
## age                                 -0.023 ***    -0.025 ***    -0.044 ***    -0.044 ***    -0.051 ***
##                                     (0.003)       (0.003)       (0.004)       (0.004)       (0.005)   
## marital1Married                                                  0.738 ***     0.773 ***     0.725 ***
##                                                                 (0.078)       (0.102)       (0.078)   
## marital1Divorced                                                -0.526 **     -0.677 *      -0.546 ** 
##                                                                 (0.169)       (0.293)       (0.170)   
## sex_gen2 Female                                                                0.054        -0.363 *  
##                                                                               (0.062)       (0.151)   
## marital1Married:sex_gen2 Female                                               -0.076                  
##                                                                               (0.111)                 
## marital1Divorced:sex_gen2 Female                                               0.243                  
##                                                                               (0.337)                 
## age:sex_gen2 Female                                                                          0.015 *  
##                                                                                             (0.006)   
## ------------------------------------------------------------------------------------------------------
## R^2                                  0.012         0.013         0.046         0.047         0.048    
## Adj. R^2                             0.012         0.013         0.046         0.046         0.047    
## Num. obs.                         6158          6158          6158          6158          6158        
## RMSE                                               1.787         1.757         1.757         1.756    
## ======================================================================================================
## *** p < 0.001; ** p < 0.01; * p < 0.05
```
---
#Export regression tables

```r
#Export to word
texreg::htmlreg(list(ols1, ols2_weight_robust, ols3_cat, ols4_interactsex, ols5_interactage), 
        include.ci = FALSE, digits = 3,
        file = "MyOLSModels.doc") # Word table
```

```
## The table was written to the file 'MyOLSModels.doc'.
```

```r
#Export to html
texreg::htmlreg(list(ols1, ols2_weight_robust, ols3_cat, ols4_interactsex, ols5_interactage), 
        include.ci = FALSE, 
        file = "MyOLSModels.html") #html
```

```
## The table was written to the file 'MyOLSModels.html'.
```

```r
#Export to excel
texreg::htmlreg(list(ols1, ols2_weight_robust, ols3_cat, ols4_interactsex, ols5_interactage), 
        include.ci = FALSE, 
        file = "MyOLSModels.xls") #excel
```

```
## The table was written to the file 'MyOLSModels.xls'.
```
---
#Take home
1. How to calcuate confidence interval
2. Linear regression
    - With weight
    - Robust
    - interaction
3. Generate regression tables
4. Important code
    - `qt(p=0.975, df=n)`, to select critical values for 95% confidence interval
    - lm(formula = your DV ~ your IV, data = your dataset)
    - lm_robust(formula = your DV ~ your IV, data = your dataset, weights= your weight)
    - texreg::screenreg()

---
class: center, middle
#[Exercise](https://rpubs.com/fancycmn/949336)