Advanced quantitative data analysis

class: center, middle, inverse, title-slide

.title[
# Advanced quantitative data analysis
]
.subtitle[
## Cross-sectional data analysis I
]
.author[
### Mengni Chen
]
.institute[
### Department of Sociology, University of Copenhagen
]

---

#Package for today

```r
#install four new packages
install.packages("ggplot2") #for visualzation
install.packages("estimatr") #estimate robust standard errors OLS.
install.packages("texreg") #make nicely-formatted Html & Latex regression tables.
```

```r
library(haven) #importing dataset of dta format 
library(tidyverse) #for data cleaning
library(janitor) #for data cleaning
library(ggplot2) #for visualzation
library(estimatr) #for robust standard errors OLS.
library(texreg) #for making nicely-formatted Html & Latex regression tables.
```
---
#Outline 
- descriptive statistics by group
  - mean
  - visualization
- OLS
- OLS with interaction

---
#Are individuals who have a partner happier?
- compare life satisfaction of individuals who have a partner with those who don't
  - individuals who don't have a partner including being single, being divorced/separated, widowed

- what one could do is to calculate the mean of life satisfaction across different partnership status
---
#Are individuals who have a partner happier?

- Prepare a dataset

```r
wave1 <- read_dta("anchor1_50percent_Eng.dta")
wave1a <- wave1 %>% 
  transmute(
    id, 
    age, 
    gender=as_factor(sex_gen) %>% fct_drop(),#var for gender & drop unused levels
    cohort=as_factor(cohort)%>% fct_drop(),#var for cohort
    relstat=as_factor(relstat)%>% fct_drop(), #var for relationship status
    sat6#var for life satisfaction
            )
```

```r
tabyl(wave1a,gender)#no missing
tabyl(wave1a,cohort) #no missing
tabyl(wave1a,relstat) #34 missing
tabyl(wave1a,sat6) #5 missing
```

---
#Are individuals who have a partner happier?
- Prepare a dataset

```r
wave1b <- wave1a %>% 
  mutate(sat6=case_when(sat6<0 ~ NA,
                        TRUE ~ as.numeric(sat6)),
         relstat=case_when(
           relstat=="-7 Incomplete data" ~ NA,
           TRUE ~ as.character(relstat)) %>% as_factor()#make it back to factor 
      ) %>% 
  drop_na() #drop all missings
#now wave1b is clean without missing
# N changed from 6201 to 6162.
```

Note: `drop_na()` is a function to drop missing.
- if you just type `drop_na()`, R drops all cases that have missing info
- if you type `drop_na(sat6)`, R drops cases that have missing info of sat6.
---
#Are individuals who have a partner happier?

.pull-left[

```r
tabyl(wave1b,relstat)
```

```
##                      relstat    n      percent
##       1 Never married single 2446 0.3969490425
##              4 Married COHAB 1734 0.2814021422
##          2 Never married LAT 1012 0.1642323921
##        3 Never married COHAB  659 0.1069457968
##   8 Divorced/separated COHAB   76 0.0123336579
##             9 Widowed single    3 0.0004868549
##  6 Divorced/separated single  145 0.0235313210
##     7 Divorced/separated LAT   63 0.0102239533
##      5 Married noncohabiting   23 0.0037325544
##               10 Widowed LAT    1 0.0001622850
```
too many categories of partnership status
]

.pull-right[
- we can reclassify relstat to reduce categories

```r
wave1b <- wave1b %>%
 mutate(
 relstat_new=case_when(
 relstat %in% c("1 Never married single") ~ "single",
 #treat 'never married single' as 'single'
 relstat %in% c("2 Never married LAT", 
                "3 Never married COHAB",
                "4 Married COHAB",
                "5 Married noncohabiting") ~ 'partnered',
 #treat the 4 situations as "partnered"
 relstat %in% c("6 Divorced/separated single",
                "7 Divorced/separated LAT",
                "8 Divorced/separated COHAB") ~ 'separated',
 #treat the 3 situations as "separated"
 relstat %in% c("9 Widowed single",
                "10 Widowed LAT") ~ 'widowed'
 #treat the 2 situations as "widowed"
                      ) %>% 
   as_factor()# make relstat_new as factor
         )
```
you can have your own way of classification as long as it make sense
]
---
#Are individuals who have a partner happier?

```r
tabyl(wave1b,relstat_new)
```

```
##  relstat_new    n      percent
##       single 2446 0.3969490425
##    partnered 3428 0.5563128854
##    separated  284 0.0460889322
##      widowed    4 0.0006491399
```

To make a simple example, let us just compare partnered and single

```r
wave1c <- wave1b %>% 
  filter(relstat_new=='single'| relstat_new=='partnered')  #only keep single and partnered
wave1c$relstat_new <- wave1c$relstat_new %>% fct_drop() #drop unused level "widowed" and "separated"
# N changed from 6162 to 5874.
```
---
#Sumarize mean of life satisfaction by partnership status
`summarise()` allows you to calculate all kinds of statistics on the level of the groups you have specified.
.pull-left[

```r
wave1c %>% 
  filter(relstat_new== "single")%>% #get a data of singles
  summarise(sat_mean=mean(sat6)) #calculate the mean

wave1c %>% 
  filter(relstat_new== "partnered")%>% #a data of partnered
  summarise(sat_mean=mean(sat6))
```
a bit too repetitive coding!
]

.pull-right[

```
## # A tibble: 1 × 1
##   sat_mean
##      <dbl>
## 1     7.50
```

```
## # A tibble: 1 × 1
##   sat_mean
##      <dbl>
## 1     7.78
```
]

---
#or you can use grouping operations
`group_by()` will transform your data into a grouped tibble. Afterwards, certain functions will operate on the level of those specified groups!

```r
wave1c %>% 
  group_by(relstat_new) %>% #make wave1c grouped by parntership status
  summarise(mean_sat=mean(sat6),#calculate mean of sat6
            sd_sat=sd(sat6)) #calculate standard deviation of sat6
```

```
## # A tibble: 2 × 3
##   relstat_new mean_sat sd_sat
##   <fct>          <dbl>  <dbl>
## 1 single          7.50   1.80
## 2 partnered       7.78   1.64
```

```r
#or you can save result to a dataset named meantable
meantable <- wave1c %>% 
  group_by(relstat_new) %>% 
  summarise(mean_sat=mean(sat6),
            sd_sat=sd(sat6))
```
---
#Are individuals who have a partner happier?
- we can check the detailed distribution of life satisfaction across partnership status

```r
distribute_table<- tabyl(wave1c,sat6, relstat_new) %>%
    adorn_totals("row") %>%#add row %
    adorn_percentages("col") %>% #add col
    adorn_pct_formatting()  #format the percentage
distribute_table
```

```
##   sat6 single partnered
##      0   0.4%      0.3%
##      1   0.4%      0.1%
##      2   0.8%      0.5%
##      3   2.0%      1.3%
##      4   2.6%      1.4%
##      5   6.3%      5.8%
##      6   9.6%      7.0%
##      7  19.1%     19.0%
##      8  29.5%     31.6%
##      9  18.7%     19.2%
##     10  10.5%     13.7%
##  Total 100.0%    100.0%
```
---
#Are individuals who have a partner happier?
- One can also plot the distribution of life satisfaction by partnership status: use ggplot2 t

```r
ggplot(data = , mapping = aes(x=, y=) +   # specify dataset, x, and y to ggplot
         <GEOM_FUNCTION>()+ # specify types of your chart, e.g. bar, point, line chart
         <COORDINATE_FUNCTION> # Change the default coordinate system, swap x and y axis
         #note: + is the symbol to connect different section of code
```
- ggplot2 contains many geom functions, which put layers of different types of geometric objects (e.g., points, bars, lines) over a coordinate system.
  - All geom functions depend on the mapping argument. It is paired with aes(), which stands for "aesthetic". Aesthetics are the visual properties of your plot.
       
  - The most important aesthetics of any graph are the y-axis and the x-axis. 
       
  - But of course, aesthetics also means, among others, color, shape, size, and so on.
       
---
 #Use ggplot2 to plot descriptive statistics

```r
       ggplot(data = wave1c) ## Create an empty coordinate system for the dataset "wave1bc".
```
<img src="https://github.com/fancycmn/slide6/blob/main/S6_Pic3.png?raw=true" width="60%" style="display: block; margin: auto;" >
         
---
#Use ggplot2 to plot descriptive statistics
.pull-left[

```r
figure1<- ggplot(data = wave1c,
                 mapping=aes(x = as.factor(sat6)))+
#as.factor(sat6) to make it x axis show 0 to 10, optional code
          geom_bar()#create a bar chart
figure1
```
<img src="https://github.com/fancycmn/25-Session3/blob/main/s4-f1.png?raw=true" width="100%" style="display: block; margin: auto;" >
] 
.pull-right[

```r
figure2<- ggplot(data = wave1c,
                 mapping=aes(x = as.factor(sat6))
                 )+
          geom_bar()+
          facet_wrap(~ relstat_new)
          #facet_wrap(~ relstat_new) to plot by relstat_new
figure2 
```
<img src="https://github.com/fancycmn/25-Session3/blob/main/s4-f2.png?raw=true" width="100%" style="display: block; margin: auto;" >
]
---
#Use ggplot2 to plot descriptive statistics

```r
figure3<- ggplot(data = wave1c,
                 mapping=aes(x = as.factor(sat6),
                             y = ..prop..,# ask R to show % rather than absolute count
                             group = relstat_new,#calculate the % of sat6 within each group of relstat_new
                             fill = relstat_new)#color the bars different by relstat_new
                 )+
          geom_bar()+
          facet_wrap(~ relstat_new)
figure3 
```
<img src="https://github.com/fancycmn/25-Session3/blob/main/s4-f3.png?raw=true" width="55%" style="display: block; margin: auto;" >

---
#Model the relationship between relationship status and life satisfaction

Coming up with a model of life satisfaction: Can life satisfaction (i.e., `$y$`) be expressed as a function (i.e., `$f()$`) of another variable (i.e., `$x$`)?

`$$\hat{y} = f(x)$$`
The two critical questions are thus:

1. What is `$f()$`?
2. What is `$x$`?

---
# A linear model of `$y$`

x could be partnership status, age, health, education

---
# Ordinary Least Squares
.pull-left[
We find `$\alpha$` and `$\beta$` by choosing the one line that minimizes the *residual sum of squares*:

`$$\begin{align*}
\ & min \text{RSS} = \min \sum_{i=1}^{n} e_{i}^{2} \\ 
\ & = \min \sum_{i=1}^{n} y_{i} - \hat{y_{i}} \\
\ & = \min \sum_{i=1}^{n} (y_{i} - (\alpha + \beta x_{i}))^{2}
    \end{align*}$$`
    ]
    
.pull-right[
<img src="https://miro.medium.com/max/625/0*gglavDlTUWKn4Loe" width="100%" style="display: block; margin: auto;">

]
---

#Assumptions of OLS
`$$y_i= \alpha+\beta*x_i  + e_i$$`
--

- About **the sample**
  - Assumption 1: The sample taken for the linear regression model must be drawn randomly from the population.
  
--

- About **the relationship between Y and X**
  - Assumption 2: The dependent variable (Y) is a linear function of independent variables (X) and the error term.
  
--

- About **the independent variables - X**
  - Assumption 3: There is no multi-collinearity (or perfect collinearity). In other words, there should be no linear relationship between the independent variables.
  
--

- About **the error term**
  - Assumption 4: The distribution of error terms has zero mean and doesn’t depend on the independent variables(Xs). There must be no relationship between the Xs and the error term.
  -  Assumption 5: There is homoscedasticity and no autocorrelation.
  -  Assumption 6: Error terms should be normally distributed.

---
# The `lm()` funcion

```r
# Estimate a linear OLS model and assign the results to object ols.
ols1 <- lm(formula = sat6 ~ relstat_new + age, data = wave1c) 
# sat6 is DV, should be written on the left side of ~ 
# relstat_new are IV, should be written on the right side of ~
summary(ols1) #to look at the summary of the result
```

```
## 
## Call:
## lm(formula = sat6 ~ relstat_new + age, data = wave1c)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.9300 -0.6988  0.3012  1.3012  3.1185 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           8.25606    0.07187  114.88   <2e-16 ***
## relstat_newpartnered  0.60272    0.05248   11.49   <2e-16 ***
## age                  -0.03715    0.00312  -11.91   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.687 on 5871 degrees of freedom
## Multiple R-squared:  0.02967,    Adjusted R-squared:  0.02934 
## F-statistic: 89.75 on 2 and 5871 DF,  p-value: < 2.2e-16
```

---
# The `lm()` funcion
Factors are automatically treated as dummies, with the first level taken as reference. You can change by using `fct_relevel()`.

```r
levels(wave1c$relstat_new) #first level is single
```

```
## [1] "single"    "partnered"
```

```r
#you can change the reference level by using fct_recode
wave1c$relstat_new <- fct_relevel(wave1c$relstat_new,"partnered")#make "partnered" first level 
ols1a <- lm(formula = sat6 ~ relstat_new + age, data = wave1c)
summary(ols1a)
```

```
## 
## Call:
## lm(formula = sat6 ~ relstat_new + age, data = wave1c)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.9300 -0.6988  0.3012  1.3012  3.1185 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        8.85878    0.09531   92.95   <2e-16 ***
## relstat_newsingle -0.60272    0.05248  -11.49   <2e-16 ***
## age               -0.03715    0.00312  -11.91   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.687 on 5871 degrees of freedom
## Multiple R-squared:  0.02967,    Adjusted R-squared:  0.02934 
## F-statistic: 89.75 on 2 and 5871 DF,  p-value: < 2.2e-16
```

---
#The `lm_robust` function
- OLS based on a cross-sectional dataset often has issue of [heteroscedasticity](https://rpubs.com/cyobero/187387). Therefore,inference from the standard errors, t-values, p-values, and confidence intervals is likely to  be wrong.
- We thus need to estimate robust standard errors OLS, by using `estimatr::lm_robust()`.

```r
wave1c$relstat_new <- fct_relevel(wave1c$relstat_new,"single")#change single back as 1st level
ols2_robust <- lm_robust(formula = sat6 ~ relstat_new + age, data = wave1c)
summary(ols2_robust)
```

```
## 
## Call:
## lm_robust(formula = sat6 ~ relstat_new + age, data = wave1c)
## 
## Standard error type:  HC2 
## 
## Coefficients:
##                      Estimate Std. Error t value  Pr(>|t|) CI Lower CI Upper
## (Intercept)           8.25606   0.069014  119.63 0.000e+00  8.12077  8.39135
## relstat_newpartnered  0.60272   0.054838   10.99 7.879e-28  0.49522  0.71022
## age                  -0.03715   0.003176  -11.70 2.982e-31 -0.04338 -0.03092
##                        DF
## (Intercept)          5871
## relstat_newpartnered 5871
## age                  5871
## 
## Multiple R-squared:  0.02967 ,   Adjusted R-squared:  0.02934 
## F-statistic: 82.07 on 2 and 5871 DF,  p-value: < 2.2e-16
```
---
#Export results
.pull-left[

```r
#Export to html
texreg::htmlreg(list(ols2_robust), 
        include.ci = FALSE, 
        file = "regression1.html") #html
```

```
## The table was written to the file 'regression1.html'.
```

```r
#you can also export to doc or excel, just change file="regression.doc" or "regression.xls"
```
]
.pull-right[
<img src="https://github.com/fancycmn/25-Session3/blob/main/s4-f4a.png?raw=true" style="display: block; margin: auto;" >
]

- Intercept: when a person is aged 0 and single, his/her life satisfaction  is 8.26 on average.
- Partnership status: when age is controlled, a person who is partnered has a higher life satisfaction than a person who is single, by the amount of 0.60 on average.
- Age: when partnership status is controlled, as the age increases by 1 year, the life satisfaction decreases by 0.04 on average.
---
#Show all regression tables

```r
texreg::htmlreg(list(ols1,ols2_robust), 
        include.ci = FALSE, 
        digits = 4, #keep 4 digits
        custom.model.names=c("OLS1", "OLS2_robust"),
        file = "regression2.html") #html
```

```
## The table was written to the file 'regression2.html'.
```
<img src="https://github.com/fancycmn/25-Session3/blob/main/s4-f5a.png?raw=true" width="40%" style="display: block; margin: auto;" >

---
#Interaction effect
- Baseline model: IV=age, gender, relationship status

- Baseline model + interaction between two categorical var: the relationship between partnership status and life satisfaction differs by gender. 
  - For instance, having a partner may bring more happiness to women than to men. 
  
- Baseline model + interaction between one categorical and one continuous var: the relationship between age and life satisfaction differs by gender. 
  - For instance, getting older reduces men's happiness more than women's

```r
ols3 <- lm_robust(formula = sat6 ~ age +relstat_new+gender, data = wave1c)
ols3a <- lm_robust(formula = sat6 ~ age + relstat_new*gender, data = wave1c)
ols3b <- lm_robust(formula = sat6 ~ age*gender + relstat_new, data = wave1c)
texreg::htmlreg(list(ols3,ols3a,ols3b), 
        include.ci = FALSE, 
        custom.model.names=c("OLS3", "OLS3a","OLS3b"),
        file = "regression_int1.html") #html
```

```
## The table was written to the file 'regression_int1.html'.
```
---
<img src="https://github.com/fancycmn/25-Session3/blob/main/s4-f6a.png?raw=true" width="50%" style="display: block; margin: auto;" >

---
#Interaction effect
- customize your regression results in a

```r
texreg::htmlreg(list(ols3,ols3a,ols3b), 
                include.ci = FALSE, #not including confidence interval
                digits = 4, #keep 4 digits
                single.row=TRUE,#single.row=TRUE is to place standard error the same line.
                custom.coef.names =c(
                "Intercept",
                "Age",
                "Partnered(ref.=Single)",
                "Female(ref.=male)",
                "Partnered x Female",
                "Age x Female"      ),#you can create understandable names for the coef.
                custom.model.names=c("OLS3", "OLS3a","OLS3b"),
                file = "regression_int2.html"  )  
```

```
## The table was written to the file 'regression_int2.html'.
```
<img src="https://github.com/fancycmn/25-Session3/blob/main/s4-f7a.png?raw=true" width="40%" style="display: block; margin: auto;" >

---
#Interaction effect: interpretation
<font size="3">`$$Model1: sat6=8.2434 -0.0371age +0.5987partnered + 0.0261female$$`
`$$Model2: sat6=8.2542 -0.0371age +0.5774partnered + 0.0003female + 0.0439partnered*female$$`
`$$Model3: sat6=8.4905 -0.0471age +0.6067partnered - 0.4611 female + 0.0191age*female$$`
<img src="https://github.com/fancycmn/25-Session3/blob/main/s4-f7a.png?raw=true" width="60%" style="display: block; margin: auto;" >
---
#Take home
1. `group_by()`: Subsets a tibble into groups. Certain functions will operate afterwards by each of the specified groups.
2. `summarize()`: Allows you to estimate any kind of aggregate statistic. Combined with group_by(), it estimates those statistic by specified group. 
  - mean
  - sd
3. `ggplot()`: to plot charts, often combined with GEOM_FUNCTION>() to specify the chart type(e.g. bar, line, etc.)
4. OLS regression
  - assumptions
  - `lm()`
  - `lm_robust`
  - `texreg::htmlreg()`
  - interactions and interpretations

---
class: center, middle
#[Exercise](https://rpubs.com/fancycmn/1346088)