Advanced quantitative data analysis

class: center, middle, inverse, title-slide

.title[
# Advanced quantitative data analysis
]
.subtitle[
## Introduction to Panel Data
]
.author[
### Mengni Chen
]
.institute[
### Department of Sociology, University of Copenhagen
]

---

#Let's get ready

```r
install.packages("skimr") #quick overview of the dataset
```

```r
library(skimr)
library(tidyverse) # Recoding and cleaning
library(haven) # Import data.
library(janitor) # Tabulation
library(ggplot2) # For plotting
```

---
#Portfolio 1
1: How does first childbearing affect women’s subjective wellbeing of women? 
**using "wave1_women.dta"**

2: How does entry into a partnership affect people’s subjective wellbeing?
**using "anchor1_percent_Eng.dta"**

3: How does partnership break-up affect people’s subjective wellbeing?
**using "anchor1_percent_Eng.dta"**

You can choose one of the following measurements as your outcome variable—subjective wellbeing:

Subjective satisfaction with work (sat1i1)

Subjective satisfaction with family (sat1i4)

General life satisfaction (sat6)

---
#Portfolio 1

- how to identify first childbearing? --> those who have one child or are first-time pregnant
  - Number of all biological kids ever born (nkidsbio)
  - Are you expecting a child? (sex3)

- how to identify entry into partnership?--> those who had a partner (either cohabiting or married)
  - Relationship status (relstat)
  - Marital status (marstat,sd10)
  - Do you currently have a partner in this sense? (sd3)

--
- how to identify partnership break-up? --> those who are in the status of separated or divorced
  - Relationship status (relstat)

- **possible to measure the status at the survey, not possible to measure an event**

---
#Portfolio 1
1) 1 paragraph to introduce your topic, and 1 paragraph to introduce your data and analytical sample
- sample size? original and final sample size

- what variables do you use? continuous or categorical

- how do you clean? e.g. how to deal with missing, do you recode or re-categorize variables?
---
#Portfolio 1
2) explain briefly what is OLS and the assumptions of OLS
- how does OLS find the best fit line?

- 6 assumptions

- you can use OLS equation, explain each element of the equation
---
#Portfolio 1
3) conduct OLS regression using robust standard error option, and interpret the result
- provide the result in tables, including significance levels

- the result table should be clear to readers. especially for categorical variables, explain what is the reference category

- interpret the regression coefficient, intercept, R square

---
#Portfolio 1
4) explain briefly  what is OB decomposition
- what OB decomposition does?

- you can use equation to show decomposition process, explain each element of the equation

- what is the explained part (i.e. composition effect)?

- what is the unexplained part (i.e. coefficient effect)

---
#Portfolio 1
5) conduct OB decomposition and interpret the result
- be clear which is your treated group, which is your control gropu
-- for instance, partnered vs single; first child vs no child; separated vs partnered

- what is your reference regression coefficient (i.e. group.weight=0,1,0.5...)

- show the overall result and result by variables in tables and graphs

- explain the major findings
  - e.g. mainly due to composition effect or coefficient effect
  - e.g. which variable plays the biggest role in composition effect, which variable plays the biggest role in coefficient effect

---
class: center, middle
Questions for Portfolio 1

---
# Cross-sectional data vs longitudinal (panel) data
- Cross-sectional data: collect individuals' information only one point in time
- Repeated cross-sectional data: collect data in multiple time points but not following the same individuals
- Longitudinal data: collect the same individuals' information in multiple time points

<img src="https://github.com/fancycmn/25-Session3/blob/main/S7-F1.png?raw=true" width="100%" style="display: block; margin: auto;" >
---
# Cross-sectional data vs longitudinal (panel) data
Example: cross-sectional data

---
# Cross-sectional data vs longitudinal (panel) data
Example: longitudinal (panel) data

---
# Longitudinal (panel) data
<img src="https://github.com/fancycmn/25-Session3/blob/main/S7-F5.png?raw=true" width="80%" style="display: block; margin: auto;" >

- Story A: A's health started to decline in month 6 and deteriorated to level 1 by the end of the study, implying that becoming unemployed negatively affected their health.

- Story B: B's health started to decline in month 2 and deteriorated to level 1 by the end of the study, implying that health affected their employment status.
---
#Panel data in long format and wide format
Data in **wide format**: contains values that do not repeat in the first column

---
#Data in long format and wide format
Data in **long format**: contains values that do repeat in the first column
<img src="https://www.theanalysisfactor.com/wp-content/uploads/2013/10/image002.jpg" width="80%" style="display: block; margin: 10px;">

---
#Typical panel data in research
.pull-left[
**Micro panel**
- N(persons) >>> T(time points)
- PAIRFAM, SOEP, BHPS, SHARE, HILDA, GGP, etc
<img src="https://github.com/fancycmn/slide-7/blob/main/S7_Pic2.PNG?raw=true" width="50%" style="display: block; margin: 10px;">

]

.pull-right[
**Macro panel**
- N countries >>> T(time points)
- OECD, World bank, UNPD, etc.
<img src="https://github.com/fancycmn/slide-7/blob/main/S7_Pic3.PNG?raw=true" width="50%" style="display: block; margin: 50px 30px;">

]
---
#Unbalanced and balanced panel data
.pull-left[
**Unbalanced panel**
- Not every Units observed at all times
- The usual case in micro surveys
- Selection problem if being observed is systematic
<img src="https://github.com/fancycmn/slide-7/blob/main/S7_Pic4.PNG?raw=true" width="50%" style="display: block; margin: 30px;">

]

.pull-right[
**Balanced panel**
- Every unit observed at all times
- Ideal-typical case, more often in macro panel data
- Never realized in surveys, selection problem if forced
<img src="https://github.com/fancycmn/slide-7/blob/main/S7_Pic5.PNG?raw=true" width="50%" style="display: block; margin: 10px 30px;">

]
---
#Benefit and problems with panel data
- Benefit
  - Temporal order of events: panel data > cross-sectional data
  - Causal inference: within-person comparison > between-person comparison
  - Identification of causal effects: compare the same person P at t0 to t1
- Both benefit and problem
  - cost of data collection: 1) few sampling costs; 2) high costs of panel maintenance; 3) overall, lower costs compared to repeated cross-sections
  - Reliability and validity of constructs: higher reliability; assessment of stable and variable constructs (IQ, personality);
  - Respondents learn to deal with the questionnaries
  - Question may change overtime
- Problem
  - At start: similar to a cross-sectional survey
  - Over time: becomes more selective during to attrition
  - Refreshment samples: new sample is added across waves

---
#What a micro panel data often contains?
- A micro panel dataset (a person-period dataset) have four types of variables
    - A subjective identifier (e.g. an ID for the person)
    - A time indicator (e.g. the year of the survey)
    - Outcome variables 
    - Predictor variables

---
#Import data

```r
wave1 <- read_dta("anchor1_50percent_Eng.dta")
wave2 <- read_dta("anchor2_50percent_Eng.dta")
wave3 <- read_dta("anchor3_50percent_Eng.dta")
wave4 <- read_dta("anchor4_50percent_Eng.dta")
wave5 <- read_dta("anchor5_50percent_Eng.dta")
wave6 <- read_dta("anchor6_50percent_Eng.dta")
```

---
#First, check data
- Think about what variables you want for analysis
- See whether the variables are coded and labelled in the same way across waves
- Some variables that are often used
  - ID (`id`)
  - Gender 
  - Age 
  - Marital status 
  - Labor force status 
  - Health 
  - Education 
  - No. of children 
  - Income 
  - Life satisfaction: the outcome variable

---
#First, check data
- In a simple case, I consider variables: id, age, sex_gen, relstat, hlt1, sat6

```r
wave1$sex_gen %>% as_factor() %>% tabyl()
wave2$sex_gen %>% as_factor() %>% tabyl()
wave3$sex_gen %>% as_factor() %>% tabyl()
wave4$sex_gen %>% as_factor() %>% tabyl()
wave5$sex_gen %>% as_factor() %>% tabyl()
wave6$sex_gen %>% as_factor() %>% tabyl()
```
Write similar codes for other variables to see the distribution and levels across different datasets

- Or you could write a function to run repeated codes for different dataset.

```r
sex_fun <- function(df) {
  table(as_factor(df$sex_gen))
        } #define a function to generate tables for the distribution of a factor variable "sex_gen"
sex_fun(wave1) #just enter your dataset in the function  "sex_fun()"
```

```
## 
##               -10 not in demodiff                -7 Incomplete data 
##                                 0                                 0 
## -4 Filter error / Incorrect entry                 -3 Does not apply 
##                                 0                                 0 
##                            1 Male                          2 Female 
##                              3029                              3172
```

---
#First, check data
- use [`sapply`](https://www.youtube.com/watch?v=ejVWRKidi9M) to run the repeated code for six waves

```r
sapply(mget(paste0("wave", 1:6)), sex_fun)
#sapply: loop over the function and evaluate repeatly
```
<img src="https://github.com/fancycmn/slide-7/blob/main/S7_Pic9.PNG?raw=true" width="90%" style="display: block; margin: 30px;">

---
#First, check data

```r
#what is past0
paste0("wave", 1)
```

```
## [1] "wave1"
```

```r
paste0("wave", 1:6)
```

```
## [1] "wave1" "wave2" "wave3" "wave4" "wave5" "wave6"
```

```r
whatisthis<- mget(paste0("wave", 1:6)) #mget() is to get a list of objects named wave1 to wave 6
sapply(mget(paste0("wave", 1:6)), sex_fun)
```

```
##                                   wave1 wave2 wave3 wave4 wave5 wave6
## -10 not in demodiff                   0     0     0     0     0     0
## -7 Incomplete data                    0     0     0     0     0     0
## -4 Filter error / Incorrect entry     0     0     0     0     0     0
## -3 Does not apply                     0     0     0     0     0     0
## 1 Male                             3029  2197  1905  1668  1493  1342
## 2 Female                           3172  2339  2050  1813  1626  1477
```

```r
#sapply: loop over a list and evaluate a function on each element and show the result in a table
```

---
#First, check data
- you can write the following function

```r
relstat_fun <- function(df) {
  table(as_factor(df$relstat))
        }
sapply(mget(paste0("wave", 1:6)), relstat_fun)

sat_fun <- function(df) {
  table(as_factor(df$sat6))
        }
sapply(mget(paste0("wave", 1:6)), sat_fun)

health_fun <- function(df) {
  table(as_factor(df$hlt1))
        }
sapply(mget(paste0("wave", 1:6)), health_fun)
```
sex_gen, relstat, sat6 are coded in the same way; while hlt1 are coded in different ways, particularly for negative values.

---
#Second, clean data
- you can repeat the following code for six waves

```r
wave1a <-  wave1 %>% 
  transmute(
    id, 
    age, 
    wave=as.numeric(wave),
    sex_gen=as_factor(sex_gen), #make sex_gen as a factor
    relstat=as_factor(relstat), #make relstat as a factor
    relstat=case_when(relstat== "-7 Incomplete data" ~ as.character(NA), #specify when is missing for relstat
                      TRUE ~ as.character(relstat))%>%  
      as_factor(), #make relstat as a factor again
    hlt1=case_when(hlt1<0 ~ as.numeric(NA),  #specify when hlt1 is missing 
                   TRUE ~ as.numeric(hlt1)),
    sat6=case_when(sat6<0 ~ as.numeric(NA), #specify when sat6 is missing
                   TRUE ~ as.numeric(sat6))
                   )
```

---
#Second, clean data
- or use a function

```r
clean_fun <- function(df) {
df %>% 
  transmute(
    id, #remove label of id
    age, #remove label of age
    wave=as.numeric(wave),
    sex=as_factor(sex_gen), #make sex_gen as a factor
    relstat=as_factor(relstat), #make relstat as a factor
    relstat=case_when(relstat== "-7 Incomplete data" ~ as.character(NA), #specify when is missing for relstat
                      TRUE ~ as.character(relstat))%>%  
      as_factor(), #make relstat as a factor again
    hlt=case_when(hlt1<0 ~ as.numeric(NA),  #specify when hlt1 is missing 
                   TRUE ~ as.numeric(hlt1)),
    sat=case_when(sat6<0 ~ as.numeric(NA), #specify when sat6 is missing
                   TRUE ~ as.numeric(sat6))
                   )
        }
wave1a <- clean_fun(wave1)
wave2a <- clean_fun(wave2)
wave3a <- clean_fun(wave3)
wave4a <- clean_fun(wave4)
wave5a <- clean_fun(wave5)
wave6a <- clean_fun(wave6)
```
---
#Second, clean data
**Now let us look at the cleaned data by using the function skim() under package "skimr"**

```r
skimr::skim(wave1a)
```
<img src="https://github.com/fancycmn/25-Session3/blob/main/S7-F6.png?raw=true" width="90%" style="display: block; margin: auto;" >
---
#Take home
1. Cross-sectional data, repeatec corss-sectional data, panel data
2. Clean multiple data: 
    - define your functions
    - use `sapply()`
3. Have a quick overview of the data
    - `skimr::skim()`

---
class: center, middle
#[Exercise](https://rpubs.com/fancycmn/1357646)