Advanced quantitative data analysis

class: center, middle, inverse, title-slide

.title[
# Advanced quantitative data analysis
]
.subtitle[
## Reshaping and joininig
]
.author[
### Mengni Chen
]
.institute[
### Department of Sociology, University of Copenhagen
]

---

#Let's get ready

```r
install.packages("skimr") #provide a descriptive information about your data
install.packages("splitstackshape") #to transform a long-format data into a wide-format data
```

```r
library(tidyverse) # Add the tidyverse package to my current library.
library(haven) # Handle labelled data.
library(Hmisc) # Weighted statistics.
library(skimr) #provide summary for multiple variables
library(splitstackshape) #transform wide data (with stacked variables) to long data
```

---
#Data in long format and wide format
.pull-left[
Data in **wide format**
-   contains values that do not repeat in the first column
<img src="https://github.com/fancycmn/slide-7/blob/main/S7_Pic17.PNG?raw=true" width="100%" style="display: block; margin: 10px;">
<img src="https://www.theanalysisfactor.com/wp-content/uploads/2013/10/image001.jpg" width="100%" style="display: block; margin: 10px;">

]
.pull-right[
Data in **long format**
-   contains values that do repeat in the first column
<img src="https://www.theanalysisfactor.com/wp-content/uploads/2013/10/image002.jpg" width="80%" style="display: block; margin: 10px;">

]

---
#What is panel data
.pull-left[
**Micro panel**
- N(persons) >>> T(time points)
- PAIRFAM, SOEP, BHPS, SHARE, HILDA, GGP, etc
<img src="https://github.com/fancycmn/slide-7/blob/main/S7_Pic2.PNG?raw=true" width="50%" style="display: block; margin: 10px;">

]

.pull-right[
**Macro panel**
- N countries >>> T(time points)
- OECD, World bank, UNPD, etc.
<img src="https://github.com/fancycmn/slide-7/blob/main/S7_Pic3.PNG?raw=true" width="50%" style="display: block; margin: 50px 30px;">

]
---
#Unbalanced and balanced panel data
.pull-left[
**Unbalanced panel**
- Units observed at different points in time
- The usual case in micro surveys
- Selection problem if being observed is systematic
<img src="https://github.com/fancycmn/slide-7/blob/main/S7_Pic4.PNG?raw=true" width="50%" style="display: block; margin: 30px;">

]

.pull-right[
**Balanced panel**
- Every unit observed at all times
- Ideal-typical case, more often in macro panel data
- Never realized in surveys, selection problem if forced
<img src="https://github.com/fancycmn/slide-7/blob/main/S7_Pic5.PNG?raw=true" width="50%" style="display: block; margin: 10px 30px;">

]
---
#Benefit and problems with panel data
- Benefit
  - Temporal order of events: panel data > cross-sectional data
  - Causal inference: within-person comparison > between-person comparison
  - Identification of causal effects: compare the same person P at t0 to t1
- Both benefit and problem
  - cost of data collection: 1) few sampling costs; 2) high costs of panel maintenance; 3) overall, lower costs compared to repeated cross-sections
  - Reliability and validity of constructs: higher reliability; assessment of stable and variable constructs (IQ, personality);
  - Respondents learn to deal with the questionnaries
  - Question may change overtime
- Problem
  - At start: similar to a cross-sectional survey
  - Over time: becomes more selective during to attrition
  - Solution: weighting (designed weights + longitudinal weight)
  - Refreshment samples

---
#Import data

```r
wave1 <- read_dta("anchor1_50percent_Eng.dta")
wave2 <- read_dta("anchor2_50percent_Eng.dta")
wave3 <- read_dta("anchor3_50percent_Eng.dta")
wave4 <- read_dta("anchor4_50percent_Eng.dta")
wave5 <- read_dta("anchor5_50percent_Eng.dta")
wave6 <- read_dta("anchor6_50percent_Eng.dta")
```

```r
#or you use loop to import to avoid repetitive coding, similar to forvalues in stata
for (i in 1:6) {
  assign(paste0("wave", i), #assign is similar to <-; paste0 is to combine wave and i into a name, i ranges from 1 to 6. 
         read_dta(paste0("anchor", i, "_50percent_Eng.dta"))
         )
} 
```

Wave1 to Wave 6 have different numbers of observations and variables. How we link the six waves' data together to make a panel data

---
#What a micro panel data often contains?
- A micro panel dataset (a person-period dataset) have four types of variables
    - A subjective identifier (e.g. an ID for the person)
    - A time indicator (e.g. the year of the survey)
    - Outcome variables 
    - Predictor variables

---
#First, check data
- Think about what variables you want for analysis
- See whether the variables are coded and labelled in the same way across waves
- Some variables that are often used
  - ID (`id`)
  - Gender (`sex_gen`)
  - Age (`age`)
  - Marital status (`relstat` or `marstat`)
  - Labor force status (`lfs`)
  - Health (`hlt1`)
  - Education (`yeduc`, `isced`)
  - No. of children (`nkidsbio`)
  - Income (e.g. household income, `hhincnet`)
  - Life satisfaction (`sat6`): the outcome variable

---
#First, check data
- In a simple case, I consider variables: id, age, sex_gen, relstat, hlt1, sat6

```r
table(wave1$sex_gen)
table(wave2$sex_gen)
table(wave3$sex_gen)
table(wave4$sex_gen)
table(wave5$sex_gen)
table(wave6$sex_gen)
```
Write similar codes for other variables to see the distribution and levels across different datasets

- Or you could write a function to run repeated codes for different dataset.

```r
sex_fun <- function(df) {
  table(as_factor(df$sex_gen))
        } #define a function to generate tables for the distribution of a factor variable "sex_gen"
sex_fun(wave1) #just enter your dataset in the function  "sex_fun()"
```

---
#First, check data
- use [`lapply` and `sapply`](https://www.youtube.com/watch?v=ejVWRKidi9M) to run the repeated code for six waves

.pull-left[

```r
lapply(mget(paste0("wave", 1:6)), sex_fun) 
#lapply: loop over a list and evaluate a function on each element;
```
<img src="https://github.com/fancycmn/slide-7/blob/main/S7_Pic8.PNG?raw=true" width="110%" style="display: block; margin: 30px;">
]

.pull-right[

```r
sapply(mget(paste0("wave", 1:6)), sex_fun)
#sapply: loop over a list and evaluate a function on each element; but different from lapply, the result is shown in a table
```
<img src="https://github.com/fancycmn/slide-7/blob/main/S7_Pic9.PNG?raw=true" width="90%" style="display: block; margin: 30px;">
]

---
#First, check data
- you can write the following function

```r
relstat_fun <- function(df) {
  table(as_factor(df$relstat))
        }
sapply(mget(paste0("wave", 1:6)), relstat_fun)

health_fun <- function(df) {
  table(as_factor(df$hlt1))
        }
sapply(mget(paste0("wave", 1:6)), health_fun)

sat_fun <- function(df) {
  table(as_factor(df$sat6))
        }
sapply(mget(paste0("wave", 1:6)), sat_fun)
```
sex_gen, relstat, sat6 are coded in the same way; while hlt1 are coded in different ways, particularly for negative values.

---
#Second, clean data
- you can repeat the following code for six waves

```r
wave1a <-  wave1 %>% 
  transmute(
    id=zap_label(id), #remove label of id
    age=zap_label(age), #remove label of age
    sex_gen=as_factor(sex_gen), #make sex_gen as a factor
    relstat=as_factor(relstat), #make relstat as a factor
    relstat=case_when(relstat== "-7 Incomplete data" ~ as.character(NA), #specify when is missing for relstat
                      TRUE ~ as.character(relstat))%>%  
      as_factor(), #make relstat as a factor again
    hlt1=case_when(hlt1<0 ~ as.numeric(NA),  #specify when hlt1 is missing 
                   TRUE ~ as.numeric(hlt1)),
    sat6=case_when(sat6<0 ~ as.numeric(NA), #specify when sat6 is missing
                   TRUE ~ as.numeric(sat6))
                   )
```

---
#Second, clean data
- or use a function

```r
clean_fun <- function(df) {
df %>% 
  transmute(
    id=zap_label(id), #remove label of id
    age=zap_label(age), #remove label of age
    wave=as.numeric(wave),
    sex=as_factor(sex_gen), #make sex_gen as a factor
    relstat=as_factor(relstat), #make relstat as a factor
    relstat=case_when(relstat== "-7 Incomplete data" ~ as.character(NA), #specify when is missing for relstat
                      TRUE ~ as.character(relstat))%>%  
      as_factor(), #make relstat as a factor again
    hlt=case_when(hlt1<0 ~ as.numeric(NA),  #specify when hlt1 is missing 
                   TRUE ~ as.numeric(hlt1)),
    sat=case_when(sat6<0 ~ as.numeric(NA), #specify when sat6 is missing
                   TRUE ~ as.numeric(sat6))
                   )
        }
wave1a <- clean_fun(wave1)
wave2a <- clean_fun(wave2)
wave3a <- clean_fun(wave3)
wave4a <- clean_fun(wave4)
wave5a <- clean_fun(wave5)
wave6a <- clean_fun(wave6)
```

---
#Second, clean data
**Now let us look at the cleaned data by using the function `skim()` under package "skimr"**

```r
skimr::skim(wave1a)
```
<img src="https://github.com/fancycmn/slide-7/blob/main/S7_Pic10.PNG?raw=true" width="100%" style="display: block; margin: 30px;">

---
#Third, combine to generate a panel data
- Generate a long data, use `rbind()`

```r
panel_long <- rbind(wave1a, wave2a, wave3a, wave4a, wave5a, wave6a) #rbind is combine dataset by rows
```
<img src="https://github.com/fancycmn/slide-7/blob/main/S7_Pic11.PNG?raw=true" width="60%" style="display: block; margin: 30px;">

---
# Generate a wide data, use `join()`
.pull-left[
<img src="https://raw.githubusercontent.com/gadenbuie/tidyexplain/master/images/inner-join.gif" width="50%" style="display: block; margin: auto;">
<img src="https://raw.githubusercontent.com/gadenbuie/tidyexplain/master/images/left-join.gif" width="50%" style="display: block; margin: auto;">
]

.pull-right[
<img src="https://raw.githubusercontent.com/gadenbuie/tidyexplain/master/images/right-join.gif" width="50%" style="display: block; margin: auto;">
<img src="https://raw.githubusercontent.com/gadenbuie/tidyexplain/master/images/full-join.gif" width="50%" style="display: block; margin: auto;">
]

---
# Generate a wide data, use `join()`

```r
wave1b <- wave1a %>% 
rename(wave.1=wave, age.1=age, sex.1=sex, relstat.1=relstat, hlt.1=hlt, sat.1=sat) #rename variables

wave2b <- wave2a %>% 
rename(wave.2=wave, age.2=age, sex.2=sex, relstat.2=relstat, hlt.2=hlt, sat.2=sat)

wave3b <- wave3a %>% 
rename(wave.3=wave, age.3=age, sex.3=sex, relstat.3=relstat, hlt.3=hlt, sat.3=sat)

wave4b <- wave4a %>% 
rename(wave.4=wave, age.4=age, sex.4=sex, relstat.4=relstat, hlt.4=hlt, sat.4=sat)

wave5b <- wave5a %>% 
rename(wave.5=wave, age.5=age, sex.5=sex, relstat.5=relstat, hlt.5=hlt, sat.5=sat)

wave6b <- wave6a %>% 
rename(wave.6=wave, age.6=age, sex.6=sex, relstat.6=relstat, hlt.6=hlt, sat6.6=sat)

panel_wide <- full_join(wave1b, wave2b, by = "id") %>%  # Full join wave1b and wave2b
  full_join(wave3b, by = "id") %>% # Full join with wave3b
  full_join(wave4b, by = "id") %>% # Full join with wave4b
  full_join(wave5b, by = "id") %>% # Full join with wave5b
  full_join(wave6b, by = "id") # Full join with wave6b
```

---
#Reshape data from long to wide
- Long to wide, using `pivot_wider()`in the package of "tidyr" (embedded in "tidyverse")

```r
panel_towide2 <- pivot_wider(
  panel_long, #dataset to transform
  id_cols = id, #to identify the id column, in this case, "id" is the column
  names_from = wave, #i want use "wave" to create columns that reflect variables in each wave
  values_from = c(age, sex, relstat, hlt, sat) #specify where the value comes from
)
```

.pull-left[
<img src="https://github.com/fancycmn/slide-7/blob/main/S7_Pic12.PNG?raw=true" width="100%" style="display: block; margin: 30px;">

]
.pull-right[
<img src="https://github.com/fancycmn/slide-7/blob/main/S7_Pic14.PNG?raw=true" width="100%" style="display: block; margin: 30px;">

]
---
#Reshape data from wide to long
- Wide to long, using `merged.stack` in the package "splitstackshape"

```r
panel_tolong<- merged.stack(panel_wide, #dataset for transfrom
                            var.stubs = c("age", "sex","relstat", "hlt", "sat", "wave"), 
#var.stubs is to specify the prefixes of the variable groups
                            sep = ".") 
#sep is to specify the character that separates the "variable name" from the "times" in the source
```
<img src="https://github.com/fancycmn/slide-7/blob/main/S7_Pic15.PNG?raw=true" width="50%" style="display: block; margin: 10px;">
---
#Reshape data from wide to long
- Wide to long, using `merged.stack` in the package "splitstackshape"

```r
panel_tolong<- merged.stack(panel_wide, #dataset for transfrom
                            var.stubs = c("age", "sex","relstat", "hlt", "sat", "wave"), 
#var.stubs is to specify the prefixes of the variable groups
                            sep = ".") %>%  
#sep is to specify the character that separates the "variable name" from the "times" in the source
                            drop_na(wave)
#drop the observations which did not join the wave
```
<img src="https://github.com/fancycmn/slide-7/blob/main/S7_Pic16.PNG?raw=true" width="50%" style="display: block; margin: 10px;">

---
#Take home
- Understand what panel data looks like
- Make a long data (person-period data)
- Make a wide data (person level data)
- Three steps to generate a panel data
  - Check variables for its coding, level and class
  - Clean the dataset for all waves
  - combine to generate a panel data
- Important code:
  - using loop: e.g `for (i in 1:6){}`
  - define a customized function: e.g. `sex_fun()`, `clean_fun()`
  - `lapply`  and `sapply`: to execute functions in a loop
  - `skimr::skim` to view the descriptive statistics
  - `rbind()`to append data
  - join functions: e.g. `inner_join()`, `full_join()`
  - `pivot_wider()`to transform a long data to wide data
  - `merged.stack()` to transform a wide data to long data
---
class: center, middle
#[Exercise](https://rpubs.com/fancycmn/961581)