class: center, middle, inverse, title-slide .title[ # Advanced quantitative data analysis ] .subtitle[ ## Introduction to Panel Data ] .author[ ### Mengni Chen ] .institute[ ### Department of Sociology, University of Copenhagen ] --- <style type="text/css"> .remark-slide-content { font-size: 20px; padding: 20px 80px 20px 80px; } .remark-code, .remark-inline-code { background: #f0f0f0; } .remark-code { font-size: 12px; } </style> #Let's get ready ```r install.packages("skimr") #quick overview of the dataset ``` ```r library(skimr) library(tidyverse) # Recoding and cleaning library(haven) # Import data. library(janitor) # Tabulation library(ggplot2) # For plotting ``` --- #Portfolio 1 1: How does first childbearing affect women’s subjective wellbeing of women? **using "wave1_women.dta"** 2: How does entry into a partnership affect people’s subjective wellbeing? **using "anchor1_percent_Eng.dta"** 3: How does partnership break-up affect people’s subjective wellbeing? **using "anchor1_percent_Eng.dta"** You can choose one of the following measurements as your outcome variable—subjective wellbeing: Subjective satisfaction with work (sat1i1) Subjective satisfaction with family (sat1i4) General life satisfaction (sat6) --- #Portfolio 1 - how to identify first childbearing? --> those who have one child or are first-time pregnant - Number of all biological kids ever born (nkidsbio) - Are you expecting a child? (sex3) -- - how to identify entry into partnership?--> those who had a partner (either cohabiting or married) - Relationship status (relstat) - Marital status (marstat,sd10) - Do you currently have a partner in this sense? (sd3) -- - how to identify partnership break-up? --> those who are in the status of separated or divorced - Relationship status (relstat) - **possible to measure the status at the survey, not possible to measure an event** --- #Portfolio 1 1) 1 paragraph to introduce your topic, and 1 paragraph to introduce your data and analytical sample - sample size? original and final sample size - what variables do you use? continuous or categorical - how do you clean? e.g. how to deal with missing, do you recode or re-categorize variables? --- #Portfolio 1 2) explain briefly what is OLS and the assumptions of OLS - how does OLS find the best fit line? - 6 assumptions - you can use OLS equation, explain each element of the equation --- #Portfolio 1 3) conduct OLS regression using robust standard error option, and interpret the result - provide the result in tables, including significance levels - the result table should be clear to readers. especially for categorical variables, explain what is the reference category - interpret the regression coefficient, intercept, R square --- #Portfolio 1 4) explain briefly what is OB decomposition - what OB decomposition does? - you can use equation to show decomposition process, explain each element of the equation - what is the explained part (i.e. composition effect)? - what is the unexplained part (i.e. coefficient effect) --- #Portfolio 1 5) conduct OB decomposition and interpret the result - be clear which is your treated group, which is your control gropu -- for instance, partnered vs single; first child vs no child; separated vs partnered - what is your reference regression coefficient (i.e. group.weight=0,1,0.5...) - show the overall result and result by variables in tables and graphs - explain the major findings - e.g. mainly due to composition effect or coefficient effect - e.g. which variable plays the biggest role in composition effect, which variable plays the biggest role in coefficient effect --- class: center, middle Questions for Portfolio 1 --- # Cross-sectional data vs longitudinal (panel) data - Cross-sectional data: collect individuals' information only one point in time - Repeated cross-sectional data: collect data in multiple time points but not following the same individuals - Longitudinal data: collect the same individuals' information in multiple time points <img src="https://github.com/fancycmn/25-Session3/blob/main/S7-F1.png?raw=true" width="100%" style="display: block; margin: auto;" > --- # Cross-sectional data vs longitudinal (panel) data Example: cross-sectional data <img src="https://github.com/fancycmn/25-Session3/blob/main/S7-F4.png?raw=true" width="100%" style="display: block; margin: auto;" > --- # Cross-sectional data vs longitudinal (panel) data Example: longitudinal (panel) data <img src="https://github.com/fancycmn/25-Session3/blob/main/S7-F5.png?raw=true" width="100%" style="display: block; margin: auto;" > --- # Longitudinal (panel) data <img src="https://github.com/fancycmn/25-Session3/blob/main/S7-F5.png?raw=true" width="80%" style="display: block; margin: auto;" > - Story A: A's health started to decline in month 6 and deteriorated to level 1 by the end of the study, implying that becoming unemployed negatively affected their health. - Story B: B's health started to decline in month 2 and deteriorated to level 1 by the end of the study, implying that health affected their employment status. --- #Panel data in long format and wide format Data in **wide format**: contains values that do not repeat in the first column <img src="https://github.com/fancycmn/24-Session7/blob/main/F9.JPG?raw=true" width="120%" style="display: block; margin: auto;" > --- #Data in long format and wide format Data in **long format**: contains values that do repeat in the first column <img src="https://www.theanalysisfactor.com/wp-content/uploads/2013/10/image002.jpg" width="80%" style="display: block; margin: 10px;"> --- #Typical panel data in research .pull-left[ **Micro panel** - N(persons) >>> T(time points) - PAIRFAM, SOEP, BHPS, SHARE, HILDA, GGP, etc <img src="https://github.com/fancycmn/slide-7/blob/main/S7_Pic2.PNG?raw=true" width="50%" style="display: block; margin: 10px;"> ] .pull-right[ **Macro panel** - N countries >>> T(time points) - OECD, World bank, UNPD, etc. <img src="https://github.com/fancycmn/slide-7/blob/main/S7_Pic3.PNG?raw=true" width="50%" style="display: block; margin: 50px 30px;"> ] --- #Unbalanced and balanced panel data .pull-left[ **Unbalanced panel** - Not every Units observed at all times - The usual case in micro surveys - Selection problem if being observed is systematic <img src="https://github.com/fancycmn/slide-7/blob/main/S7_Pic4.PNG?raw=true" width="50%" style="display: block; margin: 30px;"> ] .pull-right[ **Balanced panel** - Every unit observed at all times - Ideal-typical case, more often in macro panel data - Never realized in surveys, selection problem if forced <img src="https://github.com/fancycmn/slide-7/blob/main/S7_Pic5.PNG?raw=true" width="50%" style="display: block; margin: 10px 30px;"> ] --- #Benefit and problems with panel data - Benefit - Temporal order of events: panel data > cross-sectional data - Causal inference: within-person comparison > between-person comparison - Identification of causal effects: compare the same person P at t0 to t1 - Both benefit and problem - cost of data collection: 1) few sampling costs; 2) high costs of panel maintenance; 3) overall, lower costs compared to repeated cross-sections - Reliability and validity of constructs: higher reliability; assessment of stable and variable constructs (IQ, personality); - Respondents learn to deal with the questionnaries - Question may change overtime - Problem - At start: similar to a cross-sectional survey - Over time: becomes more selective during to attrition - Refreshment samples: new sample is added across waves --- #What a micro panel data often contains? - A micro panel dataset (a person-period dataset) have four types of variables - A subjective identifier (e.g. an ID for the person) - A time indicator (e.g. the year of the survey) - Outcome variables - Predictor variables --- #Import data ```r wave1 <- read_dta("anchor1_50percent_Eng.dta") wave2 <- read_dta("anchor2_50percent_Eng.dta") wave3 <- read_dta("anchor3_50percent_Eng.dta") wave4 <- read_dta("anchor4_50percent_Eng.dta") wave5 <- read_dta("anchor5_50percent_Eng.dta") wave6 <- read_dta("anchor6_50percent_Eng.dta") ``` --- #First, check data - Think about what variables you want for analysis - See whether the variables are coded and labelled in the same way across waves - Some variables that are often used - ID (`id`) - Gender - Age - Marital status - Labor force status - Health - Education - No. of children - Income - Life satisfaction: the outcome variable --- #First, check data - In a simple case, I consider variables: id, age, sex_gen, relstat, hlt1, sat6 ```r wave1$sex_gen %>% as_factor() %>% tabyl() wave2$sex_gen %>% as_factor() %>% tabyl() wave3$sex_gen %>% as_factor() %>% tabyl() wave4$sex_gen %>% as_factor() %>% tabyl() wave5$sex_gen %>% as_factor() %>% tabyl() wave6$sex_gen %>% as_factor() %>% tabyl() ``` Write similar codes for other variables to see the distribution and levels across different datasets - Or you could write a function to run repeated codes for different dataset. ```r sex_fun <- function(df) { table(as_factor(df$sex_gen)) } #define a function to generate tables for the distribution of a factor variable "sex_gen" sex_fun(wave1) #just enter your dataset in the function "sex_fun()" ``` ``` ## ## -10 not in demodiff -7 Incomplete data ## 0 0 ## -4 Filter error / Incorrect entry -3 Does not apply ## 0 0 ## 1 Male 2 Female ## 3029 3172 ``` --- #First, check data - use [`sapply`](https://www.youtube.com/watch?v=ejVWRKidi9M) to run the repeated code for six waves ```r sapply(mget(paste0("wave", 1:6)), sex_fun) #sapply: loop over the function and evaluate repeatly ``` <img src="https://github.com/fancycmn/slide-7/blob/main/S7_Pic9.PNG?raw=true" width="90%" style="display: block; margin: 30px;"> --- #First, check data ```r #what is past0 paste0("wave", 1) ``` ``` ## [1] "wave1" ``` ```r paste0("wave", 1:6) ``` ``` ## [1] "wave1" "wave2" "wave3" "wave4" "wave5" "wave6" ``` ```r whatisthis<- mget(paste0("wave", 1:6)) #mget() is to get a list of objects named wave1 to wave 6 sapply(mget(paste0("wave", 1:6)), sex_fun) ``` ``` ## wave1 wave2 wave3 wave4 wave5 wave6 ## -10 not in demodiff 0 0 0 0 0 0 ## -7 Incomplete data 0 0 0 0 0 0 ## -4 Filter error / Incorrect entry 0 0 0 0 0 0 ## -3 Does not apply 0 0 0 0 0 0 ## 1 Male 3029 2197 1905 1668 1493 1342 ## 2 Female 3172 2339 2050 1813 1626 1477 ``` ```r #sapply: loop over a list and evaluate a function on each element and show the result in a table ``` --- #First, check data - you can write the following function ```r relstat_fun <- function(df) { table(as_factor(df$relstat)) } sapply(mget(paste0("wave", 1:6)), relstat_fun) sat_fun <- function(df) { table(as_factor(df$sat6)) } sapply(mget(paste0("wave", 1:6)), sat_fun) health_fun <- function(df) { table(as_factor(df$hlt1)) } sapply(mget(paste0("wave", 1:6)), health_fun) ``` sex_gen, relstat, sat6 are coded in the same way; while hlt1 are coded in different ways, particularly for negative values. --- #Second, clean data - you can repeat the following code for six waves ```r wave1a <- wave1 %>% transmute( id, age, wave=as.numeric(wave), sex_gen=as_factor(sex_gen), #make sex_gen as a factor relstat=as_factor(relstat), #make relstat as a factor relstat=case_when(relstat== "-7 Incomplete data" ~ as.character(NA), #specify when is missing for relstat TRUE ~ as.character(relstat))%>% as_factor(), #make relstat as a factor again hlt1=case_when(hlt1<0 ~ as.numeric(NA), #specify when hlt1 is missing TRUE ~ as.numeric(hlt1)), sat6=case_when(sat6<0 ~ as.numeric(NA), #specify when sat6 is missing TRUE ~ as.numeric(sat6)) ) ``` --- #Second, clean data - or use a function ```r clean_fun <- function(df) { df %>% transmute( id, #remove label of id age, #remove label of age wave=as.numeric(wave), sex=as_factor(sex_gen), #make sex_gen as a factor relstat=as_factor(relstat), #make relstat as a factor relstat=case_when(relstat== "-7 Incomplete data" ~ as.character(NA), #specify when is missing for relstat TRUE ~ as.character(relstat))%>% as_factor(), #make relstat as a factor again hlt=case_when(hlt1<0 ~ as.numeric(NA), #specify when hlt1 is missing TRUE ~ as.numeric(hlt1)), sat=case_when(sat6<0 ~ as.numeric(NA), #specify when sat6 is missing TRUE ~ as.numeric(sat6)) ) } wave1a <- clean_fun(wave1) wave2a <- clean_fun(wave2) wave3a <- clean_fun(wave3) wave4a <- clean_fun(wave4) wave5a <- clean_fun(wave5) wave6a <- clean_fun(wave6) ``` --- #Second, clean data **Now let us look at the cleaned data by using the function skim() under package "skimr"** ```r skimr::skim(wave1a) ``` <img src="https://github.com/fancycmn/25-Session3/blob/main/S7-F6.png?raw=true" width="90%" style="display: block; margin: auto;" > --- #Take home 1. Cross-sectional data, repeatec corss-sectional data, panel data 2. Clean multiple data: - define your functions - use `sapply()` 3. Have a quick overview of the data - `skimr::skim()` --- class: center, middle #[Exercise](https://rpubs.com/fancycmn/1357646)