class: center, middle, inverse, title-slide .title[ # Advanced quantitative data analysis ] .subtitle[ ## Reshaping and joininig ] .author[ ### Mengni Chen ] .institute[ ### Department of Sociology, University of Copenhagen ] --- <style type="text/css"> .remark-slide-content { font-size: 20px; padding: 20px 80px 20px 80px; } .remark-code, .remark-inline-code { background: #f0f0f0; } .remark-code { font-size: 14px; } </style> #Let's get ready ```r install.packages("skimr") #provide a descriptive information about your data install.packages("splitstackshape") #to transform a long-format data into a wide-format data ``` ```r library(tidyverse) # Add the tidyverse package to my current library. library(haven) # Handle labelled data. library(Hmisc) # Weighted statistics. library(skimr) #provide summary for multiple variables library(splitstackshape) #transform wide data (with stacked variables) to long data ``` --- #Data in long format and wide format .pull-left[ Data in **wide format** - contains values that do not repeat in the first column <img src="https://github.com/fancycmn/slide-7/blob/main/S7_Pic17.PNG?raw=true" width="100%" style="display: block; margin: 10px;"> <img src="https://www.theanalysisfactor.com/wp-content/uploads/2013/10/image001.jpg" width="100%" style="display: block; margin: 10px;"> ] .pull-right[ Data in **long format** - contains values that do repeat in the first column <img src="https://www.theanalysisfactor.com/wp-content/uploads/2013/10/image002.jpg" width="80%" style="display: block; margin: 10px;"> ] --- #What is panel data .pull-left[ **Micro panel** - N(persons) >>> T(time points) - PAIRFAM, SOEP, BHPS, SHARE, HILDA, GGP, etc <img src="https://github.com/fancycmn/slide-7/blob/main/S7_Pic2.PNG?raw=true" width="50%" style="display: block; margin: 10px;"> ] .pull-right[ **Macro panel** - N countries >>> T(time points) - OECD, World bank, UNPD, etc. <img src="https://github.com/fancycmn/slide-7/blob/main/S7_Pic3.PNG?raw=true" width="50%" style="display: block; margin: 50px 30px;"> ] --- #Unbalanced and balanced panel data .pull-left[ **Unbalanced panel** - Units observed at different points in time - The usual case in micro surveys - Selection problem if being observed is systematic <img src="https://github.com/fancycmn/slide-7/blob/main/S7_Pic4.PNG?raw=true" width="50%" style="display: block; margin: 30px;"> ] .pull-right[ **Balanced panel** - Every unit observed at all times - Ideal-typical case, more often in macro panel data - Never realized in surveys, selection problem if forced <img src="https://github.com/fancycmn/slide-7/blob/main/S7_Pic5.PNG?raw=true" width="50%" style="display: block; margin: 10px 30px;"> ] --- #Benefit and problems with panel data - Benefit - Temporal order of events: panel data > cross-sectional data - Causal inference: within-person comparison > between-person comparison - Identification of causal effects: compare the same person P at t0 to t1 - Both benefit and problem - cost of data collection: 1) few sampling costs; 2) high costs of panel maintenance; 3) overall, lower costs compared to repeated cross-sections - Reliability and validity of constructs: higher reliability; assessment of stable and variable constructs (IQ, personality); - Respondents learn to deal with the questionnaries - Question may change overtime - Problem - At start: similar to a cross-sectional survey - Over time: becomes more selective during to attrition - Solution: weighting (designed weights + longitudinal weight) - Refreshment samples --- #Import data ```r wave1 <- read_dta("anchor1_50percent_Eng.dta") wave2 <- read_dta("anchor2_50percent_Eng.dta") wave3 <- read_dta("anchor3_50percent_Eng.dta") wave4 <- read_dta("anchor4_50percent_Eng.dta") wave5 <- read_dta("anchor5_50percent_Eng.dta") wave6 <- read_dta("anchor6_50percent_Eng.dta") ``` ```r #or you use loop to import to avoid repetitive coding, similar to forvalues in stata for (i in 1:6) { assign(paste0("wave", i), #assign is similar to <-; paste0 is to combine wave and i into a name, i ranges from 1 to 6. read_dta(paste0("anchor", i, "_50percent_Eng.dta")) ) } ``` <img src="https://github.com/fancycmn/slide-7/blob/main/S7_Pic6.PNG?raw=true" width="40%" style="display: block; margin: auto;"> Wave1 to Wave 6 have different numbers of observations and variables. How we link the six waves' data together to make a panel data --- #What a micro panel data often contains? - A micro panel dataset (a person-period dataset) have four types of variables - A subjective identifier (e.g. an ID for the person) - A time indicator (e.g. the year of the survey) - Outcome variables - Predictor variables --- #First, check data - Think about what variables you want for analysis - See whether the variables are coded and labelled in the same way across waves - Some variables that are often used - ID (`id`) - Gender (`sex_gen`) - Age (`age`) - Marital status (`relstat` or `marstat`) - Labor force status (`lfs`) - Health (`hlt1`) - Education (`yeduc`, `isced`) - No. of children (`nkidsbio`) - Income (e.g. household income, `hhincnet`) - Life satisfaction (`sat6`): the outcome variable --- #First, check data - In a simple case, I consider variables: id, age, sex_gen, relstat, hlt1, sat6 ```r table(wave1$sex_gen) table(wave2$sex_gen) table(wave3$sex_gen) table(wave4$sex_gen) table(wave5$sex_gen) table(wave6$sex_gen) ``` Write similar codes for other variables to see the distribution and levels across different datasets - Or you could write a function to run repeated codes for different dataset. ```r sex_fun <- function(df) { table(as_factor(df$sex_gen)) } #define a function to generate tables for the distribution of a factor variable "sex_gen" sex_fun(wave1) #just enter your dataset in the function "sex_fun()" ``` --- #First, check data - use [`lapply` and `sapply`](https://www.youtube.com/watch?v=ejVWRKidi9M) to run the repeated code for six waves .pull-left[ ```r lapply(mget(paste0("wave", 1:6)), sex_fun) #lapply: loop over a list and evaluate a function on each element; ``` <img src="https://github.com/fancycmn/slide-7/blob/main/S7_Pic8.PNG?raw=true" width="110%" style="display: block; margin: 30px;"> ] .pull-right[ ```r sapply(mget(paste0("wave", 1:6)), sex_fun) #sapply: loop over a list and evaluate a function on each element; but different from lapply, the result is shown in a table ``` <img src="https://github.com/fancycmn/slide-7/blob/main/S7_Pic9.PNG?raw=true" width="90%" style="display: block; margin: 30px;"> ] --- #First, check data - you can write the following function ```r relstat_fun <- function(df) { table(as_factor(df$relstat)) } sapply(mget(paste0("wave", 1:6)), relstat_fun) health_fun <- function(df) { table(as_factor(df$hlt1)) } sapply(mget(paste0("wave", 1:6)), health_fun) sat_fun <- function(df) { table(as_factor(df$sat6)) } sapply(mget(paste0("wave", 1:6)), sat_fun) ``` sex_gen, relstat, sat6 are coded in the same way; while hlt1 are coded in different ways, particularly for negative values. --- #Second, clean data - you can repeat the following code for six waves ```r wave1a <- wave1 %>% transmute( id=zap_label(id), #remove label of id age=zap_label(age), #remove label of age sex_gen=as_factor(sex_gen), #make sex_gen as a factor relstat=as_factor(relstat), #make relstat as a factor relstat=case_when(relstat== "-7 Incomplete data" ~ as.character(NA), #specify when is missing for relstat TRUE ~ as.character(relstat))%>% as_factor(), #make relstat as a factor again hlt1=case_when(hlt1<0 ~ as.numeric(NA), #specify when hlt1 is missing TRUE ~ as.numeric(hlt1)), sat6=case_when(sat6<0 ~ as.numeric(NA), #specify when sat6 is missing TRUE ~ as.numeric(sat6)) ) ``` --- #Second, clean data - or use a function ```r clean_fun <- function(df) { df %>% transmute( id=zap_label(id), #remove label of id age=zap_label(age), #remove label of age wave=as.numeric(wave), sex=as_factor(sex_gen), #make sex_gen as a factor relstat=as_factor(relstat), #make relstat as a factor relstat=case_when(relstat== "-7 Incomplete data" ~ as.character(NA), #specify when is missing for relstat TRUE ~ as.character(relstat))%>% as_factor(), #make relstat as a factor again hlt=case_when(hlt1<0 ~ as.numeric(NA), #specify when hlt1 is missing TRUE ~ as.numeric(hlt1)), sat=case_when(sat6<0 ~ as.numeric(NA), #specify when sat6 is missing TRUE ~ as.numeric(sat6)) ) } wave1a <- clean_fun(wave1) wave2a <- clean_fun(wave2) wave3a <- clean_fun(wave3) wave4a <- clean_fun(wave4) wave5a <- clean_fun(wave5) wave6a <- clean_fun(wave6) ``` --- #Second, clean data **Now let us look at the cleaned data by using the function `skim()` under package "skimr"** ```r skimr::skim(wave1a) ``` <img src="https://github.com/fancycmn/slide-7/blob/main/S7_Pic10.PNG?raw=true" width="100%" style="display: block; margin: 30px;"> --- #Third, combine to generate a panel data - Generate a long data, use `rbind()` ```r panel_long <- rbind(wave1a, wave2a, wave3a, wave4a, wave5a, wave6a) #rbind is combine dataset by rows ``` <img src="https://github.com/fancycmn/slide-7/blob/main/S7_Pic11.PNG?raw=true" width="60%" style="display: block; margin: 30px;"> --- # Generate a wide data, use `join()` .pull-left[ <img src="https://raw.githubusercontent.com/gadenbuie/tidyexplain/master/images/inner-join.gif" width="50%" style="display: block; margin: auto;"> <img src="https://raw.githubusercontent.com/gadenbuie/tidyexplain/master/images/left-join.gif" width="50%" style="display: block; margin: auto;"> ] .pull-right[ <img src="https://raw.githubusercontent.com/gadenbuie/tidyexplain/master/images/right-join.gif" width="50%" style="display: block; margin: auto;"> <img src="https://raw.githubusercontent.com/gadenbuie/tidyexplain/master/images/full-join.gif" width="50%" style="display: block; margin: auto;"> ] --- # Generate a wide data, use `join()` ```r wave1b <- wave1a %>% rename(wave.1=wave, age.1=age, sex.1=sex, relstat.1=relstat, hlt.1=hlt, sat.1=sat) #rename variables wave2b <- wave2a %>% rename(wave.2=wave, age.2=age, sex.2=sex, relstat.2=relstat, hlt.2=hlt, sat.2=sat) wave3b <- wave3a %>% rename(wave.3=wave, age.3=age, sex.3=sex, relstat.3=relstat, hlt.3=hlt, sat.3=sat) wave4b <- wave4a %>% rename(wave.4=wave, age.4=age, sex.4=sex, relstat.4=relstat, hlt.4=hlt, sat.4=sat) wave5b <- wave5a %>% rename(wave.5=wave, age.5=age, sex.5=sex, relstat.5=relstat, hlt.5=hlt, sat.5=sat) wave6b <- wave6a %>% rename(wave.6=wave, age.6=age, sex.6=sex, relstat.6=relstat, hlt.6=hlt, sat6.6=sat) panel_wide <- full_join(wave1b, wave2b, by = "id") %>% # Full join wave1b and wave2b full_join(wave3b, by = "id") %>% # Full join with wave3b full_join(wave4b, by = "id") %>% # Full join with wave4b full_join(wave5b, by = "id") %>% # Full join with wave5b full_join(wave6b, by = "id") # Full join with wave6b ``` --- #Reshape data from long to wide - Long to wide, using `pivot_wider()`in the package of "tidyr" (embedded in "tidyverse") ```r panel_towide2 <- pivot_wider( panel_long, #dataset to transform id_cols = id, #to identify the id column, in this case, "id" is the column names_from = wave, #i want use "wave" to create columns that reflect variables in each wave values_from = c(age, sex, relstat, hlt, sat) #specify where the value comes from ) ``` .pull-left[ <img src="https://github.com/fancycmn/slide-7/blob/main/S7_Pic12.PNG?raw=true" width="100%" style="display: block; margin: 30px;"> ] .pull-right[ <img src="https://github.com/fancycmn/slide-7/blob/main/S7_Pic14.PNG?raw=true" width="100%" style="display: block; margin: 30px;"> ] --- #Reshape data from wide to long - Wide to long, using `merged.stack` in the package "splitstackshape" ```r panel_tolong<- merged.stack(panel_wide, #dataset for transfrom var.stubs = c("age", "sex","relstat", "hlt", "sat", "wave"), #var.stubs is to specify the prefixes of the variable groups sep = ".") #sep is to specify the character that separates the "variable name" from the "times" in the source ``` <img src="https://github.com/fancycmn/slide-7/blob/main/S7_Pic15.PNG?raw=true" width="50%" style="display: block; margin: 10px;"> --- #Reshape data from wide to long - Wide to long, using `merged.stack` in the package "splitstackshape" ```r panel_tolong<- merged.stack(panel_wide, #dataset for transfrom var.stubs = c("age", "sex","relstat", "hlt", "sat", "wave"), #var.stubs is to specify the prefixes of the variable groups sep = ".") %>% #sep is to specify the character that separates the "variable name" from the "times" in the source drop_na(wave) #drop the observations which did not join the wave ``` <img src="https://github.com/fancycmn/slide-7/blob/main/S7_Pic16.PNG?raw=true" width="50%" style="display: block; margin: 10px;"> --- #Take home - Understand what panel data looks like - Make a long data (person-period data) - Make a wide data (person level data) - Three steps to generate a panel data - Check variables for its coding, level and class - Clean the dataset for all waves - combine to generate a panel data - Important code: - using loop: e.g `for (i in 1:6){}` - define a customized function: e.g. `sex_fun()`, `clean_fun()` - `lapply` and `sapply`: to execute functions in a loop - `skimr::skim` to view the descriptive statistics - `rbind()`to append data - join functions: e.g. `inner_join()`, `full_join()` - `pivot_wider()`to transform a long data to wide data - `merged.stack()` to transform a wide data to long data --- class: center, middle #[Exercise](https://rpubs.com/fancycmn/961581)