class: center, middle, inverse, title-slide .title[ # Advanced quantitative data analysis ] .subtitle[ ## Piping & Grouping ] .author[ ### Mengni Chen ] .institute[ ### Department of Sociology, University of Copenhagen ] --- <style type="text/css"> .remark-slide-content { font-size: 24px; padding: 20px 80px 20px 80px; } .remark-code, .remark-inline-code { background: #f0f0f0; } .remark-code { font-size: 12px; } </style> #package for today ```r #install two new packages install.packages("Hmisc") install.packages("ggplot2") ``` ```r library(tidyverse) library(haven) library(Hmisc) library(ggplot2) ``` --- #Nested code We have learned that R code may be nested. But too much nesting becomes unintelligible: Because R evaluates code from the inside-out, we need to read nested code from the inside-out! ```r #Nested code mean(sqrt(seq(from = 1, to = 13))) ``` ``` ## [1] 2.527274 ``` -- Alternatively, we could write several lines of code successively and read from top to bottom. But this leads to many irrelevant intermediate objects that crowd our environment. ```r x <- seq(from = 1, to = 13) x_sqrt <- sqrt(x) # Intermediate object. mean(x_sqrt) ``` ``` ## [1] 2.527274 ``` --- #The (forward) pipe ` %>% ` The `%>%` operator pipes output of one function as input to the next function. You can basically say: `function(argument1 = value)` can be written as `value %>% function()`. Or even easier, think of it as: "then" ```r mean(sqrt(seq(from = 1, to = 13))) ``` ``` ## [1] 2.527274 ``` ```r #use pipe seq(from = 1, to = 13) %>% sqrt() %>% mean() ``` ``` ## [1] 2.527274 ``` --- # What pipe looks like? .pull-left[ <img src="https://merlin-intro-r.netlify.app/5-piping/img/PipedEgg.jpg" width="90%" style="display: block; margin: auto;" > ] .pull-right[ <img src="https://merlin-intro-r.netlify.app/5-piping/img/PipedEgg2.png" width="690%" style="display: block; margin: auto;"> ] --- #The (forward) pipe ` %>% ` - Shortcut for typing `%>%` - Windows: Ctrl+Shift+M - Mac: Cmd+Shift+M ```r # Example 1 #Both lines of code round 5.882 to have only one digit. round(x = 5.882, digits = 1) ``` ``` ## [1] 5.9 ``` ```r 5.882 %>% round(digits = 1) ``` ``` ## [1] 5.9 ``` ```r #Example 2 2 %>% round(x = 5.882, digits = .) ``` ``` ## [1] 5.88 ``` By default, %>% pipes into the first argument of a function. The . placeholder allows us to pipe into another argument. --- #Advantages of piping & when not to pipe - The advantages are - Legible code: We can structure code from left to right, as opposed to from inside and out. - Shorter code: You minimize the need for local/intermediate variables. - Easily mutable code: You can easily add steps anywhere in the sequence of operations. - When not to pipe - If you have more than one or two major inputs, don't pipe. - If you have more than ten steps, better make intermediate object. It helps you to debug (i.e., find mistakes) and is simply easier to read. --- #Manipulating Pairfam using `%>%` "Keep variables of id, age, sex_gen, cohort, yedu, relstat, cdweight, as well as one variable that reflects the attitude towards family, and one variable that reflects subjective wellbeing." What we will do ```r wave1 <- read_dta("anchor1_50percent_Eng.dta") wave1a <- select(wave1, id, age, sex_gen, cohort, yeduc, relstat, cdweight, val1i7,sat6) wave1a <- mutate(wave1a, id=zap_labels(id), age=zap_labels(age), yeduc=zap_labels(yeduc), sat6=zap_labels(sat6), cdweight=zap_label(cdweight), sex_gen=as_factor(sex_gen), cohort=as_factor(cohort), relstat=as_factor(relstat), val1i7=as_factor(val1i7) ) ``` --- #Manipulating Pairfam using `%>%` But now ```r wave1b <- wave1 %>% transmute( # Create new variables and keep only those id=zap_labels(id), #take off the label of id age=zap_labels(age), #take off the label of age as it is treated as a continuous variable cdweight=zap_label(cdweight), #take off the label of the variable "cdweight" cohort=as_factor(cohort), #treat cohort as a categorical variable sex_gen=as_factor(sex_gen), #treat sex as a categorical variable wave=as_factor(wave), #treat wave as a categorical variable yeduc=case_when(yeduc<0 ~ as.numeric(NA), TRUE ~ as.numeric(yeduc)) %>% zap_label(), #when yeduc<0, make it NA; and the rest take their original value;and pipe yeduc into zap_label() to take off labels sat6=case_when(sat6<0 ~ as.numeric(NA), TRUE ~ as.numeric(sat6)), #when sat6<0, make it NA; and the rest take their original value; relstat=as_factor(relstat), #treat relstat as a categorical variable relstat=case_when(relstat== "-7 Incomplete data" ~ as.character(NA), TRUE ~ as.character(relstat) )%>% as_factor(), #when relstat has the value of "-7 Incomplete data", make it NA; otherwise, remained as it; #and pipe relstat into as_factor() to make it as categorical variable. val1i7=as_factor(val1i7), #treat val1i7 as a categorical variable. val1i7=case_when(val1i7=="-2 No answer" | # the symbol "|" means or val1i7=="-1 Don't know" ~ as.character(NA), TRUE ~as.character(val1i7)) %>% as_factor() #when val1i7 is "-2 No answer" / "-1 Don't know", make it NA; otherwise as it is. #then pipe val1i7 into as_factor() to make it as a factor variable ) #Note: why we need to finally pipe val1i7 into as_factor() after case_when()? #becasuse When vail1i7 go through "val1i7=="-2 No answer" | val1i7=="-1 Don't know" ~ as.character(NA)", it becomes a character rather than factor ``` --- #What is weight? pleae watch the video here? [Clike to see: Data weighting and representative samples](https://www.youtube.com/watch?v=KkqXbw43yxc) --- #When there is weight Statistically, we use weights by multiplication. Say in a small imagined patriarchal society, men's votes count twice as much. Here we have a ballot on whether women should be allowed to drive: .pull-left[ | id|vote | voted_yes| weight| |--:|:----|---------:|------:| | 1|NO | 0| 2| | 2|YES | 1| 2| | 3|YES | 1| 1| | 4|YES | 1| 1| | 5|NO | 0| 1| ```r #without weight, the percentage of vote for yes is 60% (3/5)*100 ``` ``` ## [1] 60 ``` ] .pull-right[ ```r #with weight, the percentage of vote for yes is 58% ((0*2 + 1*2 + 1 + 1 + 0) / 7) * 100 # Way 1 ``` ``` ## [1] 57.14286 ``` ```r (sum(voted_yes * weight) / sum(weight)) * 100 #Way 2 ``` ``` ## [1] 57.14286 ``` ```r Hmisc::wtd.mean(x = voted_yes, weights = weight) * 100 #Way 3, use the "wtd.mean "function under package "Hmisc" to get mean directly. ``` ``` ## [1] 57.14286 ``` ] --- #Grouped operations `group_by()` will transform your data into a grouped tibble. Afterwards, certain functions will operate on the level of those specified groups! ```r wave1c<- wave1b %>% group_by(cohort) #pipe dateset "wave1b" into group_by(cohort), so then the codes later will be execute by groups of cohort. wave1c #what wave1c looks like ``` ``` ## # A tibble: 6,201 × 10 ## # Groups: cohort [3] ## id age cdweight cohort sex_gen wave yeduc sat6 relstat val1i7 ## <dbl> <dbl> <dbl> <fct> <fct> <fct> <dbl> <dbl> <fct> <fct> ## 1 267206000 16 1.10 1 1991-1993 2 Fema… 1 20… 0 7 1 Neve… 3 ## 2 112963000 35 1.73 3 1971-1973 1 Male 1 20… 10.5 6 1 Neve… 2 ## 3 327937000 16 0.774 1 1991-1993 2 Fema… 1 20… 0 8 <NA> 4 ## 4 318656000 27 0.719 2 1981-1983 2 Fema… 1 20… 11.5 9 4 Marr… 5 Agr… ## 5 717889000 37 1.15 3 1971-1973 1 Male 1 20… 11.5 7 4 Marr… 4 ## 6 222517000 15 0.900 1 1991-1993 1 Male 1 20… 0 9 1 Neve… 5 Agr… ## 7 144712000 16 0.981 1 1991-1993 2 Fema… 1 20… 0 8 1 Neve… 4 ## 8 659357000 17 0.775 1 1991-1993 2 Fema… 1 20… 0 7 2 Neve… 5 Agr… ## 9 506367000 37 1.24 3 1971-1973 1 Male 1 20… 10.5 9 4 Marr… 1 Dis… ## 10 64044000 15 1.37 1 1991-1993 2 Fema… 1 20… 0 7 1 Neve… 1 Dis… ## # … with 6,191 more rows ``` --- #Group operations and sumarize `summarize()` allows you to calculate all kinds of statistics on the level of the groups you have specified.It is a function under "dplyr". ```r #when you don't consider weight wave1c<- wave1b %>% group_by(cohort) %>% dplyr::summarise(mean(sat6,na.rm=TRUE )) wave1c ``` ``` ## # A tibble: 3 × 2 ## cohort `mean(sat6, na.rm = TRUE)` ## <fct> <dbl> ## 1 1 1991-1993 7.94 ## 2 2 1981-1983 7.40 ## 3 3 1971-1973 7.47 ``` --- #Group operations and sumarize: when there is weight If we want to calculated the weighted satisfaction for the three cohorts ```r wave1c<- wave1b %>% #wave1b will go through all the following steps and be assiged to a newdataset called "wave1c" group_by(cohort) %>% #wave1b is grouped by cohort dplyr::summarise( #use the function "summarise "under the package of "dplyr" to provide some summaritive calculation n=n(), # get the unweighed sample size wn=sum(cdweight), # get the weigthed sample size by take a sum of variable "cdweight" wsum_sat6=sum(sat6*cdweight,na.rm=T), #get the weighted sum of life satisfaction and drop missing-valued observations in the calculation. m_wsat6=wsum_sat6/wn, #get the weighted mean, by dividing weighted sum of life satisfaction by weighted sample size. ) wave1c # what wave1c looks like ``` or you can use wtd.mean in the package of "Hmisc" ```r wave1c<- wave1b %>% group_by(cohort) %>% dplyr::summarise(wtd.mean(x = sat6, weights = cdweight)) #wtd.mean is a function of Hmisc package wave1c ``` ``` ## # A tibble: 3 × 2 ## cohort `wtd.mean(x = sat6, weights = cdweight)` ## <fct> <dbl> ## 1 1 1991-1993 7.94 ## 2 2 1981-1983 7.35 ## 3 3 1971-1973 7.39 ``` --- #Calculate the correlation coefficient by cohort `cor()` allow you to calculate correlation. It is a function under package "stats" You can see whether it is install by ```r find.package("stats") ``` ```r #calculate the correlation coefficient in by cohort wave1b %>% group_by(cohort) %>% dplyr::summarise(cor(age, sat6,use = "complete.obs")) ``` ``` ## # A tibble: 3 × 2 ## cohort `cor(age, sat6, use = "complete.obs")` ## <fct> <dbl> ## 1 1 1991-1993 -0.0613 ## 2 2 1981-1983 -0.0249 ## 3 3 1971-1973 0.00467 ``` ```r #If you want to drop observations with missing, rm.na is not working in cor. #You should use the argument "use". #Check out more about cor, ?cor ``` --- #Take home 1. `%>%`: the (forward) pipe, allows you to pipe the output of one function into the next function as input. 2. `group_by()`: Subsets a tibble into groups. Certain functions will afterwards operate by each of the specified groups. 3. `summarize()`: Allows you to estimate any kind of aggregate statistic. Combined with group_by(), it estimates those statistic by specified group. - mean - sd - correlation coefficient --- class: center, middle #[Exercise](https://rpubs.com/fancycmn/948203)