class: center, middle, inverse, title-slide .title[ # Advanced quantitative data analysis ] .subtitle[ ## Piping & Grouping ] .author[ ### Mengni Chen ] .institute[ ### Department of Sociology, University of Copenhagen ] --- <style type="text/css"> .remark-slide-content { font-size: 24px; padding: 20px 80px 20px 80px; } .remark-code, .remark-inline-code { background: #f0f0f0; } .remark-code { font-size: 12px; } </style> #package for today ```r #install two new packages install.packages("Hmisc") #for weighted estimation of mean, sd, variance install.packages("wCorr") #for weighted correlation coefficient ``` ```r library(tidyverse) library(haven) #introduced in session 3 "dataframe & tibble" library(Hmisc) library(wCorr) ``` --- #Nested code What do the following codes mean? ```r #Nested code mean(sqrt(seq(from = 1, to = 13))) ``` ``` ## [1] 2.527274 ``` -- We have learned that R code may be nested. But too much nesting becomes unintelligible: Because R evaluates code from the inside-out, we need to read nested code from the inside-out! -- Alternatively, we could write several lines of code successively and read from top to bottom. But this leads to many irrelevant intermediate objects that crowd our environment. ```r x <- seq(from = 1, to = 13) # x_sqrt <- sqrt(x) # Intermediate object. mean(x_sqrt) ``` ``` ## [1] 2.527274 ``` --- #The (forward) pipe ` %>% ` The `%>%` operator pipes output of one function as input to the next function. You can basically say: `function(argument1 = value)` can be written as `value %>% function()`. Or even easier, think of it as: "then" ```r mean(sqrt(seq(from = 1, to = 13))) ``` ``` ## [1] 2.527274 ``` ```r #use pipe seq(from = 1, to = 13) %>% sqrt() %>% mean() ``` ``` ## [1] 2.527274 ``` --- # What pipe looks like? .pull-left[ <img src="https://merlin-intro-r.netlify.app/5-piping/img/PipedEgg.jpg" width="190%" style="display: block; margin: auto;" > ] .pull-right[ <img src="https://merlin-intro-r.netlify.app/5-piping/img/PipedEgg2.png" width="690%" style="display: block; margin: auto;"> ] --- #The (forward) pipe ` %>% ` - Shortcut buttons for typing `%>%` - Windows: Ctrl+Shift+M - Mac: Cmd+Shift+M ```r # Example 1 #Both lines of code round 5.882 to have only one digit. round(x = 5.882, digits = 1) ``` ``` ## [1] 5.9 ``` ```r 5.882 %>% round(digits = 1) ``` ``` ## [1] 5.9 ``` ```r #Example 2 2 %>% round(x = 5.882, digits = .) ``` ``` ## [1] 5.88 ``` By default, %>% pipes into the first argument of a function. The . placeholder allows us to pipe into another argument. --- #Advantages of piping & when not to pipe - The advantages are - Legible code: We can structure code from left to right, as opposed to from inside and out. - Shorter code: You minimize the need for local/intermediate variables. - Easily mutable code: You can easily add steps anywhere in the sequence of operations. - When not to pipe - If you have more than one or two major inputs, don't pipe. - If you have more than ten steps, better make intermediate object. It helps you to debug (i.e., find mistakes) and is simply easier to read. [Example video](https://www.youtube.com/watch?v=sohARFx6aTo) --- #Manipulating Pairfam using `%>%` Task: 1) Keep variables of id, cdweight, age, sex_gen, cohort, yedu, relstat, as well as one variable that reflects subjective wellbeingv(sat6); 2) replace yedu and sat6 with NA when they are negative What we will do: way 1 ```r #we have library(haven) in the beginning wave1 <- read_dta("anchor1_50percent_Eng.dta") wave1a <- select(wave1, id, age, sex_gen, cohort, yeduc, relstat, cdweight, sat6) #select chozen variables wave1b <- mutate(wave1a, id=zap_labels(id), #remove label of id cdweight=zap_label(cdweight), #remove label of weight age=zap_labels(age), #remove label of age yeduc=zap_labels(yeduc), #remove label of education sat6=zap_labels(sat6), #remove label of education sex_gen=as_factor(sex_gen), #make sex_gen as a factor cohort=as_factor(cohort), #make cohort as a factor relstat=as_factor(relstat) #make relstat as a factor ) wave1c<- mutate(wave1b, yeduc=case_when(yeduc<0 ~ as.numeric(NA), TRUE ~ as.numeric(yeduc)), #replace yeduc with NA when yeduc<0 sat6=case_when(sat6<0 ~ as.numeric(NA), TRUE ~ as.numeric(sat6)) #replace sat6 with NA when yeduc<0 ) ``` --- #Manipulating Pairfam using `%>%` But now: way 2 using ` %>%`, even shorter code when using `transmute` -- Oooh, what is `transmute()`? only keep variables you specify in the `transmute()` [Difference between `mutate()` and `transmute()`](https://www.youtube.com/watch?v=vvIhRginelA) ```r #compare data1 and data2 data1 <- mutate(wave1, sex_gen=as_factor(sex_gen), #make sex_gen as a factor relstat=as_factor(relstat) #make relstat as a factor ) data2 <- transmute(wave1, sex_gen=as_factor(sex_gen), #make sex_gen as a factor relstat=as_factor(relstat) #make relstat as a factor ) ``` --- #Manipulating Pairfam using `%>%` But now: way 2 using ` %>%`, even shorter code when using `transmute` ```r wave1_pipe <- wave1 %>% transmute( # Create new variables and keep only those id=zap_labels(id), #take off the label of id cdweight=zap_label(cdweight), #take off the label of the variable "cdweight" age=zap_labels(age), #take off the label of age as it is treated as a continuous variable yeduc=case_when(yeduc<0 ~ as.numeric(NA), TRUE ~ as.numeric(yeduc)) %>% zap_label(), #when yeduc<0, make it NA; and the rest take their original value;and pipe yeduc into zap_label() sat6=case_when(sat6<0 ~ as.numeric(NA), TRUE ~ as.numeric(sat6))%>% zap_label(), #when sat6<0, make it NA; and the rest take their original value;and pipe yeduc into zap_label() sex_gen=as_factor(sex_gen), #treat sex as a categorical variable cohort=as_factor(cohort), #treat cohort as a categorical variable relstat=as_factor(relstat) #treat relstat as a categorical variable ) ``` --- #What is weight? pleae watch the video here? [Clike to see: Data weighting and representative samples](https://www.youtube.com/watch?v=KkqXbw43yxc) --- #When there is weight Statistically, we use weights by multiplication. Say in a small imagined patriarchal society, men's votes count twice as much. Here we have a ballot on whether women should be allowed to drive: .pull-left[ | id|vote | voted_yes| weight| |--:|:----|---------:|------:| | 1|NO | 0| 2| | 2|YES | 1| 2| | 3|YES | 1| 1| | 4|YES | 1| 1| | 5|NO | 0| 1| ```r #without weight, the percentage of vote for yes is 60% (3/5)*100 ``` ``` ## [1] 60 ``` ] .pull-right[ ```r #with weight, the percentage of vote for yes is 58% ((0*2 + 1*2 + 1*1 + 1*1 + 0*1) / 7) * 100 # Way 1 ``` ``` ## [1] 57.14286 ``` ```r (sum(voted_yes * weight) / sum(weight)) * 100 #Way 2 ``` ``` ## [1] 57.14286 ``` ```r Hmisc::wtd.mean(x = voted_yes, weights = weight) * 100 #Way 3, use the "wtd.mean "function under package "Hmisc" to get mean directly. ``` ``` ## [1] 57.14286 ``` ] --- #Grouped operations `group_by()` will transform your data into a grouped tibble. Afterwards, certain functions will operate on the level of those specified groups! ```r wave1_cohort<- wave1_pipe %>% group_by(cohort) #pipe dateset "wave1_pipe" into group_by(cohort), so then the codes later will be execute by groups of cohort. wave1_cohort #what wave1_cohort looks like ``` ``` ## # A tibble: 6,201 × 8 ## # Groups: cohort [3] ## id cdweight age yeduc sat6 sex_gen cohort relstat ## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <fct> ## 1 267206000 1.10 16 0 7 2 Female 1 1991-1993 1 Never married si… ## 2 112963000 1.73 35 10.5 6 1 Male 3 1971-1973 1 Never married si… ## 3 327937000 0.774 16 0 8 2 Female 1 1991-1993 -7 Incomplete data ## 4 318656000 0.719 27 11.5 9 2 Female 2 1981-1983 4 Married COHAB ## 5 717889000 1.15 37 11.5 7 1 Male 3 1971-1973 4 Married COHAB ## 6 222517000 0.900 15 0 9 1 Male 1 1991-1993 1 Never married si… ## 7 144712000 0.981 16 0 8 2 Female 1 1991-1993 1 Never married si… ## 8 659357000 0.775 17 0 7 2 Female 1 1991-1993 2 Never married LAT ## 9 506367000 1.24 37 10.5 9 1 Male 3 1971-1973 4 Married COHAB ## 10 64044000 1.37 15 0 7 2 Female 1 1991-1993 1 Never married si… ## # … with 6,191 more rows ``` --- #Group operations and sumarize `summarize()` allows you to calculate all kinds of statistics on the level of the groups you have specified. It is a function under "dplyr" package. And "dplyr"is a sub-package of "tidyverse" .pull-left[ ```r #when you don't consider weight wave1_cohort<- wave1_pipe %>% group_by(cohort) %>% dplyr::summarise(mean(sat6)) wave1_cohort ``` ``` ## # A tibble: 3 × 2 ## cohort `mean(sat6)` ## <fct> <dbl> ## 1 1 1991-1993 NA ## 2 2 1981-1983 NA ## 3 3 1971-1973 NA ``` ] .pull-right[ ```r #when you don't consider weight wave1_cohort<- wave1_pipe %>% group_by(cohort) %>% dplyr::summarise(mean(sat6,na.rm=TRUE )) wave1_cohort ``` ``` ## # A tibble: 3 × 2 ## cohort `mean(sat6, na.rm = TRUE)` ## <fct> <dbl> ## 1 1 1991-1993 7.94 ## 2 2 1981-1983 7.40 ## 3 3 1971-1973 7.47 ``` ] --- #Group operations and sumarize: when there is weight If we want to calculated the weighted satisfaction for the three cohorts ```r wave1_cohort<- wave1_pipe %>% #wave1_pipe will go through all the following steps and be assiged to a newdataset called "wave1_cohort" group_by(cohort) %>% #wave1_pipe is grouped by cohort filter(!is.na(sat6)) %>% #filter out whose sat6 is missing dplyr::summarise( #provide some summaritive calculation wn=sum(cdweight), # get the weighted sample size by taking a sum of variable "cdweight" wsum_sat6=sum(sat6*cdweight), #get the weighted sum of life satisfaction m_wsat6=wsum_sat6/wn, #get the weighted mean, by dividing weighted sum of life satisfaction by weighted sample size. ) wave1_cohort # what wave1_cohort looks like ``` ``` ## # A tibble: 3 × 4 ## cohort wn wsum_sat6 m_wsat6 ## <fct> <dbl> <dbl> <dbl> ## 1 1 1991-1993 1860. 14762. 7.94 ## 2 2 1981-1983 2098. 15428. 7.35 ## 3 3 1971-1973 2231. 16488. 7.39 ``` --- #Group operations and sumarize: when there is weight or you can use wtd.mean in the package of "Hmisc" ```r wave1_cohort<- wave1_pipe %>% group_by(cohort) %>% filter(!is.na(sat6)) %>% #filter out whose sat6 is missing dplyr::summarise(wtd.mean(x = sat6, weights = cdweight)) #wtd.mean is a function of Hmisc package wave1_cohort ``` ``` ## # A tibble: 3 × 2 ## cohort `wtd.mean(x = sat6, weights = cdweight)` ## <fct> <dbl> ## 1 1 1991-1993 7.94 ## 2 2 1981-1983 7.35 ## 3 3 1971-1973 7.39 ``` --- #Calculate the correlation coefficient by cohort `cor()` allow you to calculate correlation. ```r #calculate the correlation coefficient in by cohort wave1_pipe %>% group_by(cohort) %>% drop_na(sat6,yeduc) %>% #removing missing cases of sat6 and yeduc, using drop_na() dplyr::summarise(cor(x=yeduc, y=sat6)) #estimate correlation coefficient between yeduc and sat6 ``` ``` ## # A tibble: 3 × 2 ## cohort `cor(x = yeduc, y = sat6)` ## <fct> <dbl> ## 1 1 1991-1993 0.000468 ## 2 2 1981-1983 0.157 ## 3 3 1971-1973 0.111 ``` --- #Calculate the weighted correlation coefficient by cohort `weightedCorr()` under the package "wCorr" will help you realize this goal ```r #calculate the correlation coefficient in by cohort wave1_pipe %>% group_by(cohort) %>% drop_na(sat6,yeduc) %>% #drop cases when information on sat6 and yeduc are missing summarise( weightedCorr(x=yeduc, y=sat6, method="Pearson", weights=cdweight) ) ``` ``` ## # A tibble: 3 × 2 ## cohort weightedCorr(x = yeduc, y = sat6, method = "Pearson", weights = …¹ ## <fct> <dbl> ## 1 1 1991-1993 -0.00734 ## 2 2 1981-1983 0.173 ## 3 3 1971-1973 0.0960 ## # … with abbreviated variable name ## # ¹​`weightedCorr(x = yeduc, y = sat6, method = "Pearson", weights = cdweight)` ``` --- #Take home 1. `%>%`: the (forward) pipe, allows you to pipe the output of one function into the next function as input. 2. `group_by()`: Subsets a tibble into groups. Certain functions will operate afterwards by each of the specified groups. 3. `summarize()`: Allows you to estimate any kind of aggregate statistic. Combined with group_by(), it estimates those statistic by specified group. - mean - sd - correlation coefficient 4. calculating weighted statistics, e.g. weighted mean and weighted correlation coefficient --- class: center, middle #[Exercise](https://rpubs.com/fancycmn/1086419)