Advanced quantitative data analysis

class: center, middle, inverse, title-slide

.title[
# Advanced quantitative data analysis
]
.subtitle[
## Piping & Grouping
]
.author[
### Mengni Chen
]
.institute[
### Department of Sociology, University of Copenhagen
]

---

```r
#install two new packages
install.packages("Hmisc") #for weighted estimation of mean, sd, variance
install.packages("wCorr") #for weighted correlation coefficient
```

```r
library(tidyverse) 
library(haven) #introduced in session 3 "dataframe & tibble"
library(Hmisc)
library(wCorr)
```

---
#Nested code
What do the following codes mean?

```r
#Nested code
mean(sqrt(seq(from = 1, to = 13)))
```

```
## [1] 2.527274
```

--
We have learned that R code may be nested. But too much nesting becomes unintelligible: Because R evaluates code from the inside-out, we need to read nested code from the inside-out!

Alternatively, we could write several lines of code successively and read from top to bottom. But this leads to many irrelevant intermediate objects that crowd our environment.

```r
x <- seq(from = 1, to = 13) #
x_sqrt <- sqrt(x) # Intermediate object.
mean(x_sqrt)
```

```
## [1] 2.527274
```

---
#The (forward) pipe ` %>% `
The `%>%` operator pipes output of one function as input to the next function. You can basically say: `function(argument1 = value)` can be written as `value %>% function()`.

Or even easier, think of it as: "then"

```r
mean(sqrt(seq(from = 1, to = 13)))
```

```
## [1] 2.527274
```

```r
#use pipe
seq(from = 1, to = 13) %>% sqrt() %>% mean()
```

```
## [1] 2.527274
```

---
# What pipe looks like?
.pull-left[
<img src="https://merlin-intro-r.netlify.app/5-piping/img/PipedEgg.jpg" width="190%" style="display: block; margin: auto;" >
]

.pull-right[
<img src="https://merlin-intro-r.netlify.app/5-piping/img/PipedEgg2.png" width="690%" style="display: block; margin: auto;">
]

---
#The (forward) pipe ` %>% `
- Shortcut buttons for typing `%>%`
  - Windows: Ctrl+Shift+M
  - Mac: Cmd+Shift+M

```r
# Example 1
#Both lines of code round 5.882 to have only one digit.
round(x = 5.882, digits = 1) 
```

```
## [1] 5.9
```

```r
5.882 %>% round(digits = 1)
```

```
## [1] 5.9
```

```r
#Example 2
2 %>% round(x = 5.882, digits = .)
```

```
## [1] 5.88
```

By default, %>% pipes into the first argument of a function. The . placeholder allows us to pipe into another argument.
---
#Advantages of piping & when not to pipe
- The advantages are
  - Legible code: We can structure code from left to right, as opposed to from inside and out.
  - Shorter code: You minimize the need for local/intermediate variables.
  - Easily mutable code: You can easily add steps anywhere in the sequence of operations.
  
- When not to pipe
  - If you have more than one or two major inputs, don't pipe.
  - If you have more than  ten steps, better make intermediate object.
  It helps you to debug (i.e., find mistakes) and is simply easier to read.
  
  
[Example video](https://www.youtube.com/watch?v=sohARFx6aTo)

---
#Manipulating Pairfam using `%>%`
Task: 1) Keep variables of id, cdweight, age, sex_gen, cohort, yedu, relstat,  as well as one variable that reflects subjective wellbeingv(sat6); 2) replace yedu and sat6 with NA when they are negative

What we will do: way 1

```r
#we have library(haven) in the beginning
wave1 <- read_dta("anchor1_50percent_Eng.dta")

wave1a <- select(wave1, id, age, sex_gen, cohort, yeduc, 
               relstat, cdweight, sat6) #select chozen variables

wave1b <- mutate(wave1a,
                 id=zap_labels(id),     #remove label of id
                 cdweight=zap_label(cdweight), #remove label of weight
                 age=zap_labels(age),   #remove label of age
                 yeduc=zap_labels(yeduc), #remove label of education
                 sat6=zap_labels(sat6), #remove label of education
                 sex_gen=as_factor(sex_gen), #make sex_gen as a factor
                 cohort=as_factor(cohort), #make cohort as a factor
                 relstat=as_factor(relstat) #make relstat as a factor
)

wave1c<- mutate(wave1b,
                yeduc=case_when(yeduc<0 ~ as.numeric(NA), TRUE ~ as.numeric(yeduc)),
                #replace yeduc with NA when yeduc<0
                sat6=case_when(sat6<0 ~ as.numeric(NA), TRUE ~ as.numeric(sat6))
                #replace sat6 with NA when yeduc<0
                )
```

---
#Manipulating Pairfam using `%>%`
But now: way 2 using ` %>%`, even shorter code when using `transmute`

Oooh, what is `transmute()`? only keep variables you specify in the `transmute()`
[Difference between `mutate()` and `transmute()`](https://www.youtube.com/watch?v=vvIhRginelA)

```r
#compare data1 and data2
data1 <- mutate(wave1,
                sex_gen=as_factor(sex_gen), #make sex_gen as a factor
                relstat=as_factor(relstat) #make relstat as a factor
                )
data2 <- transmute(wave1,
                sex_gen=as_factor(sex_gen), #make sex_gen as a factor
                relstat=as_factor(relstat) #make relstat as a factor
                )
```

---
#Manipulating Pairfam using `%>%`
But now: way 2 using ` %>%`, even shorter code when using `transmute`

```r
wave1_pipe <- wave1 %>% 
  transmute(              # Create new variables and keep only those
    id=zap_labels(id),    #take off the label of id
    
    cdweight=zap_label(cdweight), #take off the label of the variable "cdweight" 
    
    age=zap_labels(age),  #take off the label of age as it is treated as a continuous variable
    
    yeduc=case_when(yeduc<0 ~ as.numeric(NA), TRUE ~ as.numeric(yeduc)) %>% zap_label(), 
    #when yeduc<0, make it NA; and the rest take their original value;and pipe yeduc into zap_label()
    
    sat6=case_when(sat6<0 ~ as.numeric(NA), TRUE ~ as.numeric(sat6))%>% zap_label(),
    #when sat6<0, make it NA; and the rest take their original value;and pipe yeduc into zap_label()
    
    sex_gen=as_factor(sex_gen),   #treat sex as a categorical variable
    
    cohort=as_factor(cohort),     #treat cohort as a categorical variable
    
    relstat=as_factor(relstat) #treat relstat as a categorical variable
  )
```

---
#What is weight?
pleae watch the video here?
[Clike to see: Data weighting and representative samples](https://www.youtube.com/watch?v=KkqXbw43yxc)

---
#When there is weight
Statistically, we use weights by multiplication. Say in a small imagined patriarchal society, men's votes count twice as much. Here we have a ballot on whether women should be allowed to drive:

.pull-left[

| id|vote | voted_yes| weight|
|--:|:----|---------:|------:|
|  1|NO   |         0|      2|
|  2|YES  |         1|      2|
|  3|YES  |         1|      1|
|  4|YES  |         1|      1|
|  5|NO   |         0|      1|

```r
#without weight, the percentage of vote for yes is 60%
(3/5)*100
```

```
## [1] 60
```

]

.pull-right[

```r
#with weight, the percentage of vote for yes is 58%
((0*2 + 1*2 + 1*1 + 1*1 + 0*1) / 7) * 100 # Way 1
```

```
## [1] 57.14286
```

```r
(sum(voted_yes * weight) / sum(weight)) * 100 #Way 2 
```

```
## [1] 57.14286
```

```r
Hmisc::wtd.mean(x = voted_yes, weights = weight) * 100 #Way 3, use the "wtd.mean "function under package "Hmisc" to get mean directly. 
```

```
## [1] 57.14286
```
]

---
#Grouped operations
`group_by()` will transform your data into a grouped tibble. Afterwards, certain functions will operate on the level of those specified groups!

```r
wave1_cohort<- wave1_pipe %>% group_by(cohort) #pipe dateset "wave1_pipe" into group_by(cohort), so then the codes later will be execute by groups of cohort.
wave1_cohort #what wave1_cohort looks like
```

```
## # A tibble: 6,201 × 8
## # Groups:   cohort [3]
##           id cdweight   age yeduc  sat6 sex_gen  cohort      relstat            
##        <dbl>    <dbl> <dbl> <dbl> <dbl> <fct>    <fct>       <fct>              
##  1 267206000    1.10     16   0       7 2 Female 1 1991-1993 1 Never married si…
##  2 112963000    1.73     35  10.5     6 1 Male   3 1971-1973 1 Never married si…
##  3 327937000    0.774    16   0       8 2 Female 1 1991-1993 -7 Incomplete data 
##  4 318656000    0.719    27  11.5     9 2 Female 2 1981-1983 4 Married COHAB    
##  5 717889000    1.15     37  11.5     7 1 Male   3 1971-1973 4 Married COHAB    
##  6 222517000    0.900    15   0       9 1 Male   1 1991-1993 1 Never married si…
##  7 144712000    0.981    16   0       8 2 Female 1 1991-1993 1 Never married si…
##  8 659357000    0.775    17   0       7 2 Female 1 1991-1993 2 Never married LAT
##  9 506367000    1.24     37  10.5     9 1 Male   3 1971-1973 4 Married COHAB    
## 10  64044000    1.37     15   0       7 2 Female 1 1991-1993 1 Never married si…
## # … with 6,191 more rows
```

---
#Group operations and sumarize
`summarize()` allows you to calculate all kinds of statistics on the level of the groups you have specified. It is a function under "dplyr" package. And "dplyr"is a sub-package of "tidyverse"

.pull-left[

```r
#when you don't consider weight
wave1_cohort<- wave1_pipe %>% 
  group_by(cohort) %>%
  dplyr::summarise(mean(sat6))
wave1_cohort  
```

```
## # A tibble: 3 × 2
##   cohort      `mean(sat6)`
##   <fct>              <dbl>
## 1 1 1991-1993           NA
## 2 2 1981-1983           NA
## 3 3 1971-1973           NA
```
]

.pull-right[

```r
#when you don't consider weight
wave1_cohort<- wave1_pipe %>% 
  group_by(cohort) %>%
  dplyr::summarise(mean(sat6,na.rm=TRUE ))
wave1_cohort  
```

```
## # A tibble: 3 × 2
##   cohort      `mean(sat6, na.rm = TRUE)`
##   <fct>                            <dbl>
## 1 1 1991-1993                       7.94
## 2 2 1981-1983                       7.40
## 3 3 1971-1973                       7.47
```
]

---
#Group operations and sumarize: when there is weight
If we want to calculated the weighted satisfaction for the three cohorts

```r
wave1_cohort<- wave1_pipe %>% #wave1_pipe will go through all the following steps and be assiged to a newdataset called "wave1_cohort"
  group_by(cohort) %>% #wave1_pipe is grouped by cohort
  
  filter(!is.na(sat6)) %>% #filter out whose sat6 is missing
  
  dplyr::summarise( #provide some summaritive calculation
    wn=sum(cdweight), # get the weighted sample size by taking a sum of variable "cdweight"
    
    wsum_sat6=sum(sat6*cdweight), #get the weighted sum of life satisfaction 
    
    m_wsat6=wsum_sat6/wn, #get the weighted mean, by dividing weighted sum of life satisfaction by weighted sample size.
  )
wave1_cohort # what wave1_cohort looks like 
```

```
## # A tibble: 3 × 4
##   cohort         wn wsum_sat6 m_wsat6
##   <fct>       <dbl>     <dbl>   <dbl>
## 1 1 1991-1993 1860.    14762.    7.94
## 2 2 1981-1983 2098.    15428.    7.35
## 3 3 1971-1973 2231.    16488.    7.39
```

---
#Group operations and sumarize: when there is weight

or you can use wtd.mean in the package of "Hmisc"

```r
wave1_cohort<- wave1_pipe %>% 
  group_by(cohort) %>%
  
  filter(!is.na(sat6)) %>% #filter out whose sat6 is missing

dplyr::summarise(wtd.mean(x = sat6, weights = cdweight)) #wtd.mean is a function of Hmisc package
wave1_cohort  
```

```
## # A tibble: 3 × 2
##   cohort      `wtd.mean(x = sat6, weights = cdweight)`
##   <fct>                                          <dbl>
## 1 1 1991-1993                                     7.94
## 2 2 1981-1983                                     7.35
## 3 3 1971-1973                                     7.39
```

---
#Calculate the correlation coefficient by cohort
`cor()` allow you to calculate correlation.

```r
#calculate the correlation coefficient in by cohort
  wave1_pipe %>% 
  group_by(cohort) %>%
  drop_na(sat6,yeduc) %>% #removing missing cases of sat6 and yeduc, using drop_na()
  dplyr::summarise(cor(x=yeduc, y=sat6)) #estimate correlation coefficient between yeduc and sat6
```

```
## # A tibble: 3 × 2
##   cohort      `cor(x = yeduc, y = sat6)`
##   <fct>                            <dbl>
## 1 1 1991-1993                   0.000468
## 2 2 1981-1983                   0.157   
## 3 3 1971-1973                   0.111
```
---
#Calculate the weighted correlation coefficient by cohort
`weightedCorr()` under the package "wCorr" will help you realize this goal

```r
#calculate the correlation coefficient in by cohort
  wave1_pipe %>% 
  group_by(cohort) %>%
  drop_na(sat6,yeduc) %>% #drop cases when information on sat6 and yeduc are missing
  summarise(
    weightedCorr(x=yeduc, y=sat6, method="Pearson", weights=cdweight)
  )
```

```
## # A tibble: 3 × 2
##   cohort      weightedCorr(x = yeduc, y = sat6, method = "Pearson", weights = …¹
##   <fct>                                                                    <dbl>
## 1 1 1991-1993                                                           -0.00734
## 2 2 1981-1983                                                            0.173  
## 3 3 1971-1973                                                            0.0960 
## # … with abbreviated variable name
## #   ¹`weightedCorr(x = yeduc, y = sat6, method = "Pearson", weights = cdweight)`
```
---
#Take home
1. `%>%`: the (forward) pipe, allows you to pipe the output of one function into the next function as input. 
2. `group_by()`: Subsets a tibble into groups. Certain functions will operate afterwards by each of the specified groups.
3. `summarize()`: Allows you to estimate any kind of aggregate statistic. Combined with group_by(), it estimates those statistic by specified group. 
  - mean
  - sd
  - correlation coefficient
4. calculating weighted statistics, e.g. weighted mean and weighted correlation coefficient
  
---
class: center, middle
#[Exercise](https://rpubs.com/fancycmn/1086419)