Advanced quantitative data analysis

class: center, middle, inverse, title-slide

.title[
# Advanced quantitative data analysis
]
.subtitle[
## Piping & Grouping
]
.author[
### Mengni Chen
]
.institute[
### Department of Sociology, University of Copenhagen
]

---

```r
#install two new packages
install.packages("Hmisc")
install.packages("ggplot2")
```

```r
library(tidyverse) 
library(haven)
library(Hmisc)
library(ggplot2)
```

---
#Nested code
We have learned that R code may be nested. But too much nesting becomes unintelligible: Because R evaluates code from the inside-out, we need to read nested code from the inside-out!

```r
#Nested code
mean(sqrt(seq(from = 1, to = 13)))
```

```
## [1] 2.527274
```

--
Alternatively, we could write several lines of code successively and read from top to bottom. But this leads to many irrelevant intermediate objects that crowd our environment.

```r
x <- seq(from = 1, to = 13)
x_sqrt <- sqrt(x) # Intermediate object.
mean(x_sqrt)
```

```
## [1] 2.527274
```

---
#The (forward) pipe ` %>% `
The `%>%` operator pipes output of one function as input to the next function. You can basically say: `function(argument1 = value)` can be written as `value %>% function()`.

Or even easier, think of it as: "then"

```r
mean(sqrt(seq(from = 1, to = 13)))
```

```
## [1] 2.527274
```

```r
#use pipe
seq(from = 1, to = 13) %>% sqrt() %>% mean()
```

```
## [1] 2.527274
```

---
# What pipe looks like?
.pull-left[
<img src="https://merlin-intro-r.netlify.app/5-piping/img/PipedEgg.jpg" width="90%" style="display: block; margin: auto;" >
]

.pull-right[
<img src="https://merlin-intro-r.netlify.app/5-piping/img/PipedEgg2.png" width="690%" style="display: block; margin: auto;">
]

---
#The (forward) pipe ` %>% `
- Shortcut for typing `%>%`
  - Windows: Ctrl+Shift+M
  - Mac: Cmd+Shift+M

```r
# Example 1
#Both lines of code round 5.882 to have only one digit.
round(x = 5.882, digits = 1) 
```

```
## [1] 5.9
```

```r
5.882 %>% round(digits = 1)
```

```
## [1] 5.9
```

```r
#Example 2
2 %>% round(x = 5.882, digits = .)
```

```
## [1] 5.88
```

By default, %>% pipes into the first argument of a function. The . placeholder allows us to pipe into another argument.
---
#Advantages of piping & when not to pipe
- The advantages are
  - Legible code: We can structure code from left to right, as opposed to from inside and out.
  - Shorter code: You minimize the need for local/intermediate variables.
  - Easily mutable code: You can easily add steps anywhere in the sequence of operations.
  
- When not to pipe
  - If you have more than one or two major inputs, don't pipe.
  - If you have more than  ten steps, better make intermediate object.
  It helps you to debug (i.e., find mistakes) and is simply easier to read.
  
---
#Manipulating Pairfam using `%>%`
"Keep variables of id, age, sex_gen, cohort, yedu, relstat, cdweight, as well as one variable that reflects the attitude towards family, and one variable that reflects subjective wellbeing."

What we will do

```r
wave1 <- read_dta("anchor1_50percent_Eng.dta")
wave1a <- select(wave1, id, age, sex_gen, cohort, yeduc, 
               relstat, cdweight, val1i7,sat6)
wave1a <- mutate(wave1a,
                 id=zap_labels(id),
                 age=zap_labels(age),
                 yeduc=zap_labels(yeduc),
                 sat6=zap_labels(sat6),
                 cdweight=zap_label(cdweight),
                 sex_gen=as_factor(sex_gen),
                 cohort=as_factor(cohort),
                 relstat=as_factor(relstat),
                 val1i7=as_factor(val1i7)
)
```

---
#Manipulating Pairfam using `%>%`
But now

```r
wave1b <- wave1 %>% 
  transmute( # Create new variables and keep only those
    id=zap_labels(id), #take off the label of id 
    age=zap_labels(age), #take off the label of age as it is treated as a continuous variable
    cdweight=zap_label(cdweight), #take off the label of the variable "cdweight" 
    cohort=as_factor(cohort), #treat cohort as a categorical variable
    sex_gen=as_factor(sex_gen), #treat sex as a categorical variable
    wave=as_factor(wave), #treat wave as a categorical variable
    yeduc=case_when(yeduc<0 ~ as.numeric(NA), TRUE ~ as.numeric(yeduc)) %>% zap_label(), 
    #when yeduc<0, make it NA; and the rest take their original value;and pipe yeduc into zap_label() to take off labels  
    sat6=case_when(sat6<0 ~ as.numeric(NA), TRUE ~ as.numeric(sat6)),
    #when sat6<0, make it NA; and the rest take their original value;
    relstat=as_factor(relstat), #treat relstat as a categorical variable
    relstat=case_when(relstat== "-7 Incomplete data" ~ as.character(NA), TRUE ~ as.character(relstat)
      )%>% as_factor(), 
    #when relstat has the value of "-7 Incomplete data", make it NA; otherwise, remained as it; 
    #and pipe relstat into as_factor() to make it as categorical variable.
    val1i7=as_factor(val1i7), #treat val1i7 as a categorical variable.
    val1i7=case_when(val1i7=="-2 No answer" | # the symbol "|" means or 
                       val1i7=="-1 Don't know" ~ as.character(NA),  
                     TRUE ~as.character(val1i7)) %>% as_factor() 
    #when val1i7 is "-2 No answer" / "-1 Don't know", make it NA; otherwise as it is.
    #then pipe val1i7 into as_factor() to make it as a factor variable
  )

#Note: why we need to finally pipe val1i7 into as_factor() after case_when()?
#becasuse When vail1i7 go through "val1i7=="-2 No answer" | val1i7=="-1 Don't know" ~ as.character(NA)", it becomes a character rather than factor
```

---
#What is weight?
pleae watch the video here?
[Clike to see: Data weighting and representative samples](https://www.youtube.com/watch?v=KkqXbw43yxc)

---
#When there is weight
Statistically, we use weights by multiplication. Say in a small imagined patriarchal society, men's votes count twice as much. Here we have a ballot on whether women should be allowed to drive:

.pull-left[

| id|vote | voted_yes| weight|
|--:|:----|---------:|------:|
|  1|NO   |         0|      2|
|  2|YES  |         1|      2|
|  3|YES  |         1|      1|
|  4|YES  |         1|      1|
|  5|NO   |         0|      1|

```r
#without weight, the percentage of vote for yes is 60%
(3/5)*100
```

```
## [1] 60
```

]

.pull-right[

```r
#with weight, the percentage of vote for yes is 58%
((0*2 + 1*2 + 1 + 1 + 0) / 7) * 100 # Way 1
```

```
## [1] 57.14286
```

```r
(sum(voted_yes * weight) / sum(weight)) * 100 #Way 2 
```

```
## [1] 57.14286
```

```r
Hmisc::wtd.mean(x = voted_yes, weights = weight) * 100 #Way 3, use the "wtd.mean "function under package "Hmisc" to get mean directly. 
```

```
## [1] 57.14286
```
]

---
#Grouped operations
`group_by()` will transform your data into a grouped tibble. Afterwards, certain functions will operate on the level of those specified groups!

```r
wave1c<- wave1b %>% group_by(cohort) #pipe dateset "wave1b" into group_by(cohort), so then the codes later will be execute by groups of cohort.
wave1c #what wave1c looks like
```

```
## # A tibble: 6,201 × 10
## # Groups:   cohort [3]
##           id   age cdweight cohort      sex_gen wave  yeduc  sat6 relstat val1i7
##        <dbl> <dbl>    <dbl> <fct>       <fct>   <fct> <dbl> <dbl> <fct>   <fct> 
##  1 267206000    16    1.10  1 1991-1993 2 Fema… 1 20…   0       7 1 Neve… 3     
##  2 112963000    35    1.73  3 1971-1973 1 Male  1 20…  10.5     6 1 Neve… 2     
##  3 327937000    16    0.774 1 1991-1993 2 Fema… 1 20…   0       8 <NA>    4     
##  4 318656000    27    0.719 2 1981-1983 2 Fema… 1 20…  11.5     9 4 Marr… 5 Agr…
##  5 717889000    37    1.15  3 1971-1973 1 Male  1 20…  11.5     7 4 Marr… 4     
##  6 222517000    15    0.900 1 1991-1993 1 Male  1 20…   0       9 1 Neve… 5 Agr…
##  7 144712000    16    0.981 1 1991-1993 2 Fema… 1 20…   0       8 1 Neve… 4     
##  8 659357000    17    0.775 1 1991-1993 2 Fema… 1 20…   0       7 2 Neve… 5 Agr…
##  9 506367000    37    1.24  3 1971-1973 1 Male  1 20…  10.5     9 4 Marr… 1 Dis…
## 10  64044000    15    1.37  1 1991-1993 2 Fema… 1 20…   0       7 1 Neve… 1 Dis…
## # … with 6,191 more rows
```

---
#Group operations and sumarize
`summarize()` allows you to calculate all kinds of statistics on the level of the groups you have specified.It is a function under "dplyr".

```r
#when you don't consider weight
wave1c<- wave1b %>% 
  group_by(cohort) %>%
  dplyr::summarise(mean(sat6,na.rm=TRUE ))
wave1c  
```

```
## # A tibble: 3 × 2
##   cohort      `mean(sat6, na.rm = TRUE)`
##   <fct>                            <dbl>
## 1 1 1991-1993                       7.94
## 2 2 1981-1983                       7.40
## 3 3 1971-1973                       7.47
```

---
#Group operations and sumarize: when there is weight
If we want to calculated the weighted satisfaction for the three cohorts

```r
wave1c<- wave1b %>% #wave1b will go through all the following steps and be assiged to a newdataset called "wave1c"
  group_by(cohort) %>% #wave1b is grouped by cohort
  dplyr::summarise( #use the function "summarise "under the package of "dplyr" to provide some summaritive calculation
    n=n(), # get the unweighed sample size
    wn=sum(cdweight), # get the weigthed sample size by take a sum of variable "cdweight"
    wsum_sat6=sum(sat6*cdweight,na.rm=T), #get the weighted sum of life satisfaction and drop missing-valued observations in the calculation.
    m_wsat6=wsum_sat6/wn, #get the weighted mean, by dividing weighted sum of life satisfaction by weighted sample size.
  )
wave1c # what wave1c looks like 
```

or you can use wtd.mean in the package of "Hmisc"

```r
wave1c<- wave1b %>% 
  group_by(cohort) %>%
  dplyr::summarise(wtd.mean(x = sat6, weights = cdweight)) #wtd.mean is a function of Hmisc package
wave1c  
```

```
## # A tibble: 3 × 2
##   cohort      `wtd.mean(x = sat6, weights = cdweight)`
##   <fct>                                          <dbl>
## 1 1 1991-1993                                     7.94
## 2 2 1981-1983                                     7.35
## 3 3 1971-1973                                     7.39
```

---
#Calculate the correlation coefficient by cohort
`cor()` allow you to calculate correlation. It is a function under package "stats"
You can see whether it is install by

```r
find.package("stats")
```

```r
#calculate the correlation coefficient in by cohort
  wave1b %>% 
  group_by(cohort) %>%
  dplyr::summarise(cor(age, sat6,use = "complete.obs")) 
```

```
## # A tibble: 3 × 2
##   cohort      `cor(age, sat6, use = "complete.obs")`
##   <fct>                                        <dbl>
## 1 1 1991-1993                               -0.0613 
## 2 2 1981-1983                               -0.0249 
## 3 3 1971-1973                                0.00467
```

```r
#If you want to drop observations with missing, rm.na is not working in cor. 
#You should use the argument "use".
#Check out more about cor, ?cor
```
---
#Take home
1. `%>%`: the (forward) pipe, allows you to pipe the output of one function into the next function as input. 
2. `group_by()`: Subsets a tibble into groups. Certain functions will afterwards operate by each of the specified groups.
3. `summarize()`: Allows you to estimate any kind of aggregate statistic. Combined with group_by(), it estimates those statistic by specified group. 
  - mean
  - sd
  - correlation coefficient

---
class: center, middle
#[Exercise](https://rpubs.com/fancycmn/948203)