Advanced quantitative data analysis

class: center, middle, inverse, title-slide

.title[
# Advanced quantitative data analysis
]
.subtitle[
## R basic III
]
.author[
### Mengni Chen
]
.institute[
### Department of Sociology, University of Copenhagen
]

---

#Package for today

```r
#install two new packages
install.packages("ggplot2") #for visualzation
```

```r
library(tidyverse) 
library(haven) #introduced in session 2 "R basics II"
library(janitor) #for data cleaning
library(ggplot2) #for visualzation
```
---
#Outline 
- pipe
- group
- descriptive statistics
  - mean
  - correlation coefficient
  - chi-square test
  - visualize (optional)

---
#Nested code
What do the following codes mean?

```r
#Nested code
mean(sqrt(seq(from = 1, to = 13)))
```

```
## [1] 2.527274
```

--
We have learned that R code may be nested. But too much nesting becomes unintelligible: Because R evaluates code from the inside-out, we need to read nested code from the inside-out!

Alternatively, we could write several lines of code successively and read from top to bottom. But this leads to many irrelevant intermediate objects that crowd our environment.

```r
x <- seq(from = 1, to = 13) #
x_sqrt <- sqrt(x) # Intermediate object.
mean(x_sqrt)
```

```
## [1] 2.527274
```

---
#The (forward) pipe ` %>% `
The `%>%` operator pipes output of one function as input to the next function. You can basically say: `function(argument1 = value)` can be written as `value %>% function()`.

Or even easier, think of it as: "then"

```r
mean(sqrt(seq(from = 1, to = 13)))
```

```
## [1] 2.527274
```

```r
#use pipe
seq(from = 1, to = 13) %>% sqrt() %>% mean()
```

```
## [1] 2.527274
```

---
# What pipe looks like?
.pull-left[
<img src="https://merlin-intro-r.netlify.app/5-piping/img/PipedEgg.jpg" width="190%" style="display: block; margin: auto;" >
]

.pull-right[
<img src="https://merlin-intro-r.netlify.app/5-piping/img/PipedEgg2.png" width="690%" style="display: block; margin: auto;">
]

---
#The (forward) pipe ` %>% `
- Shortcut buttons for typing `%>%`
  - Windows: Ctrl+Shift+M
  - Mac: Cmd+Shift+M

```r
# Example 1: round(x=., digits =.)
#Both lines of code round 5.882 to have only one digit.
round(x = 5.882, digits = 1) 
```

```
## [1] 5.9
```

```r
5.882 %>% round(digits = 1)
```

```
## [1] 5.9
```

```r
#Example 2
2 %>% round(x = 5.882, digits = .)
```

```
## [1] 5.88
```

By default, %>% pipes into the first argument of a function. The . placeholder allows us to pipe into another argument.
---
#Advantages of piping & when not to pipe
- The advantages are
  - Legible code: We can structure code from left to right, as opposed to from inside and out.
  - Shorter code: You minimize the need for local/intermediate variables.
  - Easily mutable code: You can easily add steps anywhere in the sequence of operations.
  
- When not to pipe
  - If you have more than one or two major inputs, don't pipe.
  - If you have more than  ten steps, better make intermediate object.
  It helps you to debug (i.e., find mistakes) and is simply easier to read.
  
  
[Example video](https://www.youtube.com/watch?v=sohARFx6aTo)

---
#Manipulating Pairfam using `%>%`
Task: 1) Keep variables of id, age, sex_gen, cohort, sd10,  as well as sat6; 2) replace sat6 with NA when they are negative

What we will do: way 1

```r
#we have library(haven) in the beginning
wave1 <- read_dta("anchor1_50percent_Eng.dta")

wave1a <- select(wave1, 
                 id, 
                 age, 
                 sex_gen, 
                 cohort,
                 sd10, 
                 sat6) #select chozen variables

wave1b <- mutate(wave1a,
                 gender=as_factor(sex_gen), #make sex_gen as a factor, named gender
                 cohort=as_factor(cohort), #make cohort as a factor
                 marital=as_factor(sd10) #make marital as a factor, named marital
                )

wave1c<- mutate(wave1b,
                sat6=case_when(sat6<0 ~ as.numeric(NA), 
                               TRUE ~ as.numeric(sat6)
                               ) #replace sat6 with NA when sat6<0
                )
```

---
#Manipulating Pairfam using `%>%`
But now: way 2 using ` %>%`, even shorter code when using `transmute()`

Oooh, what is `transmute()`? only keep variables you specify in the `transmute()`
[Difference between `mutate()` and `transmute()`](https://www.youtube.com/watch?v=vvIhRginelA)

```r
#compare data1 and data2
data1 <- mutate(wave1,
                gender=as_factor(sex_gen), #make sex_gen as a factor,named gender
                marital=as_factor(sd10)  #make sd10 as a factor, named marital
                )
data2 <- transmute(wave1,
                gender=as_factor(sex_gen), #make sex_gen as a factor,named gender
                marital=as_factor(sd10) #make sd10 as a factor, named marital
                )
```

---
#Manipulating Pairfam using `%>%`
But now: way 2 using ` %>%`, even shorter code when using `transmute`

```r
wave1_pipe <- wave1 %>% 
  transmute(# Create new variables and keep only those
    id,    
    age,  
    gender=as_factor(sex_gen),   #treat sex as a categorical variable
    cohort=as_factor(cohort),     #treat cohort as a categorical variable
    marital=as_factor(sd10),   #treat sd10 as a categorical variable
    sat6=case_when(sat6<0 ~ as.numeric(NA), 
                   TRUE ~ as.numeric(sat6))
            )
wave1_pipe
```

```
## # A tibble: 6,201 × 6
##           id age       gender   cohort      marital                         sat6
##        <dbl> <dbl+lbl> <fct>    <fct>       <fct>                          <dbl>
##  1 267206000 16        2 Female 1 1991-1993 1 Single (never married)           7
##  2 112963000 35        1 Male   3 1971-1973 1 Single (never married)           6
##  3 327937000 16        2 Female 1 1991-1993 -2 No answer                       8
##  4 318656000 27        2 Female 2 1981-1983 2 Married or in a civil union…     9
##  5 717889000 37        1 Male   3 1971-1973 2 Married or in a civil union…     7
##  6 222517000 15        1 Male   1 1991-1993 1 Single (never married)           9
##  7 144712000 16        2 Female 1 1991-1993 1 Single (never married)           8
##  8 659357000 17        2 Female 1 1991-1993 1 Single (never married)           7
##  9 506367000 37        1 Male   3 1971-1973 2 Married or in a civil union…     9
## 10  64044000 15        2 Female 1 1991-1993 1 Single (never married)           7
## # ℹ 6,191 more rows
```
---
#Manipulating Pairfam using `%>%`
Task: generate a nice two-way table on cohort and gender
.pull-left[

```r
tabyl(wave1_pipe, cohort, gender)
```

```
##                            cohort -10 not in demodiff -7 Incomplete data
##                -7 Incomplete data                   0                  0
##  0 former capikid first interview                   0                  0
##                       1 1991-1993                   0                  0
##                       2 1981-1983                   0                  0
##                       3 1971-1973                   0                  0
##                       4 2001-2003                   0                  0
##     9 former capikid re-interview                   0                  0
##  -4 Filter error / Incorrect entry -3 Does not apply 1 Male 2 Female
##                                  0                 0      0        0
##                                  0                 0      0        0
##                                  0                 0   1112     1061
##                                  0                 0   1000     1013
##                                  0                 0    917     1098
##                                  0                 0      0        0
##                                  0                 0      0        0
```
oooh, no! Not really can be said "nice"!
]

---
#Manipulating Pairfam using `%>%`
.pull-left[

```r
wave1_pipe %>% 
  mutate(
    gender=fct_drop(gender),
    cohort=fct_drop(cohort)
  )%>%
  tabyl(cohort, gender)
```

```
##       cohort 1 Male 2 Female
##  1 1991-1993   1112     1061
##  2 1981-1983   1000     1013
##  3 1971-1973    917     1098
```
]
.pull-right[

```r
wave1_pipe %>% 
  mutate(
    gender=fct_drop(gender),
    cohort=fct_drop(cohort)   )%>%
  tabyl(cohort, gender)%>%
  adorn_totals("row") %>% #add row total
  adorn_percentages("row") %>% #add row %
  adorn_pct_formatting() %>% #format the percentage
  adorn_ns(position="front")%>% #add absolute n in front
  adorn_title() %>% #add title
  knitr::kable() #generate a table
```

|            |gender        |              |
|:-----------|:-------------|:-------------|
|cohort      |1 Male        |2 Female      |
|1 1991-1993 |1,112 (51.2%) |1,061 (48.8%) |
|2 1981-1983 |1,000 (49.7%) |1,013 (50.3%) |
|3 1971-1973 |917 (45.5%)   |1,098 (54.5%) |
|Total       |3,029 (48.8%) |3,172 (51.2%) |
]
---
#Grouped operations
`group_by()` will transform your data into a grouped tibble. Afterwards, certain functions will operate on the level of those specified groups!

```r
wave1_cohort<- wave1_pipe %>% group_by(cohort) #pipe dateset "wave1_pipe" into group_by(cohort), so then the codes later will be execute by groups of cohort.
wave1_cohort #what wave1_cohort looks like
```

```
## # A tibble: 6,201 × 6
## # Groups:   cohort [3]
##           id age       gender   cohort      marital                         sat6
##        <dbl> <dbl+lbl> <fct>    <fct>       <fct>                          <dbl>
##  1 267206000 16        2 Female 1 1991-1993 1 Single (never married)           7
##  2 112963000 35        1 Male   3 1971-1973 1 Single (never married)           6
##  3 327937000 16        2 Female 1 1991-1993 -2 No answer                       8
##  4 318656000 27        2 Female 2 1981-1983 2 Married or in a civil union…     9
##  5 717889000 37        1 Male   3 1971-1973 2 Married or in a civil union…     7
##  6 222517000 15        1 Male   1 1991-1993 1 Single (never married)           9
##  7 144712000 16        2 Female 1 1991-1993 1 Single (never married)           8
##  8 659357000 17        2 Female 1 1991-1993 1 Single (never married)           7
##  9 506367000 37        1 Male   3 1971-1973 2 Married or in a civil union…     9
## 10  64044000 15        2 Female 1 1991-1993 1 Single (never married)           7
## # ℹ 6,191 more rows
```

---
#Sumarize mean by groups
`summarize()` allows you to calculate all kinds of statistics on the level of the groups you have specified. It is a function under "dplyr" package. And "dplyr"is a sub-package of "tidyverse"

.pull-left[

```r
wave1_cohort1<- wave1_pipe %>% 
  group_by(cohort) %>%
  dplyr::summarise(mean(sat6))
wave1_cohort1  
```

```
## # A tibble: 3 × 2
##   cohort      `mean(sat6)`
##   <fct>              <dbl>
## 1 1 1991-1993           NA
## 2 2 1981-1983           NA
## 3 3 1971-1973           NA
```
**why?**
]

.pull-right[

```r
wave1_cohort2<- wave1_pipe %>% 
  group_by(cohort) %>%
  dplyr::summarise(mean(sat6,na.rm=TRUE ))
wave1_cohort2  
```

```
## # A tibble: 3 × 2
##   cohort      `mean(sat6, na.rm = TRUE)`
##   <fct>                            <dbl>
## 1 1 1991-1993                       7.94
## 2 2 1981-1983                       7.40
## 3 3 1971-1973                       7.47
```
]

---
#Calculate the correlation coefficient
`cor()` allow you to calculate correlation.

```r
#calculate the correlation coefficient between x and y
cor(x =. , y =. , use =.,
    method =. )
#use="everything" is default, method = "pearson" is default.
```

```r
#calculate the correlation coefficient between age and sat6
cor(wave1_pipe$age, wave1_pipe$sat6,
    use="everything",
    method = c("pearson")
    )
```

```
## [1] NA
```

```r
cor(wave1_pipe$age, wave1_pipe$sat6,
    use="complete.obs",
    method = c("pearson")
    )
```

```
## [1] -0.1129498
```

---
#Calculate the correlation coefficient by groups
we can drop missing values of the two variables "age" and "sat6"

```r
#calculate the correlation coefficient in by cohort
correlation <-   wave1_pipe %>% 
  group_by(cohort) %>%
  drop_na(sat6,age) %>% #removing missing cases of sat6 and age, using drop_na()
  dplyr::summarise(cor(x=age, y=sat6)) #estimate correlation coefficient between age and sat6

correlation
```

```
## # A tibble: 3 × 2
##   cohort      `cor(x = age, y = sat6)`
##   <fct>                          <dbl>
## 1 1 1991-1993                 -0.0613 
## 2 2 1981-1983                 -0.0249 
## 3 3 1971-1973                  0.00467
```

---
#Calculate the chi-square between two categorical variables
`chisq.test()` allow you to calculate correlation. 
Example of calculation chi-square test of marital and cohort

```r
#first, check the distrubtion of each variable
tabyl(wave1_pipe$cohort)
```

```
##                 wave1_pipe$cohort    n   percent
##                -7 Incomplete data    0 0.0000000
##  0 former capikid first interview    0 0.0000000
##                       1 1991-1993 2173 0.3504274
##                       2 1981-1983 2013 0.3246251
##                       3 1971-1973 2015 0.3249476
##                       4 2001-2003    0 0.0000000
##     9 former capikid re-interview    0 0.0000000
```

```r
tabyl(wave1_pipe$marital)
```

```
##                                 wave1_pipe$marital    n      percent
##                              -5 Inconsistent value    0 0.0000000000
##                  -4 Filter error / Incorrect entry    0 0.0000000000
##                                  -3 Does not apply    0 0.0000000000
##                                       -2 No answer    6 0.0009675859
##                                      -1 Don't know    1 0.0001612643
##                           1 Single (never married) 4145 0.6684405741
##  2 Married or in a civil union (even if separated) 1815 0.2926947267
##                3 Divorced or dissolved civil union  230 0.0370907918
##    4 Widowed or surviving partner in a civil union    4 0.0006450572
```

---
#Calculate the chi-square between two categorical variables

```r
#create the two-way distribution table
tab <-   wave1_pipe %>% 
  transmute(
    cohort_a=fct_drop(cohort),
    marital_a=fct_drop(marital)
           ) %>%
  drop_na(marital_a,cohort_a) %>%
  tabyl(marital_a,cohort_a)  
tab
```

```
##                                          marital_a 1 1991-1993 2 1981-1983
##                                       -2 No answer           6           0
##                                      -1 Don't know           0           0
##                           1 Single (never married)        2165        1486
##  2 Married or in a civil union (even if separated)           1         493
##                3 Divorced or dissolved civil union           1          34
##    4 Widowed or surviving partner in a civil union           0           0
##  3 1971-1973
##            0
##            1
##          494
##         1321
##          195
##            4
```

---
#Calculate the chisquare between two categorical variables

```r
#calculate the chi-square
chisq.test(tab)   
```

```
## Warning in stats::chisq.test(., ...): Chi-squared approximation may be
## incorrect
```

```
## 
##  Pearson's Chi-squared test
## 
## data:  tab
## X-squared = 2776.3, df = 10, p-value < 2.2e-16
```

---
#Use ggplot2 to plot descriptive statistics

```r
ggplot(data = <DATA>, mapping = aes(x=, y=) +   # specify dataset, x, and y to ggplot
  <GEOM_FUNCTION>()+                                  # specify types of your chart, e.g. bar, point, line chart
  <COORDINATE_FUNCTION>                              # Change the default coordinate system, swap x and y axis

#note: + is the symbol to connect different section of code
```
ggplot2 contains many geom functions, which put layers of different types of geometric objects (e.g., points, bars, lines) over a coordinate system.
  - All geom functions depend on the mapping argument. It is paired with aes(), which stands for "aesthetic". Aesthetics are the visual properties of your plot.
  
  - The most important aesthetics of any graph are the y-axis and the x-axis. Therefore,aes()depends on x
and y, because these specify which variable to map to the y-axis and which one to map to the x-axis.

- But of course, aesthetics also means, among others, color, shape, size, and so on.
---
#Use ggplot2 to plot descriptive statistics

```r
ggplot(data = wave1_pipe) ## Create an empty coordinate system for the dataset "wave1_pipe".
```
<img src="https://github.com/fancycmn/slide6/blob/main/S6_Pic3.png?raw=true" width="60%" style="display: block; margin: auto;" >

---
#Use ggplot2 to plot descriptive statistics
.pull-left[

```r
figure1<- ggplot(data = wave1_pipe, mapping=aes(x=marital))+ ## Create an empty coordinate system for the dataset "wave1_pipe".
  geom_bar()
figure1 #print figure1
```
<img src="https://github.com/fancycmn/2024Advancedquant_intro/blob/main/24-Session%204/f1.JPG?raw=true" width="100%" style="display: block; margin: auto;" >
]

.pull-right[

```r
figure2<- ggplot(data = wave1_pipe, mapping=aes(x=marital))+ 
  geom_bar()+
  coord_flip() #swap the coordinating system to make a horizontal barchart
figure2 #print figure2
```
<img src="https://github.com/fancycmn/2024Advancedquant_intro/blob/main/24-Session%204/f2.JPG?raw=true" width="100%" style="display: block; margin: auto;" >
]
]
---
#Use ggplot2 to plot descriptive statistics by group

```r
figure3<- ggplot(data = wave1_pipe, mapping=aes(x=marital))+ 
  geom_bar()+
  facet_wrap(~cohort)+ #plot the barchart by cohort
  coord_flip() #swap the coordinating system to make a horizontal barchart
figure3 #print figure3
```
<img src="https://github.com/fancycmn/2024Advancedquant_intro/blob/main/24-Session%204/f3.JPG?raw=true" width="100%" style="display: block; margin: auto;" >
---
#Use ggplot2 to plot descriptive statistics by group

```r
figure4<- ggplot(data = wave1_pipe, mapping=aes(x=marital,fill=marital))+ #fill=marital, to color the bar by the different marital stauts
  geom_bar()+
  facet_wrap(~cohort)+ #plot the barchart by cohort
  coord_flip() #swap the coordinating system to make a horizontal barchart
figure4 #print figure4
```
<img src="https://github.com/fancycmn/2024Advancedquant_intro/blob/main/24-Session%204/f4.JPG?raw=true" width="100%" style="display: block; margin: auto;" >

---
#Take home
1. `%>%`: the (forward) pipe, allows you to pipe the output of one function into the next function as input. 
2. `group_by()`: Subsets a tibble into groups. Certain functions will operate afterwards by each of the specified groups.
3. `summarize()`: Allows you to estimate any kind of aggregate statistic. Combined with group_by(), it estimates those statistic by specified group. 
  - mean
  - sd
  - correlation coefficient
4. `ggplot()`: to plot charts, often combined with GEOM_FUNCTION>() to specify the chart type(e.g. bar, line, etc.)

---
class: center, middle
#[Exercise](https://rpubs.com/fancycmn/1221406)