class: center, middle, inverse, title-slide .title[ # Advanced quantitative data analysis ] .subtitle[ ## R basic III ] .author[ ### Mengni Chen ] .institute[ ### Department of Sociology, University of Copenhagen ] --- <style type="text/css"> .remark-slide-content { font-size: 24px; padding: 20px 80px 20px 80px; } .remark-code, .remark-inline-code { background: #f0f0f0; } .remark-code { font-size: 13px; } </style> #Package for today ```r #install two new packages install.packages("ggplot2") #for visualzation ``` ```r library(tidyverse) library(haven) #introduced in session 2 "R basics II" library(janitor) #for data cleaning library(ggplot2) #for visualzation ``` --- #Outline - pipe - group - descriptive statistics - mean - correlation coefficient - chi-square test - visualize (optional) --- #Nested code What do the following codes mean? ```r #Nested code mean(sqrt(seq(from = 1, to = 13))) ``` ``` ## [1] 2.527274 ``` -- We have learned that R code may be nested. But too much nesting becomes unintelligible: Because R evaluates code from the inside-out, we need to read nested code from the inside-out! -- Alternatively, we could write several lines of code successively and read from top to bottom. But this leads to many irrelevant intermediate objects that crowd our environment. ```r x <- seq(from = 1, to = 13) # x_sqrt <- sqrt(x) # Intermediate object. mean(x_sqrt) ``` ``` ## [1] 2.527274 ``` --- #The (forward) pipe ` %>% ` The `%>%` operator pipes output of one function as input to the next function. You can basically say: `function(argument1 = value)` can be written as `value %>% function()`. Or even easier, think of it as: "then" ```r mean(sqrt(seq(from = 1, to = 13))) ``` ``` ## [1] 2.527274 ``` ```r #use pipe seq(from = 1, to = 13) %>% sqrt() %>% mean() ``` ``` ## [1] 2.527274 ``` --- # What pipe looks like? .pull-left[ <img src="https://merlin-intro-r.netlify.app/5-piping/img/PipedEgg.jpg" width="190%" style="display: block; margin: auto;" > ] .pull-right[ <img src="https://merlin-intro-r.netlify.app/5-piping/img/PipedEgg2.png" width="690%" style="display: block; margin: auto;"> ] --- #The (forward) pipe ` %>% ` - Shortcut buttons for typing `%>%` - Windows: Ctrl+Shift+M - Mac: Cmd+Shift+M ```r # Example 1: round(x=., digits =.) #Both lines of code round 5.882 to have only one digit. round(x = 5.882, digits = 1) ``` ``` ## [1] 5.9 ``` ```r 5.882 %>% round(digits = 1) ``` ``` ## [1] 5.9 ``` ```r #Example 2 2 %>% round(x = 5.882, digits = .) ``` ``` ## [1] 5.88 ``` By default, %>% pipes into the first argument of a function. The . placeholder allows us to pipe into another argument. --- #Advantages of piping & when not to pipe - The advantages are - Legible code: We can structure code from left to right, as opposed to from inside and out. - Shorter code: You minimize the need for local/intermediate variables. - Easily mutable code: You can easily add steps anywhere in the sequence of operations. - When not to pipe - If you have more than one or two major inputs, don't pipe. - If you have more than ten steps, better make intermediate object. It helps you to debug (i.e., find mistakes) and is simply easier to read. [Example video](https://www.youtube.com/watch?v=sohARFx6aTo) --- #Manipulating Pairfam using `%>%` Task: 1) Keep variables of id, age, sex_gen, cohort, sd10, as well as sat6; 2) replace sat6 with NA when they are negative What we will do: way 1 ```r #we have library(haven) in the beginning wave1 <- read_dta("anchor1_50percent_Eng.dta") wave1a <- select(wave1, id, age, sex_gen, cohort, sd10, sat6) #select chozen variables wave1b <- mutate(wave1a, gender=as_factor(sex_gen), #make sex_gen as a factor, named gender cohort=as_factor(cohort), #make cohort as a factor marital=as_factor(sd10) #make marital as a factor, named marital ) wave1c<- mutate(wave1b, sat6=case_when(sat6<0 ~ as.numeric(NA), TRUE ~ as.numeric(sat6) ) #replace sat6 with NA when sat6<0 ) ``` --- #Manipulating Pairfam using `%>%` But now: way 2 using ` %>%`, even shorter code when using `transmute()` -- Oooh, what is `transmute()`? only keep variables you specify in the `transmute()` [Difference between `mutate()` and `transmute()`](https://www.youtube.com/watch?v=vvIhRginelA) ```r #compare data1 and data2 data1 <- mutate(wave1, gender=as_factor(sex_gen), #make sex_gen as a factor,named gender marital=as_factor(sd10) #make sd10 as a factor, named marital ) data2 <- transmute(wave1, gender=as_factor(sex_gen), #make sex_gen as a factor,named gender marital=as_factor(sd10) #make sd10 as a factor, named marital ) ``` --- #Manipulating Pairfam using `%>%` But now: way 2 using ` %>%`, even shorter code when using `transmute` ```r wave1_pipe <- wave1 %>% transmute(# Create new variables and keep only those id, age, gender=as_factor(sex_gen), #treat sex as a categorical variable cohort=as_factor(cohort), #treat cohort as a categorical variable marital=as_factor(sd10), #treat sd10 as a categorical variable sat6=case_when(sat6<0 ~ as.numeric(NA), TRUE ~ as.numeric(sat6)) ) wave1_pipe ``` ``` ## # A tibble: 6,201 × 6 ## id age gender cohort marital sat6 ## <dbl> <dbl+lbl> <fct> <fct> <fct> <dbl> ## 1 267206000 16 2 Female 1 1991-1993 1 Single (never married) 7 ## 2 112963000 35 1 Male 3 1971-1973 1 Single (never married) 6 ## 3 327937000 16 2 Female 1 1991-1993 -2 No answer 8 ## 4 318656000 27 2 Female 2 1981-1983 2 Married or in a civil union… 9 ## 5 717889000 37 1 Male 3 1971-1973 2 Married or in a civil union… 7 ## 6 222517000 15 1 Male 1 1991-1993 1 Single (never married) 9 ## 7 144712000 16 2 Female 1 1991-1993 1 Single (never married) 8 ## 8 659357000 17 2 Female 1 1991-1993 1 Single (never married) 7 ## 9 506367000 37 1 Male 3 1971-1973 2 Married or in a civil union… 9 ## 10 64044000 15 2 Female 1 1991-1993 1 Single (never married) 7 ## # ℹ 6,191 more rows ``` --- #Manipulating Pairfam using `%>%` Task: generate a nice two-way table on cohort and gender .pull-left[ ```r tabyl(wave1_pipe, cohort, gender) ``` ``` ## cohort -10 not in demodiff -7 Incomplete data ## -7 Incomplete data 0 0 ## 0 former capikid first interview 0 0 ## 1 1991-1993 0 0 ## 2 1981-1983 0 0 ## 3 1971-1973 0 0 ## 4 2001-2003 0 0 ## 9 former capikid re-interview 0 0 ## -4 Filter error / Incorrect entry -3 Does not apply 1 Male 2 Female ## 0 0 0 0 ## 0 0 0 0 ## 0 0 1112 1061 ## 0 0 1000 1013 ## 0 0 917 1098 ## 0 0 0 0 ## 0 0 0 0 ``` oooh, no! Not really can be said "nice"! ] --- #Manipulating Pairfam using `%>%` .pull-left[ ```r wave1_pipe %>% mutate( gender=fct_drop(gender), cohort=fct_drop(cohort) )%>% tabyl(cohort, gender) ``` ``` ## cohort 1 Male 2 Female ## 1 1991-1993 1112 1061 ## 2 1981-1983 1000 1013 ## 3 1971-1973 917 1098 ``` ] .pull-right[ ```r wave1_pipe %>% mutate( gender=fct_drop(gender), cohort=fct_drop(cohort) )%>% tabyl(cohort, gender)%>% adorn_totals("row") %>% #add row total adorn_percentages("row") %>% #add row % adorn_pct_formatting() %>% #format the percentage adorn_ns(position="front")%>% #add absolute n in front adorn_title() %>% #add title knitr::kable() #generate a table ``` | |gender | | |:-----------|:-------------|:-------------| |cohort |1 Male |2 Female | |1 1991-1993 |1,112 (51.2%) |1,061 (48.8%) | |2 1981-1983 |1,000 (49.7%) |1,013 (50.3%) | |3 1971-1973 |917 (45.5%) |1,098 (54.5%) | |Total |3,029 (48.8%) |3,172 (51.2%) | ] --- #Grouped operations `group_by()` will transform your data into a grouped tibble. Afterwards, certain functions will operate on the level of those specified groups! ```r wave1_cohort<- wave1_pipe %>% group_by(cohort) #pipe dateset "wave1_pipe" into group_by(cohort), so then the codes later will be execute by groups of cohort. wave1_cohort #what wave1_cohort looks like ``` ``` ## # A tibble: 6,201 × 6 ## # Groups: cohort [3] ## id age gender cohort marital sat6 ## <dbl> <dbl+lbl> <fct> <fct> <fct> <dbl> ## 1 267206000 16 2 Female 1 1991-1993 1 Single (never married) 7 ## 2 112963000 35 1 Male 3 1971-1973 1 Single (never married) 6 ## 3 327937000 16 2 Female 1 1991-1993 -2 No answer 8 ## 4 318656000 27 2 Female 2 1981-1983 2 Married or in a civil union… 9 ## 5 717889000 37 1 Male 3 1971-1973 2 Married or in a civil union… 7 ## 6 222517000 15 1 Male 1 1991-1993 1 Single (never married) 9 ## 7 144712000 16 2 Female 1 1991-1993 1 Single (never married) 8 ## 8 659357000 17 2 Female 1 1991-1993 1 Single (never married) 7 ## 9 506367000 37 1 Male 3 1971-1973 2 Married or in a civil union… 9 ## 10 64044000 15 2 Female 1 1991-1993 1 Single (never married) 7 ## # ℹ 6,191 more rows ``` --- #Sumarize mean by groups `summarize()` allows you to calculate all kinds of statistics on the level of the groups you have specified. It is a function under "dplyr" package. And "dplyr"is a sub-package of "tidyverse" .pull-left[ ```r wave1_cohort1<- wave1_pipe %>% group_by(cohort) %>% dplyr::summarise(mean(sat6)) wave1_cohort1 ``` ``` ## # A tibble: 3 × 2 ## cohort `mean(sat6)` ## <fct> <dbl> ## 1 1 1991-1993 NA ## 2 2 1981-1983 NA ## 3 3 1971-1973 NA ``` **why?** ] .pull-right[ ```r wave1_cohort2<- wave1_pipe %>% group_by(cohort) %>% dplyr::summarise(mean(sat6,na.rm=TRUE )) wave1_cohort2 ``` ``` ## # A tibble: 3 × 2 ## cohort `mean(sat6, na.rm = TRUE)` ## <fct> <dbl> ## 1 1 1991-1993 7.94 ## 2 2 1981-1983 7.40 ## 3 3 1971-1973 7.47 ``` ] --- #Calculate the correlation coefficient `cor()` allow you to calculate correlation. ```r #calculate the correlation coefficient between x and y cor(x =. , y =. , use =., method =. ) #use="everything" is default, method = "pearson" is default. ``` ```r #calculate the correlation coefficient between age and sat6 cor(wave1_pipe$age, wave1_pipe$sat6, use="everything", method = c("pearson") ) ``` ``` ## [1] NA ``` ```r cor(wave1_pipe$age, wave1_pipe$sat6, use="complete.obs", method = c("pearson") ) ``` ``` ## [1] -0.1129498 ``` --- #Calculate the correlation coefficient by groups we can drop missing values of the two variables "age" and "sat6" ```r #calculate the correlation coefficient in by cohort correlation <- wave1_pipe %>% group_by(cohort) %>% drop_na(sat6,age) %>% #removing missing cases of sat6 and age, using drop_na() dplyr::summarise(cor(x=age, y=sat6)) #estimate correlation coefficient between age and sat6 correlation ``` ``` ## # A tibble: 3 × 2 ## cohort `cor(x = age, y = sat6)` ## <fct> <dbl> ## 1 1 1991-1993 -0.0613 ## 2 2 1981-1983 -0.0249 ## 3 3 1971-1973 0.00467 ``` --- #Calculate the chi-square between two categorical variables `chisq.test()` allow you to calculate correlation. Example of calculation chi-square test of marital and cohort ```r #first, check the distrubtion of each variable tabyl(wave1_pipe$cohort) ``` ``` ## wave1_pipe$cohort n percent ## -7 Incomplete data 0 0.0000000 ## 0 former capikid first interview 0 0.0000000 ## 1 1991-1993 2173 0.3504274 ## 2 1981-1983 2013 0.3246251 ## 3 1971-1973 2015 0.3249476 ## 4 2001-2003 0 0.0000000 ## 9 former capikid re-interview 0 0.0000000 ``` ```r tabyl(wave1_pipe$marital) ``` ``` ## wave1_pipe$marital n percent ## -5 Inconsistent value 0 0.0000000000 ## -4 Filter error / Incorrect entry 0 0.0000000000 ## -3 Does not apply 0 0.0000000000 ## -2 No answer 6 0.0009675859 ## -1 Don't know 1 0.0001612643 ## 1 Single (never married) 4145 0.6684405741 ## 2 Married or in a civil union (even if separated) 1815 0.2926947267 ## 3 Divorced or dissolved civil union 230 0.0370907918 ## 4 Widowed or surviving partner in a civil union 4 0.0006450572 ``` --- #Calculate the chi-square between two categorical variables ```r #create the two-way distribution table tab <- wave1_pipe %>% transmute( cohort_a=fct_drop(cohort), marital_a=fct_drop(marital) ) %>% drop_na(marital_a,cohort_a) %>% tabyl(marital_a,cohort_a) tab ``` ``` ## marital_a 1 1991-1993 2 1981-1983 ## -2 No answer 6 0 ## -1 Don't know 0 0 ## 1 Single (never married) 2165 1486 ## 2 Married or in a civil union (even if separated) 1 493 ## 3 Divorced or dissolved civil union 1 34 ## 4 Widowed or surviving partner in a civil union 0 0 ## 3 1971-1973 ## 0 ## 1 ## 494 ## 1321 ## 195 ## 4 ``` --- #Calculate the chisquare between two categorical variables ```r #calculate the chi-square chisq.test(tab) ``` ``` ## Warning in stats::chisq.test(., ...): Chi-squared approximation may be ## incorrect ``` ``` ## ## Pearson's Chi-squared test ## ## data: tab ## X-squared = 2776.3, df = 10, p-value < 2.2e-16 ``` --- #Use ggplot2 to plot descriptive statistics ```r ggplot(data = <DATA>, mapping = aes(x=, y=) + # specify dataset, x, and y to ggplot <GEOM_FUNCTION>()+ # specify types of your chart, e.g. bar, point, line chart <COORDINATE_FUNCTION> # Change the default coordinate system, swap x and y axis #note: + is the symbol to connect different section of code ``` ggplot2 contains many geom functions, which put layers of different types of geometric objects (e.g., points, bars, lines) over a coordinate system. - All geom functions depend on the mapping argument. It is paired with aes(), which stands for "aesthetic". Aesthetics are the visual properties of your plot. - The most important aesthetics of any graph are the y-axis and the x-axis. Therefore,aes()depends on x and y, because these specify which variable to map to the y-axis and which one to map to the x-axis. - But of course, aesthetics also means, among others, color, shape, size, and so on. --- #Use ggplot2 to plot descriptive statistics ```r ggplot(data = wave1_pipe) ## Create an empty coordinate system for the dataset "wave1_pipe". ``` <img src="https://github.com/fancycmn/slide6/blob/main/S6_Pic3.png?raw=true" width="60%" style="display: block; margin: auto;" > --- #Use ggplot2 to plot descriptive statistics .pull-left[ ```r figure1<- ggplot(data = wave1_pipe, mapping=aes(x=marital))+ ## Create an empty coordinate system for the dataset "wave1_pipe". geom_bar() figure1 #print figure1 ``` <img src="https://github.com/fancycmn/2024Advancedquant_intro/blob/main/24-Session%204/f1.JPG?raw=true" width="100%" style="display: block; margin: auto;" > ] .pull-right[ ```r figure2<- ggplot(data = wave1_pipe, mapping=aes(x=marital))+ geom_bar()+ coord_flip() #swap the coordinating system to make a horizontal barchart figure2 #print figure2 ``` <img src="https://github.com/fancycmn/2024Advancedquant_intro/blob/main/24-Session%204/f2.JPG?raw=true" width="100%" style="display: block; margin: auto;" > ] ] --- #Use ggplot2 to plot descriptive statistics by group ```r figure3<- ggplot(data = wave1_pipe, mapping=aes(x=marital))+ geom_bar()+ facet_wrap(~cohort)+ #plot the barchart by cohort coord_flip() #swap the coordinating system to make a horizontal barchart figure3 #print figure3 ``` <img src="https://github.com/fancycmn/2024Advancedquant_intro/blob/main/24-Session%204/f3.JPG?raw=true" width="100%" style="display: block; margin: auto;" > --- #Use ggplot2 to plot descriptive statistics by group ```r figure4<- ggplot(data = wave1_pipe, mapping=aes(x=marital,fill=marital))+ #fill=marital, to color the bar by the different marital stauts geom_bar()+ facet_wrap(~cohort)+ #plot the barchart by cohort coord_flip() #swap the coordinating system to make a horizontal barchart figure4 #print figure4 ``` <img src="https://github.com/fancycmn/2024Advancedquant_intro/blob/main/24-Session%204/f4.JPG?raw=true" width="100%" style="display: block; margin: auto;" > --- #Take home 1. `%>%`: the (forward) pipe, allows you to pipe the output of one function into the next function as input. 2. `group_by()`: Subsets a tibble into groups. Certain functions will operate afterwards by each of the specified groups. 3. `summarize()`: Allows you to estimate any kind of aggregate statistic. Combined with group_by(), it estimates those statistic by specified group. - mean - sd - correlation coefficient 4. `ggplot()`: to plot charts, often combined with GEOM_FUNCTION>() to specify the chart type(e.g. bar, line, etc.) --- class: center, middle #[Exercise](https://rpubs.com/fancycmn/1221406)