class: center, middle, inverse, title-slide .title[ # Advanced quantitative data analysis ] .subtitle[ ## Cross-sectional data analysis I ] .author[ ### Mengni Chen ] .institute[ ### Department of Sociology, University of Copenhagen ] --- <style type="text/css"> .remark-slide-content { font-size: 20px; padding: 20px 80px 20px 80px; } .remark-code, .remark-inline-code { background: #f0f0f0; } .remark-code { font-size: 13px; } </style> #Package for today ```r #install four new packages install.packages("ggplot2") #for visualzation install.packages("estimatr") #estimate robust standard errors OLS. install.packages("texreg") #make nicely-formatted Html & Latex regression tables. ``` ```r library(haven) #importing dataset of dta format library(tidyverse) #for data cleaning library(janitor) #for data cleaning library(ggplot2) #for visualzation library(estimatr) #for robust standard errors OLS. library(texreg) #for making nicely-formatted Html & Latex regression tables. ``` --- #Outline - descriptive statistics by group - mean - visualization - OLS - OLS with interaction --- #Are individuals who have a partner happier? - compare life satisfaction of individuals who have a partner with those who don't - individuals who don't have a partner including being single, being divorced/separated, widowed - what one could do is to calculate the mean of life satisfaction across different partnership status --- #Are individuals who have a partner happier? - Prepare a dataset ```r wave1 <- read_dta("anchor1_50percent_Eng.dta") wave1a <- wave1 %>% transmute( id, age, gender=as_factor(sex_gen) %>% fct_drop(),#var for gender & drop unused levels cohort=as_factor(cohort)%>% fct_drop(),#var for cohort relstat=as_factor(relstat)%>% fct_drop(), #var for relationship status sat6#var for life satisfaction ) ``` ```r tabyl(wave1a,gender)#no missing tabyl(wave1a,cohort) #no missing tabyl(wave1a,relstat) #34 missing tabyl(wave1a,sat6) #5 missing ``` --- #Are individuals who have a partner happier? - Prepare a dataset ```r wave1b <- wave1a %>% mutate(sat6=case_when(sat6<0 ~ NA, TRUE ~ as.numeric(sat6)), relstat=case_when( relstat=="-7 Incomplete data" ~ NA, TRUE ~ as.character(relstat)) %>% as_factor()#make it back to factor ) %>% drop_na() #drop all missings #now wave1b is clean without missing # N changed from 6201 to 6162. ``` Note: `drop_na()` is a function to drop missing. - if you just type `drop_na()`, R drops all cases that have missing info - if you type `drop_na(sat6)`, R drops cases that have missing info of sat6. --- #Are individuals who have a partner happier? .pull-left[ ```r tabyl(wave1b,relstat) ``` ``` ## relstat n percent ## 1 Never married single 2446 0.3969490425 ## 4 Married COHAB 1734 0.2814021422 ## 2 Never married LAT 1012 0.1642323921 ## 3 Never married COHAB 659 0.1069457968 ## 8 Divorced/separated COHAB 76 0.0123336579 ## 9 Widowed single 3 0.0004868549 ## 6 Divorced/separated single 145 0.0235313210 ## 7 Divorced/separated LAT 63 0.0102239533 ## 5 Married noncohabiting 23 0.0037325544 ## 10 Widowed LAT 1 0.0001622850 ``` too many categories of partnership status ] .pull-right[ - we can reclassify relstat to reduce categories ```r wave1b <- wave1b %>% mutate( relstat_new=case_when( relstat %in% c("1 Never married single") ~ "single", #treat 'never married single' as 'single' relstat %in% c("2 Never married LAT", "3 Never married COHAB", "4 Married COHAB", "5 Married noncohabiting") ~ 'partnered', #treat the 4 situations as "partnered" relstat %in% c("6 Divorced/separated single", "7 Divorced/separated LAT", "8 Divorced/separated COHAB") ~ 'separated', #treat the 3 situations as "separated" relstat %in% c("9 Widowed single", "10 Widowed LAT") ~ 'widowed' #treat the 2 situations as "widowed" ) %>% as_factor()# make relstat_new as factor ) ``` you can have your own way of classification as long as it make sense ] --- #Are individuals who have a partner happier? ```r tabyl(wave1b,relstat_new) ``` ``` ## relstat_new n percent ## single 2446 0.3969490425 ## partnered 3428 0.5563128854 ## separated 284 0.0460889322 ## widowed 4 0.0006491399 ``` To make a simple example, let us just compare partnered and single ```r wave1c <- wave1b %>% filter(relstat_new=='single'| relstat_new=='partnered') #only keep single and partnered wave1c$relstat_new <- wave1c$relstat_new %>% fct_drop() #drop unused level "widowed" and "separated" # N changed from 6162 to 5874. ``` --- #Sumarize mean of life satisfaction by partnership status `summarise()` allows you to calculate all kinds of statistics on the level of the groups you have specified. .pull-left[ ```r wave1c %>% filter(relstat_new== "single")%>% #get a data of singles summarise(sat_mean=mean(sat6)) #calculate the mean wave1c %>% filter(relstat_new== "partnered")%>% #a data of partnered summarise(sat_mean=mean(sat6)) ``` a bit too repetitive coding! ] .pull-right[ ``` ## # A tibble: 1 × 1 ## sat_mean ## <dbl> ## 1 7.50 ``` ``` ## # A tibble: 1 × 1 ## sat_mean ## <dbl> ## 1 7.78 ``` ] --- #or you can use grouping operations `group_by()` will transform your data into a grouped tibble. Afterwards, certain functions will operate on the level of those specified groups! ```r wave1c %>% group_by(relstat_new) %>% #make wave1c grouped by parntership status summarise(mean_sat=mean(sat6),#calculate mean of sat6 sd_sat=sd(sat6)) #calculate standard deviation of sat6 ``` ``` ## # A tibble: 2 × 3 ## relstat_new mean_sat sd_sat ## <fct> <dbl> <dbl> ## 1 single 7.50 1.80 ## 2 partnered 7.78 1.64 ``` ```r #or you can save result to a dataset named meantable meantable <- wave1c %>% group_by(relstat_new) %>% summarise(mean_sat=mean(sat6), sd_sat=sd(sat6)) ``` --- #Are individuals who have a partner happier? - we can check the detailed distribution of life satisfaction across partnership status ```r distribute_table<- tabyl(wave1c,sat6, relstat_new) %>% adorn_totals("row") %>%#add row % adorn_percentages("col") %>% #add col adorn_pct_formatting() #format the percentage distribute_table ``` ``` ## sat6 single partnered ## 0 0.4% 0.3% ## 1 0.4% 0.1% ## 2 0.8% 0.5% ## 3 2.0% 1.3% ## 4 2.6% 1.4% ## 5 6.3% 5.8% ## 6 9.6% 7.0% ## 7 19.1% 19.0% ## 8 29.5% 31.6% ## 9 18.7% 19.2% ## 10 10.5% 13.7% ## Total 100.0% 100.0% ``` --- #Are individuals who have a partner happier? - One can also plot the distribution of life satisfaction by partnership status: use ggplot2 t ```r ggplot(data = , mapping = aes(x=, y=) + # specify dataset, x, and y to ggplot <GEOM_FUNCTION>()+ # specify types of your chart, e.g. bar, point, line chart <COORDINATE_FUNCTION> # Change the default coordinate system, swap x and y axis #note: + is the symbol to connect different section of code ``` - ggplot2 contains many geom functions, which put layers of different types of geometric objects (e.g., points, bars, lines) over a coordinate system. - All geom functions depend on the mapping argument. It is paired with aes(), which stands for "aesthetic". Aesthetics are the visual properties of your plot. - The most important aesthetics of any graph are the y-axis and the x-axis. - But of course, aesthetics also means, among others, color, shape, size, and so on. --- #Use ggplot2 to plot descriptive statistics ```r ggplot(data = wave1c) ## Create an empty coordinate system for the dataset "wave1bc". ``` <img src="https://github.com/fancycmn/slide6/blob/main/S6_Pic3.png?raw=true" width="60%" style="display: block; margin: auto;" > --- #Use ggplot2 to plot descriptive statistics .pull-left[ ```r figure1<- ggplot(data = wave1c, mapping=aes(x = as.factor(sat6)))+ #as.factor(sat6) to make it x axis show 0 to 10, optional code geom_bar()#create a bar chart figure1 ``` <img src="https://github.com/fancycmn/25-Session3/blob/main/s4-f1.png?raw=true" width="100%" style="display: block; margin: auto;" > ] .pull-right[ ```r figure2<- ggplot(data = wave1c, mapping=aes(x = as.factor(sat6)) )+ geom_bar()+ facet_wrap(~ relstat_new) #facet_wrap(~ relstat_new) to plot by relstat_new figure2 ``` <img src="https://github.com/fancycmn/25-Session3/blob/main/s4-f2.png?raw=true" width="100%" style="display: block; margin: auto;" > ] --- #Use ggplot2 to plot descriptive statistics ```r figure3<- ggplot(data = wave1c, mapping=aes(x = as.factor(sat6), y = ..prop..,# ask R to show % rather than absolute count group = relstat_new,#calculate the % of sat6 within each group of relstat_new fill = relstat_new)#color the bars different by relstat_new )+ geom_bar()+ facet_wrap(~ relstat_new) figure3 ``` <img src="https://github.com/fancycmn/25-Session3/blob/main/s4-f3.png?raw=true" width="55%" style="display: block; margin: auto;" > --- #Model the relationship between relationship status and life satisfaction Coming up with a model of life satisfaction: Can life satisfaction (i.e., `\(y\)`) be expressed as a function (i.e., `\(f()\)`) of another variable (i.e., `\(x\)`)? `$$\hat{y} = f(x)$$` The two critical questions are thus: 1. What is `\(f()\)`? 2. What is `\(x\)`? --- # A linear model of `\(y\)` <img src="https://merlin-intro-r.netlify.app/9-ols-i/img/LinearModel.png" width="50%" style="display: block; margin: auto;"> x could be partnership status, age, health, education --- # Ordinary Least Squares .pull-left[ We find `\(\alpha\)` and `\(\beta\)` by choosing the one line that minimizes the *residual sum of squares*: `$$\begin{align*} \ & min \text{RSS} = \min \sum_{i=1}^{n} e_{i}^{2} \\ \ & = \min \sum_{i=1}^{n} y_{i} - \hat{y_{i}} \\ \ & = \min \sum_{i=1}^{n} (y_{i} - (\alpha + \beta x_{i}))^{2} \end{align*}$$` ] .pull-right[ <img src="https://miro.medium.com/max/625/0*gglavDlTUWKn4Loe" width="100%" style="display: block; margin: auto;"> ] --- #Assumptions of OLS `$$y_i= \alpha+\beta*x_i + e_i$$` -- - About **the sample** - Assumption 1: The sample taken for the linear regression model must be drawn randomly from the population. -- - About **the relationship between Y and X** - Assumption 2: The dependent variable (Y) is a linear function of independent variables (X) and the error term. -- - About **the independent variables - X** - Assumption 3: There is no multi-collinearity (or perfect collinearity). In other words, there should be no linear relationship between the independent variables. -- - About **the error term** - Assumption 4: The distribution of error terms has zero mean and doesn’t depend on the independent variables(Xs). There must be no relationship between the Xs and the error term. - Assumption 5: There is homoscedasticity and no autocorrelation. - Assumption 6: Error terms should be normally distributed. --- # The `lm()` funcion ```r # Estimate a linear OLS model and assign the results to object ols. ols1 <- lm(formula = sat6 ~ relstat_new + age, data = wave1c) # sat6 is DV, should be written on the left side of ~ # relstat_new are IV, should be written on the right side of ~ summary(ols1) #to look at the summary of the result ``` ``` ## ## Call: ## lm(formula = sat6 ~ relstat_new + age, data = wave1c) ## ## Residuals: ## Min 1Q Median 3Q Max ## -7.9300 -0.6988 0.3012 1.3012 3.1185 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 8.25606 0.07187 114.88 <2e-16 *** ## relstat_newpartnered 0.60272 0.05248 11.49 <2e-16 *** ## age -0.03715 0.00312 -11.91 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.687 on 5871 degrees of freedom ## Multiple R-squared: 0.02967, Adjusted R-squared: 0.02934 ## F-statistic: 89.75 on 2 and 5871 DF, p-value: < 2.2e-16 ``` --- # The `lm()` funcion Factors are automatically treated as dummies, with the first level taken as reference. You can change by using `fct_relevel()`. ```r levels(wave1c$relstat_new) #first level is single ``` ``` ## [1] "single" "partnered" ``` ```r #you can change the reference level by using fct_recode wave1c$relstat_new <- fct_relevel(wave1c$relstat_new,"partnered")#make "partnered" first level ols1a <- lm(formula = sat6 ~ relstat_new + age, data = wave1c) summary(ols1a) ``` ``` ## ## Call: ## lm(formula = sat6 ~ relstat_new + age, data = wave1c) ## ## Residuals: ## Min 1Q Median 3Q Max ## -7.9300 -0.6988 0.3012 1.3012 3.1185 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 8.85878 0.09531 92.95 <2e-16 *** ## relstat_newsingle -0.60272 0.05248 -11.49 <2e-16 *** ## age -0.03715 0.00312 -11.91 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.687 on 5871 degrees of freedom ## Multiple R-squared: 0.02967, Adjusted R-squared: 0.02934 ## F-statistic: 89.75 on 2 and 5871 DF, p-value: < 2.2e-16 ``` --- #The `lm_robust` function - OLS based on a cross-sectional dataset often has issue of [heteroscedasticity](https://rpubs.com/cyobero/187387). Therefore,inference from the standard errors, t-values, p-values, and confidence intervals is likely to be wrong. - We thus need to estimate robust standard errors OLS, by using `estimatr::lm_robust()`. ```r wave1c$relstat_new <- fct_relevel(wave1c$relstat_new,"single")#change single back as 1st level ols2_robust <- lm_robust(formula = sat6 ~ relstat_new + age, data = wave1c) summary(ols2_robust) ``` ``` ## ## Call: ## lm_robust(formula = sat6 ~ relstat_new + age, data = wave1c) ## ## Standard error type: HC2 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper ## (Intercept) 8.25606 0.069014 119.63 0.000e+00 8.12077 8.39135 ## relstat_newpartnered 0.60272 0.054838 10.99 7.879e-28 0.49522 0.71022 ## age -0.03715 0.003176 -11.70 2.982e-31 -0.04338 -0.03092 ## DF ## (Intercept) 5871 ## relstat_newpartnered 5871 ## age 5871 ## ## Multiple R-squared: 0.02967 , Adjusted R-squared: 0.02934 ## F-statistic: 82.07 on 2 and 5871 DF, p-value: < 2.2e-16 ``` --- #Export results .pull-left[ ```r #Export to html texreg::htmlreg(list(ols2_robust), include.ci = FALSE, file = "regression1.html") #html ``` ``` ## The table was written to the file 'regression1.html'. ``` ```r #you can also export to doc or excel, just change file="regression.doc" or "regression.xls" ``` ] .pull-right[ <img src="https://github.com/fancycmn/25-Session3/blob/main/s4-f4a.png?raw=true" style="display: block; margin: auto;" > ] - Intercept: when a person is aged 0 and single, his/her life satisfaction is 8.26 on average. - Partnership status: when age is controlled, a person who is partnered has a higher life satisfaction than a person who is single, by the amount of 0.60 on average. - Age: when partnership status is controlled, as the age increases by 1 year, the life satisfaction decreases by 0.04 on average. --- #Show all regression tables ```r texreg::htmlreg(list(ols1,ols2_robust), include.ci = FALSE, digits = 4, #keep 4 digits custom.model.names=c("OLS1", "OLS2_robust"), file = "regression2.html") #html ``` ``` ## The table was written to the file 'regression2.html'. ``` <img src="https://github.com/fancycmn/25-Session3/blob/main/s4-f5a.png?raw=true" width="40%" style="display: block; margin: auto;" > --- #Interaction effect - Baseline model: IV=age, gender, relationship status - Baseline model + interaction between two categorical var: the relationship between partnership status and life satisfaction differs by gender. - For instance, having a partner may bring more happiness to women than to men. - Baseline model + interaction between one categorical and one continuous var: the relationship between age and life satisfaction differs by gender. - For instance, getting older reduces men's happiness more than women's ```r ols3 <- lm_robust(formula = sat6 ~ age +relstat_new+gender, data = wave1c) ols3a <- lm_robust(formula = sat6 ~ age + relstat_new*gender, data = wave1c) ols3b <- lm_robust(formula = sat6 ~ age*gender + relstat_new, data = wave1c) texreg::htmlreg(list(ols3,ols3a,ols3b), include.ci = FALSE, custom.model.names=c("OLS3", "OLS3a","OLS3b"), file = "regression_int1.html") #html ``` ``` ## The table was written to the file 'regression_int1.html'. ``` --- <img src="https://github.com/fancycmn/25-Session3/blob/main/s4-f6a.png?raw=true" width="50%" style="display: block; margin: auto;" > --- #Interaction effect - customize your regression results in a ```r texreg::htmlreg(list(ols3,ols3a,ols3b), include.ci = FALSE, #not including confidence interval digits = 4, #keep 4 digits single.row=TRUE,#single.row=TRUE is to place standard error the same line. custom.coef.names =c( "Intercept", "Age", "Partnered(ref.=Single)", "Female(ref.=male)", "Partnered x Female", "Age x Female" ),#you can create understandable names for the coef. custom.model.names=c("OLS3", "OLS3a","OLS3b"), file = "regression_int2.html" ) ``` ``` ## The table was written to the file 'regression_int2.html'. ``` <img src="https://github.com/fancycmn/25-Session3/blob/main/s4-f7a.png?raw=true" width="40%" style="display: block; margin: auto;" > --- #Interaction effect: interpretation <font size="3">`$$Model1: sat6=8.2434 -0.0371age +0.5987partnered + 0.0261female$$` `$$Model2: sat6=8.2542 -0.0371age +0.5774partnered + 0.0003female + 0.0439partnered*female$$` `$$Model3: sat6=8.4905 -0.0471age +0.6067partnered - 0.4611 female + 0.0191age*female$$` <img src="https://github.com/fancycmn/25-Session3/blob/main/s4-f7a.png?raw=true" width="60%" style="display: block; margin: auto;" > --- #Take home 1. `group_by()`: Subsets a tibble into groups. Certain functions will operate afterwards by each of the specified groups. 2. `summarize()`: Allows you to estimate any kind of aggregate statistic. Combined with group_by(), it estimates those statistic by specified group. - mean - sd 3. `ggplot()`: to plot charts, often combined with GEOM_FUNCTION>() to specify the chart type(e.g. bar, line, etc.) 4. OLS regression - assumptions - `lm()` - `lm_robust` - `texreg::htmlreg()` - interactions and interpretations --- class: center, middle #[Exercise](https://rpubs.com/fancycmn/1346088)