class: center, middle, inverse, title-slide .title[ # Advanced quantitative data analysis ] .subtitle[ ## Cross-sectional data analysis I ] .author[ ### Mengni Chen ] .institute[ ### Department of Sociology, University of Copenhagen ] --- <style type="text/css"> .remark-slide-content { font-size: 20px; padding: 20px 80px 20px 80px; } .remark-code, .remark-inline-code { background: #f0f0f0; } .remark-code { font-size: 14px; } </style> #Let's get ready ```r #install two new packages install.packages("estimatr") #for robust standard errors OLS install.packages("texreg") #output the regression results ``` ```r library(tidyverse) # see session 3. library(haven) # Import data, see session 3. library(janitor) #cleaning data library(estimatr) # Allows us to estimate robust standard errors OLS. library(texreg) # Allows us to make nicely-formatted Html & Latex regression tables. library(ggplot2) # Allows us to create nice figures ``` --- #Prepare data a dataset of age, sex, relationship status, life satisfaction - DV: life satisfaction - IV: age, sex, relationship status ```r wave1 <- read_dta("anchor1_50percent_Eng.dta") # sample size =6201 ``` ```r #have a initial look at the variables wave1$sat6 %>% as_factor() %>% tabyl() # 5 cases with "no answer" or "Don't know" wave1$sex_gen %>% as_factor() %>% tabyl() # no missing wave1$relstat %>% as_factor() %>% tabyl() # 34 cases with "incomplete data" ``` --- #Prepare data a dataset of age, sex, relationship status, life satisfaction - DV: life satisfaction - IV: age, sex, relationship status ```r wave1b <- wave1 %>% transmute( age, sat6=case_when(sat6<0 ~ as.numeric(NA), #specify when sat should be considered missing TRUE ~ as.numeric(sat6)), sex_gen=as_factor(sex_gen) %>% fct_drop(), #treat sex_gen as categorical, and drop unused level relstat=as_factor(relstat), #treat relationship status as categorical relstat_new1=case_when( relstat=="-7 Incomplete data" ~ as.character(NA),#specify when it should be missing TRUE ~ as.character(relstat) ) %>% as_factor() %>% fct_drop() #make relstat as a factor, and then drop unused levels in relstat_new1 ) %>% drop_na() #drop all observations with missing values in the sample # sample size change from 6201 to 6162 ``` --- #Prepare data a dataset of age, sex, relationship status, life satisfaction - DV: life satisfaction - IV: age, sex, relationship status ```r tabyl(wave1b,relstat_new1) ``` ``` ## relstat_new1 n percent ## 1 Never married single 2446 0.3969490425 ## 4 Married COHAB 1734 0.2814021422 ## 2 Never married LAT 1012 0.1642323921 ## 3 Never married COHAB 659 0.1069457968 ## 8 Divorced/separated COHAB 76 0.0123336579 ## 9 Widowed single 3 0.0004868549 ## 6 Divorced/separated single 145 0.0235313210 ## 7 Divorced/separated LAT 63 0.0102239533 ## 5 Married noncohabiting 23 0.0037325544 ## 10 Widowed LAT 1 0.0001622850 ``` --- #The relationship between having a partner and life satisfaction - Generate a new variable for relationship status ```r wave1c <- wave1b %>% mutate( relstat_new2=case_when( relstat_new1 %in% c("1 Never married single") ~ "single", #treat 'never married single' as 'single' relstat_new1 %in% c("2 Never married LAT", "3 Never married COHAB", "4 Married COHAB", "5 Married noncohabiting") ~ 'partnered', #treat the 4 situations as "partnered" relstat_new1 %in% c("6 Divorced/separated single", "7 Divorced/separated LAT", "8 Divorced/separated COHAB") ~ 'separated', #treat the 3 situations as "separated" relstat_new1 %in% c("9 Widowed single", "10 Widowed LAT") ~ 'widowed' #treat the 2 situations as "widowed" ) %>% as_factor()# make relstat_new2 as factor ) %>% filter(relstat_new2!= "widowed" & relstat_new2!= "separated") #only 4 widowed and 284 separated, dropping. # sample size change to 5874 after dropping widowed and separated ``` --- #Exploring the relationship between having a partner and life satisfaction First, we can estimate the average sat6 across relationship status to have a first look .pull-left[ ```r tabyl(wave1c,relstat_new2) ``` ``` ## relstat_new2 n percent ## single 2446 0.4164113 ## partnered 3428 0.5835887 ## separated 0 0.0000000 ## widowed 0 0.0000000 ``` ```r #check the frequency of each category ``` ] .pull-right[ ```r wave1c$relstat_new2 <- fct_drop(wave1c$relstat_new2) #drop unused levels rel_sat<- wave1c %>% group_by(relstat_new2) %>% dplyr::summarise( mean_sat6=mean(sat6), #calculate the mean of sat6 by marital1 ) rel_sat ``` ``` ## # A tibble: 2 × 2 ## relstat_new2 mean_sat6 ## <fct> <dbl> ## 1 single 7.50 ## 2 partnered 7.78 ``` ] --- #Model the relationship between relationship status and life satisfaction Coming up with a model of life satisfaction: Can life satisfaction (i.e., `\(y\)`) be expressed as a function (i.e., `\(f()\)`) of another variable (i.e., `\(x\)`)? `$$\hat{y} = f(x)$$` The two critical questions are thus: 1. What is `\(f()\)`? 2. What is `\(x\)`? --- # A linear model of `\(y\)` <img src="https://merlin-intro-r.netlify.app/9-ols-i/img/LinearModel.png" width="60%" style="display: block; margin: auto;"> x could be age, health conditions, relationship status, income --- # Ordinary Least Squares .pull-left[ We find `\(\alpha\)` and `\(\beta\)` by choosing the one line that minimizes the *residual sum of squares*: `$$\begin{align*} \ & min \text{RSS} = \min \sum_{i=1}^{n} e_{i}^{2} \\ \ & = \min \sum_{i=1}^{n} y_{i} - \hat{y_{i}} \\ \ & = \min \sum_{i=1}^{n} (y_{i} - (\alpha + \beta x_{i}))^{2} \end{align*}$$` ] .pull-right[ <img src="https://miro.medium.com/max/625/0*gglavDlTUWKn4Loe" width="100%" style="display: block; margin: auto;"> ] --- #Assumptions of OLS `$$y_i= \alpha+\beta*x_i + e_i$$` -- - About **the sample** - Assumption 1: The sample taken for the linear regression model must be drawn randomly from the population. -- - About **the relationship between Y and X** - Assumption 2: The dependent variable (Y) is a linear function of independent variables (X) and the error term. -- - About **the independent variables - X** - Assumption 3: There is no multi-collinearity (or perfect collinearity). In other words, there should be no linear relationship between the independent variables. -- - About **the error term** - Assumption 4: The distribution of error terms has zero mean and doesn’t depend on the independent variables(Xs). There must be no relationship between the Xs and the error term. - Assumption 5: There is homoscedasticity and no autocorrelation. - Assumption 6: Error terms should be normally distributed. --- # The `lm()` funcion ```r # Estimate a linear OLS model and assign the results to object ols. ols1 <- lm(formula = sat6 ~ age, data = wave1c) # sat6 is DV, should be written on the left side of ~ # age is IV, should be written on the right side of ~ summary(ols1) ``` ``` ## ## Call: ## lm(formula = sat6 ~ age, data = wave1c) ## ## Residuals: ## Min 1Q Median 3Q Max ## -7.8540 -0.6709 0.3291 1.1827 2.5671 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 8.128590 0.071795 113.220 < 2e-16 *** ## age -0.018309 0.002684 -6.822 9.88e-12 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.705 on 5872 degrees of freedom ## Multiple R-squared: 0.007863, Adjusted R-squared: 0.007694 ## F-statistic: 46.54 on 1 and 5872 DF, p-value: 9.88e-12 ``` --- #The `lm_robust` function - OLS based on a cross-sectional dataset often has issue of [heteroscedasticity](https://rpubs.com/cyobero/187387). Therefore,inference from the standard errors, t-values, p-values, and confidence intervals is likely to be wrong. - We thus need to estimate robust standard errors OLS, by using `estimatr::lm_robust()`. ```r ols2_robust <- lm_robust(formula = sat6 ~ age, data = wave1c) summary(ols2_robust) ``` ``` ## ## Call: ## lm_robust(formula = sat6 ~ age, data = wave1c) ## ## Standard error type: HC2 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF ## (Intercept) 8.12859 0.06827 119.063 0.000e+00 7.99475 8.26243 5872 ## age -0.01831 0.00263 -6.962 3.715e-12 -0.02346 -0.01315 5872 ## ## Multiple R-squared: 0.007863 , Adjusted R-squared: 0.007694 ## F-statistic: 48.47 on 1 and 5872 DF, p-value: 3.715e-12 ``` --- #Categorical predictors Factors are automatically treated as dummies, with the first level taken as reference. Thus if you want to change the reference, you need to recode the levels of the factor using the `forcats` package. ```r #as relstat_new2 is a factor in wave1c, it is automatically treated as dummies. ols3_cat <- lm_robust(formula = sat6 ~ age +relstat_new2, data = wave1c) summary(ols3_cat) ``` ``` ## ## Call: ## lm_robust(formula = sat6 ~ age + relstat_new2, data = wave1c) ## ## Standard error type: HC2 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper ## (Intercept) 8.25606 0.069014 119.63 0.000e+00 8.12077 8.39135 ## age -0.03715 0.003176 -11.70 2.982e-31 -0.04338 -0.03092 ## relstat_new2partnered 0.60272 0.054838 10.99 7.879e-28 0.49522 0.71022 ## DF ## (Intercept) 5871 ## age 5871 ## relstat_new2partnered 5871 ## ## Multiple R-squared: 0.02967 , Adjusted R-squared: 0.02934 ## F-statistic: 82.07 on 2 and 5871 DF, p-value: < 2.2e-16 ``` --- #Categorical predictors .pull-left[ ```r #Export to html texreg::htmlreg(list(ols3_cat), include.ci = FALSE, file = "OLS3_cat(2024).html") #html ``` ``` ## The table was written to the file 'OLS3_cat(2024).html'. ``` ] .pull-right[ <img src="https://github.com/fancycmn/24-Session-5/blob/main/Figure1.JPG?raw=true" style="display: block; margin: auto;" > ] - Intercept: when a person is aged 0 and single, his/her life satisfaction is 8.26 on average. - Relationship status: when age is controlled, a person who is partnered has a higher life satisfaction than a person who is single, by the amount of 0.602 on average. - Age: when relationship status is controlled, as the age increases by 1 year, the life satisfaction decreases by 0.037 on average. --- #Categorical predictors: change the reference level recode the levels of relstat_new2 ```r wave1c$relstat_new2 <- fct_relevel(wave1c$relstat_new2, "partnered", "single") #use fct_level function to re-level relstat_new2, now "partnered" is the 1st level and will be treated as the reference category ols3_cat_relevel <- lm_robust(formula = sat6 ~ age + relstat_new2, data = wave1c) summary(ols3_cat_relevel) ``` ``` ## ## Call: ## lm_robust(formula = sat6 ~ age + relstat_new2, data = wave1c) ## ## Standard error type: HC2 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF ## (Intercept) 8.85878 0.095240 93.02 0.000e+00 8.67207 9.04549 5871 ## age -0.03715 0.003176 -11.70 2.982e-31 -0.04338 -0.03092 5871 ## relstat_new2single -0.60272 0.054838 -10.99 7.879e-28 -0.71022 -0.49522 5871 ## ## Multiple R-squared: 0.02967 , Adjusted R-squared: 0.02934 ## F-statistic: 82.07 on 2 and 5871 DF, p-value: < 2.2e-16 ``` --- #Regression tables Present sevearl regresion together; report standard errors, not confidence intervals. ```r texreg::screenreg(list(ols1,ols2_robust,ols3_cat,ols3_cat_relevel), include.ci = FALSE, digits = 4) # include.ci=FALSE means don't include CI, digit=4 to specify the digits of your result. ``` ``` ## ## ================================================================================= ## Model 1 Model 2 Model 3 Model 4 ## --------------------------------------------------------------------------------- ## (Intercept) 8.1286 *** 8.1286 *** 8.2561 *** 8.8588 *** ## (0.0718) (0.0683) (0.0690) (0.0952) ## age -0.0183 *** -0.0183 *** -0.0371 *** -0.0371 *** ## (0.0027) (0.0026) (0.0032) (0.0032) ## relstat_new2partnered 0.6027 *** ## (0.0548) ## relstat_new2single -0.6027 *** ## (0.0548) ## --------------------------------------------------------------------------------- ## R^2 0.0079 0.0079 0.0297 0.0297 ## Adj. R^2 0.0077 0.0077 0.0293 0.0293 ## Num. obs. 5874 5874 5874 5874 ## RMSE 1.7053 1.6866 1.6866 ## ================================================================================= ## *** p < 0.001; ** p < 0.01; * p < 0.05 ``` --- #Export regression tables ```r #Export to word texreg::htmlreg(list(ols1,ols2_robust,ols3_cat,ols3_cat_relevel), include.ci = FALSE, digits = 4, file = "MyOLSModels(2024).doc") # Word table ``` ``` ## The table was written to the file 'MyOLSModels(2024).doc'. ``` ```r #Export to html texreg::htmlreg(list(ols1,ols2_robust,ols3_cat,ols3_cat_relevel), include.ci = FALSE, file = "MyOLSModels(2024).html") #html ``` ``` ## The table was written to the file 'MyOLSModels(2024).html'. ``` ```r #Export to excel texreg::htmlreg(list(ols1,ols2_robust,ols3_cat,ols3_cat_relevel), include.ci = FALSE, file = "MyOLSModels(2024).xls") #excel ``` ``` ## The table was written to the file 'MyOLSModels(2024).xls'. ``` --- #Interaction effect Baseline model: IV=age, gender, relationship status ```r wave1c$relstat_new2 <- fct_relevel(wave1c$relstat_new2, "single") ols4 <- lm_robust(formula = sat6 ~ age +relstat_new2+sex_gen, data = wave1c) summary(ols4) ``` ``` ## ## Call: ## lm_robust(formula = sat6 ~ age + relstat_new2 + sex_gen, data = wave1c) ## ## Standard error type: HC2 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper ## (Intercept) 8.24343 0.07218 114.2062 0.000e+00 8.10193 8.38493 ## age -0.03708 0.00318 -11.6585 4.557e-31 -0.04331 -0.03084 ## relstat_new2partnered 0.59868 0.05533 10.8199 4.980e-27 0.49021 0.70715 ## sex_gen2 Female 0.02607 0.04444 0.5866 5.575e-01 -0.06104 0.11318 ## DF ## (Intercept) 5870 ## age 5870 ## relstat_new2partnered 5870 ## sex_gen2 Female 5870 ## ## Multiple R-squared: 0.02972 , Adjusted R-squared: 0.02923 ## F-statistic: 54.86 on 3 and 5870 DF, p-value: < 2.2e-16 ``` --- #Interaction effect x1*x2 The effect of relationship status (a categorical var) on life satisfaction differs by gender ```r ols4_inter1 <- lm_robust(formula = sat6 ~ age +relstat_new2*sex_gen, data = wave1c)#relstat_new2*sex_gen is the interaction term summary(ols4_inter1) ``` ``` ## ## Call: ## lm_robust(formula = sat6 ~ age + relstat_new2 * sex_gen, data = wave1c) ## ## Standard error type: HC2 ## ## Coefficients: ## Estimate Std. Error t value ## (Intercept) 8.2541902 0.075707 109.027431 ## age -0.0370647 0.003181 -11.651397 ## relstat_new2partnered 0.5774176 0.069724 8.281511 ## sex_gen2 Female 0.0003469 0.071477 0.004853 ## relstat_new2partnered:sex_gen2 Female 0.0439392 0.091082 0.482416 ## Pr(>|t|) CI Lower CI Upper DF ## (Intercept) 0.000e+00 8.1058 8.40260 5869 ## age 4.947e-31 -0.0433 -0.03083 5869 ## relstat_new2partnered 1.492e-16 0.4407 0.71410 5869 ## sex_gen2 Female 9.961e-01 -0.1398 0.14047 5869 ## relstat_new2partnered:sex_gen2 Female 6.295e-01 -0.1346 0.22249 5869 ## ## Multiple R-squared: 0.02976 , Adjusted R-squared: 0.0291 ## F-statistic: 41.14 on 4 and 5869 DF, p-value: < 2.2e-16 ``` --- #Interaction effect x1*x2 The effect of age (a continous var) on life satisfaction differs by gender ```r ols4_inter2 <- lm_robust(formula = sat6 ~ age*sex_gen +relstat_new2, data = wave1c)#age*sex_gen is the interaction term summary(ols4_inter2) ``` ``` ## ## Call: ## lm_robust(formula = sat6 ~ age * sex_gen + relstat_new2, data = wave1c) ## ## Standard error type: HC2 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper ## (Intercept) 8.49053 0.096717 87.787 0.000e+00 8.300929 8.68013 ## age -0.04707 0.004187 -11.241 5.050e-29 -0.055273 -0.03886 ## sex_gen2 Female -0.46112 0.136611 -3.375 7.417e-04 -0.728930 -0.19332 ## relstat_new2partnered 0.60666 0.055323 10.966 1.034e-27 0.498210 0.71512 ## age:sex_gen2 Female 0.01913 0.005219 3.665 2.496e-04 0.008895 0.02936 ## DF ## (Intercept) 5869 ## age 5869 ## sex_gen2 Female 5869 ## relstat_new2partnered 5869 ## age:sex_gen2 Female 5869 ## ## Multiple R-squared: 0.03186 , Adjusted R-squared: 0.0312 ## F-statistic: 44.26 on 4 and 5869 DF, p-value: < 2.2e-16 ``` --- #Interaction effect: export results ```r texreg::screenreg(list(ols4,ols4_inter1,ols4_inter2), include.ci = FALSE, digits = 4, single.row=TRUE, custom.coef.names =c( "Intercept", "Age", "Partnered(ref.=Single)", "Female(ref.=male)", "Partnered x Female", "Age x Female" )#you can create understandable names for the coef. ) #single.row=TRUE is to place standard error the same line. ``` ``` ## ## ============================================================================================== ## Model 1 Model 2 Model 3 ## ---------------------------------------------------------------------------------------------- ## Intercept 8.2434 (0.0722) *** 8.2542 (0.0757) *** 8.4905 (0.0967) *** ## Age -0.0371 (0.0032) *** -0.0371 (0.0032) *** -0.0471 (0.0042) *** ## Partnered(ref.=Single) 0.5987 (0.0553) *** 0.5774 (0.0697) *** 0.6067 (0.0553) *** ## Female(ref.=male) 0.0261 (0.0444) 0.0003 (0.0715) -0.4611 (0.1366) *** ## Partnered x Female 0.0439 (0.0911) ## Age x Female 0.0191 (0.0052) *** ## ---------------------------------------------------------------------------------------------- ## R^2 0.0297 0.0298 0.0319 ## Adj. R^2 0.0292 0.0291 0.0312 ## Num. obs. 5874 5874 5874 ## RMSE 1.6867 1.6868 1.6849 ## ============================================================================================== ## *** p < 0.001; ** p < 0.01; * p < 0.05 ``` --- #Interaction effect: interpretation <font size="2">`$$Model1: sat6=8.2434 -0.0371age +0.5987partnered + 0.0261female$$` `$$Model2: sat6=8.2542 -0.0371age +0.5774partnered + 0.0003female + 0.0439partnered*female$$` `$$Model3: sat6=8.4905 -0.0471age +0.6067partnered - 0.4611 female + 0.0191age*female$$` <img src="https://github.com/fancycmn/24-Session-5/blob/main/Figure2a.JPG?raw=true" style="display: block; margin: auto;" > --- #Take home 1. Exploratory analysis: frequency, mean 2. Linear regression - Robust standard errors OLS - Interction terms 3. Generate regression tables 4. Important code - lm(formula = your DV ~ your IV, data = your dataset) - lm_robust(formula = your DV ~ your IV, data = your dataset) - lm_robust(formula = your DV ~ IV1 + IV2*IV3, data = your dataset) - texreg::screenreg() --- class: center, middle #[Exercise](https://rpubs.com/fancycmn/1224073)