class: center, middle, inverse, title-slide .title[ # Advanced quantitative data analysis ] .subtitle[ ## OLS ] .author[ ### Mengni Chen ] .institute[ ### Department of Sociology, University of Copenhagen ] --- <style type="text/css"> .remark-slide-content { font-size: 20px; padding: 20px 80px 20px 80px; } .remark-code, .remark-inline-code { background: #f0f0f0; } .remark-code { font-size: 12px; } </style> #Let's get ready ```r #install.packages("estimatr") #install.packages("texreg") # Add packages to library library(tidyverse) # Add the tidyverse package to my current library. library(haven) # Import data. library(Hmisc) # Weighting library(ggplot2) # Allows us to create nice figures. library(estimatr) # Allows us to estimate (cluster-)robust standard errors. library(texreg) # Allows us to make nicely-formatted Html & Latex regression tables. ``` --- #Prepare data a dataset of age, sex, relationship status, life satisfaction, weight - DV: life satisfaction - IV: age, sex, relationship status ```r wave1 <- read_dta("anchor1_50percent_Eng.dta") # sample size =6201 # wave1b <- wave1 %>% transmute( age=zap_labels(age), #Independent variable sat6=case_when(sat6<0 ~ as.numeric(NA), #specify when sat should be considered missing TRUE ~ as.numeric(sat6)) %>% zap_label(), #remove labels of sat6 cdweight=zap_label(cdweight), #cdweight is the variable telling the weight sex_gen=as_factor(sex_gen) %>% fct_drop(), #treat sex_gen as categorical, and drop unused level relstat=as_factor(relstat), #treat relationship status as categorical relstat=case_when(relstat=="-7 Incomplete data" ~ as.character(NA), TRUE ~ as.character(relstat)) %>% #specify when relstat should be considered missing as_factor() %>% #make relstat as a factor fct_drop() #drop unused levels in relstat ) %>% drop_na() #drop all observations with missing values in the sample # sample size change from 6201 to 6162 ``` --- #The relationship between marital status and life satisfaction - Generate a new variable for marital status ```r wave1c <- wave1b %>% mutate( marital1=case_when( relstat %in% c("1 Never married single","2 Never married LAT","3 Never married COHAB") ~ "Nevermarried", # when relstat has any of the three situations, I assign "Nevermarried" to new variable "marital1" relstat %in% c("4 Married COHAB","5 Married noncohabiting") ~ 'Married', # when relstat has any of the two situations, I assign "Married" to new variable "marital1" relstat %in% c("6 Divorced/separated single","7 Divorced/separated LAT","8 Divorced/separated COHAB") ~ 'Divorced', # when relstat has any of the three situations, I assign "Divorced" to new variable "marital1" relstat %in% c("9 Widowed single","10 Widowed LAT") ~ 'Widow' # when relstat has any of the two situations, I assign "Widow" to new variable "marital1" ) %>% as_factor()# I treat marital1 as a categorical variable ) %>% filter(marital1!= "Widow") #I drop observations when he/she is widowed. # sample size change to 6158 after dropping those widowed ``` --- #Averge life satisfaction across marital status To study the relationship between marital status and life satisfaction, we can estimate the average sat6 across marital status to have a first look ```r mar_sat<- wave1c %>% group_by(marital1) %>% dplyr::summarise( mean_sat6=mean(sat6) ) mar_sat ``` ``` ## # A tibble: 3 × 2 ## marital1 mean_sat6 ## <fct> <dbl> ## 1 Nevermarried 7.60 ## 2 Married 7.80 ## 3 Divorced 6.68 ``` --- #Averge life satisfaction across marital status, how to bring in confidence interval The standard error tells you the standard deviation of your result (i.e., mean, percentage,β etc.), if we would conduct the same study over and over again. [? difference between Standard deviation (SD) and Standard error (SE)](https://www.investopedia.com/ask/answers/042415/what-difference-between-standard-error-means-and-standard-deviation.asp) Standard error of the mean: `\(\text{SE}(\bar{x}) = \frac{ \text{SD(x)}}{\sqrt{n}}\)` Standard error of a proportion: `\(\text{SE}(p) = \sqrt{\frac{p(1-p)}{n}}\)` The 95% Confidence Interval tries to capture the range in which the true value lies with an accuracy of 95%. `$$\text{CI}_{95} = \bar{x} \pm \text{Critical Value} * \text{SE}(\bar{x})$$` `$$\text{CI}_{95} = p \pm \text{Critical Value} * \text{SE}(p)$$` To have critical values, you can use `qt(p=0.975, df=df)` where p=0.975 to specifiy 95%, df is degree of freedom, you can use sample size as its degree of freedom. --- #Averge life satisfaction across marital status, how to bring in confidence interval .pull-left[ When we don't consider weight ```r mar_sat<- wave1c %>% group_by(marital1) %>% dplyr::summarise( n=n(), #sample size mean_sat6=mean(x = sat6),#mean of sat6 sd_sat6=sd(x=sat6), #standard deviation se_sat6=sd_sat6/sqrt(n), #standard error sat6_lci=mean_sat6 - qt(p=0.975, df=n)*se_sat6, #low confidence interval sat6_hci=mean_sat6 + qt(p=0.975, df=n)*se_sat6 #high confidence interval ) %>% select(marital1, mean_sat6, sat6_lci, sat6_hci) #Note: qt(p=0.975, df=df) is to specify critial value at 95% Confidence interval mar_sat ``` ``` ## # A tibble: 3 × 4 ## marital1 mean_sat6 sat6_lci sat6_hci ## <fct> <dbl> <dbl> <dbl> ## 1 Nevermarried 7.60 7.55 7.66 ## 2 Married 7.80 7.73 7.88 ## 3 Divorced 6.68 6.42 6.93 ``` ] .pull-right[ When we consider weight ```r mar_sat<- wave1c %>% group_by(marital1) %>% dplyr::summarise( wn=sum(cdweight), #weighted sample size m_wsat6=wtd.mean(x = sat6, weights = cdweight), #weighted mean var_wsat6=wtd.var(x = sat6, weights = cdweight), #weighted variance sd_wsat6=sqrt(var_wsat6), #standard deviation by take a square root of variance se_wsat6=sd_wsat6/sqrt(wn), #standard error following the formular wsat6_lci=m_wsat6-qt(p=0.975, df=wn)*se_wsat6, #low confidence interval wsat6_hci=m_wsat6+qt(p=0.975, df=wn)*se_wsat6 #high confidence interval ) %>% select(marital1,m_wsat6,wsat6_lci,wsat6_hci ) mar_sat ``` ``` ## # A tibble: 3 × 4 ## marital1 m_wsat6 wsat6_lci wsat6_hci ## <fct> <dbl> <dbl> <dbl> ## 1 Nevermarried 7.51 7.46 7.57 ## 2 Married 7.79 7.72 7.87 ## 3 Divorced 6.50 6.22 6.78 ``` ] --- #Avergea life satisfaction across marital status .pull-left[ <img src="https://github.com/fancycmn/Slide5/blob/main/S5_Pic1.png?raw=true" width="90%" style="display: block; margin: auto;" > ] .pull-right[ ```r ggplot(data = mar_sat, mapping = aes(y = m_wsat6, x = marital1)) + geom_bar(stat = "identity") + geom_errorbar(mapping = aes(ymin = wsat6_lci, ymax = wsat6_hci), width = 0.55) + # Add layer of errorbars. labs(y = "Life satisfaction", x = "Marital1")+ scale_y_continuous(limits=c(0, 10),breaks=seq(0,10,2)) ``` ] --- #Model the relationship between marital status and life satisfaction Coming up with a model of life satisfaction: Can life satisfaction (i.e., `\(y\)`) be expressed as a function (i.e., `\(f()\)`) of another variable (i.e., `\(x\)`)? `$$\hat{y} = f(x)$$` The two critical questions are thus: 1. What is `\(f()\)`? 2. What is `\(x\)`? --- # A linear model of `\(y\)` <img src="https://merlin-intro-r.netlify.app/9-ols-i/img/LinearModel.png" width="60%" style="display: block; margin: auto;"> x could be age, health conditions, marital status, income --- # Ordinary Least Squares .pull-left[ We find `\(\alpha\)` and `\(\beta\)` by choosing the one line that minimizes the *residual sum of squares*: `$$\begin{align*} \ & min \text{RSS} = \min \sum_{i=1}^{n} e_{i}^{2} \\ \ & = \min \sum_{i=1}^{n} y_{i} - \hat{y_{i}} \\ \ & = \min \sum_{i=1}^{n} (y_{i} - (\alpha + \beta x_{i}))^{2} \end{align*}$$` ] .pull-right[ <img src="https://miro.medium.com/max/625/0*gglavDlTUWKn4Loe" width="100%" style="display: block; margin: auto;"> ] --- # The `lm()` funcion ```r # Estimate a linear OLS model and assign the results to object ols. ols1 <- lm(formula = sat6 ~ age, data = wave1c) # sat6 is DV, should be written on the left side of ~ # age is IV, should be written on the right side of ~ summary(ols1) ``` ``` ## ## Call: ## lm(formula = sat6 ~ age, data = wave1c) ## ## Residuals: ## Min 1Q Median 3Q Max ## -7.8676 -0.8215 0.2016 1.1785 2.6630 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 8.213634 0.071954 114.151 <2e-16 *** ## age -0.023069 0.002649 -8.708 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.737 on 6156 degrees of freedom ## Multiple R-squared: 0.01217, Adjusted R-squared: 0.01201 ## F-statistic: 75.84 on 1 and 6156 DF, p-value: < 2.2e-16 ``` --- #Weighted linear OLS - The weights argument allows you to use (post-stratification) weights. ```r # Estimate a weighted linear OLS model. ols2_weight <- lm(formula = sat6 ~ age, data = wave1c, weights = cdweight)#specify weight here summary(ols2_weight) ``` ``` ## ## Call: ## lm(formula = sat6 ~ age, data = wave1c, weights = cdweight) ## ## Weighted Residuals: ## Min 1Q Median 3Q Max ## -12.7388 -0.6955 0.2364 1.1134 4.9252 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 8.227414 0.077310 106.42 <2e-16 *** ## age -0.025337 0.002766 -9.16 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.787 on 6156 degrees of freedom ## Multiple R-squared: 0.01345, Adjusted R-squared: 0.01329 ## F-statistic: 83.9 on 1 and 6156 DF, p-value: < 2.2e-16 ``` --- #Weighted linear OLS with `lm_robust` - But beware, `lm()` inference (e.g., standard errors, t-values, p-values, and confidence intervals) will be wrong, because weights introduce [heteroscedasticity](.https://en.wikipedia.org/wiki/Homoscedasticity_and_heteroscedasticity) - We thus need to estimate heteroscedasticity-robust standard errors, by using `estimatr::lm_robust()`. ```r ols2_weight_robust <- lm_robust(formula = sat6 ~ age, data = wave1c, weights = cdweight) summary(ols2_weight_robust) ``` ``` ## ## Call: ## lm_robust(formula = sat6 ~ age, data = wave1c, weights = cdweight) ## ## Weighted, Standard error type: HC2 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF ## (Intercept) 8.22741 0.076592 107.419 0.000e+00 8.07727 8.37756 6156 ## age -0.02534 0.003011 -8.415 4.833e-17 -0.03124 -0.01943 6156 ## ## Multiple R-squared: 0.01345 , Adjusted R-squared: 0.01329 ## F-statistic: 70.81 on 1 and 6156 DF, p-value: < 2.2e-16 ``` --- #Categorical predictors Factors are automatically treated as dummies, with the first level taken as reference. Thus if you want to change the reference, you need to recode the levels of the factor using the `forcats` package. ```r #as marital1 is a factor in wave1c, it is automatically treated as dummies. ols3_cat <- lm_robust(formula = sat6 ~ age + marital1, data = wave1c, weights = cdweight) summary(ols3_cat) ``` ``` ## ## Call: ## lm_robust(formula = sat6 ~ age + marital1, data = wave1c, weights = cdweight) ## ## Weighted, Standard error type: HC2 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF ## (Intercept) 8.54409 0.092680 92.189 0.000e+00 8.36241 8.72578 6154 ## age -0.04426 0.004283 -10.334 7.863e-25 -0.05266 -0.03587 6154 ## marital1Married 0.73801 0.077592 9.511 2.635e-21 0.58590 0.89012 6154 ## marital1Divorced -0.52564 0.168952 -3.111 1.872e-03 -0.85684 -0.19443 6154 ## ## Multiple R-squared: 0.04629 , Adjusted R-squared: 0.04583 ## F-statistic: 58.52 on 3 and 6154 DF, p-value: < 2.2e-16 ``` --- #Categorical predictors recode the levels of marital1 ```r wave1c$marital1 <- fct_relevel(wave1c$marital1, "Divorced", "Nevermarried", "Married") #use fct_level function to re-level marital1 ols3_cat_relevel <- lm_robust(formula = sat6 ~ age + marital1, data = wave1c, weights = cdweight) summary(ols3_cat_relevel) ``` ``` ## ## Call: ## lm_robust(formula = sat6 ~ age + marital1, data = wave1c, weights = cdweight) ## ## Weighted, Standard error type: HC2 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper ## (Intercept) 8.01846 0.215844 37.149 9.858e-273 7.59532 8.44159 ## age -0.04426 0.004283 -10.334 7.863e-25 -0.05266 -0.03587 ## marital1Nevermarried 0.52564 0.168952 3.111 1.872e-03 0.19443 0.85684 ## marital1Married 1.26364 0.161271 7.836 5.462e-15 0.94750 1.57979 ## DF ## (Intercept) 6154 ## age 6154 ## marital1Nevermarried 6154 ## marital1Married 6154 ## ## Multiple R-squared: 0.04629 , Adjusted R-squared: 0.04583 ## F-statistic: 58.52 on 3 and 6154 DF, p-value: < 2.2e-16 ``` --- #Interaction effects An example of interaction between marital status and sex. ```r #as we re-level marital1 in last slides, we now re-level it back wave1c$marital1 <- fct_relevel(wave1c$marital1,"Nevermarried", "Married","Divorced") ols4_interactsex <- lm_robust(formula = sat6 ~ age + marital1*sex_gen, data = wave1c, weights = cdweight) # use * to specific the interaction summary(ols4_interactsex) ``` ``` ## ## Call: ## lm_robust(formula = sat6 ~ age + marital1 * sex_gen, data = wave1c, ## weights = cdweight) ## ## Weighted, Standard error type: HC2 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 8.51700 0.096867 87.9243 0.000e+00 ## age -0.04414 0.004291 -10.2876 1.267e-24 ## marital1Married 0.77335 0.101589 7.6126 3.091e-14 ## marital1Divorced -0.67740 0.292832 -2.3133 2.074e-02 ## sex_gen2 Female 0.05380 0.061767 0.8711 3.837e-01 ## marital1Married:sex_gen2 Female -0.07577 0.111197 -0.6814 4.956e-01 ## marital1Divorced:sex_gen2 Female 0.24337 0.336766 0.7227 4.699e-01 ## CI Lower CI Upper DF ## (Intercept) 8.32710 8.70689 6151 ## age -0.05255 -0.03573 6151 ## marital1Married 0.57420 0.97250 6151 ## marital1Divorced -1.25146 -0.10335 6151 ## sex_gen2 Female -0.06728 0.17489 6151 ## marital1Married:sex_gen2 Female -0.29376 0.14221 6151 ## marital1Divorced:sex_gen2 Female -0.41681 0.90355 6151 ## ## Multiple R-squared: 0.04672 , Adjusted R-squared: 0.04579 ## F-statistic: 29.78 on 6 and 6151 DF, p-value: < 2.2e-16 ``` --- #Interaction effects An example of interaction between age and sex ```r ols5_interactage <- lm_robust(formula = sat6 ~ age*sex_gen + marital1, data = wave1c, weights = cdweight) summary(ols5_interactage) ``` ``` ## ## Call: ## lm_robust(formula = sat6 ~ age * sex_gen + marital1, data = wave1c, ## weights = cdweight) ## ## Weighted, Standard error type: HC2 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper ## (Intercept) 8.71145 0.121206 71.873 0.000e+00 8.473839 8.94905 ## age -0.05118 0.005264 -9.723 3.485e-22 -0.061501 -0.04086 ## sex_gen2 Female -0.36254 0.151283 -2.396 1.659e-02 -0.659106 -0.06597 ## marital1Married 0.72461 0.078032 9.286 2.175e-20 0.571641 0.87758 ## marital1Divorced -0.54587 0.169630 -3.218 1.298e-03 -0.878403 -0.21333 ## age:sex_gen2 Female 0.01517 0.005905 2.569 1.021e-02 0.003596 0.02675 ## DF ## (Intercept) 6152 ## age 6152 ## sex_gen2 Female 6152 ## marital1Married 6152 ## marital1Divorced 6152 ## age:sex_gen2 Female 6152 ## ## Multiple R-squared: 0.04763 , Adjusted R-squared: 0.04686 ## F-statistic: 35.51 on 5 and 6152 DF, p-value: < 2.2e-16 ``` --- #Regression tables Present sevearl regresion together; report standard errors, not confidence intervals. ```r texreg::screenreg(list(ols1, ols2_weight_robust, ols3_cat, ols4_interactsex, ols5_interactage), include.ci = FALSE, digits = 3) # include.ci=FALSE means don't include CI, digit=3 to specify the digits of your result. ``` ``` ## ## ====================================================================================================== ## Model 1 Model 2 Model 3 Model 4 Model 5 ## ------------------------------------------------------------------------------------------------------ ## (Intercept) 8.214 *** 8.227 *** 8.544 *** 8.517 *** 8.711 *** ## (0.072) (0.077) (0.093) (0.097) (0.121) ## age -0.023 *** -0.025 *** -0.044 *** -0.044 *** -0.051 *** ## (0.003) (0.003) (0.004) (0.004) (0.005) ## marital1Married 0.738 *** 0.773 *** 0.725 *** ## (0.078) (0.102) (0.078) ## marital1Divorced -0.526 ** -0.677 * -0.546 ** ## (0.169) (0.293) (0.170) ## sex_gen2 Female 0.054 -0.363 * ## (0.062) (0.151) ## marital1Married:sex_gen2 Female -0.076 ## (0.111) ## marital1Divorced:sex_gen2 Female 0.243 ## (0.337) ## age:sex_gen2 Female 0.015 * ## (0.006) ## ------------------------------------------------------------------------------------------------------ ## R^2 0.012 0.013 0.046 0.047 0.048 ## Adj. R^2 0.012 0.013 0.046 0.046 0.047 ## Num. obs. 6158 6158 6158 6158 6158 ## RMSE 1.787 1.757 1.757 1.756 ## ====================================================================================================== ## *** p < 0.001; ** p < 0.01; * p < 0.05 ``` --- #Export regression tables ```r #Export to word texreg::htmlreg(list(ols1, ols2_weight_robust, ols3_cat, ols4_interactsex, ols5_interactage), include.ci = FALSE, digits = 3, file = "MyOLSModels.doc") # Word table ``` ``` ## The table was written to the file 'MyOLSModels.doc'. ``` ```r #Export to html texreg::htmlreg(list(ols1, ols2_weight_robust, ols3_cat, ols4_interactsex, ols5_interactage), include.ci = FALSE, file = "MyOLSModels.html") #html ``` ``` ## The table was written to the file 'MyOLSModels.html'. ``` ```r #Export to excel texreg::htmlreg(list(ols1, ols2_weight_robust, ols3_cat, ols4_interactsex, ols5_interactage), include.ci = FALSE, file = "MyOLSModels.xls") #excel ``` ``` ## The table was written to the file 'MyOLSModels.xls'. ``` --- #Take home 1. How to calcuate confidence interval 2. Linear regression - With weight - Robust - interaction 3. Generate regression tables 4. Important code - `qt(p=0.975, df=n)`, to select critical values for 95% confidence interval - lm(formula = your DV ~ your IV, data = your dataset) - lm_robust(formula = your DV ~ your IV, data = your dataset, weights= your weight) - texreg::screenreg() --- class: center, middle #[Exercise](https://rpubs.com/fancycmn/949336)