Slides for Week 11 is available here
Before adding interaction effects to the model, test the model for potential violations of OLS assumptions. To correct for violations: center age, add a squared age term, and log transform income variable in a new temporary dataset.
Basic Set Up and install and load all necessary libraries
options(scipen=999, digits = 3)
library(ggplot2)
library(ggthemes)
library(ggfortify)
library(car)
# install.packages(c("sjPlot", "sjmisc"))
library(sjPlot)
library(sjmisc)
# install.packages("stargazer")
library(stargazer)
Step 0. Look at the Data Documentation Check the
CPS2015_Description.pdf
in your project folder.
Variables in the dataset:
cps_new <- read.csv("cps_lab7.csv",header=TRUE, sep=",")
cps_new1 <- cps_new # cps1 is now the spare copy shall we "mess up" our `cps`!
# Now in your global environment, you should see 'cps' and `cps1` stored.
# It has 7098 observations (i.e., 7098 rows) and 5 variables (i.e., 5 columns).
Two Steps: create a new age
variable centered around 0,
then square it
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 25.0 27.0 30.0 29.6 32.0 34.0
The mean value for age is about 29.63, but the age variable is a discrete variable (i.e. age is measured in whole year here). So, you can center it around the year that is closest to mean value, i.e., 30 years old, or the exact mean value.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -5.00 -3.00 0.00 -0.37 2.00 4.00
ageC
to
create ageC2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 1.00 4.00 8.41 16.00 25.00
## [1] -0.337
\[\hat{ahe}= \alpha + \beta_1{ageC} + \beta_2{ageC2} + \beta_3{female} + \beta_4{bachelor} + \beta_5{female \cdot bachelor}\]
Notice that \(\beta_5{female \cdot bachelor}\) is the interaction term.
One way is to create separate interaction terms stored in the dataset. However, plotting might be difficult. So, I don’t recommend this way if you want to visualize the interaction effects down the road.
# Tips: do not make dummy to factor here, otherwise can't create an interaction.
cps_new$bachelor_female <- cps_new$bachelor * cps_new$female
cps_new$ageC_female <- cps_new$ageC * cps_new$female
# remove them for now
cps_new$bachelor_female <- NULL
cps_new$ageC_female <- NULL
Alternatively, we can include interaction terms in the regression model directly – see the following regression models.
Simple model (recall lm2 we did in previous lab)
##
## Call:
## lm(formula = ahe ~ female + age + bachelor, data = cps_new)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.91 -6.65 -1.86 4.25 83.91
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.0342 1.3548 1.5 0.13
## female -4.1424 0.2659 -15.6 <0.0000000000000002 ***
## age 0.5316 0.0451 11.8 <0.0000000000000002 ***
## bachelor 9.8471 0.2624 37.5 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.9 on 7094 degrees of freedom
## Multiple R-squared: 0.19, Adjusted R-squared: 0.189
## F-statistic: 553 on 3 and 7094 DF, p-value: <0.0000000000000002
## female age bachelor
## 1.02 1.00 1.02
sjPlot
and sjmisc
packagesSee the package documentations here for more explanations on syntaxes: http://www.strengejacke.de/sjPlot/reference/plot_model.html
# plot with no interaction: same slopes are parallel to each other!
plot_model(lm2, type = "pred", terms = c("bachelor","female"))
Reproduce the log-linear and log-log regression models from Lab 7
##
## Call:
## lm(formula = log_ahe1 ~ female + ageC + ageC2 + bachelor, data = cps_new)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.7147 -0.2862 0.0129 0.3044 2.0602
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.767359 0.011066 250.07 <0.0000000000000002 ***
## female -0.176939 0.011657 -15.18 <0.0000000000000002 ***
## ageC 0.022591 0.002098 10.77 <0.0000000000000002 ***
## ageC2 -0.001875 0.000776 -2.42 0.016 *
## bachelor 0.462164 0.011504 40.18 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.479 on 7093 degrees of freedom
## Multiple R-squared: 0.209, Adjusted R-squared: 0.208
## F-statistic: 467 on 4 and 7093 DF, p-value: <0.0000000000000002
## female ageC ageC2 bachelor
## 1.02 1.13 1.13 1.02
##
## Call:
## lm(formula = log_ahe1 ~ female + ageC + ageC2 + bachelor, data = cps_new)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.7147 -0.2862 0.0129 0.3044 2.0602
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.767359 0.011066 250.07 <0.0000000000000002 ***
## female -0.176939 0.011657 -15.18 <0.0000000000000002 ***
## ageC 0.022591 0.002098 10.77 <0.0000000000000002 ***
## ageC2 -0.001875 0.000776 -2.42 0.016 *
## bachelor 0.462164 0.011504 40.18 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.479 on 7093 degrees of freedom
## Multiple R-squared: 0.209, Adjusted R-squared: 0.208
## F-statistic: 467 on 4 and 7093 DF, p-value: <0.0000000000000002
## female log_age bachelor
## 1.02 1.00 1.02
In regression analysis, an interaction effect occurs when the effect of one independent variable on the dependent variable depends on the level of another independent variable. Including an interaction term in a regression model allows us to account for the fact that the relationship between the independent variables and the dependent variable may not be constant across all levels of the other independent variable. It can also help us to better understand the complex relationships between variables in our data.
For lm7, we want to examine if the relationship between
bachelor
and ahe
depends on
female
(or the effect of bachelor on ahe significantly vary
by sex). Let’s check if the interaction term is significant at the 0.05
level!
##
## Call:
## lm(formula = ahe ~ ageC + ageC2 + bachelor * female, data = cps_new)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.94 -6.69 -1.90 4.31 83.30
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.8930 0.2725 65.67 <0.0000000000000002 ***
## ageC 0.5049 0.0479 10.55 <0.0000000000000002 ***
## ageC2 -0.0246 0.0177 -1.39 0.1654
## bachelor 10.4596 0.3402 30.74 <0.0000000000000002 ***
## female -3.3071 0.3972 -8.33 <0.0000000000000002 ***
## bachelor:female -1.5067 0.5346 -2.82 0.0048 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.9 on 7092 degrees of freedom
## Multiple R-squared: 0.191, Adjusted R-squared: 0.19
## F-statistic: 334 on 5 and 7092 DF, p-value: <0.0000000000000002
## GVIFs computed for predictors
## GVIF Df GVIF^(1/(2*Df)) Interacts With Other Predictors
## ageC 1.13 1 1.06 -- ageC2, bachelor, female
## ageC2 1.13 1 1.06 -- ageC, bachelor, female
## bachelor 1.00 3 1.00 female ageC, ageC2
## female 1.00 3 1.00 bachelor ageC, ageC2
Controlling other variables constant, the results show that the interaction term (female x bachelor) is statistically significant at the 0.05 level, meaning that there is a significant interaction effect between female and bachelor, i.e., the association between bachelor and average hourly earning significantly differs between female and male workers.
For female workers (female=1) who had bachelor’s degree, the coefficient (=slope) of bachelor for average hourly earning decreases by $1.51, compared to the coefficients of bachelor for male workers who had bachelor’s degree.
In short, for female workers (female=1), having a bachelor’s degree
is associated with an increase in the average hourly earning of
$8.95
(i.e., 10.46 - 1.51 = 8.95), holding other variables
constant. Whereas, for male workers (female=0 which is the reference
group), having a bachelor’s degree is associated with
$10.46
increase in the average hourly earning, holding
other variables constant.
# the order of your interaction terms will determine the x-axis and y-axis in your plot
# ageC = x-axis ; female = y-axis
lm8 <- lm(ahe ~ ageC * female + ageC2 + bachelor, data = cps_new)
summary(lm8)
##
## Call:
## lm(formula = ahe ~ ageC * female + ageC2 + bachelor, data = cps_new)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.96 -6.73 -1.97 4.33 83.55
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.2282 0.2532 71.99 <0.0000000000000002 ***
## ageC 0.5930 0.0611 9.71 <0.0000000000000002 ***
## female -4.2168 0.2681 -15.73 <0.0000000000000002 ***
## ageC2 -0.0267 0.0177 -1.51 0.132
## bachelor 9.8291 0.2625 37.44 <0.0000000000000002 ***
## ageC:female -0.2030 0.0913 -2.22 0.026 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.9 on 7092 degrees of freedom
## Multiple R-squared: 0.19, Adjusted R-squared: 0.19
## F-statistic: 334 on 5 and 7092 DF, p-value: <0.0000000000000002
## GVIFs computed for predictors
## GVIF Df GVIF^(1/(2*Df)) Interacts With Other Predictors
## ageC 1.16 3 1.02 female ageC2, bachelor
## female 1.16 3 1.02 ageC ageC2, bachelor
## ageC2 1.13 1 1.06 -- ageC, female, bachelor
## bachelor 1.02 1 1.01 -- ageC, female, ageC2
Stargazer
packageFor the full documentation: https://cran.r-project.org/web/packages/stargazer/vignettes/stargazer.pdf
cps_new_Mini <- (cps_new[c("ahe","age","bachelor","female")])
# install.packages("stargazer")
library(stargazer)
stargazer(cps_new_Mini, type= "html", out = "summary table.html",
title="Table 1. Summary Statistics", align=TRUE)
Statistic | N | Mean | St. Dev. | Min | Max |
ahe | 7,098 | 21.200 | 12.100 | 0.000 | 106.000 |
age | 7,098 | 29.600 | 2.880 | 25 | 34 |
bachelor | 7,098 | 0.526 | 0.499 | 0 | 1 |
female | 7,098 | 0.417 | 0.493 | 0 | 1 |
# Create a matrix by selecting some of the variables
cps_new$bachelor <- as.numeric(cps_new$bachelor)
# Correlation can't deal with factor structure, keep dummy in numerical form
cor.matrix <- cor(cps_new[c("ahe","age","bachelor","female")])
stargazer(cor.matrix, type= "html", out = "cor.table.html",
title="Table 2. Correlation Matrix")
ahe | age | bachelor | female | |
ahe | 1 | 0.131 | 0.380 | -0.112 |
age | 0.131 | 1 | -0.001 | -0.032 |
bachelor | 0.380 | -0.001 | 1 | 0.149 |
female | -0.112 | -0.032 | 0.149 | 1 |
# Remove dummy from your correlation matrix
# since correlation is only useful for continuous/numeric variables
cor.matrix1 <- cor(cps_new[c("ahe","age")])
stargazer(cor.matrix1, type= "html", out = "cor.table1.html",
title="Table 2. Correlation Matrix")
ahe | age | |
ahe | 1 | 0.131 |
age | 0.131 | 1 |
For more settings, see https://www.jakeruss.com/cheatsheets/stargazer/
lm2$AIC <- AIC(lm2)
lm5$AIC <- AIC(lm5)
lm6$AIC <- AIC(lm6)
lm7$AIC <- AIC(lm7)
lm8$AIC <- AIC(lm8)
lm2$BIC <- BIC(lm2)
lm5$BIC <- BIC(lm5)
lm6$BIC <- BIC(lm6)
lm7$BIC <- BIC(lm7)
lm8$BIC <- BIC(lm8)
stargazer(lm2, lm5, lm6, lm7, lm8, type= "html", out = "regression.table.html",
dep.var.labels = "Average Hourly Earnings",
column.labels = c("Main Effects","Log-linear Model", "Log-Log Model",
"Int: Bachelor x Female ", "Int: Age x Female"),
intercept.bottom = FALSE,
digits = 2 , digits.extra = 2,
single.row= FALSE,
keep.stat = c("n","adj.rsq","aic","bic"),
notes.append = FALSE,
star.char = c("+", "*", "**", "***"),
star.cutoffs = c(0.1, 0.05, 0.01, 0.001),
notes = c("(+ p<0.1; * p<0.05; ** p<0.01; *** p<0.001)"),
title="Table 3. Regression Results", align=TRUE, no.space=FALSE)
Dependent variable: | |||||
Average Hourly Earnings | log_ahe1 | ahe | |||
Main Effects | Log-linear Model | Log-Log Model | Int: Bachelor x Female | Int: Age x Female | |
(1) | (2) | (3) | (4) | (5) | |
Constant | 2.03 | 2.77*** | 0.31 | 17.90*** | 18.20*** |
(1.35) | (0.01) | (0.20) | (0.27) | (0.25) | |
female | -4.14*** | -0.18*** | -0.18*** | -3.31*** | -4.22*** |
(0.27) | (0.01) | (0.01) | (0.40) | (0.27) | |
age | 0.53*** | ||||
(0.05) | |||||
bachelor:female | -1.51** | ||||
(0.53) | |||||
ageC | 0.02*** | 0.50*** | 0.59*** | ||
(0.002) | (0.05) | (0.06) | |||
ageC2 | -0.002* | -0.02 | -0.03 | ||
(0.001) | (0.02) | (0.02) | |||
log_age | 0.72*** | ||||
(0.06) | |||||
bachelor | 9.85*** | 0.46*** | 0.46*** | 10.50*** | 9.83*** |
(0.26) | (0.01) | (0.01) | (0.34) | (0.26) | |
ageC:female | -0.20* | ||||
(0.09) | |||||
Observations | 7,098 | 7,098 | 7,098 | 7,098 | 7,098 |
Adjusted R2 | 0.19 | 0.21 | 0.21 | 0.19 | 0.19 |
Akaike Inf. Crit. | 54,083.00 | 9,688.00 | 9,690.00 | 54,077.00 | 54,080.00 |
Bayesian Inf. Crit. | 54,118.00 | 9,729.00 | 9,724.00 | 54,125.00 | 54,128.00 |
Note: | (+ p<0.1; * p<0.05; ** p<0.01; *** p<0.001) |
If you use the stargazer package in your research publications, please remember to include the following citation: Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables. R package version 5.2.2. https://CRAN.R-project.org/package=stargazer