Let’s load in our cps08.csv data.
setwd("C:/Users/dvorakt/Google Drive/teaching/243")
data <- read.csv("cps08.csv")
We know the data has information on over 62 thousand individuals, their salary, education, gender etc. Let’s estimate two simple regressions, one estimating the effect of age on salary and the other one estimating the effect of marital status on salary.
model1 <- lm(salary~age, data)
model2 <- lm(salary~married, data)
summary(model1)
##
## Call:
## lm(formula = salary ~ age, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -67461 -23065 -10451 8950 653161
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26416.40 686.76 38.47 <2e-16 ***
## age 569.31 15.81 36.01 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 47220 on 63785 degrees of freedom
## Multiple R-squared: 0.01993, Adjusted R-squared: 0.01991
## F-statistic: 1297 on 1 and 63785 DF, p-value: < 2.2e-16
summary(model2)
##
## Call:
## lm(formula = salary ~ married, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -56551 -23079 -10079 8921 666038
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 40078.5 299.8 133.66 <2e-16 ***
## married 16492.6 382.5 43.12 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 47020 on 63785 degrees of freedom
## Multiple R-squared: 0.02832, Adjusted R-squared: 0.0283
## F-statistic: 1859 on 1 and 63785 DF, p-value: < 2.2e-16
We see that age has a positive and statistically significant effect on salary. For every additional year in age, salary is expected to go up by 569 dollars. In the second regression we see that married people earn over 16 thousand dollars more than single people. Again the effect is statistically significant.
The trouble with the these two regressions is that marital status and age are related. Married people tend to be older. We see that married people earn more but we don’t know if it is because they are older or because there is something about married people that makes them more productive (e.g. they work harder, they have better social skills, etc.) In order to disentangle these two effects we need to run a multiple regression - one in which we have both age and marital status as independent variables.
model3 <- lm(salary ~ age + married, data)
summary(model3)
##
## Call:
## lm(formula = salary ~ age + married, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -68435 -22425 -9650 8898 660527
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23174.82 685.73 33.80 <2e-16 ***
## age 439.51 16.06 27.37 <2e-16 ***
## married 14104.38 390.17 36.15 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 46750 on 63784 degrees of freedom
## Multiple R-squared: 0.0396, Adjusted R-squared: 0.03957
## F-statistic: 1315 on 2 and 63784 DF, p-value: < 2.2e-16
The interpretation of the coefficient on married in the above regression is the effect of being married on salary holding age constant. Similarly, the interpretation of the coefficient on age is the effect of an extra year of age on salary holding marital status constant. This is exactly what we need if we want to disentangle the effect of marital status on salary from that of age on salary. We want to know whether two people of the same age but different marital status are expected to have the same salary. The coefficient on married tells us that even when we control for age, married people earn more money - about 14 thousand dollars more.
In this exercise we estimated three regressions. When we have the need to present results from several regressions simultaneously we normally use a table where each regression is in a column, variables in rows and coefficient with t-stats or standard errors underneath the coefficients in parentheses. Package stargazer does a really good job of combining result from different models. Take a look:
library(stargazer)
stargazer(model1, model2, model3, type="text", digits = 2, intercept.bottom = FALSE)
##
## =======================================================================================================
## Dependent variable:
## -----------------------------------------------------------------------------------
## salary
## (1) (2) (3)
## -------------------------------------------------------------------------------------------------------
## Constant 26,416.40*** 40,078.53*** 23,174.82***
## (686.76) (299.85) (685.73)
##
## age 569.31*** 439.51***
## (15.81) (16.06)
##
## married 16,492.57*** 14,104.38***
## (382.51) (390.17)
##
## -------------------------------------------------------------------------------------------------------
## Observations 63,787 63,787 63,787
## R2 0.02 0.03 0.04
## Adjusted R2 0.02 0.03 0.04
## Residual Std. Error 47,223.29 (df = 63785) 47,020.62 (df = 63785) 46,747.22 (df = 63784)
## F Statistic 1,296.77*** (df = 1; 63785) 1,859.03*** (df = 1; 63785) 1,315.05*** (df = 2; 63784)
## =======================================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01