Computer Exercise 3: Multiple regression and omitted variable bias

Learning objectives:

running a multiple regression
interpreting multiple regression coefficients
presenting results from several regressions

1. Running a multiple regression

Let’s load in our cps08.csv data.

setwd("C:/Users/dvorakt/Google Drive/teaching/243")
data <- read.csv("cps08.csv")

We know the data has information on over 62 thousand individuals, their salary, education, gender etc. Let’s estimate two simple regressions, one estimating the effect of age on salary and the other one estimating the effect of marital status on salary.

model1 <- lm(salary~age, data)
model2 <- lm(salary~married, data)
summary(model1)

## 
## Call:
## lm(formula = salary ~ age, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -67461 -23065 -10451   8950 653161 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 26416.40     686.76   38.47   <2e-16 ***
## age           569.31      15.81   36.01   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 47220 on 63785 degrees of freedom
## Multiple R-squared:  0.01993,    Adjusted R-squared:  0.01991 
## F-statistic:  1297 on 1 and 63785 DF,  p-value: < 2.2e-16

summary(model2)

## 
## Call:
## lm(formula = salary ~ married, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -56551 -23079 -10079   8921 666038 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  40078.5      299.8  133.66   <2e-16 ***
## married      16492.6      382.5   43.12   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 47020 on 63785 degrees of freedom
## Multiple R-squared:  0.02832,    Adjusted R-squared:  0.0283 
## F-statistic:  1859 on 1 and 63785 DF,  p-value: < 2.2e-16

We see that age has a positive and statistically significant effect on salary. For every additional year in age, salary is expected to go up by 569 dollars. In the second regression we see that married people earn over 16 thousand dollars more than single people. Again the effect is statistically significant.

The trouble with the these two regressions is that marital status and age are related. Married people tend to be older. We see that married people earn more but we don’t know if it is because they are older or because there is something about married people that makes them more productive (e.g. they work harder, they have better social skills, etc.) In order to disentangle these two effects we need to run a multiple regression - one in which we have both age and marital status as independent variables.

model3 <- lm(salary ~ age + married, data)
summary(model3)

## 
## Call:
## lm(formula = salary ~ age + married, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -68435 -22425  -9650   8898 660527 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 23174.82     685.73   33.80   <2e-16 ***
## age           439.51      16.06   27.37   <2e-16 ***
## married     14104.38     390.17   36.15   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 46750 on 63784 degrees of freedom
## Multiple R-squared:  0.0396, Adjusted R-squared:  0.03957 
## F-statistic:  1315 on 2 and 63784 DF,  p-value: < 2.2e-16

The interpretation of the coefficient on married in the above regression is the effect of being married on salary holding age constant. Similarly, the interpretation of the coefficient on age is the effect of an extra year of age on salary holding marital status constant. This is exactly what we need if we want to disentangle the effect of marital status on salary from that of age on salary. We want to know whether two people of the same age but different marital status are expected to have the same salary. The coefficient on married tells us that even when we control for age, married people earn more money - about 14 thousand dollars more.

3. Presenting results from several regressions

In this exercise we estimated three regressions. When we have the need to present results from several regressions simultaneously we normally use a table where each regression is in a column, variables in rows and coefficient with t-stats or standard errors underneath the coefficients in parentheses. Package stargazer does a really good job of combining result from different models. Take a look:

library(stargazer)
stargazer(model1, model2, model3, type="text", digits = 2, intercept.bottom = FALSE)

## 
## =======================================================================================================
##                                                     Dependent variable:                                
##                     -----------------------------------------------------------------------------------
##                                                           salary                                       
##                                 (1)                         (2)                         (3)            
## -------------------------------------------------------------------------------------------------------
## Constant                   26,416.40***                40,078.53***                23,174.82***        
##                              (686.76)                    (299.85)                    (685.73)          
##                                                                                                        
## age                          569.31***                                               439.51***         
##                               (15.81)                                                 (16.06)          
##                                                                                                        
## married                                                16,492.57***                14,104.38***        
##                                                          (382.51)                    (390.17)          
##                                                                                                        
## -------------------------------------------------------------------------------------------------------
## Observations                  63,787                      63,787                      63,787           
## R2                             0.02                        0.03                        0.04            
## Adjusted R2                    0.02                        0.03                        0.04            
## Residual Std. Error   47,223.29 (df = 63785)      47,020.62 (df = 63785)      46,747.22 (df = 63784)   
## F Statistic         1,296.77*** (df = 1; 63785) 1,859.03*** (df = 1; 63785) 1,315.05*** (df = 2; 63784)
## =======================================================================================================
## Note:                                                                       *p<0.1; **p<0.05; ***p<0.01

Exercises:

Download cars.csv data from Nexus
Estimate the effect of fuel efficiency (Mileage_w) on price (TMV).
Estimate the effect of Horsepower on price.
Estimate the effect of fuel efficiency on price while holding horsepower constant. Did the coefficient on horsepower change from part 2? If so, can you explain why?