Jose M. Fernandez
Monday, October 06, 2014
In these notes, we will discuss how to incorporate dummy variables, log transformations, quadratics, and interactions into our linear regression.
For this section we will be using the CPS example found in Chapter 8
Each month the Bureau of Labor Statistics in the U.S. Department of Labor conducts the “Current Population Survey” (CPS), which provides data on labor force characteristics of the population, including the level of employment, unemployment, and earnings. Approximately 65,000 randomly selected U.S. households are surveyed each month. The sample is chosen by randomly selecting addresses from a database comprised of addresses from the most recent decennial census augmented with data on new housing units constructed after the last census. The exact random sampling scheme is rather complicated (first small geographical areas are randomly selected, then housing units within these areas randomly selected); details can be found in the Handbook of Labor Statistics and is described on the Bureau of Labor Statistics website (www.bls.gov). The survey conducted each March is more detailed than in other months and asks questions about earnings during the previous year. These data are from the March 2009 survey.
| Variables | Definition |
|---|---|
| gender | 1 if female; 0 if male |
| Age | Age in Years |
| earnings | Avg. Hourly Earnings |
| education | Years of Education |
| Northeast | 1 if from the Northeast, 0 otherwise |
| Midwest | 1 if from the Midwest, 0 otherwise |
| South | 1 if from the South, 0 otherwise |
| West | 1 if from the West, 0 otherwise |
library("AER", lib.loc="~/R/win-library/3.1")
## Loading required package: car
## Loading required package: lmtest
## Loading required package: zoo
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## Loading required package: sandwich
## Loading required package: survival
## Loading required package: splines
library("lattice", lib.loc="C:/Program Files/R/R-3.1.1/library")
data("CPSSW8")
summary(CPSSW8)
## earnings gender age region
## Min. : 2.0 male :34348 Min. :21.0 Northeast:12371
## 1st Qu.:11.1 female:27047 1st Qu.:33.0 Midwest :15136
## Median :16.2 Median :41.0 South :18963
## Mean :18.4 Mean :41.2 West :14925
## 3rd Qu.:23.6 3rd Qu.:49.0
## Max. :72.1 Max. :64.0
## education
## Min. : 6.0
## 1st Qu.:12.0
## Median :13.0
## Mean :13.6
## 3rd Qu.:16.0
## Max. :20.0
histogram(~CPSSW8$earnings | CPSSW8$gender)
One thing we can do with categorical variables is to identify statistical discrimination.
A simple linear regression of Avg. Hourly Earnings on Gender will give us a quick comaprison of earnings between females and males.
m1 = lm(earnings ~ factor(gender), data=CPSSW8)
summary(m1)
##
## Call:
## lm(formula = earnings ~ factor(gender), data = CPSSW8)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.08 -7.09 -1.92 4.82 52.34
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.0864 0.0537 373.9 <2e-16 ***
## factor(gender)female -3.7483 0.0809 -46.3 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.95 on 61393 degrees of freedom
## Multiple R-squared: 0.0338, Adjusted R-squared: 0.0337
## F-statistic: 2.15e+03 on 1 and 61393 DF, p-value: <2e-16
In this second regression we include some addtional explanatory variables.
\[earnings_i = \beta_0+\beta_1 Female_i +\beta_2 age + \beta_3 education\]
m2 = lm(earnings ~ factor(gender)+age+education, data=CPSSW8)
summary(m2)
##
## Call:
## lm(formula = earnings ~ factor(gender) + age + education, data = CPSSW8)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.23 -5.75 -1.32 4.28 49.50
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -10.02566 0.23720 -42.3 <2e-16 ***
## factor(gender)female -4.25021 0.07146 -59.5 <2e-16 ***
## age 0.15719 0.00335 46.9 <2e-16 ***
## education 1.74813 0.01444 121.1 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.78 on 61391 degrees of freedom
## Multiple R-squared: 0.249, Adjusted R-squared: 0.249
## F-statistic: 6.78e+03 on 3 and 61391 DF, p-value: <2e-16
anova(m1,m2)
## Analysis of Variance Table
##
## Model 1: earnings ~ factor(gender)
## Model 2: earnings ~ factor(gender) + age + education
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 61393 6083949
## 2 61391 4730226 2 1353723 8785 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Economic theory tells us that there are diminishing returns to productivity. As we age we become more productivity, but at a decreasing rate. One way to account for this change is by including a quadratic term in our specification.
\[earnings_i = \beta_0+\beta_1 Female_i +\beta_2 age + \beta_3 age^2 +\beta_4 education\]
CPSSW8$age2=CPSSW8$age*CPSSW8$age
m3 = lm(earnings ~ factor(gender)+age+age2+education, data=CPSSW8)
summary(m3)
##
## Call:
## lm(formula = earnings ~ factor(gender) + age + age2 + education,
## data = CPSSW8)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30.62 -5.69 -1.28 4.21 48.51
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.58e+01 5.22e-01 -49.4 <2e-16 ***
## factor(gender)female -4.24e+00 7.08e-02 -59.9 <2e-16 ***
## age 9.86e-01 2.48e-02 39.8 <2e-16 ***
## age2 -9.99e-03 2.96e-04 -33.8 <2e-16 ***
## education 1.72e+00 1.43e-02 120.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.7 on 61390 degrees of freedom
## Multiple R-squared: 0.262, Adjusted R-squared: 0.262
## F-statistic: 5.46e+03 on 4 and 61390 DF, p-value: <2e-16
anova(m2,m3)
## Analysis of Variance Table
##
## Model 1: earnings ~ factor(gender) + age + education
## Model 2: earnings ~ factor(gender) + age + age2 + education
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 61391 4730226
## 2 61390 4643914 1 86312 1141 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Potentially, the returns from education are different by gender. We add this feature to the model by including an interaction term. We multiply gender and education.
\[earnings_i = \beta_0+\beta_1 Female_i +\beta_2 age + \beta_3 age^2+ \beta_4 education+ \beta_5 education *Female\]
We see from the regression results that there are not much difference with respect to education
m4 = lm(earnings ~ age+age2+education+education*gender, data=CPSSW8)
summary(m4)
##
## Call:
## lm(formula = earnings ~ age + age2 + education + education *
## gender, data = CPSSW8)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30.51 -5.70 -1.28 4.21 48.50
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.55e+01 5.44e-01 -46.96 <2e-16 ***
## age 9.86e-01 2.48e-02 39.84 <2e-16 ***
## age2 -9.99e-03 2.96e-04 -33.78 <2e-16 ***
## education 1.71e+00 1.86e-02 91.74 <2e-16 ***
## genderfemale -4.85e+00 4.05e-01 -11.98 <2e-16 ***
## education:genderfemale 4.45e-02 2.91e-02 1.53 0.13
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.7 on 61389 degrees of freedom
## Multiple R-squared: 0.262, Adjusted R-squared: 0.262
## F-statistic: 4.37e+03 on 5 and 61389 DF, p-value: <2e-16
There are often unobservable characteristics about markets that we would like to capture, but we just don’t have this variable (i.e. unemployment rate by gender or sector, culture, laws, etc). One way to handle this problem is to use categorical variables for the location of the person or firm. These categorical variables will capture any time invariant differences between locations.
m5 = lm(earnings ~ age+age2+education+education*gender+region, data=CPSSW8)
summary(m5)
##
## Call:
## lm(formula = earnings ~ age + age2 + education + education *
## gender + region, data = CPSSW8)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.99 -5.67 -1.25 4.17 48.14
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.46e+01 5.50e-01 -44.72 <2e-16 ***
## age 9.83e-01 2.47e-02 39.79 <2e-16 ***
## age2 -9.97e-03 2.95e-04 -33.77 <2e-16 ***
## education 1.70e+00 1.86e-02 91.49 <2e-16 ***
## genderfemale -4.80e+00 4.04e-01 -11.87 <2e-16 ***
## regionMidwest -1.26e+00 1.05e-01 -11.94 <2e-16 ***
## regionSouth -1.22e+00 1.01e-01 -12.10 <2e-16 ***
## regionWest -4.25e-01 1.06e-01 -4.01 6e-05 ***
## education:genderfemale 4.19e-02 2.91e-02 1.44 0.15
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.68 on 61386 degrees of freedom
## Multiple R-squared: 0.265, Adjusted R-squared: 0.265
## F-statistic: 2.77e+03 on 8 and 61386 DF, p-value: <2e-16
The normality assumption about the error term implies the dependent variable can potentially take on both negative and positive values. However, there are some variables we use often that are always positive (i.e. price, quantity, and income). One method used to insure that we have a positive depedent variable is to transform the dependent variable by taking the natral log.
m6 = lm(log(earnings) ~ age+age2+education+education*gender+region, data=CPSSW8)
summary(m6)
##
## Call:
## lm(formula = log(earnings) ~ age + age2 + education + education *
## gender + region, data = CPSSW8)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.6874 -0.2810 0.0267 0.3190 1.8179
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.75e-01 3.00e-02 12.49 < 2e-16 ***
## age 6.17e-02 1.35e-03 45.72 < 2e-16 ***
## age2 -6.37e-04 1.61e-05 -39.46 < 2e-16 ***
## education 8.40e-02 1.02e-03 82.67 < 2e-16 ***
## genderfemale -5.07e-01 2.21e-02 -22.97 < 2e-16 ***
## regionMidwest -5.62e-02 5.75e-03 -9.77 < 2e-16 ***
## regionSouth -7.34e-02 5.49e-03 -13.37 < 2e-16 ***
## regionWest -2.67e-02 5.78e-03 -4.62 3.9e-06 ***
## education:genderfemale 2.01e-02 1.59e-03 12.65 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.474 on 61386 degrees of freedom
## Multiple R-squared: 0.268, Adjusted R-squared: 0.268
## F-statistic: 2.81e+03 on 8 and 61386 DF, p-value: <2e-16