Dummy Variables, Logs, and Interactions

Jose M. Fernandez

Monday, October 06, 2014

Introduction

In these notes, we will discuss how to incorporate dummy variables, log transformations, quadratics, and interactions into our linear regression.

Data

For this section we will be using the CPS example found in Chapter 8

Each month the Bureau of Labor Statistics in the U.S. Department of Labor conducts the “Current Population Survey” (CPS), which provides data on labor force characteristics of the population, including the level of employment, unemployment, and earnings. Approximately 65,000 randomly selected U.S. households are surveyed each month. The sample is chosen by randomly selecting addresses from a database comprised of addresses from the most recent decennial census augmented with data on new housing units constructed after the last census. The exact random sampling scheme is rather complicated (first small geographical areas are randomly selected, then housing units within these areas randomly selected); details can be found in the Handbook of Labor Statistics and is described on the Bureau of Labor Statistics website (www.bls.gov). The survey conducted each March is more detailed than in other months and asks questions about earnings during the previous year. These data are from the March 2009 survey.

Variables

Variables Definition
gender 1 if female; 0 if male
Age Age in Years
earnings Avg. Hourly Earnings
education Years of Education
Northeast 1 if from the Northeast, 0 otherwise
Midwest 1 if from the Midwest, 0 otherwise
South 1 if from the South, 0 otherwise
West 1 if from the West, 0 otherwise

Load the data

library("AER", lib.loc="~/R/win-library/3.1")
## Loading required package: car
## Loading required package: lmtest
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Loading required package: sandwich
## Loading required package: survival
## Loading required package: splines
library("lattice", lib.loc="C:/Program Files/R/R-3.1.1/library")
data("CPSSW8")
summary(CPSSW8)
##     earnings       gender           age             region     
##  Min.   : 2.0   male  :34348   Min.   :21.0   Northeast:12371  
##  1st Qu.:11.1   female:27047   1st Qu.:33.0   Midwest  :15136  
##  Median :16.2                  Median :41.0   South    :18963  
##  Mean   :18.4                  Mean   :41.2   West     :14925  
##  3rd Qu.:23.6                  3rd Qu.:49.0                    
##  Max.   :72.1                  Max.   :64.0                    
##    education   
##  Min.   : 6.0  
##  1st Qu.:12.0  
##  Median :13.0  
##  Mean   :13.6  
##  3rd Qu.:16.0  
##  Max.   :20.0
histogram(~CPSSW8$earnings | CPSSW8$gender)

plot of chunk unnamed-chunk-1

Statistical discrimination

One thing we can do with categorical variables is to identify statistical discrimination.

A simple linear regression of Avg. Hourly Earnings on Gender will give us a quick comaprison of earnings between females and males.

m1 = lm(earnings ~ factor(gender), data=CPSSW8)
summary(m1)
## 
## Call:
## lm(formula = earnings ~ factor(gender), data = CPSSW8)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -18.08  -7.09  -1.92   4.82  52.34 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           20.0864     0.0537   373.9   <2e-16 ***
## factor(gender)female  -3.7483     0.0809   -46.3   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.95 on 61393 degrees of freedom
## Multiple R-squared:  0.0338, Adjusted R-squared:  0.0337 
## F-statistic: 2.15e+03 on 1 and 61393 DF,  p-value: <2e-16

Let’s add some controls

In this second regression we include some addtional explanatory variables.

\[earnings_i = \beta_0+\beta_1 Female_i +\beta_2 age + \beta_3 education\]

m2 = lm(earnings ~ factor(gender)+age+education, data=CPSSW8)
summary(m2)
## 
## Call:
## lm(formula = earnings ~ factor(gender) + age + education, data = CPSSW8)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -31.23  -5.75  -1.32   4.28  49.50 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -10.02566    0.23720   -42.3   <2e-16 ***
## factor(gender)female  -4.25021    0.07146   -59.5   <2e-16 ***
## age                    0.15719    0.00335    46.9   <2e-16 ***
## education              1.74813    0.01444   121.1   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.78 on 61391 degrees of freedom
## Multiple R-squared:  0.249,  Adjusted R-squared:  0.249 
## F-statistic: 6.78e+03 on 3 and 61391 DF,  p-value: <2e-16
anova(m1,m2)
## Analysis of Variance Table
## 
## Model 1: earnings ~ factor(gender)
## Model 2: earnings ~ factor(gender) + age + education
##   Res.Df     RSS Df Sum of Sq    F Pr(>F)    
## 1  61393 6083949                             
## 2  61391 4730226  2   1353723 8785 <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Quadratic function

Economic theory tells us that there are diminishing returns to productivity. As we age we become more productivity, but at a decreasing rate. One way to account for this change is by including a quadratic term in our specification.

\[earnings_i = \beta_0+\beta_1 Female_i +\beta_2 age + \beta_3 age^2 +\beta_4 education\]

CPSSW8$age2=CPSSW8$age*CPSSW8$age
m3 = lm(earnings ~ factor(gender)+age+age2+education, data=CPSSW8)
summary(m3)
## 
## Call:
## lm(formula = earnings ~ factor(gender) + age + age2 + education, 
##     data = CPSSW8)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -30.62  -5.69  -1.28   4.21  48.51 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -2.58e+01   5.22e-01   -49.4   <2e-16 ***
## factor(gender)female -4.24e+00   7.08e-02   -59.9   <2e-16 ***
## age                   9.86e-01   2.48e-02    39.8   <2e-16 ***
## age2                 -9.99e-03   2.96e-04   -33.8   <2e-16 ***
## education             1.72e+00   1.43e-02   120.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.7 on 61390 degrees of freedom
## Multiple R-squared:  0.262,  Adjusted R-squared:  0.262 
## F-statistic: 5.46e+03 on 4 and 61390 DF,  p-value: <2e-16
anova(m2,m3)
## Analysis of Variance Table
## 
## Model 1: earnings ~ factor(gender) + age + education
## Model 2: earnings ~ factor(gender) + age + age2 + education
##   Res.Df     RSS Df Sum of Sq    F Pr(>F)    
## 1  61391 4730226                             
## 2  61390 4643914  1     86312 1141 <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interaction term

Potentially, the returns from education are different by gender. We add this feature to the model by including an interaction term. We multiply gender and education.

\[earnings_i = \beta_0+\beta_1 Female_i +\beta_2 age + \beta_3 age^2+ \beta_4 education+ \beta_5 education *Female\]

We see from the regression results that there are not much difference with respect to education

m4 = lm(earnings ~ age+age2+education+education*gender, data=CPSSW8)
summary(m4)
## 
## Call:
## lm(formula = earnings ~ age + age2 + education + education * 
##     gender, data = CPSSW8)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -30.51  -5.70  -1.28   4.21  48.50 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -2.55e+01   5.44e-01  -46.96   <2e-16 ***
## age                     9.86e-01   2.48e-02   39.84   <2e-16 ***
## age2                   -9.99e-03   2.96e-04  -33.78   <2e-16 ***
## education               1.71e+00   1.86e-02   91.74   <2e-16 ***
## genderfemale           -4.85e+00   4.05e-01  -11.98   <2e-16 ***
## education:genderfemale  4.45e-02   2.91e-02    1.53     0.13    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.7 on 61389 degrees of freedom
## Multiple R-squared:  0.262,  Adjusted R-squared:  0.262 
## F-statistic: 4.37e+03 on 5 and 61389 DF,  p-value: <2e-16

Location, Location, Location

There are often unobservable characteristics about markets that we would like to capture, but we just don’t have this variable (i.e. unemployment rate by gender or sector, culture, laws, etc). One way to handle this problem is to use categorical variables for the location of the person or firm. These categorical variables will capture any time invariant differences between locations.

m5 = lm(earnings ~ age+age2+education+education*gender+region, data=CPSSW8)
summary(m5)
## 
## Call:
## lm(formula = earnings ~ age + age2 + education + education * 
##     gender + region, data = CPSSW8)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -29.99  -5.67  -1.25   4.17  48.14 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -2.46e+01   5.50e-01  -44.72   <2e-16 ***
## age                     9.83e-01   2.47e-02   39.79   <2e-16 ***
## age2                   -9.97e-03   2.95e-04  -33.77   <2e-16 ***
## education               1.70e+00   1.86e-02   91.49   <2e-16 ***
## genderfemale           -4.80e+00   4.04e-01  -11.87   <2e-16 ***
## regionMidwest          -1.26e+00   1.05e-01  -11.94   <2e-16 ***
## regionSouth            -1.22e+00   1.01e-01  -12.10   <2e-16 ***
## regionWest             -4.25e-01   1.06e-01   -4.01    6e-05 ***
## education:genderfemale  4.19e-02   2.91e-02    1.44     0.15    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.68 on 61386 degrees of freedom
## Multiple R-squared:  0.265,  Adjusted R-squared:  0.265 
## F-statistic: 2.77e+03 on 8 and 61386 DF,  p-value: <2e-16

Log Transformation

The normality assumption about the error term implies the dependent variable can potentially take on both negative and positive values. However, there are some variables we use often that are always positive (i.e. price, quantity, and income). One method used to insure that we have a positive depedent variable is to transform the dependent variable by taking the natral log.

m6 = lm(log(earnings) ~ age+age2+education+education*gender+region, data=CPSSW8)
summary(m6)
## 
## Call:
## lm(formula = log(earnings) ~ age + age2 + education + education * 
##     gender + region, data = CPSSW8)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.6874 -0.2810  0.0267  0.3190  1.8179 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             3.75e-01   3.00e-02   12.49  < 2e-16 ***
## age                     6.17e-02   1.35e-03   45.72  < 2e-16 ***
## age2                   -6.37e-04   1.61e-05  -39.46  < 2e-16 ***
## education               8.40e-02   1.02e-03   82.67  < 2e-16 ***
## genderfemale           -5.07e-01   2.21e-02  -22.97  < 2e-16 ***
## regionMidwest          -5.62e-02   5.75e-03   -9.77  < 2e-16 ***
## regionSouth            -7.34e-02   5.49e-03  -13.37  < 2e-16 ***
## regionWest             -2.67e-02   5.78e-03   -4.62  3.9e-06 ***
## education:genderfemale  2.01e-02   1.59e-03   12.65  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.474 on 61386 degrees of freedom
## Multiple R-squared:  0.268,  Adjusted R-squared:  0.268 
## F-statistic: 2.81e+03 on 8 and 61386 DF,  p-value: <2e-16