Let’s connect: My email is zshang@uoguelph.ca,
twitter is @MaxZYShang. You are also welcome to visit my website. You can click the code button to show the original R code.
This post is related to the following two questions: How should you add dummy variables? How should you interpret the results?
It may be easier for us to understand theses questions if an example is used. So let’s create one. We will generate our own data. Wait, you may say why not use a real world example. The reason is that when data is collected from the real world, it’s hard (some time impossible) to know the true data-generating process. However, if the data is generated by ourselves, we know for sure how the data is generated. Of course, after the data is generated, we pretend we don’t know the data-generating process and use regression techniques to recover the true relationships. (well, you can also ask your colleague to generate the data and send it to you without telling you how the data is generated. )
Let’s go to our playground and generate the data. Say we want to study how salary is determined by education, experience, and whether someone is holding a management positon. Let’s play God and set up some rules:
The world is big enough for all of us. Everyone starts at 40k per year. That is, as long as you are breathing, you get 40k per year.
Practice makes perfect. Each additional year of experience, the salary increases by 5k.
The more you learn, the more you earn. The annual salary increase for high school, university, and PhD is 0, 10k, and 20k, respectively.
Anyone can hold the helm when the sea is calm. For those holding a management position, pay 20k more.
Born to be great leaders. For those only go to high school but hold a management position, give them an additional 30k.
S**t happens. Random factors will affect the salary with a mean of 0 and standard deviation of 5k.
library(tidyverse)
require(lmtest)
require(car)
set.seed(1)
# generate data
set.edu<-c("high school", "university", "phd")
set.edu<-as.factor(set.edu)
levels(set.edu)<-c("high school","university", "phd")
#levels(set.edu)<-c("university", "phd","high school")
set.mngt<-c("yes","no")
n=5000
d<-data.frame(edu=replicate(n,sample(set.edu,1)),
exp=runif(n,min=0,max = 10),
mngt=replicate(n,sample(set.mngt,1)))
# error term is normal
e<-rnorm(n,0,5000)
# error term is not normal
# e<-rbeta(n,shape1 = 1,shape2 = 5)*5000
# hist(e)
d$salary<-40000+5000*d$exp+e
# education
d<-d %>% mutate(salary = case_when( edu == "phd" ~ salary+20000,
edu == "university" ~ salary+10000,
edu == "high school" ~ salary+0
))
# management
d<-d %>% mutate(salary = case_when(
mngt == "yes" ~ salary + 20000,
mngt == "no" ~ salary + 0
))
# born to be
d<-d %>% mutate(salary = case_when(
mngt == "yes" & edu == "high school" ~ salary + 30000,
TRUE ~ salary
))
attach(d)
Here is the head and summary of the data.
head(d); summary(d)
## edu exp mngt salary
## 1 high school 0.3549953 yes 87800.96
## 2 phd 3.5570365 no 80815.61
## 3 phd 2.4887294 no 67131.31
## 4 university 8.7918038 yes 119055.02
## 5 high school 3.1797888 yes 106786.99
## 6 university 3.2112050 yes 80901.15
## edu exp mngt salary
## high school:1675 Min. :0.001064 no :2499 Min. : 31550
## university :1685 1st Qu.:2.577482 yes:2501 1st Qu.: 73543
## phd :1640 Median :5.035378 Median : 90160
## Mean :5.032461 Mean : 90044
## 3rd Qu.:7.520791 3rd Qu.:106648
## Max. :9.989099 Max. :146679
Relationship betwen salary and education for people with and with no management position.
p<-ggplot(d,aes(edu,salary))+
geom_boxplot()+
geom_jitter(alpha=0.25,color="brown")+
facet_wrap(~mngt)
p
Relationship betwen salary and experience colored by education for people with and with no management position.
p<-ggplot(d,aes(exp,salary,color=edu))+
geom_point(alpha=0.3)+
stat_smooth(method = "lm")+
facet_wrap(~mngt)
p
Let’s pretend that we don’t know the true data-generating process. We may just regress salary on education, experience, and management position. The results are:
md<-lm(salary~ edu +
exp +
mngt,data = d)
summary(md)
##
## Call:
## lm(formula = salary ~ edu + exp + mngt, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25836.8 -6701.8 -14.2 6501.9 25925.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 50093.69 320.36 156.37 <2e-16 ***
## eduuniversity -4770.52 296.96 -16.07 <2e-16 ***
## eduphd 5130.94 298.95 17.16 <2e-16 ***
## exp 4954.33 42.22 117.33 <2e-16 ***
## mngtyes 29873.56 243.42 122.72 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8605 on 4995 degrees of freedom
## Multiple R-squared: 0.8579, Adjusted R-squared: 0.8577
## F-statistic: 7537 on 4 and 4995 DF, p-value: < 2.2e-16
Although the parameters are statistically significant, it does not make any sense. How could a university degree compared to high school reduce your salary by 4770.52? Since we know how the data are generated, the correct coefficient for university education should be about 10k.
The correct model should include an interaction term of education and management position.
Now, let’s add the interaction term between education and management and see what happens:
md<-lm(data = d,salary~
edu +
exp +
mngt+
edu*mngt
)
summary(md)
##
## Call:
## lm(formula = salary ~ edu + exp + mngt + edu * mngt, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21223.1 -3314.8 -56.7 3334.6 18612.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39923.21 209.40 190.65 <2e-16 ***
## eduuniversity 10070.22 239.26 42.09 <2e-16 ***
## eduphd 20116.12 243.72 82.54 <2e-16 ***
## exp 5008.52 24.21 206.86 <2e-16 ***
## mngtyes 49735.74 241.04 206.34 <2e-16 ***
## eduuniversity:mngtyes -29965.51 340.51 -88.00 <2e-16 ***
## eduphd:mngtyes -29782.87 342.73 -86.90 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4932 on 4993 degrees of freedom
## Multiple R-squared: 0.9533, Adjusted R-squared: 0.9533
## F-statistic: 1.7e+04 on 6 and 4993 DF, p-value: < 2.2e-16
The results are now making sense.
The intercept of 39923.21 (close to 40k) is the basic guaranteed income.
The base for education is high school. Compared to high school, a university education can increase salary by $10070.22 ( close to 10k) on average. Compared to high school, Ph.D. education can increase the salary by $20116.12 (close to 20k).
One more year of experience will increase the salary by $5008.52 (close to 5k).
High school graduates holding a management position have a premium of $49735.74 (close to 50k). These are the born-to-be leaders.
Compared to high school graduates holding a management position, the premium for university graduates holding a management position reduces by 29965.51 to 19770.23 (49735.74-29965.51, close to 20k).
Compared to high school graduates holding a management position, the premium for Ph.D. graduates holding a management position reduces by 29782.87 to 19952.87 (49735.74-29782.87, close to 20k). Alternatively, you can say management position generates a basic premium of 20k regardless of the education level. Additional to this 20K, high school graduates get another 30k, increasing the total premium to 50k.
For our model to work, we need to meet some assumptions.
Normal Q-Q plot looks linear. So this assumption is satisfied.
par(mfrow = c(2,2))
plot(md)
D-W test value is 1.8878 which is close to 2. Hence, this assumption is also satisfied.
dwtest(md)
##
## Durbin-Watson test
##
## data: md
## DW = 2.0237, p-value = 0.7984
## alternative hypothesis: true autocorrelation is greater than 0
The VIF values of predictor variables edu, exp, and mngt are less than 5 so this assumption satisfied.
vif(md)
## GVIF Df GVIF^(1/(2*Df))
## edu 4.008199 2 1.414938
## exp 1.001219 1 1.000609
## mngt 2.985803 1 1.727948
## edu:mngt 8.008466 2 1.682238
You can get the same results by running the model with a subset of the data. Instead of using a dummy variable for education, you can split the data into subsets by education level and run the regression model at each subset.
With data only from high school students, you get this:
d.sub<-d %>%
filter(edu=="high school")
md<-lm(data = d.sub,salary~
exp +
mngt
)
summary(md)
##
## Call:
## lm(formula = salary ~ exp + mngt, data = d.sub)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21223.7 -3336.5 -87.6 3293.5 17774.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39919.95 273.48 146.0 <2e-16 ***
## exp 5009.17 42.35 118.3 <2e-16 ***
## mngtyes 49735.84 242.22 205.3 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4955 on 1672 degrees of freedom
## Multiple R-squared: 0.9704, Adjusted R-squared: 0.9704
## F-statistic: 2.745e+04 on 2 and 1672 DF, p-value: < 2.2e-16
With data only from University students, you get this:
d.sub<-d %>%
filter(edu=="university")
md<-lm(data = d.sub,salary~
exp +
mngt
)
summary(md)
##
## Call:
## lm(formula = salary ~ exp + mngt, data = d.sub)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15617.5 -3200.8 -65.7 3315.3 18649.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 49805.9 266.2 187.11 <2e-16 ***
## exp 5046.1 41.3 122.20 <2e-16 ***
## mngtyes 19761.1 241.5 81.81 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4952 on 1682 degrees of freedom
## Multiple R-squared: 0.9304, Adjusted R-squared: 0.9303
## F-statistic: 1.125e+04 on 2 and 1682 DF, p-value: < 2.2e-16
With data only from Ph.D. students, you get this:
d.sub<-d %>%
filter(edu=="phd")
md<-lm(data = d.sub,salary~
exp +
mngt
)
summary(md)
##
## Call:
## lm(formula = salary ~ exp + mngt, data = d.sub)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17231.0 -3396.4 -18.2 3378.1 15739.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 60245.70 273.98 219.89 <2e-16 ***
## exp 4967.56 42.18 117.76 <2e-16 ***
## mngtyes 19952.33 241.48 82.62 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4888 on 1637 degrees of freedom
## Multiple R-squared: 0.9266, Adjusted R-squared: 0.9265
## F-statistic: 1.033e+04 on 2 and 1637 DF, p-value: < 2.2e-16