Most of the data can be measured by numbers, such as heights and weights. However, variables such as gender, season, location, etc., cannot be measured by numbers. Instead, we use dummy variables to measure them.
Let’s assume the impact of x on y is different in male and female.
For male \(y=10+5x+e\)
For female \(y=5+x+e\),
where e is random effect with mean zero. So in the true relationship between y and x, gender impacts both the interception and the slope.
First, let’s generate the data we need:
set.seed(1)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.4.1
n=100
x<-rnorm(n,10,2)
e<-rnorm(n,0,1)
gender<-c(rep(1,50),rep(0,50))
d<-data.frame(cbind(x,gender))
#true slope, male =5, female = 1
d$y<-ifelse(d$gender==1, 10+5*d$x+e,5+d$x+e)
d$gender<-as.factor(d$gender)
attach(d)
## The following objects are masked _by_ .GlobalEnv:
##
## gender, x
First, we can take a look at the relationship between x and y and color the data by gender:
ggplot(data=d,aes(x,y,color=gender))+
geom_point()
It’s clear that the relationship between y and x should not be depicted by one line. We need two: one for male and one for female.
If we only regress y on x and gender, the result would be
summary(lm(y~x+gender))
##
## Call:
## lm(formula = y ~ x + gender)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.1698 -2.0864 0.5659 2.6493 7.2009
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -12.2736 2.2041 -5.568 2.29e-07 ***
## x 2.6953 0.2091 12.891 < 2e-16 ***
## gender 45.6315 0.7474 61.053 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.737 on 97 degrees of freedom
## Multiple R-squared: 0.9756, Adjusted R-squared: 0.9751
## F-statistic: 1940 on 2 and 97 DF, p-value: < 2.2e-16
The estimated coeffcient for x is not correct.
The correct set up should be the following, which allows gender to influnence both the interception and the slope.
summary(lm(y~x+gender+gender*x))
##
## Call:
## lm(formula = y ~ x + gender + gender * x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7543 -0.5246 -0.1240 0.5489 2.2400
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.20427 0.74111 7.022 3.12e-10 ***
## x 0.98755 0.07117 13.875 < 2e-16 ***
## gender 4.49815 1.13300 3.970 0.000139 ***
## x:gender 4.02667 0.10929 36.844 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9654 on 96 degrees of freedom
## Multiple R-squared: 0.9984, Adjusted R-squared: 0.9983
## F-statistic: 1.983e+04 on 3 and 96 DF, p-value: < 2.2e-16
or if you want to keep it short, use the following, R will know that you are trying to add a dummy variable.
summary(lm(y~x*gender))
##
## Call:
## lm(formula = y ~ x * gender)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7543 -0.5246 -0.1240 0.5489 2.2400
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.20427 0.74111 7.022 3.12e-10 ***
## x 0.98755 0.07117 13.875 < 2e-16 ***
## gender 4.49815 1.13300 3.970 0.000139 ***
## x:gender 4.02667 0.10929 36.844 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9654 on 96 degrees of freedom
## Multiple R-squared: 0.9984, Adjusted R-squared: 0.9983
## F-statistic: 1.983e+04 on 3 and 96 DF, p-value: < 2.2e-16
The model says that, for female (gender=0), the estimated model is \(y=5.20+0.99x\); for male (gender=1), the estimated relationship is \(y=5.20+0.99x+4.5+4.02x\), that is \(y=9.7+5.01x\), pretty close to the true relationship.
Next, let’s try two dummy variables: gender and location
Let’s generate some data where gender does not matter, but location would matter
n=10000
x<-rnorm(n,10,2)
e<-rnorm(n,0,2)
gender<-ifelse(runif(n)>0.5,1,0)
#gender<-c(rep(1,n/2),rep(0,n/2))
location<-c(rep("Toronto",n/2),
rep("Chicago",n/2)
)
d<-data.frame(x=x,gender=gender,location=location)
d$gender<-as.factor(d$gender)
d$y<-ifelse(d$gender=="1" & d$location=="Toronto", 1+1*d$x+e,
ifelse(d$gender=="0" & d$location=="Toronto", 1+1*d$x+e,
ifelse(d$gender=="1" & d$location=="Chicago", 2+2*d$x+e,
ifelse(d$gender=="0" & d$location=="Chicago", 2+2*d$x+e,NA))))
attach(d)
## The following objects are masked _by_ .GlobalEnv:
##
## gender, location, x
## The following objects are masked from d (pos = 3):
##
## gender, x, y
Plot to see the relationship between \(x\) and \(y\), color the data by gender and seperate them by location.
p<-ggplot(d,aes(x,y,color=gender))+
geom_point()
p+facet_grid(~location)
The impact of gender on y seems radom as it should be. But when you compare the data from Chicago and data from Toronto, the interception and the slope is different.
If we naively ignored the impact of gender and location, the model would be
summary(lm(y~x))
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.4047 -5.5901 0.0189 5.5535 12.8347
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.6149 0.3006 5.373 7.93e-08 ***
## x 1.4878 0.0295 50.440 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.964 on 9998 degrees of freedom
## Multiple R-squared: 0.2029, Adjusted R-squared: 0.2028
## F-statistic: 2544 on 1 and 9998 DF, p-value: < 2.2e-16
R-squared is pretty low.
We know gender wouldn’t matter, but let’s just add it to see if it will make any difference
summary(lm(y~x+gender))
##
## Call:
## lm(formula = y ~ x + gender)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.402 -5.592 0.020 5.554 12.838
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.617854 0.306891 5.272 1.38e-07 ***
## x 1.487846 0.029499 50.437 < 2e-16 ***
## gender -0.005625 0.119317 -0.047 0.962
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.965 on 9997 degrees of freedom
## Multiple R-squared: 0.2029, Adjusted R-squared: 0.2027
## F-statistic: 1272 on 2 and 9997 DF, p-value: < 2.2e-16
As expected, the impact of gender is not significant.
Now let’s examine the impact of location
summary(lm(y~x+location))
##
## Call:
## lm(formula = y ~ x + location)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.8733 -1.5038 -0.0207 1.4710 7.4832
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.13409 0.11348 62.86 <2e-16 ***
## x 1.48990 0.01093 136.37 <2e-16 ***
## locationToronto -11.07925 0.04418 -250.77 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.209 on 9997 degrees of freedom
## Multiple R-squared: 0.8907, Adjusted R-squared: 0.8906
## F-statistic: 4.072e+04 on 2 and 9997 DF, p-value: < 2.2e-16
The impact of locationi s significant. But our model set up is essentially saying that we think location would only change the interception.
What if location changes both the interception and the slope. we can set the model to be
summary(lm(y~x*location))
##
## Call:
## lm(formula = y ~ x * location)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.5041 -1.3530 -0.0267 1.3175 7.4462
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.17843 0.14373 15.157 < 2e-16 ***
## x 1.98616 0.01412 140.706 < 2e-16 ***
## locationToronto -1.48509 0.20012 -7.421 1.26e-13 ***
## x:locationToronto -0.96061 0.01964 -48.913 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.984 on 9996 degrees of freedom
## Multiple R-squared: 0.9118, Adjusted R-squared: 0.9117
## F-statistic: 3.444e+04 on 3 and 9996 DF, p-value: < 2.2e-16
you can also try this:
summary(lm(y~x*location*gender))
##
## Call:
## lm(formula = y ~ x * location * gender)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.5321 -1.3544 -0.0245 1.3130 7.3862
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.155796 0.205852 10.473 < 2e-16 ***
## x 1.989341 0.020178 98.589 < 2e-16 ***
## locationToronto -1.655801 0.286231 -5.785 7.48e-09 ***
## gender 0.045105 0.287576 0.157 0.875
## x:locationToronto -0.943339 0.028071 -33.606 < 2e-16 ***
## x:gender -0.006321 0.028241 -0.224 0.823
## locationToronto:gender 0.333634 0.400384 0.833 0.405
## x:locationToronto:gender -0.033734 0.039291 -0.859 0.391
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.985 on 9992 degrees of freedom
## Multiple R-squared: 0.9118, Adjusted R-squared: 0.9117
## F-statistic: 1.476e+04 on 7 and 9992 DF, p-value: < 2.2e-16
Gender does not matter and location changes both the interception and the slope.
Now let’s generate some data where gender and location would both matter. Let’s start with two locations.
d$y<-ifelse(d$gender=="1" & d$location=="Toronto", 10+1*d$x+e,
ifelse(d$gender=="0" & d$location=="Toronto", 1+1*d$x+e,
ifelse(d$gender=="1" & d$location=="Chicago", 20+2*d$x+e,
ifelse(d$gender=="0" & d$location=="Chicago", 2+2*d$x+e,NA))))
attach(d)
## The following objects are masked _by_ .GlobalEnv:
##
## gender, location, x
## The following objects are masked from d (pos = 3):
##
## gender, location, x, y
## The following objects are masked from d (pos = 4):
##
## gender, x, y
p<-ggplot(d,aes(x,y,color=gender))+
geom_point()
p+facet_grid(~location)
summary(lm(y~x))
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.174 -7.665 -2.346 12.240 23.943
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.69392 0.54495 15.95 <2e-16 ***
## x 1.46924 0.05348 27.47 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.81 on 9998 degrees of freedom
## Multiple R-squared: 0.0702, Adjusted R-squared: 0.0701
## F-statistic: 754.8 on 1 and 9998 DF, p-value: < 2.2e-16
summary(lm(y~gender))
##
## Call:
## lm(formula = y ~ gender)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.546 -7.933 -1.186 7.855 25.430
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.4877 0.1280 128.78 <2e-16 ***
## gender 13.4856 0.1793 75.23 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.961 on 9998 degrees of freedom
## Multiple R-squared: 0.3615, Adjusted R-squared: 0.3614
## F-statistic: 5660 on 1 and 9998 DF, p-value: < 2.2e-16
summary(lm(y~x+location))
##
## Call:
## lm(formula = y ~ x + location)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.5571 -6.7026 0.6993 6.6514 16.0910
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.51660 0.38191 43.25 <2e-16 ***
## x 1.47214 0.03677 40.04 <2e-16 ***
## locationToronto -15.70339 0.14868 -105.62 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.434 on 9997 degrees of freedom
## Multiple R-squared: 0.5606, Adjusted R-squared: 0.5605
## F-statistic: 6376 on 2 and 9997 DF, p-value: < 2.2e-16
summary(lm(y~x*location))
##
## Call:
## lm(formula = y ~ x * location)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.605 -6.732 1.030 6.721 16.058
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.82216 0.53433 22.125 <2e-16 ***
## x 1.94224 0.05248 37.011 <2e-16 ***
## locationToronto -6.61497 0.74399 -8.891 <2e-16 ***
## x:locationToronto -0.90998 0.07301 -12.463 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.377 on 9996 degrees of freedom
## Multiple R-squared: 0.5673, Adjusted R-squared: 0.5671
## F-statistic: 4368 on 3 and 9996 DF, p-value: < 2.2e-16
summary(lm(y~x+location+x*location))
##
## Call:
## lm(formula = y ~ x + location + x * location)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.605 -6.732 1.030 6.721 16.058
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.82216 0.53433 22.125 <2e-16 ***
## x 1.94224 0.05248 37.011 <2e-16 ***
## locationToronto -6.61497 0.74399 -8.891 <2e-16 ***
## x:locationToronto -0.90998 0.07301 -12.463 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.377 on 9996 degrees of freedom
## Multiple R-squared: 0.5673, Adjusted R-squared: 0.5671
## F-statistic: 4368 on 3 and 9996 DF, p-value: < 2.2e-16
summary(lm(y~x+location+gender))
##
## Call:
## lm(formula = y ~ x + location + gender)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.0384 -2.3588 0.0139 2.3698 9.7488
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.52273 0.16517 57.65 <2e-16 ***
## x 1.48283 0.01559 95.14 <2e-16 ***
## locationToronto -15.67110 0.06303 -248.64 <2e-16 ***
## gender 13.46727 0.06304 213.63 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.151 on 9996 degrees of freedom
## Multiple R-squared: 0.921, Adjusted R-squared: 0.921
## F-statistic: 3.887e+04 on 3 and 9996 DF, p-value: < 2.2e-16
summary(lm(y~x*location))
##
## Call:
## lm(formula = y ~ x * location)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.605 -6.732 1.030 6.721 16.058
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.82216 0.53433 22.125 <2e-16 ***
## x 1.94224 0.05248 37.011 <2e-16 ***
## locationToronto -6.61497 0.74399 -8.891 <2e-16 ***
## x:locationToronto -0.90998 0.07301 -12.463 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.377 on 9996 degrees of freedom
## Multiple R-squared: 0.5673, Adjusted R-squared: 0.5671
## F-statistic: 4368 on 3 and 9996 DF, p-value: < 2.2e-16
summary(lm(y~x*gender))
##
## Call:
## lm(formula = y ~ x * gender)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.736 -7.808 0.166 7.766 17.330
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.32060 0.60877 2.169 0.0301 *
## x 1.51759 0.05970 25.419 <2e-16 ***
## gender 14.24115 0.85158 16.723 <2e-16 ***
## x:gender -0.07372 0.08357 -0.882 0.3777
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.447 on 9996 degrees of freedom
## Multiple R-squared: 0.4327, Adjusted R-squared: 0.4326
## F-statistic: 2542 on 3 and 9996 DF, p-value: < 2.2e-16
summary(lm(y~x*location*gender))
##
## Call:
## lm(formula = y ~ x * location * gender)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.5321 -1.3544 -0.0245 1.3130 7.3862
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.155796 0.205852 10.473 < 2e-16 ***
## x 1.989341 0.020178 98.589 < 2e-16 ***
## locationToronto -1.655801 0.286231 -5.785 7.48e-09 ***
## gender 18.045105 0.287576 62.749 < 2e-16 ***
## x:locationToronto -0.943339 0.028071 -33.606 < 2e-16 ***
## x:gender -0.006321 0.028241 -0.224 0.823
## locationToronto:gender -8.666366 0.400384 -21.645 < 2e-16 ***
## x:locationToronto:gender -0.033734 0.039291 -0.859 0.391
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.985 on 9992 degrees of freedom
## Multiple R-squared: 0.9687, Adjusted R-squared: 0.9687
## F-statistic: 4.418e+04 on 7 and 9992 DF, p-value: < 2.2e-16
Finally, let’s try a model with 5 locations.
n=10000
x<-rnorm(n,10,2)
e<-rnorm(n,0,5)
gender<-ifelse(runif(n)>0.5,1,0)
#gender<-c(rep(1,n/2),rep(0,n/2))
location<-c(rep("Toronto",n/5),
rep("Chicago",n/5),
rep("New York",n/5),
rep("Beijing",n/5),rep("Shanghai",n/5)
)
d<-data.frame(x=x,gender=gender,location=location)
d$gender<-as.factor(d$gender)
#true coefficient, male =5, female = 1
d$y<-ifelse(d$gender=="1" & d$location=="Toronto", 1+5*d$x+e,
ifelse(d$gender=="0" & d$location=="Toronto", 1+1*d$x+e,
ifelse(d$gender=="1" & d$location=="Chicago", 2+10*d$x+e,
ifelse(d$gender=="0" & d$location=="Chicago", 2+2*d$x+e,
ifelse(d$gender=="1" & d$location=="New York",3+15*d$x+e,
ifelse(d$gender=="0" & d$location=="New York",3+5*d$x+e,
ifelse(d$gender=="1" & d$location=="Beijing",8+30*d$x+e,
ifelse(d$gender=="0" & d$location=="Beijing",8+2*d$x+e,
ifelse(d$gender=="1" & d$location=="Shanghai",15+25*d$x+e,
ifelse(d$gender=="0" & d$location=="Shanghai",15+2*d$x+e,NA))))))))))
attach(d)
## The following objects are masked _by_ .GlobalEnv:
##
## gender, location, x
## The following objects are masked from d (pos = 3):
##
## gender, location, x, y
## The following objects are masked from d (pos = 4):
##
## gender, location, x, y
## The following objects are masked from d (pos = 5):
##
## gender, x, y
p<-ggplot(d,aes(x,y,color=gender))+
geom_point()
p+facet_grid(~location)
summary(lm(y~x*location*gender))
##
## Call:
## lm(formula = y ~ x * location * gender)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.499 -3.430 0.059 3.369 19.333
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.25456 0.80758 10.221 < 2e-16 ***
## x 1.97801 0.07870 25.134 < 2e-16 ***
## locationChicago -5.46995 1.12087 -4.880 1.08e-06 ***
## locationNew York -5.55107 1.15323 -4.813 1.50e-06 ***
## locationShanghai 7.26783 1.14054 6.372 1.94e-10 ***
## locationToronto -6.82771 1.12258 -6.082 1.23e-09 ***
## gender 0.64179 1.15428 0.556 0.5782
## x:locationChicago -0.05470 0.10930 -0.500 0.6168
## x:locationNew York 3.03849 0.11217 27.089 < 2e-16 ***
## x:locationShanghai -0.01133 0.11170 -0.101 0.9192
## x:locationToronto -1.01747 0.10972 -9.273 < 2e-16 ***
## x:gender 27.93490 0.11250 248.310 < 2e-16 ***
## locationChicago:gender -2.43737 1.59633 -1.527 0.1268
## locationNew York:gender -0.41363 1.62282 -0.255 0.7988
## locationShanghai:gender -3.17128 1.62732 -1.949 0.0513 .
## locationToronto:gender -0.58634 1.61511 -0.363 0.7166
## x:locationChicago:gender -19.79186 0.15583 -127.006 < 2e-16 ***
## x:locationNew York:gender -17.94362 0.15851 -113.202 < 2e-16 ***
## x:locationShanghai:gender -4.71275 0.15933 -29.578 < 2e-16 ***
## x:locationToronto:gender -23.96322 0.15828 -151.401 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.03 on 9980 degrees of freedom
## Multiple R-squared: 0.9977, Adjusted R-squared: 0.9977
## F-statistic: 2.248e+05 on 19 and 9980 DF, p-value: < 2.2e-16
So, if you think some factors (gender, location, season, etc.,) may impact your explained variables, set them as a dummy variables.
Good luck and happy modelling.
Max Shang
Email: zshang@me.com
Connect via twitter: @MaxZYShang