Why need dummy variable

Most of the data can be measured by numbers, such as heights and weights. However, variables such as gender, season, location, etc., cannot be measured by numbers. Instead, we use dummy variables to measure them.

Example: gender

Let’s assume the impact of x on y is different in male and female.

For male \(y=10+5x+e\)

For female \(y=5+x+e\),

where e is random effect with mean zero. So in the true relationship between y and x, gender impacts both the interception and the slope.

First, let’s generate the data we need:

set.seed(1)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.4.1
n=100
x<-rnorm(n,10,2)
e<-rnorm(n,0,1)
gender<-c(rep(1,50),rep(0,50))
d<-data.frame(cbind(x,gender))

#true slope, male =5, female = 1
d$y<-ifelse(d$gender==1, 10+5*d$x+e,5+d$x+e)
d$gender<-as.factor(d$gender)
attach(d)
## The following objects are masked _by_ .GlobalEnv:
## 
##     gender, x

First, we can take a look at the relationship between x and y and color the data by gender:

ggplot(data=d,aes(x,y,color=gender))+
  geom_point()

It’s clear that the relationship between y and x should not be depicted by one line. We need two: one for male and one for female.

If we only regress y on x and gender, the result would be

summary(lm(y~x+gender))
## 
## Call:
## lm(formula = y ~ x + gender)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.1698  -2.0864   0.5659   2.6493   7.2009 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -12.2736     2.2041  -5.568 2.29e-07 ***
## x             2.6953     0.2091  12.891  < 2e-16 ***
## gender       45.6315     0.7474  61.053  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.737 on 97 degrees of freedom
## Multiple R-squared:  0.9756, Adjusted R-squared:  0.9751 
## F-statistic:  1940 on 2 and 97 DF,  p-value: < 2.2e-16

The estimated coeffcient for x is not correct.

The correct set up should be the following, which allows gender to influnence both the interception and the slope.

summary(lm(y~x+gender+gender*x))
## 
## Call:
## lm(formula = y ~ x + gender + gender * x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7543 -0.5246 -0.1240  0.5489  2.2400 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.20427    0.74111   7.022 3.12e-10 ***
## x            0.98755    0.07117  13.875  < 2e-16 ***
## gender       4.49815    1.13300   3.970 0.000139 ***
## x:gender     4.02667    0.10929  36.844  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9654 on 96 degrees of freedom
## Multiple R-squared:  0.9984, Adjusted R-squared:  0.9983 
## F-statistic: 1.983e+04 on 3 and 96 DF,  p-value: < 2.2e-16

or if you want to keep it short, use the following, R will know that you are trying to add a dummy variable.

summary(lm(y~x*gender))
## 
## Call:
## lm(formula = y ~ x * gender)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7543 -0.5246 -0.1240  0.5489  2.2400 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.20427    0.74111   7.022 3.12e-10 ***
## x            0.98755    0.07117  13.875  < 2e-16 ***
## gender       4.49815    1.13300   3.970 0.000139 ***
## x:gender     4.02667    0.10929  36.844  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9654 on 96 degrees of freedom
## Multiple R-squared:  0.9984, Adjusted R-squared:  0.9983 
## F-statistic: 1.983e+04 on 3 and 96 DF,  p-value: < 2.2e-16

The model says that, for female (gender=0), the estimated model is \(y=5.20+0.99x\); for male (gender=1), the estimated relationship is \(y=5.20+0.99x+4.5+4.02x\), that is \(y=9.7+5.01x\), pretty close to the true relationship.

Next, let’s try two dummy variables: gender and location

Dummy for both gender and location

Gender does not matter, but location matters

Let’s generate some data where gender does not matter, but location would matter

n=10000
x<-rnorm(n,10,2)
e<-rnorm(n,0,2)
gender<-ifelse(runif(n)>0.5,1,0)
#gender<-c(rep(1,n/2),rep(0,n/2))
location<-c(rep("Toronto",n/2),
            rep("Chicago",n/2)
)
d<-data.frame(x=x,gender=gender,location=location)

d$gender<-as.factor(d$gender)

d$y<-ifelse(d$gender=="1" & d$location=="Toronto", 1+1*d$x+e,
            ifelse(d$gender=="0" & d$location=="Toronto", 1+1*d$x+e,
                   ifelse(d$gender=="1" & d$location=="Chicago", 2+2*d$x+e,
                          ifelse(d$gender=="0" & d$location=="Chicago", 2+2*d$x+e,NA))))
attach(d)
## The following objects are masked _by_ .GlobalEnv:
## 
##     gender, location, x
## The following objects are masked from d (pos = 3):
## 
##     gender, x, y

Plot to see the relationship between \(x\) and \(y\), color the data by gender and seperate them by location.

p<-ggplot(d,aes(x,y,color=gender))+
  geom_point()
p+facet_grid(~location)

The impact of gender on y seems radom as it should be. But when you compare the data from Chicago and data from Toronto, the interception and the slope is different.

If we naively ignored the impact of gender and location, the model would be

summary(lm(y~x))
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.4047  -5.5901   0.0189   5.5535  12.8347 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.6149     0.3006   5.373 7.93e-08 ***
## x             1.4878     0.0295  50.440  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.964 on 9998 degrees of freedom
## Multiple R-squared:  0.2029, Adjusted R-squared:  0.2028 
## F-statistic:  2544 on 1 and 9998 DF,  p-value: < 2.2e-16

R-squared is pretty low.

We know gender wouldn’t matter, but let’s just add it to see if it will make any difference

summary(lm(y~x+gender))
## 
## Call:
## lm(formula = y ~ x + gender)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.402  -5.592   0.020   5.554  12.838 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.617854   0.306891   5.272 1.38e-07 ***
## x            1.487846   0.029499  50.437  < 2e-16 ***
## gender      -0.005625   0.119317  -0.047    0.962    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.965 on 9997 degrees of freedom
## Multiple R-squared:  0.2029, Adjusted R-squared:  0.2027 
## F-statistic:  1272 on 2 and 9997 DF,  p-value: < 2.2e-16

As expected, the impact of gender is not significant.

Now let’s examine the impact of location

summary(lm(y~x+location))
## 
## Call:
## lm(formula = y ~ x + location)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.8733 -1.5038 -0.0207  1.4710  7.4832 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       7.13409    0.11348   62.86   <2e-16 ***
## x                 1.48990    0.01093  136.37   <2e-16 ***
## locationToronto -11.07925    0.04418 -250.77   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.209 on 9997 degrees of freedom
## Multiple R-squared:  0.8907, Adjusted R-squared:  0.8906 
## F-statistic: 4.072e+04 on 2 and 9997 DF,  p-value: < 2.2e-16

The impact of locationi s significant. But our model set up is essentially saying that we think location would only change the interception.

What if location changes both the interception and the slope. we can set the model to be

summary(lm(y~x*location))
## 
## Call:
## lm(formula = y ~ x * location)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.5041 -1.3530 -0.0267  1.3175  7.4462 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        2.17843    0.14373  15.157  < 2e-16 ***
## x                  1.98616    0.01412 140.706  < 2e-16 ***
## locationToronto   -1.48509    0.20012  -7.421 1.26e-13 ***
## x:locationToronto -0.96061    0.01964 -48.913  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.984 on 9996 degrees of freedom
## Multiple R-squared:  0.9118, Adjusted R-squared:  0.9117 
## F-statistic: 3.444e+04 on 3 and 9996 DF,  p-value: < 2.2e-16

you can also try this:

summary(lm(y~x*location*gender))
## 
## Call:
## lm(formula = y ~ x * location * gender)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.5321 -1.3544 -0.0245  1.3130  7.3862 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               2.155796   0.205852  10.473  < 2e-16 ***
## x                         1.989341   0.020178  98.589  < 2e-16 ***
## locationToronto          -1.655801   0.286231  -5.785 7.48e-09 ***
## gender                    0.045105   0.287576   0.157    0.875    
## x:locationToronto        -0.943339   0.028071 -33.606  < 2e-16 ***
## x:gender                 -0.006321   0.028241  -0.224    0.823    
## locationToronto:gender    0.333634   0.400384   0.833    0.405    
## x:locationToronto:gender -0.033734   0.039291  -0.859    0.391    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.985 on 9992 degrees of freedom
## Multiple R-squared:  0.9118, Adjusted R-squared:  0.9117 
## F-statistic: 1.476e+04 on 7 and 9992 DF,  p-value: < 2.2e-16

Gender does not matter and location changes both the interception and the slope.

Gender and location both matter, 2 locations

Now let’s generate some data where gender and location would both matter. Let’s start with two locations.

d$y<-ifelse(d$gender=="1" & d$location=="Toronto", 10+1*d$x+e,
            ifelse(d$gender=="0" & d$location=="Toronto", 1+1*d$x+e,
                   ifelse(d$gender=="1" & d$location=="Chicago", 20+2*d$x+e,
                          ifelse(d$gender=="0" & d$location=="Chicago", 2+2*d$x+e,NA))))

attach(d)
## The following objects are masked _by_ .GlobalEnv:
## 
##     gender, location, x
## The following objects are masked from d (pos = 3):
## 
##     gender, location, x, y
## The following objects are masked from d (pos = 4):
## 
##     gender, x, y
p<-ggplot(d,aes(x,y,color=gender))+
  geom_point()
p+facet_grid(~location)

summary(lm(y~x))
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -19.174  -7.665  -2.346  12.240  23.943 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.69392    0.54495   15.95   <2e-16 ***
## x            1.46924    0.05348   27.47   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.81 on 9998 degrees of freedom
## Multiple R-squared:  0.0702, Adjusted R-squared:  0.0701 
## F-statistic: 754.8 on 1 and 9998 DF,  p-value: < 2.2e-16
summary(lm(y~gender))
## 
## Call:
## lm(formula = y ~ gender)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -20.546  -7.933  -1.186   7.855  25.430 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  16.4877     0.1280  128.78   <2e-16 ***
## gender       13.4856     0.1793   75.23   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.961 on 9998 degrees of freedom
## Multiple R-squared:  0.3615, Adjusted R-squared:  0.3614 
## F-statistic:  5660 on 1 and 9998 DF,  p-value: < 2.2e-16
summary(lm(y~x+location))
## 
## Call:
## lm(formula = y ~ x + location)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.5571  -6.7026   0.6993   6.6514  16.0910 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      16.51660    0.38191   43.25   <2e-16 ***
## x                 1.47214    0.03677   40.04   <2e-16 ***
## locationToronto -15.70339    0.14868 -105.62   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.434 on 9997 degrees of freedom
## Multiple R-squared:  0.5606, Adjusted R-squared:  0.5605 
## F-statistic:  6376 on 2 and 9997 DF,  p-value: < 2.2e-16
summary(lm(y~x*location))
## 
## Call:
## lm(formula = y ~ x * location)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.605  -6.732   1.030   6.721  16.058 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       11.82216    0.53433  22.125   <2e-16 ***
## x                  1.94224    0.05248  37.011   <2e-16 ***
## locationToronto   -6.61497    0.74399  -8.891   <2e-16 ***
## x:locationToronto -0.90998    0.07301 -12.463   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.377 on 9996 degrees of freedom
## Multiple R-squared:  0.5673, Adjusted R-squared:  0.5671 
## F-statistic:  4368 on 3 and 9996 DF,  p-value: < 2.2e-16
summary(lm(y~x+location+x*location))
## 
## Call:
## lm(formula = y ~ x + location + x * location)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.605  -6.732   1.030   6.721  16.058 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       11.82216    0.53433  22.125   <2e-16 ***
## x                  1.94224    0.05248  37.011   <2e-16 ***
## locationToronto   -6.61497    0.74399  -8.891   <2e-16 ***
## x:locationToronto -0.90998    0.07301 -12.463   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.377 on 9996 degrees of freedom
## Multiple R-squared:  0.5673, Adjusted R-squared:  0.5671 
## F-statistic:  4368 on 3 and 9996 DF,  p-value: < 2.2e-16
summary(lm(y~x+location+gender))
## 
## Call:
## lm(formula = y ~ x + location + gender)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.0384  -2.3588   0.0139   2.3698   9.7488 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       9.52273    0.16517   57.65   <2e-16 ***
## x                 1.48283    0.01559   95.14   <2e-16 ***
## locationToronto -15.67110    0.06303 -248.64   <2e-16 ***
## gender           13.46727    0.06304  213.63   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.151 on 9996 degrees of freedom
## Multiple R-squared:  0.921,  Adjusted R-squared:  0.921 
## F-statistic: 3.887e+04 on 3 and 9996 DF,  p-value: < 2.2e-16
summary(lm(y~x*location))
## 
## Call:
## lm(formula = y ~ x * location)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.605  -6.732   1.030   6.721  16.058 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       11.82216    0.53433  22.125   <2e-16 ***
## x                  1.94224    0.05248  37.011   <2e-16 ***
## locationToronto   -6.61497    0.74399  -8.891   <2e-16 ***
## x:locationToronto -0.90998    0.07301 -12.463   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.377 on 9996 degrees of freedom
## Multiple R-squared:  0.5673, Adjusted R-squared:  0.5671 
## F-statistic:  4368 on 3 and 9996 DF,  p-value: < 2.2e-16
summary(lm(y~x*gender))
## 
## Call:
## lm(formula = y ~ x * gender)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.736  -7.808   0.166   7.766  17.330 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.32060    0.60877   2.169   0.0301 *  
## x            1.51759    0.05970  25.419   <2e-16 ***
## gender      14.24115    0.85158  16.723   <2e-16 ***
## x:gender    -0.07372    0.08357  -0.882   0.3777    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.447 on 9996 degrees of freedom
## Multiple R-squared:  0.4327, Adjusted R-squared:  0.4326 
## F-statistic:  2542 on 3 and 9996 DF,  p-value: < 2.2e-16
summary(lm(y~x*location*gender))
## 
## Call:
## lm(formula = y ~ x * location * gender)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.5321 -1.3544 -0.0245  1.3130  7.3862 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               2.155796   0.205852  10.473  < 2e-16 ***
## x                         1.989341   0.020178  98.589  < 2e-16 ***
## locationToronto          -1.655801   0.286231  -5.785 7.48e-09 ***
## gender                   18.045105   0.287576  62.749  < 2e-16 ***
## x:locationToronto        -0.943339   0.028071 -33.606  < 2e-16 ***
## x:gender                 -0.006321   0.028241  -0.224    0.823    
## locationToronto:gender   -8.666366   0.400384 -21.645  < 2e-16 ***
## x:locationToronto:gender -0.033734   0.039291  -0.859    0.391    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.985 on 9992 degrees of freedom
## Multiple R-squared:  0.9687, Adjusted R-squared:  0.9687 
## F-statistic: 4.418e+04 on 7 and 9992 DF,  p-value: < 2.2e-16

Gender and location both matter, 5 locations

Finally, let’s try a model with 5 locations.

n=10000
x<-rnorm(n,10,2)
e<-rnorm(n,0,5)
gender<-ifelse(runif(n)>0.5,1,0)
#gender<-c(rep(1,n/2),rep(0,n/2))
location<-c(rep("Toronto",n/5),
            rep("Chicago",n/5),
            rep("New York",n/5),
            rep("Beijing",n/5),rep("Shanghai",n/5)
)
d<-data.frame(x=x,gender=gender,location=location)

d$gender<-as.factor(d$gender)
#true coefficient, male =5, female = 1


d$y<-ifelse(d$gender=="1" & d$location=="Toronto", 1+5*d$x+e,
            ifelse(d$gender=="0" & d$location=="Toronto", 1+1*d$x+e,
                   ifelse(d$gender=="1" & d$location=="Chicago", 2+10*d$x+e,
                          ifelse(d$gender=="0" & d$location=="Chicago", 2+2*d$x+e,
                                 ifelse(d$gender=="1" & d$location=="New York",3+15*d$x+e,
                                        ifelse(d$gender=="0" & d$location=="New York",3+5*d$x+e,
                                               ifelse(d$gender=="1" & d$location=="Beijing",8+30*d$x+e,
                                                      ifelse(d$gender=="0" & d$location=="Beijing",8+2*d$x+e,
                                                             ifelse(d$gender=="1" & d$location=="Shanghai",15+25*d$x+e,
                                                                    ifelse(d$gender=="0" & d$location=="Shanghai",15+2*d$x+e,NA))))))))))





attach(d)
## The following objects are masked _by_ .GlobalEnv:
## 
##     gender, location, x
## The following objects are masked from d (pos = 3):
## 
##     gender, location, x, y
## The following objects are masked from d (pos = 4):
## 
##     gender, location, x, y
## The following objects are masked from d (pos = 5):
## 
##     gender, x, y
p<-ggplot(d,aes(x,y,color=gender))+
  geom_point()
p+facet_grid(~location)

summary(lm(y~x*location*gender))
## 
## Call:
## lm(formula = y ~ x * location * gender)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.499  -3.430   0.059   3.369  19.333 
## 
## Coefficients:
##                            Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)                 8.25456    0.80758   10.221  < 2e-16 ***
## x                           1.97801    0.07870   25.134  < 2e-16 ***
## locationChicago            -5.46995    1.12087   -4.880 1.08e-06 ***
## locationNew York           -5.55107    1.15323   -4.813 1.50e-06 ***
## locationShanghai            7.26783    1.14054    6.372 1.94e-10 ***
## locationToronto            -6.82771    1.12258   -6.082 1.23e-09 ***
## gender                      0.64179    1.15428    0.556   0.5782    
## x:locationChicago          -0.05470    0.10930   -0.500   0.6168    
## x:locationNew York          3.03849    0.11217   27.089  < 2e-16 ***
## x:locationShanghai         -0.01133    0.11170   -0.101   0.9192    
## x:locationToronto          -1.01747    0.10972   -9.273  < 2e-16 ***
## x:gender                   27.93490    0.11250  248.310  < 2e-16 ***
## locationChicago:gender     -2.43737    1.59633   -1.527   0.1268    
## locationNew York:gender    -0.41363    1.62282   -0.255   0.7988    
## locationShanghai:gender    -3.17128    1.62732   -1.949   0.0513 .  
## locationToronto:gender     -0.58634    1.61511   -0.363   0.7166    
## x:locationChicago:gender  -19.79186    0.15583 -127.006  < 2e-16 ***
## x:locationNew York:gender -17.94362    0.15851 -113.202  < 2e-16 ***
## x:locationShanghai:gender  -4.71275    0.15933  -29.578  < 2e-16 ***
## x:locationToronto:gender  -23.96322    0.15828 -151.401  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.03 on 9980 degrees of freedom
## Multiple R-squared:  0.9977, Adjusted R-squared:  0.9977 
## F-statistic: 2.248e+05 on 19 and 9980 DF,  p-value: < 2.2e-16

So, if you think some factors (gender, location, season, etc.,) may impact your explained variables, set them as a dummy variables.

Good luck and happy modelling.

Max Shang

Email: zshang@me.com

Connect via twitter: @MaxZYShang