Harold Nelson
4/5/2022
A dummy variable has only two values, 0 and 1 or (FALSE or TRUE). This device is used in regression to mark the presence or absence of a condition. If one of these is used in the list of independent variables, the value of its coefficient is the amount added to the predicted value in the case of presence. I’ll build an example with the cdc dataset.
load("cdc.Rdata")
cdc = cdc %>%
mutate(male_bool = gender == "m",
male_numeric = ifelse(gender == "m",1,0))
table(cdc$male_bool,cdc$male_numeric)
##
## 0 1
## FALSE 10431 0
## TRUE 0 9569
Create models bool_model and numeric_model using these two variables in addition to height, to predict weight.
##
## Call:
## lm(formula = weight ~ height + male_bool, data = cdc)
##
## Coefficients:
## (Intercept) height male_boolTRUE
## -128.890 4.359 12.011
##
## Call:
## lm(formula = weight ~ height + male_numeric, data = cdc)
##
## Coefficients:
## (Intercept) height male_numeric
## -128.890 4.359 12.011
Here’s what either of these models tells you.
To predict weight, use the intercept and slope for the height coefficient to get a value for someone who is not a male, then add 12.011 pounds if the person is a male.
We might also consider a dummy to label females. Do this and then run a model using the female dummy. Look at the result.
cdc = cdc %>%
mutate(female_bool = gender == "f")
bool_model_f = lm(weight~height + female_bool, data = cdc)
print(bool_model_f)
##
## Call:
## lm(formula = weight ~ height + female_bool, data = cdc)
##
## Coefficients:
## (Intercept) height female_boolTRUE
## -116.879 4.359 -12.011
What do we get in our coefficient if a dummy is the only variable in the model.
##
## Call:
## lm(formula = weight ~ male_bool, data = cdc)
##
## Coefficients:
## (Intercept) male_boolTRUE
## 151.67 37.66
If a person is not male, the predicted weight is 151.67. If a person is male, the predicted weight is 151.67 + 37.66 = 189.33.
Look at the mean weights of males and females using tapply().
## m f
## 189.3227 151.6662
What happens if we include both a male dummy and a female dummy in a regression model?
##
## Call:
## lm(formula = weight ~ male_bool + female_bool, data = cdc)
##
## Coefficients:
## (Intercept) male_boolTRUE female_boolTRUE
## 151.67 37.66 NA
This model fails in a non-spectacular way. Most regression procedures will refuse to run with a complete set of dummies. One always has to be left out.
What happens if we just put gender in the model and forget about constructing dummy variables.
##
## Call:
## lm(formula = weight ~ gender, data = cdc)
##
## Coefficients:
## (Intercept) genderf
## 189.32 -37.66
Look at the levels of the factor variable gender.
## [1] "m" "f"
The “reference level” is “m”. The reference level is always omitted to avoid completeness. These levels are not the default levels since “f” is before “m” alphabetically.
What happens if you include gender*height in the regression formula?
##
## Call:
## lm(formula = weight ~ gender * height, data = cdc)
##
## Coefficients:
## (Intercept) genderf height genderf:height
## -181.268 115.459 5.275 -1.897
You get different slopes and different intercepts.