Regression 3

Harold Nelson

4/5/2022

Setup

library(tidyverse)

Dummy Variables

A dummy variable has only two values, 0 and 1 or (FALSE or TRUE). This device is used in regression to mark the presence or absence of a condition. If one of these is used in the list of independent variables, the value of its coefficient is the amount added to the predicted value in the case of presence. I’ll build an example with the cdc dataset.

load("cdc.Rdata")

cdc = cdc %>% 
  mutate(male_bool = gender == "m",
         male_numeric = ifelse(gender == "m",1,0))

table(cdc$male_bool,cdc$male_numeric)
##        
##             0     1
##   FALSE 10431     0
##   TRUE      0  9569

Models

Create models bool_model and numeric_model using these two variables in addition to height, to predict weight.

Solution

bool_model = lm(weight~height + male_bool, data = cdc)
print(bool_model)
## 
## Call:
## lm(formula = weight ~ height + male_bool, data = cdc)
## 
## Coefficients:
##   (Intercept)         height  male_boolTRUE  
##      -128.890          4.359         12.011
numeric_model = lm(weight~height + male_numeric, data = cdc)
print(numeric_model)
## 
## Call:
## lm(formula = weight ~ height + male_numeric, data = cdc)
## 
## Coefficients:
##  (Intercept)        height  male_numeric  
##     -128.890         4.359        12.011

Here’s what either of these models tells you.

To predict weight, use the intercept and slope for the height coefficient to get a value for someone who is not a male, then add 12.011 pounds if the person is a male.

A Female Dummy

We might also consider a dummy to label females. Do this and then run a model using the female dummy. Look at the result.

Solution

cdc = cdc %>% 
  mutate(female_bool = gender == "f")

bool_model_f = lm(weight~height + female_bool, data = cdc)
print(bool_model_f)
## 
## Call:
## lm(formula = weight ~ height + female_bool, data = cdc)
## 
## Coefficients:
##     (Intercept)           height  female_boolTRUE  
##        -116.879            4.359          -12.011

Only a Dummy

What do we get in our coefficient if a dummy is the only variable in the model.

Solution

model_dummy = lm(weight~male_bool, data = cdc)
print(model_dummy)
## 
## Call:
## lm(formula = weight ~ male_bool, data = cdc)
## 
## Coefficients:
##   (Intercept)  male_boolTRUE  
##        151.67          37.66

If a person is not male, the predicted weight is 151.67. If a person is male, the predicted weight is 151.67 + 37.66 = 189.33.

Look at the mean weights of males and females using tapply().

tapply(cdc$weight, cdc$gender, mean)
##        m        f 
## 189.3227 151.6662

Complete Dummies

What happens if we include both a male dummy and a female dummy in a regression model?

Solution

model_both = lm(weight ~ male_bool + female_bool, data = cdc)
print(model_both)
## 
## Call:
## lm(formula = weight ~ male_bool + female_bool, data = cdc)
## 
## Coefficients:
##     (Intercept)    male_boolTRUE  female_boolTRUE  
##          151.67            37.66               NA

This model fails in a non-spectacular way. Most regression procedures will refuse to run with a complete set of dummies. One always has to be left out.

Use a Categorical Variable

What happens if we just put gender in the model and forget about constructing dummy variables.

Solution

model_gender = lm(weight~gender, data = cdc)
print(model_gender)
## 
## Call:
## lm(formula = weight ~ gender, data = cdc)
## 
## Coefficients:
## (Intercept)      genderf  
##      189.32       -37.66

Look at the levels of the factor variable gender.

levels(cdc$gender)
## [1] "m" "f"

The “reference level” is “m”. The reference level is always omitted to avoid completeness. These levels are not the default levels since “f” is before “m” alphabetically.

Interactions

What happens if you include gender*height in the regression formula?

Solution

interaction_model = lm(weight~gender*height, data = cdc)
print(interaction_model)
## 
## Call:
## lm(formula = weight ~ gender * height, data = cdc)
## 
## Coefficients:
##    (Intercept)         genderf          height  genderf:height  
##       -181.268         115.459           5.275          -1.897

You get different slopes and different intercepts.