We would like to examine the effect of disp on mpg for types of gear from the mtcars data.Here the response variable is “mpg”, idependent variable is “disp” and the factor is “gear”.
attach(mtcars)
Lets create dummy variables for “gear”. Number of dummy variables is (n-1) where n is the number of levels. As number of levels for gear is 3, we can create two dummy variables. First add two columns “one” and “two” holding zeros as dummy variables to mtcars data. Then conditionally change values in the following way:
For “gear” == 3, “one” == 1 & “two” == 0
For “gear” == 4, “one” == 0 & “two” == 1
For “gear” == 5, “one” == 0 & “two” == 0
Lets see how dummy variables would look like for “gear”. As R uses difference from the first level as shown below, we can flip the order of levels in “gear”. That is instead of 3 its 5, and instead of 5 its 3.
contrasts(as.factor(gear))
## 4 5
## 3 0 0
## 4 1 0
## 5 0 1
mtcars <- within(mtcars, c(z1 <- rep(0, dim(mtcars)[1]), z2 <- rep(0, dim(mtcars)[1])))
mtcars[,"z1"] <- ifelse(gear == 3,1,0)
mtcars[,"z2"] <- ifelse(gear == 4,1,0)
mtcars_data <- mtcars[,c("mpg","disp","gear","z1","z2")]
head(mtcars_data, n = 10) # print first 10 obs
## mpg disp gear z1 z2
## Mazda RX4 21.0 160.0 4 0 1
## Mazda RX4 Wag 21.0 160.0 4 0 1
## Datsun 710 22.8 108.0 4 0 1
## Hornet 4 Drive 21.4 258.0 3 1 0
## Hornet Sportabout 18.7 360.0 3 1 0
## Valiant 18.1 225.0 3 1 0
## Duster 360 14.3 360.0 3 1 0
## Merc 240D 24.4 146.7 4 0 1
## Merc 230 22.8 140.8 4 0 1
## Merc 280 19.2 167.6 4 0 1
mtcars.fit.model <- lm(mpg ~ disp + z1 + z2, data =mtcars_data)
print(mtcars.fit.model)
##
## Call:
## lm(formula = mpg ~ disp + z1 + z2, data = mtcars_data)
##
## Coefficients:
## (Intercept) disp z1 z2
## 29.63589 -0.04077 -0.22471 -0.08669
A more straightforward way to achieve the above is by directly fitting “gear” in the model. As class of gear is number, convert it to factor.
mtcars[,"gear"] <- as.factor(mtcars$gear) # Convert class of gear from num to factor
mtcars.fit <- lm(mpg ~ disp + gear, data =mtcars)
print(mtcars.fit)
##
## Call:
## lm(formula = mpg ~ disp + gear, data = mtcars)
##
## Coefficients:
## (Intercept) disp gear4 gear5
## 29.41118 -0.04077 0.13802 0.22471
Regression equation when fitted with dummy variable is different from the regression equation when fitted with the factor itself. However, the ANOVA ouputs are identical.
anova(mtcars.fit.model) # model fit with dummy variables
## Analysis of Variance Table
##
## Response: mpg
## Df Sum Sq Mean Sq F value Pr(>F)
## disp 1 808.89 808.89 71.4449 3.426e-09 ***
## z1 1 0.12 0.12 0.0109 0.9177
## z2 1 0.02 0.02 0.0021 0.9637
## Residuals 28 317.01 11.32
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(mtcars.fit) # model fit without dummy variables
## Analysis of Variance Table
##
## Response: mpg
## Df Sum Sq Mean Sq F value Pr(>F)
## disp 1 808.89 808.89 71.4449 3.426e-09 ***
## gear 2 0.15 0.07 0.0065 0.9935
## Residuals 28 317.01 11.32
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Regression equation: mpg = 29.63589 - 0.04077(disp) - 0.22471(z1) - 0.08669(z2)
Average mpg for “gear” ==3 : mpg = 29.63589 - 0.04077(160) - 0.22471(1) - 0.08669(0)
Average mpg for “gear” ==4 : mpg = 29.63589 - 0.04077(160) - 0.22471(0) - 0.08669(1)
Average mpg for “gear” ==5 : mpg = 29.63589 - 0.04077(160) - 0.22471(0) - 0.08669(0)
avg_mpg_gear3 <- 29.63589 - 0.04077*(160) - 0.22471*(1) - 0.08669*(0)
avg_mpg_gear4 <- 29.63589 - 0.04077*(160) - 0.22471*(0) - 0.08669*(1)
avg_mpg_gear5 <- 29.63589 - 0.04077*(160) - 0.22471*(0) - 0.08669*(0)
cat("Gear3:",avg_mpg_gear3,"Gear4:",avg_mpg_gear4,"Gear5:", avg_mpg_gear5, sep = "\n")
## Gear3:
## 22.88798
## Gear4:
## 23.026
## Gear5:
## 23.11269
The effect for the reference (baseline) category is constrained to be null(here its gear 5), and the estimates from the other dummy variables ( which are gear 3 and gear 4) are relative effects, that is the effect of moving from the reference category to the target one(i.e from gear 5 to gear 3).
Null hypothesis: H1 = H2 = H3
Alternate hypotheis: Atleast one is not equal
From the ANOVA table for the model fit using dummy variables, we can conclude that there is statistically no significant difference in mean mpg between gear 5 and gear 3 and between gear 5 and gear 4 at disp 160 as the p values are more then 0.05.
At disp 160 there is no difference in the mpg between different gears.