Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?
Response to Prompt:
For this analysis I will be using the built-in dataset
mtcars. For my model, I will be using hp
(horsepower) as my quadratic term of choice, am
(transmission) as the dichotomous term, and will be looking at
am vs wt.
The predictor values I chose for the model, Horsepower, weight of the
car, type of transmission, and the time taken to travel 1/4 mile
(qsec) from a standing start - I believe - is related in
some form to fuel efficiency (mpg).
# Load the sample data set
data("mtcars")
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
# Fit the multiple regression model
model <- lm(mpg ~ hp + wt + qsec + I(hp^2) + am + am:wt, data = mtcars)
# Print the model summary
summary(model)
##
## Call:
## lm(formula = mpg ~ hp + wt + qsec + I(hp^2) + am + am:wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0468 -1.2187 -0.0369 1.1797 4.0086
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.233e+01 1.046e+01 2.135 0.04271 *
## hp -7.286e-02 4.127e-02 -1.765 0.08970 .
## wt -2.370e+00 8.052e-01 -2.943 0.00693 **
## qsec 5.853e-01 4.325e-01 1.353 0.18808
## I(hp^2) 1.680e-04 9.186e-05 1.829 0.07934 .
## am 1.332e+01 3.639e+00 3.660 0.00118 **
## wt:am -4.257e+00 1.276e+00 -3.336 0.00266 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.034 on 25 degrees of freedom
## Multiple R-squared: 0.9082, Adjusted R-squared: 0.8861
## F-statistic: 41.21 on 6 and 25 DF, p-value: 8.93e-12
Our estimated intercept is 22.33. This tell us that our predicted
value of mpg, when all our other variables
(hp, wt, qsec, am)
are 0, that mpg will be 22.33. The p-value is 0.04271,
which tells us that this is statistically significant.
The estimated coefficient for hp is -0.07286. This tell
us that as horsepower increases by 1 unit, the predicted miles per
gallon decreases by 0.07286 units. The p-value of 0.08970, tell us that
this is statistically insignificant.
The estimated coefficient for wt is -2.370. This tell us
that as weight increases by 1 unit, the predicted miles per gallon
decreases by -2.370 units. The p-value of 0.00693, tell us that this is
statistically significant.
The estimated coefficient for qsec is 0.5853. This tell
us that as weight increases by 1 unit, the predicted miles per gallon
increases by 0.5853 units. The p-value of 0.18808, tell us that this is
statistically insignificant.
The estimated coefficient for I(hp^2) is 0.0001680. This
tell us that the relationship between hp and mpg is not strictly linear.
The p-value of 0.07934, tell us that this is statistically
insignificant.
The estimated coefficient for am is 13.32. This tell us
that on average, cars with manual transmissions have 13.32 higher
predicted miles per gallon compared to cars with automatic
transmissions. The p-value of 0.00118, tell us that this is
statistically significant.
The estimated coefficient for wt:am is -4.257. This tell
us that for cars with manual transmissions, as weight increases by 1
unit, the predicted miles per gallon decreases by 4.257 units. The
p-value of 0.00266, tell us that this is statistically significant.
# Check for normality of residuals
qqPlot(model, main = "Normal Q-Q Plot")
## Fiat 128 Pontiac Firebird
## 18 25
# Check for homoscedasticity of residuals
spreadLevelPlot(model, main = "Spread-Level Plot")
##
## Suggested power transformation: 0.687543
# residuals vs fitted value plots
plot(x = fitted(model), y = residuals(model),
xlab = "Fitted values", ylab = "Residuals",
main = "Residuals vs. Fitted Values Plot")
We can check the assumptions of the multiple regression model by examining the residual plots. If the residuals are normally distributed, have a constant variance, and are randomly scattered around the horizontal line in the spread-level plot, then the linear model is appropriate. If any of these assumptions are violated, then the linear model may not be appropriate. In this case, the residual plots show that the residuals are approximately normally distributed, have a constant variance, and are randomly scattered around the horizontal line, indicating that the linear model is appropriate for the sample data set.