Discussion 12 :
Using R, build a multiple regression model for data that interests you.
Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis.
Was the linear model appropriate? Why or why not?
Solution:
Data set:
Consider the data set “mtcars” available in the R environment. It gives a comparison between different car models in terms of mileage per gallon (mpg), cylinder displacement(“disp”), horse power(“hp”), weight of the car(“wt”).
The goal of the model is to establish the relationship between “mpg” as a response variable with “disp”,“hp” and “wt” as predictor variables.
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
input <- mtcars[,c("mpg","disp","hp","wt")]
print(input)
## mpg disp hp wt
## Mazda RX4 21.0 160.0 110 2.620
## Mazda RX4 Wag 21.0 160.0 110 2.875
## Datsun 710 22.8 108.0 93 2.320
## Hornet 4 Drive 21.4 258.0 110 3.215
## Hornet Sportabout 18.7 360.0 175 3.440
## Valiant 18.1 225.0 105 3.460
## Duster 360 14.3 360.0 245 3.570
## Merc 240D 24.4 146.7 62 3.190
## Merc 230 22.8 140.8 95 3.150
## Merc 280 19.2 167.6 123 3.440
## Merc 280C 17.8 167.6 123 3.440
## Merc 450SE 16.4 275.8 180 4.070
## Merc 450SL 17.3 275.8 180 3.730
## Merc 450SLC 15.2 275.8 180 3.780
## Cadillac Fleetwood 10.4 472.0 205 5.250
## Lincoln Continental 10.4 460.0 215 5.424
## Chrysler Imperial 14.7 440.0 230 5.345
## Fiat 128 32.4 78.7 66 2.200
## Honda Civic 30.4 75.7 52 1.615
## Toyota Corolla 33.9 71.1 65 1.835
## Toyota Corona 21.5 120.1 97 2.465
## Dodge Challenger 15.5 318.0 150 3.520
## AMC Javelin 15.2 304.0 150 3.435
## Camaro Z28 13.3 350.0 245 3.840
## Pontiac Firebird 19.2 400.0 175 3.845
## Fiat X1-9 27.3 79.0 66 1.935
## Porsche 914-2 26.0 120.3 91 2.140
## Lotus Europa 30.4 95.1 113 1.513
## Ford Pantera L 15.8 351.0 264 3.170
## Ferrari Dino 19.7 145.0 175 2.770
## Maserati Bora 15.0 301.0 335 3.570
## Volvo 142E 21.4 121.0 109 2.780
Creating multiple regression
# Create the relationship model.
model <- lm(mpg~disp+hp+wt, data = input)
# Show the model.
print(model)
##
## Call:
## lm(formula = mpg ~ disp + hp + wt, data = input)
##
## Coefficients:
## (Intercept) disp hp wt
## 37.105505 -0.000937 -0.031157 -3.800891
Interpret all coefficient
# plot between reponse and predictor1
ggplot(data = input, aes(x = disp, y = mpg)) + geom_point(color='blue') +
geom_smooth(method = "lm", se = FALSE)

# plot between reponse and predictor2
ggplot(data = input, aes(x = hp, y = mpg)) + geom_point(color='blue') +
geom_smooth(method = "lm", se = FALSE)

# plot between reponse and predictor3
ggplot(data = input, aes(x = wt, y = mpg)) + geom_point(color='blue') +
geom_smooth(method = "lm", se = FALSE)

Individual analysis of reponse and predictor variables tells that corelation exist between response and predictor variables.
Residual analysis
# residual plot
ggplot(model, aes(.fitted, .resid)) + geom_point(color = "darkgreen", size=2) +labs(title = "Fitted Values vs Residuals") +labs(x = "Fitted Values") +labs(y = "Residuals")

# normal plot
qqnorm(resid(model))
qqline(resid(model))

The residual plot below shows there are strong correlations among predictors. This is not surprising due to the interactive term in which one of its operands are also a predictor.
Was the linear model appropriate? Why or why not?
summary(model)
##
## Call:
## lm(formula = mpg ~ disp + hp + wt, data = input)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.891 -1.640 -0.172 1.061 5.861
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.105505 2.110815 17.579 < 2e-16 ***
## disp -0.000937 0.010350 -0.091 0.92851
## hp -0.031157 0.011436 -2.724 0.01097 *
## wt -3.800891 1.066191 -3.565 0.00133 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.639 on 28 degrees of freedom
## Multiple R-squared: 0.8268, Adjusted R-squared: 0.8083
## F-statistic: 44.57 on 3 and 28 DF, p-value: 8.65e-11
p - value is below threshhold(0.05) says that null hypothesis is rejected.
R2 value : 0.80 tells that 80% variability in data.
normal qq-plot shows that residuals follow normal distribution.
overall this is good fit.
conclusion :
Based on the above intercept and coefficient values, we create the mathematical equation:
Y = a+Xdisp.x1+Xhp.x2+Xwt.x3 (or)
Y = 37.15+(-0.000937)x1+(-0.0311)x2+(-3.8008)*x3
We can use the regression equation created above to predict the mileage when a new set of values for displacement, horse power and weight is provided.