Discussion 12 :

Using R, build a multiple regression model for data that interests you.

Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis.

Was the linear model appropriate? Why or why not?

Solution:

Data set:

Consider the data set “mtcars” available in the R environment. It gives a comparison between different car models in terms of mileage per gallon (mpg), cylinder displacement(“disp”), horse power(“hp”), weight of the car(“wt”).

The goal of the model is to establish the relationship between “mpg” as a response variable with “disp”,“hp” and “wt” as predictor variables.

library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
input  <- mtcars[,c("mpg","disp","hp","wt")]
print(input)
##                      mpg  disp  hp    wt
## Mazda RX4           21.0 160.0 110 2.620
## Mazda RX4 Wag       21.0 160.0 110 2.875
## Datsun 710          22.8 108.0  93 2.320
## Hornet 4 Drive      21.4 258.0 110 3.215
## Hornet Sportabout   18.7 360.0 175 3.440
## Valiant             18.1 225.0 105 3.460
## Duster 360          14.3 360.0 245 3.570
## Merc 240D           24.4 146.7  62 3.190
## Merc 230            22.8 140.8  95 3.150
## Merc 280            19.2 167.6 123 3.440
## Merc 280C           17.8 167.6 123 3.440
## Merc 450SE          16.4 275.8 180 4.070
## Merc 450SL          17.3 275.8 180 3.730
## Merc 450SLC         15.2 275.8 180 3.780
## Cadillac Fleetwood  10.4 472.0 205 5.250
## Lincoln Continental 10.4 460.0 215 5.424
## Chrysler Imperial   14.7 440.0 230 5.345
## Fiat 128            32.4  78.7  66 2.200
## Honda Civic         30.4  75.7  52 1.615
## Toyota Corolla      33.9  71.1  65 1.835
## Toyota Corona       21.5 120.1  97 2.465
## Dodge Challenger    15.5 318.0 150 3.520
## AMC Javelin         15.2 304.0 150 3.435
## Camaro Z28          13.3 350.0 245 3.840
## Pontiac Firebird    19.2 400.0 175 3.845
## Fiat X1-9           27.3  79.0  66 1.935
## Porsche 914-2       26.0 120.3  91 2.140
## Lotus Europa        30.4  95.1 113 1.513
## Ford Pantera L      15.8 351.0 264 3.170
## Ferrari Dino        19.7 145.0 175 2.770
## Maserati Bora       15.0 301.0 335 3.570
## Volvo 142E          21.4 121.0 109 2.780

Creating multiple regression

# Create the relationship model.
model <- lm(mpg~disp+hp+wt, data = input)

# Show the model.
print(model)
## 
## Call:
## lm(formula = mpg ~ disp + hp + wt, data = input)
## 
## Coefficients:
## (Intercept)         disp           hp           wt  
##   37.105505    -0.000937    -0.031157    -3.800891

Interpret all coefficient

# plot between reponse and predictor1
ggplot(data = input, aes(x = disp, y = mpg)) +  geom_point(color='blue') +
  geom_smooth(method = "lm", se = FALSE)

# plot between reponse and predictor2
ggplot(data = input, aes(x = hp, y = mpg)) +  geom_point(color='blue') +
  geom_smooth(method = "lm", se = FALSE)

# plot between reponse and predictor3
ggplot(data = input, aes(x = wt, y = mpg)) +  geom_point(color='blue') +
  geom_smooth(method = "lm", se = FALSE)

Individual analysis of reponse and predictor variables tells that corelation exist between response and predictor variables.

Residual analysis

# residual plot
ggplot(model, aes(.fitted, .resid)) + geom_point(color = "darkgreen", size=2) +labs(title = "Fitted Values vs Residuals") +labs(x = "Fitted Values") +labs(y = "Residuals")

# normal plot
qqnorm(resid(model))
qqline(resid(model))

The residual plot below shows there are strong correlations among predictors. This is not surprising due to the interactive term in which one of its operands are also a predictor.

Was the linear model appropriate? Why or why not?

summary(model)
## 
## Call:
## lm(formula = mpg ~ disp + hp + wt, data = input)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.891 -1.640 -0.172  1.061  5.861 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 37.105505   2.110815  17.579  < 2e-16 ***
## disp        -0.000937   0.010350  -0.091  0.92851    
## hp          -0.031157   0.011436  -2.724  0.01097 *  
## wt          -3.800891   1.066191  -3.565  0.00133 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.639 on 28 degrees of freedom
## Multiple R-squared:  0.8268, Adjusted R-squared:  0.8083 
## F-statistic: 44.57 on 3 and 28 DF,  p-value: 8.65e-11

p - value is below threshhold(0.05) says that null hypothesis is rejected.

R2 value : 0.80 tells that 80% variability in data.

normal qq-plot shows that residuals follow normal distribution.

overall this is good fit.

conclusion :

Based on the above intercept and coefficient values, we create the mathematical equation:

Y = a+Xdisp.x1+Xhp.x2+Xwt.x3 (or)

Y = 37.15+(-0.000937)x1+(-0.0311)x2+(-3.8008)*x3

We can use the regression equation created above to predict the mileage when a new set of values for displacement, horse power and weight is provided.