# Load required Libraries
library(dplyr)
library(ggplot2)
library(car)
data(mtcars)
This is regression analysis of the sample mtcars dataset to determine if manual or automatic transmission results in better mpg (miles per gallon) and to quantify the difference in mpg between these transmission types. Strictly speaking a regression model isn’t needed to explain that manual transmission produces better mpg. But the selected Regression model explains that manual transmission improves the mpg by about 2.9 miles.
The mtcars dataset is data frame with 32 observations on 11 variables.
[, 1] mpg Miles/(US) gallon
[, 2] cyl Number of cylinders
[, 3] disp Displacement (cu.in.)
[, 4] hp Gross horsepower
[, 5] drat Rear axle ratio
[, 6] wt Weight (1000 lbs)
[, 7] qsec 1/4 mile time
[, 8] vs V/S
[, 9] am Transmission (0 = automatic, 1 = manual)
[,10] gear Number of forward gears
[,11] carb Number of carburetors
A quick pairs analysis of the mtcars data is done. It is apparent from the pairs plot that mpg is negatively correlated with variables cyl, disp, hp and wt. mpg is positively correlated with drat and qsec. The histogram of the mpg variable shows that it is somewhat normal with a slightly longer right tail. A linear regression seems sufficient to model this response variable.
A nested models selection strategy is employed here. The model selection starts with two ends of the spectrum. First by just fitting mpg with just the factorized am variable and then fitting the mpg with rest of the columns in data frame. A better parsimonious model that describes the am coefficient that indicate the estimated change in mpg across the transmission type is somewhere in the middle. A
# simple underfitted model with just one regressor
fit1 <- lm(mpg ~ factor(am), data = mtcars)
# over fitted model will all the columns in the dataset.
fit2 <- lm(mpg ~ factor(am) + ., data = mtcars)
fit1$coefficients
## (Intercept) factor(am)1
## 17.147368 7.244939
fit2$coefficients
## (Intercept) factor(am)1 cyl disp hp drat
## 12.30337416 2.52022689 -0.11144048 0.01333524 -0.02148212 0.78711097
## wt qsec vs am gear carb
## -3.71530393 0.82104075 0.31776281 NA 0.65541302 -0.19941925
The variables cyl, hp, disp, wt are highly correlated. So including all of them as regressor doesn’t meaningfully help the model and only results in higher Variance Inflation Factor (VIF) as seen below.
vif(lm(mpg ~ ., data = mtcars))
## cyl disp hp drat wt qsec vs
## 15.373833 21.620241 9.832037 3.374620 15.164887 7.527958 4.965873
## am gear carb
## 4.648487 5.357452 7.908747
So these correlated regressors are removed from our over fitted model. After a few trial and error, the following parsimonious model that fits mpg against am, wt and qsec as predictors seems very promising
fit3 <- lm(mpg ~ am + wt + qsec, data = mtcars)
fit3$coefficients
## (Intercept) am wt qsec
## 9.617781 2.935837 -3.916504 1.225886
A anova() test across the 3 models seems to indicate that our parsimonious model captures most of the variance of the overfitted model (fit2) that uses the entire dataset. The low F statistics and high p-value between the models indicate that there isn’t anything gained from adding more regressors. Hence this model (fit3) is selected as our preferred model to quantify the change in mpg across am (transmission) types.
#anova test to validate the model selection
anova(fit1, fit3, fit2)
## Analysis of Variance Table
##
## Model 1: mpg ~ factor(am)
## Model 2: mpg ~ am + wt + qsec
## Model 3: mpg ~ factor(am) + (cyl + disp + hp + drat + wt + qsec + vs +
## am + gear + carb)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.90
## 2 28 169.29 2 551.61 39.2687 8.025e-08 ***
## 3 21 147.49 7 21.79 0.4432 0.8636
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The plot function on selected model provides us with 4 useful plots that helps us diagnose model to see if there are any issues. The residual vs Fitted seems fine. There isn’t any Homoscedasticity or any other pattern in the variance of the residuals. The Normal Q-Q plot doesn’t show anything worrying other than a slight right tail consistent with the distribution of the mpg variable. The Residuals Vs Leverage plot also looks alright with most of the points clusterd in the lower range of the leverage with an exception of one or two outliers.
par(mfrow = c(2,2))
plot(fit3)
According to the model, the coefficient of regressor am is 2.94 rounded to 2 digits. This is the increase in mpg contributed by the manual transmission with the other regressors remaining same. The coefficient of wt is -3.92 and coefficient of qsec is 1.23. So the mpg decreases by 3.92 miles per 1000 lbs increase in wt and mpg increases by 1.23 miles per one sec increase in quarter mile time. The interpretation of this model is also straight forward. Lightness (lower wt) and slower capacity (resulting in slower acceleration and higher qsec) vehicle contributes most of the difference in the mean mpg between automatic and manual transmission vehicles. Still the tranmission contributes to a gain of about 3 mpg all else being equal
summary(fit3)
##
## Call:
## lm(formula = mpg ~ am + wt + qsec, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4811 -1.5555 -0.7257 1.4110 4.6610
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.6178 6.9596 1.382 0.177915
## am 2.9358 1.4109 2.081 0.046716 *
## wt -3.9165 0.7112 -5.507 6.95e-06 ***
## qsec 1.2259 0.2887 4.247 0.000216 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared: 0.8497, Adjusted R-squared: 0.8336
## F-statistic: 52.75 on 3 and 28 DF, p-value: 1.21e-11
This model is based on just 32 observations. This isn’t very high and linear model developed can be influenced greatly by just a few outliers.
Without a training and test set the accuracy of the selected model is also unknown.
Every effort has been taken to select an optimum model and avoid known pitfalls like colinearity between regressors, checking for normality of the response variable, performing residual diagnostics to make sure there isn’t any correlation or pattern in the residual variance etc., However the predictive power of this model as opposed to the other possible models is still an unknown.