Regression Study on mtcars Data Set

Regression on the ‘mtcars’ Data Set

The Data Set

The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models). The data frame consists of 32 observations on 11 variables:

1 - mpg Miles/(US) gallon
2 - cyl Number of cylinders
3 - disp Displacement (cu.in.)
4 - hp Gross horsepower
5 - drat Rear axle ratio
6 - wt Weight (1000 lbs)
7 - qsec 1/4 mile time
8 - vs V/S
9 - am Transmission (0 = automatic, 1 = manual)
10- gear Number of forward gears
11- carb Number of carburetors

Exploring the Data Set

The first step exploring the data set by means of fitting the linear model. Within the model, we assign mpg as the outcome and the remaining variables as the regressors.

fit_exp<- lm(mpg~., data = mtcars)
summary(fit_exp)

## 
## Call:
## lm(formula = mpg ~ ., data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4506 -1.6044 -0.1196  1.2193  4.6271 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 12.30337   18.71788   0.657   0.5181  
## cyl         -0.11144    1.04502  -0.107   0.9161  
## disp         0.01334    0.01786   0.747   0.4635  
## hp          -0.02148    0.02177  -0.987   0.3350  
## drat         0.78711    1.63537   0.481   0.6353  
## wt          -3.71530    1.89441  -1.961   0.0633 .
## qsec         0.82104    0.73084   1.123   0.2739  
## vs           0.31776    2.10451   0.151   0.8814  
## am           2.52023    2.05665   1.225   0.2340  
## gear         0.65541    1.49326   0.439   0.6652  
## carb        -0.19942    0.82875  -0.241   0.8122  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared:  0.869,  Adjusted R-squared:  0.8066 
## F-statistic: 13.93 on 10 and 21 DF,  p-value: 3.793e-07

From the summary we are able to deduce that despite the high adjusted R squared, we might not need to fit variables to predict the output mpg. As fitting too many correlated variables would result in overfitting and inflated variance.

Fitting ‘am’ with MPG

Coming to the first part of the question, we seek to find out what kind of transmission would be best for higher mpg

fit_am<- lm(mpg~ factor(am), data = mtcars)
summary(fit_am)

## 
## Call:
## lm(formula = mpg ~ factor(am), data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## factor(am)1    7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

The summary projects mpg to be 7.24 times higher when a vehicle has automatic transmission than when it has a manual transmission. However, the low value of adjacent R square suggests a lower explanatory value for the output ‘mpg’ with only ‘am’ as the regressor.

Regressor Selection

The selecting variables with the most impact on ‘mpg’ is conducted using AIC in a Stepwise Algorithm in both the direction. The step function here works on dropping models based on AIC values, this iterative function runs multiple simulations before a suitable model is reached.

# model<- step(lm(mpg~., data = mtcars), direction = "both")
model

## 
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
## 
## Coefficients:
## (Intercept)           wt         qsec           am  
##       9.618       -3.917        1.226        2.936

summary(lm(mpg~wt+qsec+am, data = mtcars))

## 
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4811 -1.5555 -0.7257  1.4110  4.6610 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.6178     6.9596   1.382 0.177915    
## wt           -3.9165     0.7112  -5.507 6.95e-06 ***
## qsec          1.2259     0.2887   4.247 0.000216 ***
## am            2.9358     1.4109   2.081 0.046716 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared:  0.8497, Adjusted R-squared:  0.8336 
## F-statistic: 52.75 on 3 and 28 DF,  p-value: 1.21e-11

From the simulations, it becomes evident that along with ‘am’, variables ‘qsec’ and ‘wt’ has to be accounted for to realise a robust model.

Model Selection and Uncertainty Quantification

Now that we are aware of the covariates that ought to be fitted into the model, the next part is deciding on the interaction between these covariates. The most robust interaction would produce a model with a high adjacent R squared value.

m_1<- summary(lm(mpg ~ wt * qsec * factor(am), data = mtcars))
m_2<- summary(lm(mpg ~ wt + qsec * factor(am), data = mtcars))
m_3<- summary(lm(mpg ~ wt * qsec + factor(am), data = mtcars))
m_4<- summary(lm(mpg ~ qsec * wt + factor(am), data = mtcars))
m_5<- summary(lm(mpg ~ qsec + wt * factor(am), data = mtcars))
m_6<- summary(lm(mpg ~ qsec + wt + factor(am), data = mtcars))
models_radj<- data.frame((m_1)$adj.r.squared,(m_2)$adj.r.squared,(m_3)$adj.r.squared,(m_4)$adj.r.squared, (m_5)$adj.r.squared,(m_6)$adj.r.squared)
colnames(models_radj)<- c("m_1","m_2","m_3","m_4","m_5","m_6")
models_radj

##         m_1       m_2       m_3       m_4       m_5       m_6
## 1 0.8759496 0.8531624 0.8347545 0.8347545 0.8804219 0.8335561

Model 5 ‘m_5’ seemed to have the highest adjacent R squared value, hence we proceed with the same.
mpg = 9.7230526 + (1.0169736)qsec + (-2.9365309)wt + (14.0794278)am + (-4.1413764)wt*am

Hence,
when am = 0, the slope of wt is -2.937 and the intercept is 9.723.
when am = 1, the slope of wt is -7.078 and the intercept is 23.802.

Therefore, on an average, a manual transmission vehicle would travel 14.079 miles per gallon more than an automatice transmission vehicle, making manul vehicles a better option

The uncertainty in the slope and intersection estimates for a 95% confidence interval has been calculated for below:

fit<- lm(mpg ~ qsec + wt * factor(am), data = mtcars)
confint(fit)

##                     2.5 %    97.5 %
## (Intercept)    -2.3807791 21.826884
## qsec            0.4998811  1.534066
## wt             -4.3031019 -1.569960
## factor(am)1     7.0308746 21.127981
## wt:factor(am)1 -6.5970316 -1.685721

Residual Plots and Diagnostics

Figure 2 - in the Appendix - displays the residual plots for the selected model stored in the variable ‘fit’. From the plots, no concrete patterns could be recognised so as to render our model obsolete or flawed. This being said, the Q-Q plot looks rather skewed which does raise some concerns. To better understand the reason for this, below the hatvalues and dfbetas values of the model will be calculated.

hist(hatvalues(fit),xlab = "hatvalues", main = "Histograme of hatvalues for model 'fit'")

From the histogram, we can say with a good amount of certainty that coefficients with hatvalues greater than 0.3 seem to have high leverage.

dfbetas(fit)[which(hatvalues(fit) > 2*5/32),]

##                     (Intercept)        qsec           wt factor(am)1
## Merc 230             0.25656732 -0.35363727  0.012205353  0.13319953
## Lincoln Continental  0.31582832 -0.13136203 -0.548982600 -0.32920707
## Lotus Europa         0.02054631 -0.02254647 -0.007125334  0.02922367
## Maserati Bora        0.06467465 -0.07097064 -0.022428766 -0.39813267
##                     wt:factor(am)1
## Merc 230               -0.15370459
## Lincoln Continental     0.25093655
## Lotus Europa           -0.03400343
## Maserati Bora           0.53072741

The skewed nature of the Q-Q plot could be explained by the high leverage points within the model, but since these points have low influence - as seen from above - the model wouldn’t be too affected and can be used for prediction and inference purposes.

Regression Study on ‘mtcars’ Data Set

Allwyn Joseph

4/19/2017

Executive Summary