Project Description

“You work for Motor Trend, a magazine about the automobile industry. Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). They are particularly interested in the following two questions:”

  1. “Is an automatic or manual transmission better for MPG”

  2. “Quantify the MPG difference between automatic and manual transmissions”

Executive Summary

The objective of this exercise is to compare the transmission variable between automatic and manual by mileage rate by finding a best fit model which include other correlated or uncorrelated variables.

The mtcars is a dataset extracted from the 1974 Motor Trend US magazine which comprises fuel rate and 10 aspects of automobile design and performance for 32 automobiles (1973-74 models).

The selected model, fit3, presumes that the manual transmission type car will provide better mileage rate by 3.33338 per unit change versus automatic transmission including effect from the number of cylinders, displacement and horsepower.

However, this data should not be taken at face value. There are other models that provide analysis from different perspectives which have not been considered in this study due to limitations, known and unknown. There were observed outliers or points that need review for possible exclusions. This study lacks the subject matter experts on knowledge of car manufacturing.

Finally, this is an exercise to familiarize the process of model selection. A good understanding of the business, data analysis, statistical modeling and working with SMEs will be recommended to develop an appropriate model.

Exploratory Analysis

The mtcars dataset has 32 obs and 11 variables, namely, mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear and carb. The am variable is the binary indicator for automatic (0) and manual (1) transmission. The mean for mpg is 20.09 with min of 10.40 and max at 33.90.

The following variables: cyl, vs, gear, am and carb are converted from numeric to factor as observed in the data. A boxplot is provided which indicated higher mileage for the manual type. Further analysis will need to be done to develop this inference.

Packages dplyr and ggplot2 were loaded to support analysis.

STEP 1. Setting up the null hypothesis.

Null hypothesis: H0: auto = manual. There is no change in mileage for either automatic or manual transmission.

Alternative hypothesis: H1: auto != manual. There is a difference in mileage between automatic and manual transmission.

A boxplot (Appendix-Plot #1) showed at this point that mileage rate increases at a higher rate with manual versus automatic transmission. There is work to be done to support this inference.

STEP 2. Multiple models

A set of linear models were created by adding one variable at a time starting with am, disp, and all others, ending with the last variable in the dataset, carburetor (carb).

STEP 3. Model Analysis

ANOVA function was used to examine the effect of the variables for all ten models, namely fit0 through fit9. The results of the ANOVA test shows that fit0, fit1, fit2 and fit3 have significant results with low p-values and high RSS, SS and F values in decreasing magnitude.

STEP 4. Variance analysis

The four models, fit0:fit3 were analyzed for variance by comparing its sigma^2 and coefficients of transmission (am) at reference level 1 which is the 2nd row in the summary. The sigma^2 and coefficients were collected using rbind function. The purpose is to pick the model with the minimum variance and lowest p-value (alpha = <0.05) indicating significance.

STEP 5. Model selection and confidence interval

From the previous step, model 4 - fit3, is selected as having lowest sigma-squared value of 7.039969 and p-value = 0.02158 which includes the following regressors am, cyl, disp and hp which potentially contribute to the effects of the transmission performance by type.

Confidence intervals show automatic transmission from 25.21 to 32.80 while manual transmission varies from 0.53 to 6.13.

STEP 6. Diagnostics

Q-Q plot (Appendix: Plot #3) of the standardized residuals against the normal with constant variance show slightly normality, with a few values near the higher end as potentially outliers.

Plotting cooks.distance of model fit3 (Appendix: Plot #4) show some values that far from center, particularly above 0.12. This will need to be reviewed for any possible errors, like data entry and to determine whether such outlier and others, may or may not be excluded.

Interpretation and Results

From the summary analysis, the intercept value of 29.004 is the expected mileage when transmission is automatic (0) at the reference level zero, while holding other variables constant, also at zero. Mileage increases by 3.3338 when the transmission is manual (1) holding the other variables constant. Cylinders 6 and 8 represent a decrease in mileage by 3.2 and 1.0, respectively holding other variables constant.

From the plots and coefficient analysis with p-values <0.05 indicating significance, the null hypothesis can be rejected which shows that the mileage rate is different between automatic and manual transmission.

Appendix

## Warning: Installed Rcpp (0.12.10) different from Rcpp used to build dplyr (0.12.11).
## Please reinstall dplyr to avoid random crashes or undefined behavior.
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
Plot #1 - Boxplot of Mileage Rate by Transmission

Ten Models -> fit0:fit9
fit0 <- lm(mpg ~ am, data = mtcdata)
fit1 <- update(fit0, mpg ~ am + cyl)
fit2 <- update(fit1, mpg ~ am + cyl + disp)
fit3 <- update(fit2, mpg ~ am + cyl + disp + hp)
fit4 <- update(fit3, mpg ~ am + cyl + disp + hp + drat)
fit5 <- update(fit4, mpg ~ am + cyl + disp + hp + drat + wt)
fit6 <- update(fit5, mpg ~ am + cyl + disp + hp + drat + wt + qsec)
fit7 <- update(fit6, mpg ~ am + cyl + disp + hp + drat + wt + qsec + vs)
fit8 <- update(fit7, mpg ~ am + cyl + disp + hp + drat + wt + qsec + vs + gear)
fit9 <- update(fit8, mpg ~ am + cyl + disp + hp + drat + wt + qsec + vs + gear + carb)
Anova test
## Analysis of Variance Table
## 
## Model  1: mpg ~ am
## Model  2: mpg ~ am + cyl
## Model  3: mpg ~ am + cyl + disp
## Model  4: mpg ~ am + cyl + disp + hp
## Model  5: mpg ~ am + cyl + disp + hp + drat
## Model  6: mpg ~ am + cyl + disp + hp + drat + wt
## Model  7: mpg ~ am + cyl + disp + hp + drat + wt + qsec
## Model  8: mpg ~ am + cyl + disp + hp + drat + wt + qsec + vs
## Model  9: mpg ~ am + cyl + disp + hp + drat + wt + qsec + vs + gear
## Model 10: mpg ~ am + cyl + disp + hp + drat + wt + qsec + vs + gear + carb
##    Res.Df    RSS Df Sum of Sq       F   Pr(>F)    
## 1      30 720.90                                  
## 2      28 264.50  2    456.40 28.4297 7.89e-06 ***
## 3      27 230.46  1     34.04  4.2402  0.05728 .  
## 4      26 183.04  1     47.42  5.9078  0.02809 *  
## 5      25 182.38  1      0.66  0.0820  0.77855    
## 6      24 150.10  1     32.28  4.0216  0.06331 .  
## 7      23 141.21  1      8.89  1.1081  0.30916    
## 8      22 139.02  1      2.18  0.2719  0.60964    
## 9      20 134.00  2      5.02  0.3128  0.73606    
## 10     15 120.40  5     13.60  0.3388  0.88144    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Analysis of Coefficients
##           [,1]
## [1,] 24.029887
## [2,]  9.446274
## [3,]  8.535551
## [4,]  7.039969
##      Estimate Std. Error  t value     Pr(>|t|)
## [1,] 7.244939   1.764422 4.106127 0.0002850207
## [2,] 2.559954   1.297579 1.972869 0.0584571679
## [3,] 1.635803   1.317408 1.241683 0.2250322107
## [4,] 3.333828   1.363638 2.444804 0.0215801904
##                Estimate Std. Error    t value     Pr(>|t|)
## (Intercept) 29.00423340 1.84536783 15.7173182 8.594568e-15
## am1          3.33382838 1.36363806  2.4448044 2.158019e-02
## cyl6        -3.22228303 1.58865398 -2.0283102 5.288842e-02
## cyl8        -1.01140831 3.03316951 -0.3334493 7.414694e-01
## disp        -0.01501139 0.01058469 -1.4182170 1.680029e-01
## hp          -0.03856603 0.01485958 -2.5953650 1.533180e-02
Confidence intervals
##                   2.5 %       97.5 %
## (Intercept) 25.21102550 32.797441305
## am1          0.53083020  6.136826550
## cyl6        -6.48780805  0.043241996
## cyl8        -7.24617753  5.223360915
## disp        -0.03676853  0.006745755
## hp          -0.06911033 -0.008021730
Plot 3: Normal Q-Q

Plot 4: Cooks.Distance