Synopsis

Using the dataset , this project explores the relationship between a set of independent variables against the dependent variable “miles per gallon (MPG)”.

The key objective of the study is to address the following questions:

  1. Is an automatic or manual transmission better for MPG?
  2. Quantify the MPG difference between automatic and manual transmissions.

The results of the study suggests that:

  1. Manual transmission has a positive impact to a vehicle’s MPG compared to automatic transmission.
  2. There is a significant difference in the mpg between automatic and manual transmission.

Exploratory Data Analysis

Based on the outcome of our exploratory data analysis (see Appendix 2), we noted that:

  1. There appears to be a positive relation between MPG and manual transmission (exhibit 2).
  2. Other than transmission type (am), there appears to be a relationship between MPG and the other variables cyl,disp,hp,drat,wt and VS.

In the next section, we will develop the required linear regression models to confirm and quantify our observations.

Modelling and Analysis - Simple Regression Model

In this section, we will first ignore our observation 2 from our exploratory data analysis and examine the relatonship between MPG and Transmission type with a simple regression model.

simple_model <- lm(mpg ~ am, data=mtcars)
summary(simple_model)
## 
## Call:
## lm(formula = mpg ~ am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## am             7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

Since the p-value of the manual transmission is <0.005, we reject the null hypothesis that manual transmission has a no significant impact on mpg.

However the adjusted R-squared value of the model indicates that this simple regression model only accounts for 33.9% of the observed variance. We will need to take into account the relationship between mpg and other variables in order to increase the explanatory power of our model.

Modelling and Analysis - Multiple Regression Model

Given a relationship may simultaneously exist between MPG and the other independent variables in the dataset (eg.cyl,disp,hp), we begin by regressing mpg against all other variables in the dataset.

Variables which do not have any significant impact to MPG are then eliminated through multiple iterations based on the step method until we arrive at the model with the best explanatory power (highest adjusted R-squared).

multi_model <- lm(mpg ~ . ,data=mtcars)
final_model <- step(multi_model, direction = "both")
## Start:  AIC=70.9
## mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
## 
##        Df Sum of Sq    RSS    AIC
## - cyl   1    0.0799 147.57 68.915
## - vs    1    0.1601 147.66 68.932
## - carb  1    0.4067 147.90 68.986
## - gear  1    1.3531 148.85 69.190
## - drat  1    1.6270 149.12 69.249
## - disp  1    3.9167 151.41 69.736
## - hp    1    6.8399 154.33 70.348
## - qsec  1    8.8641 156.36 70.765
## <none>              147.49 70.898
## - am    1   10.5467 158.04 71.108
## - wt    1   27.0144 174.51 74.280
## 
## Step:  AIC=68.92
## mpg ~ disp + hp + drat + wt + qsec + vs + am + gear + carb
## 
##        Df Sum of Sq    RSS    AIC
## - vs    1    0.2685 147.84 66.973
## - carb  1    0.5201 148.09 67.028
## - gear  1    1.8211 149.40 67.308
## - drat  1    1.9826 149.56 67.342
## - disp  1    3.9009 151.47 67.750
## - hp    1    7.3632 154.94 68.473
## <none>              147.57 68.915
## - qsec  1   10.0933 157.67 69.032
## - am    1   11.8359 159.41 69.384
## + cyl   1    0.0799 147.49 70.898
## - wt    1   27.0280 174.60 72.297
## 
## Step:  AIC=66.97
## mpg ~ disp + hp + drat + wt + qsec + am + gear + carb
## 
##        Df Sum of Sq    RSS    AIC
## - carb  1    0.6855 148.53 65.121
## - gear  1    2.1437 149.99 65.434
## - drat  1    2.2139 150.06 65.449
## - disp  1    3.6467 151.49 65.753
## - hp    1    7.1060 154.95 66.475
## <none>              147.84 66.973
## - am    1   11.5694 159.41 67.384
## - qsec  1   15.6830 163.53 68.200
## + vs    1    0.2685 147.57 68.915
## + cyl   1    0.1883 147.66 68.932
## - wt    1   27.3799 175.22 70.410
## 
## Step:  AIC=65.12
## mpg ~ disp + hp + drat + wt + qsec + am + gear
## 
##        Df Sum of Sq    RSS    AIC
## - gear  1     1.565 150.09 63.457
## - drat  1     1.932 150.46 63.535
## <none>              148.53 65.121
## - disp  1    10.110 158.64 65.229
## - am    1    12.323 160.85 65.672
## - hp    1    14.826 163.35 66.166
## + carb  1     0.685 147.84 66.973
## + vs    1     0.434 148.09 67.028
## + cyl   1     0.414 148.11 67.032
## - qsec  1    26.408 174.94 68.358
## - wt    1    69.127 217.66 75.350
## 
## Step:  AIC=63.46
## mpg ~ disp + hp + drat + wt + qsec + am
## 
##        Df Sum of Sq    RSS    AIC
## - drat  1     3.345 153.44 62.162
## - disp  1     8.545 158.64 63.229
## <none>              150.09 63.457
## - hp    1    13.285 163.38 64.171
## + gear  1     1.565 148.53 65.121
## + cyl   1     1.003 149.09 65.242
## + vs    1     0.645 149.45 65.319
## + carb  1     0.107 149.99 65.434
## - am    1    20.036 170.13 65.466
## - qsec  1    25.574 175.67 66.491
## - wt    1    67.572 217.66 73.351
## 
## Step:  AIC=62.16
## mpg ~ disp + hp + wt + qsec + am
## 
##        Df Sum of Sq    RSS    AIC
## - disp  1     6.629 160.07 61.515
## <none>              153.44 62.162
## - hp    1    12.572 166.01 62.682
## + drat  1     3.345 150.09 63.457
## + gear  1     2.977 150.46 63.535
## + cyl   1     2.447 150.99 63.648
## + vs    1     1.121 152.32 63.927
## + carb  1     0.011 153.43 64.160
## - qsec  1    26.470 179.91 65.255
## - am    1    32.198 185.63 66.258
## - wt    1    69.043 222.48 72.051
## 
## Step:  AIC=61.52
## mpg ~ hp + wt + qsec + am
## 
##        Df Sum of Sq    RSS    AIC
## - hp    1     9.219 169.29 61.307
## <none>              160.07 61.515
## + disp  1     6.629 153.44 62.162
## + carb  1     3.227 156.84 62.864
## + drat  1     1.428 158.64 63.229
## - qsec  1    20.225 180.29 63.323
## + cyl   1     0.249 159.82 63.465
## + vs    1     0.249 159.82 63.466
## + gear  1     0.171 159.90 63.481
## - am    1    25.993 186.06 64.331
## - wt    1    78.494 238.56 72.284
## 
## Step:  AIC=61.31
## mpg ~ wt + qsec + am
## 
##        Df Sum of Sq    RSS    AIC
## <none>              169.29 61.307
## + hp    1     9.219 160.07 61.515
## + carb  1     8.036 161.25 61.751
## + disp  1     3.276 166.01 62.682
## + cyl   1     1.501 167.78 63.022
## + drat  1     1.400 167.89 63.042
## + gear  1     0.123 169.16 63.284
## + vs    1     0.000 169.29 63.307
## - am    1    26.178 195.46 63.908
## - qsec  1   109.034 278.32 75.217
## - wt    1   183.347 352.63 82.790
summary(final_model)
## 
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4811 -1.5555 -0.7257  1.4110  4.6610 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.6178     6.9596   1.382 0.177915    
## wt           -3.9165     0.7112  -5.507 6.95e-06 ***
## qsec          1.2259     0.2887   4.247 0.000216 ***
## am            2.9358     1.4109   2.081 0.046716 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared:  0.8497, Adjusted R-squared:  0.8336 
## F-statistic: 52.75 on 3 and 28 DF,  p-value: 1.21e-11

Since the p-value of the manual transmission is <0.005, we reject the null hypothesis that jointly, the variables cyl, hp, wt and am no significant impact on mpg.

The final fitted model now has an adjusted R-square of 0.8401. In other words, 84% of the variability of MPG can be attributed to the variables cyl, hp, wt and am.

Residual and Other Diagonistic Tests

Finally perform a visual plot of the residuals to check for the presence of hetroskedasticity.

par(mfrow = c(2, 2))
plot(final_model)

The residuals v fitted chart clearly indicates that residuals are random and independent.

Conclusion

Based on our final multiple regression model, we conclude that:

  1. Manual tranmission has a positive impact to vehicle MPG performance compared to auto transmission.
  2. MPG for manual transmission vehicles increases by a factor of 1.8 vs. automatic transmission, after taking into account the vehicle’s weight (wt), number of cyclinder(cyl) and horse-power (hp).
  3. The number of cyclinders in the engine has a negative impact to MPG performance.
  4. Likewise, an increase in the weight of the vehicle will negatively impact the vehicle’s MPG performance.

Appendices

Appendix 1: Setup, Getting and Cleaning Data

setwd("~/Data Science/Module 7 Regression Model/")

data("mtcars")
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Based on the initial review of the independent variables, the following variables should be considered as factors:

cyl - Number of cylinders (4,6,8) vs - V/S (0,1) gear - Number of forward gears (4 or 3) carb - Number of carburetors am - Transmission (0 = automatic, 1 = manual)

mtcars$cyl <- factor(mtcars$cyl)
mtcars$vs <- factor(mtcars$vs)
mtcars$gear <- factor(mtcars$gear)
mtcars$carb <- factor(mtcars$carb)
mtcars$am <- factor(mtcars$am,labels=c('Auto','Manual'))

str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : Factor w/ 2 levels "0","1": 1 1 2 2 1 2 1 2 2 2 ...
##  $ am  : Factor w/ 2 levels "Auto","Manual": 2 2 2 1 1 1 1 1 1 1 ...
##  $ gear: Factor w/ 3 levels "3","4","5": 2 2 2 1 1 1 1 2 2 2 ...
##  $ carb: Factor w/ 6 levels "1","2","3","4",..: 4 4 1 1 2 1 4 2 2 4 ...

Appendix 2: Exploratory Data Analysis

Using the correctly setup dataset, a visual review of the relationship between the variables was performed.

pairs(mtcars, panel=panel.smooth, main="Exhibit 1: Correlation between mtcars variables")

From the paired scatterplot, we note that:

  1. There appears to be a positive relationship between manual transmission and MPG.
  2. Other than the variable for transmission (am), relationships also appears to exists between the mpg and the variables: cyl,disp,hp,drat,wt,vS.

A boxplot of the mpg to am further supports observation 1.

boxplot(mpg ~ am, data = mtcars, xlab = "Transmission (0 = automatic, 1 = manual)", main="Exhibit 2: MPG by Transmission Type")

While observation 2 is supported in our review of the AOV statistics.

Exhibit 3:

analysis <- aov(mpg ~ ., data = mtcars)
summary(analysis)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## cyl          2  824.8   412.4  51.377 1.94e-07 ***
## disp         1   57.6    57.6   7.181   0.0171 *  
## hp           1   18.5    18.5   2.305   0.1497    
## drat         1   11.9    11.9   1.484   0.2419    
## wt           1   55.8    55.8   6.950   0.0187 *  
## qsec         1    1.5     1.5   0.190   0.6692    
## vs           1    0.3     0.3   0.038   0.8488    
## am           1   16.6    16.6   2.064   0.1714    
## gear         2    5.0     2.5   0.313   0.7361    
## carb         5   13.6     2.7   0.339   0.8814    
## Residuals   15  120.4     8.0                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1