Project Assignment - Regression Models Course
Title: MTCARS - Best Fuel Consumption between Automatic and Manual Transmissions
Data Science Specialization Course - Johns Hopkins University - through Coursera
Sergio Vicente Simioni
June, 27, 2015

Report Content

  1. Executive Summary
  2. Conclusions/Questions addressing
  3. Exploratory data Analysis
  4. Strategy for Model Selection
  5. Coefficients Interpretation
  6. Residual plots and diagnosis
  7. Uncertainties and inferences evaluation
  8. APPENDIX

Content Description

Executive Summary

The project will utilize Multiple Regression to analyse the MTCARS data, extracted from the 1974 Motor Trend US magazine which comprises the fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973-74 models).

The general formula of the Multiple regression is given by \(Y=\beta_0+\beta_1X_1+\beta_2X_2+\beta_3X_3+...+\beta_iX_i\), where Y is the dependent variable, X’s the independent variables and \(\beta's\) are the coefficients which will define the amount of variation in the dependent variable when one unit of the independent variable is changed

The objective of this project is to explore the relationship between the set of variables ( X’s ) and miles per gallon ( Y ) and understand if there is difference of fuel consumption between the two type of transmissions.

Conclusions/Questions addressing

Based on this specific set of data obtained in 1973-74, it is possible to conclude that the manual transmission provides lower fuel consumption than automatic

The consumption of fuel with manual transmission on average is 2.8971 lower than the automatic, also supported by the confidence interval the difference will vary from 0.1058 ( worst case ) up to 5.688 ( best case ), assuring for the manual transmission always the best performance.

Exploratory Data Analysis

Download of the mtcars file and the Library necessary to run the codes, also exploratory data can be seen in the Appendix.

library(ggplot2)
library(dplyr)
library(gridExtra)
library(scatterplot3d)
library(car)
library(knitr)
library(ggExtra)
library(GGally)
mtcars_b <- read.csv("C:/Users/Sergio Simioni/Desktop/mtcars.csv", sep=";")
nom      <- read.csv("C:/Users/Sergio Simioni/Desktop/nom.csv", sep=";")
mtcars   <- select(mtcars_b, -brand)

Strategy for Model Selection

The literature provides several models to make the multiple regression and identify the best regression equation, as some examples: Stepwise Forward, Stepwise Backward, Stepwise all, Analysis of VIF( Variance Inflation Factor) among others. I chose the Stepwise Backward for this case once the results of this model demonstrated to be more conservative, and consequently more reliable, providing the lowest difference among these models.

The STEPWISE BACKWARD MODEL consists in evaluate stepwise the variables that do not contribute significantly to analysis and eliminate them. The process starts with the full model, the dependent variable ( MPG ) will be compared with all others variables using the function ( lm ), the variable which presents the highest p value score should be eliminated, this process is repeated again and again until no further improvement is possible

After applying the STEPWISE BACKWARD MODEL ( see Appendix for details ), the variables “carb”,“vs”, “cyl”, “drat” , “gear”, “disp” and “hp” were eliminated without compromising the final result, the adjusted \(R^2\) remained relatively high 0.8418 and the variables “weight”, “gsec” and “am” were mantained once these variables demonstrated good statistical significance.

fit<- lm(mpg~.,data = mtcars)
step<- step(fit, direction="backward", trace = FALSE)
summary(step)$coeff
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)  9.627172  6.7138759  1.433922 1.626687e-01
## weight      -3.950220  0.6848895 -5.767675 3.425693e-06
## gsec         1.231397  0.2799604  4.398469 1.432222e-04
## am           2.897106  1.3626505  2.126082 4.244934e-02

The final equation obtained is: \[MPG= 9.6272 - 3.9502Weight+1.2314gsec + 2.8971am\]

Bellow are the confidence intervals for the three variables ( “weight”, “gsec” and “am”) and also for the interception point (“intercept”)

confint(lm(mpg~weight+gsec+am, data=mtcars))
##                  2.5 %    97.5 %
## (Intercept) -4.1255789 23.379924
## weight      -5.3531529 -2.547288
## gsec         0.6579241  1.804870
## am           0.1058434  5.688369

Coefficients Interpretation

Based on the final regression equation for this specific set of data obtained in 1973-74, we can report that the consumption of fuel with manual transmission is 2.8971 lower on average than the automatic, also supported by the confidence interval the difference will vary from 0.1058 ( worst case ) up to 5.688 ( best case ), assuring for the manual transmission the best performance.

Residual plots and diagnosis

The Residuals vs Fitted plot ( See Appendix for details) is showing that there is a slight curvature in the fitted line which was expected to be horizontal, this curvature shows that an equation containing a higher order term could provide a better fit, also the Normal Q-Q plot is showing that the data are normal but slightly skewed negatively.

Uncertainties and inferences evaluation

The analysis is considered simplified due to the comprehensiveness and complexity of this subject, others models could provide different values for the same problem, and that is why I chose the most conservative method, also the variable “weight” demonstrated to have an elevated multicollinearity when compared with the type of transmission. This multicollinearity will introduce an error in the final results, but not enough to invert the relationship between MPG against the type of transmission.

APPENDIX

kable(mtcars_b)
brand mpg cyl disp hp drat weight gsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.9 2.6 16.5 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.9 2.9 17.0 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.9 2.3 18.6 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.1 3.2 19.4 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.2 3.4 17.0 0 0 3 2
Valiant 18.1 6 225.0 105 2.8 3.5 20.2 1 0 3 1
Duster 360 14.3 8 360.0 245 3.2 3.6 15.8 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.7 3.2 20.0 1 0 4 2
Merc 230 22.8 4 140.8 95 3.9 3.2 22.9 1 0 4 2
Merc 280 19.2 6 167.6 123 3.9 3.4 18.3 1 0 4 4
Merc 280C 17.8 6 167.6 123 3.9 3.4 18.9 1 0 4 4
Merc 450SE 16.4 8 275.8 180 3.1 4.1 17.4 0 0 3 3
Merc 450SL 17.3 8 275.8 180 3.1 3.7 17.6 0 0 3 3
Merc 450SLC 15.2 8 275.8 180 3.1 3.8 18.0 0 0 3 3
Cadillac Fleetwoo 10.4 8 472.0 205 2.9 5.3 18.0 0 0 3 4
Lincoln Continent 10.4 8 460.0 215 3.0 5.4 17.8 0 0 3 4
Chrysler Imperial 14.7 8 440.0 230 3.2 5.3 17.4 0 0 3 4
Fiat 128 32.4 4 78.7 66 4.1 2.2 19.5 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.9 1.6 18.5 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.2 1.8 19.9 1 1 4 1
Toyota Corona 21.5 4 120.1 97 3.7 2.5 20.0 1 0 3 1
Dodge Challenger 15.5 8 318.0 150 2.8 3.5 16.9 0 0 3 2
AMC Javelin 15.2 8 304.0 150 3.2 3.4 17.3 0 0 3 2
Camaro Z28 13.3 8 350.0 245 3.7 3.8 15.4 0 0 3 4
Pontiac Firebird 19.2 8 400.0 175 3.1 3.8 17.1 0 0 3 2
Fiat X1-9 27.3 4 79.0 66 4.1 1.9 18.9 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.4 2.1 16.7 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.8 1.5 16.9 1 1 5 2
Ford Pantera L 15.8 8 351.0 264 4.2 3.2 14.5 0 1 5 4
Ferrari Dino 19.7 6 145.0 175 3.6 2.8 15.5 0 1 5 6
Maserati Bora 15.0 8 301.0 335 3.5 3.6 14.6 0 1 5 8
Volvo 142E 21.4 4 121.0 109 4.1 2.8 18.6 1 1 4 2

DEFINITION OF THE VARIABLES

kable(nom)
Variable Definition
mpg Miles/(US) gallon
cyl Number of cylinders
disp Displacement (cu.in.)
hp Gross horsepower
drat Rear axle ratio
wt Weight (lb/1000)
gsec 1/4 mile time
vs V/S
am Transmission (0 = automatic, 1 = manual)
gear Number of forward gears
carb Number of carburetors

SUMMARY OF MTCARS

summary(mtcars)
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat         weight           gsec             vs        
##  Min.   :2.8   Min.   :1.500   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.1   1st Qu.:2.575   1st Qu.:16.90   1st Qu.:0.0000  
##  Median :3.7   Median :3.300   Median :17.70   Median :0.0000  
##  Mean   :3.6   Mean   :3.212   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.9   3rd Qu.:3.625   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.9   Max.   :5.400   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

HISTOGRAM OF MILES PER GALLON

ggplot(mtcars, aes(x=mpg)) + 
        geom_histogram(aes(y=..density..), bindwidth =0.5, colour = "blue", fill = "white") + 
        geom_density( alpha =0.2, fill = "#FF6666")+
        geom_vline( aes(xintercept=mean(mpg), color="red"), linetype ="dashed", size = 1)

boxplot(mtcars$mpg, horizontal = TRUE, xlab = "Miles per Gallon Summary", col = "red")

ANALYSIS MPG VERSUS WEIGHT

a= ggplot(mtcars, aes(x=weight, y=mpg)) + 
        geom_point(color="blue") +
        stat_smooth(method="lm", formula = y~poly(x,2))
        ggMarginal(a, type="histogram", color="red")

CORRELATION AMONG ALL VARIABLES

cor(mtcars)
##               mpg        cyl       disp         hp       drat     weight
## mpg     1.0000000 -0.8521620 -0.8475514 -0.7761684  0.7001914 -0.8708348
## cyl    -0.8521620  1.0000000  0.9020329  0.8324475 -0.7131038  0.7792749
## disp   -0.8475514  0.9020329  1.0000000  0.7909486 -0.7282924  0.8850443
## hp     -0.7761684  0.8324475  0.7909486  1.0000000 -0.4733891  0.6618896
## drat    0.7001914 -0.7131038 -0.7282924 -0.4733891  1.0000000 -0.7366824
## weight -0.8708348  0.7792749  0.8850443  0.6618896 -0.7366824  1.0000000
## gsec    0.4201387 -0.5915721 -0.4345090 -0.7097455  0.1024837 -0.1712363
## vs      0.6640389 -0.8108118 -0.7104159 -0.7230967  0.4538426 -0.5529061
## am      0.5998324 -0.5226070 -0.5912270 -0.2432043  0.7185935 -0.6894439
## gear    0.4802848 -0.4926866 -0.5555692 -0.1257043  0.6954805 -0.5782591
## carb   -0.5509251  0.5269883  0.3949769  0.7498125 -0.1224818  0.4331112
##              gsec         vs          am       gear        carb
## mpg     0.4201387  0.6640389  0.59983243  0.4802848 -0.55092507
## cyl    -0.5915721 -0.8108118 -0.52260705 -0.4926866  0.52698829
## disp   -0.4345090 -0.7104159 -0.59122704 -0.5555692  0.39497686
## hp     -0.7097455 -0.7230967 -0.24320426 -0.1257043  0.74981247
## drat    0.1024837  0.4538426  0.71859350  0.6954805 -0.12248181
## weight -0.1712363 -0.5529061 -0.68944395 -0.5782591  0.43311119
## gsec    1.0000000  0.7435263 -0.22842686 -0.2113422 -0.65675589
## vs      0.7435263  1.0000000  0.16834512  0.2060233 -0.56960714
## am     -0.2284269  0.1683451  1.00000000  0.7940588  0.05753435
## gear   -0.2113422  0.2060233  0.79405876  1.0000000  0.27407284
## carb   -0.6567559 -0.5696071  0.05753435  0.2740728  1.00000000

CORRELATION AMONG BEST FIT VARIABLES

mtcars_a<- select(mtcars, mpg, weight, gsec, am)
ggpairs(mtcars_a, 
        lower = list(continuous = "smooth",params = c(method = "loess", colour="blue")),
        diag=list(continuous="bar", params=c(colour="blue")),
        upper=list(params=list(corSize=1)), 
        axisLabels='show')

VALUES ABOVE 10 DEMONSTRATE HIGH COLLINEARITY

vif(fit)
##       cyl      disp        hp      drat    weight      gsec        vs 
## 15.201564 21.378171  9.693377  3.549308 15.571275  7.827133  4.971498 
##        am      gear      carb 
##  4.641093  5.363006  7.988905
vif(step)
##   weight     gsec       am 
## 2.435111 1.347957 2.493834

REGRESSION PLOTS

w <- ggplot(mtcars, aes(y=mpg, x=weight)) + geom_point(colour="blue")
w<- w + stat_smooth(method="lm", formula = y~poly(x,2))+ ggtitle("Regression Plot \n Polynomial \n Weight")

wa <- ggplot(mtcars, aes(y=mpg, x=weight)) + geom_point(colour="blue")
wa<- wa + stat_smooth()+ ggtitle("Regression Plot \n Loess \n Weight")

g <- ggplot(mtcars, aes(y=mpg, x=gsec)) + geom_point(colour="blue")
g<- g + stat_smooth(method="lm", formula = y~poly(x,2))+ ggtitle("Regression Plot \n Polynomial \n gsec")

gl <- ggplot(mtcars, aes(y=mpg, x=gsec)) + geom_point(colour="blue")
gl<- gl + stat_smooth()+ ggtitle("Regression Plot \n Loess \n gsec")

am <- ggplot(mtcars, aes(y=mpg, x=am)) + geom_point(colour="blue")
am<- am + stat_smooth(method="lm", formula = y~x)+ ggtitle("Regression Plot \n am")

ama <- ggplot(mtcars, aes(y=mpg, x=am)) + geom_point(colour="blue")
ama<- ama + stat_smooth()+ ggtitle("Regression Plot \n Loess \n am")
grid.arrange(w,g,ama, ncol=3)

grid.arrange(wa,gl,am,ncol=3)

PLOT OF RESIDUALS

b<- ggplot2::fortify(step)
resid<- ggplot(b, aes(x= .fitted, y= .resid)) + geom_point(colour="blue")+ stat_smooth(method="lm", formula = y~poly(x,2))
resid<- resid + ggtitle("Residuals Vs Fitted")
normal <- ggplot(b, aes(qqnorm(.stdresid)[[1]], .stdresid)) + geom_point(colour="blue")+ stat_smooth(method="lm")
normal <- normal + ggtitle("Q-Q Normal")
grid.arrange(resid, normal)

3D EXPLORATORY GRAPHIC FOR MPG, WEIGHT AND GSEC

mtcars_b<- select(mtcars, mpg, weight, gsec)
#scatterplot3d(x, y, z)
s3d<- scatterplot3d(mtcars_b$weight, mtcars_b$gsec, mtcars_b$mpg, pch=16, highlight.3d=TRUE, type ="h", main="3D Scatterplot", xlab="Weight", ylab="gsec", zlab="mpg(Miles per gallon)")
plane <- lm(mtcars_b$mpg ~ +mtcars_b$weight+mtcars_b$gsec)
s3d$plane3d(plane)

STEPWISE BACKWARD MODEL DETAILS

fit<- lm(mpg~.,data = mtcars)
step<- step(fit, direction="backward")
## Start:  AIC=69.75
## mpg ~ cyl + disp + hp + drat + weight + gsec + vs + am + gear + 
##     carb
## 
##          Df Sum of Sq    RSS    AIC
## - carb    1     0.023 142.33 67.757
## - vs      1     0.055 142.36 67.764
## - cyl     1     0.252 142.56 67.808
## - drat    1     0.813 143.12 67.934
## - gear    1     1.091 143.40 67.996
## - disp    1     5.632 147.94 68.994
## - hp      1     6.312 148.62 69.140
## <none>                142.31 69.752
## - am      1    10.956 153.26 70.125
## - gsec    1    11.757 154.06 70.292
## - weight  1    32.650 174.96 74.361
## 
## Step:  AIC=67.76
## mpg ~ cyl + disp + hp + drat + weight + gsec + vs + am + gear
## 
##          Df Sum of Sq    RSS    AIC
## - vs      1     0.061 142.39 65.771
## - cyl     1     0.305 142.63 65.825
## - drat    1     0.791 143.12 65.934
## - gear    1     1.176 143.50 66.020
## - hp      1     8.943 151.27 67.707
## <none>                142.33 67.757
## - am      1    11.077 153.41 68.155
## - disp    1    11.303 153.63 68.202
## - gsec    1    13.256 155.58 68.606
## - weight  1    66.773 209.10 78.067
## 
## Step:  AIC=65.77
## mpg ~ cyl + disp + hp + drat + weight + gsec + am + gear
## 
##          Df Sum of Sq    RSS    AIC
## - cyl     1     0.470 142.86 63.876
## - drat    1     0.797 143.19 63.949
## - gear    1     1.178 143.57 64.034
## <none>                142.39 65.771
## - hp      1     9.239 151.63 65.782
## - am      1    11.228 153.62 66.199
## - disp    1    11.251 153.64 66.204
## - gsec    1    16.849 159.24 67.349
## - weight  1    71.160 213.55 76.740
## 
## Step:  AIC=63.88
## mpg ~ disp + hp + drat + weight + gsec + am + gear
## 
##          Df Sum of Sq    RSS    AIC
## - drat    1     1.195 144.06 62.143
## - gear    1     1.875 144.73 62.293
## <none>                142.86 63.876
## - disp    1    10.781 153.64 64.204
## - hp      1    11.454 154.31 64.344
## - am      1    13.008 155.87 64.665
## - gsec    1    30.452 173.31 68.059
## - weight  1    74.208 217.07 75.263
## 
## Step:  AIC=62.14
## mpg ~ disp + hp + weight + gsec + am + gear
## 
##          Df Sum of Sq    RSS    AIC
## - gear    1     3.061 147.12 60.815
## <none>                144.06 62.143
## - disp    1    10.338 154.39 62.360
## - hp      1    11.976 156.03 62.698
## - am      1    15.608 159.66 63.434
## - gsec    1    31.423 175.48 66.457
## - weight  1    77.904 221.96 73.976
## 
## Step:  AIC=60.82
## mpg ~ disp + hp + weight + gsec + am
## 
##          Df Sum of Sq    RSS    AIC
## - disp    1     7.411 154.53 60.388
## - hp      1     9.222 156.34 60.761
## <none>                147.12 60.815
## - gsec    1    30.349 177.46 64.817
## - am      1    32.921 180.04 65.278
## - weight  1    75.343 222.46 72.048
## 
## Step:  AIC=60.39
## mpg ~ hp + weight + gsec + am
## 
##          Df Sum of Sq    RSS    AIC
## - hp      1     6.392 160.92 59.685
## <none>                154.53 60.388
## - gsec    1    23.249 177.78 62.873
## - am      1    26.001 180.53 63.365
## - weight  1    84.066 238.59 72.289
## 
## Step:  AIC=59.69
## mpg ~ weight + gsec + am
## 
##          Df Sum of Sq    RSS    AIC
## <none>                160.92 59.685
## - am      1    25.978 186.90 62.474
## - gsec    1   111.186 272.10 74.494
## - weight  1   191.183 352.10 82.742