Report Content
Content Description
Executive Summary
The project will utilize Multiple Regression to analyse the MTCARS data, extracted from the 1974 Motor Trend US magazine which comprises the fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973-74 models).
The general formula of the Multiple regression is given by \(Y=\beta_0+\beta_1X_1+\beta_2X_2+\beta_3X_3+...+\beta_iX_i\), where Y is the dependent variable, X’s the independent variables and \(\beta's\) are the coefficients which will define the amount of variation in the dependent variable when one unit of the independent variable is changed
The objective of this project is to explore the relationship between the set of variables ( X’s ) and miles per gallon ( Y ) and understand if there is difference of fuel consumption between the two type of transmissions.
Conclusions/Questions addressing
Based on this specific set of data obtained in 1973-74, it is possible to conclude that the manual transmission provides lower fuel consumption than automatic
The consumption of fuel with manual transmission on average is 2.8971 lower than the automatic, also supported by the confidence interval the difference will vary from 0.1058 ( worst case ) up to 5.688 ( best case ), assuring for the manual transmission always the best performance.
Exploratory Data Analysis
Download of the mtcars file and the Library necessary to run the codes, also exploratory data can be seen in the Appendix.
library(ggplot2)
library(dplyr)
library(gridExtra)
library(scatterplot3d)
library(car)
library(knitr)
library(ggExtra)
library(GGally)
mtcars_b <- read.csv("C:/Users/Sergio Simioni/Desktop/mtcars.csv", sep=";")
nom <- read.csv("C:/Users/Sergio Simioni/Desktop/nom.csv", sep=";")
mtcars <- select(mtcars_b, -brand)
Strategy for Model Selection
The literature provides several models to make the multiple regression and identify the best regression equation, as some examples: Stepwise Forward, Stepwise Backward, Stepwise all, Analysis of VIF( Variance Inflation Factor) among others. I chose the Stepwise Backward for this case once the results of this model demonstrated to be more conservative, and consequently more reliable, providing the lowest difference among these models.
The STEPWISE BACKWARD MODEL consists in evaluate stepwise the variables that do not contribute significantly to analysis and eliminate them. The process starts with the full model, the dependent variable ( MPG ) will be compared with all others variables using the function ( lm ), the variable which presents the highest p value score should be eliminated, this process is repeated again and again until no further improvement is possible
After applying the STEPWISE BACKWARD MODEL ( see Appendix for details ), the variables “carb”,“vs”, “cyl”, “drat” , “gear”, “disp” and “hp” were eliminated without compromising the final result, the adjusted \(R^2\) remained relatively high 0.8418 and the variables “weight”, “gsec” and “am” were mantained once these variables demonstrated good statistical significance.
fit<- lm(mpg~.,data = mtcars)
step<- step(fit, direction="backward", trace = FALSE)
summary(step)$coeff
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.627172 6.7138759 1.433922 1.626687e-01
## weight -3.950220 0.6848895 -5.767675 3.425693e-06
## gsec 1.231397 0.2799604 4.398469 1.432222e-04
## am 2.897106 1.3626505 2.126082 4.244934e-02
The final equation obtained is: \[MPG= 9.6272 - 3.9502Weight+1.2314gsec + 2.8971am\]
Bellow are the confidence intervals for the three variables ( “weight”, “gsec” and “am”) and also for the interception point (“intercept”)
confint(lm(mpg~weight+gsec+am, data=mtcars))
## 2.5 % 97.5 %
## (Intercept) -4.1255789 23.379924
## weight -5.3531529 -2.547288
## gsec 0.6579241 1.804870
## am 0.1058434 5.688369
Coefficients Interpretation
Based on the final regression equation for this specific set of data obtained in 1973-74, we can report that the consumption of fuel with manual transmission is 2.8971 lower on average than the automatic, also supported by the confidence interval the difference will vary from 0.1058 ( worst case ) up to 5.688 ( best case ), assuring for the manual transmission the best performance.
Residual plots and diagnosis
The Residuals vs Fitted plot ( See Appendix for details) is showing that there is a slight curvature in the fitted line which was expected to be horizontal, this curvature shows that an equation containing a higher order term could provide a better fit, also the Normal Q-Q plot is showing that the data are normal but slightly skewed negatively.
Uncertainties and inferences evaluation
The analysis is considered simplified due to the comprehensiveness and complexity of this subject, others models could provide different values for the same problem, and that is why I chose the most conservative method, also the variable “weight” demonstrated to have an elevated multicollinearity when compared with the type of transmission. This multicollinearity will introduce an error in the final results, but not enough to invert the relationship between MPG against the type of transmission.
kable(mtcars_b)
| brand | mpg | cyl | disp | hp | drat | weight | gsec | vs | am | gear | carb |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Mazda RX4 | 21.0 | 6 | 160.0 | 110 | 3.9 | 2.6 | 16.5 | 0 | 1 | 4 | 4 |
| Mazda RX4 Wag | 21.0 | 6 | 160.0 | 110 | 3.9 | 2.9 | 17.0 | 0 | 1 | 4 | 4 |
| Datsun 710 | 22.8 | 4 | 108.0 | 93 | 3.9 | 2.3 | 18.6 | 1 | 1 | 4 | 1 |
| Hornet 4 Drive | 21.4 | 6 | 258.0 | 110 | 3.1 | 3.2 | 19.4 | 1 | 0 | 3 | 1 |
| Hornet Sportabout | 18.7 | 8 | 360.0 | 175 | 3.2 | 3.4 | 17.0 | 0 | 0 | 3 | 2 |
| Valiant | 18.1 | 6 | 225.0 | 105 | 2.8 | 3.5 | 20.2 | 1 | 0 | 3 | 1 |
| Duster 360 | 14.3 | 8 | 360.0 | 245 | 3.2 | 3.6 | 15.8 | 0 | 0 | 3 | 4 |
| Merc 240D | 24.4 | 4 | 146.7 | 62 | 3.7 | 3.2 | 20.0 | 1 | 0 | 4 | 2 |
| Merc 230 | 22.8 | 4 | 140.8 | 95 | 3.9 | 3.2 | 22.9 | 1 | 0 | 4 | 2 |
| Merc 280 | 19.2 | 6 | 167.6 | 123 | 3.9 | 3.4 | 18.3 | 1 | 0 | 4 | 4 |
| Merc 280C | 17.8 | 6 | 167.6 | 123 | 3.9 | 3.4 | 18.9 | 1 | 0 | 4 | 4 |
| Merc 450SE | 16.4 | 8 | 275.8 | 180 | 3.1 | 4.1 | 17.4 | 0 | 0 | 3 | 3 |
| Merc 450SL | 17.3 | 8 | 275.8 | 180 | 3.1 | 3.7 | 17.6 | 0 | 0 | 3 | 3 |
| Merc 450SLC | 15.2 | 8 | 275.8 | 180 | 3.1 | 3.8 | 18.0 | 0 | 0 | 3 | 3 |
| Cadillac Fleetwoo | 10.4 | 8 | 472.0 | 205 | 2.9 | 5.3 | 18.0 | 0 | 0 | 3 | 4 |
| Lincoln Continent | 10.4 | 8 | 460.0 | 215 | 3.0 | 5.4 | 17.8 | 0 | 0 | 3 | 4 |
| Chrysler Imperial | 14.7 | 8 | 440.0 | 230 | 3.2 | 5.3 | 17.4 | 0 | 0 | 3 | 4 |
| Fiat 128 | 32.4 | 4 | 78.7 | 66 | 4.1 | 2.2 | 19.5 | 1 | 1 | 4 | 1 |
| Honda Civic | 30.4 | 4 | 75.7 | 52 | 4.9 | 1.6 | 18.5 | 1 | 1 | 4 | 2 |
| Toyota Corolla | 33.9 | 4 | 71.1 | 65 | 4.2 | 1.8 | 19.9 | 1 | 1 | 4 | 1 |
| Toyota Corona | 21.5 | 4 | 120.1 | 97 | 3.7 | 2.5 | 20.0 | 1 | 0 | 3 | 1 |
| Dodge Challenger | 15.5 | 8 | 318.0 | 150 | 2.8 | 3.5 | 16.9 | 0 | 0 | 3 | 2 |
| AMC Javelin | 15.2 | 8 | 304.0 | 150 | 3.2 | 3.4 | 17.3 | 0 | 0 | 3 | 2 |
| Camaro Z28 | 13.3 | 8 | 350.0 | 245 | 3.7 | 3.8 | 15.4 | 0 | 0 | 3 | 4 |
| Pontiac Firebird | 19.2 | 8 | 400.0 | 175 | 3.1 | 3.8 | 17.1 | 0 | 0 | 3 | 2 |
| Fiat X1-9 | 27.3 | 4 | 79.0 | 66 | 4.1 | 1.9 | 18.9 | 1 | 1 | 4 | 1 |
| Porsche 914-2 | 26.0 | 4 | 120.3 | 91 | 4.4 | 2.1 | 16.7 | 0 | 1 | 5 | 2 |
| Lotus Europa | 30.4 | 4 | 95.1 | 113 | 3.8 | 1.5 | 16.9 | 1 | 1 | 5 | 2 |
| Ford Pantera L | 15.8 | 8 | 351.0 | 264 | 4.2 | 3.2 | 14.5 | 0 | 1 | 5 | 4 |
| Ferrari Dino | 19.7 | 6 | 145.0 | 175 | 3.6 | 2.8 | 15.5 | 0 | 1 | 5 | 6 |
| Maserati Bora | 15.0 | 8 | 301.0 | 335 | 3.5 | 3.6 | 14.6 | 0 | 1 | 5 | 8 |
| Volvo 142E | 21.4 | 4 | 121.0 | 109 | 4.1 | 2.8 | 18.6 | 1 | 1 | 4 | 2 |
kable(nom)
| Variable | Definition |
|---|---|
| mpg | Miles/(US) gallon |
| cyl | Number of cylinders |
| disp | Displacement (cu.in.) |
| hp | Gross horsepower |
| drat | Rear axle ratio |
| wt | Weight (lb/1000) |
| gsec | 1/4 mile time |
| vs | V/S |
| am | Transmission (0 = automatic, 1 = manual) |
| gear | Number of forward gears |
| carb | Number of carburetors |
summary(mtcars)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat weight gsec vs
## Min. :2.8 Min. :1.500 Min. :14.50 Min. :0.0000
## 1st Qu.:3.1 1st Qu.:2.575 1st Qu.:16.90 1st Qu.:0.0000
## Median :3.7 Median :3.300 Median :17.70 Median :0.0000
## Mean :3.6 Mean :3.212 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.9 3rd Qu.:3.625 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.9 Max. :5.400 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
ggplot(mtcars, aes(x=mpg)) +
geom_histogram(aes(y=..density..), bindwidth =0.5, colour = "blue", fill = "white") +
geom_density( alpha =0.2, fill = "#FF6666")+
geom_vline( aes(xintercept=mean(mpg), color="red"), linetype ="dashed", size = 1)
boxplot(mtcars$mpg, horizontal = TRUE, xlab = "Miles per Gallon Summary", col = "red")
a= ggplot(mtcars, aes(x=weight, y=mpg)) +
geom_point(color="blue") +
stat_smooth(method="lm", formula = y~poly(x,2))
ggMarginal(a, type="histogram", color="red")
cor(mtcars)
## mpg cyl disp hp drat weight
## mpg 1.0000000 -0.8521620 -0.8475514 -0.7761684 0.7001914 -0.8708348
## cyl -0.8521620 1.0000000 0.9020329 0.8324475 -0.7131038 0.7792749
## disp -0.8475514 0.9020329 1.0000000 0.7909486 -0.7282924 0.8850443
## hp -0.7761684 0.8324475 0.7909486 1.0000000 -0.4733891 0.6618896
## drat 0.7001914 -0.7131038 -0.7282924 -0.4733891 1.0000000 -0.7366824
## weight -0.8708348 0.7792749 0.8850443 0.6618896 -0.7366824 1.0000000
## gsec 0.4201387 -0.5915721 -0.4345090 -0.7097455 0.1024837 -0.1712363
## vs 0.6640389 -0.8108118 -0.7104159 -0.7230967 0.4538426 -0.5529061
## am 0.5998324 -0.5226070 -0.5912270 -0.2432043 0.7185935 -0.6894439
## gear 0.4802848 -0.4926866 -0.5555692 -0.1257043 0.6954805 -0.5782591
## carb -0.5509251 0.5269883 0.3949769 0.7498125 -0.1224818 0.4331112
## gsec vs am gear carb
## mpg 0.4201387 0.6640389 0.59983243 0.4802848 -0.55092507
## cyl -0.5915721 -0.8108118 -0.52260705 -0.4926866 0.52698829
## disp -0.4345090 -0.7104159 -0.59122704 -0.5555692 0.39497686
## hp -0.7097455 -0.7230967 -0.24320426 -0.1257043 0.74981247
## drat 0.1024837 0.4538426 0.71859350 0.6954805 -0.12248181
## weight -0.1712363 -0.5529061 -0.68944395 -0.5782591 0.43311119
## gsec 1.0000000 0.7435263 -0.22842686 -0.2113422 -0.65675589
## vs 0.7435263 1.0000000 0.16834512 0.2060233 -0.56960714
## am -0.2284269 0.1683451 1.00000000 0.7940588 0.05753435
## gear -0.2113422 0.2060233 0.79405876 1.0000000 0.27407284
## carb -0.6567559 -0.5696071 0.05753435 0.2740728 1.00000000
mtcars_a<- select(mtcars, mpg, weight, gsec, am)
ggpairs(mtcars_a,
lower = list(continuous = "smooth",params = c(method = "loess", colour="blue")),
diag=list(continuous="bar", params=c(colour="blue")),
upper=list(params=list(corSize=1)),
axisLabels='show')
vif(fit)
## cyl disp hp drat weight gsec vs
## 15.201564 21.378171 9.693377 3.549308 15.571275 7.827133 4.971498
## am gear carb
## 4.641093 5.363006 7.988905
vif(step)
## weight gsec am
## 2.435111 1.347957 2.493834
w <- ggplot(mtcars, aes(y=mpg, x=weight)) + geom_point(colour="blue")
w<- w + stat_smooth(method="lm", formula = y~poly(x,2))+ ggtitle("Regression Plot \n Polynomial \n Weight")
wa <- ggplot(mtcars, aes(y=mpg, x=weight)) + geom_point(colour="blue")
wa<- wa + stat_smooth()+ ggtitle("Regression Plot \n Loess \n Weight")
g <- ggplot(mtcars, aes(y=mpg, x=gsec)) + geom_point(colour="blue")
g<- g + stat_smooth(method="lm", formula = y~poly(x,2))+ ggtitle("Regression Plot \n Polynomial \n gsec")
gl <- ggplot(mtcars, aes(y=mpg, x=gsec)) + geom_point(colour="blue")
gl<- gl + stat_smooth()+ ggtitle("Regression Plot \n Loess \n gsec")
am <- ggplot(mtcars, aes(y=mpg, x=am)) + geom_point(colour="blue")
am<- am + stat_smooth(method="lm", formula = y~x)+ ggtitle("Regression Plot \n am")
ama <- ggplot(mtcars, aes(y=mpg, x=am)) + geom_point(colour="blue")
ama<- ama + stat_smooth()+ ggtitle("Regression Plot \n Loess \n am")
grid.arrange(w,g,ama, ncol=3)
grid.arrange(wa,gl,am,ncol=3)
b<- ggplot2::fortify(step)
resid<- ggplot(b, aes(x= .fitted, y= .resid)) + geom_point(colour="blue")+ stat_smooth(method="lm", formula = y~poly(x,2))
resid<- resid + ggtitle("Residuals Vs Fitted")
normal <- ggplot(b, aes(qqnorm(.stdresid)[[1]], .stdresid)) + geom_point(colour="blue")+ stat_smooth(method="lm")
normal <- normal + ggtitle("Q-Q Normal")
grid.arrange(resid, normal)
mtcars_b<- select(mtcars, mpg, weight, gsec)
#scatterplot3d(x, y, z)
s3d<- scatterplot3d(mtcars_b$weight, mtcars_b$gsec, mtcars_b$mpg, pch=16, highlight.3d=TRUE, type ="h", main="3D Scatterplot", xlab="Weight", ylab="gsec", zlab="mpg(Miles per gallon)")
plane <- lm(mtcars_b$mpg ~ +mtcars_b$weight+mtcars_b$gsec)
s3d$plane3d(plane)
fit<- lm(mpg~.,data = mtcars)
step<- step(fit, direction="backward")
## Start: AIC=69.75
## mpg ~ cyl + disp + hp + drat + weight + gsec + vs + am + gear +
## carb
##
## Df Sum of Sq RSS AIC
## - carb 1 0.023 142.33 67.757
## - vs 1 0.055 142.36 67.764
## - cyl 1 0.252 142.56 67.808
## - drat 1 0.813 143.12 67.934
## - gear 1 1.091 143.40 67.996
## - disp 1 5.632 147.94 68.994
## - hp 1 6.312 148.62 69.140
## <none> 142.31 69.752
## - am 1 10.956 153.26 70.125
## - gsec 1 11.757 154.06 70.292
## - weight 1 32.650 174.96 74.361
##
## Step: AIC=67.76
## mpg ~ cyl + disp + hp + drat + weight + gsec + vs + am + gear
##
## Df Sum of Sq RSS AIC
## - vs 1 0.061 142.39 65.771
## - cyl 1 0.305 142.63 65.825
## - drat 1 0.791 143.12 65.934
## - gear 1 1.176 143.50 66.020
## - hp 1 8.943 151.27 67.707
## <none> 142.33 67.757
## - am 1 11.077 153.41 68.155
## - disp 1 11.303 153.63 68.202
## - gsec 1 13.256 155.58 68.606
## - weight 1 66.773 209.10 78.067
##
## Step: AIC=65.77
## mpg ~ cyl + disp + hp + drat + weight + gsec + am + gear
##
## Df Sum of Sq RSS AIC
## - cyl 1 0.470 142.86 63.876
## - drat 1 0.797 143.19 63.949
## - gear 1 1.178 143.57 64.034
## <none> 142.39 65.771
## - hp 1 9.239 151.63 65.782
## - am 1 11.228 153.62 66.199
## - disp 1 11.251 153.64 66.204
## - gsec 1 16.849 159.24 67.349
## - weight 1 71.160 213.55 76.740
##
## Step: AIC=63.88
## mpg ~ disp + hp + drat + weight + gsec + am + gear
##
## Df Sum of Sq RSS AIC
## - drat 1 1.195 144.06 62.143
## - gear 1 1.875 144.73 62.293
## <none> 142.86 63.876
## - disp 1 10.781 153.64 64.204
## - hp 1 11.454 154.31 64.344
## - am 1 13.008 155.87 64.665
## - gsec 1 30.452 173.31 68.059
## - weight 1 74.208 217.07 75.263
##
## Step: AIC=62.14
## mpg ~ disp + hp + weight + gsec + am + gear
##
## Df Sum of Sq RSS AIC
## - gear 1 3.061 147.12 60.815
## <none> 144.06 62.143
## - disp 1 10.338 154.39 62.360
## - hp 1 11.976 156.03 62.698
## - am 1 15.608 159.66 63.434
## - gsec 1 31.423 175.48 66.457
## - weight 1 77.904 221.96 73.976
##
## Step: AIC=60.82
## mpg ~ disp + hp + weight + gsec + am
##
## Df Sum of Sq RSS AIC
## - disp 1 7.411 154.53 60.388
## - hp 1 9.222 156.34 60.761
## <none> 147.12 60.815
## - gsec 1 30.349 177.46 64.817
## - am 1 32.921 180.04 65.278
## - weight 1 75.343 222.46 72.048
##
## Step: AIC=60.39
## mpg ~ hp + weight + gsec + am
##
## Df Sum of Sq RSS AIC
## - hp 1 6.392 160.92 59.685
## <none> 154.53 60.388
## - gsec 1 23.249 177.78 62.873
## - am 1 26.001 180.53 63.365
## - weight 1 84.066 238.59 72.289
##
## Step: AIC=59.69
## mpg ~ weight + gsec + am
##
## Df Sum of Sq RSS AIC
## <none> 160.92 59.685
## - am 1 25.978 186.90 62.474
## - gsec 1 111.186 272.10 74.494
## - weight 1 191.183 352.10 82.742