mtcars
Data Using Regression ModelsExecutive Summary
Motor Trend is a magazine about the automobile industry. Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome).
They are particularly interested in the following two questions:
* Is an automatic or manual transmission better for MPG
* Quantify the MPG difference between automatic and manual transmissions
Using hypothesis testing and simple linear regression, we can conclude that there is a signficant difference between the mean MPG for automatic and manual transmission cars and hence conclude that “manual transmission better than automatic transmission for MPG”
To confirm our conclusions & to adjust for confounding variables such as the weight and quarter mile time (acceleration) of the car, multivariate regression analysis was run to understand the impact of transmission type on MPG.
The best-fit model results indicates that weight and quarter mile time (acceleration) have signficant impact of the mpg between automatic and manual transmission cars.
Location
Data was obtained in R CRAN and its documentation can be found on
http://stat.ethz.ch/R-manual/R-devel/library/datasets/html/mtcars.html
Description
The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973-74 models).
Format
A data frame with 32 observations on 11 variables
[, 1] mpg Miles/(US) gallon
[, 2] cyl Number of cylinders
[, 3] disp Displacement (cu.in.)
[, 4] hp Gross horsepower
[, 5] drat Rear axle ratio
[, 6] wt Weight (lb/1000)
[, 7] qsec 1/4 mile time
[, 8] vs V/S
[, 9] am Transmission (0 = automatic, 1 = manual)
[,10] gear Number of forward gears
[,11] carb Number of carburetors
Pre-Requisites
Before you start execution of this Rmd file, please set working dir to your repository
> setwd(<your_assignment_repository>)
knitr Global Options
knitr::opts_chunk$set(tidy=FALSE, fig.path='figures/')
Load Libraries
library(ggplot2)
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 3.1.2
## Loading required package: grid
Load Data
# load data
data(mtcars)
names(mtcars)
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
Levels
Here we see that variable “am”, (our predictor), is a numeric class. Since we are dealing with a variable which has values 0 & 1, we convert this to a factor class and label the levels as Automatic and Manual.
mtcars$am <- as.factor(mtcars$am)
levels(mtcars$am) <- c("Automatic", "Manual")
Head
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 Manual 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 Manual 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 Manual 4
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 Automatic 3
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 Automatic 3
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 Automatic 3
## carb
## Mazda RX4 4
## Mazda RX4 Wag 4
## Datsun 710 1
## Hornet 4 Drive 1
## Hornet Sportabout 2
## Valiant 1
Summary
summary(mtcars)
## mpg cyl disp hp
## Min. :10.4 Min. :4.00 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.4 1st Qu.:4.00 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.2 Median :6.00 Median :196.3 Median :123.0
## Mean :20.1 Mean :6.19 Mean :230.7 Mean :146.7
## 3rd Qu.:22.8 3rd Qu.:8.00 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.9 Max. :8.00 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.76 Min. :1.51 Min. :14.5 Min. :0.000
## 1st Qu.:3.08 1st Qu.:2.58 1st Qu.:16.9 1st Qu.:0.000
## Median :3.69 Median :3.33 Median :17.7 Median :0.000
## Mean :3.60 Mean :3.22 Mean :17.8 Mean :0.438
## 3rd Qu.:3.92 3rd Qu.:3.61 3rd Qu.:18.9 3rd Qu.:1.000
## Max. :4.93 Max. :5.42 Max. :22.9 Max. :1.000
## am gear carb
## Automatic:19 Min. :3.00 Min. :1.00
## Manual :13 1st Qu.:3.00 1st Qu.:2.00
## Median :4.00 Median :2.00
## Mean :3.69 Mean :2.81
## 3rd Qu.:4.00 3rd Qu.:4.00
## Max. :5.00 Max. :8.00
Aggregates (count & mean) - mpg ~ am
# count
aggregate(mpg~am, data=mtcars, FUN=function(x){NROW(x)})
## am mpg
## 1 Automatic 19
## 2 Manual 13
# mean
aggregate(mpg~am, data=mtcars, mean)
## am mpg
## 1 Automatic 17.15
## 2 Manual 24.39
Aggregates (count & mean) - mpg ~ am & cyl
# count
aggregate(mpg~am*cyl, data=mtcars, FUN=function(x){NROW(x)})
## am cyl mpg
## 1 Automatic 4 3
## 2 Manual 4 8
## 3 Automatic 6 4
## 4 Manual 6 3
## 5 Automatic 8 12
## 6 Manual 8 2
# mean
aggregate(mpg~am*cyl, data=mtcars, mean)
## am cyl mpg
## 1 Automatic 4 22.90
## 2 Manual 4 28.07
## 3 Automatic 6 19.12
## 4 Manual 6 20.57
## 5 Automatic 8 15.05
## 6 Manual 8 15.40
Aggregates (count & mean) - mpg ~ am & vs
# count
aggregate(mpg~am*vs, data=mtcars, FUN=function(x){NROW(x)})
## am vs mpg
## 1 Automatic 0 12
## 2 Manual 0 6
## 3 Automatic 1 7
## 4 Manual 1 7
# mean
aggregate(mpg~am*vs, data=mtcars, mean)
## am vs mpg
## 1 Automatic 0 15.05
## 2 Manual 0 19.75
## 3 Automatic 1 20.74
## 4 Manual 1 28.37
Aggregates (count & mean) - mpg ~ am & gears
# count
aggregate(mpg~am*gear, data=mtcars, FUN=function(x){NROW(x)})
## am gear mpg
## 1 Automatic 3 15
## 2 Automatic 4 4
## 3 Manual 4 8
## 4 Manual 5 5
# mean
aggregate(mpg~am*gear, data=mtcars, mean)
## am gear mpg
## 1 Automatic 3 16.11
## 2 Automatic 4 21.05
## 3 Manual 4 26.27
## 4 Manual 5 21.38
Aggregates (count & mean) - mpg ~ am & carburetors
# count
aggregate(mpg~am*carb, data=mtcars, FUN=function(x){NROW(x)})
## am carb mpg
## 1 Automatic 1 3
## 2 Manual 1 4
## 3 Automatic 2 6
## 4 Manual 2 4
## 5 Automatic 3 3
## 6 Automatic 4 7
## 7 Manual 4 3
## 8 Manual 6 1
## 9 Manual 8 1
aggregate(mpg~am*carb, data=mtcars, mean)
## am carb mpg
## 1 Automatic 1 20.33
## 2 Manual 1 29.10
## 3 Automatic 2 19.30
## 4 Manual 2 27.05
## 5 Automatic 3 16.30
## 6 Automatic 4 14.30
## 7 Manual 4 19.27
## 8 Manual 6 19.70
## 9 Manual 8 15.00
We will be running a linear regression tests on this data.
For Linear Regression, we need to ensure that the following basic assumptions are met.
* The distribution of mpg is approximately normal
* Outliers are not skewing the data
Boxplot mpg ~ am
ggplot(mtcars, aes(x=factor(am),y=mpg,fill=factor(am)))+
geom_boxplot(notch=F)+
scale_x_discrete("Transmission")+
scale_y_continuous("Miles per Gallon")+
ggtitle("MPG by Transmission Type")
Observation
From the above graph, it can be seen that the following basic assumptions are met.
* The distribution of mpg is approximately normal
* Outliers are not skewing the data
Boxplot mpg ~ am & cyl / vs / carb / gears
plot1 <- ggplot(mtcars, aes(x=factor(am),y=mpg,fill=factor(am)))+
geom_boxplot(notch=F)+facet_grid(.~cyl)+scale_x_discrete("Transmission")+
scale_y_continuous("Miles per Gallon")+ggtitle("MPG by Transmission Type & Cylinder")
plot2 <- ggplot(mtcars, aes(x=factor(am),y=mpg,fill=factor(am)))+
geom_boxplot(notch=F)+facet_grid(.~vs)+scale_x_discrete("Transmission")+
scale_y_continuous("Miles per Gallon")+ggtitle("MPG by Transmission Type & VS")
plot3 <- ggplot(mtcars, aes(x=factor(am),y=mpg,fill=factor(am)))+
geom_boxplot(notch=F)+facet_grid(.~gear)+scale_x_discrete("Transmission")+
scale_y_continuous("Miles per Gallon")+ggtitle("MPG by Transmission Type & Gears")
plot4 <- ggplot(mtcars, aes(x=factor(am),y=mpg,fill=factor(am)))+
geom_boxplot(notch=F)+facet_grid(.~carb)+scale_x_discrete("Transmission")+
scale_y_continuous("Miles per Gallon")+ggtitle("MPG by Transmission Type & Carburetors")
grid.arrange(plot1, plot2, plot3, plot3, nrow=2, ncol=2)
Observation
From the above graph “MPG by Transmission Type & Cylinder”, it can be seen that
* For lower Cylinders, the mpg is far great (in both Automatic Or Manual)
Hence we should definately consider performing tests with Cylinder.
From the above graph “MPG by Transmission Type & VS”, it can be seen that
* The mpg is higher when vs = 1 (in both Automatic Or Manual)
Hence we should definately consider performing tests with VS.
From the above graph “MPG by Transmission Type & Gears”, it can be seen that
* The mpg is best when Gears = 4 (in both Automatic Or Manual)
However, there is no data for Manual Transmission for Gears = 3 & Auto Transmission for Gears = 5, so we will avoid any tests with Gears*.
From the above graph “MPG by Transmission Type & Carburetors”, it can be seen that
* The mpg is best when Carburetors = 1 or 2 (in both Automatic Or Manual)
However, sufficient data is not available Manual & Auto Transmission for all categoryies of Carburetors, so we will avoid any tests with Carburetors*.
Scatterplot mpg ~ all vars
#pairs(~mpg+am+cyl+wt+qsec+vs, data=mtcars,
#pairs(~mpg+disp+hp+drat+gear+carb, data=mtcars,
pairs(~mpg+., data=mtcars,
main="mtcars Scatterplot Matrix")
Observation
From the above graph, it can be seen that for the paried graph of mpg ~ wt and mpg ~ qsec
* The distribution of mpg is approximately normal
* Outliers are not skewing the data
Considering this, while performing tests, we should include Weight and QSec.
From the above graph, it is seen that the paried graphs for disp, hp, drat, gear, carb are skewed, hence we can ignore all these variables while performing our tests.
Correlogram mpg ~ all vars
library(corrgram)
## Warning: package 'corrgram' was built under R version 3.1.2
corrgram(mtcars, order=TRUE,
lower.panel=panel.shade,
upper.panel=panel.pie,
text.panel=panel.txt,
main="MPG Data")
Observation
From the above graph, it is seen that all
* wt, disp, cyl and hp are negatively correlated
* qsec, gear, vs, drat are positively correlated
Note:
In correlogram when the shaded row is used, each cell is shaded blue or red depending on the sign of the correlation, and with the intensity of color scaled 0-100% in proportion to the magnitude of the correlation. (Such scaled colors are easily computed using RGB coding from red, (1,0,0), through white (1,1,1), to blue (0,0,1).
Means
Analysis of mileage of automatic vs. manual transmission
means <- aggregate(mpg~am, data=mtcars, mean)
means
## am mpg
## 1 Automatic 17.15
## 2 Manual 24.39
Observation
The average MPG of all the manual transmission cars is 24.3923. This is much higher than average MPG of all the automatic transmission cars which is 17.1474.
We set our alpha-value at 0.5 and run a t-test to find out.
ttest
autoData <- mtcars[mtcars$am == "Automatic",]
manualData <- mtcars[mtcars$am == "Manual",]
ttest <- t.test(autoData$mpg, manualData$mpg)
ttest
##
## Welch Two Sample t-test
##
## data: autoData$mpg and manualData$mpg
## t = -3.767, df = 18.33, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.28 -3.21
## sample estimates:
## mean of x mean of y
## 17.15 24.39
Observation
The p-value is 0.0014, so we can reject the null hypothesis and conclude automatic has low mpg compared with manual cars. This ratifies ourobservation as seen in the above boxplot graph titled “MPG by Transmission Type”.
However this conclusion would be incomplete without considering the other characteristics of auto cars and manual cars are same.
The should be further explored using the multiple linear regression analysis.
cor test - generic
mtcars
dataframe has 11 variables. To find out possibility of the relationshp between mpg and other variables, we need to check the correlation between mpg and other variables by using the cor() function. For easy understanding / reference, we will sort the output of the cor() is ascending order.
data(mtcars)
sort(cor(mtcars)[1,])
## wt cyl disp hp carb qsec gear am vs
## -0.8677 -0.8522 -0.8476 -0.7762 -0.5509 0.4187 0.4803 0.5998 0.6640
## drat mpg
## 0.6812 1.0000
Observation
1. From the above data it is seen that
* wt, cyl, disp, hp & carb are negatively correlated with mpg i.e. as the wt, cyl, disp, hp & carb increase the mpg will decrease.
* qsec, gear, am, vs & drat are positively correlated with mpg i.e. as the qsec, gear, am, vs & drat increase or decrease the mpg will proportionately increase or decrease.
2. Hence apart from am (which by default is mandatory for regression model), we see that wt, cyl, disp, and hp are significantly correlated (negatively) with our dependent variable mpg. There are no positively significant correlated variables.
3. Accordingly we will carry out cor.test on am, wt & cyl.
Note: Conventionally, the correlation coefficient r measures the strength and direction of a linear relationship between two variables on a scatterplot. The value of r is always between +1 and -1. To interpret its value, see which of the following values your correlation r is closest to:
* Exactly -1. A perfect downhill (negative) linear relationship
* -0.70. A strong downhill (negative) linear relationship
* -0.50. A moderate downhill (negative) relationship
* -0.30. A weak downhill (negative) linear relationship
* 0. No linear relationship
* +0.30. A weak uphill (positive) linear relationship
* +0.50. A moderate uphill (positive) relationship
* +0.70. A strong uphill (positive) linear relationship
* Exactly +1. A perfect uphill (positive) linear relationship
cor test - mpg ~ am
cortest <- cor.test(mtcars$mpg, as.numeric(mtcars$am))
cortest$p.value; cortest$conf.int
## [1] 0.000285
## [1] 0.3176 0.7845
## attr(,"conf.level")
## [1] 0.95
Observation
From the above result it is seen that
1. the p-value is 2.8502 × 10-4. This is much less than 0.05 hence significant corelation.
2. the 95% Confidence Interval is in range 0.3176, 0.7845 and this does not contain zero; means corelation can not be zero, hence significant.
Note:
The cor.test function returns several values, including the p-value from the test of significance. Conventionally, p < 0.05 indicates that the correlation is likely significant whereas p > 0.05 indicates it is not.
cor test - mpg ~ cyl
cortest <- cor.test(mtcars$mpg, as.numeric(mtcars$cyl))
cortest$p.value; cortest$conf.int
## [1] 6.113e-10
## [1] -0.9258 -0.7163
## attr(,"conf.level")
## [1] 0.95
Observation
From the above result it is seen that
1. the p-value is 6.1127 × 10-10. This is much less than 0.05 hence significant corelation.
2. the 95% Confidence Interval is in range -0.9258, -0.7163 and this does not contain zero; means corelation can not be zero, hence significant.
Note:
The cor.test function returns several values, including the p-value from the test of significance. Conventionally, p < 0.05 indicates that the correlation is likely significant whereas p > 0.05 indicates it is not.
cor test - mpg ~ wt
cortest <- cor.test(mtcars$mpg, as.numeric(mtcars$wt))
cortest$p.value; cortest$conf.int
## [1] 1.294e-10
## [1] -0.9338 -0.7441
## attr(,"conf.level")
## [1] 0.95
Observation
From the above result it is seen that
1. the p-value is 1.294 × 10-10. This is much less than 0.05 hence significant corelation.
2. the 95% Confidence Interval is in range -0.9338, -0.7441 and this does not contain zero; means corelation can not be zero, hence significant.
Note:
The cor.test function returns several values, including the p-value from the test of significance. Conventionally, p < 0.05 indicates that the correlation is likely significant whereas p > 0.05 indicates it is not.
Simple Linear Regression
line_fit <- lm(mpg~am, data=mtcars)
line_smry <- summary(line_fit)
line_smry
##
## Call:
## lm(formula = mpg ~ am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.392 -3.092 -0.297 3.244 9.508
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.15 1.12 15.25 1.1e-15 ***
## am 7.24 1.76 4.11 0.00029 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.9 on 30 degrees of freedom
## Multiple R-squared: 0.36, Adjusted R-squared: 0.338
## F-statistic: 16.9 on 1 and 30 DF, p-value: 0.000285
Observation
Interpreting the coefficient and intercepts, we say that, on average, manual transmission cars have 7.2449 mpg more than automatic transmission.
In addition, we see that the R^2 value is 0.3598. This means that our model explains 35.9799% of the variance (not sufficient).
Hence we can say that we do not gain much information from our hypothesis test using this model.
Multivariate Regression Analysis
We use a stepwise algorithm to choose the best linera model by using step().
step_fit=step(lm(data=mtcars, mpg ~ .),trace=0,steps=10000)
step_smry <- summary(step_fit)
step_smry
##
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.481 -1.556 -0.726 1.411 4.661
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.618 6.960 1.38 0.17792
## wt -3.917 0.711 -5.51 7e-06 ***
## qsec 1.226 0.289 4.25 0.00022 ***
## am 2.936 1.411 2.08 0.04672 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.46 on 28 degrees of freedom
## Multiple R-squared: 0.85, Adjusted R-squared: 0.834
## F-statistic: 52.7 on 3 and 28 DF, p-value: 1.21e-11
Observation
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
This shows that in adddition to transmission, wt (weight) & qsec (1/4 mile time) are most significant in explaining the variations in mpg.
The adjusted R^2 is 0.8497 which means that the model handles 84.9664% of the variation in mpg.
We can safely conculde that this is a robust and highly predictive model.
Best Model - am + wt + qsec
To quantify the mpg difference between automatic and manual transmission, we include 3 variables wt, qsec and am. As seen above, this model captured 84.9664% of total variance.
best_fit <- lm(mpg~am+wt+qsec, data=mtcars)
best_smry <- summary(best_fit)
best_smry
##
## Call:
## lm(formula = mpg ~ am + wt + qsec, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.481 -1.556 -0.726 1.411 4.661
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.618 6.960 1.38 0.17792
## am 2.936 1.411 2.08 0.04672 *
## wt -3.917 0.711 -5.51 7e-06 ***
## qsec 1.226 0.289 4.25 0.00022 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.46 on 28 degrees of freedom
## Multiple R-squared: 0.85, Adjusted R-squared: 0.834
## F-statistic: 52.7 on 3 and 28 DF, p-value: 1.21e-11
f <- best_smry$fstatistic
best_pval <- pf(f[1],f[2],f[3],lower.tail=F)
attributes(best_pval) <- NULL
Observation
This model captured 84.9664% of total variance in mpg.
The p-value is 1.2104 × 10-11.
Based on above, we can reject the null hypothesis and can conclude that our multivariate model is significantly different from our simple linear regression model.
Result Summary
1. This model explains 84.9664% of the variance in miles per gallon (mpg).
2. We see that wt (weight) & qsec (1/4 mile time) did indeed impact the relationship between am and mpg (mostly wt).
Therefore given the above analysis, the question of “Is an automatic or manual transmission better for MPG” can not be answered without considering wt (weight) & qsec (1/4 mile time).
Again from the above analysis, to answer the question “Quantify the MPG difference between automatic and manual transmissions”, we refer to the coefficient for am and based on that we can say that, on average, manual transmission cars have 2.9358 mpg more than automatic transmission cars.
par(mfrow=c(1, 2))
# Histogram with Normal Curve
x <- mtcars$mpg
h<-hist(x, breaks=10, col="red", xlab="Miles Per Gallon",
main="Histogram Of MPG")
xfit<-seq(min(x),max(x),length=40)
yfit<-dnorm(xfit,mean=mean(x),sd=sd(x))
yfit <- yfit*diff(h$mids[1:2])*length(x)
lines(xfit, yfit, col="blue", lwd=2)
# Kernel Density Plot
d <- density(mtcars$mpg)
plot(d, xlab="MPG", main ="Density Of MPG")
par(mfrow=c(2,2))
plot(best_fit)
End Of Report