Analysis Of mtcars Data Using Regression Models

Overview

Executive Summary

Motor Trend is a magazine about the automobile industry. Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome).

They are particularly interested in the following two questions:
* Is an automatic or manual transmission better for MPG
* Quantify the MPG difference between automatic and manual transmissions

Using hypothesis testing and simple linear regression, we can conclude that there is a signficant difference between the mean MPG for automatic and manual transmission cars and hence conclude that “manual transmission better than automatic transmission for MPG”

To confirm our conclusions & to adjust for confounding variables such as the weight and quarter mile time (acceleration) of the car, multivariate regression analysis was run to understand the impact of transmission type on MPG.

The best-fit model results indicates that weight and quarter mile time (acceleration) have signficant impact of the mpg between automatic and manual transmission cars.

Data

Location

Data was obtained in R CRAN and its documentation can be found on
http://stat.ethz.ch/R-manual/R-devel/library/datasets/html/mtcars.html

Description

The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973-74 models).

Format

A data frame with 32 observations on 11 variables
[, 1] mpg Miles/(US) gallon
[, 2] cyl Number of cylinders
[, 3] disp Displacement (cu.in.)
[, 4] hp Gross horsepower
[, 5] drat Rear axle ratio
[, 6] wt Weight (lb/1000)
[, 7] qsec 1/4 mile time
[, 8] vs V/S
[, 9] am Transmission (0 = automatic, 1 = manual)
[,10] gear Number of forward gears
[,11] carb Number of carburetors

Pre Process

Pre-Requisites

Before you start execution of this Rmd file, please set working dir to your repository

> setwd(<your_assignment_repository>)

knitr Global Options

knitr::opts_chunk$set(tidy=FALSE, fig.path='figures/')

Load Libraries

library(ggplot2)
library(gridExtra)

## Warning: package 'gridExtra' was built under R version 3.1.2

## Loading required package: grid

Load Data

# load data
data(mtcars)
names(mtcars)

##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"

Levels

Here we see that variable “am”, (our predictor), is a numeric class. Since we are dealing with a variable which has values 0 & 1, we convert this to a factor class and label the levels as Automatic and Manual.

mtcars$am <- as.factor(mtcars$am)
levels(mtcars$am) <- c("Automatic", "Manual")

Basic Data Analysis

Head

head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs        am gear
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0    Manual    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0    Manual    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1    Manual    4
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1 Automatic    3
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0 Automatic    3
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1 Automatic    3
##                   carb
## Mazda RX4            4
## Mazda RX4 Wag        4
## Datsun 710           1
## Hornet 4 Drive       1
## Hornet Sportabout    2
## Valiant              1

Summary

summary(mtcars)

##       mpg            cyl            disp             hp       
##  Min.   :10.4   Min.   :4.00   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.4   1st Qu.:4.00   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.2   Median :6.00   Median :196.3   Median :123.0  
##  Mean   :20.1   Mean   :6.19   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.8   3rd Qu.:8.00   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.9   Max.   :8.00   Max.   :472.0   Max.   :335.0  
##       drat            wt            qsec            vs       
##  Min.   :2.76   Min.   :1.51   Min.   :14.5   Min.   :0.000  
##  1st Qu.:3.08   1st Qu.:2.58   1st Qu.:16.9   1st Qu.:0.000  
##  Median :3.69   Median :3.33   Median :17.7   Median :0.000  
##  Mean   :3.60   Mean   :3.22   Mean   :17.8   Mean   :0.438  
##  3rd Qu.:3.92   3rd Qu.:3.61   3rd Qu.:18.9   3rd Qu.:1.000  
##  Max.   :4.93   Max.   :5.42   Max.   :22.9   Max.   :1.000  
##          am          gear           carb     
##  Automatic:19   Min.   :3.00   Min.   :1.00  
##  Manual   :13   1st Qu.:3.00   1st Qu.:2.00  
##                 Median :4.00   Median :2.00  
##                 Mean   :3.69   Mean   :2.81  
##                 3rd Qu.:4.00   3rd Qu.:4.00  
##                 Max.   :5.00   Max.   :8.00

Aggregates (count & mean) - mpg ~ am

# count
aggregate(mpg~am, data=mtcars, FUN=function(x){NROW(x)})

##          am mpg
## 1 Automatic  19
## 2    Manual  13

# mean
aggregate(mpg~am, data=mtcars, mean)

##          am   mpg
## 1 Automatic 17.15
## 2    Manual 24.39

Aggregates (count & mean) - mpg ~ am & cyl

# count
aggregate(mpg~am*cyl, data=mtcars, FUN=function(x){NROW(x)})

##          am cyl mpg
## 1 Automatic   4   3
## 2    Manual   4   8
## 3 Automatic   6   4
## 4    Manual   6   3
## 5 Automatic   8  12
## 6    Manual   8   2

# mean
aggregate(mpg~am*cyl, data=mtcars, mean)

##          am cyl   mpg
## 1 Automatic   4 22.90
## 2    Manual   4 28.07
## 3 Automatic   6 19.12
## 4    Manual   6 20.57
## 5 Automatic   8 15.05
## 6    Manual   8 15.40

Aggregates (count & mean) - mpg ~ am & vs

# count
aggregate(mpg~am*vs, data=mtcars, FUN=function(x){NROW(x)})

##          am vs mpg
## 1 Automatic  0  12
## 2    Manual  0   6
## 3 Automatic  1   7
## 4    Manual  1   7

# mean
aggregate(mpg~am*vs, data=mtcars, mean)

##          am vs   mpg
## 1 Automatic  0 15.05
## 2    Manual  0 19.75
## 3 Automatic  1 20.74
## 4    Manual  1 28.37

Aggregates (count & mean) - mpg ~ am & gears

# count
aggregate(mpg~am*gear, data=mtcars, FUN=function(x){NROW(x)})

##          am gear mpg
## 1 Automatic    3  15
## 2 Automatic    4   4
## 3    Manual    4   8
## 4    Manual    5   5

# mean
aggregate(mpg~am*gear, data=mtcars, mean)

##          am gear   mpg
## 1 Automatic    3 16.11
## 2 Automatic    4 21.05
## 3    Manual    4 26.27
## 4    Manual    5 21.38

Aggregates (count & mean) - mpg ~ am & carburetors

# count
aggregate(mpg~am*carb, data=mtcars, FUN=function(x){NROW(x)})

##          am carb mpg
## 1 Automatic    1   3
## 2    Manual    1   4
## 3 Automatic    2   6
## 4    Manual    2   4
## 5 Automatic    3   3
## 6 Automatic    4   7
## 7    Manual    4   3
## 8    Manual    6   1
## 9    Manual    8   1

aggregate(mpg~am*carb, data=mtcars, mean)

##          am carb   mpg
## 1 Automatic    1 20.33
## 2    Manual    1 29.10
## 3 Automatic    2 19.30
## 4    Manual    2 27.05
## 5 Automatic    3 16.30
## 6 Automatic    4 14.30
## 7    Manual    4 19.27
## 8    Manual    6 19.70
## 9    Manual    8 15.00

Exploratory Data Analysis

We will be running a linear regression tests on this data.
For Linear Regression, we need to ensure that the following basic assumptions are met.
* The distribution of mpg is approximately normal
* Outliers are not skewing the data

Boxplot mpg ~ am

ggplot(mtcars, aes(x=factor(am),y=mpg,fill=factor(am)))+
  geom_boxplot(notch=F)+ 
  scale_x_discrete("Transmission")+
  scale_y_continuous("Miles per Gallon")+
  ggtitle("MPG by Transmission Type")

plot of chunk show_boxplot

Observation
From the above graph, it can be seen that the following basic assumptions are met.
* The distribution of mpg is approximately normal
* Outliers are not skewing the data

Boxplot mpg ~ am & cyl / vs / carb / gears

plot1 <- ggplot(mtcars, aes(x=factor(am),y=mpg,fill=factor(am)))+
  geom_boxplot(notch=F)+facet_grid(.~cyl)+scale_x_discrete("Transmission")+
  scale_y_continuous("Miles per Gallon")+ggtitle("MPG by Transmission Type & Cylinder")
plot2 <- ggplot(mtcars, aes(x=factor(am),y=mpg,fill=factor(am)))+
  geom_boxplot(notch=F)+facet_grid(.~vs)+scale_x_discrete("Transmission")+
  scale_y_continuous("Miles per Gallon")+ggtitle("MPG by Transmission Type & VS")
plot3 <- ggplot(mtcars, aes(x=factor(am),y=mpg,fill=factor(am)))+
  geom_boxplot(notch=F)+facet_grid(.~gear)+scale_x_discrete("Transmission")+
  scale_y_continuous("Miles per Gallon")+ggtitle("MPG by Transmission Type & Gears")
plot4 <- ggplot(mtcars, aes(x=factor(am),y=mpg,fill=factor(am)))+
  geom_boxplot(notch=F)+facet_grid(.~carb)+scale_x_discrete("Transmission")+
  scale_y_continuous("Miles per Gallon")+ggtitle("MPG by Transmission Type & Carburetors")
grid.arrange(plot1, plot2, plot3, plot3, nrow=2, ncol=2)

plot of chunk show_boxplot1

Observation
From the above graph “MPG by Transmission Type & Cylinder”, it can be seen that
* For lower Cylinders, the mpg is far great (in both Automatic Or Manual)
Hence we should definately consider performing tests with Cylinder.

From the above graph “MPG by Transmission Type & VS”, it can be seen that
* The mpg is higher when vs = 1 (in both Automatic Or Manual)
Hence we should definately consider performing tests with VS.

From the above graph “MPG by Transmission Type & Gears”, it can be seen that
* The mpg is best when Gears = 4 (in both Automatic Or Manual)
However, there is no data for Manual Transmission for Gears = 3 & Auto Transmission for Gears = 5, so we will avoid any tests with Gears*.

From the above graph “MPG by Transmission Type & Carburetors”, it can be seen that
* The mpg is best when Carburetors = 1 or 2 (in both Automatic Or Manual)
However, sufficient data is not available Manual & Auto Transmission for all categoryies of Carburetors, so we will avoid any tests with Carburetors*.

Scatterplot mpg ~ all vars

#pairs(~mpg+am+cyl+wt+qsec+vs, data=mtcars, 
#pairs(~mpg+disp+hp+drat+gear+carb, data=mtcars, 
pairs(~mpg+., data=mtcars, 
   main="mtcars Scatterplot Matrix")

plot of chunk show_scatterplot1

Observation
From the above graph, it can be seen that for the paried graph of mpg ~ wt and mpg ~ qsec
* The distribution of mpg is approximately normal
* Outliers are not skewing the data
Considering this, while performing tests, we should include Weight and QSec.

From the above graph, it is seen that the paried graphs for disp, hp, drat, gear, carb are skewed, hence we can ignore all these variables while performing our tests.

Correlogram mpg ~ all vars

library(corrgram)

## Warning: package 'corrgram' was built under R version 3.1.2

corrgram(mtcars, order=TRUE, 
         lower.panel=panel.shade,
        upper.panel=panel.pie, 
        text.panel=panel.txt,
        main="MPG Data")

Observation
From the above graph, it is seen that all
* wt, disp, cyl and hp are negatively correlated
* qsec, gear, vs, drat are positively correlated
Note:
In correlogram when the shaded row is used, each cell is shaded blue or red depending on the sign of the correlation, and with the intensity of color scaled 0-100% in proportion to the magnitude of the correlation. (Such scaled colors are easily computed using RGB coding from red, (1,0,0), through white (1,1,1), to blue (0,0,1).

Means Test (ttest)

Means

Analysis of mileage of automatic vs. manual transmission

means <- aggregate(mpg~am, data=mtcars, mean)
means

##          am   mpg
## 1 Automatic 17.15
## 2    Manual 24.39

Observation
The average MPG of all the manual transmission cars is 24.3923. This is much higher than average MPG of all the automatic transmission cars which is 17.1474.
We set our alpha-value at 0.5 and run a t-test to find out.

ttest

autoData <- mtcars[mtcars$am == "Automatic",]
manualData <- mtcars[mtcars$am == "Manual",]
ttest <- t.test(autoData$mpg, manualData$mpg)
ttest

## 
##  Welch Two Sample t-test
## 
## data:  autoData$mpg and manualData$mpg
## t = -3.767, df = 18.33, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.28  -3.21
## sample estimates:
## mean of x mean of y 
##     17.15     24.39

Observation
The p-value is 0.0014, so we can reject the null hypothesis and conclude automatic has low mpg compared with manual cars. This ratifies ourobservation as seen in the above boxplot graph titled “MPG by Transmission Type”.
However this conclusion would be incomplete without considering the other characteristics of auto cars and manual cars are same.
The should be further explored using the multiple linear regression analysis.

Correlation Analysis

cor test - generic

mtcars dataframe has 11 variables. To find out possibility of the relationshp between mpg and other variables, we need to check the correlation between mpg and other variables by using the cor() function. For easy understanding / reference, we will sort the output of the cor() is ascending order.

data(mtcars)
sort(cor(mtcars)[1,])

##      wt     cyl    disp      hp    carb    qsec    gear      am      vs 
## -0.8677 -0.8522 -0.8476 -0.7762 -0.5509  0.4187  0.4803  0.5998  0.6640 
##    drat     mpg 
##  0.6812  1.0000

Observation
1. From the above data it is seen that
* wt, cyl, disp, hp & carb are negatively correlated with mpg i.e. as the wt, cyl, disp, hp & carb increase the mpg will decrease.
* qsec, gear, am, vs & drat are positively correlated with mpg i.e. as the qsec, gear, am, vs & drat increase or decrease the mpg will proportionately increase or decrease.
2. Hence apart from am (which by default is mandatory for regression model), we see that wt, cyl, disp, and hp are significantly correlated (negatively) with our dependent variable mpg. There are no positively significant correlated variables.
3. Accordingly we will carry out cor.test on am, wt & cyl.
Note: Conventionally, the correlation coefficient r measures the strength and direction of a linear relationship between two variables on a scatterplot. The value of r is always between +1 and -1. To interpret its value, see which of the following values your correlation r is closest to:
* Exactly -1. A perfect downhill (negative) linear relationship
* -0.70. A strong downhill (negative) linear relationship
* -0.50. A moderate downhill (negative) relationship
* -0.30. A weak downhill (negative) linear relationship
* 0. No linear relationship
* +0.30. A weak uphill (positive) linear relationship
* +0.50. A moderate uphill (positive) relationship
* +0.70. A strong uphill (positive) linear relationship
* Exactly +1. A perfect uphill (positive) linear relationship

cor test - mpg ~ am

cortest <- cor.test(mtcars$mpg, as.numeric(mtcars$am))
cortest$p.value; cortest$conf.int

## [1] 0.000285

## [1] 0.3176 0.7845
## attr(,"conf.level")
## [1] 0.95

Observation
From the above result it is seen that
1. the p-value is 2.8502 × 10^-4. This is much less than 0.05 hence significant corelation.
2. the 95% Confidence Interval is in range 0.3176, 0.7845 and this does not contain zero; means corelation can not be zero, hence significant.
Note:
The cor.test function returns several values, including the p-value from the test of significance. Conventionally, p < 0.05 indicates that the correlation is likely significant whereas p > 0.05 indicates it is not.

cor test - mpg ~ cyl

cortest <- cor.test(mtcars$mpg, as.numeric(mtcars$cyl))
cortest$p.value; cortest$conf.int

## [1] 6.113e-10

## [1] -0.9258 -0.7163
## attr(,"conf.level")
## [1] 0.95

Observation
From the above result it is seen that
1. the p-value is 6.1127 × 10^-10. This is much less than 0.05 hence significant corelation.
2. the 95% Confidence Interval is in range -0.9258, -0.7163 and this does not contain zero; means corelation can not be zero, hence significant.
Note:
The cor.test function returns several values, including the p-value from the test of significance. Conventionally, p < 0.05 indicates that the correlation is likely significant whereas p > 0.05 indicates it is not.

cor test - mpg ~ wt

cortest <- cor.test(mtcars$mpg, as.numeric(mtcars$wt))
cortest$p.value; cortest$conf.int

## [1] 1.294e-10

## [1] -0.9338 -0.7441
## attr(,"conf.level")
## [1] 0.95

Observation
From the above result it is seen that
1. the p-value is 1.294 × 10^-10. This is much less than 0.05 hence significant corelation.
2. the 95% Confidence Interval is in range -0.9338, -0.7441 and this does not contain zero; means corelation can not be zero, hence significant.
Note:
The cor.test function returns several values, including the p-value from the test of significance. Conventionally, p < 0.05 indicates that the correlation is likely significant whereas p > 0.05 indicates it is not.

Linear Regression Models

Simple Linear Regression

line_fit <- lm(mpg~am, data=mtcars)
line_smry <- summary(line_fit)
line_smry

## 
## Call:
## lm(formula = mpg ~ am, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.392 -3.092 -0.297  3.244  9.508 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    17.15       1.12   15.25  1.1e-15 ***
## am              7.24       1.76    4.11  0.00029 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.9 on 30 degrees of freedom
## Multiple R-squared:  0.36,   Adjusted R-squared:  0.338 
## F-statistic: 16.9 on 1 and 30 DF,  p-value: 0.000285

Observation
Interpreting the coefficient and intercepts, we say that, on average, manual transmission cars have 7.2449 mpg more than automatic transmission.
In addition, we see that the R^2 value is 0.3598. This means that our model explains 35.9799% of the variance (not sufficient).
Hence we can say that we do not gain much information from our hypothesis test using this model.

Multivariate Regression Analysis

We use a stepwise algorithm to choose the best linera model by using step().

step_fit=step(lm(data=mtcars, mpg ~ .),trace=0,steps=10000)
step_smry <- summary(step_fit)
step_smry

## 
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.481 -1.556 -0.726  1.411  4.661 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    9.618      6.960    1.38  0.17792    
## wt            -3.917      0.711   -5.51    7e-06 ***
## qsec           1.226      0.289    4.25  0.00022 ***
## am             2.936      1.411    2.08  0.04672 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.46 on 28 degrees of freedom
## Multiple R-squared:  0.85,   Adjusted R-squared:  0.834 
## F-statistic: 52.7 on 3 and 28 DF,  p-value: 1.21e-11

Observation

## lm(formula = mpg ~ wt + qsec + am, data = mtcars)

This shows that in adddition to transmission, wt (weight) & qsec (1/4 mile time) are most significant in explaining the variations in mpg.
The adjusted R^2 is 0.8497 which means that the model handles 84.9664% of the variation in mpg.
We can safely conculde that this is a robust and highly predictive model.

Best Model - am + wt + qsec

To quantify the mpg difference between automatic and manual transmission, we include 3 variables wt, qsec and am. As seen above, this model captured 84.9664% of total variance.

best_fit <- lm(mpg~am+wt+qsec, data=mtcars)
best_smry <- summary(best_fit)
best_smry

## 
## Call:
## lm(formula = mpg ~ am + wt + qsec, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.481 -1.556 -0.726  1.411  4.661 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    9.618      6.960    1.38  0.17792    
## am             2.936      1.411    2.08  0.04672 *  
## wt            -3.917      0.711   -5.51    7e-06 ***
## qsec           1.226      0.289    4.25  0.00022 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.46 on 28 degrees of freedom
## Multiple R-squared:  0.85,   Adjusted R-squared:  0.834 
## F-statistic: 52.7 on 3 and 28 DF,  p-value: 1.21e-11

    f <- best_smry$fstatistic
    best_pval <- pf(f[1],f[2],f[3],lower.tail=F)
    attributes(best_pval) <- NULL

Observation
This model captured 84.9664% of total variance in mpg.
The p-value is 1.2104 × 10^-11.
Based on above, we can reject the null hypothesis and can conclude that our multivariate model is significantly different from our simple linear regression model.

Result Summary
1. This model explains 84.9664% of the variance in miles per gallon (mpg).
2. We see that wt (weight) & qsec (1/4 mile time) did indeed impact the relationship between am and mpg (mostly wt).

Therefore given the above analysis, the question of “Is an automatic or manual transmission better for MPG” can not be answered without considering wt (weight) & qsec (1/4 mile time).

Again from the above analysis, to answer the question “Quantify the MPG difference between automatic and manual transmissions”, we refer to the coefficient for am and based on that we can say that, on average, manual transmission cars have 2.9358 mpg more than automatic transmission cars.

Appendix 1

par(mfrow=c(1, 2))
# Histogram with Normal Curve
x <- mtcars$mpg
h<-hist(x, breaks=10, col="red", xlab="Miles Per Gallon",
   main="Histogram Of MPG")
xfit<-seq(min(x),max(x),length=40)
yfit<-dnorm(xfit,mean=mean(x),sd=sd(x))
yfit <- yfit*diff(h$mids[1:2])*length(x)
lines(xfit, yfit, col="blue", lwd=2)
# Kernel Density Plot
d <- density(mtcars$mpg)
plot(d, xlab="MPG", main ="Density Of MPG")

plot of chunk show_appendix1

Appendix 2 : Residual Diagnostics Of Final Model

par(mfrow=c(2,2))
plot(best_fit)

plot of chunk show_appendix2

End Of Report

Analysis Of `mtcars` Data Using Regression Models

Cyrus Lentin

Saturday, November 15, 2014