There is an increased interest in the understanding what are the major factors that affect the performance of cars, especially the referent to the gasoline consumption and the distance a car can be driven, also known as Miles Per Gallon (mpg). There are some characteristics that all gas powered cars share, and those characteristics may be used to produce a reliable statistic about the mpg.
In the dataset cars include as a basic library of the {R} programming language, there are 10 share characteristics of different cars models, with the associate value for each car. Two of those characteristics are mpg and am that stands for Transmission , and there are only two kind of transmissions Manual or Automatic
The aim of this project is to determine how the Manual transmission provides a better mpg than Automatic transmission, and the results shows that in average manual transmission provides an extra of 2.6 miles per gallon, when only the transmission and weight of the car are taking into account.
The most important relationship, for this project, would be the relationship between variables mpg (Miles per US_gallon) and am (Transmission, manual or automatic). Still, there can be other factors influencing the mpg of a car, thus the following questions may help understanding the relationship between car’s transmissions and mpg
What are the averages of mpg provide by Manual and Automatic transmissions? How can the difference be measured? What other characteristics of a car may influence the mpg?
To answer those question, first I need to load the dataset and analysed it. For this purpose I will use {R} language and RStudio to do the statistical analysis. Loading the data and libraries needed
sessionInfo()
## R version 3.2.0 (2015-04-16)
## Platform: i686-pc-linux-gnu (32-bit)
## Running under: Ubuntu 14.04.2 LTS
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] magrittr_1.5 formatR_1.2 tools_3.2.0 htmltools_0.2.6
## [5] yaml_2.1.13 stringi_0.4-1 rmarkdown_0.6.1 knitr_1.10.5
## [9] stringr_1.0.0 digest_0.6.8 evaluate_0.7
library(datasets)
library(ggplot2)
library(aod)
# Loading the dataset
data(mtcars)
First review of the dataset
names(mtcars)
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
Some observations: to get an idea about the variables in the data set, I did check for the names and meanings of the variables. The dataset contains a total of 32 observations on 11 variables.
?mtcars
[, 1] mpg Miles/(US) gallon [, 2] cyl Number of cylinders
[, 3] disp Displacement (cu.in.) [, 4] hp Gross horsepower
[, 5] drat Rear axle ratio [, 6] wt Weight (lb/1000)
[, 7] qsec 1/4 mile time [, 8] vs V/S
[, 9] am Transmission (0 = automatic, 1 = manual)
[,10] gear Number of forward gears [,11] carb Number of carburetors
summary(mtcars)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
After a short review of the variables in the dataset, I check for correlation between all the variables and the mpg variable.
sort(cor(mtcars)[1,])
## wt cyl disp hp carb qsec
## -0.8676594 -0.8521620 -0.8475514 -0.7761684 -0.5509251 0.4186840
## gear am vs drat mpg
## 0.4802848 0.5998324 0.6640389 0.6811719 1.0000000
There are some of the observed variables that are related, for instance:
It is easily observable that the variables with a greater number are correlated to MPG. I use this criteria to eliminate some variables with low correlation factors
qsec is a measure of time, not related to “mpg” and with low correlation factorgear is a characteristic with low correlation factorcyl and disp are mechanically related, and highly correlated with each other, so they can be eliminatedvs with high correlation factor, but related to the speed no to mpg can be eliminatedNow I created a new set of data, that contains only the variables I want to study
# exclude "cyl" = 2nd, "qsec" = 7th, "vs" = 8th, and "gear" =10th variables
mtcars_01 <- mtcars[c(-2, -7, -8, -10)]
In particular, the gas mileage for manual and automatic transmissions are two independent data populations.
Subsetting the mtcars_01 using the am as referente
auto <- subset(mtcars_01, mtcars_01$am==0)
manual <- subset(mtcars_01, mtcars_01$am==1)
summary (auto$mpg)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.40 14.95 17.30 17.15 19.20 24.40
summary(manual$mpg)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 15.00 21.00 22.80 24.39 30.40 33.90
Diff_auto_manual <- c(mean(auto$mpg), mean(manual$mpg))
Answer : Initial result, calculation of the average mpg of both transmission, and plotting the results
Figure 01: Mean of Manual and Automatic Transmissions
In mtcars_01, the mean mileage of automatic transmission is 17.147 mpg and the manual transmission is 24.392 mpg. To visualize the averages, separated by the car’s transmission type, I created the colorful plot where there is an indication that auto and manual transmissions are two separated groups, with different performance and correlation to the mpg.
I got the first results that, in a way, give some basic graphic information about the correlation between Manual and Automatic transmissions and the MPG, but there is important to calculate the numeric values. To do that, I will run the basic T-Test to calculate the mean mpg, and p-values of cars with automatic and manual transmissions,
mpg_auto_vs_manual <- t.test(mpg ~ am, data=mtcars_01)
mpg_auto_vs_manual
##
## Welch Two Sample t-test
##
## data: mpg by am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.280194 -3.209684
## sample estimates:
## mean in group 0 mean in group 1
## 17.14737 24.39231
Based on this initial results, I can concluded that:
In mtcars_01, the mean mileage of automatic transmission is 17.147 mpg and the manual transmission is 24.392 mpg.
The 95% confidence interval of the difference in mean gas mileage is between 3.2097 and 11.2802 mpg.
The low p-value of 0.001374 indicates that the mean mpg for cars with “Manual Transmission” is more than the mpg for cars with “Automatic Transmission”.
But, there is still the need to find the correlated factors that influence in the MPG performance, thus I decided to get a first linear model evaluation of the am variable
first_model <- lm(mpg~am, data = mtcars_01)
summary(first_model)
##
## Call:
## lm(formula = mpg ~ am, data = mtcars_01)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## am 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
There is no extra information to support a better understanding of the relationship between the transmission and the MPG.
As a results of interpreting the values of the coefficient and intercepts, on average, cars with Automatic transmission have 17.147 MPG, while cars with Manual transmission have 7.245 MPGs more.
Also, the R2 value of 0.3598n means that this initial model may explains 35.98% of the variance.
To determine which predictors should go into our model, I can create a correlation matrix for the mtcars_01 dataset and look at the mpg variable.
## data(mtcars_01)
sort(cor(mtcars_01)[1,])
## wt disp hp carb am drat
## -0.8676594 -0.8475514 -0.7761684 -0.5509251 0.5998324 0.6811719
## mpg
## 1.0000000
In addition to am (which by default must be included in our regression model), I do observe that the variables wt, and *disp, are highly correlated with our dependent variable mpg, and both of them should be include in the final model.
Because my first model helps to explain only 35% of the correlation, I will use an extra model of MLR for mpg and its correlation to am, wt and disp, I created a second linear model with the variables am, wt, hp and carb.
# first_model <- lm(mpg~am, data = mtcars_01) #already done
second_model <- lm(mpg~am + wt + hp + carb, data = mtcars_01)
summary(second_model)
##
## Call:
## lm(formula = mpg ~ am + wt + hp + carb, data = mtcars_01)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.7015 -1.8535 -0.3801 1.2983 5.2348
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.38232 2.73710 12.196 1.71e-12 ***
## am 2.64516 1.51223 1.749 0.09162 .
## wt -2.69255 0.93049 -2.894 0.00744 **
## hp -0.03063 0.01222 -2.506 0.01853 *
## carb -0.43023 0.47275 -0.910 0.37084
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.545 on 27 degrees of freedom
## Multiple R-squared: 0.8447, Adjusted R-squared: 0.8216
## F-statistic: 36.7 on 4 and 27 DF, p-value: 1.5e-10
This second model explains over 84.46% of the variance, compare with the 35.98% of the first model. Besides, this model helps to identify that the variables wt, hp and carb contribute to the relationship between mpg and am, especially the variable wt.
Also, now it is possible to recalculate the coefficient for the average am. Cars with Manual transmissions provide, in average, ** 2.69 MPG extra** , compare to cars with Automatic transmission.
Answer : To measure the difference in MPG provided by Manual and Automatic transmissions, I must take into account the other variables in the dataset. In this case, my two models allow me to do data interpretation the data in a way that each variable has its own weight.
Once the two linear models have been fitted, I will run an Analysis of variance (ANOVA) to find significant differences in correlated factors
anova(first_model, second_model)
## Analysis of Variance Table
##
## Model 1: mpg ~ am
## Model 2: mpg ~ am + wt + hp + carb
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.90
## 2 27 174.93 3 545.97 28.09 1.866e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value of 1.866e-08, allows me to reject the null hypothesis, that the first_fit model was the most appropriate, and suggest, instead that the second model (multivariate model) is a better choice.
Answer : The best approach to discover what other characteristics (describe as variables in the dataset) of a car have influence in the average MPG, a set of linear modeling and regressions must be done. But for the aim of this project, I will show a plot summarizing the influence of the other variables.
Now that the second model seems to be the most appropriate, the Residuals need to be checked to discard any possible abnormality.
Figure 02: Residuals
The two major factors at the time to measure the MPG of a car, can be described as Transmission (am), and Weight (wt) in the dataset.
Transmission alone counts for over 35% in the difference between the mpg of Manuals vs Automatic cars.
Weight, on the other hand, in combination with the transmission counts for over ** 85%** of the difference.
Cars with low weight and manual transmission get a better MPG than heavy cars with Automatic transmission. The difference can be around 2.6 extra miles per US-gallon.
Figure 03: Automatic vs Manual MPG, using Weight as cofactor
The number of samples with am=0 is less than number of am=1. With the regression lines, the mpg payoff of am=1 is better if wt > 2.8. We still need more data to prove and increase the reliability.
Harold A. Cruz-Sanchez May 23 2015
The following are some of the code chunks used to generate the figures included in this report.
g1 <- ggplot(mtcars_01, aes(factor(am), mpg, fill=factor(am)))
g1 <- g1 + geom_boxplot()
g1 <- g1 + geom_jitter(position=position_jitter(width=.1, height=0))
g1 <- g1 + scale_colour_discrete(name = "Type")
g1 <- g1 + scale_fill_discrete(name="Type", breaks=c("0", "1"),
labels=c("Automatic", "Manual"))
g1 <- g1 + scale_x_discrete(breaks=c("0", "1"), labels=c("Automatic", "Manual"))
g1 <- g1 + xlab("")
g1
Figure 01: Mean of Manual and Automatic Transmissions
par(mfrow = c(2,2))
plot(second_model)
Figure 02: Residuals
fit_00 <- lm(data=mtcars_01, mpg~wt+am+wt*am)
plot(auto$wt, auto$mpg, col="lightblue", pch=20, cex=2, xlab="Weight", ylab="mpg")
points(manual$wt, manual$mpg, col="salmon", pch=20, cex=2)
## am=0;
abline(c(fit_00$coeff[1], fit_00$coeff[2]), col="lightblue", lwd=3, lty=2)
## am=1
abline(c(fit_00$coeff[1]+fit_00$coeff[3], fit_00$coeff[2]+fit_00$coeff[4]), col="salmon", lwd=3, lty=2)
legend("topright", pch=19, col=c("lightblue", "salmon"), legend=c("Manual", "Automatic"))
Figure 03: Automatic vs Manual MPG, using Weight as cofactor