Executive Summary

There is an increased interest in the understanding what are the major factors that affect the performance of cars, especially the referent to the gasoline consumption and the distance a car can be driven, also known as Miles Per Gallon (mpg). There are some characteristics that all gas powered cars share, and those characteristics may be used to produce a reliable statistic about the mpg.

In the dataset cars include as a basic library of the {R} programming language, there are 10 share characteristics of different cars models, with the associate value for each car. Two of those characteristics are mpg and am that stands for Transmission , and there are only two kind of transmissions Manual or Automatic

The aim of this project is to determine how the Manual transmission provides a better mpg than Automatic transmission, and the results shows that in average manual transmission provides an extra of 2.6 miles per gallon, when only the transmission and weight of the car are taking into account.

Data Analysis

The most important relationship, for this project, would be the relationship between variables mpg (Miles per US_gallon) and am (Transmission, manual or automatic). Still, there can be other factors influencing the mpg of a car, thus the following questions may help understanding the relationship between car’s transmissions and mpg

What are the averages of mpg provide by Manual and Automatic transmissions? How can the difference be measured? What other characteristics of a car may influence the mpg?

To answer those question, first I need to load the dataset and analysed it. For this purpose I will use {R} language and RStudio to do the statistical analysis. Loading the data and libraries needed

sessionInfo()
## R version 3.2.0 (2015-04-16)
## Platform: i686-pc-linux-gnu (32-bit)
## Running under: Ubuntu 14.04.2 LTS
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] magrittr_1.5    formatR_1.2     tools_3.2.0     htmltools_0.2.6
##  [5] yaml_2.1.13     stringi_0.4-1   rmarkdown_0.6.1 knitr_1.10.5   
##  [9] stringr_1.0.0   digest_0.6.8    evaluate_0.7
library(datasets)
library(ggplot2)
library(aod)

# Loading the dataset

data(mtcars)

First review of the dataset

names(mtcars)
##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"

Some observations: to get an idea about the variables in the data set, I did check for the names and meanings of the variables. The dataset contains a total of 32 observations on 11 variables.

?mtcars
 [, 1]   mpg   Miles/(US) gallon        [, 2]    cyl     Number of cylinders
 [, 3]   disp    Displacement (cu.in.)      [, 4]    hp  Gross horsepower
 [, 5]   drat    Rear axle ratio        [, 6]    wt  Weight (lb/1000)
 [, 7]   qsec    1/4 mile time          [, 8]    vs  V/S
 [, 9]   am  Transmission (0 = automatic, 1 = manual)
 [,10]   gear    Number of forward gears    [,11]    carb    Number of carburetors
summary(mtcars)
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

Data Cleaning

After a short review of the variables in the dataset, I check for correlation between all the variables and the mpg variable.

sort(cor(mtcars)[1,])
##         wt        cyl       disp         hp       carb       qsec 
## -0.8676594 -0.8521620 -0.8475514 -0.7761684 -0.5509251  0.4186840 
##       gear         am         vs       drat        mpg 
##  0.4802848  0.5998324  0.6640389  0.6811719  1.0000000

There are some of the observed variables that are related, for instance:

It is easily observable that the variables with a greater number are correlated to MPG. I use this criteria to eliminate some variables with low correlation factors

  • qsec is a measure of time, not related to “mpg” and with low correlation factor
  • gear is a characteristic with low correlation factor
  • cyl and disp are mechanically related, and highly correlated with each other, so they can be eliminated
  • vs with high correlation factor, but related to the speed no to mpg can be eliminated

Now I created a new set of data, that contains only the variables I want to study

# exclude "cyl" = 2nd, "qsec" = 7th, "vs" = 8th, and "gear" =10th variables
mtcars_01 <- mtcars[c(-2, -7, -8, -10)]

In particular, the gas mileage for manual and automatic transmissions are two independent data populations.

Exploratory Data Analysis

Subsetting the mtcars_01 using the am as referente

auto <- subset(mtcars_01, mtcars_01$am==0)
manual <- subset(mtcars_01, mtcars_01$am==1)

summary (auto$mpg)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.40   14.95   17.30   17.15   19.20   24.40
summary(manual$mpg)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   15.00   21.00   22.80   24.39   30.40   33.90

Question 1 : What are the averages of mpg provide by Manual and Automatic transmissions?

Diff_auto_manual <- c(mean(auto$mpg), mean(manual$mpg))

Answer : Initial result, calculation of the average mpg of both transmission, and plotting the results

Figure 01: Mean of Manual and Automatic Transmissions

In mtcars_01, the mean mileage of automatic transmission is 17.147 mpg and the manual transmission is 24.392 mpg. To visualize the averages, separated by the car’s transmission type, I created the colorful plot where there is an indication that auto and manual transmissions are two separated groups, with different performance and correlation to the mpg.

Start of Data Analysis

I got the first results that, in a way, give some basic graphic information about the correlation between Manual and Automatic transmissions and the MPG, but there is important to calculate the numeric values. To do that, I will run the basic T-Test to calculate the mean mpg, and p-values of cars with automatic and manual transmissions,

mpg_auto_vs_manual <- t.test(mpg ~ am, data=mtcars_01)
mpg_auto_vs_manual
## 
##  Welch Two Sample t-test
## 
## data:  mpg by am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
## mean in group 0 mean in group 1 
##        17.14737        24.39231

Based on this initial results, I can concluded that:

In mtcars_01, the mean mileage of automatic transmission is 17.147 mpg and the manual transmission is 24.392 mpg.

The 95% confidence interval of the difference in mean gas mileage is between 3.2097 and 11.2802 mpg.

The low p-value of 0.001374 indicates that the mean mpg for cars with “Manual Transmission” is more than the mpg for cars with “Automatic Transmission”.

But, there is still the need to find the correlated factors that influence in the MPG performance, thus I decided to get a first linear model evaluation of the am variable

first_model <- lm(mpg~am, data = mtcars_01)
summary(first_model)
## 
## Call:
## lm(formula = mpg ~ am, data = mtcars_01)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## am             7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

There is no extra information to support a better understanding of the relationship between the transmission and the MPG.

As a results of interpreting the values of the coefficient and intercepts, on average, cars with Automatic transmission have 17.147 MPG, while cars with Manual transmission have 7.245 MPGs more.

Also, the R2 value of 0.3598n means that this initial model may explains 35.98% of the variance.

Correlation

To determine which predictors should go into our model, I can create a correlation matrix for the mtcars_01 dataset and look at the mpg variable.

## data(mtcars_01)
sort(cor(mtcars_01)[1,])
##         wt       disp         hp       carb         am       drat 
## -0.8676594 -0.8475514 -0.7761684 -0.5509251  0.5998324  0.6811719 
##        mpg 
##  1.0000000

In addition to am (which by default must be included in our regression model), I do observe that the variables wt, and *disp, are highly correlated with our dependent variable mpg, and both of them should be include in the final model.

Multivariate Linear Regression

Because my first model helps to explain only 35% of the correlation, I will use an extra model of MLR for mpg and its correlation to am, wt and disp, I created a second linear model with the variables am, wt, hp and carb.

# first_model <- lm(mpg~am, data = mtcars_01) #already done 
second_model <- lm(mpg~am + wt + hp + carb, data = mtcars_01)
summary(second_model)
## 
## Call:
## lm(formula = mpg ~ am + wt + hp + carb, data = mtcars_01)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7015 -1.8535 -0.3801  1.2983  5.2348 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 33.38232    2.73710  12.196 1.71e-12 ***
## am           2.64516    1.51223   1.749  0.09162 .  
## wt          -2.69255    0.93049  -2.894  0.00744 ** 
## hp          -0.03063    0.01222  -2.506  0.01853 *  
## carb        -0.43023    0.47275  -0.910  0.37084    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.545 on 27 degrees of freedom
## Multiple R-squared:  0.8447, Adjusted R-squared:  0.8216 
## F-statistic:  36.7 on 4 and 27 DF,  p-value: 1.5e-10

This second model explains over 84.46% of the variance, compare with the 35.98% of the first model. Besides, this model helps to identify that the variables wt, hp and carb contribute to the relationship between mpg and am, especially the variable wt.

Also, now it is possible to recalculate the coefficient for the average am. Cars with Manual transmissions provide, in average, ** 2.69 MPG extra** , compare to cars with Automatic transmission.

Question 2: How can the difference be measured?

Answer : To measure the difference in MPG provided by Manual and Automatic transmissions, I must take into account the other variables in the dataset. In this case, my two models allow me to do data interpretation the data in a way that each variable has its own weight.

Once the two linear models have been fitted, I will run an Analysis of variance (ANOVA) to find significant differences in correlated factors

anova(first_model, second_model)
## Analysis of Variance Table
## 
## Model 1: mpg ~ am
## Model 2: mpg ~ am + wt + hp + carb
##   Res.Df    RSS Df Sum of Sq     F    Pr(>F)    
## 1     30 720.90                                 
## 2     27 174.93  3    545.97 28.09 1.866e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value of 1.866e-08, allows me to reject the null hypothesis, that the first_fit model was the most appropriate, and suggest, instead that the second model (multivariate model) is a better choice.

Question 3: What other characteristics of a car may influence the mpg?

Answer : The best approach to discover what other characteristics (describe as variables in the dataset) of a car have influence in the average MPG, a set of linear modeling and regressions must be done. But for the aim of this project, I will show a plot summarizing the influence of the other variables.

Now that the second model seems to be the most appropriate, the Residuals need to be checked to discard any possible abnormality.

Figure 02: Residuals

Conclusions

The two major factors at the time to measure the MPG of a car, can be described as Transmission (am), and Weight (wt) in the dataset.

Transmission alone counts for over 35% in the difference between the mpg of Manuals vs Automatic cars.

Weight, on the other hand, in combination with the transmission counts for over ** 85%** of the difference.

Cars with low weight and manual transmission get a better MPG than heavy cars with Automatic transmission. The difference can be around 2.6 extra miles per US-gallon. Figure 03: Automatic vs Manual MPG, using Weight as cofactor

The number of samples with am=0 is less than number of am=1. With the regression lines, the mpg payoff of am=1 is better if wt > 2.8. We still need more data to prove and increase the reliability.

Harold A. Cruz-Sanchez May 23 2015


Appendix

The following are some of the code chunks used to generate the figures included in this report.

g1 <- ggplot(mtcars_01, aes(factor(am), mpg, fill=factor(am)))
g1 <- g1 + geom_boxplot()
g1 <- g1 + geom_jitter(position=position_jitter(width=.1, height=0))
g1 <- g1 + scale_colour_discrete(name = "Type")
g1 <- g1 + scale_fill_discrete(name="Type", breaks=c("0", "1"),
                               labels=c("Automatic", "Manual"))
g1 <- g1 + scale_x_discrete(breaks=c("0", "1"), labels=c("Automatic", "Manual"))
g1 <- g1 + xlab("")
g1

Figure 01: Mean of Manual and Automatic Transmissions

par(mfrow = c(2,2))
plot(second_model)

Figure 02: Residuals

fit_00 <- lm(data=mtcars_01, mpg~wt+am+wt*am)
plot(auto$wt, auto$mpg, col="lightblue", pch=20, cex=2, xlab="Weight", ylab="mpg")
points(manual$wt, manual$mpg, col="salmon", pch=20, cex=2)
## am=0; 
abline(c(fit_00$coeff[1], fit_00$coeff[2]), col="lightblue", lwd=3, lty=2)
## am=1
abline(c(fit_00$coeff[1]+fit_00$coeff[3], fit_00$coeff[2]+fit_00$coeff[4]), col="salmon", lwd=3, lty=2)
legend("topright", pch=19, col=c("lightblue", "salmon"), legend=c("Manual", "Automatic"))

Figure 03: Automatic vs Manual MPG, using Weight as cofactor