@Yiyang Zhao

I. Project Overview

This project analyzes a data set (mtcars) of a collection of cars and attemps to explore the relationship between a set of variables and miles per gallon (MPG) (outcome). In this project, we aim to investigate the following two questions: 1. Is an automatic or manual transmission better for MPG? 2. How to quantify the MPG difference between automatic and manual transmissions?

II. Executive Summary

In this project, we first read the mtcars data from the R library, and obtain a brief summary of the relevant variables. Then, some exploratory data analyses, including scatter plots, box plots and hypothesis testing, are performed to understand the relationship between transmission type and MPG. Next, we fit the data to several regression models and analyze the results quantitatively.

1. Reading Data

# Take a look at the data.
mtcars
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
## [, 1]    mpg Miles/(US) gallon
## [, 2]    cyl Number of cylinders
## [, 3]    disp    Displacement (cu.in.)
## [, 4]    hp  Gross horsepower
## [, 5]    drat    Rear axle ratio
## [, 6]    wt  Weight (1000 lbs)
## [, 7]    qsec    1/4 mile time
## [, 8]    vs  V/S
## [, 9]    am  Transmission (0 = automatic, 1 = manual)
## [,10]    gear    Number of forward gears
## [,11]    carb    

2. Exploratory Data Analysis

# Briefly summarize the data
summary(mtcars$mpg)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.40   15.42   19.20   20.09   22.80   33.90
mtcars$am = as.factor(mtcars$am)
levels(mtcars$am) = c("Automatic", "Manual")
summary(mtcars$am)
## Automatic    Manual 
##        19        13

From the data summary, we can briefly know that the mtcars data set contains 32 car samples. 19 of them have automatic transimission and 13 have manualtransimission; their fuel consumption efficiency, measured as (miles per gallon), ranges between 10.40 and 33.90.

# Obtain scatter plots
# The more relevant variables like `mpg`, `cyl`, `wt` and `am` are extracted
pairs(mtcars[,c("mpg", "cyl", "hp", "wt", "am")], panel = panel.smooth, main = "mtcars data")

For this project, we focus on the first column, which shows how mpg responds to changes in cyl, wt, hp and am respectively. From the scatter plots, it is evident that all three variables have an impact on the fuel consumption performance. Therefore, as we analyze the relationship between mpg and am (transmission type), we might need to look at cyl, hp and wt as well.

# Obtain comparative boxplots
# Check the difference in mpg based on different types of transmission
boxplot(mpg ~ am, data = mtcars, main = "Comparison of MPG by type of Transmission",
              xlab = "Type of Gear",
              ylab = "MPG (miles/gallon",
              ylim = c(10, 35)
              )

From the plots, we can infer that manually operated cars may have a more efficient consumption level since the mean MPG for “Manual” is higher.

# Hypothesis testing
# Use a t-test to check their relationship
# H_0: mu_manual - mu_auto = 0
# H_A: mu_manual - mu_auto > 0
hypo <- t.test(mtcars$mpg[mtcars$am == "Manual"], mtcars$mpg[mtcars$am == "Automatic"], alternative = "greater")
hypo
## 
##  Welch Two Sample t-test
## 
## data:  mtcars$mpg[mtcars$am == "Manual"] and mtcars$mpg[mtcars$am == "Automatic"]
## t = 3.7671, df = 18.332, p-value = 0.0006868
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  3.913256      Inf
## sample estimates:
## mean of x mean of y 
##  24.39231  17.14737
hypo$p.value
## [1] 0.0006868192

Since the p-value = 0.000687, the null hypothesis is rejected at the 0.05 significance level. There is a statistically significant relationship between the transmission type and the MPG of cars. More specifically, the manual cars have a betetr MPG performance.

3. Regression Analysis

# Model selection
# Initial attempt
fit1 <- lm(mpg ~ am, mtcars)
summary(fit1)
## 
## Call:
## lm(formula = mpg ~ am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## amManual       7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

From the summary, the cofficent of am is 7.245. It suggests that the mean mpg for manual is 7.245 more than that of automatic. However, the R-squared value = 0.3598 is low, only less than 36% of the data can be explained by this model.

# Finding the confounding variables using other models
fit2 <- lm(mpg ~ am + cyl + hp + wt, mtcars)
summary(fit2)
## 
## Call:
## lm(formula = mpg ~ am + cyl + hp + wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4765 -1.8471 -0.5544  1.2758  5.6608 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 36.14654    3.10478  11.642 4.94e-12 ***
## amManual     1.47805    1.44115   1.026   0.3142    
## cyl         -0.74516    0.58279  -1.279   0.2119    
## hp          -0.02495    0.01365  -1.828   0.0786 .  
## wt          -2.60648    0.91984  -2.834   0.0086 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.509 on 27 degrees of freedom
## Multiple R-squared:  0.849,  Adjusted R-squared:  0.8267 
## F-statistic: 37.96 on 4 and 27 DF,  p-value: 1.025e-10

From the summary, the cofficent of am is 1.47. It suggests that the mean mpg for manual is 1.47 more than that of automatic. The R-squared value = 0.849, suggesting that 84.9% of the data can be explained by this model.

4. Residual Plot

par(mfrow = c(2,2))
plot(fit1)

plot(fit2)

The residual plot of the first model displays data points that accumulate at the two ends of the horizontal axis. The clear pattern suggests poor model fit.
In the second, modified model, the residuals are evenly distributed above and below the line. Therefore, the second model is better.

Conclusion

In general cases, manual transmission is better for MPG. Quantitatively, the 0.000687 p-value suggests a statistically significant MPG difference between automatic and manual transmissions. However, the linear model that only takes the am into consideration is a poor fit, as seen in the patterned residual plot and the low R-squared value (0.3598). When other variables like cyl(No. of cylinders), hp(Gross horsepower) and wt(Weight) are accounted for in the second model, the residual plot is even and the R-squared value (0.849) is much higher. Hence, the second linear model is a better fit. In conclusion, the MPG differences are partially due to different transmission types, but the influences of other confounding variables are also significant. It would be clearer if more data is available, thus more inferences can made about the specifics.