@Yiyang Zhao
This project analyzes a data set (mtcars) of a collection of cars and attemps to explore the relationship between a set of variables and miles per gallon (MPG) (outcome). In this project, we aim to investigate the following two questions: 1. Is an automatic or manual transmission better for MPG? 2. How to quantify the MPG difference between automatic and manual transmissions?
In this project, we first read the mtcars data from the R library, and obtain a brief summary of the relevant variables. Then, some exploratory data analyses, including scatter plots, box plots and hypothesis testing, are performed to understand the relationship between transmission type and MPG. Next, we fit the data to several regression models and analyze the results quantitatively.
# Take a look at the data.
mtcars
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
## [, 1] mpg Miles/(US) gallon
## [, 2] cyl Number of cylinders
## [, 3] disp Displacement (cu.in.)
## [, 4] hp Gross horsepower
## [, 5] drat Rear axle ratio
## [, 6] wt Weight (1000 lbs)
## [, 7] qsec 1/4 mile time
## [, 8] vs V/S
## [, 9] am Transmission (0 = automatic, 1 = manual)
## [,10] gear Number of forward gears
## [,11] carb
# Briefly summarize the data
summary(mtcars$mpg)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.40 15.42 19.20 20.09 22.80 33.90
mtcars$am = as.factor(mtcars$am)
levels(mtcars$am) = c("Automatic", "Manual")
summary(mtcars$am)
## Automatic Manual
## 19 13
From the data summary, we can briefly know that the mtcars data set contains 32 car samples. 19 of them have automatic transimission and 13 have manualtransimission; their fuel consumption efficiency, measured as (miles per gallon), ranges between 10.40 and 33.90.
# Obtain scatter plots
# The more relevant variables like `mpg`, `cyl`, `wt` and `am` are extracted
pairs(mtcars[,c("mpg", "cyl", "hp", "wt", "am")], panel = panel.smooth, main = "mtcars data")
For this project, we focus on the first column, which shows how
mpg responds to changes in cyl, wt, hp and am respectively. From the scatter plots, it is evident that all three variables have an impact on the fuel consumption performance. Therefore, as we analyze the relationship between mpg and am (transmission type), we might need to look at cyl, hp and wt as well.
# Obtain comparative boxplots
# Check the difference in mpg based on different types of transmission
boxplot(mpg ~ am, data = mtcars, main = "Comparison of MPG by type of Transmission",
xlab = "Type of Gear",
ylab = "MPG (miles/gallon",
ylim = c(10, 35)
)
From the plots, we can infer that manually operated cars may have a more efficient consumption level since the mean MPG for “Manual” is higher.
# Hypothesis testing
# Use a t-test to check their relationship
# H_0: mu_manual - mu_auto = 0
# H_A: mu_manual - mu_auto > 0
hypo <- t.test(mtcars$mpg[mtcars$am == "Manual"], mtcars$mpg[mtcars$am == "Automatic"], alternative = "greater")
hypo
##
## Welch Two Sample t-test
##
## data: mtcars$mpg[mtcars$am == "Manual"] and mtcars$mpg[mtcars$am == "Automatic"]
## t = 3.7671, df = 18.332, p-value = 0.0006868
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 3.913256 Inf
## sample estimates:
## mean of x mean of y
## 24.39231 17.14737
hypo$p.value
## [1] 0.0006868192
Since the p-value = 0.000687, the null hypothesis is rejected at the 0.05 significance level. There is a statistically significant relationship between the transmission type and the MPG of cars. More specifically, the manual cars have a betetr MPG performance.
# Model selection
# Initial attempt
fit1 <- lm(mpg ~ am, mtcars)
summary(fit1)
##
## Call:
## lm(formula = mpg ~ am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## amManual 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
From the summary, the cofficent of am is 7.245. It suggests that the mean mpg for manual is 7.245 more than that of automatic. However, the R-squared value = 0.3598 is low, only less than 36% of the data can be explained by this model.
# Finding the confounding variables using other models
fit2 <- lm(mpg ~ am + cyl + hp + wt, mtcars)
summary(fit2)
##
## Call:
## lm(formula = mpg ~ am + cyl + hp + wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4765 -1.8471 -0.5544 1.2758 5.6608
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.14654 3.10478 11.642 4.94e-12 ***
## amManual 1.47805 1.44115 1.026 0.3142
## cyl -0.74516 0.58279 -1.279 0.2119
## hp -0.02495 0.01365 -1.828 0.0786 .
## wt -2.60648 0.91984 -2.834 0.0086 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.509 on 27 degrees of freedom
## Multiple R-squared: 0.849, Adjusted R-squared: 0.8267
## F-statistic: 37.96 on 4 and 27 DF, p-value: 1.025e-10
From the summary, the cofficent of am is 1.47. It suggests that the mean mpg for manual is 1.47 more than that of automatic. The R-squared value = 0.849, suggesting that 84.9% of the data can be explained by this model.
par(mfrow = c(2,2))
plot(fit1)
plot(fit2)
The residual plot of the first model displays data points that accumulate at the two ends of the horizontal axis. The clear pattern suggests poor model fit.
In the second, modified model, the residuals are evenly distributed above and below the line. Therefore, the second model is better.
In general cases, manual transmission is better for MPG. Quantitatively, the 0.000687 p-value suggests a statistically significant MPG difference between automatic and manual transmissions. However, the linear model that only takes the am into consideration is a poor fit, as seen in the patterned residual plot and the low R-squared value (0.3598). When other variables like cyl(No. of cylinders), hp(Gross horsepower) and wt(Weight) are accounted for in the second model, the residual plot is even and the R-squared value (0.849) is much higher. Hence, the second linear model is a better fit. In conclusion, the MPG differences are partially due to different transmission types, but the influences of other confounding variables are also significant. It would be clearer if more data is available, thus more inferences can made about the specifics.