In this project, the dataset mtcars has been analyzed in order to explore the relationship between a set of variables and miles per gallon (MPG). The dataset has been designed to highlight 10 parameters for 32 automobiles. Regression models have been used and data analyses have been processed to answer two main questions: - “Is an automatic or manual transmission better for MPG ?” - “Quantify the MPG difference between automatic and manual transmissions” Thus, the t-test shows that there is a signficiant difference in the MPG regarding the type of transmission.Then, several linear models have been fitted, regarding correlation between MPG and each parameter and the one with the highest adjusted R-squared value and suitable p-value for each variable has been chosen. Besides, the packages ggplot2, datasets and gridExtra are needed.
First we loaded and processed the dataset mtcars as below:
data(mtcars)
mtcars$cyl <- as.factor(mtcars$cyl)
mtcars$vs <- as.factor(mtcars$vs)
mtcars$am <- factor(mtcars$am)
mtcars$gear <- factor(mtcars$gear)
mtcars$carb <- factor(mtcars$carb)
mtcars$am <- factor(mtcars$am,labels=c('Automatic','Manual'))
A boxplot is needed to get a first point of view of the impact of the type of transmission on the MPG in Appendix 1.
A t-test can be run to get additional information:
automatic_data <- mtcars[mtcars$am == "Automatic",]
manual_data <- mtcars[mtcars$am == "Manual",]
t.test(automatic_data$mpg, manual_data$mpg)
##
## Welch Two Sample t-test
##
## data: automatic_data$mpg and manual_data$mpg
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.280194 -3.209684
## sample estimates:
## mean of x mean of y
## 17.14737 24.39231
The p-value for the null hypothesis regarding the t-test is 0.001374. Thus, it confirms that the MPG is different for manual and automatic transmissions. Besides, as it is shown on the boxplot, the MPG mean value from automatic transmission is 17.15 while it is 24.39 for manual transmission.
In order to explain the relationship between MPG and car weight, displacement, number of carburetors, 1/4 mile time and rear axle ratio and horse power,some plots have been obtained for automatic and manual transmissions in Appendix 2.
It seems there are negative correlation between MPG and weight, MPG and displacement, MPG and number of carburetors, MPG and horse power but positive correlation between MPG and 1/4 mile time and between MPG and rear axle ratio.
model_1 <- lm(mpg~am, data = mtcars)
summary(model_1)
##
## Call:
## lm(formula = mpg ~ am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## amManual 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
As it has been said already, automatic cars have, on average, a MPG of 17.147 and manual cars have, on average, a MPG of 24.39. Besides, the multiple R-squared value only reaches 0.3598 which means that this model explains only 35.98% of the variance.
In order to get the best multiple R-squared value and since mpg is highly negative correlated with cyl, disp, hp and wt, these factors will be considered to understand which ones contribute to get a better fuel efficiency. Thus, several regression models have been tested regarding Appendix 3. Regarding the impact on the multiple R-squared value, disp and am do not really improve the fitting:
fit_final <- lm(mpg ~ wt+hp+cyl, data=mtcars)
Each p-value is almost lower than 0.05 and the multiple R-squared value is 0.8572, which means that this model explains 85.72% of the variance. Since we have two models of the same data, we run an ANOVA to compare both models and understand if they are significantly different.
anova(model_1, fit_final)
## Analysis of Variance Table
##
## Model 1: mpg ~ am
## Model 2: mpg ~ wt + hp + cyl
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.90
## 2 27 160.78 3 560.12 31.354 6.048e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As the p-value is equal to 6.048e-09, we reject the null hypothesis. Thus the multivariate model is significantly different from the first simple model.
Finally, residuals have been checked to see whether they are normally distributed and the plot residual vs fitted values has to be examinated to highlight some patterns.
Appendix 4 plots the 4 diagnostic plots of the final model. No patterns like heteroskedasticity are observed. The residuals seem to be normally distributed and there are no points that have substantial influence on the regression model.
It has been shown that the type of transmission may not necessary have an impact on MPG but manual transmission implies, on average, a higher MPG. Other factors such as a lower weight and a fewer number of cylinders contribute more significantly in terms of fuel efficiency. We can either include or not include the type of transmission in the final model (model_8 and fit_final can be compared but it may be hard to take a clear final decision). TO get a better study, more data of different cars should be collected to say if the type of transmission should be considered or not. Analyses should be done on cars with the similar factors such as similar weight, same number of cylinders and same horse power. Only then, I will be able to give a better conclusion to determine which type of transmission brings better fuel efficiency.
boxplot(mpg ~ am, data = mtcars, xlab="Transmission", ylab="MPG", main="MPG vs. Transmission")
g1 <- ggplot(mtcars, aes(x=wt, y=mpg, group=am, color=am, height=3, width=3)) +
geom_point() + scale_colour_discrete(labels=c("Automatic", "Manual")) +
xlab("Weight") + ylab("MPG") +
ggtitle("MPG vs. weight") + theme(legend.title=element_blank())
g2 <- ggplot(mtcars, aes(x=disp, y=mpg, group=am, color=am, height=3, width=3)) +
geom_point() + scale_colour_discrete(labels=c("Automatic", "Manual")) +
xlab("Displacment") + ylab("MPG") +
ggtitle("MPG vs. displacment") + theme(legend.title=element_blank())
g3 <- ggplot(mtcars, aes(x=carb, y=mpg, group=am, color=am, height=3, width=3)) +
geom_point() + scale_colour_discrete(labels=c("Automatic", "Manual")) +
xlab("Number of carburetors") + ylab("MPG") +
ggtitle("MPG vs. number of carburetors") + theme(legend.title=element_blank())
g4 <- ggplot(mtcars, aes(x=qsec, y=mpg, group=am, color=am, height=3, width=3)) +
geom_point() + scale_colour_discrete(labels=c("Automatic", "Manual")) +
xlab("1/4 mile time") + ylab("MPG") +
ggtitle("MPG vs. 1/4 mile time") + theme(legend.title=element_blank())
g5 <- ggplot(mtcars, aes(x=drat, y=mpg, group=am, color=am, height=3, width=3)) +
geom_point() + scale_colour_discrete(labels=c("Automatic", "Manual")) +
xlab("Rear axle ratio") + ylab("MPG") +
ggtitle("MPG vs. rear axle ratio") + theme(legend.title=element_blank())
g6 <- ggplot(mtcars, aes(x=hp, y=mpg, group=am, color=am, height=3, width=3)) +
geom_point() + scale_colour_discrete(labels=c("Automatic", "Manual")) +
xlab("Horse power") + ylab("MPG") +
ggtitle(" MPG vs. horse power") + theme(legend.title=element_blank())
grid.arrange(g1,g2,g3,g4,g5,g6,ncol=2,nrow=3)
model_2 <- lm(mpg ~ wt, data=mtcars)
summary(model_2)
##
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5432 -2.3647 -0.1252 1.4096 6.8727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
## wt -5.3445 0.5591 -9.559 1.29e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
## F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
model_3 <- lm(mpg ~ wt+hp, data=mtcars)
summary(model_3)
##
## Call:
## lm(formula = mpg ~ wt + hp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.941 -1.600 -0.182 1.050 5.854
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.22727 1.59879 23.285 < 2e-16 ***
## wt -3.87783 0.63273 -6.129 1.12e-06 ***
## hp -0.03177 0.00903 -3.519 0.00145 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.593 on 29 degrees of freedom
## Multiple R-squared: 0.8268, Adjusted R-squared: 0.8148
## F-statistic: 69.21 on 2 and 29 DF, p-value: 9.109e-12
model_4 <- lm(mpg ~ wt+hp+disp, data=mtcars)
summary(model_4)
##
## Call:
## lm(formula = mpg ~ wt + hp + disp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.891 -1.640 -0.172 1.061 5.861
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.105505 2.110815 17.579 < 2e-16 ***
## wt -3.800891 1.066191 -3.565 0.00133 **
## hp -0.031157 0.011436 -2.724 0.01097 *
## disp -0.000937 0.010350 -0.091 0.92851
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.639 on 28 degrees of freedom
## Multiple R-squared: 0.8268, Adjusted R-squared: 0.8083
## F-statistic: 44.57 on 3 and 28 DF, p-value: 8.65e-11
model_5 <- lm(mpg ~ wt+hp+cyl, data=mtcars)
summary(model_5)
##
## Call:
## lm(formula = mpg ~ wt + hp + cyl, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.2612 -1.0320 -0.3210 0.9281 5.3947
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.84600 2.04102 17.563 2.67e-16 ***
## wt -3.18140 0.71960 -4.421 0.000144 ***
## hp -0.02312 0.01195 -1.934 0.063613 .
## cyl6 -3.35902 1.40167 -2.396 0.023747 *
## cyl8 -3.18588 2.17048 -1.468 0.153705
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.44 on 27 degrees of freedom
## Multiple R-squared: 0.8572, Adjusted R-squared: 0.8361
## F-statistic: 40.53 on 4 and 27 DF, p-value: 4.869e-11
model_6 <- lm(mpg ~ wt+hp+disp+cyl, data=mtcars)
summary(model_6)
##
## Call:
## lm(formula = mpg ~ wt + hp + disp + cyl, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.2740 -1.0349 -0.3831 0.9810 5.4192
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.002405 2.130726 16.897 1.54e-15 ***
## wt -3.428626 1.055455 -3.248 0.00319 **
## hp -0.023517 0.012216 -1.925 0.06523 .
## disp 0.004199 0.012917 0.325 0.74774
## cyl6 -3.466011 1.462979 -2.369 0.02554 *
## cyl8 -3.753227 2.813996 -1.334 0.19385
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.482 on 26 degrees of freedom
## Multiple R-squared: 0.8578, Adjusted R-squared: 0.8305
## F-statistic: 31.37 on 5 and 26 DF, p-value: 3.18e-10
model_7 <- lm(mpg ~ wt+hp+disp+cyl+am, data=mtcars)
summary(model_7)
##
## Call:
## lm(formula = mpg ~ wt + hp + disp + cyl + am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9374 -1.3347 -0.3903 1.1910 5.0757
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.864276 2.695416 12.564 2.67e-12 ***
## wt -2.738695 1.175978 -2.329 0.0282 *
## hp -0.032480 0.013983 -2.323 0.0286 *
## disp 0.004088 0.012767 0.320 0.7515
## cyl6 -3.136067 1.469090 -2.135 0.0428 *
## cyl8 -2.717781 2.898149 -0.938 0.3573
## amManual 1.806099 1.421079 1.271 0.2155
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.453 on 25 degrees of freedom
## Multiple R-squared: 0.8664, Adjusted R-squared: 0.8344
## F-statistic: 27.03 on 6 and 25 DF, p-value: 8.861e-10
par(mfrow = c(2,2))
plot(fit_final)