The goal of this project (the course project of the Regression Models Course from Coursera) is to analyse the Motor Trend Car Road Tests Dataset included in The R Datasets Package, exploring the relationship between a set of variables and miles per gallon (MPG) (outcome).
We are particularly interested to find out which transmission type (AM), automatic or manual, is better for MPG, and Quantify the MPG difference between both transmissions.
We use t-test to determine if MPG and AM are significantly different from each other, ending that there is a significant difference in MPG between the two groups, automatic and manual transmission. The mean of MPG for manual transmission cars 24.4 is larger than 17.1, the mean of MPG for automatic transmission cars.
We do the analysis with 3 different Linear Regression Models: one simple univariable model, one multivariable model with all variables included in the dataset, and one multivariable model with selected variables using the AIC (Akaike Information Criterion) criterion.
We concluded that the multivariable with selected variables model is the best of the 3 models because the model can explain about 85% of the variance of the MPG with only 3 predictors: WT (weight), QSEC (1/4 mile time) and AM.
You work for Motor Trend, a magazine about the automobile industry. Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). They are particularly interested in the following two questions:
We work with the Motor Trend Car Road Tests included in The R Datasets Package.
The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).
We load data and transform AM into a factor variable with levels “Automatic” and “Manual”.
# Load Library
library(ggplot2)
# Load Data
data(mtcars)
# As factor am variable and change it's values
mtcars$am <- as.factor(as.character(mtcars$am))
levels(mtcars$am) <- c("Automatic", "Manual")
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 Manual 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 Manual 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 Manual 4
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 Automatic 3
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 Automatic 3
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 Automatic 3
## carb
## Mazda RX4 4
## Mazda RX4 Wag 4
## Datsun 710 1
## Hornet 4 Drive 1
## Hornet Sportabout 2
## Valiant 1
We plot a boxplot of MPG along AM.
# Plot Boxplot with two variables
g <- qplot(am, mpg, data = mtcars, geom = "boxplot", color = am)
g + ggtitle("Miles Per Gallon by Transmission Type")
In general, the manual transmission yields higher values of MPG than automatic transmission.
We use t-test to determine if MPG and AM are significantly different from each other.
We suppose the test statistic follows a Student’s t-distribution under the null hypothesis.
# Inference
result <- t.test(mpg ~ factor(am), data=mtcars)
result$p.value
## [1] 0.001373638
result$estimate
## mean in group Automatic mean in group Manual
## 17.14737 24.39231
Since the p-value is 0.001, which is less than 0.05, we reject our null hypothesis. There is a significant difference in MPG between the two groups. The mean of MPG for manual transmission cars 24.4 is larger than 17.1, the mean of MPG for automatic transmission cars.
We use MPG as the dependent variable and AM as the independent variable to fit a linear regression.
# Regression Analysis
# Univariate Linear Regression Analysis
uni <- lm(mpg ~ am, data = mtcars)
summary(uni)
##
## Call:
## lm(formula = mpg ~ am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## amManual 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
Since the p-value = 0.0003, which is less than 0.05, we rejected null hypothesis. The adjusted R squared value is 0.36 which means our model only explains 36% of the variance. We need to include other predictor variables to improve our model.
We run a linear regression model against MPG for each of the 10 variables left.
# Multivariate Linear Regression Analysis
multi <- lm(mpg ~ ., data = mtcars)
summary(multi)
##
## Call:
## lm(formula = mpg ~ ., data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4506 -1.6044 -0.1196 1.2193 4.6271
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.30337 18.71788 0.657 0.5181
## cyl -0.11144 1.04502 -0.107 0.9161
## disp 0.01334 0.01786 0.747 0.4635
## hp -0.02148 0.02177 -0.987 0.3350
## drat 0.78711 1.63537 0.481 0.6353
## wt -3.71530 1.89441 -1.961 0.0633 .
## qsec 0.82104 0.73084 1.123 0.2739
## vs 0.31776 2.10451 0.151 0.8814
## amManual 2.52023 2.05665 1.225 0.2340
## gear 0.65541 1.49326 0.439 0.6652
## carb -0.19942 0.82875 -0.241 0.8122
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared: 0.869, Adjusted R-squared: 0.8066
## F-statistic: 13.93 on 10 and 21 DF, p-value: 3.793e-07
The Adjusted R-squared value is 0.86 which means that the model can explain about 86% of the variance of the MPG variable. In adddition to transmission, WT of the vehicle as well as accelaration speed have the highest relation to explaining the variation in MPG.
We use the AIC (Akaike Information Criterion) in a stepwise algorithm to select the best combination of variables that represents our model.
# Multivariate Linear Regression Analysis with Model Selection: Akaike information criterion (AIC)
multi <- step(multi, direction = "both", trace = 0, steps = 10000)
summary(multi)
##
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4811 -1.5555 -0.7257 1.4110 4.6610
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.6178 6.9596 1.382 0.177915
## wt -3.9165 0.7112 -5.507 6.95e-06 ***
## qsec 1.2259 0.2887 4.247 0.000216 ***
## amManual 2.9358 1.4109 2.081 0.046716 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared: 0.8497, Adjusted R-squared: 0.8336
## F-statistic: 52.75 on 3 and 28 DF, p-value: 1.21e-11
The AIC criterion select a model with this 3 variables “mpg ~ wt + qsec + am”.
The Adjusted R-squared value is 0.85, which means that the model can explain about 85% of the variance of the MPG variable. We can reject null hypothesis in favor of the alternative hypothesis that there is a significant difference in MPG between the two groups at alpha = 0.05.
We studied a boxplot of MPG along AM, concluding that in general, the manual transmission yields higher values of MPG than automatic transmission.
We use t-test to determine if MPG and AM are significantly different from each other, ending that there is a significant difference in MPG between the two groups, automatic and manual transmission. The mean of MPG for manual transmission cars 24.4 is larger than 17.1, the mean of MPG for automatic transmission cars.
We created 3 Linear Regression models, one simple univariable model, one multivariable model with all variables, and one multivariable model with selected variables using the AIC criterion.
We concluded that the multivariable with selected variables model is the best of the 3 models because the model can explain about 85% of the variance of the MPG with only 3 predictors: WT (weight), QSEC (1/4 mile time) and AM.
# Residuals
par(mfrow = c(2, 2))
plot(uni, pch = 19)
# Residuals
par(mfrow = c(2, 2))
plot(multi, pch = 19)
# Correlations
mtcars_vars <- mtcars
mar.orig <- par()$mar # save the original values
par(mar = c(1, 1, 1, 1)) # set your new values
pairs(mtcars_vars, panel = panel.smooth, col = 9 + mtcars$wt)