Regression Models: Course Project

Executive Summary

The goal of this project (the course project of the Regression Models Course from Coursera) is to analyse the Motor Trend Car Road Tests Dataset included in The R Datasets Package, exploring the relationship between a set of variables and miles per gallon (MPG) (outcome).

We are particularly interested to find out which transmission type (AM), automatic or manual, is better for MPG, and Quantify the MPG difference between both transmissions.

We use t-test to determine if MPG and AM are significantly different from each other, ending that there is a significant difference in MPG between the two groups, automatic and manual transmission. The mean of MPG for manual transmission cars 24.4 is larger than 17.1, the mean of MPG for automatic transmission cars.

We do the analysis with 3 different Linear Regression Models: one simple univariable model, one multivariable model with all variables included in the dataset, and one multivariable model with selected variables using the AIC (Akaike Information Criterion) criterion.

We concluded that the multivariable with selected variables model is the best of the 3 models because the model can explain about 85% of the variance of the MPG with only 3 predictors: WT (weight), QSEC (1/4 mile time) and AM.

Instructions

You work for Motor Trend, a magazine about the automobile industry. Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). They are particularly interested in the following two questions:

“Is an automatic or manual transmission better for MPG”
“Quantify the MPG difference between automatic and manual transmissions”

Data

We work with the Motor Trend Car Road Tests included in The R Datasets Package.
The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).

Data Processing

We load data and transform AM into a factor variable with levels “Automatic” and “Manual”.

# Load Library
library(ggplot2)


# Load Data
data(mtcars)


# As factor am variable and change it's values 
mtcars$am <- as.factor(as.character(mtcars$am))
levels(mtcars$am) <- c("Automatic", "Manual")


head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs        am gear
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0    Manual    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0    Manual    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1    Manual    4
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1 Automatic    3
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0 Automatic    3
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1 Automatic    3
##                   carb
## Mazda RX4            4
## Mazda RX4 Wag        4
## Datsun 710           1
## Hornet 4 Drive       1
## Hornet Sportabout    2
## Valiant              1

Exploring Data Analysis

We plot a boxplot of MPG along AM.

# Plot Boxplot with two variables
g <- qplot(am, mpg, data = mtcars, geom = "boxplot", color = am) 
g + ggtitle("Miles Per Gallon by Transmission Type")

In general, the manual transmission yields higher values of MPG than automatic transmission.

Statistical Inference Analysis

We use t-test to determine if MPG and AM are significantly different from each other.
We suppose the test statistic follows a Student’s t-distribution under the null hypothesis.

# Inference
result <- t.test(mpg ~ factor(am), data=mtcars)
result$p.value

## [1] 0.001373638

result$estimate

## mean in group Automatic    mean in group Manual 
##                17.14737                24.39231

Since the p-value is 0.001, which is less than 0.05, we reject our null hypothesis. There is a significant difference in MPG between the two groups. The mean of MPG for manual transmission cars 24.4 is larger than 17.1, the mean of MPG for automatic transmission cars.

Simple Linear Regression Model Analysis

We use MPG as the dependent variable and AM as the independent variable to fit a linear regression.

# Regression Analysis
# Univariate Linear Regression Analysis
uni <- lm(mpg ~ am, data = mtcars)
summary(uni)

## 
## Call:
## lm(formula = mpg ~ am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## amManual       7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

Since the p-value = 0.0003, which is less than 0.05, we rejected null hypothesis. The adjusted R squared value is 0.36 which means our model only explains 36% of the variance. We need to include other predictor variables to improve our model.

Multiple Linear Regression Model Analysis

We run a linear regression model against MPG for each of the 10 variables left.

# Multivariate Linear Regression Analysis
multi <- lm(mpg ~ ., data = mtcars)
summary(multi)

## 
## Call:
## lm(formula = mpg ~ ., data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4506 -1.6044 -0.1196  1.2193  4.6271 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 12.30337   18.71788   0.657   0.5181  
## cyl         -0.11144    1.04502  -0.107   0.9161  
## disp         0.01334    0.01786   0.747   0.4635  
## hp          -0.02148    0.02177  -0.987   0.3350  
## drat         0.78711    1.63537   0.481   0.6353  
## wt          -3.71530    1.89441  -1.961   0.0633 .
## qsec         0.82104    0.73084   1.123   0.2739  
## vs           0.31776    2.10451   0.151   0.8814  
## amManual     2.52023    2.05665   1.225   0.2340  
## gear         0.65541    1.49326   0.439   0.6652  
## carb        -0.19942    0.82875  -0.241   0.8122  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared:  0.869,  Adjusted R-squared:  0.8066 
## F-statistic: 13.93 on 10 and 21 DF,  p-value: 3.793e-07

The Adjusted R-squared value is 0.86 which means that the model can explain about 86% of the variance of the MPG variable. In adddition to transmission, WT of the vehicle as well as accelaration speed have the highest relation to explaining the variation in MPG.

Model Selection

We use the AIC (Akaike Information Criterion) in a stepwise algorithm to select the best combination of variables that represents our model.

# Multivariate Linear Regression Analysis with Model Selection: Akaike information criterion (AIC)
multi <-  step(multi, direction = "both", trace = 0, steps = 10000)
summary(multi)

## 
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4811 -1.5555 -0.7257  1.4110  4.6610 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.6178     6.9596   1.382 0.177915    
## wt           -3.9165     0.7112  -5.507 6.95e-06 ***
## qsec          1.2259     0.2887   4.247 0.000216 ***
## amManual      2.9358     1.4109   2.081 0.046716 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared:  0.8497, Adjusted R-squared:  0.8336 
## F-statistic: 52.75 on 3 and 28 DF,  p-value: 1.21e-11

The AIC criterion select a model with this 3 variables “mpg ~ wt + qsec + am”.

The Adjusted R-squared value is 0.85, which means that the model can explain about 85% of the variance of the MPG variable. We can reject null hypothesis in favor of the alternative hypothesis that there is a significant difference in MPG between the two groups at alpha = 0.05.

Conclusions

We studied a boxplot of MPG along AM, concluding that in general, the manual transmission yields higher values of MPG than automatic transmission.

We created 3 Linear Regression models, one simple univariable model, one multivariable model with all variables, and one multivariable model with selected variables using the AIC criterion.

Appendix

Figure 1: Residuals Simple Linear Regression Model

# Residuals
par(mfrow = c(2, 2))
plot(uni, pch = 19)

Figure 2: Residuals Multiple Linear Regression Model

# Residuals
par(mfrow = c(2, 2))
plot(multi, pch = 19)

Figure 3: Correlations Between Variables

# Correlations
mtcars_vars <- mtcars
mar.orig <- par()$mar  # save the original values 
par(mar = c(1, 1, 1, 1))  # set your new values 
pairs(mtcars_vars, panel = panel.smooth, col = 9 + mtcars$wt)