This work has been created for the Course project at the Regression Models course in Coursera (Aug 2014). In this assigment we have to address questions about the relation between the car consume in milles per galon (MPG) and the kind of transmission (automatic/manual), therefore we have to answer the following basic questions:
To solve the questions we develop the following steps that will be deeply explained at each section:
Load the required libraries, read the database and split the data between automatic and manual transmission, so we can create one different regression model for each type of transmission.
Develop some exploratory data analysis in order to select a predictor variable to use at the lineal regresion model creation. To achieve this objective we use correlation matrix graphics to visualize the correlation values between the predictor candidates, and finally select the Weight as the better candidate to develop the lineal regression model for each type of transmission.
Finally we apply the lineal regresion model created to predict the results and develop some diagnosis and analysis of the residuals. In this section we use some inference to determine how is the behaviour of the consume in miles per galon as a function of the selected predictor variable.
Our basic strategy is to create a lineal regression model for each type of transmission using the same predictor variable and quantify the difference between automatic and manual cars using the regression parameters obtained.
In order to exectute the R code the following libraries must be loaded.
library(caret)
library(ggplot2)
library(corrgram)
library(gridExtra)
We start reading the mtcars database and aplltying some transformations to generate better visualizations.
# Loading data
data(mtcars)
Now we split the data by transmission type.
# Split data : automatic / manual
manual <- mtcars[which(mtcars$Transmision == 'Manual'),]
automatic <- mtcars[which(mtcars$Transmision == 'Automatic'),]
We start our exploratory data analysis by checking the consume in miles per galon versus the type of transmission. The result of this test is showed in the next figure:
Two basic questions can be addressed from this figure:
To continue with the exploratory data analysis our objective now is to select the predictor variable that will be used at the lineal regression models. In the next figure we have the correlation matrix so we can visualize which variables are highly correlated because they appear with colours very resalted (red for negative and blue for possitive correlation values) so we can easy select the predictor candidate.
From this plot we select the Weight as the the better predictor for all the data, but we have to check if this variable has good correlation properties in both cases, manual and automatic transmission types. In the next figures we show the correlation matrices for automatic and manual transmissions, where we can see that Weight is still a good candidate for both transmission types, altougth in the case of manual transmission there are better candidates (Cylinders, Displacement and HorsePower).
Once we have selected the predictor variable is time to generate the lineal regressors models for each type of transmission.
In the next plot we show the consume in miles per galon versus the weight coloured by transmission type and we add the lineal regressors for each type of transmisssion and its respective smoother regions.
From this figure we can set that :
Now we can deploy some inference by using the lineal regression models with a more interpretable formulation.
# Manual transmission linear regresion model.
manual.weight.mean <- mean(manual$Weight)
manual.model.lm <- lm(MPG ~ I(Weight - manual.weight.mean), data = manual)
# Manual summaryize
summary(manual.model.lm)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 0.5800 29.563 4.667e-16
## I(Weight - manual.weight.mean) -3.786 0.7666 -4.939 1.246e-04
From this result we can state that for manual transmission cars:
# Automatic transmission linear regresion model.
automatic.weight.mean <- mean(automatic$Weight)
automatic.model.lm <- lm(MPG ~ I(Weight - automatic.weight.mean), data = automatic)
# Automatic summaryize
summary(automatic.model.lm)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24.392 0.7449 32.747 2.568e-12
## I(Weight - automatic.weight.mean) -9.084 1.2566 -7.229 1.688e-05
From this result we can state that for automatic transmission cars:
In the next plot we present the residual plot for automatic and manual transmissions using the proposed lineal regression models.
From this plots we can’t appreciate any residual pattern but we can easily detect a couple of outliers in the prediction process. Finally lets take a look at some influence measures for the manual transmission:
hatvalues(manual.model.lm)
## Hornet 4 Drive Hornet Sportabout Valiant
## 0.08083 0.06258 0.06140
## Duster 360 Merc 240D Merc 230
## 0.05627 0.08344 0.08784
## Merc 280 Merc 280C Merc 450SE
## 0.06258 0.06258 0.06097
## Merc 450SL Merc 450SLC Cadillac Fleetwood
## 0.05277 0.05264 0.25429
## Lincoln Continental Chrysler Imperial Toyota Corona
## 0.30445 0.28099 0.20892
## Dodge Challenger AMC Javelin Camaro Z28
## 0.05833 0.06288 0.05310
## Pontiac Firebird
## 0.05316
and for the automatic transmission type:
hatvalues(automatic.model.lm)
## Mazda RX4 Mazda RX4 Wag Datsun 710 Fiat 128 Honda Civic
## 0.08649 0.12405 0.07874 0.08667 0.21563
## Toyota Corolla Fiat X1-9 Porsche 914-2 Lotus Europa Ford Pantera L
## 0.14955 0.12652 0.09300 0.25346 0.20304
## Ferrari Dino Maserati Bora Volvo 142E
## 0.10514 0.37099 0.10673
The models Lincoln Continental and Chrysler Imperial at the manual transmission and the models Maserati Bora and Lotus Europa at the automatic transmission accumulates the largest influence in the residual errors.