This project aims to answer questions related to energy efficiency (miles per gallon) of cars with auto & manual gears, based on the “mtcars” data set.
Predictive Modelling Process has been adopted to find out the answers with the use of lm, glm and splines in lm modelling.
The first step followed in model building process was to understand the data, with the help of graphs/plots. Since we have multiple predictors, we needed to understand the characteristics of the predictors and relationships among them.
Scanning through the data would also identify the Data Pre-Processing needs.
The next step taken was to identify & built the models and then evaluate these models. Model evaluation & model performance comparison was done with “cross validation” in caret package,and models’ RMSE & R-Square values.
Model assumptions were checked with four disgonestic plots vi. the residual plots, qq plots for normality, Scale-Location plots & Residuals vs. Leverage plot.
The Best performing model was chosen for answering the questions!
Refer clause (7) below for answers, Annexture A for graphs/plots & Annexture B for R codes.
The basic data structure was understood with ?mtcars (details not produced here for the sake of brevity). Further scanning was done with str(mtcars) command.
mtcars is 32X11 data frame. Number of observations are limited to 32 only, & with mpg as response, 10 predictors are available for prediction.
It was decided to convert vs & am into categorical variables, and cyl,gear & carb into ordered variables.
We will confirm multicolinearity of continuous variables first.
We can see strong correlation among factors except qsec.
For categorical variables, visual patterns can be deceptive, hence we will use VIF function to calculate Variation Inflation Factor to confirm the collinearity.
## Loading required package: carData
GVIF^(1/(2*Df)) is crucial to detect collinearity among categorical variables. Value around 1 to 1.5 indicates very low collinearity.
Value greater than 1.5 indicate moderate concerns. carb falls in this category.
values greater than root5 = 2.24 indicates high concerns. vs,gear fall in this category.
values greater than root10 = 3.16 indicates severe collinearity. cyl,disp,hp,wt,qsec fall in this category.
Intuitively, number of cylinders and carb are related. The Horse power of the car engine depends on number of cylinders, displacement and vs. Weight (wt) of car increases as number of cylinders & its displacement increases.
Based on the observations made on data as above, We will develop 5 number of models as described below – i) develop a lm & glm models each with only hp, wt & am predictors. ii) develop a lm & glm models each with all predictors iii) use spline in hp variable in other model.
We will first carry out 10 fold cross validation using carret package.
## Loading required package: ggplot2
## Loading required package: lattice
Model glm_3 has least mean RMSE at 2.42 but lower R Square of 0.86
model lm_3 has RMSE of 2.46 with 0.93 of R squared.
A choice has to be made between model glm_3 and lm_3. A higher Rsquare indicates better goodness-of-fit, meaning the model explains a larger share of total variation.
RMSE is an absolute measure to assess precision of models prediction. It is mean distance that observed data point fall from regression line. & it is expressed in original units of response - here in miles per galleons (mpg)
Trade off between Rsqaure and RMSE depends heavily of primary goal - explanation or prediction accuracy. Since we are interested to know difference between manual & auto-transmission energy efficiency, we will prefer a model with better accuracy that is model glm_3.
It is done to check model assumptions -
Residual plot Checks for linearity and homoscedasticity (constant variance). It has fairly distributed cloud of points.
Normal Q-Q plot checks the assumption that the residuals are normally distributed. And it is found to be satisfactory.
Scale-Location (or Spread-Location) - More clearly checks for homoscedasticity (constant variance). This plot uses the square root of the standardized residuals, which helps visually confirm the spread. Found OK.
Residuals vs. Leverage plot identifies influential observations and high leverage points. Points are not falling outside the cooks distnace (dotted lines). influential
The plots are not very ideal, indicating more predictors are required for further improvement in model prediction.
Q1) Is an automatic or manual transmission better for mpg?
Consider model glm_3 which includes only hp,wt & am predictors, the equation is mpg = 34 -0.037*hp - 2.878wt + (2.083 for ammanual,autotransmisssion =0 as it is a refernce datum)
#####CONCLUSION 1 : Manual transmission gives better mpg than auto transmission.
Q2) Quantify the mpg difference between automatic & manual transmission.
## Waiting for profiling to be done...
Manual transmission gives 2.08 miles per galleon more than auto transmission.
95% confidence interval for this difference is -0.61 to 4.78 mpg.
Figure 1:Pair Plots
Figure2- mpg Vs hp
Figure3 - Model Performance Plots
knitr::opts_chunk$set(echo = TRUE)
data("mtcars")
# convert vs & am in factor variable, cyl,gear & carb into ordered variable.
mtcars1 <- within(mtcars, {
vs <- factor(vs, labels = c("V", "S"))
am <- factor(am, labels = c("automatic", "manual"))
cyl <- ordered(cyl)
gear <- ordered(gear)
carb <- ordered(carb) })
require(graphics)
pairs(mtcars1,main="mtcars data at a glance", gap=1/4)
set.seed(100)
pred1 <- mtcars1[,c(1,3,4,5,6,7)]
cor_matrix <- cor(pred1)
print(round(cor_matrix,2))
options(warn = -1) # to suppress warning messages
set.seed(100)
fit1<- lm(mpg~., data=mtcars1) # included all predictors
# Let us find out VIF
library(car)
temp <- vif(fit1)
#print(round(temp,2))
set.seed(100)
fit2 <- lm(mpg~ wt+hp+am, data=mtcars1) ## lm model with only wt,hp & am
fit3 <- glm(mpg~., data=mtcars1) # glm with all predictors
fit4<- glm(mpg~ wt+hp+am,data=mtcars1) # glm with only wt,hp & wt
library(splines)
plot(mtcars1$mpg,mtcars1$hp)
fit5 <- lm(mpg~ hp+bs(hp, knots = c(20))+wt+am, data=mtcars1) ## model with spline for hp predictor.
options(warn = -1) # to suppress warning messages
library(caret)
library(ggplot2)
set.seed(100)
lm_all <- train(form = mpg ~., data=mtcars1, method="lm", trControl = trainControl(method = "cv",number=10)) # lm with all predictors
lm_3 <- train(form = mpg ~ wt+hp+am, data=mtcars1, method="lm", trControl = trainControl(method = "cv",number=10)) # lm with 3 predictors
glm_all <- train(form = mpg ~., data=mtcars1, method="glm", trControl = trainControl(method = "cv",number=10)) # glm with all predictors
glm_3 <- train(form = mpg ~wt+hp+am, data=mtcars1, method="glm", trControl = trainControl(method = "cv",number=10)) # glm with 3 predictors
lm_spline <- train(form = mpg ~ hp+bs(hp, knots = c(20))+wt+am, data=mtcars1, method="lm", trControl = trainControl(method = "cv",number=10)) ## with spline
## Extract the performance measures.
summary(resamples(list(Model1=lm_all,Model2=lm_3,Model3=glm_all,Model4=glm_3,Model5=lm_spline)))
par(mfrow=c(2,2)) # plot in 2 by 2 matrix
plot(fit4)
par(mfrow = c(1, 1)) # Reset the plotting layout
options(warn = -1) # to suppress warning messages
confint(fit4)