1) Executive Summary

This project aims to answer questions related to energy efficiency (miles per gallon) of cars with auto & manual gears, based on the “mtcars” data set.

Predictive Modelling Process has been adopted to find out the answers with the use of lm, glm and splines in lm modelling.

The first step followed in model building process was to understand the data, with the help of graphs/plots. Since we have multiple predictors, we needed to understand the characteristics of the predictors and relationships among them.

Scanning through the data would also identify the Data Pre-Processing needs.

The next step taken was to identify & built the models and then evaluate these models. Model evaluation & model performance comparison was done with “cross validation” in caret package,and models’ RMSE & R-Square values.

Model assumptions were checked with four disgonestic plots vi. the residual plots, qq plots for normality, Scale-Location plots & Residuals vs. Leverage plot.

The Best performing model was chosen for answering the questions!

Refer clause (7) below for answers, Annexture A for graphs/plots & Annexture B for R codes.

2) Understanding the data

The basic data structure was understood with ?mtcars (details not produced here for the sake of brevity). Further scanning was done with str(mtcars) command.

mtcars is 32X11 data frame. Number of observations are limited to 32 only, & with mpg as response, 10 predictors are available for prediction.

It was decided to convert vs & am into categorical variables, and cyl,gear & carb into ordered variables.

3) Understanding relationship of predictors with plots

Observations based on pair plots (reference - Figure1 in Annex A)
  1. We can see multi-collinearity in this data.
  2. some curvature in found in relations, hence we will also use glm, in addition to lm function and-
  3. we will introduce spline for hp variable.

We will confirm multicolinearity of continuous variables first.

Observations based on output of cor function :
  1. We can see strong correlation among factors except qsec.

  2. For categorical variables, visual patterns can be deceptive, hence we will use VIF function to calculate Variation Inflation Factor to confirm the collinearity.

## Loading required package: carData
Observations:

GVIF^(1/(2*Df)) is crucial to detect collinearity among categorical variables. Value around 1 to 1.5 indicates very low collinearity.

Value greater than 1.5 indicate moderate concerns. carb falls in this category.

values greater than root5 = 2.24 indicates high concerns. vs,gear fall in this category.

values greater than root10 = 3.16 indicates severe collinearity. cyl,disp,hp,wt,qsec fall in this category.

Intuitively, number of cylinders and carb are related. The Horse power of the car engine depends on number of cylinders, displacement and vs. Weight (wt) of car increases as number of cylinders & its displacement increases.

4) Models Building:

Based on the observations made on data as above, We will develop 5 number of models as described below – i) develop a lm & glm models each with only hp, wt & am predictors. ii) develop a lm & glm models each with all predictors iii) use spline in hp variable in other model.

Plot the data to locate knot point for spline.(refer code )
With reference to plot as shown in Figure2 in Annex A, we will introduce knot point at 20 mpg

5) Model Performance & comparison

We will first carry out 10 fold cross validation using carret package.

## Loading required package: ggplot2
## Loading required package: lattice

6) Model Selection:

Model glm_3 has least mean RMSE at 2.42 but lower R Square of 0.86

model lm_3 has RMSE of 2.46 with 0.93 of R squared.

A choice has to be made between model glm_3 and lm_3. A higher Rsquare indicates better goodness-of-fit, meaning the model explains a larger share of total variation.

RMSE is an absolute measure to assess precision of models prediction. It is mean distance that observed data point fall from regression line. & it is expressed in original units of response - here in miles per galleons (mpg)

Trade off between Rsqaure and RMSE depends heavily of primary goal - explanation or prediction accuracy. Since we are interested to know difference between manual & auto-transmission energy efficiency, we will prefer a model with better accuracy that is model glm_3.

Diagonestic Checks -

It is done to check model assumptions -

Observations:(refer Figures 3 in the Annex)

Residual plot Checks for linearity and homoscedasticity (constant variance). It has fairly distributed cloud of points.

Normal Q-Q plot checks the assumption that the residuals are normally distributed. And it is found to be satisfactory.

Scale-Location (or Spread-Location) - More clearly checks for homoscedasticity (constant variance). This plot uses the square root of the standardized residuals, which helps visually confirm the spread. Found OK.

Residuals vs. Leverage plot identifies influential observations and high leverage points. Points are not falling outside the cooks distnace (dotted lines). influential

The plots are not very ideal, indicating more predictors are required for further improvement in model prediction.

7) ANSWERING QUESTIONS

Q1) Is an automatic or manual transmission better for mpg?

Consider model glm_3 which includes only hp,wt & am predictors, the equation is mpg = 34 -0.037*hp - 2.878wt + (2.083 for ammanual,autotransmisssion =0 as it is a refernce datum)

#####CONCLUSION 1 : Manual transmission gives better mpg than auto transmission.

Q2) Quantify the mpg difference between automatic & manual transmission.

## Waiting for profiling to be done...
CONCLUSION 2:

Manual transmission gives 2.08 miles per galleon more than auto transmission.

Confidence Interval with confint function -

95% confidence interval for this difference is -0.61 to 4.78 mpg.

8) ANNEXTURE A FOR PLOTS / FIGURES

FIGURE 1
Figure 1:Pair Plots

Figure 1:Pair Plots

FIGURE 2
Figure2- mpg Vs hp

Figure2- mpg Vs hp

FIGURE 3
Figure3 - Model Performance Plots

Figure3 - Model Performance Plots

9) Annexture B: All R codes

knitr::opts_chunk$set(echo = TRUE)
data("mtcars")
# convert vs & am in factor variable, cyl,gear & carb into ordered variable.
mtcars1 <- within(mtcars, {
  vs <- factor(vs, labels = c("V", "S"))
  am <- factor(am, labels = c("automatic", "manual"))
  cyl  <- ordered(cyl)
  gear <- ordered(gear)
  carb <- ordered(carb) })
require(graphics)
pairs(mtcars1,main="mtcars data at a glance", gap=1/4)
set.seed(100)
pred1 <- mtcars1[,c(1,3,4,5,6,7)]
cor_matrix <- cor(pred1)
print(round(cor_matrix,2))
options(warn = -1) # to suppress warning messages
set.seed(100)
fit1<- lm(mpg~., data=mtcars1) # included all predictors
# Let us find out VIF
library(car)
temp <- vif(fit1)
#print(round(temp,2))
set.seed(100)
fit2 <- lm(mpg~ wt+hp+am, data=mtcars1) ## lm model with only wt,hp & am
fit3 <- glm(mpg~., data=mtcars1) # glm with all predictors
fit4<- glm(mpg~ wt+hp+am,data=mtcars1) # glm with only wt,hp & wt
library(splines)
plot(mtcars1$mpg,mtcars1$hp)
fit5 <- lm(mpg~ hp+bs(hp, knots = c(20))+wt+am, data=mtcars1) ## model with spline for hp predictor.
options(warn = -1) # to suppress warning messages
library(caret)
library(ggplot2)
set.seed(100)
lm_all <- train(form = mpg ~., data=mtcars1, method="lm", trControl = trainControl(method = "cv",number=10)) # lm with all predictors
lm_3 <- train(form = mpg ~ wt+hp+am, data=mtcars1, method="lm", trControl = trainControl(method = "cv",number=10)) # lm with 3 predictors
glm_all <- train(form = mpg ~., data=mtcars1, method="glm", trControl = trainControl(method = "cv",number=10)) # glm with all predictors
glm_3 <- train(form = mpg ~wt+hp+am, data=mtcars1, method="glm", trControl = trainControl(method = "cv",number=10)) # glm with 3 predictors
lm_spline <- train(form = mpg ~ hp+bs(hp, knots = c(20))+wt+am, data=mtcars1, method="lm", trControl = trainControl(method = "cv",number=10)) ## with spline

## Extract the performance measures.
summary(resamples(list(Model1=lm_all,Model2=lm_3,Model3=glm_all,Model4=glm_3,Model5=lm_spline)))
par(mfrow=c(2,2)) # plot in 2 by 2 matrix
plot(fit4)
par(mfrow = c(1, 1)) # Reset the plotting layout
options(warn = -1) # to suppress warning messages
confint(fit4)