Introduction

The following data set contains data on fuel efficiency of 38 cars(taken from Abraham & Ledolter, 2006). The data set has 38 observation and 8 variables.

Variable Description:

#Reading the data
FuelEff <- read.csv("C:/Users/Gourab/Desktop/Dataset - Data Mining/FuelEfficiency.csv")
head(FuelEff)
##    MPG   GPM    WT DIS NC  HP  ACC ET
## 1 16.9 5.917 4.360 350  8 155 14.9  1
## 2 15.5 6.452 4.054 351  8 142 14.3  1
## 3 19.2 5.208 3.605 267  8 125 15.0  1
## 4 18.5 5.405 3.940 360  8 150 13.0  1
## 5 30.0 3.333 2.155  98  4  68 16.5  0
## 6 27.5 3.636 2.560 134  4  95 14.2  0

Strucure & Summary

#Summary and Structure of the data
str(FuelEff)
## 'data.frame':    38 obs. of  8 variables:
##  $ MPG: num  16.9 15.5 19.2 18.5 30 27.5 27.2 30.9 20.3 17 ...
##  $ GPM: num  5.92 6.45 5.21 5.41 3.33 ...
##  $ WT : num  4.36 4.05 3.6 3.94 2.15 ...
##  $ DIS: int  350 351 267 360 98 134 119 105 131 163 ...
##  $ NC : int  8 8 8 8 4 4 4 4 5 6 ...
##  $ HP : int  155 142 125 150 68 95 97 75 103 125 ...
##  $ ACC: num  14.9 14.3 15 13 16.5 14.2 14.7 14.5 15.9 13.6 ...
##  $ ET : int  1 1 1 1 0 0 0 0 0 0 ...
summary(FuelEff)
##       MPG             GPM              WT             DIS       
##  Min.   :15.50   Min.   :2.681   Min.   :1.915   Min.   : 85.0  
##  1st Qu.:18.52   1st Qu.:3.292   1st Qu.:2.208   1st Qu.:105.0  
##  Median :24.25   Median :4.160   Median :2.685   Median :148.5  
##  Mean   :24.76   Mean   :4.331   Mean   :2.863   Mean   :177.3  
##  3rd Qu.:30.38   3rd Qu.:5.398   3rd Qu.:3.410   3rd Qu.:229.5  
##  Max.   :37.30   Max.   :6.452   Max.   :4.360   Max.   :360.0  
##        NC              HP             ACC              ET        
##  Min.   :4.000   Min.   : 65.0   Min.   :11.30   Min.   :0.0000  
##  1st Qu.:4.000   1st Qu.: 78.5   1st Qu.:14.03   1st Qu.:0.0000  
##  Median :4.500   Median :100.0   Median :14.80   Median :0.0000  
##  Mean   :5.395   Mean   :101.7   Mean   :14.86   Mean   :0.2895  
##  3rd Qu.:6.000   3rd Qu.:123.8   3rd Qu.:15.78   3rd Qu.:1.0000  
##  Max.   :8.000   Max.   :155.0   Max.   :19.20   Max.   :1.0000

In this data set we have: - 4 numerical variables: MPG, GPM, WT & ACC - 4 integer variables: DIS, NC, HP, ET (also a binary variable)

The summary do not indicate the presence of any missing value in any variables. We will try to model the fuel efficiency, measured in GPM as a function of the weight, displacement, number of cylinders, horsepower, acceleratioin and engine type. We analyse GPM instead of the usual fuel efficiency measure MPG because the receprocal transformation GPM = 100/MPG leads to approximate linear relationship between the response and the predictors.

Detecting & Treating Outliers

In this problem our objective is to model the fuel efficiency (MPG or GPM) as a function of the other variables. One good model can of course be linear regression model. Linear regression being parametric model, gets largely affected by outlier.

boxplot(FuelEff$MPG)

boxplot(FuelEff$GPM)

boxplot(FuelEff$WT)

boxplot(FuelEff$DIS)

boxplot(FuelEff$NC)

boxplot(FuelEff$HP)

boxplot(FuelEff$ACC)

boxplot(FuelEff$ET)

Among all te variables the variable ACC indicates the presence of some extreme values. Let us replace these values with the Q3 + 1.5*IQR

#Finding the quartile of ACC
quantile(FuelEff$ACC)
##     0%    25%    50%    75%   100% 
## 11.300 14.025 14.800 15.775 19.200
#The interquartile range is given by
IQR(FuelEff$ACC)
## [1] 1.75
#benchmark value = Q3 +  1.5*IQR will be
benchMark = 15.775 + 1.5*IQR(FuelEff$ACC)
benchMark
## [1] 18.4
#Detecting the extreme values:
FuelEff$ACC[FuelEff$ACC > benchMark]
## [1] 18.7 19.2
#Replacing these values with the benchmark value that we have set
FuelEff$ACC[FuelEff$ACC > benchMark] = benchMark

#Taking the final look at the distribution of ACC
boxplot(FuelEff$ACC)

The final boxplot gave a clear indication that how much influenced the outliers had on the ditribution on ACC.

Checking for linearity assumption

The first and foremost assumption of linear regression that the dependent variable and the independent variables should be linearly related.

#Verifying the linearity assumption through scatterplots
plot(GPM~MPG,data=FuelEff)

plot(GPM~WT,data=FuelEff)

plot(GPM~DIS,data=FuelEff)

plot(GPM~NC,data=FuelEff)

plot(GPM~HP,data=FuelEff)

plot(GPM~ACC,data=FuelEff)

plot(GPM~ET,data=FuelEff)

All the variables are nearly linearly related with the response.

Model Building

We will not be using the variable MPG present in our data set. Therefore, before we start to build our model let us remove the variable from the data set.

#Deleting the variable MPG from the dataset
FuelEff=FuelEff[-1]
head(FuelEff)
##     GPM    WT DIS NC  HP  ACC ET
## 1 5.917 4.360 350  8 155 14.9  1
## 2 6.452 4.054 351  8 142 14.3  1
## 3 5.208 3.605 267  8 125 15.0  1
## 4 5.405 3.940 360  8 150 13.0  1
## 5 3.333 2.155  98  4  68 16.5  0
## 6 3.636 2.560 134  4  95 14.2  0

Building Regression Model on all the variables

## regression on all data
m1=lm(GPM~.,data=FuelEff)
summary(m1)
## 
## Call:
## lm(formula = GPM ~ ., data = FuelEff)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.48991 -0.25461  0.03753  0.18524  0.64504 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.688441   0.688524  -3.905 0.000476 ***
## WT           0.753110   0.456856   1.648 0.109358    
## DIS         -0.004745   0.002706  -1.753 0.089404 .  
## NC           0.444900   0.122271   3.639 0.000986 ***
## HP           0.024040   0.006789   3.541 0.001284 ** 
## ACC          0.076643   0.047143   1.626 0.114128    
## ET          -0.961089   0.265928  -3.614 0.001054 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.312 on 31 degrees of freedom
## Multiple R-squared:  0.939,  Adjusted R-squared:  0.9272 
## F-statistic:  79.5 on 6 and 31 DF,  p-value: < 2.2e-16

The resulting model is:

GPM = -2.688 + 0.75 WT - 0.005 DIS + 0.445 NC + 0.024 HP + 0.077 ACC + 0.00105 ET

Note: ET should be converted as a factor and not left as an integer variable

The model resulted in a high value of multiple & adjusted R-squared.This high value may suggest the presence of multicollinearity in the data, i.e. the predictor variables in this model are themseles related. For example, one can expect that a car with large weight has a large engine size and large horsepower. Let us calculate the pairwise collelation among all the predictor varibles.

#Pairwise correlation among all the predictor variables
cor(FuelEff[,2:7])
##              WT        DIS         NC         HP         ACC         ET
## WT   1.00000000  0.9507647  0.9166777  0.9172204 -0.02377844  0.6673661
## DIS  0.95076469  1.0000000  0.9402812  0.8717993 -0.13855958  0.7746636
## NC   0.91667774  0.9402812  1.0000000  0.8638473 -0.12524272  0.8311721
## HP   0.91722045  0.8717993  0.8638473  1.0000000 -0.24701158  0.7202350
## ACC -0.02377844 -0.1385596 -0.1252427 -0.2470116  1.00000000 -0.3134769
## ET   0.66736606  0.7746636  0.8311721  0.7202350 -0.31347692  1.0000000

The above correlation matrix indicates some very high magnitude of association among some variables. The presence of multicollinearity in the data may affect the analysis to a great extent and therefore needs to be corrected.

Best subset Regression

Before doing any correction let us calculate the all possible regressions with their R-square and adjusted R-square and see how the various models perform. These calculations can be carried out using the function regsubsets in the R-package leaps.

## best subset regression in R
library(leaps)  
X=FuelEff[,2:7]
y=FuelEff[,1]
out=summary(regsubsets(X,y,nbest=2,nvmax=ncol(X)))
tab=cbind(out$which,out$rsq,out$adjr2)
tab
##   (Intercept) WT DIS NC HP ACC ET                    
## 1           1  1   0  0  0   0  0 0.8579697 0.8540244
## 1           1  0   0  0  1   0  0 0.7880098 0.7821212
## 2           1  1   1  0  0   0  0 0.8926952 0.8865635
## 2           1  1   0  0  0   0  1 0.8751262 0.8679906
## 3           1  0   0  1  1   0  1 0.9145736 0.9070360
## 3           1  1   1  1  0   0  0 0.9028083 0.8942326
## 4           1  0   0  1  1   1  1 0.9323970 0.9242027
## 4           1  1   0  1  1   0  1 0.9204005 0.9107520
## 5           1  1   1  1  1   0  1 0.9337702 0.9234218
## 5           1  0   1  1  1   1  1 0.9336238 0.9232525
## 6           1  1   1  1  1   1  1 0.9389733 0.9271617

The resulting table shows the trade-off between model size and model fit. The model with just one regressor, weight of the automobile, leads to an R-square of 85.8. Just this one variable explains most of the variability in fuel efficiency. This model is certainly easy to explain as the fuel needed to travel a certain distance must be related to the weight of the object that is being pushed forward.

Modelling Fuel Efficiency with only weight as predictor

#Model with only Weight as predictor
m2=lm(GPM~WT,data=FuelEff)
summary(m2)
## 
## Call:
## lm(formula = GPM ~ WT, data = FuelEff)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.88072 -0.29041  0.00659  0.19021  1.13164 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.006101   0.302681   -0.02    0.984    
## WT           1.514798   0.102721   14.75   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4417 on 36 degrees of freedom
## Multiple R-squared:  0.858,  Adjusted R-squared:  0.854 
## F-statistic: 217.5 on 1 and 36 DF,  p-value: < 2.2e-16

The resulting model is:

GPM = -0.006 + 1.515 WT

As we have discussed this model shows a great performance based on both R-square and p-value. One thing we can do is that we can discard all other variables from the model and keep only the WT variable. However if we consider including some other variables then it is very important that we should treat the multicollinearity and then proceed to build a multiple linear regression model.

Treating Multicollinearity

We will use the vif() function present in the library cars to detect and treat multicollinearity. We will start with a model containing all the variables, check the VIF corresponding to each of the variables, the variable that corresponds to the highest VIF, fit the model again with the remaining variable. This process continues until all the variables corresponds to a VIF less than 5. (This benchmark is not rigid)

#Treating Multicollinearity using VIF
library(car)

vif(lm(GPM~.,data=FuelEff))
##        WT       DIS        NC        HP       ACC        ET 
## 39.642747 21.989048 14.603382 12.252858  1.864045  5.678396
vif(lm(GPM~ DIS + NC + HP + ACC + ET,data=FuelEff)) #Drop WT from the model
##       DIS        NC        HP       ACC        ET 
##  9.836248 12.716555  4.886803  1.302502  3.828254
vif(lm(GPM~ DIS + HP + ACC + ET,data=FuelEff)) #Drop NC from the model
##      DIS       HP      ACC       ET 
## 5.564249 4.488758 1.205819 2.816388

Removing the variables WT and NC has lowered the level of multicollinearity. [Note that here the idea is to reduce the multi-collinearity tand not to make the correlation among variables zero]

#Removing the variable WT & NC from the dataset and storing the result in the data frame Fuel

Fuel = FuelEff[,-c(2,4)]
head(Fuel)
##     GPM DIS  HP  ACC ET
## 1 5.917 350 155 14.9  1
## 2 6.452 351 142 14.3  1
## 3 5.208 267 125 15.0  1
## 4 5.405 360 150 13.0  1
## 5 3.333  98  68 16.5  0
## 6 3.636 134  95 14.2  0

Best subset Regression with the new data Fuel

## best subset regression in R
X=Fuel[,2:5]
y=FuelEff[,1]
out=summary(regsubsets(X,y,nbest=2,nvmax=ncol(X)))
tab=cbind(out$which,out$rsq,out$adjr2)
tab
##   (Intercept) DIS HP ACC ET                    
## 1           1   0  1   0  0 0.7880098 0.7821212
## 1           1   1  0   0  0 0.6771806 0.6682134
## 2           1   0  1   1  0 0.8622486 0.8543771
## 2           1   0  1   0  1 0.8173060 0.8068664
## 3           1   0  1   1  1 0.8763265 0.8654141
## 3           1   1  1   1  0 0.8654669 0.8535963
## 4           1   1  1   1  1 0.8931507 0.8801993

From the result we can see that the model with DIS, HO, ACC and ET on the GPM results in the highest adjusted R-square.

Modelling Fuel Efficiency with only DIS, HP, ACC & ET as predictor

#Modelling Fuel Efficiency with only DIS, HP, ACC & ET as predictor
m3 = lm(GPM~ DIS + HP + ACC + ET,data=FuelEff)
summary(m3)
## 
## Call:
## lm(formula = GPM ~ DIS + HP + ACC + ET, data = FuelEff)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.72153 -0.26220  0.09793  0.31936  0.72120 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.555685   0.851066  -3.003  0.00507 ** 
## DIS          0.003980   0.001746   2.279  0.02923 *  
## HP           0.038266   0.005270   7.261 2.49e-08 ***
## ACC          0.167980   0.048628   3.454  0.00153 ** 
## ET          -0.702315   0.240186  -2.924  0.00620 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4001 on 33 degrees of freedom
## Multiple R-squared:  0.8932, Adjusted R-squared:  0.8802 
## F-statistic: 68.96 on 4 and 33 DF,  p-value: 1.485e-15

Checking the normality assumption

It is important that we also check the assumption of the normality of the residuals. It is true that a true normal distribution might not be possible, and that too with a sample size of 38, however if the histogram of errors represents a nearly symmetrical distribution then we will assume that the residual is normally distributed.

hist(m1$residuals)

hist(m2$residuals)

hist(m3$residuals)

Roughly, all the histogram is nearly symmetrical ans so we assume the normality assumption.

Checking the Homoscadasticity assumption

First let us calculate the predicted values for eaxh of the models on the the training set.

#predicted value for model 1 on the training data
pred1 = predict(m1, newdata = FuelEff[,-1])

#predicted value for model 2 on the training data
pred2 = predict(m2, newdata = FuelEff[,-1])

#predicted value for model 3 on the training data
pred3 = predict(m3, newdata = FuelEff[,-1])

Now let us plot the resudial vs. predicted plots.

#Residual Vs. predicted values for model 1
plot(pred1, m1$residuals)
abline(h=0)

#Residual Vs. predicted values for model 2
plot(pred2, m2$residuals)
abline(h=0)

#Residual Vs. predicted values for model 3
plot(pred3, m3$residuals)
abline(h=0)

For all the above plots the residuals are randomly distributed against the predicted values. There is no an indication of any king of cone shaped pattern and hence no sign of heteroscadasticity. Hence we assume homoscadasticity.

Testing for auto-correlation of errors

The presence of suto-correlation can be tested using the Darbin-Watson’s test. The hypothesis for the test is given by,

The critical value of D (the Darbin-Watson test statistics) is > 2.5 or < 1.5. D can assume values between 0 and 4. Values around 2 indicate no autocorrelation. As a rule of thumb values of 1.5 < d < 2.5 show that there is no auto-correlation in the data.

#Darwin-Watson test statistics 

DWatson = function(model,data)
{
  num = 0
  for(i in 2:nrow(data))
  {
    num = num + (model$residuals[i-1] - model$residuals[i])^2
  }
  denom = sum((model$residuals)^2)
  dw = num/denom
  return(dw)
}


#Model 1
DWatson(m1,FuelEff)
##       1 
## 2.03187
#Model 2
DWatson(m2,FuelEff)
##        1 
## 1.321797
#Model 3
DWatson(m3,FuelEff)
##        1 
## 1.891336

The Darwin-Watson’s test statistics is between 1.5 to 2.5 for all the models, which indicates that there is no auto-correlation present in the model.

Cross-Validation

So, far we have come across three different models - m1, m2 and m3. It is necessary to check the performance of the model on the test set. We are going to use the cross validation method (leave one out method) and calculate the mean error, RMSD and MAPE for all these 3 models. This will help us to compare among the models.

Cross-validation (leave one out method) removes one case from the data set of n cases, fits the model to the reduced data set, and predicts the response of that one case that has been removed from the estimation. This is repeated for each of the n cases. The summary statistics of the n genuine out-of-sample prediction errors (mean error, root mean square error, mean absolute percent error) help us assess the out-of-sample prediction performance. Cross-validation is very informative as it evaluates the model on new data.

Cross Validation for the models on all six regressors

#CROSS-VALIDATION:
## cross-validation (leave one out) for the model on all six regressors
n=length(FuelEff$GPM)
diff=dim(n)
percdiff=dim(n)
for (k in 1:n) {
  train1=c(1:n)
  train=train1[train1!=k]
  ## the R expression "train1[train1!=k]" picks from train1 those 
  ## elements that are different from k and stores those elements in the
  ## object train. 
  ## For k=1, train consists of elements that are different from 1; that 
  ## is 2, 3, ., n.
  m1=lm(GPM~.,data=FuelEff[train,])          #The proposed Model
  pred=predict(m1,newdat=FuelEff[-train,])   #Making prediction using the model
  obs=FuelEff$GPM[-train]                    #The observed value
  diff[k]=obs-pred                           #The ERROR = Obs. value - pred. value
  percdiff[k]=abs(diff[k])/obs               #percentage difference
}


me=mean(diff)
rmse=sqrt(mean(diff**2))
mape=100*(mean(percdiff))
me   # mean error
## [1] -0.00311191
rmse # root mean square error (RMSE)
## [1] 0.3484421
mape # mean absolute percent error (MAPE)
## [1] 6.681503

Cross Validation for the models on weight only

## cross-validation (leave one out) for the model on weight only
n=length(FuelEff$GPM)
diff=dim(n)
percdiff=dim(n)
for (k in 1:n) {
  train1=c(1:n)
  train=train1[train1!=k]
  m2=lm(GPM~WT,data=FuelEff[train,])
  pred=predict(m2,newdat=FuelEff[-train,])
  obs=FuelEff$GPM[-train]
  diff[k]=obs-pred
  percdiff[k]=abs(diff[k])/obs
}
me=mean(diff)
rmse=sqrt(mean(diff**2))
mape=100*(mean(percdiff))
me   # mean error
## [1] -0.002960953
rmse # root mean square error
## [1] 0.4517422
mape # mean absolute percent error
## [1] 8.232507

Cross Validation for the models with WT, DIS, ACC and ET as response

## cross-validation (leave one out) for the model 3
n=length(FuelEff$GPM)
diff=dim(n)
percdiff=dim(n)
for (k in 1:n) {
  train1=c(1:n)
  train=train1[train1!=k]
  m3 = lm(GPM~ DIS + HP + ACC + ET,data=FuelEff)
  pred=predict(m3,newdat=FuelEff[-train,])
  obs=FuelEff$GPM[-train]
  diff[k]=obs-pred
  percdiff[k]=abs(diff[k])/obs
}
me=mean(diff)
rmse=sqrt(mean(diff**2))
mape=100*(mean(percdiff))
me   # mean error
## [1] 7.012006e-16
rmse # root mean square error
## [1] 0.3728664
mape # mean absolute percent error
## [1] 7.485257

The model 3 with four predictors seem to perform better ased on the mean error and RMSD. However based on the MAPE the model 1 with all six predictors seem to perform better.

Reference: Data Analytics and Business Modelling with R - Johannes Ledolter Data Source: http://www.biz.uiowa.edu/faculty/jledolter/DataMining/datatext.html