The following data set contains data on fuel efficiency of 38 cars(taken from Abraham & Ledolter, 2006). The data set has 38 observation and 8 variables.
Variable Description:
#Reading the data
FuelEff <- read.csv("C:/Users/Gourab/Desktop/Dataset - Data Mining/FuelEfficiency.csv")
head(FuelEff)
## MPG GPM WT DIS NC HP ACC ET
## 1 16.9 5.917 4.360 350 8 155 14.9 1
## 2 15.5 6.452 4.054 351 8 142 14.3 1
## 3 19.2 5.208 3.605 267 8 125 15.0 1
## 4 18.5 5.405 3.940 360 8 150 13.0 1
## 5 30.0 3.333 2.155 98 4 68 16.5 0
## 6 27.5 3.636 2.560 134 4 95 14.2 0
#Summary and Structure of the data
str(FuelEff)
## 'data.frame': 38 obs. of 8 variables:
## $ MPG: num 16.9 15.5 19.2 18.5 30 27.5 27.2 30.9 20.3 17 ...
## $ GPM: num 5.92 6.45 5.21 5.41 3.33 ...
## $ WT : num 4.36 4.05 3.6 3.94 2.15 ...
## $ DIS: int 350 351 267 360 98 134 119 105 131 163 ...
## $ NC : int 8 8 8 8 4 4 4 4 5 6 ...
## $ HP : int 155 142 125 150 68 95 97 75 103 125 ...
## $ ACC: num 14.9 14.3 15 13 16.5 14.2 14.7 14.5 15.9 13.6 ...
## $ ET : int 1 1 1 1 0 0 0 0 0 0 ...
summary(FuelEff)
## MPG GPM WT DIS
## Min. :15.50 Min. :2.681 Min. :1.915 Min. : 85.0
## 1st Qu.:18.52 1st Qu.:3.292 1st Qu.:2.208 1st Qu.:105.0
## Median :24.25 Median :4.160 Median :2.685 Median :148.5
## Mean :24.76 Mean :4.331 Mean :2.863 Mean :177.3
## 3rd Qu.:30.38 3rd Qu.:5.398 3rd Qu.:3.410 3rd Qu.:229.5
## Max. :37.30 Max. :6.452 Max. :4.360 Max. :360.0
## NC HP ACC ET
## Min. :4.000 Min. : 65.0 Min. :11.30 Min. :0.0000
## 1st Qu.:4.000 1st Qu.: 78.5 1st Qu.:14.03 1st Qu.:0.0000
## Median :4.500 Median :100.0 Median :14.80 Median :0.0000
## Mean :5.395 Mean :101.7 Mean :14.86 Mean :0.2895
## 3rd Qu.:6.000 3rd Qu.:123.8 3rd Qu.:15.78 3rd Qu.:1.0000
## Max. :8.000 Max. :155.0 Max. :19.20 Max. :1.0000
In this data set we have: - 4 numerical variables: MPG, GPM, WT & ACC - 4 integer variables: DIS, NC, HP, ET (also a binary variable)
The summary do not indicate the presence of any missing value in any variables. We will try to model the fuel efficiency, measured in GPM as a function of the weight, displacement, number of cylinders, horsepower, acceleratioin and engine type. We analyse GPM instead of the usual fuel efficiency measure MPG because the receprocal transformation GPM = 100/MPG leads to approximate linear relationship between the response and the predictors.
In this problem our objective is to model the fuel efficiency (MPG or GPM) as a function of the other variables. One good model can of course be linear regression model. Linear regression being parametric model, gets largely affected by outlier.
boxplot(FuelEff$MPG)
boxplot(FuelEff$GPM)
boxplot(FuelEff$WT)
boxplot(FuelEff$DIS)
boxplot(FuelEff$NC)
boxplot(FuelEff$HP)
boxplot(FuelEff$ACC)
boxplot(FuelEff$ET)
Among all te variables the variable ACC indicates the presence of some extreme values. Let us replace these values with the Q3 + 1.5*IQR
#Finding the quartile of ACC
quantile(FuelEff$ACC)
## 0% 25% 50% 75% 100%
## 11.300 14.025 14.800 15.775 19.200
#The interquartile range is given by
IQR(FuelEff$ACC)
## [1] 1.75
#benchmark value = Q3 + 1.5*IQR will be
benchMark = 15.775 + 1.5*IQR(FuelEff$ACC)
benchMark
## [1] 18.4
#Detecting the extreme values:
FuelEff$ACC[FuelEff$ACC > benchMark]
## [1] 18.7 19.2
#Replacing these values with the benchmark value that we have set
FuelEff$ACC[FuelEff$ACC > benchMark] = benchMark
#Taking the final look at the distribution of ACC
boxplot(FuelEff$ACC)
The final boxplot gave a clear indication that how much influenced the outliers had on the ditribution on ACC.
The first and foremost assumption of linear regression that the dependent variable and the independent variables should be linearly related.
#Verifying the linearity assumption through scatterplots
plot(GPM~MPG,data=FuelEff)
plot(GPM~WT,data=FuelEff)
plot(GPM~DIS,data=FuelEff)
plot(GPM~NC,data=FuelEff)
plot(GPM~HP,data=FuelEff)
plot(GPM~ACC,data=FuelEff)
plot(GPM~ET,data=FuelEff)
All the variables are nearly linearly related with the response.
We will not be using the variable MPG present in our data set. Therefore, before we start to build our model let us remove the variable from the data set.
#Deleting the variable MPG from the dataset
FuelEff=FuelEff[-1]
head(FuelEff)
## GPM WT DIS NC HP ACC ET
## 1 5.917 4.360 350 8 155 14.9 1
## 2 6.452 4.054 351 8 142 14.3 1
## 3 5.208 3.605 267 8 125 15.0 1
## 4 5.405 3.940 360 8 150 13.0 1
## 5 3.333 2.155 98 4 68 16.5 0
## 6 3.636 2.560 134 4 95 14.2 0
## regression on all data
m1=lm(GPM~.,data=FuelEff)
summary(m1)
##
## Call:
## lm(formula = GPM ~ ., data = FuelEff)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.48991 -0.25461 0.03753 0.18524 0.64504
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.688441 0.688524 -3.905 0.000476 ***
## WT 0.753110 0.456856 1.648 0.109358
## DIS -0.004745 0.002706 -1.753 0.089404 .
## NC 0.444900 0.122271 3.639 0.000986 ***
## HP 0.024040 0.006789 3.541 0.001284 **
## ACC 0.076643 0.047143 1.626 0.114128
## ET -0.961089 0.265928 -3.614 0.001054 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.312 on 31 degrees of freedom
## Multiple R-squared: 0.939, Adjusted R-squared: 0.9272
## F-statistic: 79.5 on 6 and 31 DF, p-value: < 2.2e-16
The resulting model is:
GPM = -2.688 + 0.75 WT - 0.005 DIS + 0.445 NC + 0.024 HP + 0.077 ACC + 0.00105 ET
Note: ET should be converted as a factor and not left as an integer variable
The model resulted in a high value of multiple & adjusted R-squared.This high value may suggest the presence of multicollinearity in the data, i.e. the predictor variables in this model are themseles related. For example, one can expect that a car with large weight has a large engine size and large horsepower. Let us calculate the pairwise collelation among all the predictor varibles.
#Pairwise correlation among all the predictor variables
cor(FuelEff[,2:7])
## WT DIS NC HP ACC ET
## WT 1.00000000 0.9507647 0.9166777 0.9172204 -0.02377844 0.6673661
## DIS 0.95076469 1.0000000 0.9402812 0.8717993 -0.13855958 0.7746636
## NC 0.91667774 0.9402812 1.0000000 0.8638473 -0.12524272 0.8311721
## HP 0.91722045 0.8717993 0.8638473 1.0000000 -0.24701158 0.7202350
## ACC -0.02377844 -0.1385596 -0.1252427 -0.2470116 1.00000000 -0.3134769
## ET 0.66736606 0.7746636 0.8311721 0.7202350 -0.31347692 1.0000000
The above correlation matrix indicates some very high magnitude of association among some variables. The presence of multicollinearity in the data may affect the analysis to a great extent and therefore needs to be corrected.
Before doing any correction let us calculate the all possible regressions with their R-square and adjusted R-square and see how the various models perform. These calculations can be carried out using the function regsubsets in the R-package leaps.
## best subset regression in R
library(leaps)
X=FuelEff[,2:7]
y=FuelEff[,1]
out=summary(regsubsets(X,y,nbest=2,nvmax=ncol(X)))
tab=cbind(out$which,out$rsq,out$adjr2)
tab
## (Intercept) WT DIS NC HP ACC ET
## 1 1 1 0 0 0 0 0 0.8579697 0.8540244
## 1 1 0 0 0 1 0 0 0.7880098 0.7821212
## 2 1 1 1 0 0 0 0 0.8926952 0.8865635
## 2 1 1 0 0 0 0 1 0.8751262 0.8679906
## 3 1 0 0 1 1 0 1 0.9145736 0.9070360
## 3 1 1 1 1 0 0 0 0.9028083 0.8942326
## 4 1 0 0 1 1 1 1 0.9323970 0.9242027
## 4 1 1 0 1 1 0 1 0.9204005 0.9107520
## 5 1 1 1 1 1 0 1 0.9337702 0.9234218
## 5 1 0 1 1 1 1 1 0.9336238 0.9232525
## 6 1 1 1 1 1 1 1 0.9389733 0.9271617
The resulting table shows the trade-off between model size and model fit. The model with just one regressor, weight of the automobile, leads to an R-square of 85.8. Just this one variable explains most of the variability in fuel efficiency. This model is certainly easy to explain as the fuel needed to travel a certain distance must be related to the weight of the object that is being pushed forward.
#Model with only Weight as predictor
m2=lm(GPM~WT,data=FuelEff)
summary(m2)
##
## Call:
## lm(formula = GPM ~ WT, data = FuelEff)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.88072 -0.29041 0.00659 0.19021 1.13164
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.006101 0.302681 -0.02 0.984
## WT 1.514798 0.102721 14.75 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4417 on 36 degrees of freedom
## Multiple R-squared: 0.858, Adjusted R-squared: 0.854
## F-statistic: 217.5 on 1 and 36 DF, p-value: < 2.2e-16
The resulting model is:
GPM = -0.006 + 1.515 WT
As we have discussed this model shows a great performance based on both R-square and p-value. One thing we can do is that we can discard all other variables from the model and keep only the WT variable. However if we consider including some other variables then it is very important that we should treat the multicollinearity and then proceed to build a multiple linear regression model.
We will use the vif() function present in the library cars to detect and treat multicollinearity. We will start with a model containing all the variables, check the VIF corresponding to each of the variables, the variable that corresponds to the highest VIF, fit the model again with the remaining variable. This process continues until all the variables corresponds to a VIF less than 5. (This benchmark is not rigid)
#Treating Multicollinearity using VIF
library(car)
vif(lm(GPM~.,data=FuelEff))
## WT DIS NC HP ACC ET
## 39.642747 21.989048 14.603382 12.252858 1.864045 5.678396
vif(lm(GPM~ DIS + NC + HP + ACC + ET,data=FuelEff)) #Drop WT from the model
## DIS NC HP ACC ET
## 9.836248 12.716555 4.886803 1.302502 3.828254
vif(lm(GPM~ DIS + HP + ACC + ET,data=FuelEff)) #Drop NC from the model
## DIS HP ACC ET
## 5.564249 4.488758 1.205819 2.816388
Removing the variables WT and NC has lowered the level of multicollinearity. [Note that here the idea is to reduce the multi-collinearity tand not to make the correlation among variables zero]
#Removing the variable WT & NC from the dataset and storing the result in the data frame Fuel
Fuel = FuelEff[,-c(2,4)]
head(Fuel)
## GPM DIS HP ACC ET
## 1 5.917 350 155 14.9 1
## 2 6.452 351 142 14.3 1
## 3 5.208 267 125 15.0 1
## 4 5.405 360 150 13.0 1
## 5 3.333 98 68 16.5 0
## 6 3.636 134 95 14.2 0
## best subset regression in R
X=Fuel[,2:5]
y=FuelEff[,1]
out=summary(regsubsets(X,y,nbest=2,nvmax=ncol(X)))
tab=cbind(out$which,out$rsq,out$adjr2)
tab
## (Intercept) DIS HP ACC ET
## 1 1 0 1 0 0 0.7880098 0.7821212
## 1 1 1 0 0 0 0.6771806 0.6682134
## 2 1 0 1 1 0 0.8622486 0.8543771
## 2 1 0 1 0 1 0.8173060 0.8068664
## 3 1 0 1 1 1 0.8763265 0.8654141
## 3 1 1 1 1 0 0.8654669 0.8535963
## 4 1 1 1 1 1 0.8931507 0.8801993
From the result we can see that the model with DIS, HO, ACC and ET on the GPM results in the highest adjusted R-square.
#Modelling Fuel Efficiency with only DIS, HP, ACC & ET as predictor
m3 = lm(GPM~ DIS + HP + ACC + ET,data=FuelEff)
summary(m3)
##
## Call:
## lm(formula = GPM ~ DIS + HP + ACC + ET, data = FuelEff)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.72153 -0.26220 0.09793 0.31936 0.72120
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.555685 0.851066 -3.003 0.00507 **
## DIS 0.003980 0.001746 2.279 0.02923 *
## HP 0.038266 0.005270 7.261 2.49e-08 ***
## ACC 0.167980 0.048628 3.454 0.00153 **
## ET -0.702315 0.240186 -2.924 0.00620 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4001 on 33 degrees of freedom
## Multiple R-squared: 0.8932, Adjusted R-squared: 0.8802
## F-statistic: 68.96 on 4 and 33 DF, p-value: 1.485e-15
It is important that we also check the assumption of the normality of the residuals. It is true that a true normal distribution might not be possible, and that too with a sample size of 38, however if the histogram of errors represents a nearly symmetrical distribution then we will assume that the residual is normally distributed.
hist(m1$residuals)
hist(m2$residuals)
hist(m3$residuals)
Roughly, all the histogram is nearly symmetrical ans so we assume the normality assumption.
First let us calculate the predicted values for eaxh of the models on the the training set.
#predicted value for model 1 on the training data
pred1 = predict(m1, newdata = FuelEff[,-1])
#predicted value for model 2 on the training data
pred2 = predict(m2, newdata = FuelEff[,-1])
#predicted value for model 3 on the training data
pred3 = predict(m3, newdata = FuelEff[,-1])
Now let us plot the resudial vs. predicted plots.
#Residual Vs. predicted values for model 1
plot(pred1, m1$residuals)
abline(h=0)
#Residual Vs. predicted values for model 2
plot(pred2, m2$residuals)
abline(h=0)
#Residual Vs. predicted values for model 3
plot(pred3, m3$residuals)
abline(h=0)
For all the above plots the residuals are randomly distributed against the predicted values. There is no an indication of any king of cone shaped pattern and hence no sign of heteroscadasticity. Hence we assume homoscadasticity.
The presence of suto-correlation can be tested using the Darbin-Watson’s test. The hypothesis for the test is given by,
The critical value of D (the Darbin-Watson test statistics) is > 2.5 or < 1.5. D can assume values between 0 and 4. Values around 2 indicate no autocorrelation. As a rule of thumb values of 1.5 < d < 2.5 show that there is no auto-correlation in the data.
#Darwin-Watson test statistics
DWatson = function(model,data)
{
num = 0
for(i in 2:nrow(data))
{
num = num + (model$residuals[i-1] - model$residuals[i])^2
}
denom = sum((model$residuals)^2)
dw = num/denom
return(dw)
}
#Model 1
DWatson(m1,FuelEff)
## 1
## 2.03187
#Model 2
DWatson(m2,FuelEff)
## 1
## 1.321797
#Model 3
DWatson(m3,FuelEff)
## 1
## 1.891336
The Darwin-Watson’s test statistics is between 1.5 to 2.5 for all the models, which indicates that there is no auto-correlation present in the model.
So, far we have come across three different models - m1, m2 and m3. It is necessary to check the performance of the model on the test set. We are going to use the cross validation method (leave one out method) and calculate the mean error, RMSD and MAPE for all these 3 models. This will help us to compare among the models.
Cross-validation (leave one out method) removes one case from the data set of n cases, fits the model to the reduced data set, and predicts the response of that one case that has been removed from the estimation. This is repeated for each of the n cases. The summary statistics of the n genuine out-of-sample prediction errors (mean error, root mean square error, mean absolute percent error) help us assess the out-of-sample prediction performance. Cross-validation is very informative as it evaluates the model on new data.
#CROSS-VALIDATION:
## cross-validation (leave one out) for the model on all six regressors
n=length(FuelEff$GPM)
diff=dim(n)
percdiff=dim(n)
for (k in 1:n) {
train1=c(1:n)
train=train1[train1!=k]
## the R expression "train1[train1!=k]" picks from train1 those
## elements that are different from k and stores those elements in the
## object train.
## For k=1, train consists of elements that are different from 1; that
## is 2, 3, ., n.
m1=lm(GPM~.,data=FuelEff[train,]) #The proposed Model
pred=predict(m1,newdat=FuelEff[-train,]) #Making prediction using the model
obs=FuelEff$GPM[-train] #The observed value
diff[k]=obs-pred #The ERROR = Obs. value - pred. value
percdiff[k]=abs(diff[k])/obs #percentage difference
}
me=mean(diff)
rmse=sqrt(mean(diff**2))
mape=100*(mean(percdiff))
me # mean error
## [1] -0.00311191
rmse # root mean square error (RMSE)
## [1] 0.3484421
mape # mean absolute percent error (MAPE)
## [1] 6.681503
## cross-validation (leave one out) for the model on weight only
n=length(FuelEff$GPM)
diff=dim(n)
percdiff=dim(n)
for (k in 1:n) {
train1=c(1:n)
train=train1[train1!=k]
m2=lm(GPM~WT,data=FuelEff[train,])
pred=predict(m2,newdat=FuelEff[-train,])
obs=FuelEff$GPM[-train]
diff[k]=obs-pred
percdiff[k]=abs(diff[k])/obs
}
me=mean(diff)
rmse=sqrt(mean(diff**2))
mape=100*(mean(percdiff))
me # mean error
## [1] -0.002960953
rmse # root mean square error
## [1] 0.4517422
mape # mean absolute percent error
## [1] 8.232507
## cross-validation (leave one out) for the model 3
n=length(FuelEff$GPM)
diff=dim(n)
percdiff=dim(n)
for (k in 1:n) {
train1=c(1:n)
train=train1[train1!=k]
m3 = lm(GPM~ DIS + HP + ACC + ET,data=FuelEff)
pred=predict(m3,newdat=FuelEff[-train,])
obs=FuelEff$GPM[-train]
diff[k]=obs-pred
percdiff[k]=abs(diff[k])/obs
}
me=mean(diff)
rmse=sqrt(mean(diff**2))
mape=100*(mean(percdiff))
me # mean error
## [1] 7.012006e-16
rmse # root mean square error
## [1] 0.3728664
mape # mean absolute percent error
## [1] 7.485257
The model 3 with four predictors seem to perform better ased on the mean error and RMSD. However based on the MAPE the model 1 with all six predictors seem to perform better.
Reference: Data Analytics and Business Modelling with R - Johannes Ledolter Data Source: http://www.biz.uiowa.edu/faculty/jledolter/DataMining/datatext.html