DATA PREPROCESSING & SELECTING THE MODEL

BOSTON DATASET (displaying only 11 observations)
crim zn indus chas nox rm age dis rad tax ptratio black lstat medv
0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0
0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6
0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2
0.02985 0.0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21 28.7
0.08829 12.5 7.87 0 0.524 6.012 66.6 5.5605 5 311 15.2 395.60 12.43 22.9
0.14455 12.5 7.87 0 0.524 6.172 96.1 5.9505 5 311 15.2 396.90 19.15 27.1
0.21124 12.5 7.87 0 0.524 5.631 100.0 6.0821 5 311 15.2 386.63 29.93 16.5
0.17004 12.5 7.87 0 0.524 6.004 85.9 6.5921 5 311 15.2 386.71 17.10 18.9
0.22489 12.5 7.87 0 0.524 6.377 94.3 6.3467 5 311 15.2 392.52 20.45 15.0
Lets build a model on boston dataset collected by the US census service concerning housing in the area of boston mass.
Here,we will consider only medv(median values of owner occupied housing in USD1000) and lstat(percentage values of lower status population)in building a model where medv values depends on lstat data so 'lstat' is explanatory/independent variable and 'medv' is outcome/dependent variable.As we have only two variables we use SIMPLE LINEAR REGRESSION MODEL.

Checking for linearity

library(ggplot2)
ggplot(Boston,aes(medv,lstat))+geom_point()+geom_smooth()
`geom_smooth()` using method = 'loess'

From the plot we infer that there is a nonlinear relation between the two variables.So we need to convert nonlinearity to linearity by using non linear regression models.  

Check for outcome variable transformation

dummymodel<-lm(medv~lstat,data=Boston)
library(car)
powerTransform(dummymodel)
Estimated transformation parameters 
        Y1 
0.03068753 
Here,we got lambda value as 0.0306 which is near to zero,so we should use logarithmic transformation on outcome variable.
  Here,we have two non linear regression models (polynomial regression & logarithmic transform regression) but we will use polynomial regression as it is more preferable than logarithmic transformation.

Non linear regression model

model<-lm(medv~poly(lstat,7),data=Boston)
summary(model)

Call:
lm(formula = medv ~ poly(lstat, 7), data = Boston)

Residuals:
     Min       1Q   Median       3Q      Max 
-14.3746  -3.1382  -0.7072   2.0646  26.9642 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       22.5328     0.2319  97.168  < 2e-16 ***
poly(lstat, 7)1 -152.4595     5.2164 -29.227  < 2e-16 ***
poly(lstat, 7)2   64.2272     5.2164  12.313  < 2e-16 ***
poly(lstat, 7)3  -27.0511     5.2164  -5.186 3.13e-07 ***
poly(lstat, 7)4   25.4517     5.2164   4.879 1.44e-06 ***
poly(lstat, 7)5  -19.2524     5.2164  -3.691 0.000248 ***
poly(lstat, 7)6    6.5088     5.2164   1.248 0.212708    
poly(lstat, 7)7    1.9416     5.2164   0.372 0.709888    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.216 on 498 degrees of freedom
Multiple R-squared:  0.6828,    Adjusted R-squared:  0.6783 
F-statistic: 153.1 on 7 and 498 DF,  p-value: < 2.2e-16
From the summary we got to know that coefficients are significant upto 5th order so lets build a model upto 5th order.
But before building lets use resampling methods(cross validation & bootstrapping) in order to reduce the bias(overfitting) and variance(error rate).

Hold out approach

library(caret)
Warning: package 'caret' was built under R version 3.4.4
Loading required package: lattice
split<-createDataPartition(Boston$medv,p=0.7,list=FALSE)
train1<-Boston[split,]
test1<-Boston[-split,]
Here, we split the data and assign 70% of data as train1 data and 30% of data as test1 data.

Bootstrapping method

library(caret)
method<-trainControl(method="boot",number = 100)
Here,we use bootstrapping method where each observation is trained and tested multiple times.

BUILD THE MODEL

library(caret)
library(car)
model<-train(medv~poly(lstat,5),data=train1,method="lm",trControl=method)
summary(model)

Call:
lm(formula = .outcome ~ ., data = dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-13.2624  -3.0449  -0.6638   2.0655  27.1644 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)         22.5211     0.2759  81.627  < 2e-16 ***
`poly(lstat, 5)1` -129.6562     5.2057 -24.907  < 2e-16 ***
`poly(lstat, 5)2`   55.5696     5.2057  10.675  < 2e-16 ***
`poly(lstat, 5)3`  -20.3589     5.2057  -3.911  0.00011 ***
`poly(lstat, 5)4`   22.4564     5.2057   4.314 2.09e-05 ***
`poly(lstat, 5)5`  -12.4691     5.2057  -2.395  0.01713 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.206 on 350 degrees of freedom
Multiple R-squared:  0.6886,    Adjusted R-squared:  0.6841 
F-statistic: 154.8 on 5 and 350 DF,  p-value: < 2.2e-16
Here all the coefficients are significant and residual square is 0.7229(which indicates that the explanatory variable explains 72% better about the outcome variable)

EVALUATE THE MODEL

pred<-predict(model,train1)
library(DMwR)
Loading required package: grid
regr.eval(train1$medv,pred)
       mae        mse       rmse       mape 
 3.7249725 26.6426539  5.1616522  0.1817132 
Here, we apply the model equation on train data and predict the outcome variables and later check the accuracy of the model by using regression evaluation technique.
We got mean absolute percentage error(mape) as 0.179 (i.e.model has 18% error) which is acceptable.
pred<-predict(model,test1)
regr.eval(test1$medv,pred)
       mae        mse       rmse       mape 
 3.7136637 27.5263661  5.2465576  0.1689368 
Here, we predict the outcome based on test data and check the accuracy of the model.We have mape as 0.186(i.e. nearly 18.5% error) which is also acceptable.

checking regression model assumptions

shapiro.test(pred)

    Shapiro-Wilk normality test

data:  pred
W = 0.96016, p-value = 0.0002532
library(nortest)
ad.test(pred)

    Anderson-Darling normality test

data:  pred
A = 1.1894, p-value = 0.004067
library(car)
durbinWatsonTest(pred)
[1] 0.09268972
So,lets check for normality and independency of errors.
From,shapiro wilk test and anderson darling test we have pvalue lesser than 0.05 so we reject null hypothesis saying that the data is not normally distributed.
Therefore,Normality is not met but that is fine.
From,durbinwatsontest we have value greater than 0.05 so we will accept null hypothesis saying that there is no correlation between the variables.
Therefore,Independency of errors is met.

Reshuffle the data

set.seed(123)
Here,the data will be reshuffled so as to check how better the model performs on different data groups.
Now we will build and test the model by using different resampling methods.

BUILDING AND TESTING A MODEL USING K-FOLD CROSS VALIDATION TECHNIQUE

library(caret)
split<-createDataPartition(Boston$medv,p=0.7,list=FALSE)
train1<-Boston[split,]
test1<-Boston[-split,]
method<-trainControl(method="cv",number = 5)
library(caret)
library(car)
model<-train(medv~poly(lstat,5),data=train1,method="lm",trControl=method)
summary(model)

Call:
lm(formula = .outcome ~ ., data = dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-12.6939  -2.9651  -0.7246   2.1306  26.5039 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)         22.5374     0.2698  83.537  < 2e-16 ***
`poly(lstat, 5)1` -125.5151     5.0904 -24.657  < 2e-16 ***
`poly(lstat, 5)2`   53.5415     5.0904  10.518  < 2e-16 ***
`poly(lstat, 5)3`  -23.3540     5.0904  -4.588 6.25e-06 ***
`poly(lstat, 5)4`   20.8914     5.0904   4.104 5.05e-05 ***
`poly(lstat, 5)5`  -16.7496     5.0904  -3.290   0.0011 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.09 on 350 degrees of freedom
Multiple R-squared:  0.6868,    Adjusted R-squared:  0.6823 
F-statistic: 153.5 on 5 and 350 DF,  p-value: < 2.2e-16
Here,all coefficients are significant and residual square is 0.6635(i.e.66%).

Checking the accuracy of the model

pred<-predict(model,train1)
library(DMwR)
regr.eval(train1$medv,pred)
       mae        mse       rmse       mape 
 3.6783894 25.4753014  5.0473064  0.1770015 
pred<-predict(model,test1)
regr.eval(test1$medv,pred)
      mae       mse      rmse      mape 
 3.789995 30.447014  5.517881  0.182509 
Here, we have mape on train data as 0.178(i.e.18%) and on test data as 0.175(i.e.17.5%) which is consistent.

Checking the regression model assumptions

shapiro.test(pred)

    Shapiro-Wilk normality test

data:  pred
W = 0.95577, p-value = 0.0001019
library(nortest)
ad.test(pred)

    Anderson-Darling normality test

data:  pred
A = 1.4235, p-value = 0.001074
library(car)
durbinWatsonTest(pred)
[1] 0.09084939
Here, the pvalue in shapiro wilk test and anderson darling test are less than 0.05.
Therefore, Normality is not met.
Here,in durbin watson test the value is more than 0.05.
Therefore, Independency is met on residuals.

Reshuffle the data

set.seed(234)

BUILDING AND TESTING A MODEL USING REPEATED CROSS VALIDATION TECHNIQUE

library(caret)
split<-createDataPartition(Boston$medv,p=0.7,list=FALSE)
train1<-Boston[split,]
test1<-Boston[-split,]
method<-trainControl(method="repeatedcv",number = 5,repeats = 4)
library(caret)
library(car)
model<-train(medv~poly(lstat,5),data=train1,method="lm",trControl=method)
summary(model)

Call:
lm(formula = .outcome ~ ., data = dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-13.4962  -3.0343  -0.5861   2.1363  27.3362 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)         22.5360     0.2756  81.781  < 2e-16 ***
`poly(lstat, 5)1` -129.5693     5.1993 -24.920  < 2e-16 ***
`poly(lstat, 5)2`   54.7446     5.1993  10.529  < 2e-16 ***
`poly(lstat, 5)3`  -22.2463     5.1993  -4.279 2.43e-05 ***
`poly(lstat, 5)4`   23.8176     5.1993   4.581 6.45e-06 ***
`poly(lstat, 5)5`  -17.2875     5.1993  -3.325 0.000978 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.199 on 350 degrees of freedom
Multiple R-squared:  0.6909,    Adjusted R-squared:  0.6865 
F-statistic: 156.4 on 5 and 350 DF,  p-value: < 2.2e-16
Here, all coefficients are significant and residual square is 0.6735(i.e.67%)

Checking the accuracy of the model

pred<-predict(model,train1)
library(DMwR)
regr.eval(train1$medv,pred)
       mae        mse       rmse       mape 
 3.7026234 26.5775911  5.1553459  0.1805658 
pred<-predict(model,test1)
regr.eval(test1$medv,pred)
       mae        mse       rmse       mape 
 3.7064532 28.0838291  5.2994178  0.1597686 
Here,the mape values on train and test data is 0.1832(i.e.18% error) and 0.1711(i.e.17% error) respectively which is consistent.

Checking the regression model assumptions

shapiro.test(pred)

    Shapiro-Wilk normality test

data:  pred
W = 0.96078, p-value = 0.0002894
library(nortest)
ad.test(pred)

    Anderson-Darling normality test

data:  pred
A = 1.0107, p-value = 0.01126
library(car)
durbinWatsonTest(pred)
[1] 0.09557605
Here, the pvalue of shapiro and anderson test are less than 0.05.
Therefore, Normality is not met.
In durbinwatson test, the value is greater than 0.05.
Therefore Independency is met on residuals.

Shuffle the data

set.seed(678)

Build the model

library(caret)
split<-createDataPartition(Boston$medv,p=0.7,list=FALSE)
train1<-Boston[split,]
test1<-Boston[-split,]

method<-trainControl(method="LOOCV")

library(caret)
library(car)
model<-train(medv~poly(lstat,5),data=train1,method="lm",trControl=method)
summary(model)

Call:
lm(formula = .outcome ~ ., data = dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-13.2649  -3.0819  -0.4029   2.1627  26.3091 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)         22.5649     0.2704  83.447  < 2e-16 ***
`poly(lstat, 5)1` -126.0024     5.1021 -24.696  < 2e-16 ***
`poly(lstat, 5)2`   55.1407     5.1021  10.807  < 2e-16 ***
`poly(lstat, 5)3`  -23.4965     5.1021  -4.605 5.77e-06 ***
`poly(lstat, 5)4`   18.1712     5.1021   3.562  0.00042 ***
`poly(lstat, 5)5`  -12.9093     5.1021  -2.530  0.01184 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.102 on 350 degrees of freedom
Multiple R-squared:  0.6867,    Adjusted R-squared:  0.6822 
F-statistic: 153.4 on 5 and 350 DF,  p-value: < 2.2e-16
Here, the coefficients are significant and residual square is 0.6644(i.e.66%).

Checking the accuracy of the model

pred<-predict(model,train1)
library(DMwR)
regr.eval(train1$medv,pred)
       mae        mse       rmse       mape 
 3.6901094 25.5924942  5.0589025  0.1815415 
pred<-predict(model,test1)
regr.eval(test1$medv,pred)
       mae        mse       rmse       mape 
 4.0339984 31.8179971  5.6407444  0.1948971 
Here, the mape value for both train and test data is 0.17(i.e.17% error) and 0.19(i.e.19% error) where both are significant.

Checking the regression model assumptions

shapiro.test(pred)

    Shapiro-Wilk normality test

data:  pred
W = 0.95255, p-value = 5.365e-05
library(nortest)
ad.test(pred)

    Anderson-Darling normality test

data:  pred
A = 2.2627, p-value = 9.222e-06
library(car)
durbinWatsonTest(pred)
[1] 0.1188522
Here, as the p value in bith shapiro and anderson darling test are lesser than 0.05,Normality is not met.
Here,in durbin watson test the value is more than 0.05, Independency of errors is met.
SUMMARY
ATTRIBUTES BOOTSTRAPPING K.FOLD_CV REPEATED_CV LOOCV MEAN
Residual_square 0.670 0.680 0.690 0.68 0.670
mape_on_train_data 0.180 0.177 0.180 0.18 0.180
mape_on_test_data 0.176 0.180 0.159 0.19 0.176
Here, the residual square is consistent for the models with different methods which is varying near 0.67(i.e. model shows that 'lstat' explains 67% in a better way about 'medv') which is consistent. 

The mean absolute percentage error(mape) for train data and test data for all the models with different method is around 0.18(i.e. 18% error) & 0.176(i.e. 17.6% error)  for the models with different methods which is also consistent.

As the residual square and error is consistent on all the models(i.e. it works well on future data),we can deploy this model.