DATA PREPROCESSING & SELECTING THE MODEL
BOSTON DATASET (displaying only 11 observations)
| 0.00632 |
18.0 |
2.31 |
0 |
0.538 |
6.575 |
65.2 |
4.0900 |
1 |
296 |
15.3 |
396.90 |
4.98 |
24.0 |
| 0.02731 |
0.0 |
7.07 |
0 |
0.469 |
6.421 |
78.9 |
4.9671 |
2 |
242 |
17.8 |
396.90 |
9.14 |
21.6 |
| 0.02729 |
0.0 |
7.07 |
0 |
0.469 |
7.185 |
61.1 |
4.9671 |
2 |
242 |
17.8 |
392.83 |
4.03 |
34.7 |
| 0.03237 |
0.0 |
2.18 |
0 |
0.458 |
6.998 |
45.8 |
6.0622 |
3 |
222 |
18.7 |
394.63 |
2.94 |
33.4 |
| 0.06905 |
0.0 |
2.18 |
0 |
0.458 |
7.147 |
54.2 |
6.0622 |
3 |
222 |
18.7 |
396.90 |
5.33 |
36.2 |
| 0.02985 |
0.0 |
2.18 |
0 |
0.458 |
6.430 |
58.7 |
6.0622 |
3 |
222 |
18.7 |
394.12 |
5.21 |
28.7 |
| 0.08829 |
12.5 |
7.87 |
0 |
0.524 |
6.012 |
66.6 |
5.5605 |
5 |
311 |
15.2 |
395.60 |
12.43 |
22.9 |
| 0.14455 |
12.5 |
7.87 |
0 |
0.524 |
6.172 |
96.1 |
5.9505 |
5 |
311 |
15.2 |
396.90 |
19.15 |
27.1 |
| 0.21124 |
12.5 |
7.87 |
0 |
0.524 |
5.631 |
100.0 |
6.0821 |
5 |
311 |
15.2 |
386.63 |
29.93 |
16.5 |
| 0.17004 |
12.5 |
7.87 |
0 |
0.524 |
6.004 |
85.9 |
6.5921 |
5 |
311 |
15.2 |
386.71 |
17.10 |
18.9 |
| 0.22489 |
12.5 |
7.87 |
0 |
0.524 |
6.377 |
94.3 |
6.3467 |
5 |
311 |
15.2 |
392.52 |
20.45 |
15.0 |
Lets build a model on boston dataset collected by the US census service concerning housing in the area of boston mass.
Here,we will consider only medv(median values of owner occupied housing in USD1000) and lstat(percentage values of lower status population)in building a model where medv values depends on lstat data so 'lstat' is explanatory/independent variable and 'medv' is outcome/dependent variable.As we have only two variables we use SIMPLE LINEAR REGRESSION MODEL.
Checking for linearity
library(ggplot2)
ggplot(Boston,aes(medv,lstat))+geom_point()+geom_smooth()
`geom_smooth()` using method = 'loess'

From the plot we infer that there is a nonlinear relation between the two variables.So we need to convert nonlinearity to linearity by using non linear regression models.
Non linear regression model
model<-lm(medv~poly(lstat,7),data=Boston)
summary(model)
Call:
lm(formula = medv ~ poly(lstat, 7), data = Boston)
Residuals:
Min 1Q Median 3Q Max
-14.3746 -3.1382 -0.7072 2.0646 26.9642
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 22.5328 0.2319 97.168 < 2e-16 ***
poly(lstat, 7)1 -152.4595 5.2164 -29.227 < 2e-16 ***
poly(lstat, 7)2 64.2272 5.2164 12.313 < 2e-16 ***
poly(lstat, 7)3 -27.0511 5.2164 -5.186 3.13e-07 ***
poly(lstat, 7)4 25.4517 5.2164 4.879 1.44e-06 ***
poly(lstat, 7)5 -19.2524 5.2164 -3.691 0.000248 ***
poly(lstat, 7)6 6.5088 5.2164 1.248 0.212708
poly(lstat, 7)7 1.9416 5.2164 0.372 0.709888
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.216 on 498 degrees of freedom
Multiple R-squared: 0.6828, Adjusted R-squared: 0.6783
F-statistic: 153.1 on 7 and 498 DF, p-value: < 2.2e-16
From the summary we got to know that coefficients are significant upto 5th order so lets build a model upto 5th order.
But before building lets use resampling methods(cross validation & bootstrapping) in order to reduce the bias(overfitting) and variance(error rate).
Hold out approach
library(caret)
Warning: package 'caret' was built under R version 3.4.4
Loading required package: lattice
split<-createDataPartition(Boston$medv,p=0.7,list=FALSE)
train1<-Boston[split,]
test1<-Boston[-split,]
Here, we split the data and assign 70% of data as train1 data and 30% of data as test1 data.
Bootstrapping method
library(caret)
method<-trainControl(method="boot",number = 100)
Here,we use bootstrapping method where each observation is trained and tested multiple times.
BUILD THE MODEL
library(caret)
library(car)
model<-train(medv~poly(lstat,5),data=train1,method="lm",trControl=method)
summary(model)
Call:
lm(formula = .outcome ~ ., data = dat)
Residuals:
Min 1Q Median 3Q Max
-13.2624 -3.0449 -0.6638 2.0655 27.1644
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 22.5211 0.2759 81.627 < 2e-16 ***
`poly(lstat, 5)1` -129.6562 5.2057 -24.907 < 2e-16 ***
`poly(lstat, 5)2` 55.5696 5.2057 10.675 < 2e-16 ***
`poly(lstat, 5)3` -20.3589 5.2057 -3.911 0.00011 ***
`poly(lstat, 5)4` 22.4564 5.2057 4.314 2.09e-05 ***
`poly(lstat, 5)5` -12.4691 5.2057 -2.395 0.01713 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.206 on 350 degrees of freedom
Multiple R-squared: 0.6886, Adjusted R-squared: 0.6841
F-statistic: 154.8 on 5 and 350 DF, p-value: < 2.2e-16
Here all the coefficients are significant and residual square is 0.7229(which indicates that the explanatory variable explains 72% better about the outcome variable)
EVALUATE THE MODEL
pred<-predict(model,train1)
library(DMwR)
Loading required package: grid
regr.eval(train1$medv,pred)
mae mse rmse mape
3.7249725 26.6426539 5.1616522 0.1817132
Here, we apply the model equation on train data and predict the outcome variables and later check the accuracy of the model by using regression evaluation technique.
We got mean absolute percentage error(mape) as 0.179 (i.e.model has 18% error) which is acceptable.
pred<-predict(model,test1)
regr.eval(test1$medv,pred)
mae mse rmse mape
3.7136637 27.5263661 5.2465576 0.1689368
Here, we predict the outcome based on test data and check the accuracy of the model.We have mape as 0.186(i.e. nearly 18.5% error) which is also acceptable.
checking regression model assumptions
shapiro.test(pred)
Shapiro-Wilk normality test
data: pred
W = 0.96016, p-value = 0.0002532
library(nortest)
ad.test(pred)
Anderson-Darling normality test
data: pred
A = 1.1894, p-value = 0.004067
library(car)
durbinWatsonTest(pred)
[1] 0.09268972
So,lets check for normality and independency of errors.
From,shapiro wilk test and anderson darling test we have pvalue lesser than 0.05 so we reject null hypothesis saying that the data is not normally distributed.
Therefore,Normality is not met but that is fine.
From,durbinwatsontest we have value greater than 0.05 so we will accept null hypothesis saying that there is no correlation between the variables.
Therefore,Independency of errors is met.
Reshuffle the data
set.seed(123)
Here,the data will be reshuffled so as to check how better the model performs on different data groups.
Now we will build and test the model by using different resampling methods.
BUILDING AND TESTING A MODEL USING K-FOLD CROSS VALIDATION TECHNIQUE
library(caret)
split<-createDataPartition(Boston$medv,p=0.7,list=FALSE)
train1<-Boston[split,]
test1<-Boston[-split,]
method<-trainControl(method="cv",number = 5)
library(caret)
library(car)
model<-train(medv~poly(lstat,5),data=train1,method="lm",trControl=method)
summary(model)
Call:
lm(formula = .outcome ~ ., data = dat)
Residuals:
Min 1Q Median 3Q Max
-12.6939 -2.9651 -0.7246 2.1306 26.5039
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 22.5374 0.2698 83.537 < 2e-16 ***
`poly(lstat, 5)1` -125.5151 5.0904 -24.657 < 2e-16 ***
`poly(lstat, 5)2` 53.5415 5.0904 10.518 < 2e-16 ***
`poly(lstat, 5)3` -23.3540 5.0904 -4.588 6.25e-06 ***
`poly(lstat, 5)4` 20.8914 5.0904 4.104 5.05e-05 ***
`poly(lstat, 5)5` -16.7496 5.0904 -3.290 0.0011 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.09 on 350 degrees of freedom
Multiple R-squared: 0.6868, Adjusted R-squared: 0.6823
F-statistic: 153.5 on 5 and 350 DF, p-value: < 2.2e-16
Here,all coefficients are significant and residual square is 0.6635(i.e.66%).
Checking the accuracy of the model
pred<-predict(model,train1)
library(DMwR)
regr.eval(train1$medv,pred)
mae mse rmse mape
3.6783894 25.4753014 5.0473064 0.1770015
pred<-predict(model,test1)
regr.eval(test1$medv,pred)
mae mse rmse mape
3.789995 30.447014 5.517881 0.182509
Here, we have mape on train data as 0.178(i.e.18%) and on test data as 0.175(i.e.17.5%) which is consistent.
Checking the regression model assumptions
shapiro.test(pred)
Shapiro-Wilk normality test
data: pred
W = 0.95577, p-value = 0.0001019
library(nortest)
ad.test(pred)
Anderson-Darling normality test
data: pred
A = 1.4235, p-value = 0.001074
library(car)
durbinWatsonTest(pred)
[1] 0.09084939
Here, the pvalue in shapiro wilk test and anderson darling test are less than 0.05.
Therefore, Normality is not met.
Here,in durbin watson test the value is more than 0.05.
Therefore, Independency is met on residuals.
Reshuffle the data
set.seed(234)
BUILDING AND TESTING A MODEL USING REPEATED CROSS VALIDATION TECHNIQUE
library(caret)
split<-createDataPartition(Boston$medv,p=0.7,list=FALSE)
train1<-Boston[split,]
test1<-Boston[-split,]
method<-trainControl(method="repeatedcv",number = 5,repeats = 4)
library(caret)
library(car)
model<-train(medv~poly(lstat,5),data=train1,method="lm",trControl=method)
summary(model)
Call:
lm(formula = .outcome ~ ., data = dat)
Residuals:
Min 1Q Median 3Q Max
-13.4962 -3.0343 -0.5861 2.1363 27.3362
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 22.5360 0.2756 81.781 < 2e-16 ***
`poly(lstat, 5)1` -129.5693 5.1993 -24.920 < 2e-16 ***
`poly(lstat, 5)2` 54.7446 5.1993 10.529 < 2e-16 ***
`poly(lstat, 5)3` -22.2463 5.1993 -4.279 2.43e-05 ***
`poly(lstat, 5)4` 23.8176 5.1993 4.581 6.45e-06 ***
`poly(lstat, 5)5` -17.2875 5.1993 -3.325 0.000978 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.199 on 350 degrees of freedom
Multiple R-squared: 0.6909, Adjusted R-squared: 0.6865
F-statistic: 156.4 on 5 and 350 DF, p-value: < 2.2e-16
Here, all coefficients are significant and residual square is 0.6735(i.e.67%)
Checking the accuracy of the model
pred<-predict(model,train1)
library(DMwR)
regr.eval(train1$medv,pred)
mae mse rmse mape
3.7026234 26.5775911 5.1553459 0.1805658
pred<-predict(model,test1)
regr.eval(test1$medv,pred)
mae mse rmse mape
3.7064532 28.0838291 5.2994178 0.1597686
Here,the mape values on train and test data is 0.1832(i.e.18% error) and 0.1711(i.e.17% error) respectively which is consistent.
Checking the regression model assumptions
shapiro.test(pred)
Shapiro-Wilk normality test
data: pred
W = 0.96078, p-value = 0.0002894
library(nortest)
ad.test(pred)
Anderson-Darling normality test
data: pred
A = 1.0107, p-value = 0.01126
library(car)
durbinWatsonTest(pred)
[1] 0.09557605
Here, the pvalue of shapiro and anderson test are less than 0.05.
Therefore, Normality is not met.
In durbinwatson test, the value is greater than 0.05.
Therefore Independency is met on residuals.
Shuffle the data
set.seed(678)
Build the model
library(caret)
split<-createDataPartition(Boston$medv,p=0.7,list=FALSE)
train1<-Boston[split,]
test1<-Boston[-split,]
method<-trainControl(method="LOOCV")
library(caret)
library(car)
model<-train(medv~poly(lstat,5),data=train1,method="lm",trControl=method)
summary(model)
Call:
lm(formula = .outcome ~ ., data = dat)
Residuals:
Min 1Q Median 3Q Max
-13.2649 -3.0819 -0.4029 2.1627 26.3091
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 22.5649 0.2704 83.447 < 2e-16 ***
`poly(lstat, 5)1` -126.0024 5.1021 -24.696 < 2e-16 ***
`poly(lstat, 5)2` 55.1407 5.1021 10.807 < 2e-16 ***
`poly(lstat, 5)3` -23.4965 5.1021 -4.605 5.77e-06 ***
`poly(lstat, 5)4` 18.1712 5.1021 3.562 0.00042 ***
`poly(lstat, 5)5` -12.9093 5.1021 -2.530 0.01184 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.102 on 350 degrees of freedom
Multiple R-squared: 0.6867, Adjusted R-squared: 0.6822
F-statistic: 153.4 on 5 and 350 DF, p-value: < 2.2e-16
Here, the coefficients are significant and residual square is 0.6644(i.e.66%).
Checking the accuracy of the model
pred<-predict(model,train1)
library(DMwR)
regr.eval(train1$medv,pred)
mae mse rmse mape
3.6901094 25.5924942 5.0589025 0.1815415
pred<-predict(model,test1)
regr.eval(test1$medv,pred)
mae mse rmse mape
4.0339984 31.8179971 5.6407444 0.1948971
Here, the mape value for both train and test data is 0.17(i.e.17% error) and 0.19(i.e.19% error) where both are significant.
Checking the regression model assumptions
shapiro.test(pred)
Shapiro-Wilk normality test
data: pred
W = 0.95255, p-value = 5.365e-05
library(nortest)
ad.test(pred)
Anderson-Darling normality test
data: pred
A = 2.2627, p-value = 9.222e-06
library(car)
durbinWatsonTest(pred)
[1] 0.1188522
Here, as the p value in bith shapiro and anderson darling test are lesser than 0.05,Normality is not met.
Here,in durbin watson test the value is more than 0.05, Independency of errors is met.
SUMMARY
| Residual_square |
0.670 |
0.680 |
0.690 |
0.68 |
0.670 |
| mape_on_train_data |
0.180 |
0.177 |
0.180 |
0.18 |
0.180 |
| mape_on_test_data |
0.176 |
0.180 |
0.159 |
0.19 |
0.176 |
Here, the residual square is consistent for the models with different methods which is varying near 0.67(i.e. model shows that 'lstat' explains 67% in a better way about 'medv') which is consistent.
The mean absolute percentage error(mape) for train data and test data for all the models with different method is around 0.18(i.e. 18% error) & 0.176(i.e. 17.6% error) for the models with different methods which is also consistent.
As the residual square and error is consistent on all the models(i.e. it works well on future data),we can deploy this model.