MODEL ON BOSTON(simple linear regression)

DATA PREPROCESSING & SELECTING THE MODEL

BOSTON DATASET (displaying only 11 observations)
crim	zn	indus	nox	rm	age	dis	rad	tax	ptratio	black	lstat	medv
0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1	296	15.3	396.90	4.98	24.0
0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2	242	17.8	396.90	9.14	21.6
0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2	242	17.8	392.83	4.03	34.7
0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3	222	18.7	394.63	2.94	33.4
0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3	222	18.7	396.90	5.33	36.2
0.02985	0.0	2.18	0.458	6.430	58.7	6.0622	3	222	18.7	394.12	5.21	28.7
0.08829	12.5	7.87	0.524	6.012	66.6	5.5605	5	311	15.2	395.60	12.43	22.9
0.14455	12.5	7.87	0.524	6.172	96.1	5.9505	5	311	15.2	396.90	19.15	27.1
0.21124	12.5	7.87	0.524	5.631	100.0	6.0821	5	311	15.2	386.63	29.93	16.5
0.17004	12.5	7.87	0.524	6.004	85.9	6.5921	5	311	15.2	386.71	17.10	18.9
0.22489	12.5	7.87	0.524	6.377	94.3	6.3467	5	311	15.2	392.52	20.45	15.0

Lets build a model on boston dataset collected by the US census service concerning housing in the area of boston mass.
Here,we will consider only medv(median values of owner occupied housing in USD1000) and lstat(percentage values of lower status population)in building a model where medv values depends on lstat data so 'lstat' is explanatory/independent variable and 'medv' is outcome/dependent variable.As we have only two variables we use SIMPLE LINEAR REGRESSION MODEL.

Checking for linearity

library(ggplot2)
ggplot(Boston,aes(medv,lstat))+geom_point()+geom_smooth()

`geom_smooth()` using method = 'loess'

From the plot we infer that there is a nonlinear relation between the two variables.So we need to convert nonlinearity to linearity by using non linear regression models.

Check for outcome variable transformation

dummymodel<-lm(medv~lstat,data=Boston)
library(car)
powerTransform(dummymodel)

Estimated transformation parameters 
        Y1 
0.03068753

Here,we got lambda value as 0.0306 which is near to zero,so we should use logarithmic transformation on outcome variable.
  Here,we have two non linear regression models (polynomial regression & logarithmic transform regression) but we will use polynomial regression as it is more preferable than logarithmic transformation.

Non linear regression model

model<-lm(medv~poly(lstat,7),data=Boston)
summary(model)


Call:
lm(formula = medv ~ poly(lstat, 7), data = Boston)

Residuals:
     Min       1Q   Median       3Q      Max 
-14.3746  -3.1382  -0.7072   2.0646  26.9642 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       22.5328     0.2319  97.168  < 2e-16 ***
poly(lstat, 7)1 -152.4595     5.2164 -29.227  < 2e-16 ***
poly(lstat, 7)2   64.2272     5.2164  12.313  < 2e-16 ***
poly(lstat, 7)3  -27.0511     5.2164  -5.186 3.13e-07 ***
poly(lstat, 7)4   25.4517     5.2164   4.879 1.44e-06 ***
poly(lstat, 7)5  -19.2524     5.2164  -3.691 0.000248 ***
poly(lstat, 7)6    6.5088     5.2164   1.248 0.212708    
poly(lstat, 7)7    1.9416     5.2164   0.372 0.709888    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.216 on 498 degrees of freedom
Multiple R-squared:  0.6828,    Adjusted R-squared:  0.6783 
F-statistic: 153.1 on 7 and 498 DF,  p-value: < 2.2e-16

From the summary we got to know that coefficients are significant upto 5th order so lets build a model upto 5th order.
But before building lets use resampling methods(cross validation & bootstrapping) in order to reduce the bias(overfitting) and variance(error rate).

Hold out approach

library(caret)

Warning: package 'caret' was built under R version 3.4.4

Loading required package: lattice

split<-createDataPartition(Boston$medv,p=0.7,list=FALSE)
train1<-Boston[split,]
test1<-Boston[-split,]

Here, we split the data and assign 70% of data as train1 data and 30% of data as test1 data.

Bootstrapping method

library(caret)
method<-trainControl(method="boot",number = 100)

Here,we use bootstrapping method where each observation is trained and tested multiple times.

BUILD THE MODEL

library(caret)
library(car)
model<-train(medv~poly(lstat,5),data=train1,method="lm",trControl=method)
summary(model)


Call:
lm(formula = .outcome ~ ., data = dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-13.2624  -3.0449  -0.6638   2.0655  27.1644 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)         22.5211     0.2759  81.627  < 2e-16 ***
`poly(lstat, 5)1` -129.6562     5.2057 -24.907  < 2e-16 ***
`poly(lstat, 5)2`   55.5696     5.2057  10.675  < 2e-16 ***
`poly(lstat, 5)3`  -20.3589     5.2057  -3.911  0.00011 ***
`poly(lstat, 5)4`   22.4564     5.2057   4.314 2.09e-05 ***
`poly(lstat, 5)5`  -12.4691     5.2057  -2.395  0.01713 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.206 on 350 degrees of freedom
Multiple R-squared:  0.6886,    Adjusted R-squared:  0.6841 
F-statistic: 154.8 on 5 and 350 DF,  p-value: < 2.2e-16

Here all the coefficients are significant and residual square is 0.7229(which indicates that the explanatory variable explains 72% better about the outcome variable)

EVALUATE THE MODEL

pred<-predict(model,train1)
library(DMwR)

Loading required package: grid

regr.eval(train1$medv,pred)

       mae        mse       rmse       mape 
 3.7249725 26.6426539  5.1616522  0.1817132

Here, we apply the model equation on train data and predict the outcome variables and later check the accuracy of the model by using regression evaluation technique.
We got mean absolute percentage error(mape) as 0.179 (i.e.model has 18% error) which is acceptable.

pred<-predict(model,test1)
regr.eval(test1$medv,pred)

       mae        mse       rmse       mape 
 3.7136637 27.5263661  5.2465576  0.1689368

Here, we predict the outcome based on test data and check the accuracy of the model.We have mape as 0.186(i.e. nearly 18.5% error) which is also acceptable.

checking regression model assumptions

shapiro.test(pred)


    Shapiro-Wilk normality test

data:  pred
W = 0.96016, p-value = 0.0002532

library(nortest)
ad.test(pred)


    Anderson-Darling normality test

data:  pred
A = 1.1894, p-value = 0.004067

library(car)
durbinWatsonTest(pred)

[1] 0.09268972

So,lets check for normality and independency of errors.
From,shapiro wilk test and anderson darling test we have pvalue lesser than 0.05 so we reject null hypothesis saying that the data is not normally distributed.
Therefore,Normality is not met but that is fine.
From,durbinwatsontest we have value greater than 0.05 so we will accept null hypothesis saying that there is no correlation between the variables.
Therefore,Independency of errors is met.

Reshuffle the data

set.seed(123)

Here,the data will be reshuffled so as to check how better the model performs on different data groups.
Now we will build and test the model by using different resampling methods.

BUILDING AND TESTING A MODEL USING K-FOLD CROSS VALIDATION TECHNIQUE

library(caret)
split<-createDataPartition(Boston$medv,p=0.7,list=FALSE)
train1<-Boston[split,]
test1<-Boston[-split,]
method<-trainControl(method="cv",number = 5)
library(caret)
library(car)
model<-train(medv~poly(lstat,5),data=train1,method="lm",trControl=method)
summary(model)


Call:
lm(formula = .outcome ~ ., data = dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-12.6939  -2.9651  -0.7246   2.1306  26.5039 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)         22.5374     0.2698  83.537  < 2e-16 ***
`poly(lstat, 5)1` -125.5151     5.0904 -24.657  < 2e-16 ***
`poly(lstat, 5)2`   53.5415     5.0904  10.518  < 2e-16 ***
`poly(lstat, 5)3`  -23.3540     5.0904  -4.588 6.25e-06 ***
`poly(lstat, 5)4`   20.8914     5.0904   4.104 5.05e-05 ***
`poly(lstat, 5)5`  -16.7496     5.0904  -3.290   0.0011 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.09 on 350 degrees of freedom
Multiple R-squared:  0.6868,    Adjusted R-squared:  0.6823 
F-statistic: 153.5 on 5 and 350 DF,  p-value: < 2.2e-16

Here,all coefficients are significant and residual square is 0.6635(i.e.66%).

Checking the accuracy of the model

pred<-predict(model,train1)
library(DMwR)
regr.eval(train1$medv,pred)

       mae        mse       rmse       mape 
 3.6783894 25.4753014  5.0473064  0.1770015

pred<-predict(model,test1)
regr.eval(test1$medv,pred)

      mae       mse      rmse      mape 
 3.789995 30.447014  5.517881  0.182509

Here, we have mape on train data as 0.178(i.e.18%) and on test data as 0.175(i.e.17.5%) which is consistent.

Checking the regression model assumptions

shapiro.test(pred)


    Shapiro-Wilk normality test

data:  pred
W = 0.95577, p-value = 0.0001019

library(nortest)
ad.test(pred)


    Anderson-Darling normality test

data:  pred
A = 1.4235, p-value = 0.001074

library(car)
durbinWatsonTest(pred)

[1] 0.09084939

Here, the pvalue in shapiro wilk test and anderson darling test are less than 0.05.
Therefore, Normality is not met.
Here,in durbin watson test the value is more than 0.05.
Therefore, Independency is met on residuals.

Reshuffle the data

set.seed(234)

BUILDING AND TESTING A MODEL USING REPEATED CROSS VALIDATION TECHNIQUE

library(caret)
split<-createDataPartition(Boston$medv,p=0.7,list=FALSE)
train1<-Boston[split,]
test1<-Boston[-split,]
method<-trainControl(method="repeatedcv",number = 5,repeats = 4)
library(caret)
library(car)
model<-train(medv~poly(lstat,5),data=train1,method="lm",trControl=method)
summary(model)


Call:
lm(formula = .outcome ~ ., data = dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-13.4962  -3.0343  -0.5861   2.1363  27.3362 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)         22.5360     0.2756  81.781  < 2e-16 ***
`poly(lstat, 5)1` -129.5693     5.1993 -24.920  < 2e-16 ***
`poly(lstat, 5)2`   54.7446     5.1993  10.529  < 2e-16 ***
`poly(lstat, 5)3`  -22.2463     5.1993  -4.279 2.43e-05 ***
`poly(lstat, 5)4`   23.8176     5.1993   4.581 6.45e-06 ***
`poly(lstat, 5)5`  -17.2875     5.1993  -3.325 0.000978 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.199 on 350 degrees of freedom
Multiple R-squared:  0.6909,    Adjusted R-squared:  0.6865 
F-statistic: 156.4 on 5 and 350 DF,  p-value: < 2.2e-16

Here, all coefficients are significant and residual square is 0.6735(i.e.67%)

Checking the accuracy of the model

pred<-predict(model,train1)
library(DMwR)
regr.eval(train1$medv,pred)

       mae        mse       rmse       mape 
 3.7026234 26.5775911  5.1553459  0.1805658

pred<-predict(model,test1)
regr.eval(test1$medv,pred)

       mae        mse       rmse       mape 
 3.7064532 28.0838291  5.2994178  0.1597686

Here,the mape values on train and test data is 0.1832(i.e.18% error) and 0.1711(i.e.17% error) respectively which is consistent.

Checking the regression model assumptions

shapiro.test(pred)


    Shapiro-Wilk normality test

data:  pred
W = 0.96078, p-value = 0.0002894

library(nortest)
ad.test(pred)


    Anderson-Darling normality test

data:  pred
A = 1.0107, p-value = 0.01126

library(car)
durbinWatsonTest(pred)

[1] 0.09557605

Here, the pvalue of shapiro and anderson test are less than 0.05.
Therefore, Normality is not met.
In durbinwatson test, the value is greater than 0.05.
Therefore Independency is met on residuals.

Shuffle the data

set.seed(678)

Build the model

library(caret)
split<-createDataPartition(Boston$medv,p=0.7,list=FALSE)
train1<-Boston[split,]
test1<-Boston[-split,]

method<-trainControl(method="LOOCV")

library(caret)
library(car)
model<-train(medv~poly(lstat,5),data=train1,method="lm",trControl=method)
summary(model)


Call:
lm(formula = .outcome ~ ., data = dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-13.2649  -3.0819  -0.4029   2.1627  26.3091 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)         22.5649     0.2704  83.447  < 2e-16 ***
`poly(lstat, 5)1` -126.0024     5.1021 -24.696  < 2e-16 ***
`poly(lstat, 5)2`   55.1407     5.1021  10.807  < 2e-16 ***
`poly(lstat, 5)3`  -23.4965     5.1021  -4.605 5.77e-06 ***
`poly(lstat, 5)4`   18.1712     5.1021   3.562  0.00042 ***
`poly(lstat, 5)5`  -12.9093     5.1021  -2.530  0.01184 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.102 on 350 degrees of freedom
Multiple R-squared:  0.6867,    Adjusted R-squared:  0.6822 
F-statistic: 153.4 on 5 and 350 DF,  p-value: < 2.2e-16

Here, the coefficients are significant and residual square is 0.6644(i.e.66%).

Checking the accuracy of the model

pred<-predict(model,train1)
library(DMwR)
regr.eval(train1$medv,pred)

       mae        mse       rmse       mape 
 3.6901094 25.5924942  5.0589025  0.1815415

pred<-predict(model,test1)
regr.eval(test1$medv,pred)

       mae        mse       rmse       mape 
 4.0339984 31.8179971  5.6407444  0.1948971

Here, the mape value for both train and test data is 0.17(i.e.17% error) and 0.19(i.e.19% error) where both are significant.

Checking the regression model assumptions

shapiro.test(pred)


    Shapiro-Wilk normality test

data:  pred
W = 0.95255, p-value = 5.365e-05

library(nortest)
ad.test(pred)


    Anderson-Darling normality test

data:  pred
A = 2.2627, p-value = 9.222e-06

library(car)
durbinWatsonTest(pred)

[1] 0.1188522

Here, as the p value in bith shapiro and anderson darling test are lesser than 0.05,Normality is not met.
Here,in durbin watson test the value is more than 0.05, Independency of errors is met.

SUMMARY
ATTRIBUTES	BOOTSTRAPPING	K.FOLD_CV	REPEATED_CV	LOOCV	MEAN
Residual_square	0.670	0.680	0.690	0.68	0.670
mape_on_train_data	0.180	0.177	0.180	0.18	0.180
mape_on_test_data	0.176	0.180	0.159	0.19	0.176

Here, the residual square is consistent for the models with different methods which is varying near 0.67(i.e. model shows that 'lstat' explains 67% in a better way about 'medv') which is consistent. 

The mean absolute percentage error(mape) for train data and test data for all the models with different method is around 0.18(i.e. 18% error) & 0.176(i.e. 17.6% error)  for the models with different methods which is also consistent.

As the residual square and error is consistent on all the models(i.e. it works well on future data),we can deploy this model.

MODEL ON BOSTON(simple linear regression)

shekar

14 June 2018

DATA PREPROCESSING & SELECTING THE MODEL

Checking for linearity

Check for outcome variable transformation

Non linear regression model

Hold out approach

Bootstrapping method

BUILD THE MODEL

EVALUATE THE MODEL

checking regression model assumptions

Reshuffle the data

BUILDING AND TESTING A MODEL USING K-FOLD CROSS VALIDATION TECHNIQUE

Checking the accuracy of the model

Checking the regression model assumptions

Reshuffle the data

BUILDING AND TESTING A MODEL USING REPEATED CROSS VALIDATION TECHNIQUE

Checking the accuracy of the model

Checking the regression model assumptions

Shuffle the data

Build the model

Checking the accuracy of the model

Checking the regression model assumptions