Select the method

library(MASS)
View(Boston)
library(ggplot2)
ggplot(Boston,aes(lstat,medv))+geom_point()+geom_smooth()
## `geom_smooth()` using method = 'loess'

By seeing the graph above,we can say x and y are in curvilinear relationship.So we use non linear regression i.e Polynomial Regression. ## Train the model

Spliting the data to 80% training an 20% test data

library(caret)
## Loading required package: lattice
split<-createDataPartition(Boston$lstat,p=0.8,list=FALSE)
training<-Boston[split,]
test<-Boston[-split,]
model<-lm(medv ~ poly(lstat,7),data=training)
summary(model)
## 
## Call:
## lm(formula = medv ~ poly(lstat, 7), data = training)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.753  -3.031  -0.847   2.079  26.874 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       22.5224     0.2611  86.268  < 2e-16 ***
## poly(lstat, 7)1 -141.3579     5.2605 -26.871  < 2e-16 ***
## poly(lstat, 7)2   59.2898     5.2605  11.271  < 2e-16 ***
## poly(lstat, 7)3  -27.8237     5.2605  -5.289 2.03e-07 ***
## poly(lstat, 7)4   26.2592     5.2605   4.992 8.97e-07 ***
## poly(lstat, 7)5  -15.3378     5.2605  -2.916  0.00375 ** 
## poly(lstat, 7)6    9.4938     5.2605   1.805  0.07187 .  
## poly(lstat, 7)7   -4.2257     5.2605  -0.803  0.42229    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.261 on 398 degrees of freedom
## Multiple R-squared:  0.6967, Adjusted R-squared:  0.6914 
## F-statistic: 130.6 on 7 and 398 DF,  p-value: < 2.2e-16

Seeing the summary of model we can say that Coefficients of polynomial are significant upto order 5 Hence we take model1 with order of polynomial 5

model<-lm(medv ~ poly(lstat,5),data=training)
library(DMwR)
## Loading required package: grid
pred1<-predict(model,training)
regr.eval(training$medv,pred1)
##        mae        mse       rmse       mape 
##  3.7223450 27.3937711  5.2339059  0.1769079

Test the model

pred<-predict(model,test)
regr.eval(test$medv,pred)
##        mae        mse       rmse       mape 
##  3.7798880 25.3439013  5.0342727  0.1811757

“mape” as accuracy measure.Here it represents approximately 18% as errors This model has accuracy of 82%

Checking the Assumptions of Residuals

1.Linearity

In Residuals vs Fitted plot ,red line is drawn along the dotted line and all fitted values scattered around it without any systematic relationship then linearity assumption is met on the residuals.

2.Normality of residuals

Statistical tests are used to check the normality of residuals Shapiro wilk Test,Anderson Darling Test

NUll Hypothesis :data is normally distributed Alternate Hypothesis :data is not normally distributed

library(car)
## Loading required package: carData
shapiro.test(model$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model$residuals
## W = 0.91074, p-value = 9.644e-15

As p-value is < 0.05 we can accept null hypothesis.Hence Residual data is not normally distributed.

3.HOMOSCADESCITY of Residuals

Checking for constant error rate

plot(model,3)

4.Independence of Errors

To check the correlation between errors we use Durbin Watson Test Null Hypothesis : No correlation between errors Alternate Hypothesis: correlation between errors

durbinWatsonTest(model)
##  lag Autocorrelation D-W Statistic p-value
##    1       0.4322083      1.114143       0
##  Alternative hypothesis: rho != 0

As p-value <0.05 we cannot accept the null hypothesis,so there is correlation between errors

Checking for outliers

boxplot(model$residuals)

There are so many outliers in this model,so this affects the prediction of results.

Prediction

predicted<-data.frame(test$lstat,test$medv,pred)
View(predicted)