library(MASS)
View(Boston)
library(ggplot2)
ggplot(Boston,aes(lstat,medv))+geom_point()+geom_smooth()
## `geom_smooth()` using method = 'loess'
By seeing the graph above,we can say x and y are in curvilinear relationship.So we use non linear regression i.e Polynomial Regression. ## Train the model
Spliting the data to 80% training an 20% test data
library(caret)
## Loading required package: lattice
split<-createDataPartition(Boston$lstat,p=0.8,list=FALSE)
training<-Boston[split,]
test<-Boston[-split,]
model<-lm(medv ~ poly(lstat,7),data=training)
summary(model)
##
## Call:
## lm(formula = medv ~ poly(lstat, 7), data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.753 -3.031 -0.847 2.079 26.874
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.5224 0.2611 86.268 < 2e-16 ***
## poly(lstat, 7)1 -141.3579 5.2605 -26.871 < 2e-16 ***
## poly(lstat, 7)2 59.2898 5.2605 11.271 < 2e-16 ***
## poly(lstat, 7)3 -27.8237 5.2605 -5.289 2.03e-07 ***
## poly(lstat, 7)4 26.2592 5.2605 4.992 8.97e-07 ***
## poly(lstat, 7)5 -15.3378 5.2605 -2.916 0.00375 **
## poly(lstat, 7)6 9.4938 5.2605 1.805 0.07187 .
## poly(lstat, 7)7 -4.2257 5.2605 -0.803 0.42229
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.261 on 398 degrees of freedom
## Multiple R-squared: 0.6967, Adjusted R-squared: 0.6914
## F-statistic: 130.6 on 7 and 398 DF, p-value: < 2.2e-16
Seeing the summary of model we can say that Coefficients of polynomial are significant upto order 5 Hence we take model1 with order of polynomial 5
model<-lm(medv ~ poly(lstat,5),data=training)
library(DMwR)
## Loading required package: grid
pred1<-predict(model,training)
regr.eval(training$medv,pred1)
## mae mse rmse mape
## 3.7223450 27.3937711 5.2339059 0.1769079
pred<-predict(model,test)
regr.eval(test$medv,pred)
## mae mse rmse mape
## 3.7798880 25.3439013 5.0342727 0.1811757
“mape” as accuracy measure.Here it represents approximately 18% as errors This model has accuracy of 82%
In Residuals vs Fitted plot ,red line is drawn along the dotted line and all fitted values scattered around it without any systematic relationship then linearity assumption is met on the residuals.
Statistical tests are used to check the normality of residuals Shapiro wilk Test,Anderson Darling Test
NUll Hypothesis :data is normally distributed Alternate Hypothesis :data is not normally distributed
library(car)
## Loading required package: carData
shapiro.test(model$residuals)
##
## Shapiro-Wilk normality test
##
## data: model$residuals
## W = 0.91074, p-value = 9.644e-15
As p-value is < 0.05 we can accept null hypothesis.Hence Residual data is not normally distributed.
Checking for constant error rate
plot(model,3)
To check the correlation between errors we use Durbin Watson Test Null Hypothesis : No correlation between errors Alternate Hypothesis: correlation between errors
durbinWatsonTest(model)
## lag Autocorrelation D-W Statistic p-value
## 1 0.4322083 1.114143 0
## Alternative hypothesis: rho != 0
As p-value <0.05 we cannot accept the null hypothesis,so there is correlation between errors
boxplot(model$residuals)
There are so many outliers in this model,so this affects the prediction of results.
predicted<-data.frame(test$lstat,test$medv,pred)
View(predicted)