The data was taken from the UCI Machine learning repository
The data contains 14 variables, I list some of them below:
We shall use all the variables listed above to predict MEDV
I saved the data as boston.csv on my computer, use readr to read the data into r
library(readr)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(ggplot2)
library(GGally)
boston <- read_csv("boston.csv")
Let us take a look at the data
str(boston)
## Classes 'tbl_df', 'tbl' and 'data.frame': 506 obs. of 16 variables:
## $ TOWN : chr "Nahant" "Swampscott" "Swampscott" "Marblehead" ...
## $ TRACT : int 2011 2021 2022 2031 2032 2033 2041 2042 2043 2044 ...
## $ LON : num -71 -71 -70.9 -70.9 -70.9 ...
## $ LAT : num 42.3 42.3 42.3 42.3 42.3 ...
## $ MEDV : num 24 21.6 34.7 33.4 36.2 28.7 22.9 22.1 16.5 18.9 ...
## $ CRIM : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ ZN : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ INDUS : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ CHAS : int 0 0 0 0 0 0 0 0 0 0 ...
## $ NOX : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ RM : num 6.58 6.42 7.18 7 7.15 ...
## $ AGE : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ DIS : num 4.09 4.97 4.97 6.06 6.06 ...
## $ RAD : int 1 2 2 3 3 3 5 5 5 5 ...
## $ TAX : int 296 242 242 222 222 222 311 311 311 311 ...
## $ PTRATIO: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
We show some initial plots of the data
Lets see if there is a need to transforms some of the variables
ggpairs(boston, columns = c(3, 4, 8, 9, 11))
ggplot(boston, aes(x = MEDV, y = RM)) + geom_point()
There seems to be a linear relationship between MEDV and RM though some points are clearly out of the linear trend.
Transforming the data would help resolve some of the distributiona skewness.
We use the preprocess function in the caret package to transform and scale the data.
boston <- boston[, -c(1, 2)]
pp <- preProcess(boston, method = c("BoxCox", "center", "scale"))
boston <- predict(pp, boston)
let us check the distribution of some of the variables after transformations
ggpairs(boston, columns = c(3, 4, 8, 9, 11))
It is obvious from the plots above that the transforation of the variables have resolved some of the skewness problmes. This enhances the predictive efficiency of our model
Extract MEDV from the predictors and split your data into a training and test set
bostonY <- boston["MEDV"]
bostonX <- boston[-3]
indx <- createDataPartition(bostonX$CHAS, p = .50, list = FALSE)
trainX <- bostonX[indx, ]
testX <- bostonX[-indx, ]
trainY <- bostonY[indx]
testY <- bostonY[-indx]
we use logistic regression to predict medv:
regModel <- train(trainX, trainY,
method = "lm")
regModel
## Linear Regression
##
## 253 samples
## 13 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 253, 253, 253, 253, 253, 253, ...
## Resampling results:
##
## RMSE Rsquared
## 0.6334229 0.5517792
##
##
Testing our model on the validation set:
testPred <- predict(regModel, testX)
save the test result in a data frame to compare results with testY
testResult <- data.frame(observation = testY, testPred)
head(testResult)
## observation testPred
## 1 0.3184308 0.612397677
## 3 1.2899894 0.756990144
## 4 1.1860099 0.791168996
## 8 0.1107223 -0.021337513
## 9 -0.5982272 -0.390436066
## 11 -0.8206419 0.005716975
lets plot the observation versus testResult,to visualize how well the model fits
ggplot(testResult, aes(testPred, observation)) +
geom_point()
Since the points fall fairly around the 45 degree line, it shows the model does fairly when in predicting MEDV
Finally, the root mean squared error (RMSE) is used to measure the performance of our model and it is interpreted as the average distance between the observed values and the model predictions
RMSE(testPred, testY)
## [1] 0.6141431
R2(testPred, testY)
## [1] 0.6719048