Predicting House

The data was taken from the UCI Machine learning repository

The data contains 14 variables, I list some of them below:

MEDV - median value of owner occupied homes, in thousands of dollars
LON - Longitude of the census tract
LAT - lattitude of the census tract
CRIM - per capital crime ratio
RM - average number of rooms per dwelling
RAD - measure of closeness to important highways
DIS - measure of how far the tract is from employment centers in Boston
CHAS - 1 if close to charles river

We shall use all the variables listed above to predict MEDV

I saved the data as boston.csv on my computer, use readr to read the data into r

library(readr)
library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

library(ggplot2)
library(GGally)
boston <- read_csv("boston.csv")

Let us take a look at the data

str(boston)

## Classes 'tbl_df', 'tbl' and 'data.frame':    506 obs. of  16 variables:
##  $ TOWN   : chr  "Nahant" "Swampscott" "Swampscott" "Marblehead" ...
##  $ TRACT  : int  2011 2021 2022 2031 2032 2033 2041 2042 2043 2044 ...
##  $ LON    : num  -71 -71 -70.9 -70.9 -70.9 ...
##  $ LAT    : num  42.3 42.3 42.3 42.3 42.3 ...
##  $ MEDV   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 22.1 16.5 18.9 ...
##  $ CRIM   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ ZN     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ INDUS  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ CHAS   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ NOX    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ RM     : num  6.58 6.42 7.18 7 7.15 ...
##  $ AGE    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ DIS    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ RAD    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ TAX    : int  296 242 242 222 222 222 311 311 311 311 ...
##  $ PTRATIO: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...

We show some initial plots of the data

Lets see if there is a need to transforms some of the variables

ggpairs(boston, columns = c(3, 4, 8, 9, 11))

ggplot(boston, aes(x = MEDV, y = RM)) + geom_point()

There seems to be a linear relationship between MEDV and RM though some points are clearly out of the linear trend.

Transforming the data would help resolve some of the distributiona skewness.

We use the preprocess function in the caret package to transform and scale the data.

boston <- boston[, -c(1, 2)]

pp <- preProcess(boston, method = c("BoxCox", "center", "scale"))
boston <- predict(pp, boston)

let us check the distribution of some of the variables after transformations

ggpairs(boston, columns = c(3, 4, 8, 9, 11))

It is obvious from the plots above that the transforation of the variables have resolved some of the skewness problmes. This enhances the predictive efficiency of our model

Extract MEDV from the predictors and split your data into a training and test set

bostonY <- boston["MEDV"]
bostonX <- boston[-3]

indx <- createDataPartition(bostonX$CHAS, p = .50, list = FALSE)
trainX <- bostonX[indx, ]
testX <- bostonX[-indx, ]

trainY <- bostonY[indx]
testY <- bostonY[-indx]

we use logistic regression to predict medv:

regModel <- train(trainX, trainY,
                  method = "lm")

regModel

## Linear Regression 
## 
## 253 samples
##  13 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 253, 253, 253, 253, 253, 253, ... 
## Resampling results:
## 
##   RMSE       Rsquared 
##   0.6334229  0.5517792
## 
##

Testing our model on the validation set:

testPred <- predict(regModel, testX)

save the test result in a data frame to compare results with testY

testResult <- data.frame(observation = testY, testPred)
head(testResult)

##    observation     testPred
## 1    0.3184308  0.612397677
## 3    1.2899894  0.756990144
## 4    1.1860099  0.791168996
## 8    0.1107223 -0.021337513
## 9   -0.5982272 -0.390436066
## 11  -0.8206419  0.005716975

lets plot the observation versus testResult,to visualize how well the model fits

ggplot(testResult, aes(testPred, observation)) +
        geom_point()

Since the points fall fairly around the 45 degree line, it shows the model does fairly when in predicting MEDV

Finally, the root mean squared error (RMSE) is used to measure the performance of our model and it is interpreted as the average distance between the observed values and the model predictions

RMSE(testPred, testY)

## [1] 0.6141431

R2(testPred, testY)

## [1] 0.6719048

Predicting House - Pricing

Daniel Abban

13 July 2016