Introduction

‘Ninety percent of all millionaires become so through owning real estate. However, seventy-five to eighty-five percent of real estate investments fail’

Investments in this industry require careful analysis of multiple parameters. While some have the gift of making the right calls, for everyone else there is regression.

This page walks through an analysis and prediction of the median value of homes given a set of different factors such as structural quality, neighborhood, accessibility etc. Although this page uses the inbuilt data set of Boston housing data from MASS library in R, the analysis and idea can be extended to become a seasoned investor.

If you’re looking for a quick read instead of the entire code, here’s your escape

Required Libraries

We will utilize the following libraries:

library(MASS)
library(tidyverse)
library(caret)
library(GGally)
library(broom)
library(modelr)
library(gridExtra)
library(corrplot)
library(leaps)
library(glmnet)
library(WVPlots)
library(rpart)

Exploratory Data Analysis

Data Import

data(Boston)

We begin by checking the dimensions of data:

dim(Boston)

## [1] 506  14

There are 14 variables and 506 rows, let’s take a look at the data snapshot

head(Boston)

##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12
##   lstat medv
## 1  4.98 24.0
## 2  9.14 21.6
## 3  4.03 34.7
## 4  2.94 33.4
## 5  5.33 36.2
## 6  5.21 28.7

summary(Boston)

##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

We observe no missing values and no obvious abnormal values at a first glance (an extremely rare real-world scenario).

Let’s visualize the distribution of each variable:

Boston %>% 
  gather(key,val,-chas) %>% 
  ggplot(aes(val,fill = factor(chas))) + 
  geom_histogram() + 
  facet_wrap(~key, scales = 'free')

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

From the histograms above, we observe: * The proportion of owner-occupied units built prior to 1940 (age) and the proportion of blacks by town (black) are highly skewed to the left, which means that the most counts of these variables occur on the higher end. * The average number of rooms per dwelling (rm) follows a normal distribution i.e most of the dwellings have an average of 6 rooms. * There are more dwellings which have smaller distances to five Boston employment centers (dis is skewed to the right) * There are more dwellings which have lower median value (less than $25000) than the number of dwellings that have a higher value. (medv is skewed to the right) * There are lesser proportion of adults without high school education and male workers classified as laborers in the dwellings of the Boston suburbs (lstat is skewed to the right) * The full value property tax rate (tax - measured in $10000s) can be seen to be separated into 2 distinct clusters. One below 500 and the other more than 700 * The index of accessibility to radial highways (rad) also seems to be separated into 2 distinct clusters. A huge number of dwellings having this index less than 10 and the rest having more than 24

Correlation

cormat <- cor(Boston[,-14])
corrplot(cormat)

Studying the correlation between the variables we notice that: * A strong correlation of 0.912 between variables rad and tax. This is expected as we often see that as the accessibility to radial highways increase, the property tax rate of the dwellings also increases * A correlation of 0.76 between the proportion of non-retail business acres per town (indus) and the nitrogen oxide concentration (nox). This corroborates the fact that non-retail businesses have a high contribution to the nitrogen concentration in the air * A correlation of 0.73 between the proportion of non-retail business acres per town (indus) and the property tax rate of the dwellings (tax). The tax rate may also be influenced by the presence of non-retail business near the dwellings * A correlation of 0.73 between the proportion of owner-occupied units built prior to 1940 (age) and the nitrogen oxides concentration (nox). This might lead to the fact the older parts of the city, or where the older houses are situated have more air pollution * A negative correlation of 0.74 between mean of distances to five Boston employment centers (dis) and proportion of owner-occupied units built prior to 1940 (age). Interesting to note that older homes are farther away from the employment centers, which shows that a city expands more where the employment centers are located.

Let’s take a cursory glance at different correlation coefficients and their significance with medv, the value to be predicted. The below code works fine because all the data is numeric.

Boston %>% 
  gather(var,value,-medv) %>% 
  group_by(var) %>% 
  summarize(correlation = cor(medv,value), p_value = cor.test(medv,value)$p.value) %>% 
  arrange(p_value)

## # A tibble: 13 x 3
##        var correlation      p_value
##      <chr>       <dbl>        <dbl>
##  1   lstat  -0.7376627 5.081103e-88
##  2      rm   0.6953599 2.487229e-74
##  3 ptratio  -0.5077867 1.609509e-34
##  4   indus  -0.4837252 4.900260e-31
##  5     tax  -0.4685359 5.637734e-29
##  6     nox  -0.4273208 7.065042e-24
##  7    crim  -0.3883046 1.173987e-19
##  8     rad  -0.3816262 5.465933e-19
##  9     age  -0.3769546 1.569982e-18
## 10      zn   0.3604453 5.713584e-17
## 11   black   0.3334608 1.318113e-14
## 12     dis   0.2499287 1.206612e-08
## 13    chas   0.1752602 7.390623e-05

Before interpreting the above results, it is important we visualize these scatter plots to consolidate on correlation numbers. At times, a non-linear curve would justify a better fit. We will use GGally package for this

Boston %>% 
  mutate(crim_l = log(crim), lstat_l = log(lstat)) %>% 
  gather(var,value,-medv) %>% 
  ggplot(aes(value,medv)) + 
  geom_point() + 
  geom_smooth(method = 'lm') + 
  facet_wrap(~factor(var),scales = 'free_x')

The above summaries indicate: * A negative correlation of 0.74 with lstat (percent of lower status of the population) i.e more the proportion of people with lower status, lesser is the value of the house. This can be attributed to affordability * A positive correlation of 0.70 with the average number of rooms per dwelling i.e as the number of rooms increase, a hike in the price of the dwellings can be observed * Possibe log transformation of variables crim and lstat have higher linear correlation with medv

Tax and rad variables exhibhit two clusters in their distribution and appear highly correlated. The figure below confirms this finding:

Boston %>% 
  select(tax,rad) %>% 
  ggpairs()

Creating new Variables

Creating a new categorical variables to explore relationship between different continuous variables. Above we notice rad 24, tax > 500 exhibit clustered traits.

Boston %>% 
  mutate(rad_c = rad == 24) -> Boston

Given the high correlation between tax and rad, it can be assumed for simplicity of the model that the variations in tax would be accounted for accurately enough by variable rad in our model.

Relationship study between medv and and categorical variable v/s other factors

There are a number of variable selection techniques to arrive at a good model sooner, but a deeper understanding of different categorical variables within our model can make a huge difference since none of the interaction terms are easily accounted for by the variable selection algorithms. Let’s explore the side-effects of these categorical variables with when different factors interact with medv

crim

ggplot(Boston, aes(crim, medv)) + geom_point()

For lower crime rate areas the medv is higher as expected but the graph is too crowded near 0. We can fix that by taking a log. We will also see how categorical variables interact with this relationship.

ggplot(Boston,aes(log(crim),medv,col=factor(chas))) + 
  geom_point() + 
  geom_smooth(method = 'lm')

ggplot(Boston,aes(log(crim),medv,col=factor(rad_c))) + 
  geom_point() + 
  geom_smooth(method = 'lm')

ggplot(Boston,aes(log(crim),medv,col=factor(chas))) + 
  geom_point() + 
  geom_smooth(method = 'lm') +
  facet_grid(~rad_c)

When adding rad_c and chas, there is a clear change of slope in the night which means that we would need to include an interaction term while modeling.

zn

ggplot(Boston,aes(zn,medv, col = factor(chas))) + 
  geom_point() + 
  geom_smooth(method = 'lm')

Not much difference in slope for zn when checking interaction with chas, let’s check interaction for other categorical variables

ggplot(Boston,aes(zn,medv, col = rad_c)) + 
  geom_point() + 
  geom_smooth(method = 'lm')

ggplot(Boston,aes(zn,medv, col = factor(chas))) + 
  geom_point() + 
  geom_smooth(method = 'lm') + 
  facet_wrap(~rad_c)

indus

ggplot(Boston,aes(indus,medv,col=factor(chas))) + 
  geom_point() + 
  geom_smooth(method = 'lm')

ggplot(Boston,aes(indus,medv,col=factor(chas))) + 
  geom_point() + 
  geom_smooth(method = 'lm') + 
  facet_wrap(~ rad_c)

nox

ggplot(Boston,aes(nox,medv,col=factor(chas))) + 
  geom_point() + 
  geom_smooth(method = 'lm')

ggplot(Boston,aes(nox,medv,col=factor(chas))) + 
  geom_point() + 
  geom_smooth(method = 'lm')

rm

ggplot(Boston,aes(rm,medv)) + geom_point() + geom_smooth(method='lm')

rm appears to have a strong positive correlation with medv, let’s analyze with different categorical variables

ggplot(Boston,aes(rm,medv, col=factor(rad_c))) + 
  geom_point() +
  geom_smooth(method = 'lm')

HUGE interaction term required to mimic slope change for rm and rad_c

ggplot(Boston,aes(rm,medv, col=factor(chas))) + 
  geom_point() +
  geom_smooth(method = 'lm')

No specific interaction with chas, let’s observe them all together

ggplot(Boston,aes(rm,medv, col=factor(chas))) + 
  geom_point() +
  geom_smooth(method = 'lm') +
  facet_wrap(~rad_c)

age

ggplot(Boston,aes(age,medv)) + geom_point() + geom_smooth(method = 'lm')

The trend shows that the medv decreases as the age increases. Does it consistently decrease across different categories? let’s find out

ggplot(Boston,aes(age,medv, col = rad_c)) + 
  geom_point() + 
  geom_smooth(method = 'lm')

ggplot(Boston,aes(age,medv, col = factor(chas))) + 
  geom_point() + 
  geom_smooth(method = 'lm') + 
  facet_wrap(~rad_c)

Interestingly, while the rad variable is quite well distributed throughout the age, the older houses accumulate much more with max rad value. We earlier discovered that the houses in this quadrant tend to deviate from linear and general perception of bigger house bigger price. Do these old houses have smaller

Boston %>% 
  filter(age>50) %>% 
  ggplot(aes(rm,medv,col=age)) + 
  geom_point() + geom_smooth(method = 'lm') + 
  facet_wrap(~factor(rad))

Older houses are mostly distributed in the rad 24 quadrant and thus probably tend to fetch lower price for higher number of rooms as well

dis

ggplot(Boston,aes(dis,medv)) + 
  geom_point()

surprisingly, as the distance increases the price also increase. could this be because the size of the houses are small near employment centres?

ggplot(Boston,aes(dis,medv,col=age)) + 
  geom_point()

We notice that old houses are closer to the employment centers and we already noted that the prices of old constructions are slightly lower.

ggplot(Boston,aes(dis,medv,col=age)) + 
  geom_point() + 
  facet_wrap(~factor(round(rm)))

The size of the houses does not seem to be playing a huge role in this disparity but only the age of construction.

Let’s analyze the effect of categorical variables and see if there are any insights

ggplot(Boston,aes(dis,medv,col=factor(chas))) + 
  geom_point() + 
  geom_smooth(method = 'lm') +
  facet_wrap(~rad_c)

ggplot(Boston,aes(dis,medv,col=factor(rad_c))) + 
  geom_point() + 
  geom_smooth(method = 'lm')

  facet_wrap(~rad_c)

## <ggproto object: Class FacetWrap, Facet>
##     compute_layout: function
##     draw_back: function
##     draw_front: function
##     draw_labels: function
##     draw_panels: function
##     finish_data: function
##     init_scales: function
##     map: function
##     map_data: function
##     params: list
##     render_back: function
##     render_front: function
##     render_panels: function
##     setup_data: function
##     setup_params: function
##     shrink: TRUE
##     train: function
##     train_positions: function
##     train_scales: function
##     super:  <ggproto object: Class FacetWrap, Facet>

ptratio

ggplot(Boston,aes(ptratio,medv)) + 
  geom_point() + 
  geom_smooth(method = 'lm')

As the ptratio increases, the medv decreases. Exploring further:

ggplot(Boston,aes(ptratio,medv,col = factor(chas))) + 
  geom_point() + 
  geom_smooth(method = 'lm') +
  facet_wrap(~rad_c)

Black

ggplot(Boston,aes(black,medv)) + 
  geom_point() +
  geom_smooth(method = 'lm')

Most of the black population is concentrated around the middle of medv. The distribution though is quite scattered, let’s see if it could be segmented better by different categorical variables

ggplot(Boston,aes(black,medv,col= factor(chas))) + 
  geom_point() +
  geom_smooth(method = 'lm')

ggplot(Boston,aes(black,medv,col= factor(rad_c))) + 
  geom_point() +
  geom_smooth(method = 'lm')

ggplot(Boston,aes(black,medv,col= factor(chas))) + 
  geom_point() +
  geom_smooth(method = 'lm') +
  facet_wrap(~rad_c)

lstat

ggplot(Boston,aes(lstat,medv)) + 
  geom_point()

There appears to be a strong negative correlation between lstat and medv, let’s delve deeper

Checking across categories:

ggplot(Boston,aes(lstat,medv,col=factor(chas))) + 
  geom_point() + 
  geom_smooth(method = 'lm')

Adding chas might explain the trend better, let’s check with rad_c. The above lines seem to have a curvature towards the end, let’s explore with different methods of geom smooth

ggplot(Boston,aes(lstat,medv,col=factor(rad_c))) + 
  geom_point() + 
  geom_smooth(method = 'loess')

ggplot(Boston,aes(lstat,medv,col=factor(chas))) + 
  geom_point() + 
  geom_smooth(method = 'loess') + 
  facet_wrap(~rad_c)

Let’s check if log treatment makes it more linear

ggplot(Boston,aes(log(lstat),medv,col=factor(rad_c))) + 
  geom_point() + 
  geom_smooth(method = 'lm')

Log treatment might be a better option for lstat

EDA In Summary

Potential Findings

We mutated 2 new categorical variables from the existing data since their continuous distributions seem to be segmented into identifiable categories.
crim variable is mostly concentrated around 0, taking log of this creates better association
rad_c, the newly generated categorical variable is able to segment data for different slopes better than the existing chas variable
rm variable has a huge change in slope when included with rad_c categorical variable. Thus, it would need an interaction term. Other candidates for interaction term are log(crim),black,lstat
Possible curvature in lstat, a quadratic term or a log transformation could be tried.

Modeling

Before modeling, it is only wise to split the data into train and test in order to keep a tab at the predicting power of our model.

Splitting data

Spliting data in 80 / 20 for initial model exploration. Removing columns of tax, rad and chas as mentioned earlier

Boston.Model <- Boston 
index <- sample(nrow(Boston.Model), nrow(Boston.Model)*.80)
train <- Boston.Model[index,]
test <- Boston.Model[-index,]

In general, for the modeling phase, both classical and regularization techniques for variable selection are implemented in order to come up with the best linear regression model for the dependent variable. We begin with simple linear regression, with just lstat (a variable we found most prominently correlated with medv)

model.base <- lm(medv~lstat, data = train)
summary(model.base)

## 
## Call:
## lm(formula = medv ~ lstat, data = train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -10.08  -4.06  -1.55   1.74  24.35 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.96507    0.62889   55.60   <2e-16 ***
## lstat       -0.97765    0.04342  -22.52   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.203 on 402 degrees of freedom
## Multiple R-squared:  0.5578, Adjusted R-squared:  0.5567 
## F-statistic: 507.1 on 1 and 402 DF,  p-value: < 2.2e-16

The base model with lstat explains 56% of the total variation. To visualize this, let’s take a look at the graph below:

plot(medv~lstat, data = train)
abline(model.base, col = 'red',lty = 2)

As pointed earlier, the top and bottom part of the scatter plot appears exhbits slight curvature. Checking the residual plots for this model to confirm this:

par(mfrow=c(2,2))
plot(model.base)

Not only does homoscedasticity seems to be violated, residuals also indicate certain curvature unexplained by the model. Let’s explore the alternative of log transformation on lstat

model.extended <- lm(medv~log(lstat), data = train)
summary(model.extended)

## 
## Call:
## lm(formula = medv ~ log(lstat), data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.7131  -3.6026  -0.6803   2.0815  25.9172 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  52.7867     1.0781   48.96   <2e-16 ***
## log(lstat)  -12.7321     0.4412  -28.86   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.322 on 402 degrees of freedom
## Multiple R-squared:  0.6745, Adjusted R-squared:  0.6737 
## F-statistic: 832.9 on 1 and 402 DF,  p-value: < 2.2e-16

R square value has increased appreciably, which is a good sign. Let’s check the residual plots and fit.

par(mfrow= c(2,2))
plot(model.extended)

The log term seems to perform much better and the curvature looks to be better explained. We’ll now add our second most significant variable rm

model.extended2 <- lm(medv~log(lstat) + rm, data = train)
summary(model.extended2)

## 
## Call:
## lm(formula = medv ~ log(lstat) + rm, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.8409 -3.1719 -0.8167  2.1999 26.8939 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  24.6579     3.9118   6.303 7.68e-10 ***
## log(lstat)  -10.0881     0.5455 -18.493  < 2e-16 ***
## rm            3.4857     0.4683   7.444 6.02e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.995 on 401 degrees of freedom
## Multiple R-squared:  0.714,  Adjusted R-squared:  0.7126 
## F-statistic: 500.5 on 2 and 401 DF,  p-value: < 2.2e-16

par(mfrow = c(2,2))
plot(model.extended2)

Although there is a fair increase in the adjusted R square, we see a slight increment in curvature in the residual plot. Earlier we observed that the slope of both lstat and rm changed with the addition of categorical variable of rad_c. Let’s add it to our model:

model.extended3 <- lm(medv ~ (log(lstat) + rm) * rad_c, train)
summary(model.extended3)

## 
## Call:
## lm(formula = medv ~ (log(lstat) + rm) * rad_c, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.9519  -2.3716  -0.4693   2.0241  23.6381 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -22.4137     4.4722  -5.012 8.13e-07 ***
## log(lstat)            -4.0166     0.6218  -6.460 3.06e-10 ***
## rm                     8.7734     0.5249  16.715  < 2e-16 ***
## rad_cTRUE             95.9916     6.4456  14.892  < 2e-16 ***
## log(lstat):rad_cTRUE -13.2589     1.0738 -12.347  < 2e-16 ***
## rm:rad_cTRUE         -10.1037     0.7695 -13.130  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.002 on 398 degrees of freedom
## Multiple R-squared:  0.8178, Adjusted R-squared:  0.8155 
## F-statistic: 357.3 on 5 and 398 DF,  p-value: < 2.2e-16

Visualizing the residuals:

par(mfrow = c(2,2))
plot(model.extended3)

The addition of the interaction term increased adj R square to nearly 82%. Before adding the rest of the signficant variables, let’s compare our current models with out of sample predictions

model.base.mspe <- mean((predict(model.base,newdata = test) - test[,'medv'])^2)
model.extended.mspe <- mean((predict(model.extended,newdata = test) - test[,'medv'])^2)
model.extended2.mspe <- mean((predict(model.extended2,newdata = test) - test[,'medv'])^2)
model.extended3.mspe <- mean((predict(model.extended3,newdata = test) - test[,'medv'])^2)

models <- c(model.base.mspe,model.extended.mspe,model.extended2.mspe,model.extended3.mspe)

plot(models)

The last model with categorical variable has done much better as compared to other models. Let’s now evaluate a few variable selection techniques

Variable selection techniques

Best Subset : Best subset technique by default uses an exhaustive algorithm to arrive at the best subset model i.e. a model that contains only those variables that can account for maximum variation in the given Y values.

regfit.model <- regsubsets(medv~., data = train, nvmax = 13)

plot(regfit.model, scale = "bic")

Choosing the best simplest possible model, we opt for bic and get the final formula as : medv ~ crim + zn + chas + nox + rm + dis + rad + tax + ptratio + black + lstat. A quick glimpse of this model shows a rather low adjusted R square value:

best.subset <- lm(medv~ crim+zn+chas+nox+rm+dis+rad+tax+ptratio+black+lstat, data = train)

This model doesn’t compare well with the model we formed earlier with just 3 terms (although internally multiple interaction terms are also generated).

Stepwise Variable Selection: In stepwise variable selection, the most significant model is built in steps till the addition(or subtraction) of variables result in improved model design

nullmodel <- lm(medv~1, data=train)
fullmodel <- lm(medv~., data=train)

model.step.bwd<- step(fullmodel,direction='backward', trace = 0)
model.step.fwd<- step(nullmodel, scope=list(lower=nullmodel, upper=fullmodel), direction='forward', trace = 0)
model.step.bth<- step(nullmodel,scope=list(lower=nullmodel, upper=fullmodel),  direction='both', trace = 0)

Notice that all the above variable selection techniques narrow down to a single model and it doesn’t outperform our earlier selected model. We shall create a few more custom models including the insights gained from earlier analysis to build an even more accurate models. Instead of splitting training and testing data, we will perform K fold cross validation to compare different models.

Cross Validation and Ensembling

model_cust <- train(medv ~ nox + ptratio + age + (log(lstat) + rm + log(crim) + dis) * rad_c, data = Boston.Model, method = 'lm', trControl = trainControl(method = 'repeatedcv', repeats = 5, number = 20, verboseIter = F))
model_step <- train(medv ~ chas + nox + rm + dis + ptratio + black + lstat + crim + zn + rad + age +indus, data = Boston.Model, method = 'lm', trControl = trainControl(method = 'repeatedcv', repeats = 5, number = 20, verboseIter = F))
model_lasso <- train(medv~chas + nox + rm + dis + ptratio + black + lstat + crim + zn + rad + tax, data = Boston.Model, method = 'lm',  trControl = trainControl(method = 'repeatedcv', repeats = 5, number = 20, verboseIter = F))

Model Evaluation

R Square / Adjusted R Square and Predicted R Square

fmod <- lm(medv ~ nox + ptratio + age + (log(lstat) + rm + log(crim) + dis) * rad_c, data = Boston.Model)
press.ind <- residuals(fmod)/(1 - lm.influence(fmod)$hat)
tss <- sum(anova(fmod)$"Sum Sq")
PRESS <- sum(press.ind^2)
(pred.r.squared <- 1 - PRESS/(tss))

## [1] 0.8413558

For the above selected model we see that R-Square is 0.85 and an adj R-Square 0.85 along with the above predicted R-square, we can be confidence that the chosen model is not overfitting

Root Mean Squared Error

Let’s put all our models together and check their performance

model_list <- list(m.cust=model_cust,m.step=model_step,m.lasso = model_lasso)
resamps <- resamples(model_list)

bwplot(resamps, metric = "RMSE")

dotplot(resamps, metric='RMSE')

The dotplot in combination with the box and whisker plot suggets picking mLog model as a better predictor amongst others.

Gains Chart

Gains chart is usually expressed and is more relevant for classification problems. However, since this model is aimed at predicting the median prices of houses, it is intuitive to understand if the order of our predictions sort in the same order as the true outcome.

When the predictions sort in exactly the same order, the relative Gini coefficient is 1. When the model sorts poorly, the relative Gini coefficient is close to zero, or even negative.

Boston$Prediction <- predict(model_cust, Boston.Model)

GainCurvePlot(Boston, "Prediction","medv", "Gain Curve")

The Gain Curve above looks very promising as our predictions fall well above the diagonal line i.e. the line of random prediction.

Regression V/s Tree (CART)

Trees are a lazy man’s best friend to arrive at a final model with relative ease and have enough explanatory power in the model to convince the stake holders. Having invested so long in doing the EDA and running different linear models, let’s pit them against this so called ‘hack’ in the machine learning collection of algorithms.

We would utilize the rpart library to grow a regression tree.

We would use the entire dataset since trees do not need any transfomation of variables For comparison, we have also taken a full regression model and eventually we would compare both with our chosen Regression model.

data("Boston")
model_tree <- train(medv ~ ., data = Boston, method = 'treebag', trControl = trainControl(method = 'repeatedcv', repeats = 5, number = 20, verboseIter = F))
model_full <- train(medv~., data = Boston, method = 'lm', trControl = trainControl(method = 'repeatedcv', repeats = 5, number = 20, verboseIter = F))

model_list <- list(m.cust=model_cust,m.tree=model_tree,m.full = model_full)
resamps <- resamples(model_list)
dotplot(resamps, metric = "RMSE")

Final Model

Thus, we arrive at the final model

\[ medv = 8.71 - 15.33*nox - .78 * ptratio - 0.03 * age - 3.06*log(stat) + 8.00 * rm + 0.50 * log(crim) - 0.88 * dis + 101.66 * rad_c - 11.71*log(stat)*rad_c - 9.21*rm*rad_c -3.77*log(crim)*rad_c-3.36*dis*rad_c\]

Final Word

It is evident that CART performs better than linear regression model (and it generally extends to most of the models that can be linearly separated). Trees benefit a lot from their inherent ability of handling missing data and not requiring much EDA effort to identify variable transformations etc. However, because of the simplicity of linear regression, the analysis done in the exploratory phase and its incorporation to the final linear model slightly outperforms the CART model.

So, in short, if you have the time and an intention to impress your boss with the most powerful model that can both explain and predict, spend 80% of your time doing EDA and the remaining trying to prove that you didn’t waste your time after all :-)

A Million Dollar Regression

Ravish Kalra

January 24, 2018