This dataset contains information about 506 census tracts of Boston from the 1970 census. As an aspiring data scientist, understanding how to model data like this is of great importance to me. In this kernel, I will use the Boston housing data by Harrison and Rubinfeld (1979) and explore which factors affect the median value of homes. I will perform a linear regression analysis on the same.

Boston Housing Data

You can include this data by using the ‘MASS’ library. The data has following features, medv being the target (dependent) variable:

We will structure the code as follows

  1. Loading the data
  2. Preparing the data
  3. Exploratory Data Analysis
  4. Building the model and accuracy analysis
  5. Final Analysis of the model

1. Load the Boston Housing data and assign it to any variable

library(MASS)
housing <- Boston

Other libraries we may need:

library(corrplot) #for visualisation of correlation
library(lattice) #for visualisation
library(ggplot2) #for visualisation
library(caTools) #for splittind data into testing and training data
library(dplyr) #manipulating dataframe
library(plotly) #converting ggplot to plotly

2. Preparing the data

Checking for NA and missing values and removing them

numberOfNA <- length(which(is.na(housing)==T))
if(numberOfNA>0) {
  housing <- housing[complete.cases(housing),]
}

Prepare the training and testing data

set.seed(123)
split <- sample.split(housing,SplitRatio = 0.75) #assigns booleans to a new coloumn based on the split ratio
train <- subset(housing,split==TRUE)
test <- subset(housing,split==FALSE)

3. Exploratory Data Analysis

This is a crucial part and usually takes up most of the time. A proper and extensive EDA would reveal interesting patterns and help to prepare the data in a better way!

Now let’s perform some exploratory data analysis to understand how the variables of the data are related to one another.

Now lets see the structure of different variables in the Boston Housing dataset:

str(housing)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

Here we can see that the variables ‘chas’ and ‘rad’ are non numeric

A command called head gives you the top 6 rows of the dataset

head(housing)
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12
##   lstat medv
## 1  4.98 24.0
## 2  9.14 21.6
## 3  4.03 34.7
## 4  2.94 33.4
## 5  5.33 36.2
## 6  5.21 28.7

A command called summary gives you the basic statistics of your dataset like mean, median, 1st quartile, 2nd quartile etc.

summary(housing)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Here we can see that variable ‘crim’ and ‘black’ take wide range of values.

Variables ‘crim’, ‘zn’, ‘rm’ and ‘black’ have a large difference between their median and mean which indicates lot of outliers in respective variables.

par(mfrow = c(1, 4))
boxplot(housing$crim, main='crim',col='Sky Blue')
boxplot(housing$zn, main='zn',col='Sky Blue')
boxplot(housing$rm, main='rm',col='Sky Blue')
boxplot(housing$black, main='black',col='Sky Blue')

As suggested earlier variables ‘crim’, ‘zn’, ‘rm’ and ‘black’ do have a lot of outliers.

Finding correlation

Correlation is a statistical measure that suggests the level of linear dependence between two variables that occur in pair. Its value lies between -1 to +1

  • If above 0 it means positive correlation i.e. X is directly proportional to Y.
  • If below 0 it means negative correlation i.e. X is inversly proportional to Y.
  • Value 0 suggests weak relation.

Usually we would use the function ‘cor’ to find correlation between two variables, but since we have 14 variables here, it is easier to examine the correlation between different varables using corrplot function in library ‘corrplot’.

Correlation plots are a great way of exploring data and seeing the level of interaction between the variables.

corrplot(cor(housing))

Since this is a linear regression experiment which involves looking at how median value of homes in Boston vary with the different factors, it makes sense to see the trends of all the variables.

Before moving on to analyzing linearity between ‘medv’ and different variables, there are few things we must know:

Types of Linear Models in R

source- Montefiore Institute

We will now try to find out the linearity between ‘medv’ and other variables keeping one thing in mind- “It is not worth complicating the model for a very small increase in Adjusted R-squared value”

dropList <- c('chas','rad','crim','zn','black')
#We drop chas and rad because they are non numeric
#We drop crim, zn and black because they have lot of outliers
housingplot <- housing[,!colnames(housing) %in% dropList]
splom(housingplot,col = 'Sky Blue')

The first row of plot is the most useful. It indicates how different variables impact the median value of homes in Boston.

Analyzing scatter plot and Adjusted R-squared values between medv and other variables for linearity we find that only ‘lstat’ has significantly high difference of Adjusted R-square between its squared model and linear model for it to be mathematically squared inside the model using the identity function (I).

4. Building the model and accuracy analysis

How to analyze a model

In Linear Regression

  • The Null Hypothesis is that the coefficients associated with the variables are zero.
  • The alternate hypothesis is that the coefficients are not equal to zero (i.e. there exists a relationship between the independent variable in question and the dependent variable).
  • If Pr(>|t|) value has 3 stars, it means that coeffecient is of very high statistical significance. Pr(>|t|) value less than 0.05 is considered as good.
  • Multiple R-squared measures the proportion of the variation in your dependent variable explained by all of your independent variables in the model.
  • Adjusted R-squared measures the proportion of variation explained by only those independent variables that really help in explaining the dependent variable. It penalizes you for adding independent variable that do not help in predicting the dependent variable.
  • If F-statistic is significant then Model is good (higher the value of F-statistic the better).
  • Our key objective is to determine the variable(s) that would give best predictive model.

Let’s begin by fitting all the variables.

# Fitting Simple Linear regression
# . is used to fit predictor using all independent variables
lm.fit1 <- lm(medv~.,data=train)
summary(lm.fit1)
## 
## Call:
## lm(formula = medv ~ ., data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.4182  -2.4857  -0.3826   1.5040  24.2247 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  32.014799   5.797019   5.523 6.55e-08 ***
## crim         -0.108156   0.033650  -3.214 0.001431 ** 
## zn            0.039578   0.015572   2.542 0.011465 *  
## indus        -0.014607   0.067178  -0.217 0.827999    
## chas          3.596486   0.917827   3.918 0.000107 ***
## nox         -18.362276   4.209855  -4.362 1.70e-05 ***
## rm            4.229451   0.469245   9.013  < 2e-16 ***
## age           0.001901   0.015649   0.121 0.903374    
## dis          -1.409436   0.223701  -6.301 8.97e-10 ***
## rad           0.252868   0.070250   3.600 0.000365 ***
## tax          -0.010172   0.003953  -2.573 0.010488 *  
## ptratio      -0.900114   0.144507  -6.229 1.36e-09 ***
## black         0.009349   0.003127   2.990 0.002992 ** 
## lstat        -0.473235   0.060917  -7.768 9.00e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.439 on 348 degrees of freedom
## Multiple R-squared:  0.7723, Adjusted R-squared:  0.7638 
## F-statistic:  90.8 on 13 and 348 DF,  p-value: < 2.2e-16

Iteration 1

  • R-squared value is around 0.76
  • F-statistic value is 90.8

Improvements

  • Variables ‘age’ and ‘indus’ have very high Pr(>|t|) value and low significance hence removing them could give us a better model.
  • As we noticed in EDA ‘lstat’ is non-linear and hence can be squared for a better fit.
lm.fit2 <- lm(medv~.-age-indus+I(lstat^2),data=train)
summary(lm.fit2)
## 
## Call:
## lm(formula = medv ~ . - age - indus + I(lstat^2), data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.1653  -2.3843  -0.2234   1.7692  23.8610 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.279359   5.171206   7.209 3.52e-12 ***
## crim         -0.151059   0.030552  -4.944 1.19e-06 ***
## zn            0.016149   0.014038   1.150 0.250769    
## chas          3.135843   0.817476   3.836 0.000148 ***
## nox         -13.254046   3.539960  -3.744 0.000212 ***
## rm            3.692559   0.408420   9.041  < 2e-16 ***
## dis          -1.280147   0.185742  -6.892 2.58e-11 ***
## rad           0.238068   0.059762   3.984 8.26e-05 ***
## tax          -0.008809   0.003170  -2.778 0.005760 ** 
## ptratio      -0.787533   0.127831  -6.161 2.00e-09 ***
## black         0.009370   0.002795   3.352 0.000890 ***
## lstat        -1.603102   0.134567 -11.913  < 2e-16 ***
## I(lstat^2)    0.033168   0.003672   9.032  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.991 on 349 degrees of freedom
## Multiple R-squared:  0.8154, Adjusted R-squared:  0.8091 
## F-statistic: 128.5 on 12 and 349 DF,  p-value: < 2.2e-16

Iteration 2

  • R-squared value increased to around 0.81
  • F-statistic value inreased to 128.5

Improvements

  • Variable ‘zn’ has very high Pr(>|t|) value and low significance hence removing it could give us a better model.
  • Interaction between highly significant variables could give us a better model.
lm.fit3 <- lm(medv~.-indus-age-zn+rm*lstat-black+rm*rad+lstat*rad,data=train)
summary(lm.fit3)
## 
## Call:
## lm(formula = medv ~ . - indus - age - zn + rm * lstat - black + 
##     rm * rad + lstat * rad, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.4588  -1.8988  -0.1295   1.5532  21.6076 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -22.206624   5.769480  -3.849 0.000141 ***
## crim         -0.097030   0.027330  -3.550 0.000438 ***
## chas          2.845805   0.723912   3.931 0.000102 ***
## nox         -15.513258   3.009914  -5.154 4.28e-07 ***
## rm           11.623469   0.588117  19.764  < 2e-16 ***
## dis          -0.855225   0.141565  -6.041 3.92e-09 ***
## rad           2.760981   0.293506   9.407  < 2e-16 ***
## tax          -0.009317   0.002682  -3.473 0.000578 ***
## ptratio      -0.679298   0.105630  -6.431 4.18e-10 ***
## lstat         1.751194   0.227317   7.704 1.38e-13 ***
## rm:lstat     -0.317622   0.038477  -8.255 3.17e-15 ***
## rm:rad       -0.332028   0.042475  -7.817 6.45e-14 ***
## rad:lstat    -0.032398   0.004144  -7.819 6.37e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.423 on 349 degrees of freedom
## Multiple R-squared:  0.8642, Adjusted R-squared:  0.8596 
## F-statistic: 185.1 on 12 and 349 DF,  p-value: < 2.2e-16

Iteration 3

  • R-squared value increased to around 0.86
  • F-statistic value inreased to 185.1

5. Final Analysis of the model

residuals <- data.frame('Residuals' = lm.fit3$residuals)
res_hist <- ggplot(residuals, aes(x=Residuals)) + geom_histogram(color='black', fill='skyblue') + ggtitle('Histogram of Residuals')
res_hist

Looking at the above histogram we can say that graph is slightly right skewed and therefore can almost be considered as normally distributed.

plot(lm.fit3, col='Sky Blue')

test$predicted.medv <- predict(lm.fit3,test)
pl1 <-test %>% 
  ggplot(aes(medv,predicted.medv)) +
  geom_point(alpha=0.5) + 
  stat_smooth(aes(colour='black')) +
  xlab('Actual value of medv') +
  ylab('Predicted value of medv') +
  theme_bw()

ggplotly(pl1)

I hope you enjoyed this analysis! I think going forward it would be interesting to use a method other than linear regression.

Comments, questions, and upvotes are welcome!

Thank You!