This dataset contains information about 506 census tracts of Boston from the 1970 census. As an aspiring data scientist, understanding how to model data like this is of great importance to me. In this kernel, I will use the Boston housing data by Harrison and Rubinfeld (1979) and explore which factors affect the median value of homes. I will perform a linear regression analysis on the same.
You can include this data by using the ‘MASS’ library. The data has following features, medv being the target (dependent) variable:
library(MASS)
housing <- Boston
library(corrplot) #for visualisation of correlation
library(lattice) #for visualisation
library(ggplot2) #for visualisation
library(caTools) #for splittind data into testing and training data
library(dplyr) #manipulating dataframe
library(plotly) #converting ggplot to plotly
Checking for NA and missing values and removing them
numberOfNA <- length(which(is.na(housing)==T))
if(numberOfNA>0) {
housing <- housing[complete.cases(housing),]
}
Prepare the training and testing data
set.seed(123)
split <- sample.split(housing,SplitRatio = 0.75) #assigns booleans to a new coloumn based on the split ratio
train <- subset(housing,split==TRUE)
test <- subset(housing,split==FALSE)
This is a crucial part and usually takes up most of the time. A proper and extensive EDA would reveal interesting patterns and help to prepare the data in a better way!
Now let’s perform some exploratory data analysis to understand how the variables of the data are related to one another.
Now lets see the structure of different variables in the Boston Housing dataset:
str(housing)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
Here we can see that the variables ‘chas’ and ‘rad’ are non numeric
A command called head gives you the top 6 rows of the dataset
head(housing)
## crim zn indus chas nox rm age dis rad tax ptratio black
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12
## lstat medv
## 1 4.98 24.0
## 2 9.14 21.6
## 3 4.03 34.7
## 4 2.94 33.4
## 5 5.33 36.2
## 6 5.21 28.7
A command called summary gives you the basic statistics of your dataset like mean, median, 1st quartile, 2nd quartile etc.
summary(housing)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
Here we can see that variable ‘crim’ and ‘black’ take wide range of values.
Variables ‘crim’, ‘zn’, ‘rm’ and ‘black’ have a large difference between their median and mean which indicates lot of outliers in respective variables.
par(mfrow = c(1, 4))
boxplot(housing$crim, main='crim',col='Sky Blue')
boxplot(housing$zn, main='zn',col='Sky Blue')
boxplot(housing$rm, main='rm',col='Sky Blue')
boxplot(housing$black, main='black',col='Sky Blue')
As suggested earlier variables ‘crim’, ‘zn’, ‘rm’ and ‘black’ do have a lot of outliers.
Correlation is a statistical measure that suggests the level of linear dependence between two variables that occur in pair. Its value lies between -1 to +1
Usually we would use the function ‘cor’ to find correlation between two variables, but since we have 14 variables here, it is easier to examine the correlation between different varables using corrplot function in library ‘corrplot’.
Correlation plots are a great way of exploring data and seeing the level of interaction between the variables.
corrplot(cor(housing))
Since this is a linear regression experiment which involves looking at how median value of homes in Boston vary with the different factors, it makes sense to see the trends of all the variables.
Before moving on to analyzing linearity between ‘medv’ and different variables, there are few things we must know:
source- Montefiore Institute
We will now try to find out the linearity between ‘medv’ and other variables keeping one thing in mind- “It is not worth complicating the model for a very small increase in Adjusted R-squared value”
dropList <- c('chas','rad','crim','zn','black')
#We drop chas and rad because they are non numeric
#We drop crim, zn and black because they have lot of outliers
housingplot <- housing[,!colnames(housing) %in% dropList]
splom(housingplot,col = 'Sky Blue')
The first row of plot is the most useful. It indicates how different variables impact the median value of homes in Boston.
Analyzing scatter plot and Adjusted R-squared values between medv and other variables for linearity we find that only ‘lstat’ has significantly high difference of Adjusted R-square between its squared model and linear model for it to be mathematically squared inside the model using the identity function (I).
In Linear Regression
Let’s begin by fitting all the variables.
# Fitting Simple Linear regression
# . is used to fit predictor using all independent variables
lm.fit1 <- lm(medv~.,data=train)
summary(lm.fit1)
##
## Call:
## lm(formula = medv ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.4182 -2.4857 -0.3826 1.5040 24.2247
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 32.014799 5.797019 5.523 6.55e-08 ***
## crim -0.108156 0.033650 -3.214 0.001431 **
## zn 0.039578 0.015572 2.542 0.011465 *
## indus -0.014607 0.067178 -0.217 0.827999
## chas 3.596486 0.917827 3.918 0.000107 ***
## nox -18.362276 4.209855 -4.362 1.70e-05 ***
## rm 4.229451 0.469245 9.013 < 2e-16 ***
## age 0.001901 0.015649 0.121 0.903374
## dis -1.409436 0.223701 -6.301 8.97e-10 ***
## rad 0.252868 0.070250 3.600 0.000365 ***
## tax -0.010172 0.003953 -2.573 0.010488 *
## ptratio -0.900114 0.144507 -6.229 1.36e-09 ***
## black 0.009349 0.003127 2.990 0.002992 **
## lstat -0.473235 0.060917 -7.768 9.00e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.439 on 348 degrees of freedom
## Multiple R-squared: 0.7723, Adjusted R-squared: 0.7638
## F-statistic: 90.8 on 13 and 348 DF, p-value: < 2.2e-16
lm.fit2 <- lm(medv~.-age-indus+I(lstat^2),data=train)
summary(lm.fit2)
##
## Call:
## lm(formula = medv ~ . - age - indus + I(lstat^2), data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.1653 -2.3843 -0.2234 1.7692 23.8610
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.279359 5.171206 7.209 3.52e-12 ***
## crim -0.151059 0.030552 -4.944 1.19e-06 ***
## zn 0.016149 0.014038 1.150 0.250769
## chas 3.135843 0.817476 3.836 0.000148 ***
## nox -13.254046 3.539960 -3.744 0.000212 ***
## rm 3.692559 0.408420 9.041 < 2e-16 ***
## dis -1.280147 0.185742 -6.892 2.58e-11 ***
## rad 0.238068 0.059762 3.984 8.26e-05 ***
## tax -0.008809 0.003170 -2.778 0.005760 **
## ptratio -0.787533 0.127831 -6.161 2.00e-09 ***
## black 0.009370 0.002795 3.352 0.000890 ***
## lstat -1.603102 0.134567 -11.913 < 2e-16 ***
## I(lstat^2) 0.033168 0.003672 9.032 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.991 on 349 degrees of freedom
## Multiple R-squared: 0.8154, Adjusted R-squared: 0.8091
## F-statistic: 128.5 on 12 and 349 DF, p-value: < 2.2e-16
lm.fit3 <- lm(medv~.-indus-age-zn+rm*lstat-black+rm*rad+lstat*rad,data=train)
summary(lm.fit3)
##
## Call:
## lm(formula = medv ~ . - indus - age - zn + rm * lstat - black +
## rm * rad + lstat * rad, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.4588 -1.8988 -0.1295 1.5532 21.6076
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -22.206624 5.769480 -3.849 0.000141 ***
## crim -0.097030 0.027330 -3.550 0.000438 ***
## chas 2.845805 0.723912 3.931 0.000102 ***
## nox -15.513258 3.009914 -5.154 4.28e-07 ***
## rm 11.623469 0.588117 19.764 < 2e-16 ***
## dis -0.855225 0.141565 -6.041 3.92e-09 ***
## rad 2.760981 0.293506 9.407 < 2e-16 ***
## tax -0.009317 0.002682 -3.473 0.000578 ***
## ptratio -0.679298 0.105630 -6.431 4.18e-10 ***
## lstat 1.751194 0.227317 7.704 1.38e-13 ***
## rm:lstat -0.317622 0.038477 -8.255 3.17e-15 ***
## rm:rad -0.332028 0.042475 -7.817 6.45e-14 ***
## rad:lstat -0.032398 0.004144 -7.819 6.37e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.423 on 349 degrees of freedom
## Multiple R-squared: 0.8642, Adjusted R-squared: 0.8596
## F-statistic: 185.1 on 12 and 349 DF, p-value: < 2.2e-16
residuals <- data.frame('Residuals' = lm.fit3$residuals)
res_hist <- ggplot(residuals, aes(x=Residuals)) + geom_histogram(color='black', fill='skyblue') + ggtitle('Histogram of Residuals')
res_hist
Looking at the above histogram we can say that graph is slightly right skewed and therefore can almost be considered as normally distributed.
plot(lm.fit3, col='Sky Blue')
test$predicted.medv <- predict(lm.fit3,test)
pl1 <-test %>%
ggplot(aes(medv,predicted.medv)) +
geom_point(alpha=0.5) +
stat_smooth(aes(colour='black')) +
xlab('Actual value of medv') +
ylab('Predicted value of medv') +
theme_bw()
ggplotly(pl1)
I hope you enjoyed this analysis! I think going forward it would be interesting to use a method other than linear regression.
Comments, questions, and upvotes are welcome!