We take Boston Housing Data collected in 1978 with 506 entries representing aggregated data about 14 features for homes from various suburbs in Boston.
housing_data <- read.table("https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data", header = F)
colnames(housing_data) <- c("CRIM", "ZN", "INDUS", "CHAS", "NOX", "RM", "AGE", "DIS", "RAD", "TAX", "PTRATIO", "B", "LSTAT", "MEDV")
attach(housing_data); str(housing_data)
## 'data.frame': 506 obs. of 14 variables:
## $ CRIM : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ ZN : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ INDUS : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ CHAS : int 0 0 0 0 0 0 0 0 0 0 ...
## $ NOX : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ RM : num 6.58 6.42 7.18 7 7.15 ...
## $ AGE : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ DIS : num 4.09 4.97 4.97 6.06 6.06 ...
## $ RAD : int 1 2 2 3 3 3 5 5 5 5 ...
## $ TAX : num 296 242 242 222 222 222 311 311 311 311 ...
## $ PTRATIO: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ B : num 397 397 393 395 397 ...
## $ LSTAT : num 4.98 9.14 4.03 2.94 5.33 ...
## $ MEDV : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
Let’s first make our own assumptions about data and verify by further analysis if our assumptions can be data-proved.
What we have? Read it if you’re bored. If not, head over to the “So!”.
1. CRIM - per capita crime rate by town
2. ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
3. INDUS - proportion of non-retail business acres per town
4. CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
5. NOX - nitric oxides concentration (parts per 10 million)
6. RM - average number of rooms per dwelling
7. AGE - proportion of owner-occupied units built prior to 1940
8. DIS - weighted distances to five Boston employment centres
9. RAD - index of accessibility to radial highways
10. TAX - full-value property-tax rate per $10,000
11. PTRATIO - pupil-teacher ratio by town
12. B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
13. LSTAT - % lower status of the population
14. MEDV - Median value of owner-occupied homes in $1000's
So! If I say want to predict Boston housing prices (MEDV), then I want to see the relationships between MEDV and variables affecting it.
In this case the response or target variable is ‘MEDV’, which contains values of Boston housing at that time.
The most obvious predictors for this variable are ‘RM’, number of rooms, and ‘LSTAT’, percentage of lower class. ‘CRIM’ can also be taken as the third predictor of a model.
housing_frgm <- housing_data[, c("MEDV", "RM", "LSTAT", "CRIM")]
str(housing_frgm); summary(housing_frgm)
## 'data.frame': 506 obs. of 4 variables:
## $ MEDV : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
## $ RM : num 6.58 6.42 7.18 7 7.15 ...
## $ LSTAT: num 4.98 9.14 4.03 2.94 5.33 ...
## $ CRIM : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## MEDV RM LSTAT CRIM
## Min. : 5.00 Min. :3.561 Min. : 1.73 Min. : 0.00632
## 1st Qu.:17.02 1st Qu.:5.886 1st Qu.: 6.95 1st Qu.: 0.08204
## Median :21.20 Median :6.208 Median :11.36 Median : 0.25651
## Mean :22.53 Mean :6.285 Mean :12.65 Mean : 3.61352
## 3rd Qu.:25.00 3rd Qu.:6.623 3rd Qu.:16.95 3rd Qu.: 3.67708
## Max. :50.00 Max. :8.780 Max. :37.97 Max. :88.97620
MEDV max of 50.00 seems unrealistic nowadays, but let’s roll with it.
library(ggplot2)
#storing ggplots for three predictors: 'RM', 'LSTAT', 'CRIM'
plot1 <- ggplot(housing_frgm, aes(x=RM,y=MEDV)) + geom_point(col = "blue") + ylab("Prices in $1000's") + xlab("Number of rooms")
plot2 <-ggplot(housing_frgm, aes(x=LSTAT,y=MEDV)) + geom_point(col = "blue") + ylab("Prices in $1000's") + xlab("% of lower class")
plot3 <- ggplot(housing_frgm, aes(x=CRIM,y=MEDV)) + geom_point(col = "blue") + ylab("Prices in $1000's") + xlab("Crime rate")
After some thinking I have also decided to add ‘PTRATIO’ to housing fragment as it shows the pupil-teacher ratio, which can be an indicator for some families.
Let’s add it to our housing fragment, make a plot, store it and finally visualize relationships between these 5 variables.
housing_frgm <- cbind(housing_frgm, PTRATIO)
plot4 <- ggplot(housing_frgm, aes(x=PTRATIO,y=MEDV)) + geom_point(col = "blue") + ylab("Prices in $1000's") + xlab("Pupil/teacher")
library(gridExtra)
grid.arrange(plot1, plot2, plot3, plot4, ncol=2) #four plots arranged
We have here at least three linear trends between MEDV - /RM/LSTAT/PTRATIO. As for CRIM we can say that it really doesn’t depend on the housing price: we see houses with lower values having the same crime rate as the most expensive ones. This variable will not be the most usefull one for our model, but still let’s take a look.
housing_lm <- lm(MEDV ~ ., housing_frgm)
summary(housing_lm)
##
## Call:
## lm(formula = MEDV ~ ., data = housing_frgm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.5492 -3.1454 -0.9357 1.6894 30.1563
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.92330 3.97564 4.257 2.48e-05 ***
## RM 4.61862 0.42716 10.812 < 2e-16 ***
## LSTAT -0.53431 0.04564 -11.708 < 2e-16 ***
## CRIM -0.06544 0.03081 -2.124 0.0342 *
## PTRATIO -0.88969 0.11883 -7.487 3.19e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.211 on 501 degrees of freedom
## Multiple R-squared: 0.6815, Adjusted R-squared: 0.6789
## F-statistic: 268 on 4 and 501 DF, p-value: < 2.2e-16
Yes, as we noticed before, predictor variable CRIM won’t contribute much into the model. It is better to drop it and rewrite ‘housing_lm’.
housing_frgm$CRIM <- NULL #dropping CRIM
housing_lm <- lm(MEDV ~ ., housing_frgm)
summary(housing_lm)
##
## Call:
## lm(formula = MEDV ~ ., data = housing_frgm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.4871 -3.1047 -0.7976 1.8129 29.6559
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.56711 3.91320 4.745 2.73e-06 ***
## RM 4.51542 0.42587 10.603 < 2e-16 ***
## LSTAT -0.57181 0.04223 -13.540 < 2e-16 ***
## PTRATIO -0.93072 0.11765 -7.911 1.64e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.229 on 502 degrees of freedom
## Multiple R-squared: 0.6786, Adjusted R-squared: 0.6767
## F-statistic: 353.3 on 3 and 502 DF, p-value: < 2.2e-16
We have now all ‘statistically significant’ features for our model. Adjusted R-squared indicates that 68% of given Boston housing prices can be explained by chosen features: #rooms, %lower class and pupil/teacher ratio. Using THIS model.
Finaly we shoud also check if our observations are independent and if they have the same distribution (normal).
ggplot(housing_frgm, aes(x=housing_lm$fitted.values,y=housing_lm$residuals)) + geom_point() + xlab("Estimated values") + ylab("Residuals") + geom_smooth(method = "lm", se = FALSE)
I am not sure about this aggregation of points on the first plot. May it be some pattern here or not? At least no clear pattern. Good thing though that we have quite a lot of data around residual = 0 line. Let’s assume that given observations are independent and have no effect one each other.
qqnorm(housing_lm$residuals, ylab = "Residuals Quantiles"); qqline(housing_lm$residuals, col = "red")
As for the second plot we see a considerable variation from normality at the tail of the qqline. Also we can easily spot few outliers.
In this last part I find it interesting to build the second multi linear model based on 70% of data we have and then test it on the rest 30% that the model didn’t see before.
Let’s shuffle and split our data into train and test sets:
set.seed(1) #for reproducibility of results
#shuffle & split
train_set <- sample(1:nrow(housing_frgm), round(0.7*nrow(housing_frgm),0))
train <- housing_frgm[train_set, ]
#takes the rest 30%
test_set<- setdiff(1:nrow(housing_frgm), train_set)
test <- housing_frgm[test_set, ]
dim(train)
## [1] 354 4
dim(test)
## [1] 152 4
Reminder: housing fragment had 506 observations.
Running the second model:
housing_lm_train <- lm(MEDV ~ ., train)
summary(housing_lm_train)
##
## Call:
## lm(formula = MEDV ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.252 -2.900 -0.952 1.841 28.961
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.62444 4.56875 3.639 0.000315 ***
## RM 4.81921 0.49617 9.713 < 2e-16 ***
## LSTAT -0.55875 0.04826 -11.578 < 2e-16 ***
## PTRATIO -0.93743 0.13347 -7.023 1.13e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.011 on 350 degrees of freedom
## Multiple R-squared: 0.7053, Adjusted R-squared: 0.7028
## F-statistic: 279.2 on 3 and 350 DF, p-value: < 2.2e-16
We are still good with p-values and btw explained variance went up from 68% to 70%. Did we take some outliers from the data set?
predictors_test <- test[, c(2:4)] #dropping target variable MEDV
predictions <- predict.lm(housing_lm_train, predictors_test)
plot(test$MEDV, predictions, xlab = "True values", ylab = "Predicted values by 'train' model"); points(predictions, test$MEDV, col = "blue")
In the above, black dots represent true values of housing prices taken from splitted dataset and the blue ones - values predicted by the model. The second model did a good job at predicting values for Boston housing prices. So I am happy and can sleep well.