The Boston Housing Dataset

The dataset was obtained from the StatLib archieve. The Boston house-price data of Harrison, D. and Rubinfeld, D.L. ‘Hedonic prices and the demand for clean air’, J. Environ. Economics & Management, vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, ‘Regression diagnostics …’, Wiley, 1980. N.B. Various transformations are used in the table on pages 244-261 of the latter.

In R, the MASS library contains Boston data set, which has 506 rows and 14 columns. The dataset records medv (median house value) and 13 predictors for 506 neighborhoods around Boston. The variable descriptions are as follows:

Explore the Boston Housing Dataset

First of all, let’s take a look at the variables in the dataset Boston.

library(MASS)
names(Boston)
##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"
head(Boston)
dim(Boston)
## [1] 506  14
  • names() — print out all the variable names;
  • head() — print out the first 5 rows of the dataset.

Data Type

Then we need to know the data type of every variable. Based on the information we have, we know (variable name — description — variable type)

  • crim — per capita crime rate by town — Numerical, continuous

  • zn — proportion of residential land zoned for lots over 25,000 sq.ft. — Numerical, continuous

  • indus — proportion of non-retail business acres per town. — Numerical, continuous

  • chas — Charles River dummy variable (= 1 if tract bounds river; 0 otherwise). — categorical, nominal

  • nox — nitrogen oxides concentration (parts per 10 million). — Numerical, continuous

  • rm — average number of rooms per dwelling. — Numerical, continuous

  • age — proportion of owner-occupied units built prior to 1940. — Numerical, continuous

  • dis — weighted mean of distances to five Boston employment centres. — Numerical, continuous

  • rad — index of accessibility to radial highways. larger index denotes better accessibility — categorical, ordinal

  • tax — full-value property-tax rate per $10,000. — Numerical, continuous

  • ptratio — pupil-teacher ratio by town. — Numerical, continuous

  • black — \(1000 (Bk - 0.63)^2\) where \(Bk\) is the proportion of blacks by town. — Numerical, continuous

  • lstat — lower status of the population (percent). — Numerical, continuous

  • medv — median value of owner-occupied homes in $1000s. — Numerical, continuous

Summary of all the variables

Are there any missing values?

sapply(Boston, anyNA)
##    crim      zn   indus    chas     nox      rm     age     dis     rad     tax 
##   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE 
## ptratio   black   lstat    medv 
##   FALSE   FALSE   FALSE   FALSE

From the output, we know there is no missing values in the Boston dataset.

Next, we find the summary of all the 13 variables as follows.

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Histogram

We could draw histogram for a particular variable, say medv

library(ggplot2)
attach(Boston)

### In base R

hist(medv)

### ggplot

ggplot(Boston, aes(x = medv)) + geom_histogram(binwidth = 5, color = "darkblue",
    fill = "lightblue")

We could also draw all the histogram for all numerical variables in the data frame Boston.

library(Hmisc)
hist.data.frame(Boston)

Boxplot

We could draw boxplot for each column. For example, ptratio,

### In base R

boxplot(ptratio)

### ggplot
ggplot(Boston, aes(y = ptratio)) + geom_boxplot()

#### boxplot of ptratio by chas groups
ggplot(Boston, aes(x = as.factor(chas), y = ptratio)) + geom_boxplot()

Bar chart & Pie chart

attach(Boston)

### In base R
counts <- table(chas)
barplot(counts, main = "Chas Distribution", xlab = "Whether on the banks of the Charles River ( 0 = No; 1 = Yes)")

pie(counts, labels = names(counts), main = "If house on the banks of the Charles River ( 0 = No; 1 = Yes)")

### ggplot
ggplot(Boston, aes(x = as.factor(chas))) + geom_bar()

ggplot(Boston, aes(x = rad)) + geom_bar()

ggplot(Boston, aes(x = factor(1), fill = as.factor(chas))) + geom_bar(stat = "count") +
    coord_polar("y") + labs(fill = "chas", title = "If house on the banks of the Charles River ( 0 = No; 1 = Yes)")

Scatter plot

attach(Boston)
### In base R
plot(lstat, medv)

### ggplot
ggplot(Boston, aes(x = lstat, y = medv, color = as.factor(chas))) + geom_point(size = 2)

We could also combine scatter plot and the marginal histogram.

library(ggExtra)
p = ggplot(Boston, aes(x = lstat, y = medv, color = rad)) + geom_point()
ggMarginal(p, type = "histogram")

Chapter 3 Simple Linear Regression

We will seek to predict medv using 13 predictors such as rm (average number of rooms per house), age (proportion of the owner-occupied units built prior to 1940), and lstat (percent of households with low socioeconomic status).

Find a strongest linear correlation (between the numerical independent variables and the dependent variable)

Here, we use corrlation matrix to find the independent variable which has the strongest linear correlation to the dependent variable.

options(width = 300)
Boston_num <- Boston[, -c(4, 9)]  ## remove the chas and rad columns, which are categorical variables;
cor_matrix <- cor(Boston_num)
round(cor_matrix, 2)
##          crim    zn indus   nox    rm   age   dis   tax ptratio black lstat  medv
## crim     1.00 -0.20  0.41  0.42 -0.22  0.35 -0.38  0.58    0.29 -0.39  0.46 -0.39
## zn      -0.20  1.00 -0.53 -0.52  0.31 -0.57  0.66 -0.31   -0.39  0.18 -0.41  0.36
## indus    0.41 -0.53  1.00  0.76 -0.39  0.64 -0.71  0.72    0.38 -0.36  0.60 -0.48
## nox      0.42 -0.52  0.76  1.00 -0.30  0.73 -0.77  0.67    0.19 -0.38  0.59 -0.43
## rm      -0.22  0.31 -0.39 -0.30  1.00 -0.24  0.21 -0.29   -0.36  0.13 -0.61  0.70
## age      0.35 -0.57  0.64  0.73 -0.24  1.00 -0.75  0.51    0.26 -0.27  0.60 -0.38
## dis     -0.38  0.66 -0.71 -0.77  0.21 -0.75  1.00 -0.53   -0.23  0.29 -0.50  0.25
## tax      0.58 -0.31  0.72  0.67 -0.29  0.51 -0.53  1.00    0.46 -0.44  0.54 -0.47
## ptratio  0.29 -0.39  0.38  0.19 -0.36  0.26 -0.23  0.46    1.00 -0.18  0.37 -0.51
## black   -0.39  0.18 -0.36 -0.38  0.13 -0.27  0.29 -0.44   -0.18  1.00 -0.37  0.33
## lstat    0.46 -0.41  0.60  0.59 -0.61  0.60 -0.50  0.54    0.37 -0.37  1.00 -0.74
## medv    -0.39  0.36 -0.48 -0.43  0.70 -0.38  0.25 -0.47   -0.51  0.33 -0.74  1.00
### Visualize the correlation matrix
library(ggcorrplot)
ggcorrplot(cor_matrix, hc.order = TRUE)

The last column in the correlation matrix shows the correlation coefficient \(r\)s between the predictors and the response variable (medv). We find that lstat and medv has the strongest linear correlation.

For the next step, we will explore the linear relationship between the two variables. That is, \(y\)=medv, \(x\)=lstat.

plot(Boston$lstat, Boston$medv)

The data appears to demostrate a straight-line relationshiop. As the percentage of lower status of the population (lstat) increase, the median home value decrease, which fits with common sense. The scatterplot shows curvilinear relation, which suggests a curivilinear model might be a better fit. In this chapter, we first fit the straight-line model.

cor(Boston$lstat, Boston$medv)
## [1] -0.7376627
cor.test(Boston$lstat, Boston$medv)
## 
##  Pearson's product-moment correlation
## 
## data:  Boston$lstat and Boston$medv
## t = -24.528, df = 504, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7749982 -0.6951959
## sample estimates:
##        cor 
## -0.7376627
pv <- cor.test(Boston$lstat, Boston$medv)$p.value

We perform the hypothesis test for linear correlation. \(H_0: \rho=0\) vs \(H_a: \rho\neq0\). Since p-value = 5.0811034^{-88} <\(\alpha\)=0.05, we reject \(H_0\). Therefore, median home price and low-status percentage are significantly linearly related.

lm.fit = lm(medv ~ lstat, data = Boston)
attach(Boston)
lm.fit = lm(medv ~ lstat)

lm.fit
## 
## Call:
## lm(formula = medv ~ lstat)
## 
## Coefficients:
## (Intercept)        lstat  
##       34.55        -0.95
coef(lm.fit)
## (Intercept)       lstat 
##  34.5538409  -0.9500494

The estimated regression equation by using least squares method is \(\hat{y}\)=34.554+(-0.95)x. Before we use the fitted model to do point or interval estimation, we need to access model adequacy.

Test of model utility

anova(lm.fit)
summary(lm.fit)
## 
## Call:
## lm(formula = medv ~ lstat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.168  -3.990  -1.318   2.034  24.500 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.55384    0.56263   61.41   <2e-16 ***
## lstat       -0.95005    0.03873  -24.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared:  0.5441, Adjusted R-squared:  0.5432 
## F-statistic: 601.6 on 1 and 504 DF,  p-value: < 2.2e-16
names(summary(lm.fit))
##  [1] "call"          "terms"         "residuals"     "coefficients" 
##  [5] "aliased"       "sigma"         "df"            "r.squared"    
##  [9] "adj.r.squared" "fstatistic"    "cov.unscaled"
summary(lm.fit)$coefficients
##               Estimate Std. Error   t value      Pr(>|t|)
## (Intercept) 34.5538409 0.56262735  61.41515 3.743081e-236
## lstat       -0.9500494 0.03873342 -24.52790  5.081103e-88
confint(lm.fit, level = 0.95)
##                 2.5 %     97.5 %
## (Intercept) 33.448457 35.6592247
## lstat       -1.026148 -0.8739505
# Combine to display the coef and confidence interval for the parameters
# together;
coef <- summary(lm.fit)$coefficients
coef_CI <- confint(lm.fit, level = 0.95)
cbind(coef, coef_CI)
##               Estimate Std. Error   t value      Pr(>|t|)     2.5 %     97.5 %
## (Intercept) 34.5538409 0.56262735  61.41515 3.743081e-236 33.448457 35.6592247
## lstat       -0.9500494 0.03873342 -24.52790  5.081103e-88 -1.026148 -0.8739505

Now, we test the \(H_0:\beta_1=0\) vs \(H_a: \beta_1\neq 0\). For simple linear regression, the result would be the same as the correlation hypothesis test above. We will conclude percentage of lower status (lstat) is a significant predictor to the median house price (medv).

The confidence interval for slope \(\beta_1\) is (-1.0261482, -0.8739505). We are 95% confident that the mean median home price decreases from 0.874 to 1.026 thousands of dollors per additional percent increase in low-status of the population.

The estimated standard deviation of \(\epsilon\) is \(s\)=6.2157604, which implies that most of the observed median home price will fall within in approximately \(2s\)=12.4 thousands of dollars of their respective predicted values when using the least squares line.

The coefficient of determination is 0.544. This value implies that about 54% of the sample variation in median home price is explained by the low-status percent and the median home price.

With relatively small \(2s\), significant linear relationship between \(x\) and \(y\), we might make predictions as follows. However, as we note that \(r^2\) is not very high, we need to add more predictors in our model. We could get confidence interval and prediction interval as follows.

plot(Boston$lstat, Boston$medv, pch = "+")
abline(lm.fit, lwd = 3, col = "red")

summary(lm.fit)$r.squared
## [1] 0.5441463
summary(lm.fit)$sigma
## [1] 6.21576
summary(lm.fit)$fstatistic
##    value    numdf    dendf 
## 601.6179   1.0000 504.0000
### Confidence Interval for E(y);
predict(lm.fit, data.frame(lstat = c(5, 10, 15)), se.fit = TRUE, interval = "confidence")
## $fit
##        fit      lwr      upr
## 1 29.80359 29.00741 30.59978
## 2 25.05335 24.47413 25.63256
## 3 20.30310 19.73159 20.87461
## 
## $se.fit
##         1         2         3 
## 0.4052473 0.2948138 0.2908931 
## 
## $df
## [1] 504
## 
## $residual.scale
## [1] 6.21576
CI <- predict(lm.fit, data.frame(lstat = c(5, 10, 15)), se.fit = TRUE, interval = "confidence")  ## Save as CI and we will access items one by one below;
CI$fit
##        fit      lwr      upr
## 1 29.80359 29.00741 30.59978
## 2 25.05335 24.47413 25.63256
## 3 20.30310 19.73159 20.87461
CI$se.fit
##         1         2         3 
## 0.4052473 0.2948138 0.2908931

In abline(), * lwd — change the weight of the line; * col — change the color of the line.

Residual Analysis

We plot the residual vs the predictor variable lstat.

res = resid(lm.fit)

plot(Boston$lstat, res)

We could find that there is a clear u-shape in the residual plot. And the variance shows non-constant variance issue, as the size of the residuals decreases as the value of lstat increase. This residual plot indicates that a multiplicative model may be appropriate.

Prediction Interval for y;

predict(lm.fit, data.frame(lstat = c(5, 10, 15)), interval = "prediction")
##        fit       lwr      upr
## 1 29.80359 17.565675 42.04151
## 2 25.05335 12.827626 37.27907
## 3 20.30310  8.077742 32.52846
# detach(Boston)

# par(mfrow=c(2,2)) plot(lm.fit) plot(predict(lm.fit), residuals(lm.fit))
# plot(predict(lm.fit), rstudent(lm.fit)) plot(hatvalues(lm.fit)) # hatvalues()
# produces leverage statistics, which can be used to compute for any number of
# predictors; which.max(hatvalues(lm.fit))

Chapter 4 Multiple Regression Models

If fit a model with all predictors…

Now we consider building a model with more than one predictor. We have 13 independent variables to choose. Should all the predictors be included in the model? Or use only a couple of them? Which ones should be in the model? We will decide by using variable selection procedures, which is discussed in Chapter 8.

fit.full <- lm(medv ~ ., data = Boston)  # Here, we used the short-hand, to avoid cumbersome way of listing all 13 variables;
anova_alt(fit.full)

By using variable selection procedures, we could identify the most important predictors. For now, let’s say, we are going to fit models with 5 predictors chosen: lstat, rm, dis, ptratio, chas.

Boston_5_pred <- Boston[, c("lstat", "rm", "dis", "ptratio", "chas", "medv")]
attach(Boston_5_pred)
## The following objects are masked from Boston (pos = 3):
## 
##     chas, dis, lstat, medv, ptratio, rm
## The following objects are masked from Boston (pos = 4):
## 
##     chas, dis, lstat, medv, ptratio, rm

First, we model the median value of house (medv, \(y\), in $1000s) as a function of the lower status of the populatio (\(x_1\)), average number of rooms (\(x_2\)), distance to Boston employment centres (\(x_3\)), pupil-teacher ratio (\(x_4\)), and if tract bounds river (\(x_5\)).

The relationship between lstat and medv

In Chapter 3, we fitted stright-line model to relate medv (\(y\)) and lstat (\(x\)).

plot(Boston$lstat, Boston$medv)

The scatterplot shows curvilinear relationship, which suggests a curivilinear model might be a better fit. We will fit the curvilinear model \(y=\beta_0+\beta_1 x+\beta_2 x^2+\varepsilon\).

fit_quad <- lm(medv ~ lstat + I(lstat^2))
anova_alt(fit_quad)
summary(fit_quad)
## 
## Call:
## lm(formula = medv ~ lstat + I(lstat^2))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.2834  -3.8313  -0.5295   2.3095  25.4148 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 42.862007   0.872084   49.15   <2e-16 ***
## lstat       -2.332821   0.123803  -18.84   <2e-16 ***
## I(lstat^2)   0.043547   0.003745   11.63   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.524 on 503 degrees of freedom
## Multiple R-squared:  0.6407, Adjusted R-squared:  0.6393 
## F-statistic: 448.5 on 2 and 503 DF,  p-value: < 2.2e-16

Firt, we test the overall significance of this curvilinear model.

\(H_0: \beta_1=\beta_2=0\)

\(H_a:\) At least one of the \(\beta\)’s \(\neq 0\)

Since the p-value is almost 0, which is less than \(\alpha=0.05\), we reject \(H_0\) and conclude the curvilinear model is adequate.

In order to identify whether the curvilinear term lstat\(^2\) should be kept in the model, we could run the \(t\)-test, or partial \(F\)-test. Since there is only one additional factor (i.e. lstat\(^2\)) is considered here, these two test are equivalent.

anova(lm.fit, fit_quad)

\(H_0: y=\beta_0+\beta_1 x\) (Reduced Model)

\(H_a:y=\beta_0+\beta_1 x +\beta_2 x^2\) (Complete Model)

Since the \(p\)-value =7.63e-28 \(<\alpha\), we reject \(H_0\) and conclude the complete model is prefered. Therefore, medv and lstat are related in a curvilinear way, with upward curvature (as \(\beta_2>0\)).

Interaction between rm and dis

An analyst hypothesizes that the efect of distance to Boston employment centres on house price will be greater when there are more room numbers. To test this hypothesis, we will need to run an interaction model for \(E(y)\) as a function of dis (\(x_1\)), rm(\(x_2\)), and dis_rm (\(x_1x_2\)).

fit_interaction <- lm(medv ~ rm * dis, data = Boston_5_pred)
anova_alt(fit_interaction)
summary(fit_interaction)
## 
## Call:
## lm(formula = medv ~ rm * dis, data = Boston_5_pred)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.423  -3.276   0.104   2.831  38.061 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -15.2533     4.8953  -3.116  0.00194 ** 
## rm            5.7020     0.7851   7.263 1.45e-12 ***
## dis          -5.7579     1.3500  -4.265 2.39e-05 ***
## rm:dis        0.9855     0.2119   4.652 4.22e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.415 on 502 degrees of freedom
## Multiple R-squared:  0.5164, Adjusted R-squared:  0.5135 
## F-statistic: 178.7 on 3 and 502 DF,  p-value: < 2.2e-16

\(H_0\): There is no significant interaction between rm and dis.

\(H_a:\) There is a significant interaction between rm and dis.

From the p-value = 0.0000042 \(<\alpha=.05\), we conclude there is a significant interaction between rm and dis. As the distance from Boston employment centres and number of rooms interact positivel on median house price.

Residual Analysis

res = resid(fit_interaction)
pred = predict(fit_interaction)

plot(pred, res)