The dataset was obtained from the StatLib archieve. The Boston house-price data of Harrison, D. and Rubinfeld, D.L. ‘Hedonic prices and the demand for clean air’, J. Environ. Economics & Management, vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, ‘Regression diagnostics …’, Wiley, 1980. N.B. Various transformations are used in the table on pages 244-261 of the latter.
In R, the MASS library contains Boston data set, which has 506 rows and 14 columns. The dataset records medv (median house value) and 13 predictors for 506 neighborhoods around Boston. The variable descriptions are as follows:
crim —per capita crime rate by town.
zn — proportion of residential land zoned for lots over 25,000 sq.ft.
indus — proportion of non-retail business acres per town.
chas — Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
nox — nitrogen oxides concentration (parts per 10 million).
rm — average number of rooms per dwelling.
age — proportion of owner-occupied units built prior to 1940.
dis — weighted mean of distances to five Boston employment centres.
rad — index of accessibility to radial highways.
tax — full-value property-tax rate per $10,000.
ptratio — pupil-teacher ratio by town.
black — \(1000 (Bk - 0.63)^2\) where \(Bk\) is the proportion of blacks by town.
lstat — lower status of the population (percent).
medv — median value of owner-occupied homes in $1000s.
First of all, let’s take a look at the variables in the dataset Boston.
library(MASS)
names(Boston)
## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
head(Boston)
dim(Boston)
## [1] 506 14
- names() — print out all the variable names;
- head() — print out the first 5 rows of the dataset.
Then we need to know the data type of every variable. Based on the information we have, we know (variable name — description — variable type)
crim — per capita crime rate by town — Numerical, continuous
zn — proportion of residential land zoned for lots over 25,000 sq.ft. — Numerical, continuous
indus — proportion of non-retail business acres per town. — Numerical, continuous
chas — Charles River dummy variable (= 1 if tract bounds river; 0 otherwise). — categorical, nominal
nox — nitrogen oxides concentration (parts per 10 million). — Numerical, continuous
rm — average number of rooms per dwelling. — Numerical, continuous
age — proportion of owner-occupied units built prior to 1940. — Numerical, continuous
dis — weighted mean of distances to five Boston employment centres. — Numerical, continuous
rad — index of accessibility to radial highways. larger index denotes better accessibility — categorical, ordinal
tax — full-value property-tax rate per $10,000. — Numerical, continuous
ptratio — pupil-teacher ratio by town. — Numerical, continuous
black — \(1000 (Bk - 0.63)^2\) where \(Bk\) is the proportion of blacks by town. — Numerical, continuous
lstat — lower status of the population (percent). — Numerical, continuous
medv — median value of owner-occupied homes in $1000s. — Numerical, continuous
Are there any missing values?
sapply(Boston, anyNA)
## crim zn indus chas nox rm age dis rad tax
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## ptratio black lstat medv
## FALSE FALSE FALSE FALSE
From the output, we know there is no missing values in the Boston dataset.
Next, we find the summary of all the 13 variables as follows.
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
We could draw histogram for a particular variable, say medv
library(ggplot2)
attach(Boston)
### In base R
hist(medv)
### ggplot
ggplot(Boston, aes(x = medv)) + geom_histogram(binwidth = 5, color = "darkblue",
fill = "lightblue")
We could also draw all the histogram for all numerical variables in the data frame Boston.
library(Hmisc)
hist.data.frame(Boston)
We could draw boxplot for each column. For example, ptratio,
### In base R
boxplot(ptratio)
### ggplot
ggplot(Boston, aes(y = ptratio)) + geom_boxplot()
#### boxplot of ptratio by chas groups
ggplot(Boston, aes(x = as.factor(chas), y = ptratio)) + geom_boxplot()
attach(Boston)
### In base R
counts <- table(chas)
barplot(counts, main = "Chas Distribution", xlab = "Whether on the banks of the Charles River ( 0 = No; 1 = Yes)")
pie(counts, labels = names(counts), main = "If house on the banks of the Charles River ( 0 = No; 1 = Yes)")
### ggplot
ggplot(Boston, aes(x = as.factor(chas))) + geom_bar()
ggplot(Boston, aes(x = rad)) + geom_bar()
ggplot(Boston, aes(x = factor(1), fill = as.factor(chas))) + geom_bar(stat = "count") +
coord_polar("y") + labs(fill = "chas", title = "If house on the banks of the Charles River ( 0 = No; 1 = Yes)")
attach(Boston)
### In base R
plot(lstat, medv)
### ggplot
ggplot(Boston, aes(x = lstat, y = medv, color = as.factor(chas))) + geom_point(size = 2)
We could also combine scatter plot and the marginal histogram.
library(ggExtra)
p = ggplot(Boston, aes(x = lstat, y = medv, color = rad)) + geom_point()
ggMarginal(p, type = "histogram")
We will seek to predict medv using 13 predictors such as rm (average number of rooms per house), age (proportion of the owner-occupied units built prior to 1940), and lstat (percent of households with low socioeconomic status).
Here, we use corrlation matrix to find the independent variable which has the strongest linear correlation to the dependent variable.
options(width = 300)
Boston_num <- Boston[, -c(4, 9)] ## remove the chas and rad columns, which are categorical variables;
cor_matrix <- cor(Boston_num)
round(cor_matrix, 2)
## crim zn indus nox rm age dis tax ptratio black lstat medv
## crim 1.00 -0.20 0.41 0.42 -0.22 0.35 -0.38 0.58 0.29 -0.39 0.46 -0.39
## zn -0.20 1.00 -0.53 -0.52 0.31 -0.57 0.66 -0.31 -0.39 0.18 -0.41 0.36
## indus 0.41 -0.53 1.00 0.76 -0.39 0.64 -0.71 0.72 0.38 -0.36 0.60 -0.48
## nox 0.42 -0.52 0.76 1.00 -0.30 0.73 -0.77 0.67 0.19 -0.38 0.59 -0.43
## rm -0.22 0.31 -0.39 -0.30 1.00 -0.24 0.21 -0.29 -0.36 0.13 -0.61 0.70
## age 0.35 -0.57 0.64 0.73 -0.24 1.00 -0.75 0.51 0.26 -0.27 0.60 -0.38
## dis -0.38 0.66 -0.71 -0.77 0.21 -0.75 1.00 -0.53 -0.23 0.29 -0.50 0.25
## tax 0.58 -0.31 0.72 0.67 -0.29 0.51 -0.53 1.00 0.46 -0.44 0.54 -0.47
## ptratio 0.29 -0.39 0.38 0.19 -0.36 0.26 -0.23 0.46 1.00 -0.18 0.37 -0.51
## black -0.39 0.18 -0.36 -0.38 0.13 -0.27 0.29 -0.44 -0.18 1.00 -0.37 0.33
## lstat 0.46 -0.41 0.60 0.59 -0.61 0.60 -0.50 0.54 0.37 -0.37 1.00 -0.74
## medv -0.39 0.36 -0.48 -0.43 0.70 -0.38 0.25 -0.47 -0.51 0.33 -0.74 1.00
### Visualize the correlation matrix
library(ggcorrplot)
ggcorrplot(cor_matrix, hc.order = TRUE)
The last column in the correlation matrix shows the correlation coefficient \(r\)s between the predictors and the response variable (medv). We find that lstat and medv has the strongest linear correlation.
For the next step, we will explore the linear relationship between the two variables. That is, \(y\)=medv, \(x\)=lstat.
plot(Boston$lstat, Boston$medv)
The data appears to demostrate a straight-line relationshiop. As the percentage of lower status of the population (lstat) increase, the median home value decrease, which fits with common sense. The scatterplot shows curvilinear relation, which suggests a curivilinear model might be a better fit. In this chapter, we first fit the straight-line model.
cor(Boston$lstat, Boston$medv)
## [1] -0.7376627
cor.test(Boston$lstat, Boston$medv)
##
## Pearson's product-moment correlation
##
## data: Boston$lstat and Boston$medv
## t = -24.528, df = 504, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.7749982 -0.6951959
## sample estimates:
## cor
## -0.7376627
pv <- cor.test(Boston$lstat, Boston$medv)$p.value
We perform the hypothesis test for linear correlation. \(H_0: \rho=0\) vs \(H_a: \rho\neq0\). Since p-value = 5.0811034^{-88} <\(\alpha\)=0.05, we reject \(H_0\). Therefore, median home price and low-status percentage are significantly linearly related.
lm.fit = lm(medv ~ lstat, data = Boston)
attach(Boston)
lm.fit = lm(medv ~ lstat)
lm.fit
##
## Call:
## lm(formula = medv ~ lstat)
##
## Coefficients:
## (Intercept) lstat
## 34.55 -0.95
coef(lm.fit)
## (Intercept) lstat
## 34.5538409 -0.9500494
The estimated regression equation by using least squares method is \(\hat{y}\)=34.554+(-0.95)x. Before we use the fitted model to do point or interval estimation, we need to access model adequacy.
anova(lm.fit)
summary(lm.fit)
##
## Call:
## lm(formula = medv ~ lstat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.168 -3.990 -1.318 2.034 24.500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.55384 0.56263 61.41 <2e-16 ***
## lstat -0.95005 0.03873 -24.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared: 0.5441, Adjusted R-squared: 0.5432
## F-statistic: 601.6 on 1 and 504 DF, p-value: < 2.2e-16
names(summary(lm.fit))
## [1] "call" "terms" "residuals" "coefficients"
## [5] "aliased" "sigma" "df" "r.squared"
## [9] "adj.r.squared" "fstatistic" "cov.unscaled"
summary(lm.fit)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.5538409 0.56262735 61.41515 3.743081e-236
## lstat -0.9500494 0.03873342 -24.52790 5.081103e-88
confint(lm.fit, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) 33.448457 35.6592247
## lstat -1.026148 -0.8739505
# Combine to display the coef and confidence interval for the parameters
# together;
coef <- summary(lm.fit)$coefficients
coef_CI <- confint(lm.fit, level = 0.95)
cbind(coef, coef_CI)
## Estimate Std. Error t value Pr(>|t|) 2.5 % 97.5 %
## (Intercept) 34.5538409 0.56262735 61.41515 3.743081e-236 33.448457 35.6592247
## lstat -0.9500494 0.03873342 -24.52790 5.081103e-88 -1.026148 -0.8739505
Now, we test the \(H_0:\beta_1=0\) vs \(H_a: \beta_1\neq 0\). For simple linear regression, the result would be the same as the correlation hypothesis test above. We will conclude percentage of lower status (lstat) is a significant predictor to the median house price (medv).
The confidence interval for slope \(\beta_1\) is (-1.0261482, -0.8739505). We are 95% confident that the mean median home price decreases from 0.874 to 1.026 thousands of dollors per additional percent increase in low-status of the population.
The estimated standard deviation of \(\epsilon\) is \(s\)=6.2157604, which implies that most of the observed median home price will fall within in approximately \(2s\)=12.4 thousands of dollars of their respective predicted values when using the least squares line.
The coefficient of determination is 0.544. This value implies that about 54% of the sample variation in median home price is explained by the low-status percent and the median home price.
With relatively small \(2s\), significant linear relationship between \(x\) and \(y\), we might make predictions as follows. However, as we note that \(r^2\) is not very high, we need to add more predictors in our model. We could get confidence interval and prediction interval as follows.
plot(Boston$lstat, Boston$medv, pch = "+")
abline(lm.fit, lwd = 3, col = "red")
summary(lm.fit)$r.squared
## [1] 0.5441463
summary(lm.fit)$sigma
## [1] 6.21576
summary(lm.fit)$fstatistic
## value numdf dendf
## 601.6179 1.0000 504.0000
### Confidence Interval for E(y);
predict(lm.fit, data.frame(lstat = c(5, 10, 15)), se.fit = TRUE, interval = "confidence")
## $fit
## fit lwr upr
## 1 29.80359 29.00741 30.59978
## 2 25.05335 24.47413 25.63256
## 3 20.30310 19.73159 20.87461
##
## $se.fit
## 1 2 3
## 0.4052473 0.2948138 0.2908931
##
## $df
## [1] 504
##
## $residual.scale
## [1] 6.21576
CI <- predict(lm.fit, data.frame(lstat = c(5, 10, 15)), se.fit = TRUE, interval = "confidence") ## Save as CI and we will access items one by one below;
CI$fit
## fit lwr upr
## 1 29.80359 29.00741 30.59978
## 2 25.05335 24.47413 25.63256
## 3 20.30310 19.73159 20.87461
CI$se.fit
## 1 2 3
## 0.4052473 0.2948138 0.2908931
In abline(), * lwd — change the weight of the line; * col — change the color of the line.
We plot the residual vs the predictor variable lstat.
res = resid(lm.fit)
plot(Boston$lstat, res)
We could find that there is a clear u-shape in the residual plot. And the variance shows non-constant variance issue, as the size of the residuals decreases as the value of lstat increase. This residual plot indicates that a multiplicative model may be appropriate.
predict(lm.fit, data.frame(lstat = c(5, 10, 15)), interval = "prediction")
## fit lwr upr
## 1 29.80359 17.565675 42.04151
## 2 25.05335 12.827626 37.27907
## 3 20.30310 8.077742 32.52846
# detach(Boston)
# par(mfrow=c(2,2)) plot(lm.fit) plot(predict(lm.fit), residuals(lm.fit))
# plot(predict(lm.fit), rstudent(lm.fit)) plot(hatvalues(lm.fit)) # hatvalues()
# produces leverage statistics, which can be used to compute for any number of
# predictors; which.max(hatvalues(lm.fit))
Now we consider building a model with more than one predictor. We have 13 independent variables to choose. Should all the predictors be included in the model? Or use only a couple of them? Which ones should be in the model? We will decide by using variable selection procedures, which is discussed in Chapter 8.
fit.full <- lm(medv ~ ., data = Boston) # Here, we used the short-hand, to avoid cumbersome way of listing all 13 variables;
anova_alt(fit.full)
By using variable selection procedures, we could identify the most important predictors. For now, let’s say, we are going to fit models with 5 predictors chosen: lstat, rm, dis, ptratio, chas.
Boston_5_pred <- Boston[, c("lstat", "rm", "dis", "ptratio", "chas", "medv")]
attach(Boston_5_pred)
## The following objects are masked from Boston (pos = 3):
##
## chas, dis, lstat, medv, ptratio, rm
## The following objects are masked from Boston (pos = 4):
##
## chas, dis, lstat, medv, ptratio, rm
First, we model the median value of house (medv, \(y\), in $1000s) as a function of the lower status of the populatio (\(x_1\)), average number of rooms (\(x_2\)), distance to Boston employment centres (\(x_3\)), pupil-teacher ratio (\(x_4\)), and if tract bounds river (\(x_5\)).
In Chapter 3, we fitted stright-line model to relate medv (\(y\)) and lstat (\(x\)).
plot(Boston$lstat, Boston$medv)
The scatterplot shows curvilinear relationship, which suggests a curivilinear model might be a better fit. We will fit the curvilinear model \(y=\beta_0+\beta_1 x+\beta_2 x^2+\varepsilon\).
fit_quad <- lm(medv ~ lstat + I(lstat^2))
anova_alt(fit_quad)
summary(fit_quad)
##
## Call:
## lm(formula = medv ~ lstat + I(lstat^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.2834 -3.8313 -0.5295 2.3095 25.4148
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.862007 0.872084 49.15 <2e-16 ***
## lstat -2.332821 0.123803 -18.84 <2e-16 ***
## I(lstat^2) 0.043547 0.003745 11.63 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.524 on 503 degrees of freedom
## Multiple R-squared: 0.6407, Adjusted R-squared: 0.6393
## F-statistic: 448.5 on 2 and 503 DF, p-value: < 2.2e-16
Firt, we test the overall significance of this curvilinear model.
\(H_0: \beta_1=\beta_2=0\)
\(H_a:\) At least one of the \(\beta\)’s \(\neq 0\)
Since the p-value is almost 0, which is less than \(\alpha=0.05\), we reject \(H_0\) and conclude the curvilinear model is adequate.
In order to identify whether the curvilinear term lstat\(^2\) should be kept in the model, we could run the \(t\)-test, or partial \(F\)-test. Since there is only one additional factor (i.e. lstat\(^2\)) is considered here, these two test are equivalent.
anova(lm.fit, fit_quad)
\(H_0: y=\beta_0+\beta_1 x\) (Reduced Model)
\(H_a:y=\beta_0+\beta_1 x +\beta_2 x^2\) (Complete Model)
Since the \(p\)-value =7.63e-28 \(<\alpha\), we reject \(H_0\) and conclude the complete model is prefered. Therefore, medv and lstat are related in a curvilinear way, with upward curvature (as \(\beta_2>0\)).
An analyst hypothesizes that the efect of distance to Boston employment centres on house price will be greater when there are more room numbers. To test this hypothesis, we will need to run an interaction model for \(E(y)\) as a function of dis (\(x_1\)), rm(\(x_2\)), and dis_rm (\(x_1x_2\)).
fit_interaction <- lm(medv ~ rm * dis, data = Boston_5_pred)
anova_alt(fit_interaction)
summary(fit_interaction)
##
## Call:
## lm(formula = medv ~ rm * dis, data = Boston_5_pred)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.423 -3.276 0.104 2.831 38.061
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -15.2533 4.8953 -3.116 0.00194 **
## rm 5.7020 0.7851 7.263 1.45e-12 ***
## dis -5.7579 1.3500 -4.265 2.39e-05 ***
## rm:dis 0.9855 0.2119 4.652 4.22e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.415 on 502 degrees of freedom
## Multiple R-squared: 0.5164, Adjusted R-squared: 0.5135
## F-statistic: 178.7 on 3 and 502 DF, p-value: < 2.2e-16
\(H_0\): There is no significant interaction between rm and dis.
\(H_a:\) There is a significant interaction between rm and dis.
From the p-value = 0.0000042 \(<\alpha=.05\), we conclude there is a significant interaction between rm and dis. As the distance from Boston employment centres and number of rooms interact positivel on median house price.
res = resid(fit_interaction)
pred = predict(fit_interaction)
plot(pred, res)