Case Study: Boston

The Boston Housing Dataset

The dataset was obtained from the StatLib archieve. The Boston house-price data of Harrison, D. and Rubinfeld, D.L. ‘Hedonic prices and the demand for clean air’, J. Environ. Economics & Management, vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, ‘Regression diagnostics …’, Wiley, 1980. N.B. Various transformations are used in the table on pages 244-261 of the latter.

In R, the MASS library contains Boston data set, which has 506 rows and 14 columns. The dataset records medv (median house value) and 13 predictors for 506 neighborhoods around Boston. The variable descriptions are as follows:

crim —per capita crime rate by town.
zn — proportion of residential land zoned for lots over 25,000 sq.ft.
indus — proportion of non-retail business acres per town.
chas — Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
nox — nitrogen oxides concentration (parts per 10 million).
rm — average number of rooms per dwelling.
age — proportion of owner-occupied units built prior to 1940.
dis — weighted mean of distances to five Boston employment centres.
rad — index of accessibility to radial highways.
tax — full-value property-tax rate per $10,000.
ptratio — pupil-teacher ratio by town.
black — $1000 (Bk - 0.63)^2$ where $Bk$ is the proportion of blacks by town.
lstat — lower status of the population (percent).
medv — median value of owner-occupied homes in $1000s.

Explore the Boston Housing Dataset

First of all, let’s take a look at the variables in the dataset Boston.

library(MASS)
names(Boston)

##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"

head(Boston)

dim(Boston)

## [1] 506  14

names() — print out all the variable names;

head() — print out the first 5 rows of the dataset.

Data Type

Then we need to know the data type of every variable. Based on the information we have, we know (variable name — description — variable type)

crim — per capita crime rate by town — Numerical, continuous
zn — proportion of residential land zoned for lots over 25,000 sq.ft. — Numerical, continuous
indus — proportion of non-retail business acres per town. — Numerical, continuous
chas — Charles River dummy variable (= 1 if tract bounds river; 0 otherwise). — categorical, nominal
nox — nitrogen oxides concentration (parts per 10 million). — Numerical, continuous
rm — average number of rooms per dwelling. — Numerical, continuous
age — proportion of owner-occupied units built prior to 1940. — Numerical, continuous
dis — weighted mean of distances to five Boston employment centres. — Numerical, continuous
rad — index of accessibility to radial highways. larger index denotes better accessibility — categorical, ordinal
tax — full-value property-tax rate per $10,000. — Numerical, continuous
ptratio — pupil-teacher ratio by town. — Numerical, continuous
black — $1000 (Bk - 0.63)^2$ where $Bk$ is the proportion of blacks by town. — Numerical, continuous
lstat — lower status of the population (percent). — Numerical, continuous
medv — median value of owner-occupied homes in $1000s. — Numerical, continuous

Summary of all the variables

Are there any missing values?

sapply(Boston, anyNA)

##    crim      zn   indus    chas     nox      rm     age     dis     rad     tax 
##   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE 
## ptratio   black   lstat    medv 
##   FALSE   FALSE   FALSE   FALSE

From the output, we know there is no missing values in the Boston dataset.

Next, we find the summary of all the 13 variables as follows.

summary(Boston)

##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Histogram

We could draw histogram for a particular variable, say medv

library(ggplot2)
attach(Boston)

### In base R

hist(medv)

### ggplot

ggplot(Boston, aes(x = medv)) + geom_histogram(binwidth = 5, color = "darkblue",
    fill = "lightblue")

We could also draw all the histogram for all numerical variables in the data frame Boston.

library(Hmisc)
hist.data.frame(Boston)

Boxplot

We could draw boxplot for each column. For example, ptratio,

### In base R

boxplot(ptratio)

### ggplot
ggplot(Boston, aes(y = ptratio)) + geom_boxplot()

#### boxplot of ptratio by chas groups
ggplot(Boston, aes(x = as.factor(chas), y = ptratio)) + geom_boxplot()

Bar chart & Pie chart

attach(Boston)

### In base R
counts <- table(chas)
barplot(counts, main = "Chas Distribution", xlab = "Whether on the banks of the Charles River ( 0 = No; 1 = Yes)")

pie(counts, labels = names(counts), main = "If house on the banks of the Charles River ( 0 = No; 1 = Yes)")

### ggplot
ggplot(Boston, aes(x = as.factor(chas))) + geom_bar()

ggplot(Boston, aes(x = rad)) + geom_bar()

ggplot(Boston, aes(x = factor(1), fill = as.factor(chas))) + geom_bar(stat = "count") +
    coord_polar("y") + labs(fill = "chas", title = "If house on the banks of the Charles River ( 0 = No; 1 = Yes)")

Scatter plot

attach(Boston)
### In base R
plot(lstat, medv)

### ggplot
ggplot(Boston, aes(x = lstat, y = medv, color = as.factor(chas))) + geom_point(size = 2)

We could also combine scatter plot and the marginal histogram.

library(ggExtra)
p = ggplot(Boston, aes(x = lstat, y = medv, color = rad)) + geom_point()
ggMarginal(p, type = "histogram")

Chapter 3 Simple Linear Regression

We will seek to predict medv using 13 predictors such as rm (average number of rooms per house), age (proportion of the owner-occupied units built prior to 1940), and lstat (percent of households with low socioeconomic status).

Find a strongest linear correlation (between the numerical independent variables and the dependent variable)

Here, we use corrlation matrix to find the independent variable which has the strongest linear correlation to the dependent variable.

options(width = 300)
Boston_num <- Boston[, -c(4, 9)]  ## remove the chas and rad columns, which are categorical variables;
cor_matrix <- cor(Boston_num)
round(cor_matrix, 2)

##          crim    zn indus   nox    rm   age   dis   tax ptratio black lstat  medv
## crim     1.00 -0.20  0.41  0.42 -0.22  0.35 -0.38  0.58    0.29 -0.39  0.46 -0.39
## zn      -0.20  1.00 -0.53 -0.52  0.31 -0.57  0.66 -0.31   -0.39  0.18 -0.41  0.36
## indus    0.41 -0.53  1.00  0.76 -0.39  0.64 -0.71  0.72    0.38 -0.36  0.60 -0.48
## nox      0.42 -0.52  0.76  1.00 -0.30  0.73 -0.77  0.67    0.19 -0.38  0.59 -0.43
## rm      -0.22  0.31 -0.39 -0.30  1.00 -0.24  0.21 -0.29   -0.36  0.13 -0.61  0.70
## age      0.35 -0.57  0.64  0.73 -0.24  1.00 -0.75  0.51    0.26 -0.27  0.60 -0.38
## dis     -0.38  0.66 -0.71 -0.77  0.21 -0.75  1.00 -0.53   -0.23  0.29 -0.50  0.25
## tax      0.58 -0.31  0.72  0.67 -0.29  0.51 -0.53  1.00    0.46 -0.44  0.54 -0.47
## ptratio  0.29 -0.39  0.38  0.19 -0.36  0.26 -0.23  0.46    1.00 -0.18  0.37 -0.51
## black   -0.39  0.18 -0.36 -0.38  0.13 -0.27  0.29 -0.44   -0.18  1.00 -0.37  0.33
## lstat    0.46 -0.41  0.60  0.59 -0.61  0.60 -0.50  0.54    0.37 -0.37  1.00 -0.74
## medv    -0.39  0.36 -0.48 -0.43  0.70 -0.38  0.25 -0.47   -0.51  0.33 -0.74  1.00

### Visualize the correlation matrix
library(ggcorrplot)
ggcorrplot(cor_matrix, hc.order = TRUE)

The last column in the correlation matrix shows the correlation coefficient $r$s between the predictors and the response variable (medv). We find that lstat and medv has the strongest linear correlation.

For the next step, we will explore the linear relationship between the two variables. That is, $y$=medv, $x$=lstat.

plot(Boston$lstat, Boston$medv)

The data appears to demostrate a straight-line relationshiop. As the percentage of lower status of the population (lstat) increase, the median home value decrease, which fits with common sense. The scatterplot shows curvilinear relation, which suggests a curivilinear model might be a better fit. In this chapter, we first fit the straight-line model.

cor(Boston$lstat, Boston$medv)

## [1] -0.7376627

cor.test(Boston$lstat, Boston$medv)

## 
##  Pearson's product-moment correlation
## 
## data:  Boston$lstat and Boston$medv
## t = -24.528, df = 504, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7749982 -0.6951959
## sample estimates:
##        cor 
## -0.7376627

pv <- cor.test(Boston$lstat, Boston$medv)$p.value

We perform the hypothesis test for linear correlation. $H_0: \rho=0$ vs $H_a: \rho\neq0$. Since p-value = 5.0811034^{-88} <$\alpha$=0.05, we reject $H_0$. Therefore, median home price and low-status percentage are significantly linearly related.

lm.fit = lm(medv ~ lstat, data = Boston)
attach(Boston)
lm.fit = lm(medv ~ lstat)

lm.fit

## 
## Call:
## lm(formula = medv ~ lstat)
## 
## Coefficients:
## (Intercept)        lstat  
##       34.55        -0.95

coef(lm.fit)

## (Intercept)       lstat 
##  34.5538409  -0.9500494

The estimated regression equation by using least squares method is $\hat{y}$=34.554+(-0.95)x. Before we use the fitted model to do point or interval estimation, we need to access model adequacy.

Test of model utility

anova(lm.fit)

summary(lm.fit)

## 
## Call:
## lm(formula = medv ~ lstat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.168  -3.990  -1.318   2.034  24.500 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.55384    0.56263   61.41   <2e-16 ***
## lstat       -0.95005    0.03873  -24.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared:  0.5441, Adjusted R-squared:  0.5432 
## F-statistic: 601.6 on 1 and 504 DF,  p-value: < 2.2e-16

names(summary(lm.fit))

##  [1] "call"          "terms"         "residuals"     "coefficients" 
##  [5] "aliased"       "sigma"         "df"            "r.squared"    
##  [9] "adj.r.squared" "fstatistic"    "cov.unscaled"

summary(lm.fit)$coefficients

##               Estimate Std. Error   t value      Pr(>|t|)
## (Intercept) 34.5538409 0.56262735  61.41515 3.743081e-236
## lstat       -0.9500494 0.03873342 -24.52790  5.081103e-88

confint(lm.fit, level = 0.95)

##                 2.5 %     97.5 %
## (Intercept) 33.448457 35.6592247
## lstat       -1.026148 -0.8739505

# Combine to display the coef and confidence interval for the parameters
# together;
coef <- summary(lm.fit)$coefficients
coef_CI <- confint(lm.fit, level = 0.95)
cbind(coef, coef_CI)

##               Estimate Std. Error   t value      Pr(>|t|)     2.5 %     97.5 %
## (Intercept) 34.5538409 0.56262735  61.41515 3.743081e-236 33.448457 35.6592247
## lstat       -0.9500494 0.03873342 -24.52790  5.081103e-88 -1.026148 -0.8739505

Now, we test the $H_0:\beta_1=0$ vs $H_a: \beta_1\neq 0$. For simple linear regression, the result would be the same as the correlation hypothesis test above. We will conclude percentage of lower status (lstat) is a significant predictor to the median house price (medv).

The confidence interval for slope $\beta_1$ is (-1.0261482, -0.8739505). We are 95% confident that the mean median home price decreases from 0.874 to 1.026 thousands of dollors per additional percent increase in low-status of the population.

The estimated standard deviation of $\epsilon$ is $s$=6.2157604, which implies that most of the observed median home price will fall within in approximately $2s$=12.4 thousands of dollars of their respective predicted values when using the least squares line.

The coefficient of determination is 0.544. This value implies that about 54% of the sample variation in median home price is explained by the low-status percent and the median home price.

With relatively small $2s$, significant linear relationship between $x$ and $y$, we might make predictions as follows. However, as we note that $r^2$ is not very high, we need to add more predictors in our model. We could get confidence interval and prediction interval as follows.

plot(Boston$lstat, Boston$medv, pch = "+")
abline(lm.fit, lwd = 3, col = "red")

summary(lm.fit)$r.squared

## [1] 0.5441463

summary(lm.fit)$sigma

## [1] 6.21576

summary(lm.fit)$fstatistic

##    value    numdf    dendf 
## 601.6179   1.0000 504.0000

### Confidence Interval for E(y);
predict(lm.fit, data.frame(lstat = c(5, 10, 15)), se.fit = TRUE, interval = "confidence")

## $fit
##        fit      lwr      upr
## 1 29.80359 29.00741 30.59978
## 2 25.05335 24.47413 25.63256
## 3 20.30310 19.73159 20.87461
## 
## $se.fit
##         1         2         3 
## 0.4052473 0.2948138 0.2908931 
## 
## $df
## [1] 504
## 
## $residual.scale
## [1] 6.21576

CI <- predict(lm.fit, data.frame(lstat = c(5, 10, 15)), se.fit = TRUE, interval = "confidence")  ## Save as CI and we will access items one by one below;
CI$fit

##        fit      lwr      upr
## 1 29.80359 29.00741 30.59978
## 2 25.05335 24.47413 25.63256
## 3 20.30310 19.73159 20.87461

CI$se.fit

##         1         2         3 
## 0.4052473 0.2948138 0.2908931

In abline(), * lwd — change the weight of the line; * col — change the color of the line.

Residual Analysis

We plot the residual vs the predictor variable lstat.

res = resid(lm.fit)

plot(Boston$lstat, res)

We could find that there is a clear u-shape in the residual plot. And the variance shows non-constant variance issue, as the size of the residuals decreases as the value of lstat increase. This residual plot indicates that a multiplicative model may be appropriate.

Prediction Interval for y;

predict(lm.fit, data.frame(lstat = c(5, 10, 15)), interval = "prediction")

##        fit       lwr      upr
## 1 29.80359 17.565675 42.04151
## 2 25.05335 12.827626 37.27907
## 3 20.30310  8.077742 32.52846

# detach(Boston)

# par(mfrow=c(2,2)) plot(lm.fit) plot(predict(lm.fit), residuals(lm.fit))
# plot(predict(lm.fit), rstudent(lm.fit)) plot(hatvalues(lm.fit)) # hatvalues()
# produces leverage statistics, which can be used to compute for any number of
# predictors; which.max(hatvalues(lm.fit))

Chapter 4 Multiple Regression Models

If fit a model with all predictors…

Now we consider building a model with more than one predictor. We have 13 independent variables to choose. Should all the predictors be included in the model? Or use only a couple of them? Which ones should be in the model? We will decide by using variable selection procedures, which is discussed in Chapter 8.

fit.full <- lm(medv ~ ., data = Boston)  # Here, we used the short-hand, to avoid cumbersome way of listing all 13 variables;
anova_alt(fit.full)

By using variable selection procedures, we could identify the most important predictors. For now, let’s say, we are going to fit models with 5 predictors chosen: lstat, rm, dis, ptratio, chas.

Boston_5_pred <- Boston[, c("lstat", "rm", "dis", "ptratio", "chas", "medv")]
attach(Boston_5_pred)

## The following objects are masked from Boston (pos = 3):
## 
##     chas, dis, lstat, medv, ptratio, rm

## The following objects are masked from Boston (pos = 4):
## 
##     chas, dis, lstat, medv, ptratio, rm

First, we model the median value of house (medv, $y$, in $1000s) as a function of the lower status of the populatio ($x_1$), average number of rooms ($x_2$), distance to Boston employment centres ($x_3$), pupil-teacher ratio ($x_4$), and if tract bounds river ($x_5$).

Is the qualitative variable chas significantly related to the median value of house (medv, $y$, in $1000s)?

To answer this question, we fit the regression model with medv as the $y$ and chas as the independent variable $x$. Note that, chas is a dummy variable, with

chas — Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

In other words, not tract bounds river is chosen as the base level.

fit_chas <- lm(medv ~ chas)
anova(fit_chas)

summary(fit_chas)

## 
## Call:
## lm(formula = medv ~ chas)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.094  -5.894  -1.417   2.856  27.906 
## 
## Coefficients:
##             Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)  22.0938     0.4176  52.902   < 2e-16 ***
## chas          6.3462     1.5880   3.996 0.0000739 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.064 on 504 degrees of freedom
## Multiple R-squared:  0.03072,    Adjusted R-squared:  0.02879 
## F-statistic: 15.97 on 1 and 504 DF,  p-value: 0.00007391

From the output above, we could find on average, the median value of a house is $6,346 higher for the houses tract bounds river than those not.

We could also perfor a hypothesis test as follows:

$H_0:$ Chas is not a significant predictor. $H_a:$ Chas is a significant predictor.

Since p-value = 0.0000739 $<\alpha=$ 0.05, the null hypothesis is rejected. We conslude Chas is a significant predictor. However, on the other hand, the coefficient of determination $r^2=0.03$, which is quite small value. In other word, in addition to Chas, we need more predictors in the model.

The relationship between lstat and medv

In Chapter 3, we fitted stright-line model to relate medv ($y$) and lstat ($x$).

plot(Boston$lstat, Boston$medv)

The scatterplot shows curvilinear relationship, which suggests a curivilinear model might be a better fit. We will fit the curvilinear model $y=\beta_0+\beta_1 x+\beta_2 x^2+\varepsilon$.

fit_quad <- lm(medv ~ lstat + I(lstat^2))
anova_alt(fit_quad)

summary(fit_quad)

## 
## Call:
## lm(formula = medv ~ lstat + I(lstat^2))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.2834  -3.8313  -0.5295   2.3095  25.4148 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 42.862007   0.872084   49.15   <2e-16 ***
## lstat       -2.332821   0.123803  -18.84   <2e-16 ***
## I(lstat^2)   0.043547   0.003745   11.63   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.524 on 503 degrees of freedom
## Multiple R-squared:  0.6407, Adjusted R-squared:  0.6393 
## F-statistic: 448.5 on 2 and 503 DF,  p-value: < 2.2e-16

Firt, we test the overall significance of this curvilinear model.

$H_0: \beta_1=\beta_2=0$

$H_a:$ At least one of the $\beta$’s $\neq 0$

Since the p-value is almost 0, which is less than $\alpha=0.05$, we reject $H_0$ and conclude the curvilinear model is adequate.

In order to identify whether the curvilinear term lstat$^2$ should be kept in the model, we could run the $t$-test, or partial $F$-test. Since there is only one additional factor (i.e. lstat$^2$) is considered here, these two test are equivalent.

anova(lm.fit, fit_quad)

$H_0: y=\beta_0+\beta_1 x$ (Reduced Model)

$H_a:y=\beta_0+\beta_1 x +\beta_2 x^2$ (Complete Model)

Since the $p$-value =7.63e-28 $<\alpha$, we reject $H_0$ and conclude the complete model is prefered. Therefore, medv and lstat are related in a curvilinear way, with upward curvature (as $\beta_2>0$).

Interaction between rm and dis

An analyst hypothesizes that the efect of distance to Boston employment centres on house price will be greater when there are more room numbers. To test this hypothesis, we will need to run an interaction model for $E(y)$ as a function of dis ($x_1$), rm($x_2$), and dis_rm ($x_1x_2$).

fit_interaction <- lm(medv ~ rm * dis, data = Boston_5_pred)
anova_alt(fit_interaction)

summary(fit_interaction)

## 
## Call:
## lm(formula = medv ~ rm * dis, data = Boston_5_pred)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.423  -3.276   0.104   2.831  38.061 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -15.2533     4.8953  -3.116  0.00194 ** 
## rm            5.7020     0.7851   7.263 1.45e-12 ***
## dis          -5.7579     1.3500  -4.265 2.39e-05 ***
## rm:dis        0.9855     0.2119   4.652 4.22e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.415 on 502 degrees of freedom
## Multiple R-squared:  0.5164, Adjusted R-squared:  0.5135 
## F-statistic: 178.7 on 3 and 502 DF,  p-value: < 2.2e-16

$H_0$: There is no significant interaction between rm and dis.

$H_a:$ There is a significant interaction between rm and dis.

From the p-value = 0.0000042 $<\alpha=.05$, we conclude there is a significant interaction between rm and dis. As the distance from Boston employment centres and number of rooms interact positivel on median house price.

Residual Analysis

res = resid(fit_interaction)
pred = predict(fit_interaction)

plot(pred, res)

Case Study: Boston

Dr. Weijia Jia

The Boston Housing Dataset

Explore the Boston Housing Dataset

Data Type

Summary of all the variables

Histogram

Boxplot

Bar chart & Pie chart

Scatter plot

Chapter 3 Simple Linear Regression

Find a strongest linear correlation (between the numerical independent variables and the dependent variable)

Test of model utility

Residual Analysis

Prediction Interval for y;

Chapter 4 Multiple Regression Models

If fit a model with all predictors…

Is the qualitative variable chas significantly related to the median value of house (medv, \(y\), in $1000s)?

The relationship between lstat and medv

Interaction between rm and dis

Residual Analysis