Analysis of Boston housing data using linear regression ,trees and GAM

Introduction

Boston housing data is a data set in package MASS. The data set has 506 rows and 14 columns.

This report provides an analysis and evaluation of the factors affecting the median value of the owner occupied homes in the suburbs of Boston. The in-built data set of Boston Housing Data in package MASS is used for this analysis and various factors about the structural quality, neighbourhood, accessibility and air pollution such as per capita crime rate by town, proportion of non-retail business acres per town, index of accessibility to radial highways etc are taken into account for this study.

Approach Methods of analysis includes the following * Summary statistics of the variables and finding correlation between variables, * Exploratory data analysis using visualization * Random sampling of data set into 80/20 training and testing data set * Fitting various models such as generalized linear regression, regression tree, generalized additive model (GAM) and neural network * Finally, comparing the models based on in-sample (MSPE) and out-of-sample prediction errors (MSPE).

Packages required

Packages required Loading required packages:

library(MASS)
library(ggplot2)
library(dplyr)
library(tidyverse)
library(corrplot)
library(leaps)
library(rpart)
library(mgcv)
library(glmnet)
library(boot)
library(rpart.plot)

Exploratory data analysis

#loading data
data(Boston)
dim(Boston)

## [1] 506  14

#a look at first few rows
head(Boston)

##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12
##   lstat medv
## 1  4.98 24.0
## 2  9.14 21.6
## 3  4.03 34.7
## 4  2.94 33.4
## 5  5.33 36.2
## 6  5.21 28.7

#a look at structure of the data set
glimpse(Boston)

## Observations: 506
## Variables: 14
## $ crim    <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, ...
## $ zn      <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5,...
## $ indus   <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, ...
## $ chas    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ nox     <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524...
## $ rm      <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172...
## $ age     <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0,...
## $ dis     <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605...
## $ rad     <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, ...
## $ tax     <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311,...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, ...
## $ black   <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60...
## $ lstat   <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.9...
## $ medv    <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, ...

#summary statistics
summary(Boston)

##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

#Check for missing values
sum(is.na(Boston))

## [1] 0

#Check for duplicated values
sum(duplicated(Boston))

## [1] 0

#checking correlation between variables
corrplot(cor(Boston), method = "number", type = "upper", diag = FALSE)

There are no missing or duplicate values.

From correlation matrix, some of the observations made are as follows:

Median value of owner-occupied homes (in 1000$) increases as average number of rooms per dwelling increases and it decreases if percent of lower status population in the area increases
nox or nitrogen oxides concentration (ppm) increases with increase in proportion of non-retail business acres per town and proportion of owner-occupied units built prior to 1940.
rad and tax have a strong positive correlation of 0.91 which implies that as accessibility of radial highways increases, the full value property-tax rate per $10,000 also increases.
crim is strongly associated with variables rad and tax which implies as accessibility to radial highways increases, per capita crime rate increases.
indus has strong positive correlation with nox, which supports the notion that nitrogen oxides concentration is high in industrial areas.

Data Visualization

Boston %>%
  gather(key, val, -medv) %>%
  ggplot(aes(x = val, y = medv)) +
  geom_point() +
  stat_smooth(method = "lm", se = TRUE, col = "blue") +
  facet_wrap(~key, scales = "free") +
  theme_gray() +
  ggtitle("Scatter plot of dependent variables vs Median Value (medv)")

table(Boston$chas)

## 
##   0   1 
## 471  35

Observations * Proportion of owner occupied units built prior to 1940 (age) and proportion of blacks by town (black) is heavily skewed to left, while per capita crime rate in town (crim) and weighted mean of distances to five Boston employment centres (dis) is heavily skewed to right. * rm is normally distributed with mean of approximately 6. * Most of the properties are situated close to the five Boston employment centres (dis skewed to right) * There is a high proportion of owner occupied units built prior to 1940 (age skewed to left) and blacks in town (black skewed to right) * From scatter plots, it is seen that lstat and rm show strong correlation with medv. * 93% of the properties are away from Charles river. The properties bordering the river seems to have higher median prices.

Split data set into 80:20 train and test data

set.seed(12383010)
index <- sample(nrow(Boston), nrow(Boston) * 0.80)
Boston.train <- Boston[index, ]
Boston.test <- Boston[-index, ]

Building linear regression model

model1 <- lm(medv ~ ., data = Boston.train)
model1.sum <- summary(model1)
model1.sum

## 
## Call:
## lm(formula = medv ~ ., data = Boston.train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.9568  -2.5728  -0.5662   1.6289  24.5479 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  40.637283   5.859638   6.935 1.69e-11 ***
## crim         -0.121515   0.037991  -3.199 0.001494 ** 
## zn            0.042038   0.015510   2.710 0.007018 ** 
## indus        -0.026726   0.067262  -0.397 0.691327    
## chas          3.207705   0.956704   3.353 0.000878 ***
## nox         -17.301829   4.303363  -4.021 6.97e-05 ***
## rm            3.393375   0.480399   7.064 7.50e-12 ***
## age           0.008526   0.014667   0.581 0.561355    
## dis          -1.509803   0.220647  -6.843 3.02e-11 ***
## rad           0.359537   0.075594   4.756 2.78e-06 ***
## tax          -0.014072   0.004205  -3.346 0.000899 ***
## ptratio      -0.959623   0.141643  -6.775 4.60e-11 ***
## black         0.008467   0.003143   2.694 0.007357 ** 
## lstat        -0.597214   0.058622 -10.188  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.739 on 390 degrees of freedom
## Multiple R-squared:  0.7502, Adjusted R-squared:  0.7419 
## F-statistic:  90.1 on 13 and 390 DF,  p-value: < 2.2e-16

#Looking at model summary, we see that variables indus and age are insignificant
#Building model without variables indus and age
model2 <- lm(medv ~ . -indus -age, data = Boston.train)
model2.sum <- summary(model2)
model2.sum

## 
## Call:
## lm(formula = medv ~ . - indus - age, data = Boston.train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.1161  -2.5762  -0.6103   1.5993  24.8802 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  40.372746   5.822013   6.934 1.69e-11 ***
## crim         -0.121738   0.037887  -3.213 0.001421 ** 
## zn            0.041536   0.015232   2.727 0.006681 ** 
## chas          3.200015   0.951031   3.365 0.000842 ***
## nox         -17.097070   3.982669  -4.293 2.23e-05 ***
## rm            3.478815   0.464366   7.492 4.56e-13 ***
## dis          -1.529129   0.204693  -7.470 5.25e-13 ***
## rad           0.363258   0.073345   4.953 1.09e-06 ***
## tax          -0.014658   0.003870  -3.788 0.000176 ***
## ptratio      -0.961374   0.140294  -6.853 2.82e-11 ***
## black         0.008655   0.003124   2.771 0.005859 ** 
## lstat        -0.584915   0.053401 -10.953  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.73 on 392 degrees of freedom
## Multiple R-squared:  0.7499, Adjusted R-squared:  0.7429 
## F-statistic: 106.8 on 11 and 392 DF,  p-value: < 2.2e-16

Summary of model 2 shows that all variables are significant (based on p-value).

Variable Selection

Best subset (13 variable) and stepwise (forward, backward, both) techniques of variable selection were used to come up with the best linear regression model for the dependent variable medv.

#Variable Selection using best subset regression
model.subset <- regsubsets(medv ~ ., data = Boston.train, nbest = 1, nvmax = 13)
summary(model.subset)

## Subset selection object
## Call: regsubsets.formula(medv ~ ., data = Boston.train, nbest = 1, 
##     nvmax = 13)
## 13 Variables  (and intercept)
##         Forced in Forced out
## crim        FALSE      FALSE
## zn          FALSE      FALSE
## indus       FALSE      FALSE
## chas        FALSE      FALSE
## nox         FALSE      FALSE
## rm          FALSE      FALSE
## age         FALSE      FALSE
## dis         FALSE      FALSE
## rad         FALSE      FALSE
## tax         FALSE      FALSE
## ptratio     FALSE      FALSE
## black       FALSE      FALSE
## lstat       FALSE      FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: exhaustive
##           crim zn  indus chas nox rm  age dis rad tax ptratio black lstat
## 1  ( 1 )  " "  " " " "   " "  " " " " " " " " " " " " " "     " "   "*"  
## 2  ( 1 )  " "  " " " "   " "  " " "*" " " " " " " " " " "     " "   "*"  
## 3  ( 1 )  " "  " " " "   " "  " " "*" " " " " " " " " "*"     " "   "*"  
## 4  ( 1 )  " "  " " " "   " "  " " "*" " " "*" " " " " "*"     " "   "*"  
## 5  ( 1 )  " "  " " " "   " "  "*" "*" " " "*" " " " " "*"     " "   "*"  
## 6  ( 1 )  " "  " " " "   "*"  "*" "*" " " "*" " " " " "*"     " "   "*"  
## 7  ( 1 )  " "  " " " "   "*"  "*" "*" " " "*" " " " " "*"     "*"   "*"  
## 8  ( 1 )  " "  " " " "   "*"  "*" "*" " " "*" "*" "*" "*"     " "   "*"  
## 9  ( 1 )  "*"  " " " "   "*"  "*" "*" " " "*" "*" "*" "*"     " "   "*"  
## 10  ( 1 ) "*"  " " " "   "*"  "*" "*" " " "*" "*" "*" "*"     "*"   "*"  
## 11  ( 1 ) "*"  "*" " "   "*"  "*" "*" " " "*" "*" "*" "*"     "*"   "*"  
## 12  ( 1 ) "*"  "*" " "   "*"  "*" "*" "*" "*" "*" "*" "*"     "*"   "*"  
## 13  ( 1 ) "*"  "*" "*"   "*"  "*" "*" "*" "*" "*" "*" "*"     "*"   "*"

plot(model.subset, scale = "bic")

#Variable selection using stepwise regression
nullmodel <- lm(medv ~ 1, data = Boston.train)
fullmodel <- lm(medv ~ ., data = Boston.train)

#forward selection
model.step.f <- step(nullmodel, scope = list(lower = nullmodel, upper = fullmodel), direction = "forward")

## Start:  AIC=1805.3
## medv ~ 1
## 
##           Df Sum of Sq   RSS    AIC
## + lstat    1   19540.4 15527 1478.2
## + rm       1   17275.3 17793 1533.2
## + ptratio  1    8828.8 26239 1690.1
## + indus    1    8166.1 26902 1700.2
## + tax      1    7382.3 27686 1711.8
## + nox      1    5805.6 29262 1734.2
## + crim     1    5174.7 29893 1742.8
## + rad      1    4793.6 30274 1747.9
## + age      1    4701.5 30366 1749.2
## + zn       1    4147.3 30921 1756.5
## + black    1    3502.4 31565 1764.8
## + dis      1    1761.5 33306 1786.5
## + chas     1    1032.5 34035 1795.2
## <none>                 35068 1805.3
## 
## Step:  AIC=1478.18
## medv ~ lstat
## 
##           Df Sum of Sq   RSS    AIC
## + rm       1   3076.05 12451 1391.0
## + ptratio  1   2079.76 13448 1422.1
## + dis      1    808.72 14719 1458.6
## + chas     1    730.95 14796 1460.7
## + age      1    479.95 15048 1467.5
## + tax      1    200.38 15327 1474.9
## + black    1    138.86 15389 1476.5
## + indus    1     90.46 15437 1477.8
## + crim     1     88.99 15438 1477.8
## <none>                 15527 1478.2
## + zn       1     57.50 15470 1478.7
## + nox      1     24.68 15503 1479.5
## + rad      1      2.86 15525 1480.1
## 
## Step:  AIC=1390.98
## medv ~ lstat + rm
## 
##           Df Sum of Sq   RSS    AIC
## + ptratio  1   1230.22 11221 1351.0
## + chas     1    549.74 11902 1374.7
## + dis      1    440.02 12011 1378.5
## + black    1    303.06 12148 1383.0
## + tax      1    238.13 12213 1385.2
## + crim     1    205.56 12246 1386.3
## + age      1    107.88 12344 1389.5
## <none>                 12451 1391.0
## + rad      1     53.40 12398 1391.2
## + indus    1     28.21 12423 1392.1
## + zn       1      8.15 12443 1392.7
## + nox      1      0.45 12451 1393.0
## 
## Step:  AIC=1350.95
## medv ~ lstat + rm + ptratio
## 
##         Df Sum of Sq   RSS    AIC
## + dis    1    586.38 10635 1331.3
## + chas   1    448.39 10773 1336.5
## + black  1    253.45 10968 1343.7
## + age    1    159.20 11062 1347.2
## + crim   1     79.13 11142 1350.1
## <none>               11221 1351.0
## + zn     1     42.95 11178 1351.4
## + rad    1     31.25 11190 1351.8
## + tax    1     20.26 11201 1352.2
## + indus  1      1.72 11219 1352.9
## + nox    1      0.70 11220 1352.9
## 
## Step:  AIC=1331.27
## medv ~ lstat + rm + ptratio + dis
## 
##         Df Sum of Sq   RSS    AIC
## + nox    1    582.46 10052 1310.5
## + black  1    349.11 10286 1319.8
## + chas   1    317.85 10317 1321.0
## + indus  1    271.04 10364 1322.8
## + tax    1    185.09 10450 1326.2
## + crim   1    182.07 10453 1326.3
## + zn     1     81.16 10554 1330.2
## <none>               10635 1331.3
## + age    1     12.17 10623 1332.8
## + rad    1      4.54 10630 1333.1
## 
## Step:  AIC=1310.51
## medv ~ lstat + rm + ptratio + dis + nox
## 
##         Df Sum of Sq     RSS    AIC
## + chas   1    381.58  9670.8 1296.9
## + black  1    192.40  9859.9 1304.7
## + crim   1    109.40  9942.9 1308.1
## + zn     1     90.42  9961.9 1308.9
## + rad    1     77.60  9974.7 1309.4
## <none>               10052.3 1310.5
## + indus  1     44.48 10007.9 1310.7
## + tax    1      7.13 10045.2 1312.2
## + age    1      6.61 10045.7 1312.2
## 
## Step:  AIC=1296.88
## medv ~ lstat + rm + ptratio + dis + nox + chas
## 
##         Df Sum of Sq    RSS    AIC
## + black  1   163.353 9507.4 1292.0
## + zn     1    92.410 9578.3 1295.0
## + crim   1    86.734 9584.0 1295.2
## + rad    1    72.269 9598.5 1295.8
## + indus  1    57.009 9613.7 1296.5
## <none>               9670.8 1296.9
## + tax    1     4.108 9666.6 1298.7
## + age    1     2.960 9667.8 1298.8
## 
## Step:  AIC=1292
## medv ~ lstat + rm + ptratio + dis + nox + chas + black
## 
##         Df Sum of Sq    RSS    AIC
## + rad    1   128.106 9379.3 1288.5
## + zn     1   103.795 9403.6 1289.6
## + crim   1    54.458 9452.9 1291.7
## <none>               9507.4 1292.0
## + indus  1    46.188 9461.2 1292.0
## + tax    1     0.120 9507.3 1294.0
## + age    1     0.052 9507.3 1294.0
## 
## Step:  AIC=1288.52
## medv ~ lstat + rm + ptratio + dis + nox + chas + black + rad
## 
##         Df Sum of Sq    RSS    AIC
## + tax    1   241.894 9137.4 1280.0
## + crim   1   199.207 9180.1 1281.8
## + indus  1    72.169 9307.1 1287.4
## + zn     1    66.868 9312.4 1287.6
## <none>               9379.3 1288.5
## + age    1     2.863 9376.4 1290.4
## 
## Step:  AIC=1279.96
## medv ~ lstat + rm + ptratio + dis + nox + chas + black + rad + 
##     tax
## 
##         Df Sum of Sq    RSS    AIC
## + crim   1   200.083 8937.3 1273.0
## + zn     1   135.443 9002.0 1275.9
## <none>               9137.4 1280.0
## + indus  1     8.588 9128.8 1281.6
## + age    1     2.736 9134.7 1281.8
## 
## Step:  AIC=1273.02
## medv ~ lstat + rm + ptratio + dis + nox + chas + black + rad + 
##     tax + crim
## 
##         Df Sum of Sq    RSS    AIC
## + zn     1   166.375 8770.9 1267.4
## <none>               8937.3 1273.0
## + indus  1    11.992 8925.3 1274.5
## + age    1     1.174 8936.1 1275.0
## 
## Step:  AIC=1267.43
## medv ~ lstat + rm + ptratio + dis + nox + chas + black + rad + 
##     tax + crim + zn
## 
##         Df Sum of Sq    RSS    AIC
## <none>               8770.9 1267.4
## + age    1    8.0495 8762.9 1269.0
## + indus  1    4.0055 8766.9 1269.2

#Backward selection
model.step.b <- step(fullmodel, direction = "backward")

## Start:  AIC=1270.89
## medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad + 
##     tax + ptratio + black + lstat
## 
##           Df Sum of Sq     RSS    AIC
## - indus    1      3.55  8762.9 1269.0
## - age      1      7.59  8766.9 1269.2
## <none>                  8759.3 1270.9
## - black    1    163.05  8922.4 1276.3
## - zn       1    164.99  8924.3 1276.4
## - crim     1    229.78  8989.1 1279.3
## - tax      1    251.46  9010.8 1280.3
## - chas     1    252.49  9011.8 1280.4
## - nox      1    363.06  9122.4 1285.3
## - rad      1    508.06  9267.4 1291.7
## - ptratio  1   1030.90  9790.3 1313.8
## - dis      1   1051.60  9811.0 1314.7
## - rm       1   1120.64  9880.0 1317.5
## - lstat    1   2331.00 11090.4 1364.2
## 
## Step:  AIC=1269.05
## medv ~ crim + zn + chas + nox + rm + age + dis + rad + tax + 
##     ptratio + black + lstat
## 
##           Df Sum of Sq     RSS    AIC
## - age      1      8.05  8770.9 1267.4
## <none>                  8762.9 1269.0
## - black    1    163.94  8926.8 1274.5
## - zn       1    173.25  8936.1 1275.0
## - crim     1    228.32  8991.2 1277.5
## - chas     1    249.31  9012.2 1278.4
## - tax      1    323.39  9086.3 1281.7
## - nox      1    412.93  9175.8 1285.7
## - rad      1    555.41  9318.3 1291.9
## - ptratio  1   1057.52  9820.4 1313.1
## - dis      1   1078.05  9840.9 1313.9
## - rm       1   1149.65  9912.5 1316.9
## - lstat    1   2358.89 11121.8 1363.4
## 
## Step:  AIC=1267.43
## medv ~ crim + zn + chas + nox + rm + dis + rad + tax + ptratio + 
##     black + lstat
## 
##           Df Sum of Sq     RSS    AIC
## <none>                  8770.9 1267.4
## - zn       1    166.37  8937.3 1273.0
## - black    1    171.77  8942.7 1273.3
## - crim     1    231.02  9002.0 1275.9
## - chas     1    253.32  9024.3 1276.9
## - tax      1    321.00  9091.9 1280.0
## - nox      1    412.34  9183.3 1284.0
## - rad      1    548.85  9319.8 1290.0
## - ptratio  1   1050.67  9821.6 1311.1
## - dis      1   1248.65 10019.6 1319.2
## - rm       1   1255.75 10026.7 1319.5
## - lstat    1   2684.39 11455.3 1373.3

#stepwise selection
model.step <- step(nullmodel, scope = list(lower = nullmodel, upper = fullmodel), direction = "both")

## Start:  AIC=1805.3
## medv ~ 1
## 
##           Df Sum of Sq   RSS    AIC
## + lstat    1   19540.4 15527 1478.2
## + rm       1   17275.3 17793 1533.2
## + ptratio  1    8828.8 26239 1690.1
## + indus    1    8166.1 26902 1700.2
## + tax      1    7382.3 27686 1711.8
## + nox      1    5805.6 29262 1734.2
## + crim     1    5174.7 29893 1742.8
## + rad      1    4793.6 30274 1747.9
## + age      1    4701.5 30366 1749.2
## + zn       1    4147.3 30921 1756.5
## + black    1    3502.4 31565 1764.8
## + dis      1    1761.5 33306 1786.5
## + chas     1    1032.5 34035 1795.2
## <none>                 35068 1805.3
## 
## Step:  AIC=1478.18
## medv ~ lstat
## 
##           Df Sum of Sq   RSS    AIC
## + rm       1    3076.1 12451 1391.0
## + ptratio  1    2079.8 13448 1422.1
## + dis      1     808.7 14719 1458.6
## + chas     1     731.0 14796 1460.7
## + age      1     480.0 15047 1467.5
## + tax      1     200.4 15327 1474.9
## + black    1     138.9 15389 1476.5
## + indus    1      90.5 15437 1477.8
## + crim     1      89.0 15438 1477.8
## <none>                 15527 1478.2
## + zn       1      57.5 15470 1478.7
## + nox      1      24.7 15503 1479.5
## + rad      1       2.9 15525 1480.1
## - lstat    1   19540.4 35068 1805.3
## 
## Step:  AIC=1390.98
## medv ~ lstat + rm
## 
##           Df Sum of Sq   RSS    AIC
## + ptratio  1    1230.2 11221 1351.0
## + chas     1     549.7 11902 1374.7
## + dis      1     440.0 12011 1378.5
## + black    1     303.1 12148 1383.0
## + tax      1     238.1 12213 1385.2
## + crim     1     205.6 12246 1386.3
## + age      1     107.9 12344 1389.5
## <none>                 12451 1391.0
## + rad      1      53.4 12398 1391.2
## + indus    1      28.2 12423 1392.1
## + zn       1       8.2 12443 1392.7
## + nox      1       0.5 12451 1393.0
## - rm       1    3076.1 15527 1478.2
## - lstat    1    5341.1 17793 1533.2
## 
## Step:  AIC=1350.95
## medv ~ lstat + rm + ptratio
## 
##           Df Sum of Sq   RSS    AIC
## + dis      1     586.4 10635 1331.3
## + chas     1     448.4 10773 1336.5
## + black    1     253.5 10968 1343.7
## + age      1     159.2 11062 1347.2
## + crim     1      79.1 11142 1350.1
## <none>                 11221 1351.0
## + zn       1      43.0 11178 1351.4
## + rad      1      31.3 11190 1351.8
## + tax      1      20.3 11201 1352.2
## + indus    1       1.7 11219 1352.9
## + nox      1       0.7 11220 1352.9
## - ptratio  1    1230.2 12451 1391.0
## - rm       1    2226.5 13448 1422.1
## - lstat    1    4303.1 15524 1480.1
## 
## Step:  AIC=1331.27
## medv ~ lstat + rm + ptratio + dis
## 
##           Df Sum of Sq   RSS    AIC
## + nox      1     582.5 10052 1310.5
## + black    1     349.1 10286 1319.8
## + chas     1     317.9 10317 1321.0
## + indus    1     271.0 10364 1322.8
## + tax      1     185.1 10450 1326.2
## + crim     1     182.1 10453 1326.3
## + zn       1      81.2 10554 1330.2
## <none>                 10635 1331.3
## + age      1      12.2 10623 1332.8
## + rad      1       4.5 10630 1333.1
## - dis      1     586.4 11221 1351.0
## - ptratio  1    1376.6 12011 1378.5
## - rm       1    1841.1 12476 1393.8
## - lstat    1    4837.1 15472 1480.7
## 
## Step:  AIC=1310.51
## medv ~ lstat + rm + ptratio + dis + nox
## 
##           Df Sum of Sq     RSS    AIC
## + chas     1     381.6  9670.8 1296.9
## + black    1     192.4  9859.9 1304.7
## + crim     1     109.4  9942.9 1308.1
## + zn       1      90.4  9961.9 1308.9
## + rad      1      77.6  9974.7 1309.4
## <none>                 10052.3 1310.5
## + indus    1      44.5 10007.9 1310.7
## + tax      1       7.1 10045.2 1312.2
## + age      1       6.6 10045.7 1312.2
## - nox      1     582.5 10634.8 1331.3
## - dis      1    1168.1 11220.5 1352.9
## - ptratio  1    1613.5 11665.8 1368.7
## - rm       1    1726.1 11778.4 1372.5
## - lstat    1    3497.6 13549.9 1429.1
## 
## Step:  AIC=1296.88
## medv ~ lstat + rm + ptratio + dis + nox + chas
## 
##           Df Sum of Sq     RSS    AIC
## + black    1     163.4  9507.4 1292.0
## + zn       1      92.4  9578.3 1295.0
## + crim     1      86.7  9584.0 1295.2
## + rad      1      72.3  9598.5 1295.8
## + indus    1      57.0  9613.7 1296.5
## <none>                  9670.8 1296.9
## + tax      1       4.1  9666.6 1298.7
## + age      1       3.0  9667.8 1298.8
## - chas     1     381.6 10052.3 1310.5
## - nox      1     646.2 10316.9 1321.0
## - dis      1    1085.5 10756.3 1337.9
## - ptratio  1    1499.1 11169.9 1353.1
## - rm       1    1672.6 11343.4 1359.3
## - lstat    1    3344.7 13015.4 1414.9
## 
## Step:  AIC=1292
## medv ~ lstat + rm + ptratio + dis + nox + chas + black
## 
##           Df Sum of Sq     RSS    AIC
## + rad      1    128.11  9379.3 1288.5
## + zn       1    103.79  9403.6 1289.6
## + crim     1     54.46  9452.9 1291.7
## <none>                  9507.4 1292.0
## + indus    1     46.19  9461.2 1292.0
## + tax      1      0.12  9507.3 1294.0
## + age      1      0.05  9507.3 1294.0
## - black    1    163.35  9670.8 1296.9
## - chas     1    352.53  9859.9 1304.7
## - nox      1    486.55  9993.9 1310.2
## - dis      1   1027.59 10535.0 1331.5
## - ptratio  1   1434.48 10941.9 1346.8
## - rm       1   1764.85 11272.2 1358.8
## - lstat    1   2987.69 12495.1 1400.4
## 
## Step:  AIC=1288.52
## medv ~ lstat + rm + ptratio + dis + nox + chas + black + rad
## 
##           Df Sum of Sq     RSS    AIC
## + tax      1    241.89  9137.4 1280.0
## + crim     1    199.21  9180.1 1281.8
## + indus    1     72.17  9307.1 1287.4
## + zn       1     66.87  9312.4 1287.6
## <none>                  9379.3 1288.5
## + age      1      2.86  9376.4 1290.4
## - rad      1    128.11  9507.4 1292.0
## - black    1    219.19  9598.5 1295.8
## - chas     1    340.52  9719.8 1300.9
## - nox      1    610.59  9989.9 1312.0
## - dis      1   1056.50 10435.8 1329.6
## - ptratio  1   1528.99 10908.3 1347.5
## - rm       1   1575.83 10955.1 1349.3
## - lstat    1   3094.90 12474.2 1401.7
## 
## Step:  AIC=1279.96
## medv ~ lstat + rm + ptratio + dis + nox + chas + black + rad + 
##     tax
## 
##           Df Sum of Sq     RSS    AIC
## + crim     1    200.08  8937.3 1273.0
## + zn       1    135.44  9002.0 1275.9
## <none>                  9137.4 1280.0
## + indus    1      8.59  9128.8 1281.6
## + age      1      2.74  9134.7 1281.8
## - black    1    203.97  9341.4 1286.9
## - tax      1    241.89  9379.3 1288.5
## - chas     1    294.50  9431.9 1290.8
## - rad      1    369.88  9507.3 1294.0
## - nox      1    422.19  9559.6 1296.2
## - dis      1   1027.43 10164.8 1321.0
## - rm       1   1398.79 10536.2 1335.5
## - ptratio  1   1434.42 10571.8 1336.9
## - lstat    1   3089.14 12226.5 1395.6
## 
## Step:  AIC=1273.02
## medv ~ lstat + rm + ptratio + dis + nox + chas + black + rad + 
##     tax + crim
## 
##           Df Sum of Sq     RSS    AIC
## + zn       1    166.37  8770.9 1267.4
## <none>                  8937.3 1273.0
## + indus    1     11.99  8925.3 1274.5
## + age      1      1.17  8936.1 1275.0
## - black    1    174.61  9111.9 1278.8
## - crim     1    200.08  9137.4 1280.0
## - tax      1    242.77  9180.1 1281.8
## - chas     1    259.36  9196.7 1282.6
## - nox      1    473.18  9410.5 1291.9
## - rad      1    504.45  9441.8 1293.2
## - dis      1   1100.15 10037.5 1317.9
## - rm       1   1412.90 10350.2 1330.3
## - ptratio  1   1496.37 10433.7 1333.6
## - lstat    1   2717.20 11654.5 1378.3
## 
## Step:  AIC=1267.43
## medv ~ lstat + rm + ptratio + dis + nox + chas + black + rad + 
##     tax + crim + zn
## 
##           Df Sum of Sq     RSS    AIC
## <none>                  8770.9 1267.4
## + age      1      8.05  8762.9 1269.0
## + indus    1      4.01  8766.9 1269.2
## - zn       1    166.37  8937.3 1273.0
## - black    1    171.77  8942.7 1273.3
## - crim     1    231.02  9002.0 1275.9
## - chas     1    253.32  9024.3 1276.9
## - tax      1    321.00  9091.9 1280.0
## - nox      1    412.34  9183.3 1284.0
## - rad      1    548.85  9319.8 1290.0
## - ptratio  1   1050.67  9821.6 1311.1
## - dis      1   1248.65 10019.6 1319.2
## - rm       1   1255.75 10026.7 1319.5
## - lstat    1   2684.39 11455.3 1373.3

AIC(model.step)

## [1] 2415.927

BIC(model.step)

## [1] 2467.946

summary(model.step)

## 
## Call:
## lm(formula = medv ~ lstat + rm + ptratio + dis + nox + chas + 
##     black + rad + tax + crim + zn, data = Boston.train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.1161  -2.5762  -0.6103   1.5993  24.8802 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  40.372746   5.822013   6.934 1.69e-11 ***
## lstat        -0.584915   0.053401 -10.953  < 2e-16 ***
## rm            3.478815   0.464366   7.492 4.56e-13 ***
## ptratio      -0.961374   0.140294  -6.853 2.82e-11 ***
## dis          -1.529129   0.204693  -7.470 5.25e-13 ***
## nox         -17.097070   3.982669  -4.293 2.23e-05 ***
## chas          3.200015   0.951031   3.365 0.000842 ***
## black         0.008655   0.003124   2.771 0.005859 ** 
## rad           0.363258   0.073345   4.953 1.09e-06 ***
## tax          -0.014658   0.003870  -3.788 0.000176 ***
## crim         -0.121738   0.037887  -3.213 0.001421 ** 
## zn            0.041536   0.015232   2.727 0.006681 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.73 on 392 degrees of freedom
## Multiple R-squared:  0.7499, Adjusted R-squared:  0.7429 
## F-statistic: 106.8 on 11 and 392 DF,  p-value: < 2.2e-16

From Best subset regression and stepwise selection (forward, backward, both), we see that all variables except indus and age are significant.

Model Assessment

#Model Diagnostics for model 2
par(mfrow = c(2,2))
plot(model.step)

par(mfrow = c(1,1))

Residuals vs Fitted plot shows that the relationship between medv and predictors is not completely linear. Also, normal qq plot is skewed implying that residuals are not normally distributed. A different functional from may be required.

Models are compared based on adjusted r square, AIC, BIC criteria for in-sample performance and mean square prediction error (MSPE) for out-of-sample performance

#In-sample performance
#MSE
model.sum <- summary(model1)
(model.sum$sigma) ^ 2

## [1] 22.45987

model2.sum <- summary(model2)
(model2.sum$sigma) ^ 2

## [1] 22.37486

#R-squared
model1.sum$r.squared

## [1] 0.750217

model2.sum$r.squared

## [1] 0.7498863

#Adjusted r square
model1.sum$adj.r.squared

## [1] 0.7418909

model2.sum$adj.r.squared

## [1] 0.7428679

#AIC 
AIC(model1)

## [1] 2419.393

AIC(model2)

## [1] 2415.927

#BIC
BIC(model1)

## [1] 2479.414

BIC(model2)

## [1] 2467.946

#Out-of-sample Prediction or test error (MSPE)
model1.pred.test <- predict(model1, newdata = Boston.test)
model1.mspe <- mean((model1.pred.test - Boston.test$medv) ^ 2)
model1.mspe

## [1] 23.91468

model2.pred.test <- predict(model2, newdata = Boston.test)
model2.mspe <- mean((model2.pred.test - Boston.test$medv) ^ 2)
model2.mspe

## [1] 23.58354

#Cross Validation
model1.glm = glm(medv ~ ., data = Boston)
cv.glm(data = Boston, glmfit = model1.glm, K = 5)$delta[2]

## [1] 24.16365

model2.glm <- glm(medv ~ . -indus -age, data = Boston)
cv.glm(data = Boston, glmfit = model2.glm, K = 5)$delta[2]

## [1] 23.82476

Based on AIC criteria and adjusted R square values, model 2 is slightly better than model 1. In-sample MSE is nearly the same for both models.

We need to check out-of-sample MSPE for both models. Based on out-of-sample prediction error, model 2 is slightly better than model 1. MSPE of model 1 is 20.7113 while that of model 2 is 20.6784. Based on cross validation also, model 2 performs better.

Fitting Regression Trees

Following regression trees were fitted to the training data. * Using default value, cp = 0.01 and no additional constraints. This resulted in a tree with 8 terminal nodes. * Making cp = 0.001 and allowing the tree to grow large This results in a tree with 27 terminal nodes. A plot of cp values vs error rates (fig 7) shows that a cp value of 0.0072 would reduce the complexity of the model. * Finally, with cp = 0.0072. This results in a tree with 10 terminal nodes.

#default value of cp = 0.01
Boston.tree <- rpart(medv ~ ., data = Boston.train)
Boston.tree

## n= 404 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 404 35067.83000 22.82376  
##    2) lstat>=9.755 227  5123.02300 17.18282  
##      4) lstat>=16.085 115  1935.10100 14.26522  
##        8) crim>=5.76921 57   627.49720 11.79298 *
##        9) crim< 5.76921 58   616.84840 16.69483 *
##      5) lstat< 16.085 112  1203.84900 20.17857 *
##    3) lstat< 9.755 177 13457.97000 30.05819  
##      6) rm< 7.141 135  5507.54400 26.81852  
##       12) tax< 548 128  2713.95900 25.92188  
##         24) rm< 6.5445 70   525.02070 23.13571 *
##         25) rm>=6.5445 58   989.73600 29.28448 *
##       13) tax>=548 7   808.92860 43.21429 *
##      7) rm>=7.141 42  1979.24600 40.47143  
##       14) rm< 7.445 17    59.46118 34.75882 *
##       15) rm>=7.445 25   987.76160 44.35600 *

#Plotting the tree
rpart.plot(Boston.tree, type = 3, box.palette = c("red", "green"), fallen.leaves = TRUE)

plotcp(Boston.tree)

printcp(Boston.tree)

## 
## Regression tree:
## rpart(formula = medv ~ ., data = Boston.train)
## 
## Variables actually used in tree construction:
## [1] crim  lstat rm    tax  
## 
## Root node error: 35068/404 = 86.802
## 
## n= 404 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.470141      0   1.00000 1.00205 0.088668
## 2 0.170275      1   0.52986 0.53927 0.045741
## 3 0.056595      2   0.35958 0.40188 0.040452
## 4 0.056578      3   0.30299 0.40309 0.044836
## 5 0.034197      4   0.24641 0.36308 0.044066
## 6 0.026578      5   0.21221 0.31303 0.040008
## 7 0.019698      6   0.18564 0.31248 0.040015
## 8 0.010000      7   0.16594 0.27943 0.039762

#Building a large tree
Boston.largetree <- rpart(formula = medv ~ ., data = Boston.train, cp = 0.001)

plot(Boston.largetree)
text(Boston.largetree)

printcp(Boston.largetree)

## 
## Regression tree:
## rpart(formula = medv ~ ., data = Boston.train, cp = 0.001)
## 
## Variables actually used in tree construction:
##  [1] black   crim    dis     indus   lstat   nox     ptratio rad    
##  [9] rm      tax    
## 
## Root node error: 35068/404 = 86.802
## 
## n= 404 
## 
##           CP nsplit rel error  xerror     xstd
## 1  0.4701414      0   1.00000 1.00199 0.088924
## 2  0.1702752      1   0.52986 0.53930 0.046107
## 3  0.0565948      2   0.35958 0.40589 0.042325
## 4  0.0565782      3   0.30299 0.38842 0.043911
## 5  0.0341966      4   0.24641 0.33213 0.040310
## 6  0.0265777      5   0.21221 0.29703 0.038979
## 7  0.0196977      6   0.18564 0.28913 0.038851
## 8  0.0088490      7   0.16594 0.26006 0.037794
## 9  0.0082962      8   0.15709 0.24019 0.029925
## 10 0.0061924      9   0.14879 0.23732 0.030105
## 11 0.0055046     10   0.14260 0.23016 0.030094
## 12 0.0041547     11   0.13710 0.23440 0.030639
## 13 0.0039535     12   0.13294 0.23592 0.030634
## 14 0.0038486     13   0.12899 0.23463 0.030629
## 15 0.0037686     14   0.12514 0.23198 0.030501
## 16 0.0031753     15   0.12137 0.22617 0.028763
## 17 0.0019657     16   0.11820 0.21893 0.028622
## 18 0.0019599     17   0.11623 0.21474 0.028587
## 19 0.0016566     18   0.11427 0.21478 0.028584
## 20 0.0015947     20   0.11096 0.21543 0.028610
## 21 0.0013438     21   0.10936 0.21592 0.028584
## 22 0.0012121     22   0.10802 0.21850 0.028677
## 23 0.0011492     23   0.10681 0.21878 0.028674
## 24 0.0010170     24   0.10566 0.21854 0.028671
## 25 0.0010045     25   0.10464 0.21898 0.028686
## 26 0.0010000     26   0.10364 0.21929 0.028683

plotcp(Boston.largetree)

#however, from plotcp, we observe that a tree with more than 7 to 9 splits is not very helpful.
#further pruning the tree to limit to 9 splits;corresponding cp value from plot is 0.0072
pruned.tree <- prune(Boston.largetree, cp = 0.0072)
pruned.tree

## n= 404 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 404 35067.83000 22.82376  
##    2) lstat>=9.755 227  5123.02300 17.18282  
##      4) lstat>=16.085 115  1935.10100 14.26522  
##        8) crim>=5.76921 57   627.49720 11.79298 *
##        9) crim< 5.76921 58   616.84840 16.69483 *
##      5) lstat< 16.085 112  1203.84900 20.17857 *
##    3) lstat< 9.755 177 13457.97000 30.05819  
##      6) rm< 7.141 135  5507.54400 26.81852  
##       12) tax< 548 128  2713.95900 25.92188  
##         24) rm< 6.5445 70   525.02070 23.13571 *
##         25) rm>=6.5445 58   989.73600 29.28448  
##           50) lstat>=5.195 31   267.05940 27.12581 *
##           51) lstat< 5.195 27   412.36300 31.76296 *
##       13) tax>=548 7   808.92860 43.21429 *
##      7) rm>=7.141 42  1979.24600 40.47143  
##       14) rm< 7.445 17    59.46118 34.75882 *
##       15) rm>=7.445 25   987.76160 44.35600  
##         30) ptratio>=17.6 7   465.96860 38.88571 *
##         31) ptratio< 17.6 18   230.86500 46.48333 *

rpart.plot(pruned.tree, type = 3, box.palette = c("red", "green"), fallen.leaves = TRUE, extra = 1)

#In-sample MSE
mean((predict(Boston.tree) - Boston.train$medv) ^ 2)      #default tree

## [1] 14.40372

mean((predict(Boston.largetree) - Boston.train$medv) ^ 2)  #large tree

## [1] 8.995731

mean((predict(pruned.tree) - Boston.train$medv) ^ 2)       #pruned tree

## [1] 12.9155

#out-of-sample performance
#Mean squared error loss for this tree
mean((predict(Boston.tree, newdata = Boston.test) - Boston.test$medv) ^ 2)  #default tree

## [1] 22.69261

mean((predict(Boston.largetree, newdata = Boston.test) - Boston.test$medv) ^ 2)   #large tree

## [1] 20.06999

mean((predict(pruned.tree, newdata = Boston.test) - Boston.test$medv) ^ 2)     #pruned tree

## [1] 21.45344

From plot, we can observe that cross validation error does not always go down when tree becomes more complex. We can see that a tree with more than 7 to 9 splits is not very helpful.

The large tree results in lowest in-sample and out-of-sample prediction error compared to other two trees. But there is a large difference between in-sample and out-of-sample performance for this tree. In contrast, trees with default cp value and pruned tree have lower difference between in-sample and out-of-sample prediction errors. Further, they are easier to read and interpret and do not sacrifice much in terms of out-of-sample prediction error. Since pruned has lower prediction errors compared to tree with default cp value, it is the chosen tree.

Generalized Additive Model

Residual diagnostics of linear regression model showed that the relation between medv and predictor variables may not be linear. Since the correct transformation of predictor variables is not known, GAM can be used to model non-linearity. GAM is fit using smoothing splines, s(), which is available in gam library in R. In the model, smoothing spline is used for all continuous variables except ‘chas’ and ‘rad’, which are of integer type and which have less than 10 unique values. It is not recommended to use smoothing splines on such variables.

#model 1 - not using s() on chas and rad, leaving them as integers
Boston.gam <- gam(medv ~ s(crim) + s(zn) + s(indus) + s(nox) + s(rm) + s(age) + s(dis) + 
                    s(tax) + s(ptratio) + s(black) + s(lstat) + chas + rad, data = Boston.train)
summary(Boston.gam)

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## medv ~ s(crim) + s(zn) + s(indus) + s(nox) + s(rm) + s(age) + 
##     s(dis) + s(tax) + s(ptratio) + s(black) + s(lstat) + chas + 
##     rad
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  19.8584     1.1963  16.599   <2e-16 ***
## chas          0.9520     0.6939   1.372    0.171    
## rad           0.3145     0.1286   2.445    0.015 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##              edf Ref.df      F  p-value    
## s(crim)    5.085  6.083 12.204 1.26e-12 ***
## s(zn)      1.000  1.000  1.750  0.18675    
## s(indus)   4.716  5.671  3.046  0.00502 ** 
## s(nox)     8.817  8.978 13.766  < 2e-16 ***
## s(rm)      8.288  8.854 18.778  < 2e-16 ***
## s(age)     1.000  1.000  0.268  0.60507    
## s(dis)     8.779  8.981  8.441 1.43e-11 ***
## s(tax)     5.933  6.802  8.377 6.67e-09 ***
## s(ptratio) 1.894  2.368 16.809 2.36e-08 ***
## s(black)   1.000  1.000  0.532  0.46652    
## s(lstat)   7.458  8.418 20.525  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Rank: 100/102
## R-sq.(adj) =  0.891   Deviance explained = 90.6%
## GCV = 11.084  Scale est. = 9.5212    n = 404

Variables which have edf close to 1 have linear relationship with medv. Summary of the model shows that variables ‘zn’, ‘age’ and ‘black’ have edf of 1. This along with their p-values, indicates that they have a linear relationship with medv. Rest of the variables have a non-linear relationship with medv.

#model 2 - removing s() from functions which are linear
Boston.gam <- gam(medv ~ s(crim) + zn + s(indus) + s(nox) + s(rm) + age + s(dis) + 
                    s(tax) + s(ptratio) + black + s(lstat) + chas + rad, data = Boston.train)
summary(Boston.gam)

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## medv ~ s(crim) + zn + s(indus) + s(nox) + s(rm) + age + s(dis) + 
##     s(tax) + s(ptratio) + black + s(lstat) + chas + rad
## 
## Parametric coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 19.282679   1.776422  10.855   <2e-16 ***
## zn           0.021160   0.015822   1.337   0.1820    
## age         -0.006635   0.012089  -0.549   0.5834    
## black        0.001743   0.002338   0.746   0.4563    
## chas         0.947958   0.694317   1.365   0.1730    
## rad          0.333164   0.141282   2.358   0.0189 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##              edf Ref.df      F  p-value    
## s(crim)    5.777  6.885 10.699 3.50e-12 ***
## s(indus)   4.643  5.572  3.046  0.00499 ** 
## s(nox)     8.816  8.978 13.678  < 2e-16 ***
## s(rm)      8.287  8.854 18.709  < 2e-16 ***
## s(dis)     8.778  8.980  8.375 1.82e-11 ***
## s(tax)     6.362  7.387  7.426 1.01e-08 ***
## s(ptratio) 1.892  2.365 16.027 5.27e-08 ***
## s(lstat)   7.453  8.416 20.462  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =   0.89   Deviance explained = 90.6%
## GCV =  11.13  Scale est. = 9.5316    n = 404

Performance metrics of the model is shown below:

#Model AIC, BIC, mean residual deviance
AIC(Boston.gam)

## [1] 2112.764

BIC(Boston.gam)

## [1] 2348.88

Boston.gam$deviance

## [1] 3297.872

The non-linear relationship of variables with medv can be seen in the following plots:

#plot
plot(Boston.gam, shade = TRUE, seWithMean = TRUE, scale = 0)

In-sample and out-of-sample prediction errors are shown below:

#In-sample prediction
(Boston.gam.mse <- mean((predict(Boston.gam) - Boston.train$medv) ^ 2))

## [1] 8.163049

#Out-of-sample prediction - MSPE
(Boston.gam.mspe <- mean((predict(Boston.gam, newdata = Boston.test) - Boston.test$medv) ^ 2))

## [1] 11.96812

Conclusion

Three different methods - linear regression, regression tree and GAM - were used to predict median housing price (medv) in the above problem. In each method, the best possible model was chosen based on its performance using in-sample and out-of-sample data sets. The performance measures used include mean squared error (in-sample) and mean squared prediction error (out-of-sample). Finally, a comprehensive model comparison was done between different methods. Lower the MSPE, better is the model.

It was observed GAM performs the best among the model with lowest out-of-sample MSPE. Moreover, it is able to capture non-linear relations in the data set.