Boston housing data is a data set in package MASS. The data set has 506 rows and 14 columns.
This report provides an analysis and evaluation of the factors affecting the median value of the owner occupied homes in the suburbs of Boston. The in-built data set of Boston Housing Data in package MASS is used for this analysis and various factors about the structural quality, neighbourhood, accessibility and air pollution such as per capita crime rate by town, proportion of non-retail business acres per town, index of accessibility to radial highways etc are taken into account for this study.
Approach Methods of analysis includes the following * Summary statistics of the variables and finding correlation between variables, * Exploratory data analysis using visualization * Random sampling of data set into 80/20 training and testing data set * Fitting various models such as generalized linear regression, regression tree, generalized additive model (GAM) and neural network * Finally, comparing the models based on in-sample (MSPE) and out-of-sample prediction errors (MSPE).
Packages required Loading required packages:
library(MASS)
library(ggplot2)
library(dplyr)
library(tidyverse)
library(corrplot)
library(leaps)
library(rpart)
library(mgcv)
library(glmnet)
library(boot)
library(rpart.plot)
#loading data
data(Boston)
dim(Boston)
## [1] 506 14
#a look at first few rows
head(Boston)
## crim zn indus chas nox rm age dis rad tax ptratio black
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12
## lstat medv
## 1 4.98 24.0
## 2 9.14 21.6
## 3 4.03 34.7
## 4 2.94 33.4
## 5 5.33 36.2
## 6 5.21 28.7
#a look at structure of the data set
glimpse(Boston)
## Observations: 506
## Variables: 14
## $ crim <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, ...
## $ zn <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5,...
## $ indus <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, ...
## $ chas <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ nox <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524...
## $ rm <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172...
## $ age <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0,...
## $ dis <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605...
## $ rad <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, ...
## $ tax <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311,...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, ...
## $ black <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60...
## $ lstat <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.9...
## $ medv <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, ...
#summary statistics
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
#Check for missing values
sum(is.na(Boston))
## [1] 0
#Check for duplicated values
sum(duplicated(Boston))
## [1] 0
#checking correlation between variables
corrplot(cor(Boston), method = "number", type = "upper", diag = FALSE)
There are no missing or duplicate values.
From correlation matrix, some of the observations made are as follows:
Boston %>%
gather(key, val, -medv) %>%
ggplot(aes(x = val, y = medv)) +
geom_point() +
stat_smooth(method = "lm", se = TRUE, col = "blue") +
facet_wrap(~key, scales = "free") +
theme_gray() +
ggtitle("Scatter plot of dependent variables vs Median Value (medv)")
table(Boston$chas)
##
## 0 1
## 471 35
Observations * Proportion of owner occupied units built prior to 1940 (age) and proportion of blacks by town (black) is heavily skewed to left, while per capita crime rate in town (crim) and weighted mean of distances to five Boston employment centres (dis) is heavily skewed to right. * rm is normally distributed with mean of approximately 6. * Most of the properties are situated close to the five Boston employment centres (dis skewed to right) * There is a high proportion of owner occupied units built prior to 1940 (age skewed to left) and blacks in town (black skewed to right) * From scatter plots, it is seen that lstat and rm show strong correlation with medv. * 93% of the properties are away from Charles river. The properties bordering the river seems to have higher median prices.
set.seed(12383010)
index <- sample(nrow(Boston), nrow(Boston) * 0.80)
Boston.train <- Boston[index, ]
Boston.test <- Boston[-index, ]
model1 <- lm(medv ~ ., data = Boston.train)
model1.sum <- summary(model1)
model1.sum
##
## Call:
## lm(formula = medv ~ ., data = Boston.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.9568 -2.5728 -0.5662 1.6289 24.5479
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 40.637283 5.859638 6.935 1.69e-11 ***
## crim -0.121515 0.037991 -3.199 0.001494 **
## zn 0.042038 0.015510 2.710 0.007018 **
## indus -0.026726 0.067262 -0.397 0.691327
## chas 3.207705 0.956704 3.353 0.000878 ***
## nox -17.301829 4.303363 -4.021 6.97e-05 ***
## rm 3.393375 0.480399 7.064 7.50e-12 ***
## age 0.008526 0.014667 0.581 0.561355
## dis -1.509803 0.220647 -6.843 3.02e-11 ***
## rad 0.359537 0.075594 4.756 2.78e-06 ***
## tax -0.014072 0.004205 -3.346 0.000899 ***
## ptratio -0.959623 0.141643 -6.775 4.60e-11 ***
## black 0.008467 0.003143 2.694 0.007357 **
## lstat -0.597214 0.058622 -10.188 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.739 on 390 degrees of freedom
## Multiple R-squared: 0.7502, Adjusted R-squared: 0.7419
## F-statistic: 90.1 on 13 and 390 DF, p-value: < 2.2e-16
#Looking at model summary, we see that variables indus and age are insignificant
#Building model without variables indus and age
model2 <- lm(medv ~ . -indus -age, data = Boston.train)
model2.sum <- summary(model2)
model2.sum
##
## Call:
## lm(formula = medv ~ . - indus - age, data = Boston.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.1161 -2.5762 -0.6103 1.5993 24.8802
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 40.372746 5.822013 6.934 1.69e-11 ***
## crim -0.121738 0.037887 -3.213 0.001421 **
## zn 0.041536 0.015232 2.727 0.006681 **
## chas 3.200015 0.951031 3.365 0.000842 ***
## nox -17.097070 3.982669 -4.293 2.23e-05 ***
## rm 3.478815 0.464366 7.492 4.56e-13 ***
## dis -1.529129 0.204693 -7.470 5.25e-13 ***
## rad 0.363258 0.073345 4.953 1.09e-06 ***
## tax -0.014658 0.003870 -3.788 0.000176 ***
## ptratio -0.961374 0.140294 -6.853 2.82e-11 ***
## black 0.008655 0.003124 2.771 0.005859 **
## lstat -0.584915 0.053401 -10.953 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.73 on 392 degrees of freedom
## Multiple R-squared: 0.7499, Adjusted R-squared: 0.7429
## F-statistic: 106.8 on 11 and 392 DF, p-value: < 2.2e-16
Summary of model 2 shows that all variables are significant (based on p-value).
Best subset (13 variable) and stepwise (forward, backward, both) techniques of variable selection were used to come up with the best linear regression model for the dependent variable medv.
#Variable Selection using best subset regression
model.subset <- regsubsets(medv ~ ., data = Boston.train, nbest = 1, nvmax = 13)
summary(model.subset)
## Subset selection object
## Call: regsubsets.formula(medv ~ ., data = Boston.train, nbest = 1,
## nvmax = 13)
## 13 Variables (and intercept)
## Forced in Forced out
## crim FALSE FALSE
## zn FALSE FALSE
## indus FALSE FALSE
## chas FALSE FALSE
## nox FALSE FALSE
## rm FALSE FALSE
## age FALSE FALSE
## dis FALSE FALSE
## rad FALSE FALSE
## tax FALSE FALSE
## ptratio FALSE FALSE
## black FALSE FALSE
## lstat FALSE FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: exhaustive
## crim zn indus chas nox rm age dis rad tax ptratio black lstat
## 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " " " "*"
## 2 ( 1 ) " " " " " " " " " " "*" " " " " " " " " " " " " "*"
## 3 ( 1 ) " " " " " " " " " " "*" " " " " " " " " "*" " " "*"
## 4 ( 1 ) " " " " " " " " " " "*" " " "*" " " " " "*" " " "*"
## 5 ( 1 ) " " " " " " " " "*" "*" " " "*" " " " " "*" " " "*"
## 6 ( 1 ) " " " " " " "*" "*" "*" " " "*" " " " " "*" " " "*"
## 7 ( 1 ) " " " " " " "*" "*" "*" " " "*" " " " " "*" "*" "*"
## 8 ( 1 ) " " " " " " "*" "*" "*" " " "*" "*" "*" "*" " " "*"
## 9 ( 1 ) "*" " " " " "*" "*" "*" " " "*" "*" "*" "*" " " "*"
## 10 ( 1 ) "*" " " " " "*" "*" "*" " " "*" "*" "*" "*" "*" "*"
## 11 ( 1 ) "*" "*" " " "*" "*" "*" " " "*" "*" "*" "*" "*" "*"
## 12 ( 1 ) "*" "*" " " "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
## 13 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
plot(model.subset, scale = "bic")
#Variable selection using stepwise regression
nullmodel <- lm(medv ~ 1, data = Boston.train)
fullmodel <- lm(medv ~ ., data = Boston.train)
#forward selection
model.step.f <- step(nullmodel, scope = list(lower = nullmodel, upper = fullmodel), direction = "forward")
## Start: AIC=1805.3
## medv ~ 1
##
## Df Sum of Sq RSS AIC
## + lstat 1 19540.4 15527 1478.2
## + rm 1 17275.3 17793 1533.2
## + ptratio 1 8828.8 26239 1690.1
## + indus 1 8166.1 26902 1700.2
## + tax 1 7382.3 27686 1711.8
## + nox 1 5805.6 29262 1734.2
## + crim 1 5174.7 29893 1742.8
## + rad 1 4793.6 30274 1747.9
## + age 1 4701.5 30366 1749.2
## + zn 1 4147.3 30921 1756.5
## + black 1 3502.4 31565 1764.8
## + dis 1 1761.5 33306 1786.5
## + chas 1 1032.5 34035 1795.2
## <none> 35068 1805.3
##
## Step: AIC=1478.18
## medv ~ lstat
##
## Df Sum of Sq RSS AIC
## + rm 1 3076.05 12451 1391.0
## + ptratio 1 2079.76 13448 1422.1
## + dis 1 808.72 14719 1458.6
## + chas 1 730.95 14796 1460.7
## + age 1 479.95 15048 1467.5
## + tax 1 200.38 15327 1474.9
## + black 1 138.86 15389 1476.5
## + indus 1 90.46 15437 1477.8
## + crim 1 88.99 15438 1477.8
## <none> 15527 1478.2
## + zn 1 57.50 15470 1478.7
## + nox 1 24.68 15503 1479.5
## + rad 1 2.86 15525 1480.1
##
## Step: AIC=1390.98
## medv ~ lstat + rm
##
## Df Sum of Sq RSS AIC
## + ptratio 1 1230.22 11221 1351.0
## + chas 1 549.74 11902 1374.7
## + dis 1 440.02 12011 1378.5
## + black 1 303.06 12148 1383.0
## + tax 1 238.13 12213 1385.2
## + crim 1 205.56 12246 1386.3
## + age 1 107.88 12344 1389.5
## <none> 12451 1391.0
## + rad 1 53.40 12398 1391.2
## + indus 1 28.21 12423 1392.1
## + zn 1 8.15 12443 1392.7
## + nox 1 0.45 12451 1393.0
##
## Step: AIC=1350.95
## medv ~ lstat + rm + ptratio
##
## Df Sum of Sq RSS AIC
## + dis 1 586.38 10635 1331.3
## + chas 1 448.39 10773 1336.5
## + black 1 253.45 10968 1343.7
## + age 1 159.20 11062 1347.2
## + crim 1 79.13 11142 1350.1
## <none> 11221 1351.0
## + zn 1 42.95 11178 1351.4
## + rad 1 31.25 11190 1351.8
## + tax 1 20.26 11201 1352.2
## + indus 1 1.72 11219 1352.9
## + nox 1 0.70 11220 1352.9
##
## Step: AIC=1331.27
## medv ~ lstat + rm + ptratio + dis
##
## Df Sum of Sq RSS AIC
## + nox 1 582.46 10052 1310.5
## + black 1 349.11 10286 1319.8
## + chas 1 317.85 10317 1321.0
## + indus 1 271.04 10364 1322.8
## + tax 1 185.09 10450 1326.2
## + crim 1 182.07 10453 1326.3
## + zn 1 81.16 10554 1330.2
## <none> 10635 1331.3
## + age 1 12.17 10623 1332.8
## + rad 1 4.54 10630 1333.1
##
## Step: AIC=1310.51
## medv ~ lstat + rm + ptratio + dis + nox
##
## Df Sum of Sq RSS AIC
## + chas 1 381.58 9670.8 1296.9
## + black 1 192.40 9859.9 1304.7
## + crim 1 109.40 9942.9 1308.1
## + zn 1 90.42 9961.9 1308.9
## + rad 1 77.60 9974.7 1309.4
## <none> 10052.3 1310.5
## + indus 1 44.48 10007.9 1310.7
## + tax 1 7.13 10045.2 1312.2
## + age 1 6.61 10045.7 1312.2
##
## Step: AIC=1296.88
## medv ~ lstat + rm + ptratio + dis + nox + chas
##
## Df Sum of Sq RSS AIC
## + black 1 163.353 9507.4 1292.0
## + zn 1 92.410 9578.3 1295.0
## + crim 1 86.734 9584.0 1295.2
## + rad 1 72.269 9598.5 1295.8
## + indus 1 57.009 9613.7 1296.5
## <none> 9670.8 1296.9
## + tax 1 4.108 9666.6 1298.7
## + age 1 2.960 9667.8 1298.8
##
## Step: AIC=1292
## medv ~ lstat + rm + ptratio + dis + nox + chas + black
##
## Df Sum of Sq RSS AIC
## + rad 1 128.106 9379.3 1288.5
## + zn 1 103.795 9403.6 1289.6
## + crim 1 54.458 9452.9 1291.7
## <none> 9507.4 1292.0
## + indus 1 46.188 9461.2 1292.0
## + tax 1 0.120 9507.3 1294.0
## + age 1 0.052 9507.3 1294.0
##
## Step: AIC=1288.52
## medv ~ lstat + rm + ptratio + dis + nox + chas + black + rad
##
## Df Sum of Sq RSS AIC
## + tax 1 241.894 9137.4 1280.0
## + crim 1 199.207 9180.1 1281.8
## + indus 1 72.169 9307.1 1287.4
## + zn 1 66.868 9312.4 1287.6
## <none> 9379.3 1288.5
## + age 1 2.863 9376.4 1290.4
##
## Step: AIC=1279.96
## medv ~ lstat + rm + ptratio + dis + nox + chas + black + rad +
## tax
##
## Df Sum of Sq RSS AIC
## + crim 1 200.083 8937.3 1273.0
## + zn 1 135.443 9002.0 1275.9
## <none> 9137.4 1280.0
## + indus 1 8.588 9128.8 1281.6
## + age 1 2.736 9134.7 1281.8
##
## Step: AIC=1273.02
## medv ~ lstat + rm + ptratio + dis + nox + chas + black + rad +
## tax + crim
##
## Df Sum of Sq RSS AIC
## + zn 1 166.375 8770.9 1267.4
## <none> 8937.3 1273.0
## + indus 1 11.992 8925.3 1274.5
## + age 1 1.174 8936.1 1275.0
##
## Step: AIC=1267.43
## medv ~ lstat + rm + ptratio + dis + nox + chas + black + rad +
## tax + crim + zn
##
## Df Sum of Sq RSS AIC
## <none> 8770.9 1267.4
## + age 1 8.0495 8762.9 1269.0
## + indus 1 4.0055 8766.9 1269.2
#Backward selection
model.step.b <- step(fullmodel, direction = "backward")
## Start: AIC=1270.89
## medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad +
## tax + ptratio + black + lstat
##
## Df Sum of Sq RSS AIC
## - indus 1 3.55 8762.9 1269.0
## - age 1 7.59 8766.9 1269.2
## <none> 8759.3 1270.9
## - black 1 163.05 8922.4 1276.3
## - zn 1 164.99 8924.3 1276.4
## - crim 1 229.78 8989.1 1279.3
## - tax 1 251.46 9010.8 1280.3
## - chas 1 252.49 9011.8 1280.4
## - nox 1 363.06 9122.4 1285.3
## - rad 1 508.06 9267.4 1291.7
## - ptratio 1 1030.90 9790.3 1313.8
## - dis 1 1051.60 9811.0 1314.7
## - rm 1 1120.64 9880.0 1317.5
## - lstat 1 2331.00 11090.4 1364.2
##
## Step: AIC=1269.05
## medv ~ crim + zn + chas + nox + rm + age + dis + rad + tax +
## ptratio + black + lstat
##
## Df Sum of Sq RSS AIC
## - age 1 8.05 8770.9 1267.4
## <none> 8762.9 1269.0
## - black 1 163.94 8926.8 1274.5
## - zn 1 173.25 8936.1 1275.0
## - crim 1 228.32 8991.2 1277.5
## - chas 1 249.31 9012.2 1278.4
## - tax 1 323.39 9086.3 1281.7
## - nox 1 412.93 9175.8 1285.7
## - rad 1 555.41 9318.3 1291.9
## - ptratio 1 1057.52 9820.4 1313.1
## - dis 1 1078.05 9840.9 1313.9
## - rm 1 1149.65 9912.5 1316.9
## - lstat 1 2358.89 11121.8 1363.4
##
## Step: AIC=1267.43
## medv ~ crim + zn + chas + nox + rm + dis + rad + tax + ptratio +
## black + lstat
##
## Df Sum of Sq RSS AIC
## <none> 8770.9 1267.4
## - zn 1 166.37 8937.3 1273.0
## - black 1 171.77 8942.7 1273.3
## - crim 1 231.02 9002.0 1275.9
## - chas 1 253.32 9024.3 1276.9
## - tax 1 321.00 9091.9 1280.0
## - nox 1 412.34 9183.3 1284.0
## - rad 1 548.85 9319.8 1290.0
## - ptratio 1 1050.67 9821.6 1311.1
## - dis 1 1248.65 10019.6 1319.2
## - rm 1 1255.75 10026.7 1319.5
## - lstat 1 2684.39 11455.3 1373.3
#stepwise selection
model.step <- step(nullmodel, scope = list(lower = nullmodel, upper = fullmodel), direction = "both")
## Start: AIC=1805.3
## medv ~ 1
##
## Df Sum of Sq RSS AIC
## + lstat 1 19540.4 15527 1478.2
## + rm 1 17275.3 17793 1533.2
## + ptratio 1 8828.8 26239 1690.1
## + indus 1 8166.1 26902 1700.2
## + tax 1 7382.3 27686 1711.8
## + nox 1 5805.6 29262 1734.2
## + crim 1 5174.7 29893 1742.8
## + rad 1 4793.6 30274 1747.9
## + age 1 4701.5 30366 1749.2
## + zn 1 4147.3 30921 1756.5
## + black 1 3502.4 31565 1764.8
## + dis 1 1761.5 33306 1786.5
## + chas 1 1032.5 34035 1795.2
## <none> 35068 1805.3
##
## Step: AIC=1478.18
## medv ~ lstat
##
## Df Sum of Sq RSS AIC
## + rm 1 3076.1 12451 1391.0
## + ptratio 1 2079.8 13448 1422.1
## + dis 1 808.7 14719 1458.6
## + chas 1 731.0 14796 1460.7
## + age 1 480.0 15047 1467.5
## + tax 1 200.4 15327 1474.9
## + black 1 138.9 15389 1476.5
## + indus 1 90.5 15437 1477.8
## + crim 1 89.0 15438 1477.8
## <none> 15527 1478.2
## + zn 1 57.5 15470 1478.7
## + nox 1 24.7 15503 1479.5
## + rad 1 2.9 15525 1480.1
## - lstat 1 19540.4 35068 1805.3
##
## Step: AIC=1390.98
## medv ~ lstat + rm
##
## Df Sum of Sq RSS AIC
## + ptratio 1 1230.2 11221 1351.0
## + chas 1 549.7 11902 1374.7
## + dis 1 440.0 12011 1378.5
## + black 1 303.1 12148 1383.0
## + tax 1 238.1 12213 1385.2
## + crim 1 205.6 12246 1386.3
## + age 1 107.9 12344 1389.5
## <none> 12451 1391.0
## + rad 1 53.4 12398 1391.2
## + indus 1 28.2 12423 1392.1
## + zn 1 8.2 12443 1392.7
## + nox 1 0.5 12451 1393.0
## - rm 1 3076.1 15527 1478.2
## - lstat 1 5341.1 17793 1533.2
##
## Step: AIC=1350.95
## medv ~ lstat + rm + ptratio
##
## Df Sum of Sq RSS AIC
## + dis 1 586.4 10635 1331.3
## + chas 1 448.4 10773 1336.5
## + black 1 253.5 10968 1343.7
## + age 1 159.2 11062 1347.2
## + crim 1 79.1 11142 1350.1
## <none> 11221 1351.0
## + zn 1 43.0 11178 1351.4
## + rad 1 31.3 11190 1351.8
## + tax 1 20.3 11201 1352.2
## + indus 1 1.7 11219 1352.9
## + nox 1 0.7 11220 1352.9
## - ptratio 1 1230.2 12451 1391.0
## - rm 1 2226.5 13448 1422.1
## - lstat 1 4303.1 15524 1480.1
##
## Step: AIC=1331.27
## medv ~ lstat + rm + ptratio + dis
##
## Df Sum of Sq RSS AIC
## + nox 1 582.5 10052 1310.5
## + black 1 349.1 10286 1319.8
## + chas 1 317.9 10317 1321.0
## + indus 1 271.0 10364 1322.8
## + tax 1 185.1 10450 1326.2
## + crim 1 182.1 10453 1326.3
## + zn 1 81.2 10554 1330.2
## <none> 10635 1331.3
## + age 1 12.2 10623 1332.8
## + rad 1 4.5 10630 1333.1
## - dis 1 586.4 11221 1351.0
## - ptratio 1 1376.6 12011 1378.5
## - rm 1 1841.1 12476 1393.8
## - lstat 1 4837.1 15472 1480.7
##
## Step: AIC=1310.51
## medv ~ lstat + rm + ptratio + dis + nox
##
## Df Sum of Sq RSS AIC
## + chas 1 381.6 9670.8 1296.9
## + black 1 192.4 9859.9 1304.7
## + crim 1 109.4 9942.9 1308.1
## + zn 1 90.4 9961.9 1308.9
## + rad 1 77.6 9974.7 1309.4
## <none> 10052.3 1310.5
## + indus 1 44.5 10007.9 1310.7
## + tax 1 7.1 10045.2 1312.2
## + age 1 6.6 10045.7 1312.2
## - nox 1 582.5 10634.8 1331.3
## - dis 1 1168.1 11220.5 1352.9
## - ptratio 1 1613.5 11665.8 1368.7
## - rm 1 1726.1 11778.4 1372.5
## - lstat 1 3497.6 13549.9 1429.1
##
## Step: AIC=1296.88
## medv ~ lstat + rm + ptratio + dis + nox + chas
##
## Df Sum of Sq RSS AIC
## + black 1 163.4 9507.4 1292.0
## + zn 1 92.4 9578.3 1295.0
## + crim 1 86.7 9584.0 1295.2
## + rad 1 72.3 9598.5 1295.8
## + indus 1 57.0 9613.7 1296.5
## <none> 9670.8 1296.9
## + tax 1 4.1 9666.6 1298.7
## + age 1 3.0 9667.8 1298.8
## - chas 1 381.6 10052.3 1310.5
## - nox 1 646.2 10316.9 1321.0
## - dis 1 1085.5 10756.3 1337.9
## - ptratio 1 1499.1 11169.9 1353.1
## - rm 1 1672.6 11343.4 1359.3
## - lstat 1 3344.7 13015.4 1414.9
##
## Step: AIC=1292
## medv ~ lstat + rm + ptratio + dis + nox + chas + black
##
## Df Sum of Sq RSS AIC
## + rad 1 128.11 9379.3 1288.5
## + zn 1 103.79 9403.6 1289.6
## + crim 1 54.46 9452.9 1291.7
## <none> 9507.4 1292.0
## + indus 1 46.19 9461.2 1292.0
## + tax 1 0.12 9507.3 1294.0
## + age 1 0.05 9507.3 1294.0
## - black 1 163.35 9670.8 1296.9
## - chas 1 352.53 9859.9 1304.7
## - nox 1 486.55 9993.9 1310.2
## - dis 1 1027.59 10535.0 1331.5
## - ptratio 1 1434.48 10941.9 1346.8
## - rm 1 1764.85 11272.2 1358.8
## - lstat 1 2987.69 12495.1 1400.4
##
## Step: AIC=1288.52
## medv ~ lstat + rm + ptratio + dis + nox + chas + black + rad
##
## Df Sum of Sq RSS AIC
## + tax 1 241.89 9137.4 1280.0
## + crim 1 199.21 9180.1 1281.8
## + indus 1 72.17 9307.1 1287.4
## + zn 1 66.87 9312.4 1287.6
## <none> 9379.3 1288.5
## + age 1 2.86 9376.4 1290.4
## - rad 1 128.11 9507.4 1292.0
## - black 1 219.19 9598.5 1295.8
## - chas 1 340.52 9719.8 1300.9
## - nox 1 610.59 9989.9 1312.0
## - dis 1 1056.50 10435.8 1329.6
## - ptratio 1 1528.99 10908.3 1347.5
## - rm 1 1575.83 10955.1 1349.3
## - lstat 1 3094.90 12474.2 1401.7
##
## Step: AIC=1279.96
## medv ~ lstat + rm + ptratio + dis + nox + chas + black + rad +
## tax
##
## Df Sum of Sq RSS AIC
## + crim 1 200.08 8937.3 1273.0
## + zn 1 135.44 9002.0 1275.9
## <none> 9137.4 1280.0
## + indus 1 8.59 9128.8 1281.6
## + age 1 2.74 9134.7 1281.8
## - black 1 203.97 9341.4 1286.9
## - tax 1 241.89 9379.3 1288.5
## - chas 1 294.50 9431.9 1290.8
## - rad 1 369.88 9507.3 1294.0
## - nox 1 422.19 9559.6 1296.2
## - dis 1 1027.43 10164.8 1321.0
## - rm 1 1398.79 10536.2 1335.5
## - ptratio 1 1434.42 10571.8 1336.9
## - lstat 1 3089.14 12226.5 1395.6
##
## Step: AIC=1273.02
## medv ~ lstat + rm + ptratio + dis + nox + chas + black + rad +
## tax + crim
##
## Df Sum of Sq RSS AIC
## + zn 1 166.37 8770.9 1267.4
## <none> 8937.3 1273.0
## + indus 1 11.99 8925.3 1274.5
## + age 1 1.17 8936.1 1275.0
## - black 1 174.61 9111.9 1278.8
## - crim 1 200.08 9137.4 1280.0
## - tax 1 242.77 9180.1 1281.8
## - chas 1 259.36 9196.7 1282.6
## - nox 1 473.18 9410.5 1291.9
## - rad 1 504.45 9441.8 1293.2
## - dis 1 1100.15 10037.5 1317.9
## - rm 1 1412.90 10350.2 1330.3
## - ptratio 1 1496.37 10433.7 1333.6
## - lstat 1 2717.20 11654.5 1378.3
##
## Step: AIC=1267.43
## medv ~ lstat + rm + ptratio + dis + nox + chas + black + rad +
## tax + crim + zn
##
## Df Sum of Sq RSS AIC
## <none> 8770.9 1267.4
## + age 1 8.05 8762.9 1269.0
## + indus 1 4.01 8766.9 1269.2
## - zn 1 166.37 8937.3 1273.0
## - black 1 171.77 8942.7 1273.3
## - crim 1 231.02 9002.0 1275.9
## - chas 1 253.32 9024.3 1276.9
## - tax 1 321.00 9091.9 1280.0
## - nox 1 412.34 9183.3 1284.0
## - rad 1 548.85 9319.8 1290.0
## - ptratio 1 1050.67 9821.6 1311.1
## - dis 1 1248.65 10019.6 1319.2
## - rm 1 1255.75 10026.7 1319.5
## - lstat 1 2684.39 11455.3 1373.3
AIC(model.step)
## [1] 2415.927
BIC(model.step)
## [1] 2467.946
summary(model.step)
##
## Call:
## lm(formula = medv ~ lstat + rm + ptratio + dis + nox + chas +
## black + rad + tax + crim + zn, data = Boston.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.1161 -2.5762 -0.6103 1.5993 24.8802
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 40.372746 5.822013 6.934 1.69e-11 ***
## lstat -0.584915 0.053401 -10.953 < 2e-16 ***
## rm 3.478815 0.464366 7.492 4.56e-13 ***
## ptratio -0.961374 0.140294 -6.853 2.82e-11 ***
## dis -1.529129 0.204693 -7.470 5.25e-13 ***
## nox -17.097070 3.982669 -4.293 2.23e-05 ***
## chas 3.200015 0.951031 3.365 0.000842 ***
## black 0.008655 0.003124 2.771 0.005859 **
## rad 0.363258 0.073345 4.953 1.09e-06 ***
## tax -0.014658 0.003870 -3.788 0.000176 ***
## crim -0.121738 0.037887 -3.213 0.001421 **
## zn 0.041536 0.015232 2.727 0.006681 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.73 on 392 degrees of freedom
## Multiple R-squared: 0.7499, Adjusted R-squared: 0.7429
## F-statistic: 106.8 on 11 and 392 DF, p-value: < 2.2e-16
From Best subset regression and stepwise selection (forward, backward, both), we see that all variables except indus and age are significant.
#Model Diagnostics for model 2
par(mfrow = c(2,2))
plot(model.step)
par(mfrow = c(1,1))
Residuals vs Fitted plot shows that the relationship between medv and predictors is not completely linear. Also, normal qq plot is skewed implying that residuals are not normally distributed. A different functional from may be required.
Models are compared based on adjusted r square, AIC, BIC criteria for in-sample performance and mean square prediction error (MSPE) for out-of-sample performance
#In-sample performance
#MSE
model.sum <- summary(model1)
(model.sum$sigma) ^ 2
## [1] 22.45987
model2.sum <- summary(model2)
(model2.sum$sigma) ^ 2
## [1] 22.37486
#R-squared
model1.sum$r.squared
## [1] 0.750217
model2.sum$r.squared
## [1] 0.7498863
#Adjusted r square
model1.sum$adj.r.squared
## [1] 0.7418909
model2.sum$adj.r.squared
## [1] 0.7428679
#AIC
AIC(model1)
## [1] 2419.393
AIC(model2)
## [1] 2415.927
#BIC
BIC(model1)
## [1] 2479.414
BIC(model2)
## [1] 2467.946
#Out-of-sample Prediction or test error (MSPE)
model1.pred.test <- predict(model1, newdata = Boston.test)
model1.mspe <- mean((model1.pred.test - Boston.test$medv) ^ 2)
model1.mspe
## [1] 23.91468
model2.pred.test <- predict(model2, newdata = Boston.test)
model2.mspe <- mean((model2.pred.test - Boston.test$medv) ^ 2)
model2.mspe
## [1] 23.58354
#Cross Validation
model1.glm = glm(medv ~ ., data = Boston)
cv.glm(data = Boston, glmfit = model1.glm, K = 5)$delta[2]
## [1] 24.16365
model2.glm <- glm(medv ~ . -indus -age, data = Boston)
cv.glm(data = Boston, glmfit = model2.glm, K = 5)$delta[2]
## [1] 23.82476
Based on AIC criteria and adjusted R square values, model 2 is slightly better than model 1. In-sample MSE is nearly the same for both models.
We need to check out-of-sample MSPE for both models. Based on out-of-sample prediction error, model 2 is slightly better than model 1. MSPE of model 1 is 20.7113 while that of model 2 is 20.6784. Based on cross validation also, model 2 performs better.
Following regression trees were fitted to the training data. * Using default value, cp = 0.01 and no additional constraints. This resulted in a tree with 8 terminal nodes. * Making cp = 0.001 and allowing the tree to grow large This results in a tree with 27 terminal nodes. A plot of cp values vs error rates (fig 7) shows that a cp value of 0.0072 would reduce the complexity of the model. * Finally, with cp = 0.0072. This results in a tree with 10 terminal nodes.
#default value of cp = 0.01
Boston.tree <- rpart(medv ~ ., data = Boston.train)
Boston.tree
## n= 404
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 404 35067.83000 22.82376
## 2) lstat>=9.755 227 5123.02300 17.18282
## 4) lstat>=16.085 115 1935.10100 14.26522
## 8) crim>=5.76921 57 627.49720 11.79298 *
## 9) crim< 5.76921 58 616.84840 16.69483 *
## 5) lstat< 16.085 112 1203.84900 20.17857 *
## 3) lstat< 9.755 177 13457.97000 30.05819
## 6) rm< 7.141 135 5507.54400 26.81852
## 12) tax< 548 128 2713.95900 25.92188
## 24) rm< 6.5445 70 525.02070 23.13571 *
## 25) rm>=6.5445 58 989.73600 29.28448 *
## 13) tax>=548 7 808.92860 43.21429 *
## 7) rm>=7.141 42 1979.24600 40.47143
## 14) rm< 7.445 17 59.46118 34.75882 *
## 15) rm>=7.445 25 987.76160 44.35600 *
#Plotting the tree
rpart.plot(Boston.tree, type = 3, box.palette = c("red", "green"), fallen.leaves = TRUE)
plotcp(Boston.tree)
printcp(Boston.tree)
##
## Regression tree:
## rpart(formula = medv ~ ., data = Boston.train)
##
## Variables actually used in tree construction:
## [1] crim lstat rm tax
##
## Root node error: 35068/404 = 86.802
##
## n= 404
##
## CP nsplit rel error xerror xstd
## 1 0.470141 0 1.00000 1.00205 0.088668
## 2 0.170275 1 0.52986 0.53927 0.045741
## 3 0.056595 2 0.35958 0.40188 0.040452
## 4 0.056578 3 0.30299 0.40309 0.044836
## 5 0.034197 4 0.24641 0.36308 0.044066
## 6 0.026578 5 0.21221 0.31303 0.040008
## 7 0.019698 6 0.18564 0.31248 0.040015
## 8 0.010000 7 0.16594 0.27943 0.039762
#Building a large tree
Boston.largetree <- rpart(formula = medv ~ ., data = Boston.train, cp = 0.001)
plot(Boston.largetree)
text(Boston.largetree)
printcp(Boston.largetree)
##
## Regression tree:
## rpart(formula = medv ~ ., data = Boston.train, cp = 0.001)
##
## Variables actually used in tree construction:
## [1] black crim dis indus lstat nox ptratio rad
## [9] rm tax
##
## Root node error: 35068/404 = 86.802
##
## n= 404
##
## CP nsplit rel error xerror xstd
## 1 0.4701414 0 1.00000 1.00199 0.088924
## 2 0.1702752 1 0.52986 0.53930 0.046107
## 3 0.0565948 2 0.35958 0.40589 0.042325
## 4 0.0565782 3 0.30299 0.38842 0.043911
## 5 0.0341966 4 0.24641 0.33213 0.040310
## 6 0.0265777 5 0.21221 0.29703 0.038979
## 7 0.0196977 6 0.18564 0.28913 0.038851
## 8 0.0088490 7 0.16594 0.26006 0.037794
## 9 0.0082962 8 0.15709 0.24019 0.029925
## 10 0.0061924 9 0.14879 0.23732 0.030105
## 11 0.0055046 10 0.14260 0.23016 0.030094
## 12 0.0041547 11 0.13710 0.23440 0.030639
## 13 0.0039535 12 0.13294 0.23592 0.030634
## 14 0.0038486 13 0.12899 0.23463 0.030629
## 15 0.0037686 14 0.12514 0.23198 0.030501
## 16 0.0031753 15 0.12137 0.22617 0.028763
## 17 0.0019657 16 0.11820 0.21893 0.028622
## 18 0.0019599 17 0.11623 0.21474 0.028587
## 19 0.0016566 18 0.11427 0.21478 0.028584
## 20 0.0015947 20 0.11096 0.21543 0.028610
## 21 0.0013438 21 0.10936 0.21592 0.028584
## 22 0.0012121 22 0.10802 0.21850 0.028677
## 23 0.0011492 23 0.10681 0.21878 0.028674
## 24 0.0010170 24 0.10566 0.21854 0.028671
## 25 0.0010045 25 0.10464 0.21898 0.028686
## 26 0.0010000 26 0.10364 0.21929 0.028683
plotcp(Boston.largetree)
#however, from plotcp, we observe that a tree with more than 7 to 9 splits is not very helpful.
#further pruning the tree to limit to 9 splits;corresponding cp value from plot is 0.0072
pruned.tree <- prune(Boston.largetree, cp = 0.0072)
pruned.tree
## n= 404
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 404 35067.83000 22.82376
## 2) lstat>=9.755 227 5123.02300 17.18282
## 4) lstat>=16.085 115 1935.10100 14.26522
## 8) crim>=5.76921 57 627.49720 11.79298 *
## 9) crim< 5.76921 58 616.84840 16.69483 *
## 5) lstat< 16.085 112 1203.84900 20.17857 *
## 3) lstat< 9.755 177 13457.97000 30.05819
## 6) rm< 7.141 135 5507.54400 26.81852
## 12) tax< 548 128 2713.95900 25.92188
## 24) rm< 6.5445 70 525.02070 23.13571 *
## 25) rm>=6.5445 58 989.73600 29.28448
## 50) lstat>=5.195 31 267.05940 27.12581 *
## 51) lstat< 5.195 27 412.36300 31.76296 *
## 13) tax>=548 7 808.92860 43.21429 *
## 7) rm>=7.141 42 1979.24600 40.47143
## 14) rm< 7.445 17 59.46118 34.75882 *
## 15) rm>=7.445 25 987.76160 44.35600
## 30) ptratio>=17.6 7 465.96860 38.88571 *
## 31) ptratio< 17.6 18 230.86500 46.48333 *
rpart.plot(pruned.tree, type = 3, box.palette = c("red", "green"), fallen.leaves = TRUE, extra = 1)
#In-sample MSE
mean((predict(Boston.tree) - Boston.train$medv) ^ 2) #default tree
## [1] 14.40372
mean((predict(Boston.largetree) - Boston.train$medv) ^ 2) #large tree
## [1] 8.995731
mean((predict(pruned.tree) - Boston.train$medv) ^ 2) #pruned tree
## [1] 12.9155
#out-of-sample performance
#Mean squared error loss for this tree
mean((predict(Boston.tree, newdata = Boston.test) - Boston.test$medv) ^ 2) #default tree
## [1] 22.69261
mean((predict(Boston.largetree, newdata = Boston.test) - Boston.test$medv) ^ 2) #large tree
## [1] 20.06999
mean((predict(pruned.tree, newdata = Boston.test) - Boston.test$medv) ^ 2) #pruned tree
## [1] 21.45344
From plot, we can observe that cross validation error does not always go down when tree becomes more complex. We can see that a tree with more than 7 to 9 splits is not very helpful.
The large tree results in lowest in-sample and out-of-sample prediction error compared to other two trees. But there is a large difference between in-sample and out-of-sample performance for this tree. In contrast, trees with default cp value and pruned tree have lower difference between in-sample and out-of-sample prediction errors. Further, they are easier to read and interpret and do not sacrifice much in terms of out-of-sample prediction error. Since pruned has lower prediction errors compared to tree with default cp value, it is the chosen tree.
Residual diagnostics of linear regression model showed that the relation between medv and predictor variables may not be linear. Since the correct transformation of predictor variables is not known, GAM can be used to model non-linearity. GAM is fit using smoothing splines, s(), which is available in gam library in R. In the model, smoothing spline is used for all continuous variables except ‘chas’ and ‘rad’, which are of integer type and which have less than 10 unique values. It is not recommended to use smoothing splines on such variables.
#model 1 - not using s() on chas and rad, leaving them as integers
Boston.gam <- gam(medv ~ s(crim) + s(zn) + s(indus) + s(nox) + s(rm) + s(age) + s(dis) +
s(tax) + s(ptratio) + s(black) + s(lstat) + chas + rad, data = Boston.train)
summary(Boston.gam)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## medv ~ s(crim) + s(zn) + s(indus) + s(nox) + s(rm) + s(age) +
## s(dis) + s(tax) + s(ptratio) + s(black) + s(lstat) + chas +
## rad
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.8584 1.1963 16.599 <2e-16 ***
## chas 0.9520 0.6939 1.372 0.171
## rad 0.3145 0.1286 2.445 0.015 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(crim) 5.085 6.083 12.204 1.26e-12 ***
## s(zn) 1.000 1.000 1.750 0.18675
## s(indus) 4.716 5.671 3.046 0.00502 **
## s(nox) 8.817 8.978 13.766 < 2e-16 ***
## s(rm) 8.288 8.854 18.778 < 2e-16 ***
## s(age) 1.000 1.000 0.268 0.60507
## s(dis) 8.779 8.981 8.441 1.43e-11 ***
## s(tax) 5.933 6.802 8.377 6.67e-09 ***
## s(ptratio) 1.894 2.368 16.809 2.36e-08 ***
## s(black) 1.000 1.000 0.532 0.46652
## s(lstat) 7.458 8.418 20.525 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Rank: 100/102
## R-sq.(adj) = 0.891 Deviance explained = 90.6%
## GCV = 11.084 Scale est. = 9.5212 n = 404
Variables which have edf close to 1 have linear relationship with medv. Summary of the model shows that variables ‘zn’, ‘age’ and ‘black’ have edf of 1. This along with their p-values, indicates that they have a linear relationship with medv. Rest of the variables have a non-linear relationship with medv.
#model 2 - removing s() from functions which are linear
Boston.gam <- gam(medv ~ s(crim) + zn + s(indus) + s(nox) + s(rm) + age + s(dis) +
s(tax) + s(ptratio) + black + s(lstat) + chas + rad, data = Boston.train)
summary(Boston.gam)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## medv ~ s(crim) + zn + s(indus) + s(nox) + s(rm) + age + s(dis) +
## s(tax) + s(ptratio) + black + s(lstat) + chas + rad
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.282679 1.776422 10.855 <2e-16 ***
## zn 0.021160 0.015822 1.337 0.1820
## age -0.006635 0.012089 -0.549 0.5834
## black 0.001743 0.002338 0.746 0.4563
## chas 0.947958 0.694317 1.365 0.1730
## rad 0.333164 0.141282 2.358 0.0189 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(crim) 5.777 6.885 10.699 3.50e-12 ***
## s(indus) 4.643 5.572 3.046 0.00499 **
## s(nox) 8.816 8.978 13.678 < 2e-16 ***
## s(rm) 8.287 8.854 18.709 < 2e-16 ***
## s(dis) 8.778 8.980 8.375 1.82e-11 ***
## s(tax) 6.362 7.387 7.426 1.01e-08 ***
## s(ptratio) 1.892 2.365 16.027 5.27e-08 ***
## s(lstat) 7.453 8.416 20.462 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.89 Deviance explained = 90.6%
## GCV = 11.13 Scale est. = 9.5316 n = 404
Performance metrics of the model is shown below:
#Model AIC, BIC, mean residual deviance
AIC(Boston.gam)
## [1] 2112.764
BIC(Boston.gam)
## [1] 2348.88
Boston.gam$deviance
## [1] 3297.872
The non-linear relationship of variables with medv can be seen in the following plots:
#plot
plot(Boston.gam, shade = TRUE, seWithMean = TRUE, scale = 0)
In-sample and out-of-sample prediction errors are shown below:
#In-sample prediction
(Boston.gam.mse <- mean((predict(Boston.gam) - Boston.train$medv) ^ 2))
## [1] 8.163049
#Out-of-sample prediction - MSPE
(Boston.gam.mspe <- mean((predict(Boston.gam, newdata = Boston.test) - Boston.test$medv) ^ 2))
## [1] 11.96812
Three different methods - linear regression, regression tree and GAM - were used to predict median housing price (medv) in the above problem. In each method, the best possible model was chosen based on its performance using in-sample and out-of-sample data sets. The performance measures used include mean squared error (in-sample) and mean squared prediction error (out-of-sample). Finally, a comprehensive model comparison was done between different methods. Lower the MSPE, better is the model.
It was observed GAM performs the best among the model with lowest out-of-sample MSPE. Moreover, it is able to capture non-linear relations in the data set.