This report provides an analysis and evaluation of the factors affecting the median value of the owner-occupied homes in the suburbs of Boston. The in-built data set of Boston Housing Data in package MASS is used for this analysis and various factors about the structural quality, neighborhood, accessibility and air pollution such as per capita crime rate by town, proportion of non-retail business acres per town, index of accessibility to radial highways etc. are taken into account for this study. To start with, the random sample with 80:20 dataset has to be created, find a best model using linear regression with AIC and BIC and LASSO variable selection. Using the final model, Report in-sample (MSE) and out-of-sample prediction errors (MSPE). Build a regression tree and compare it with the linear regression model. Finally, repeat the sample process for another random sample with conclusions and detailed comparison tables.
Methods of analysis includes the following. Summary statistics of the variables and finding correlation between variables, Exploratory data analysis using visualization. Random sampling of data set into 80/20 training and testing data set. Fitting various models such as generalized linear regression using different variable selections and finding the best model. Calculate 5-fold cross validation, fit a decision tree (CART) on the same data and compare it with the linear regression model.
Boston housing data is a data set in package MASS. The data set has 506 rows and 14 columns.
The Boston Housing dataset to understand the distribution of every variable and also to observe if there are any notable outliers. The distribution for age and tax seems pretty reasonable and there are no outliers which is a positive point to start with. I can clearly observe that there are many outliers particularly in 3 variables named crime, zn and black. Zn and crim distribution are skewed right even after standard normalization and black variable seems to have a left skewed distribution. nox and rm variable has a perfectly equal distribution and the boxplot is symmetrical. Rad and crim are extremely negatively skewed.
To further the analysis, created at the scatter plots and see which predictor variables have a positive or a negative correlation with the response variable (medv). This will give a starting point to analyze which covariates are important for consideration while building the regression model. Moreover, it will also help us realize if there are any covariates which are strongly correlated with each other leading to collinearity. Looking at the model building and variable selection, there are both positive and negative correlation with the response variable ‘medv’. Crim, indus, nox, age, rad, tax, ptratio, lstat all of them have a negative correlation with the lstat one having the strongest correlation.
Finally, after comparing the performance of regression tree with linear model, the performance of the regression tree looks good as the Mean Standard Predicted Error is less for regression tree which is 19.355.
data(Boston); #this data is in MASS package
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
head(Boston)
## crim zn indus chas nox rm age dis rad tax ptratio black lstat
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21
## medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
Boston %>%
gather(key, val, -medv) %>%
ggplot(aes(x = val, y = medv)) +
geom_point() +
stat_smooth(method = "lm", se = TRUE, col = "blue") +
facet_wrap(~key, scales = "free") +
theme_gray() +
ggtitle("Scatter plot of dependent variables vs Median Value (medv)")
## `geom_smooth()` using formula 'y ~ x'
corrplot(cor(Boston), method = "number", type = "upper", diag = FALSE)
set.seed(14006542)
s_index <- sample(nrow(Boston),nrow(Boston)*0.70)
b_train <- Boston[s_index,]
b_test <- Boston[-s_index,]
lm_model1 <- lm(medv~., data=b_train)
summary(lm_model1)
##
## Call:
## lm(formula = medv ~ ., data = b_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.9119 -2.7467 -0.6709 1.4983 23.6015
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 44.875444 6.414557 6.996 1.41e-11 ***
## crim -0.103527 0.042986 -2.408 0.016556 *
## zn 0.052664 0.016844 3.127 0.001921 **
## indus 0.029534 0.073781 0.400 0.689196
## chas 2.723410 1.088942 2.501 0.012855 *
## nox -20.195312 4.896818 -4.124 4.68e-05 ***
## rm 3.040044 0.497370 6.112 2.69e-09 ***
## age 0.011924 0.016214 0.735 0.462573
## dis -1.738834 0.251711 -6.908 2.43e-11 ***
## rad 0.339305 0.082783 4.099 5.20e-05 ***
## tax -0.012853 0.004559 -2.819 0.005095 **
## ptratio -1.041704 0.171491 -6.074 3.33e-09 ***
## black 0.011167 0.003321 3.362 0.000861 ***
## lstat -0.629431 0.061219 -10.282 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.929 on 340 degrees of freedom
## Multiple R-squared: 0.7387, Adjusted R-squared: 0.7287
## F-statistic: 73.94 on 13 and 340 DF, p-value: < 2.2e-16
model1.sum <- summary(lm_model1)
(model1.sum$sigma) ^ 2
## [1] 24.29432
AIC(lm_model1)
## [1] 2149.67
BIC(lm_model1)
## [1] 2207.709
model1.sum$r.squared
## [1] 0.7386966
model1.sum$adj.r.squared
## [1] 0.7287056
par(mfrow = c(2,2))
plot(lm_model1)
#### Variable Selection
#boston_model_1 <- glm(medv ~ ., family = gaussian, b_train)
#boston_model_2 <- step(german_model_aic, direction = "backward")
#summary(boston_model_back_aic)
boston_model_1 <- lm(medv~., data = b_train)
boston_model_2 <- lm(medv~crim+zn, data = b_train)
summary(boston_model_1)
##
## Call:
## lm(formula = medv ~ ., data = b_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.9119 -2.7467 -0.6709 1.4983 23.6015
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 44.875444 6.414557 6.996 1.41e-11 ***
## crim -0.103527 0.042986 -2.408 0.016556 *
## zn 0.052664 0.016844 3.127 0.001921 **
## indus 0.029534 0.073781 0.400 0.689196
## chas 2.723410 1.088942 2.501 0.012855 *
## nox -20.195312 4.896818 -4.124 4.68e-05 ***
## rm 3.040044 0.497370 6.112 2.69e-09 ***
## age 0.011924 0.016214 0.735 0.462573
## dis -1.738834 0.251711 -6.908 2.43e-11 ***
## rad 0.339305 0.082783 4.099 5.20e-05 ***
## tax -0.012853 0.004559 -2.819 0.005095 **
## ptratio -1.041704 0.171491 -6.074 3.33e-09 ***
## black 0.011167 0.003321 3.362 0.000861 ***
## lstat -0.629431 0.061219 -10.282 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.929 on 340 degrees of freedom
## Multiple R-squared: 0.7387, Adjusted R-squared: 0.7287
## F-statistic: 73.94 on 13 and 340 DF, p-value: < 2.2e-16
summary(boston_model_2)
##
## Call:
## lm(formula = medv ~ crim + zn, data = b_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.737 -5.277 -1.795 1.944 30.574
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.91533 0.54469 42.070 < 2e-16 ***
## crim -0.37795 0.05522 -6.845 3.43e-11 ***
## zn 0.11045 0.01932 5.716 2.34e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.387 on 351 degrees of freedom
## Multiple R-squared: 0.2189, Adjusted R-squared: 0.2145
## F-statistic: 49.2 on 2 and 351 DF, p-value: < 2.2e-16
AIC(boston_model_1); BIC(boston_model_1)
## [1] 2149.67
## [1] 2207.709
AIC(boston_model_2); BIC(boston_model_2)
## [1] 2515.289
## [1] 2530.766
subset.result <- regsubsets(medv~.,data=b_train, nbest=1, nvmax = 13)
summary(subset.result)
## Subset selection object
## Call: regsubsets.formula(medv ~ ., data = b_train, nbest = 1, nvmax = 13)
## 13 Variables (and intercept)
## Forced in Forced out
## crim FALSE FALSE
## zn FALSE FALSE
## indus FALSE FALSE
## chas FALSE FALSE
## nox FALSE FALSE
## rm FALSE FALSE
## age FALSE FALSE
## dis FALSE FALSE
## rad FALSE FALSE
## tax FALSE FALSE
## ptratio FALSE FALSE
## black FALSE FALSE
## lstat FALSE FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: exhaustive
## crim zn indus chas nox rm age dis rad tax ptratio black lstat
## 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " " " "*"
## 2 ( 1 ) " " " " " " " " " " "*" " " " " " " " " " " " " "*"
## 3 ( 1 ) " " " " " " " " " " "*" " " " " " " " " "*" " " "*"
## 4 ( 1 ) " " " " " " " " " " "*" " " "*" " " " " "*" " " "*"
## 5 ( 1 ) " " " " " " " " "*" "*" " " "*" " " " " "*" " " "*"
## 6 ( 1 ) " " " " " " " " "*" "*" " " "*" " " " " "*" "*" "*"
## 7 ( 1 ) " " " " " " "*" "*" "*" " " "*" " " " " "*" "*" "*"
## 8 ( 1 ) " " "*" " " "*" "*" "*" " " "*" " " " " "*" "*" "*"
## 9 ( 1 ) " " "*" " " " " "*" "*" " " "*" "*" "*" "*" "*" "*"
## 10 ( 1 ) " " "*" " " "*" "*" "*" " " "*" "*" "*" "*" "*" "*"
## 11 ( 1 ) "*" "*" " " "*" "*" "*" " " "*" "*" "*" "*" "*" "*"
## 12 ( 1 ) "*" "*" " " "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
## 13 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
plot(subset.result, scale = "bic")
nullmodel=lm(medv~1, data=b_train)
fullmodel=lm(medv~., data=b_train)
Backward
boston_model_step_b <- step(fullmodel,direction='backward')
## Start: AIC=1143.06
## medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad +
## tax + ptratio + black + lstat
##
## Df Sum of Sq RSS AIC
## - indus 1 3.89 8264.0 1141.2
## - age 1 13.14 8273.2 1141.6
## <none> 8260.1 1143.1
## - crim 1 140.91 8401.0 1147.0
## - chas 1 151.96 8412.0 1147.5
## - tax 1 193.10 8453.2 1149.2
## - zn 1 237.50 8497.6 1151.1
## - black 1 274.65 8534.7 1152.6
## - rad 1 408.13 8668.2 1158.1
## - nox 1 413.22 8673.3 1158.3
## - ptratio 1 896.42 9156.5 1177.5
## - rm 1 907.62 9167.7 1178.0
## - dis 1 1159.36 9419.4 1187.6
## - lstat 1 2568.24 10828.3 1236.9
##
## Step: AIC=1141.23
## medv ~ crim + zn + chas + nox + rm + age + dis + rad + tax +
## ptratio + black + lstat
##
## Df Sum of Sq RSS AIC
## - age 1 13.36 8277.3 1139.8
## <none> 8264.0 1141.2
## - crim 1 142.14 8406.1 1145.3
## - chas 1 161.19 8425.2 1146.1
## - tax 1 207.05 8471.0 1148.0
## - zn 1 234.77 8498.7 1149.1
## - black 1 273.05 8537.0 1150.7
## - rad 1 418.68 8682.6 1156.7
## - nox 1 426.14 8690.1 1157.0
## - ptratio 1 900.12 9164.1 1175.8
## - rm 1 905.20 9169.2 1176.0
## - dis 1 1220.20 9484.2 1188.0
## - lstat 1 2567.78 10831.7 1235.0
##
## Step: AIC=1139.8
## medv ~ crim + zn + chas + nox + rm + dis + rad + tax + ptratio +
## black + lstat
##
## Df Sum of Sq RSS AIC
## <none> 8277.3 1139.8
## - crim 1 142.34 8419.7 1143.8
## - chas 1 163.92 8441.2 1144.7
## - tax 1 206.59 8483.9 1146.5
## - zn 1 223.93 8501.3 1147.2
## - black 1 282.34 8559.7 1149.7
## - rad 1 413.61 8690.9 1155.1
## - nox 1 418.63 8696.0 1155.3
## - ptratio 1 890.19 9167.5 1174.0
## - rm 1 976.72 9254.0 1177.3
## - dis 1 1385.24 9662.6 1192.6
## - lstat 1 2714.75 10992.1 1238.2
summary(boston_model_step_b)
##
## Call:
## lm(formula = medv ~ crim + zn + chas + nox + rm + dis + rad +
## tax + ptratio + black + lstat, data = b_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.0798 -2.7191 -0.7501 1.5986 23.8547
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 44.349339 6.364990 6.968 1.66e-11 ***
## crim -0.104020 0.042892 -2.425 0.015820 *
## zn 0.050578 0.016628 3.042 0.002534 **
## chas 2.802968 1.077056 2.602 0.009659 **
## nox -18.621443 4.477465 -4.159 4.05e-05 ***
## rm 3.084974 0.485623 6.353 6.75e-10 ***
## dis -1.804990 0.238586 -7.565 3.61e-13 ***
## rad 0.327840 0.079305 4.134 4.49e-05 ***
## tax -0.012076 0.004134 -2.922 0.003714 **
## ptratio -1.023388 0.168744 -6.065 3.50e-09 ***
## black 0.011293 0.003306 3.416 0.000713 ***
## lstat -0.613561 0.057933 -10.591 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.92 on 342 degrees of freedom
## Multiple R-squared: 0.7382, Adjusted R-squared: 0.7297
## F-statistic: 87.64 on 11 and 342 DF, p-value: < 2.2e-16
AIC(boston_model_step_b); BIC(boston_model_step_b)
## [1] 2146.409
## [1] 2196.71
Forward
boston_model_step_f <- step(nullmodel, scope=list(lower=nullmodel, upper=fullmodel), direction='forward')
## Start: AIC=1592.16
## medv ~ 1
##
## Df Sum of Sq RSS AIC
## + lstat 1 17536.4 14075 1307.7
## + rm 1 13803.7 17807 1391.0
## + ptratio 1 8258.4 23353 1487.0
## + tax 1 6771.2 24840 1508.8
## + indus 1 6754.5 24857 1509.1
## + nox 1 5269.2 26342 1529.6
## + rad 1 4755.1 26856 1536.5
## + crim 1 4623.1 26988 1538.2
## + black 1 3924.3 27687 1547.2
## + age 1 3838.0 27773 1548.3
## + zn 1 3625.3 27986 1551.0
## + chas 1 1642.0 29969 1575.3
## + dis 1 1482.7 30128 1577.2
## <none> 31611 1592.2
##
## Step: AIC=1307.72
## medv ~ lstat
##
## Df Sum of Sq RSS AIC
## + rm 1 2359.36 11715 1244.8
## + ptratio 1 1672.50 12402 1264.9
## + dis 1 686.37 13388 1292.0
## + chas 1 510.60 13564 1296.6
## + age 1 274.04 13801 1302.8
## + black 1 199.76 13875 1304.7
## + tax 1 113.34 13961 1306.9
## + zn 1 101.22 13973 1307.2
## <none> 14075 1307.7
## + crim 1 58.44 14016 1308.2
## + indus 1 39.68 14035 1308.7
## + nox 1 9.09 14066 1309.5
## + rad 1 2.64 14072 1309.7
##
## Step: AIC=1244.77
## medv ~ lstat + rm
##
## Df Sum of Sq RSS AIC
## + ptratio 1 1079.74 10636 1212.5
## + black 1 420.71 11295 1233.8
## + dis 1 401.66 11314 1234.4
## + chas 1 367.43 11348 1235.5
## + tax 1 174.78 11540 1241.5
## + crim 1 128.95 11586 1242.8
## <none> 11715 1244.8
## + age 1 62.45 11653 1244.9
## + rad 1 62.19 11653 1244.9
## + zn 1 29.73 11686 1245.9
## + indus 1 13.71 11702 1246.4
## + nox 1 2.95 11712 1246.7
##
## Step: AIC=1212.54
## medv ~ lstat + rm + ptratio
##
## Df Sum of Sq RSS AIC
## + dis 1 597.71 10038 1194.1
## + black 1 362.17 10273 1202.3
## + chas 1 300.24 10335 1204.4
## + age 1 139.08 10496 1209.9
## <none> 10636 1212.5
## + rad 1 39.58 10596 1213.2
## + crim 1 31.98 10604 1213.5
## + zn 1 19.71 10616 1213.9
## + indus 1 17.38 10618 1214.0
## + tax 1 1.14 10634 1214.5
## + nox 1 1.11 10634 1214.5
##
## Step: AIC=1194.07
## medv ~ lstat + rm + ptratio + dis
##
## Df Sum of Sq RSS AIC
## + nox 1 600.89 9437.0 1174.2
## + black 1 510.50 9527.3 1177.6
## + chas 1 184.19 9853.7 1189.5
## + zn 1 162.65 9875.2 1190.3
## + indus 1 159.26 9878.6 1190.4
## + tax 1 128.88 9909.0 1191.5
## + crim 1 111.32 9926.5 1192.1
## <none> 10037.9 1194.1
## + age 1 20.28 10017.6 1195.3
## + rad 1 2.70 10035.2 1196.0
##
## Step: AIC=1174.21
## medv ~ lstat + rm + ptratio + dis + nox
##
## Df Sum of Sq RSS AIC
## + black 1 286.540 9150.4 1165.3
## + chas 1 237.409 9199.6 1167.2
## + zn 1 171.311 9265.6 1169.7
## + rad 1 105.874 9331.1 1172.2
## <none> 9437.0 1174.2
## + crim 1 52.370 9384.6 1174.2
## + age 1 6.998 9430.0 1176.0
## + indus 1 3.659 9433.3 1176.1
## + tax 1 0.035 9436.9 1176.2
##
## Step: AIC=1165.3
## medv ~ lstat + rm + ptratio + dis + nox + black
##
## Df Sum of Sq RSS AIC
## + chas 1 219.448 8931.0 1158.7
## + zn 1 200.302 8950.1 1159.5
## + rad 1 197.247 8953.2 1159.6
## <none> 9150.4 1165.3
## + crim 1 14.263 9136.2 1166.8
## + tax 1 11.492 9138.9 1166.8
## + age 1 1.289 9149.1 1167.2
## + indus 1 1.182 9149.2 1167.2
##
## Step: AIC=1158.71
## medv ~ lstat + rm + ptratio + dis + nox + black + chas
##
## Df Sum of Sq RSS AIC
## + zn 1 202.413 8728.6 1152.6
## + rad 1 184.546 8746.4 1153.3
## <none> 8931.0 1158.7
## + tax 1 14.162 8916.8 1160.1
## + crim 1 11.700 8919.3 1160.2
## + indus 1 5.880 8925.1 1160.5
## + age 1 0.559 8930.4 1160.7
##
## Step: AIC=1152.59
## medv ~ lstat + rm + ptratio + dis + nox + black + chas + zn
##
## Df Sum of Sq RSS AIC
## + rad 1 114.382 8614.2 1149.9
## <none> 8728.6 1152.6
## + crim 1 35.347 8693.2 1153.2
## + indus 1 8.888 8719.7 1154.2
## + age 1 8.883 8719.7 1154.2
## + tax 1 0.133 8728.4 1154.6
##
## Step: AIC=1149.92
## medv ~ lstat + rm + ptratio + dis + nox + black + chas + zn +
## rad
##
## Df Sum of Sq RSS AIC
## + tax 1 194.508 8419.7 1143.8
## + crim 1 130.263 8483.9 1146.5
## <none> 8614.2 1149.9
## + indus 1 13.968 8600.2 1151.3
## + age 1 13.110 8601.1 1151.4
##
## Step: AIC=1143.84
## medv ~ lstat + rm + ptratio + dis + nox + black + chas + zn +
## rad + tax
##
## Df Sum of Sq RSS AIC
## + crim 1 142.343 8277.3 1139.8
## <none> 8419.7 1143.8
## + age 1 13.571 8406.1 1145.3
## + indus 1 5.375 8414.3 1145.6
##
## Step: AIC=1139.8
## medv ~ lstat + rm + ptratio + dis + nox + black + chas + zn +
## rad + tax + crim
##
## Df Sum of Sq RSS AIC
## <none> 8277.3 1139.8
## + age 1 13.3638 8264.0 1141.2
## + indus 1 4.1161 8273.2 1141.6
summary(boston_model_step_f)
##
## Call:
## lm(formula = medv ~ lstat + rm + ptratio + dis + nox + black +
## chas + zn + rad + tax + crim, data = b_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.0798 -2.7191 -0.7501 1.5986 23.8547
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 44.349339 6.364990 6.968 1.66e-11 ***
## lstat -0.613561 0.057933 -10.591 < 2e-16 ***
## rm 3.084974 0.485623 6.353 6.75e-10 ***
## ptratio -1.023388 0.168744 -6.065 3.50e-09 ***
## dis -1.804990 0.238586 -7.565 3.61e-13 ***
## nox -18.621443 4.477465 -4.159 4.05e-05 ***
## black 0.011293 0.003306 3.416 0.000713 ***
## chas 2.802968 1.077056 2.602 0.009659 **
## zn 0.050578 0.016628 3.042 0.002534 **
## rad 0.327840 0.079305 4.134 4.49e-05 ***
## tax -0.012076 0.004134 -2.922 0.003714 **
## crim -0.104020 0.042892 -2.425 0.015820 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.92 on 342 degrees of freedom
## Multiple R-squared: 0.7382, Adjusted R-squared: 0.7297
## F-statistic: 87.64 on 11 and 342 DF, p-value: < 2.2e-16
AIC(boston_model_step_f); BIC(boston_model_step_f)
## [1] 2146.409
## [1] 2196.71
b_pred1 <- predict(object = boston_model_step_b, newdata = b_train)
mean((b_pred1-b_train$medv)^2)
## [1] 23.38228
b_pred2 <- predict(object = boston_model_step_f, newdata = b_train)
mean((b_pred2-b_train$medv)^2)
## [1] 23.38228
We see that all variables except indus and age are significant.
lasso_fit <- glmnet(x = as.matrix(Boston[, -c(which(colnames(Boston)=='medv'))]), y = Boston$medv, alpha = 1)
#lambda = 0.5
coef(lasso_fit,s=1)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 15.283205304
## crim .
## zn .
## indus .
## chas 0.074725982
## nox .
## rm 3.862575281
## age .
## dis .
## rad .
## tax .
## ptratio -0.620249174
## black 0.001973097
## lstat -0.496888716
Cross Validation in LASSO
library(repr)
eval_results <- function(true, predicted, df) {
SSE <- sum((predicted - true)^2)
SST <- sum((true - mean(true))^2)
R_square <- 1 - SSE / SST
RMSE = sqrt(SSE/nrow(df))
MSE = SSE/nrow(df)
MSPE = mean((predicted - true) ^ 2)
# Model performance metrics
data.frame(
RMSE = RMSE,
Rsquare = R_square,
MSE = MSE,
MSPE = MSPE
)
}
lambdas <- 10^seq(2, -3, by = -.1)
#### Setting alpha = 1 implements lasso regression
lasso_reg <- cv.glmnet(x = as.matrix(Boston[, -c(which(colnames(Boston)=='medv'))]), y = Boston$medv, alpha = 1, lambda = lambdas, standardize = TRUE, nfolds = 5)
#### Best
lambda_best <- lasso_reg$lambda.min
lambda_best
## [1] 0.03162278
lasso_model <- glmnet(x = as.matrix(Boston[, -c(which(colnames(Boston)=='medv'))]), y = Boston$medv, alpha = 1, lambda = lambda_best, standardize = TRUE)
summary(lasso_model)
## Length Class Mode
## a0 1 -none- numeric
## beta 13 dgCMatrix S4
## df 1 -none- numeric
## dim 2 -none- numeric
## lambda 1 -none- numeric
## dev.ratio 1 -none- numeric
## nulldev 1 -none- numeric
## npasses 1 -none- numeric
## jerr 1 -none- numeric
## offset 1 -none- logical
## call 6 -none- call
## nobs 1 -none- numeric
predictions_train <- predict(lasso_model, s = lambda_best, newx = as.matrix(Boston[, -c(which(colnames(Boston)=='medv'))]))
eval_results(Boston$medv, predictions_train, b_train)
## RMSE Rsquare MSE MSPE
## 1 5.600383 0.7400767 31.36429 21.94261
coef(lasso_model,s=0.5)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 34.303181925
## crim -0.097436553
## zn 0.041028775
## indus .
## chas 2.677113119
## nox -16.217323677
## rm 3.868389212
## age .
## dis -1.387273559
## rad 0.249561791
## tax -0.009688829
## ptratio -0.928919938
## black 0.008997440
## lstat -0.522714979
coef(lasso_model,s=1)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 34.303181925
## crim -0.097436553
## zn 0.041028775
## indus .
## chas 2.677113119
## nox -16.217323677
## rm 3.868389212
## age .
## dis -1.387273559
## rad 0.249561791
## tax -0.009688829
## ptratio -0.928919938
## black 0.008997440
## lstat -0.522714979
Using 5-fold cross-validation to search optimal lambda:
#use 5-fold cross validation to pick lambda
cv_lasso_fit <- cv.glmnet(x = as.matrix(Boston[, -c(which(colnames(Boston)=='medv'))]), y = Boston$medv, alpha = 1, nfolds = 5)
plot(cv_lasso_fit)
b_predt <- predict(object = lm_model1, newdata = b_test)
mean((b_predt - b_test$medv)^2)
## [1] 19.90497
b_predt1 <- predict(object = boston_model_step_b, newdata = b_test)
model1.mspe <- mean((b_predt1 - b_test$medv) ^ 2)
model1.mspe
## [1] 19.70492
b_predt2 <- predict(object = boston_model_step_f, newdata = b_test)
model2.mspe <- mean((b_predt2 - b_test$medv) ^ 2)
model2.mspe
## [1] 19.70492
predictions_test <- predict(lasso_model, s = lambda_best, newx = as.matrix(Boston[, -c(which(colnames(Boston)=='medv'))]))
eval_results(Boston$medv, predictions_test, b_test)
## RMSE Rsquare MSE MSPE
## 1 8.546683 0.7400767 73.04578 21.94261
Cross validation
#Cross Validation
model1.glm = glm(medv ~ ., data = Boston)
cv.glm(data = Boston, glmfit = model1.glm, K = 5)$delta[2]
## [1] 24.04002
model2.glm <- glm(medv ~ . -indus -age, data = Boston)
cv.glm(data = Boston, glmfit = model2.glm, K = 5)$delta[2]
## [1] 23.43058
finalmodel <- lm(medv ~ . -indus -age, data = Boston)
summary(finalmodel)
##
## Call:
## lm(formula = medv ~ . - indus - age, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.5984 -2.7386 -0.5046 1.7273 26.2373
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.341145 5.067492 7.171 2.73e-12 ***
## crim -0.108413 0.032779 -3.307 0.001010 **
## zn 0.045845 0.013523 3.390 0.000754 ***
## chas 2.718716 0.854240 3.183 0.001551 **
## nox -17.376023 3.535243 -4.915 1.21e-06 ***
## rm 3.801579 0.406316 9.356 < 2e-16 ***
## dis -1.492711 0.185731 -8.037 6.84e-15 ***
## rad 0.299608 0.063402 4.726 3.00e-06 ***
## tax -0.011778 0.003372 -3.493 0.000521 ***
## ptratio -0.946525 0.129066 -7.334 9.24e-13 ***
## black 0.009291 0.002674 3.475 0.000557 ***
## lstat -0.522553 0.047424 -11.019 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.736 on 494 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7348
## F-statistic: 128.2 on 11 and 494 DF, p-value: < 2.2e-16
AIC(finalmodel)
## [1] 3023.726
BIC(finalmodel)
## [1] 3078.671
b_predt3 <- predict(object = finalmodel, newdata = b_test)
model3.mspe <- mean((b_predt3 - b_test$medv) ^ 2)
model3.mspe
## [1] 17.54899
Fit a regression tree (CART) on the same data
#default value of cp = 0.01
Boston.tree <- rpart(medv ~ ., data = b_train)
plotcp(Boston.tree)
#Plotting the tree
rpart.plot(Boston.tree, type = 3, fallen.leaves = TRUE)
#Building a large tree
Boston.largetree <- rpart(formula = medv ~ ., data = b_train, cp = 0.001)
printcp(Boston.largetree)
##
## Regression tree:
## rpart(formula = medv ~ ., data = b_train, cp = 0.001)
##
## Variables actually used in tree construction:
## [1] age crim dis lstat nox ptratio rad rm tax
##
## Root node error: 31611/354 = 89.297
##
## n= 354
##
## CP nsplit rel error xerror xstd
## 1 0.4574074 0 1.00000 1.00311 0.097625
## 2 0.1702137 1 0.54259 0.66426 0.071590
## 3 0.0699232 2 0.37238 0.46387 0.059291
## 4 0.0556685 3 0.30246 0.42574 0.057210
## 5 0.0330562 4 0.24679 0.36911 0.055261
## 6 0.0157256 5 0.21373 0.32477 0.051965
## 7 0.0118478 6 0.19801 0.29451 0.049421
## 8 0.0111765 7 0.18616 0.27557 0.039533
## 9 0.0093937 8 0.17498 0.27464 0.039514
## 10 0.0059391 9 0.16559 0.25856 0.035935
## 11 0.0048666 10 0.15965 0.25629 0.035140
## 12 0.0043668 11 0.15478 0.25516 0.035271
## 13 0.0033539 12 0.15041 0.25479 0.035132
## 14 0.0033193 13 0.14706 0.25322 0.035232
## 15 0.0026320 14 0.14374 0.25321 0.035194
## 16 0.0026011 15 0.14111 0.25237 0.035160
## 17 0.0024909 16 0.13851 0.25005 0.034750
## 18 0.0024075 17 0.13602 0.24581 0.034759
## 19 0.0021375 18 0.13361 0.24835 0.034794
## 20 0.0020096 19 0.13147 0.24559 0.034784
## 21 0.0016004 20 0.12946 0.24435 0.034719
## 22 0.0014010 21 0.12786 0.24096 0.034716
## 23 0.0013597 22 0.12646 0.23963 0.034697
## 24 0.0012039 23 0.12510 0.23974 0.034695
## 25 0.0011517 24 0.12390 0.24105 0.034699
## 26 0.0011241 25 0.12275 0.24192 0.034721
## 27 0.0010000 26 0.12162 0.24192 0.034721
plotcp(Boston.largetree)
#Plotting the tree
rpart.plot(Boston.largetree, type = 3, fallen.leaves = TRUE)
pruned.tree <- prune(Boston.largetree, cp = 0.006)
pruned.tree
## n= 354
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 354 31611.03000 22.75565
## 2) lstat>=9.605 211 5113.79300 17.49431
## 4) crim>=5.84803 66 950.69940 12.69697
## 8) lstat>=19.73 37 278.27240 10.64865 *
## 9) lstat< 19.73 29 319.12690 15.31034 *
## 5) crim< 5.84803 145 1952.74900 19.67793
## 10) lstat>=16.085 45 511.50580 16.91778 *
## 11) lstat< 16.085 100 944.14000 20.92000 *
## 3) lstat< 9.605 143 12038.12000 30.51888
## 6) rm< 7.437 121 5670.63900 27.90331
## 12) rm< 6.6805 78 2769.49800 25.07179
## 24) age< 88.8 71 748.84390 23.92254 *
## 25) age>=88.8 7 975.71430 36.72857 *
## 13) rm>=6.6805 43 1141.40300 33.03953
## 26) dis>=2.84255 35 347.21140 31.62857 *
## 27) dis< 2.84255 8 419.66880 39.21250 *
## 7) rm>=7.437 22 986.84950 44.90455
## 14) ptratio>=15.4 10 605.07600 40.88000 *
## 15) ptratio< 15.4 12 84.82917 48.25833 *
rpart.plot(pruned.tree, type = 3, fallen.leaves = TRUE, extra = 1)
#In-sample MSE
mean((predict(Boston.tree) - b_train$medv) ^ 2)
## [1] 15.62523
#out-of-sample performance
mean((predict(Boston.tree, newdata = b_test) - b_test$medv) ^ 2) #default tree
## [1] 19.18994
mean((predict(Boston.largetree) - b_train$medv) ^ 2) #large tree
## [1] 10.8605
mean((predict(Boston.largetree, newdata = b_test) - b_test$medv) ^ 2) #large tree
## [1] 14.88378
mean((predict(pruned.tree) - b_train$medv) ^ 2) #pruned tree
## [1] 14.78641
mean((predict(pruned.tree, newdata = b_test) - b_test$medv) ^ 2) #large tree
## [1] 19.35511
```{r Linear Model}s lm_model <- lm(formula = medv ~ ., data = b_train) b_test_lm_pred <- predict(lm_model, b_test) mean((b_test_lm_pred - b_test$medv)^2)
b_test_dt_pred <- predict(pruned.tree, b_test) mean((b_test_dt_pred - b_test$medv)^2)
```r
b_test_dt_pred1 <- predict(Boston.largetree, b_test)
mean((b_test_dt_pred1 - b_test$medv)^2)
## [1] 14.88378
b_test_dt_pred2 <- predict(pruned.tree, b_test)
mean((b_test_dt_pred2 - b_test$medv)^2)
## [1] 19.35511