Boston Housing

Executive Summary:

Problem

This report provides an analysis and evaluation of the factors affecting the median value of the owner-occupied homes in the suburbs of Boston. The in-built data set of Boston Housing Data in package MASS is used for this analysis and various factors about the structural quality, neighborhood, accessibility and air pollution such as per capita crime rate by town, proportion of non-retail business acres per town, index of accessibility to radial highways etc. are taken into account for this study. To start with, the random sample with 80:20 dataset has to be created, find a best model using linear regression with AIC and BIC and LASSO variable selection. Using the final model, Report in-sample (MSE) and out-of-sample prediction errors (MSPE). Build a regression tree and compare it with the linear regression model. Finally, repeat the sample process for another random sample with conclusions and detailed comparison tables.

Approach

Methods of analysis includes the following. Summary statistics of the variables and finding correlation between variables, Exploratory data analysis using visualization. Random sampling of data set into 80/20 training and testing data set. Fitting various models such as generalized linear regression using different variable selections and finding the best model. Calculate 5-fold cross validation, fit a decision tree (CART) on the same data and compare it with the linear regression model.

Major Results

Boston housing data is a data set in package MASS. The data set has 506 rows and 14 columns.

The Boston Housing dataset to understand the distribution of every variable and also to observe if there are any notable outliers. The distribution for age and tax seems pretty reasonable and there are no outliers which is a positive point to start with. I can clearly observe that there are many outliers particularly in 3 variables named crime, zn and black. Zn and crim distribution are skewed right even after standard normalization and black variable seems to have a left skewed distribution. nox and rm variable has a perfectly equal distribution and the boxplot is symmetrical. Rad and crim are extremely negatively skewed.

To further the analysis, created at the scatter plots and see which predictor variables have a positive or a negative correlation with the response variable (medv). This will give a starting point to analyze which covariates are important for consideration while building the regression model. Moreover, it will also help us realize if there are any covariates which are strongly correlated with each other leading to collinearity. Looking at the model building and variable selection, there are both positive and negative correlation with the response variable ‘medv’. Crim, indus, nox, age, rad, tax, ptratio, lstat all of them have a negative correlation with the lstat one having the strongest correlation.

Finally, after comparing the performance of regression tree with linear model, the performance of the regression tree looks good as the Mean Standard Predicted Error is less for regression tree which is 19.355.

Boston house prices dataset

data(Boston); #this data is in MASS package

str(Boston)

## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

summary(Boston)

##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

head(Boston)

##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
##   medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7

Boston %>%
  gather(key, val, -medv) %>%
  ggplot(aes(x = val, y = medv)) +
  geom_point() +
  stat_smooth(method = "lm", se = TRUE, col = "blue") +
  facet_wrap(~key, scales = "free") +
  theme_gray() +
  ggtitle("Scatter plot of dependent variables vs Median Value (medv)")

## `geom_smooth()` using formula 'y ~ x'

corrplot(cor(Boston), method = "number", type = "upper", diag = FALSE)

set.seed(14006542)
s_index <- sample(nrow(Boston),nrow(Boston)*0.70)
b_train <- Boston[s_index,]
b_test <- Boston[-s_index,]

lm_model1 <- lm(medv~., data=b_train)
summary(lm_model1)

## 
## Call:
## lm(formula = medv ~ ., data = b_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.9119  -2.7467  -0.6709   1.4983  23.6015 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  44.875444   6.414557   6.996 1.41e-11 ***
## crim         -0.103527   0.042986  -2.408 0.016556 *  
## zn            0.052664   0.016844   3.127 0.001921 ** 
## indus         0.029534   0.073781   0.400 0.689196    
## chas          2.723410   1.088942   2.501 0.012855 *  
## nox         -20.195312   4.896818  -4.124 4.68e-05 ***
## rm            3.040044   0.497370   6.112 2.69e-09 ***
## age           0.011924   0.016214   0.735 0.462573    
## dis          -1.738834   0.251711  -6.908 2.43e-11 ***
## rad           0.339305   0.082783   4.099 5.20e-05 ***
## tax          -0.012853   0.004559  -2.819 0.005095 ** 
## ptratio      -1.041704   0.171491  -6.074 3.33e-09 ***
## black         0.011167   0.003321   3.362 0.000861 ***
## lstat        -0.629431   0.061219 -10.282  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.929 on 340 degrees of freedom
## Multiple R-squared:  0.7387, Adjusted R-squared:  0.7287 
## F-statistic: 73.94 on 13 and 340 DF,  p-value: < 2.2e-16

model1.sum <- summary(lm_model1)
(model1.sum$sigma) ^ 2

## [1] 24.29432

AIC(lm_model1)

## [1] 2149.67

BIC(lm_model1)

## [1] 2207.709

model1.sum$r.squared

## [1] 0.7386966

model1.sum$adj.r.squared

## [1] 0.7287056

par(mfrow = c(2,2))
plot(lm_model1)

#### Variable Selection

Compare Model Fit Manually

#boston_model_1 <- glm(medv ~ ., family = gaussian, b_train)
#boston_model_2 <- step(german_model_aic, direction = "backward")
#summary(boston_model_back_aic)

boston_model_1 <- lm(medv~., data = b_train)
boston_model_2 <- lm(medv~crim+zn, data = b_train)
summary(boston_model_1)

## 
## Call:
## lm(formula = medv ~ ., data = b_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.9119  -2.7467  -0.6709   1.4983  23.6015 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  44.875444   6.414557   6.996 1.41e-11 ***
## crim         -0.103527   0.042986  -2.408 0.016556 *  
## zn            0.052664   0.016844   3.127 0.001921 ** 
## indus         0.029534   0.073781   0.400 0.689196    
## chas          2.723410   1.088942   2.501 0.012855 *  
## nox         -20.195312   4.896818  -4.124 4.68e-05 ***
## rm            3.040044   0.497370   6.112 2.69e-09 ***
## age           0.011924   0.016214   0.735 0.462573    
## dis          -1.738834   0.251711  -6.908 2.43e-11 ***
## rad           0.339305   0.082783   4.099 5.20e-05 ***
## tax          -0.012853   0.004559  -2.819 0.005095 ** 
## ptratio      -1.041704   0.171491  -6.074 3.33e-09 ***
## black         0.011167   0.003321   3.362 0.000861 ***
## lstat        -0.629431   0.061219 -10.282  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.929 on 340 degrees of freedom
## Multiple R-squared:  0.7387, Adjusted R-squared:  0.7287 
## F-statistic: 73.94 on 13 and 340 DF,  p-value: < 2.2e-16

summary(boston_model_2)

## 
## Call:
## lm(formula = medv ~ crim + zn, data = b_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.737  -5.277  -1.795   1.944  30.574 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 22.91533    0.54469  42.070  < 2e-16 ***
## crim        -0.37795    0.05522  -6.845 3.43e-11 ***
## zn           0.11045    0.01932   5.716 2.34e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.387 on 351 degrees of freedom
## Multiple R-squared:  0.2189, Adjusted R-squared:  0.2145 
## F-statistic:  49.2 on 2 and 351 DF,  p-value: < 2.2e-16

AIC(boston_model_1); BIC(boston_model_1)

## [1] 2149.67

## [1] 2207.709

AIC(boston_model_2); BIC(boston_model_2)

## [1] 2515.289

## [1] 2530.766

subset.result <- regsubsets(medv~.,data=b_train, nbest=1, nvmax = 13)
summary(subset.result)

## Subset selection object
## Call: regsubsets.formula(medv ~ ., data = b_train, nbest = 1, nvmax = 13)
## 13 Variables  (and intercept)
##         Forced in Forced out
## crim        FALSE      FALSE
## zn          FALSE      FALSE
## indus       FALSE      FALSE
## chas        FALSE      FALSE
## nox         FALSE      FALSE
## rm          FALSE      FALSE
## age         FALSE      FALSE
## dis         FALSE      FALSE
## rad         FALSE      FALSE
## tax         FALSE      FALSE
## ptratio     FALSE      FALSE
## black       FALSE      FALSE
## lstat       FALSE      FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: exhaustive
##           crim zn  indus chas nox rm  age dis rad tax ptratio black lstat
## 1  ( 1 )  " "  " " " "   " "  " " " " " " " " " " " " " "     " "   "*"  
## 2  ( 1 )  " "  " " " "   " "  " " "*" " " " " " " " " " "     " "   "*"  
## 3  ( 1 )  " "  " " " "   " "  " " "*" " " " " " " " " "*"     " "   "*"  
## 4  ( 1 )  " "  " " " "   " "  " " "*" " " "*" " " " " "*"     " "   "*"  
## 5  ( 1 )  " "  " " " "   " "  "*" "*" " " "*" " " " " "*"     " "   "*"  
## 6  ( 1 )  " "  " " " "   " "  "*" "*" " " "*" " " " " "*"     "*"   "*"  
## 7  ( 1 )  " "  " " " "   "*"  "*" "*" " " "*" " " " " "*"     "*"   "*"  
## 8  ( 1 )  " "  "*" " "   "*"  "*" "*" " " "*" " " " " "*"     "*"   "*"  
## 9  ( 1 )  " "  "*" " "   " "  "*" "*" " " "*" "*" "*" "*"     "*"   "*"  
## 10  ( 1 ) " "  "*" " "   "*"  "*" "*" " " "*" "*" "*" "*"     "*"   "*"  
## 11  ( 1 ) "*"  "*" " "   "*"  "*" "*" " " "*" "*" "*" "*"     "*"   "*"  
## 12  ( 1 ) "*"  "*" " "   "*"  "*" "*" "*" "*" "*" "*" "*"     "*"   "*"  
## 13  ( 1 ) "*"  "*" "*"   "*"  "*" "*" "*" "*" "*" "*" "*"     "*"   "*"

plot(subset.result, scale = "bic")

Forward/Backward/Stepwise Regression Using AIC

nullmodel=lm(medv~1, data=b_train)
fullmodel=lm(medv~., data=b_train)

Backward

boston_model_step_b <- step(fullmodel,direction='backward')

## Start:  AIC=1143.06
## medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad + 
##     tax + ptratio + black + lstat
## 
##           Df Sum of Sq     RSS    AIC
## - indus    1      3.89  8264.0 1141.2
## - age      1     13.14  8273.2 1141.6
## <none>                  8260.1 1143.1
## - crim     1    140.91  8401.0 1147.0
## - chas     1    151.96  8412.0 1147.5
## - tax      1    193.10  8453.2 1149.2
## - zn       1    237.50  8497.6 1151.1
## - black    1    274.65  8534.7 1152.6
## - rad      1    408.13  8668.2 1158.1
## - nox      1    413.22  8673.3 1158.3
## - ptratio  1    896.42  9156.5 1177.5
## - rm       1    907.62  9167.7 1178.0
## - dis      1   1159.36  9419.4 1187.6
## - lstat    1   2568.24 10828.3 1236.9
## 
## Step:  AIC=1141.23
## medv ~ crim + zn + chas + nox + rm + age + dis + rad + tax + 
##     ptratio + black + lstat
## 
##           Df Sum of Sq     RSS    AIC
## - age      1     13.36  8277.3 1139.8
## <none>                  8264.0 1141.2
## - crim     1    142.14  8406.1 1145.3
## - chas     1    161.19  8425.2 1146.1
## - tax      1    207.05  8471.0 1148.0
## - zn       1    234.77  8498.7 1149.1
## - black    1    273.05  8537.0 1150.7
## - rad      1    418.68  8682.6 1156.7
## - nox      1    426.14  8690.1 1157.0
## - ptratio  1    900.12  9164.1 1175.8
## - rm       1    905.20  9169.2 1176.0
## - dis      1   1220.20  9484.2 1188.0
## - lstat    1   2567.78 10831.7 1235.0
## 
## Step:  AIC=1139.8
## medv ~ crim + zn + chas + nox + rm + dis + rad + tax + ptratio + 
##     black + lstat
## 
##           Df Sum of Sq     RSS    AIC
## <none>                  8277.3 1139.8
## - crim     1    142.34  8419.7 1143.8
## - chas     1    163.92  8441.2 1144.7
## - tax      1    206.59  8483.9 1146.5
## - zn       1    223.93  8501.3 1147.2
## - black    1    282.34  8559.7 1149.7
## - rad      1    413.61  8690.9 1155.1
## - nox      1    418.63  8696.0 1155.3
## - ptratio  1    890.19  9167.5 1174.0
## - rm       1    976.72  9254.0 1177.3
## - dis      1   1385.24  9662.6 1192.6
## - lstat    1   2714.75 10992.1 1238.2

summary(boston_model_step_b)

## 
## Call:
## lm(formula = medv ~ crim + zn + chas + nox + rm + dis + rad + 
##     tax + ptratio + black + lstat, data = b_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.0798  -2.7191  -0.7501   1.5986  23.8547 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  44.349339   6.364990   6.968 1.66e-11 ***
## crim         -0.104020   0.042892  -2.425 0.015820 *  
## zn            0.050578   0.016628   3.042 0.002534 ** 
## chas          2.802968   1.077056   2.602 0.009659 ** 
## nox         -18.621443   4.477465  -4.159 4.05e-05 ***
## rm            3.084974   0.485623   6.353 6.75e-10 ***
## dis          -1.804990   0.238586  -7.565 3.61e-13 ***
## rad           0.327840   0.079305   4.134 4.49e-05 ***
## tax          -0.012076   0.004134  -2.922 0.003714 ** 
## ptratio      -1.023388   0.168744  -6.065 3.50e-09 ***
## black         0.011293   0.003306   3.416 0.000713 ***
## lstat        -0.613561   0.057933 -10.591  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.92 on 342 degrees of freedom
## Multiple R-squared:  0.7382, Adjusted R-squared:  0.7297 
## F-statistic: 87.64 on 11 and 342 DF,  p-value: < 2.2e-16

AIC(boston_model_step_b); BIC(boston_model_step_b)

## [1] 2146.409

## [1] 2196.71

Forward

boston_model_step_f <- step(nullmodel, scope=list(lower=nullmodel, upper=fullmodel), direction='forward')

## Start:  AIC=1592.16
## medv ~ 1
## 
##           Df Sum of Sq   RSS    AIC
## + lstat    1   17536.4 14075 1307.7
## + rm       1   13803.7 17807 1391.0
## + ptratio  1    8258.4 23353 1487.0
## + tax      1    6771.2 24840 1508.8
## + indus    1    6754.5 24857 1509.1
## + nox      1    5269.2 26342 1529.6
## + rad      1    4755.1 26856 1536.5
## + crim     1    4623.1 26988 1538.2
## + black    1    3924.3 27687 1547.2
## + age      1    3838.0 27773 1548.3
## + zn       1    3625.3 27986 1551.0
## + chas     1    1642.0 29969 1575.3
## + dis      1    1482.7 30128 1577.2
## <none>                 31611 1592.2
## 
## Step:  AIC=1307.72
## medv ~ lstat
## 
##           Df Sum of Sq   RSS    AIC
## + rm       1   2359.36 11715 1244.8
## + ptratio  1   1672.50 12402 1264.9
## + dis      1    686.37 13388 1292.0
## + chas     1    510.60 13564 1296.6
## + age      1    274.04 13801 1302.8
## + black    1    199.76 13875 1304.7
## + tax      1    113.34 13961 1306.9
## + zn       1    101.22 13973 1307.2
## <none>                 14075 1307.7
## + crim     1     58.44 14016 1308.2
## + indus    1     39.68 14035 1308.7
## + nox      1      9.09 14066 1309.5
## + rad      1      2.64 14072 1309.7
## 
## Step:  AIC=1244.77
## medv ~ lstat + rm
## 
##           Df Sum of Sq   RSS    AIC
## + ptratio  1   1079.74 10636 1212.5
## + black    1    420.71 11295 1233.8
## + dis      1    401.66 11314 1234.4
## + chas     1    367.43 11348 1235.5
## + tax      1    174.78 11540 1241.5
## + crim     1    128.95 11586 1242.8
## <none>                 11715 1244.8
## + age      1     62.45 11653 1244.9
## + rad      1     62.19 11653 1244.9
## + zn       1     29.73 11686 1245.9
## + indus    1     13.71 11702 1246.4
## + nox      1      2.95 11712 1246.7
## 
## Step:  AIC=1212.54
## medv ~ lstat + rm + ptratio
## 
##         Df Sum of Sq   RSS    AIC
## + dis    1    597.71 10038 1194.1
## + black  1    362.17 10273 1202.3
## + chas   1    300.24 10335 1204.4
## + age    1    139.08 10496 1209.9
## <none>               10636 1212.5
## + rad    1     39.58 10596 1213.2
## + crim   1     31.98 10604 1213.5
## + zn     1     19.71 10616 1213.9
## + indus  1     17.38 10618 1214.0
## + tax    1      1.14 10634 1214.5
## + nox    1      1.11 10634 1214.5
## 
## Step:  AIC=1194.07
## medv ~ lstat + rm + ptratio + dis
## 
##         Df Sum of Sq     RSS    AIC
## + nox    1    600.89  9437.0 1174.2
## + black  1    510.50  9527.3 1177.6
## + chas   1    184.19  9853.7 1189.5
## + zn     1    162.65  9875.2 1190.3
## + indus  1    159.26  9878.6 1190.4
## + tax    1    128.88  9909.0 1191.5
## + crim   1    111.32  9926.5 1192.1
## <none>               10037.9 1194.1
## + age    1     20.28 10017.6 1195.3
## + rad    1      2.70 10035.2 1196.0
## 
## Step:  AIC=1174.21
## medv ~ lstat + rm + ptratio + dis + nox
## 
##         Df Sum of Sq    RSS    AIC
## + black  1   286.540 9150.4 1165.3
## + chas   1   237.409 9199.6 1167.2
## + zn     1   171.311 9265.6 1169.7
## + rad    1   105.874 9331.1 1172.2
## <none>               9437.0 1174.2
## + crim   1    52.370 9384.6 1174.2
## + age    1     6.998 9430.0 1176.0
## + indus  1     3.659 9433.3 1176.1
## + tax    1     0.035 9436.9 1176.2
## 
## Step:  AIC=1165.3
## medv ~ lstat + rm + ptratio + dis + nox + black
## 
##         Df Sum of Sq    RSS    AIC
## + chas   1   219.448 8931.0 1158.7
## + zn     1   200.302 8950.1 1159.5
## + rad    1   197.247 8953.2 1159.6
## <none>               9150.4 1165.3
## + crim   1    14.263 9136.2 1166.8
## + tax    1    11.492 9138.9 1166.8
## + age    1     1.289 9149.1 1167.2
## + indus  1     1.182 9149.2 1167.2
## 
## Step:  AIC=1158.71
## medv ~ lstat + rm + ptratio + dis + nox + black + chas
## 
##         Df Sum of Sq    RSS    AIC
## + zn     1   202.413 8728.6 1152.6
## + rad    1   184.546 8746.4 1153.3
## <none>               8931.0 1158.7
## + tax    1    14.162 8916.8 1160.1
## + crim   1    11.700 8919.3 1160.2
## + indus  1     5.880 8925.1 1160.5
## + age    1     0.559 8930.4 1160.7
## 
## Step:  AIC=1152.59
## medv ~ lstat + rm + ptratio + dis + nox + black + chas + zn
## 
##         Df Sum of Sq    RSS    AIC
## + rad    1   114.382 8614.2 1149.9
## <none>               8728.6 1152.6
## + crim   1    35.347 8693.2 1153.2
## + indus  1     8.888 8719.7 1154.2
## + age    1     8.883 8719.7 1154.2
## + tax    1     0.133 8728.4 1154.6
## 
## Step:  AIC=1149.92
## medv ~ lstat + rm + ptratio + dis + nox + black + chas + zn + 
##     rad
## 
##         Df Sum of Sq    RSS    AIC
## + tax    1   194.508 8419.7 1143.8
## + crim   1   130.263 8483.9 1146.5
## <none>               8614.2 1149.9
## + indus  1    13.968 8600.2 1151.3
## + age    1    13.110 8601.1 1151.4
## 
## Step:  AIC=1143.84
## medv ~ lstat + rm + ptratio + dis + nox + black + chas + zn + 
##     rad + tax
## 
##         Df Sum of Sq    RSS    AIC
## + crim   1   142.343 8277.3 1139.8
## <none>               8419.7 1143.8
## + age    1    13.571 8406.1 1145.3
## + indus  1     5.375 8414.3 1145.6
## 
## Step:  AIC=1139.8
## medv ~ lstat + rm + ptratio + dis + nox + black + chas + zn + 
##     rad + tax + crim
## 
##         Df Sum of Sq    RSS    AIC
## <none>               8277.3 1139.8
## + age    1   13.3638 8264.0 1141.2
## + indus  1    4.1161 8273.2 1141.6

summary(boston_model_step_f)

## 
## Call:
## lm(formula = medv ~ lstat + rm + ptratio + dis + nox + black + 
##     chas + zn + rad + tax + crim, data = b_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.0798  -2.7191  -0.7501   1.5986  23.8547 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  44.349339   6.364990   6.968 1.66e-11 ***
## lstat        -0.613561   0.057933 -10.591  < 2e-16 ***
## rm            3.084974   0.485623   6.353 6.75e-10 ***
## ptratio      -1.023388   0.168744  -6.065 3.50e-09 ***
## dis          -1.804990   0.238586  -7.565 3.61e-13 ***
## nox         -18.621443   4.477465  -4.159 4.05e-05 ***
## black         0.011293   0.003306   3.416 0.000713 ***
## chas          2.802968   1.077056   2.602 0.009659 ** 
## zn            0.050578   0.016628   3.042 0.002534 ** 
## rad           0.327840   0.079305   4.134 4.49e-05 ***
## tax          -0.012076   0.004134  -2.922 0.003714 ** 
## crim         -0.104020   0.042892  -2.425 0.015820 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.92 on 342 degrees of freedom
## Multiple R-squared:  0.7382, Adjusted R-squared:  0.7297 
## F-statistic: 87.64 on 11 and 342 DF,  p-value: < 2.2e-16

AIC(boston_model_step_f); BIC(boston_model_step_f)

## [1] 2146.409

## [1] 2196.71

MSE

b_pred1 <- predict(object = boston_model_step_b, newdata = b_train)
mean((b_pred1-b_train$medv)^2)

## [1] 23.38228

b_pred2 <- predict(object = boston_model_step_f, newdata = b_train)
mean((b_pred2-b_train$medv)^2)

## [1] 23.38228

We see that all variables except indus and age are significant.

Lasso Regression

lasso_fit <- glmnet(x = as.matrix(Boston[, -c(which(colnames(Boston)=='medv'))]), y = Boston$medv, alpha = 1)
#lambda = 0.5
coef(lasso_fit,s=1)

## 14 x 1 sparse Matrix of class "dgCMatrix"
##                        1
## (Intercept) 15.283205304
## crim         .          
## zn           .          
## indus        .          
## chas         0.074725982
## nox          .          
## rm           3.862575281
## age          .          
## dis          .          
## rad          .          
## tax          .          
## ptratio     -0.620249174
## black        0.001973097
## lstat       -0.496888716

Cross Validation in LASSO

library(repr)
eval_results <- function(true, predicted, df) {
  SSE <- sum((predicted - true)^2)
  SST <- sum((true - mean(true))^2)
  R_square <- 1 - SSE / SST
  RMSE = sqrt(SSE/nrow(df))
  MSE = SSE/nrow(df)
  MSPE =  mean((predicted - true) ^ 2)
  # Model performance metrics
data.frame(
  RMSE = RMSE,
  Rsquare = R_square,
  MSE = MSE,
  MSPE = MSPE
)
  
}

lambdas <- 10^seq(2, -3, by = -.1)

####  Setting alpha = 1 implements lasso regression
lasso_reg <- cv.glmnet(x = as.matrix(Boston[, -c(which(colnames(Boston)=='medv'))]), y = Boston$medv, alpha = 1, lambda = lambdas, standardize = TRUE, nfolds = 5)

####  Best 
lambda_best <- lasso_reg$lambda.min 
lambda_best

## [1] 0.03162278

lasso_model <- glmnet(x = as.matrix(Boston[, -c(which(colnames(Boston)=='medv'))]), y = Boston$medv, alpha = 1, lambda = lambda_best, standardize = TRUE)
summary(lasso_model)

##           Length Class     Mode   
## a0         1     -none-    numeric
## beta      13     dgCMatrix S4     
## df         1     -none-    numeric
## dim        2     -none-    numeric
## lambda     1     -none-    numeric
## dev.ratio  1     -none-    numeric
## nulldev    1     -none-    numeric
## npasses    1     -none-    numeric
## jerr       1     -none-    numeric
## offset     1     -none-    logical
## call       6     -none-    call   
## nobs       1     -none-    numeric

predictions_train <- predict(lasso_model, s = lambda_best, newx =  as.matrix(Boston[, -c(which(colnames(Boston)=='medv'))]))
eval_results(Boston$medv, predictions_train, b_train)

##       RMSE   Rsquare      MSE     MSPE
## 1 5.600383 0.7400767 31.36429 21.94261

coef(lasso_model,s=0.5)

## 14 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept)  34.303181925
## crim         -0.097436553
## zn            0.041028775
## indus         .          
## chas          2.677113119
## nox         -16.217323677
## rm            3.868389212
## age           .          
## dis          -1.387273559
## rad           0.249561791
## tax          -0.009688829
## ptratio      -0.928919938
## black         0.008997440
## lstat        -0.522714979

coef(lasso_model,s=1)

## 14 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept)  34.303181925
## crim         -0.097436553
## zn            0.041028775
## indus         .          
## chas          2.677113119
## nox         -16.217323677
## rm            3.868389212
## age           .          
## dis          -1.387273559
## rad           0.249561791
## tax          -0.009688829
## ptratio      -0.928919938
## black         0.008997440
## lstat        -0.522714979

Using 5-fold cross-validation to search optimal lambda:

#use 5-fold cross validation to pick lambda
cv_lasso_fit <- cv.glmnet(x = as.matrix(Boston[, -c(which(colnames(Boston)=='medv'))]), y = Boston$medv, alpha = 1, nfolds = 5)
plot(cv_lasso_fit)

(iii) Test the out-of-sample performance. Using final linear model built from

on the 80% of original data, test with the remaining 20% testing data. (Try predict() function in R.) Report out-of-sample model MSPE etc.

b_predt <- predict(object = lm_model1, newdata = b_test)
mean((b_predt - b_test$medv)^2)

## [1] 19.90497

b_predt1 <- predict(object = boston_model_step_b, newdata = b_test)
model1.mspe <- mean((b_predt1 - b_test$medv) ^ 2)
model1.mspe

## [1] 19.70492

b_predt2 <- predict(object = boston_model_step_f, newdata = b_test)
model2.mspe <- mean((b_predt2 - b_test$medv) ^ 2)
model2.mspe

## [1] 19.70492

predictions_test <- predict(lasso_model, s = lambda_best, newx =  as.matrix(Boston[, -c(which(colnames(Boston)=='medv'))]))
eval_results(Boston$medv, predictions_test, b_test)

##       RMSE   Rsquare      MSE     MSPE
## 1 8.546683 0.7400767 73.04578 21.94261

Cross validation

#Cross Validation
model1.glm = glm(medv ~ ., data = Boston)
cv.glm(data = Boston, glmfit = model1.glm, K = 5)$delta[2]

## [1] 24.04002

model2.glm <- glm(medv ~ . -indus -age, data = Boston)
cv.glm(data = Boston, glmfit = model2.glm, K = 5)$delta[2]

## [1] 23.43058

finalmodel <- lm(medv ~ . -indus -age, data = Boston)
summary(finalmodel)

## 
## Call:
## lm(formula = medv ~ . - indus - age, data = Boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.5984  -2.7386  -0.5046   1.7273  26.2373 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  36.341145   5.067492   7.171 2.73e-12 ***
## crim         -0.108413   0.032779  -3.307 0.001010 ** 
## zn            0.045845   0.013523   3.390 0.000754 ***
## chas          2.718716   0.854240   3.183 0.001551 ** 
## nox         -17.376023   3.535243  -4.915 1.21e-06 ***
## rm            3.801579   0.406316   9.356  < 2e-16 ***
## dis          -1.492711   0.185731  -8.037 6.84e-15 ***
## rad           0.299608   0.063402   4.726 3.00e-06 ***
## tax          -0.011778   0.003372  -3.493 0.000521 ***
## ptratio      -0.946525   0.129066  -7.334 9.24e-13 ***
## black         0.009291   0.002674   3.475 0.000557 ***
## lstat        -0.522553   0.047424 -11.019  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.736 on 494 degrees of freedom
## Multiple R-squared:  0.7406, Adjusted R-squared:  0.7348 
## F-statistic: 128.2 on 11 and 494 DF,  p-value: < 2.2e-16

AIC(finalmodel)

## [1] 3023.726

BIC(finalmodel)

## [1] 3078.671

b_predt3 <- predict(object = finalmodel, newdata = b_test)
model3.mspe <- mean((b_predt3 - b_test$medv) ^ 2)
model3.mspe

## [1] 17.54899

Fit a regression tree (CART) on the same data

#default value of cp = 0.01
Boston.tree <- rpart(medv ~ ., data = b_train)
plotcp(Boston.tree)

#Plotting the tree
rpart.plot(Boston.tree, type = 3, fallen.leaves = TRUE)

#Building a large tree
Boston.largetree <- rpart(formula = medv ~ ., data = b_train, cp = 0.001)

printcp(Boston.largetree)

## 
## Regression tree:
## rpart(formula = medv ~ ., data = b_train, cp = 0.001)
## 
## Variables actually used in tree construction:
## [1] age     crim    dis     lstat   nox     ptratio rad     rm      tax    
## 
## Root node error: 31611/354 = 89.297
## 
## n= 354 
## 
##           CP nsplit rel error  xerror     xstd
## 1  0.4574074      0   1.00000 1.00311 0.097625
## 2  0.1702137      1   0.54259 0.66426 0.071590
## 3  0.0699232      2   0.37238 0.46387 0.059291
## 4  0.0556685      3   0.30246 0.42574 0.057210
## 5  0.0330562      4   0.24679 0.36911 0.055261
## 6  0.0157256      5   0.21373 0.32477 0.051965
## 7  0.0118478      6   0.19801 0.29451 0.049421
## 8  0.0111765      7   0.18616 0.27557 0.039533
## 9  0.0093937      8   0.17498 0.27464 0.039514
## 10 0.0059391      9   0.16559 0.25856 0.035935
## 11 0.0048666     10   0.15965 0.25629 0.035140
## 12 0.0043668     11   0.15478 0.25516 0.035271
## 13 0.0033539     12   0.15041 0.25479 0.035132
## 14 0.0033193     13   0.14706 0.25322 0.035232
## 15 0.0026320     14   0.14374 0.25321 0.035194
## 16 0.0026011     15   0.14111 0.25237 0.035160
## 17 0.0024909     16   0.13851 0.25005 0.034750
## 18 0.0024075     17   0.13602 0.24581 0.034759
## 19 0.0021375     18   0.13361 0.24835 0.034794
## 20 0.0020096     19   0.13147 0.24559 0.034784
## 21 0.0016004     20   0.12946 0.24435 0.034719
## 22 0.0014010     21   0.12786 0.24096 0.034716
## 23 0.0013597     22   0.12646 0.23963 0.034697
## 24 0.0012039     23   0.12510 0.23974 0.034695
## 25 0.0011517     24   0.12390 0.24105 0.034699
## 26 0.0011241     25   0.12275 0.24192 0.034721
## 27 0.0010000     26   0.12162 0.24192 0.034721

plotcp(Boston.largetree)

#Plotting the tree
rpart.plot(Boston.largetree, type = 3, fallen.leaves = TRUE)

pruned.tree <- prune(Boston.largetree, cp = 0.006)
pruned.tree

## n= 354 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 354 31611.03000 22.75565  
##    2) lstat>=9.605 211  5113.79300 17.49431  
##      4) crim>=5.84803 66   950.69940 12.69697  
##        8) lstat>=19.73 37   278.27240 10.64865 *
##        9) lstat< 19.73 29   319.12690 15.31034 *
##      5) crim< 5.84803 145  1952.74900 19.67793  
##       10) lstat>=16.085 45   511.50580 16.91778 *
##       11) lstat< 16.085 100   944.14000 20.92000 *
##    3) lstat< 9.605 143 12038.12000 30.51888  
##      6) rm< 7.437 121  5670.63900 27.90331  
##       12) rm< 6.6805 78  2769.49800 25.07179  
##         24) age< 88.8 71   748.84390 23.92254 *
##         25) age>=88.8 7   975.71430 36.72857 *
##       13) rm>=6.6805 43  1141.40300 33.03953  
##         26) dis>=2.84255 35   347.21140 31.62857 *
##         27) dis< 2.84255 8   419.66880 39.21250 *
##      7) rm>=7.437 22   986.84950 44.90455  
##       14) ptratio>=15.4 10   605.07600 40.88000 *
##       15) ptratio< 15.4 12    84.82917 48.25833 *

rpart.plot(pruned.tree, type = 3, fallen.leaves = TRUE, extra = 1)

#In-sample MSE
mean((predict(Boston.tree) - b_train$medv) ^ 2)

## [1] 15.62523

#out-of-sample performance
mean((predict(Boston.tree, newdata = b_test) - b_test$medv) ^ 2)  #default tree

## [1] 19.18994

mean((predict(Boston.largetree) - b_train$medv) ^ 2)  #large tree

## [1] 10.8605

mean((predict(Boston.largetree, newdata = b_test) - b_test$medv) ^ 2)   #large tree

## [1] 14.88378

mean((predict(pruned.tree) - b_train$medv) ^ 2)       #pruned tree

## [1] 14.78641

mean((predict(pruned.tree, newdata = b_test) - b_test$medv) ^ 2)   #large tree

## [1] 19.35511

Comparison with Linear Model

```{r Linear Model}s lm_model <- lm(formula = medv ~ ., data = b_train) b_test_lm_pred <- predict(lm_model, b_test) mean((b_test_lm_pred - b_test$medv)^2)

b_test_dt_pred <- predict(pruned.tree, b_test) mean((b_test_dt_pred - b_test$medv)^2)


```r
b_test_dt_pred1 <- predict(Boston.largetree, b_test)
mean((b_test_dt_pred1 - b_test$medv)^2)

## [1] 14.88378

b_test_dt_pred2 <- predict(pruned.tree, b_test)
mean((b_test_dt_pred2 - b_test$medv)^2)

## [1] 19.35511