This dataset contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. It was obtained from the StatLib archive (http://lib.stat.cmu.edu/datasets/boston), and has been used extensively throughout the literature to benchmark algorithms.
The data was originally published by Harrison, D. and Rubinfeld, D.L. “Hedonic prices and the demand for clean air”, J. Environ. Economics & Management, vol.5, 81-102, 1978.
Based on the sample training dataset we have 354 observations in the sample with 13 predictor variables and 1 response variable.
Below are the columns included in the data set, all of which are numeric:
1. crim - per capita crime rate by town
2. zn - proportion of residential land zoned for lots over 25,000 sq.ft.
3. indus - proportion of non-retail business acres per town
4. chas - Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
5. nox - nitrogen oxides concentration (parts per 10 million)
6. rm - average number of rooms per dwelling
7. age - proportion of owner-occupied units built prior to 1940
8. dis - weighted mean of distances to five Boston employment centres
9. rad - index of accessibility to radial highways
10. tax - full-value property-tax rate per $10,000
11. ptratio - pupil-teacher ratio by town
12. black - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
13. lstat - lower status of the population (percent)
14. Medv - median value of owner-occupied homes in $1000
The following are the first 6 observations of the Boston Dataset.
## crim zn indus chas nox rm age dis rad tax ptratio black lstat
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21
## medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
We have dropped column black from the dataset to avoid racial discrimination.
boston_df <- subset(boston_df, select = -c(black))
colnames(boston_df)
## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "lstat" "medv"
The dataset consists of 14 attributes and 506 rows of data.
## 'data.frame': 506 obs. of 13 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
Number of rows
## [1] 506
Number of columns
## [1] 13
While checking for missing values in any of the observation for any covariate, we can observe that there are no missing values and no duplicates.
## crim zn indus chas nox rm age dis rad tax
## 0 0 0 0 0 0 0 0 0 0
## ptratio lstat medv
## 0 0 0
We have analysed the box plot to understand the distribution of every variable and also to observe if there are any notable outliers. The distribution for age and tax seems pretty reasonable and there are no outliers which is a positive point to start with. We can clearly observe that there are many outliers particularly in 3 variables named crime, zn and black. Zn and crim distribution are skewed right even after standard normalization and black variable seems to have a left skewed distribution. Nox and rm variable has a perfectly equal distribution and the boxplot is symmetrical. Rad and crim are extremely negatively skewed. Since, chas is a dummy variable with a value of eityher 0 or 1 it doesnt make any sense looking at its distribution.
To further the analysis, we will have a look at the correlation matrix and scatter plots and see which predictor variables have a positive or a negative correlation with the response variable (medv). This will give a starting point to analyse which covariates are important for consideration while building the regression model. Moreover, it will also help us realise if there are any covariates which are strongly correlated with each other leading to collinearity. Looking at these plots and values while exploring the data initially sets the ground for model building and variable selection. In general, there are both positive and negative correlation with the response varaible. Crim, indus, nox, age, rad, tax, ptratio, lstat all of them have a negative correlation with the lstat one having the strongest correlation. While, zn, chas, rm, dis and black have a positive correlation with rm having the strongest positive correlation. rad and tax have a strong positive correlation of 0.91 which implies that as accessibility of radial highways increases, the full value property-tax rate per $10,000 also increases. indus has a strong positive correlation with nox indicating that the concentration of nitrogen oxides are very high in industrial areas.
Full Model
##
## Call:
## lm(formula = medv ~ ., data = boston_full)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.595 -2.730 -0.518 1.777 26.199
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.646e+01 5.103e+00 7.144 3.28e-12 ***
## crim -1.080e-01 3.286e-02 -3.287 0.001087 **
## zn 4.642e-02 1.373e-02 3.382 0.000778 ***
## indus 2.056e-02 6.150e-02 0.334 0.738288
## chas 2.687e+00 8.616e-01 3.118 0.001925 **
## nox -1.777e+01 3.820e+00 -4.651 4.25e-06 ***
## rm 3.810e+00 4.179e-01 9.116 < 2e-16 ***
## age 6.922e-04 1.321e-02 0.052 0.958229
## dis -1.476e+00 1.995e-01 -7.398 6.01e-13 ***
## rad 3.060e-01 6.635e-02 4.613 5.07e-06 ***
## tax -1.233e-02 3.760e-03 -3.280 0.001112 **
## ptratio -9.527e-01 1.308e-01 -7.283 1.31e-12 ***
## black 9.312e-03 2.686e-03 3.467 0.000573 ***
## lstat -5.248e-01 5.072e-02 -10.347 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338
## F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16
Model After Removing Black Column from the dataset
##
## Call:
## lm(formula = medv ~ ., data = boston_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.1304 -2.7673 -0.5814 1.9414 26.2526
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 41.617270 4.936039 8.431 3.79e-16 ***
## crim -0.121389 0.033000 -3.678 0.000261 ***
## zn 0.046963 0.013879 3.384 0.000772 ***
## indus 0.013468 0.062145 0.217 0.828520
## chas 2.839993 0.870007 3.264 0.001173 **
## nox -18.758022 3.851355 -4.870 1.50e-06 ***
## rm 3.658119 0.420246 8.705 < 2e-16 ***
## age 0.003611 0.013329 0.271 0.786595
## dis -1.490754 0.201623 -7.394 6.17e-13 ***
## rad 0.289405 0.066908 4.325 1.84e-05 ***
## tax -0.012682 0.003801 -3.337 0.000912 ***
## ptratio -0.937533 0.132206 -7.091 4.63e-12 ***
## lstat -0.552019 0.050659 -10.897 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.798 on 493 degrees of freedom
## Multiple R-squared: 0.7343, Adjusted R-squared: 0.7278
## F-statistic: 113.5 on 12 and 493 DF, p-value: < 2.2e-16
T-test:
From the above model we can see that the p-value of age and indus are greater than 0.05 so as per hypothesis testing methodology they should not be included in the model.
F-test:
From the above model we can also confident say that the linear regression as a whole is significant in explaining the variation in response variable. The p-value of entire model is 2.2e-16 which is less than 0.05.
Our main objective while plotting the R-Square Plot after creating subset is to maximize the value of R-Square. We can see the highest value of R-Square is obtained at 70%. The variables that can be avoided in the model as per the plot for R-Square are zn,indus and age.
Even for Adjusted R-Square the objective is to maximize the value for R-Square. We can avoid the variables zn,indus,age,rad and tax as per the above plot keeping the value for AdjustedR_square Maximum.
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad +
## tax + ptratio + lstat
##
## Final Model:
## medv ~ zn + chas + nox + rm + dis + rad + tax + ptratio + lstat
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 341 8309.251 1193.464
## 2 - indus 1 16.73413 342 8325.985 1188.307
## 3 - age 1 17.68703 343 8343.673 1183.189
## 4 - crim 1 128.14733 344 8471.820 1182.715
As we have to minimize the value of BIC it gives us the best and less complex model. We can drop the variables crim,age and indus from the final model.
##
## Call:
## lm(formula = medv ~ . - indus - age - crim, data = Boston.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.2992 -2.8149 -0.4606 1.9329 24.4232
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 49.830094 5.945090 8.382 1.35e-15 ***
## zn 0.047722 0.016797 2.841 0.004763 **
## chas 3.351016 1.169201 2.866 0.004412 **
## nox -20.559700 4.409203 -4.663 4.47e-06 ***
## rm 2.597822 0.485076 5.355 1.56e-07 ***
## dis -1.798079 0.237867 -7.559 3.72e-13 ***
## rad 0.292811 0.075764 3.865 0.000133 ***
## tax -0.015343 0.004182 -3.669 0.000282 ***
## ptratio -0.792909 0.165321 -4.796 2.41e-06 ***
## lstat -0.641971 0.056374 -11.388 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.963 on 344 degrees of freedom
## Multiple R-squared: 0.696, Adjusted R-squared: 0.6881
## F-statistic: 87.51 on 9 and 344 DF, p-value: < 2.2e-16
Mean Square Error
## [1] 23.93169
Stepwise- Selection
## Start: AIC=1547.55
## medv ~ 1
##
## Df Sum of Sq RSS AIC
## + lstat 1 14844.1 13024 1280.3
## + rm 1 11169.2 16699 1368.2
## + indus 1 6543.4 21325 1454.8
## + ptratio 1 6015.6 21853 1463.5
## + tax 1 6005.8 21863 1463.6
## + nox 1 4799.2 23069 1482.6
## + crim 1 4539.7 23329 1486.6
## + rad 1 4045.7 23823 1494.0
## + age 1 3076.2 24792 1508.1
## + zn 1 2911.5 24957 1510.5
## + dis 1 1381.0 26487 1531.5
## + chas 1 374.8 27494 1544.8
## <none> 27868 1547.5
##
## Step: AIC=1280.26
## medv ~ lstat
##
## Df Sum of Sq RSS AIC
## + rm 1 1770.2 11254 1230.5
## + ptratio 1 1092.5 11932 1251.2
## + dis 1 787.6 12237 1260.2
## + chas 1 383.7 12640 1271.7
## + age 1 360.8 12663 1272.3
## + tax 1 141.9 12882 1278.4
## <none> 13024 1280.3
## + indus 1 64.7 12960 1280.5
## + crim 1 25.8 12998 1281.6
## + zn 1 23.2 13001 1281.6
## + nox 1 6.1 13018 1282.1
## + rad 1 0.3 13024 1282.3
## - lstat 1 14844.1 27868 1547.5
##
## Step: AIC=1230.55
## medv ~ lstat + rm
##
## Df Sum of Sq RSS AIC
## + ptratio 1 759.9 10494 1207.8
## + dis 1 480.9 10773 1217.1
## + chas 1 322.0 10932 1222.3
## + tax 1 182.2 11072 1226.8
## + age 1 97.0 11157 1229.5
## + crim 1 69.4 11185 1230.4
## <none> 11254 1230.5
## + indus 1 33.3 11221 1231.5
## + rad 1 26.6 11228 1231.7
## + zn 1 6.4 11248 1232.3
## + nox 1 2.4 11252 1232.5
## - rm 1 1770.2 13024 1280.3
## - lstat 1 5445.1 16699 1368.2
##
## Step: AIC=1207.8
## medv ~ lstat + rm + ptratio
##
## Df Sum of Sq RSS AIC
## + dis 1 555.5 9938.6 1190.5
## + chas 1 231.2 10262.9 1201.9
## + age 1 113.4 10380.7 1206.0
## <none> 10494.1 1207.8
## + rad 1 28.1 10466.0 1208.8
## + tax 1 21.0 10473.1 1209.1
## + zn 1 18.9 10475.2 1209.2
## + crim 1 18.3 10475.8 1209.2
## + nox 1 11.1 10483.0 1209.4
## + indus 1 0.1 10494.0 1209.8
## - ptratio 1 759.9 11254.0 1230.5
## - rm 1 1437.6 11931.7 1251.2
## - lstat 1 4152.6 14646.8 1323.8
##
## Step: AIC=1190.55
## medv ~ lstat + rm + ptratio + dis
##
## Df Sum of Sq RSS AIC
## + nox 1 703.5 9235.1 1166.6
## + indus 1 341.8 9596.7 1180.2
## + tax 1 195.6 9743.0 1185.5
## + chas 1 144.4 9794.2 1187.4
## + zn 1 142.4 9796.1 1187.4
## + crim 1 72.3 9866.2 1190.0
## <none> 9938.6 1190.5
## + age 1 25.2 9913.4 1191.7
## + rad 1 4.6 9934.0 1192.4
## - dis 1 555.5 10494.1 1207.8
## - ptratio 1 834.5 10773.1 1217.1
## - rm 1 1136.0 11074.6 1226.9
## - lstat 1 4615.1 14553.6 1323.6
##
## Step: AIC=1166.56
## medv ~ lstat + rm + ptratio + dis + nox
##
## Df Sum of Sq RSS AIC
## + chas 1 205.43 9029.7 1160.6
## + zn 1 148.49 9086.6 1162.8
## + rad 1 88.23 9146.9 1165.2
## + indus 1 62.30 9172.8 1166.2
## <none> 9235.1 1166.6
## + crim 1 24.22 9210.9 1167.6
## + age 1 8.40 9226.7 1168.2
## + tax 1 4.81 9230.3 1168.4
## - nox 1 703.48 9938.6 1190.5
## - ptratio 1 1045.21 10280.3 1202.5
## - rm 1 1080.66 10315.8 1203.7
## - dis 1 1247.90 10483.0 1209.4
## - lstat 1 3156.68 12391.8 1268.6
##
## Step: AIC=1160.6
## medv ~ lstat + rm + ptratio + dis + nox + chas
##
## Df Sum of Sq RSS AIC
## + zn 1 168.46 8861.2 1155.9
## + rad 1 100.32 8929.4 1158.6
## + indus 1 79.08 8950.6 1159.5
## <none> 9029.7 1160.6
## + crim 1 14.57 9015.1 1162.0
## + age 1 3.87 9025.8 1162.5
## + tax 1 1.39 9028.3 1162.5
## - chas 1 205.43 9235.1 1166.6
## - nox 1 764.54 9794.2 1187.4
## - ptratio 1 946.14 9975.8 1193.9
## - rm 1 1075.58 10105.2 1198.4
## - dis 1 1199.55 10229.2 1202.8
## - lstat 1 3072.19 12101.9 1262.3
##
## Step: AIC=1155.93
## medv ~ lstat + rm + ptratio + dis + nox + chas + zn
##
## Df Sum of Sq RSS AIC
## + indus 1 85.02 8776.2 1154.5
## + rad 1 57.92 8803.3 1155.6
## <none> 8861.2 1155.9
## + crim 1 37.74 8823.5 1156.4
## + tax 1 21.54 8839.7 1157.1
## + age 1 14.42 8846.8 1157.3
## - zn 1 168.46 9029.7 1160.6
## - chas 1 225.40 9086.6 1162.8
## - ptratio 1 692.05 9553.3 1180.5
## - nox 1 774.50 9635.7 1183.6
## - rm 1 937.76 9799.0 1189.5
## - dis 1 1342.06 10203.3 1203.8
## - lstat 1 3136.33 11997.5 1261.2
##
## Step: AIC=1154.52
## medv ~ lstat + rm + ptratio + dis + nox + chas + zn + indus
##
## Df Sum of Sq RSS AIC
## + rad 1 71.64 8704.5 1153.6
## <none> 8776.2 1154.5
## + crim 1 40.02 8736.2 1154.9
## + age 1 17.16 8759.0 1155.8
## - indus 1 85.02 8861.2 1155.9
## + tax 1 3.89 8772.3 1156.4
## - zn 1 174.40 8950.6 1159.5
## - chas 1 243.94 9020.1 1162.2
## - nox 1 454.25 9230.4 1170.4
## - ptratio 1 507.80 9284.0 1172.4
## - rm 1 828.02 9604.2 1184.4
## - dis 1 1426.49 10202.7 1205.8
## - lstat 1 3008.06 11784.2 1256.8
##
## Step: AIC=1153.62
## medv ~ lstat + rm + ptratio + dis + nox + chas + zn + indus +
## rad
##
## Df Sum of Sq RSS AIC
## + tax 1 242.65 8461.9 1145.6
## + crim 1 134.08 8570.5 1150.1
## <none> 8704.5 1153.6
## - rad 1 71.64 8776.2 1154.5
## + age 1 24.54 8680.0 1154.6
## - indus 1 98.74 8803.3 1155.6
## - zn 1 127.64 8832.2 1156.8
## - chas 1 253.99 8958.5 1161.8
## - nox 1 525.65 9230.2 1172.4
## - ptratio 1 574.82 9279.4 1174.2
## - rm 1 747.98 9452.5 1180.8
## - dis 1 1401.52 10106.1 1204.5
## - lstat 1 3079.53 11784.1 1258.8
##
## Step: AIC=1145.61
## medv ~ lstat + rm + ptratio + dis + nox + chas + zn + indus +
## rad + tax
##
## Df Sum of Sq RSS AIC
## + crim 1 133.31 8328.6 1142.0
## - indus 1 9.93 8471.8 1144.0
## <none> 8461.9 1145.6
## + age 1 21.66 8440.2 1146.7
## - zn 1 192.58 8654.5 1151.6
## - chas 1 209.37 8671.3 1152.3
## - tax 1 242.65 8704.5 1153.6
## - rad 1 310.40 8772.3 1156.4
## - nox 1 449.08 8911.0 1161.9
## - ptratio 1 523.13 8985.0 1164.8
## - rm 1 675.25 9137.1 1170.8
## - dis 1 1378.72 9840.6 1197.0
## - lstat 1 3136.35 11598.2 1255.2
##
## Step: AIC=1141.99
## medv ~ lstat + rm + ptratio + dis + nox + chas + zn + indus +
## rad + tax + crim
##
## Df Sum of Sq RSS AIC
## - indus 1 15.09 8343.7 1140.6
## <none> 8328.6 1142.0
## + age 1 19.33 8309.3 1143.2
## - crim 1 133.31 8461.9 1145.6
## - chas 1 190.45 8519.0 1148.0
## - zn 1 220.66 8549.2 1149.2
## - tax 1 241.88 8570.5 1150.1
## - rad 1 407.35 8735.9 1156.9
## - nox 1 466.03 8794.6 1159.3
## - ptratio 1 538.89 8867.5 1162.2
## - rm 1 663.89 8992.5 1167.1
## - dis 1 1460.51 9789.1 1197.2
## - lstat 1 2669.87 10998.4 1238.4
##
## Step: AIC=1140.63
## medv ~ lstat + rm + ptratio + dis + nox + chas + zn + rad + tax +
## crim
##
## Df Sum of Sq RSS AIC
## <none> 8343.7 1140.6
## + age 1 17.69 8326.0 1141.9
## + indus 1 15.09 8328.6 1142.0
## - crim 1 128.15 8471.8 1144.0
## - chas 1 181.86 8525.5 1146.3
## - zn 1 227.86 8571.5 1148.2
## - tax 1 341.46 8685.1 1152.8
## - rad 1 472.96 8816.6 1158.1
## - nox 1 563.81 8907.5 1161.8
## - ptratio 1 588.56 8932.2 1162.8
## - rm 1 699.80 9043.5 1167.1
## - dis 1 1480.64 9824.3 1196.5
## - lstat 1 2739.57 11083.2 1239.1
##
## Call:
## lm(formula = medv ~ lstat + rm + ptratio + dis + nox + chas +
## zn + rad + tax + crim, data = Boston.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5886 -2.8169 -0.5257 1.8192 24.3188
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 50.234002 5.911169 8.498 5.98e-16 ***
## lstat -0.611279 0.057601 -10.612 < 2e-16 ***
## rm 2.585905 0.482123 5.364 1.50e-07 ***
## ptratio -0.808922 0.164453 -4.919 1.35e-06 ***
## dis -1.854224 0.237667 -7.802 7.43e-14 ***
## nox -21.130869 4.389163 -4.814 2.22e-06 ***
## chas 3.183488 1.164305 2.734 0.00658 **
## zn 0.051315 0.016767 3.061 0.00238 **
## rad 0.349322 0.079221 4.409 1.39e-05 ***
## tax -0.015578 0.004158 -3.747 0.00021 ***
## crim -0.104928 0.045716 -2.295 0.02232 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.932 on 343 degrees of freedom
## Multiple R-squared: 0.7006, Adjusted R-squared: 0.6919
## F-statistic: 80.26 on 10 and 343 DF, p-value: < 2.2e-16
Mean Square Error
## [1] 23.5697
While fitting the best linear regression model using AIC we come across the above as the final model with lower value of AIC. In the above model the variable age and indus are dropped giving us a final model. This model is similar to the very first model created on the entire dataset where it was depicted that the age and indus should be dropped from the model with the help of T-Test and F-Test as they have a p-value greater than 0.05.
## Loading required package: Matrix
## Loaded glmnet 3.0-2
Lambda Min
## [1] 0.006627913
Lambda 1se
## [1] 0.362035
Coefficients from Lambda Min
## 13 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 22.4679421
## crim -0.8899073
## zn 1.1898281
## indus -0.4299548
## chas 0.8144242
## nox -2.4138474
## rm 1.7327443
## age 0.3845212
## dis -3.8391074
## rad 2.8179921
## tax -2.2946981
## ptratio -1.7014649
## lstat -4.4432175
Coefficients from Lambda 1se
## 13 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 22.535109167
## crim -0.039528378
## zn 0.006475944
## indus -0.260645724
## chas 0.493677038
## nox -0.806436894
## rm 2.137370834
## age .
## dis -1.407380924
## rad .
## tax -0.070337823
## ptratio -1.401210354
## lstat -4.255274002
Lasso MSE Lambda Min
## [1] 23.47515
Lasso MSE Lambda 1se
## [1] 26.8531
Lasso MPSE Lambda Min
## [1] 22.72916
Lasso MPSE Lambda 1se
## [1] 23.77414
#Final Model
Final_Model <- lm(medv ~.-indus-age-crim, data = Boston.train)
summary(Final_Model)
##
## Call:
## lm(formula = medv ~ . - indus - age - crim, data = Boston.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.2992 -2.8149 -0.4606 1.9329 24.4232
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 49.830094 5.945090 8.382 1.35e-15 ***
## zn 0.047722 0.016797 2.841 0.004763 **
## chas 3.351016 1.169201 2.866 0.004412 **
## nox -20.559700 4.409203 -4.663 4.47e-06 ***
## rm 2.597822 0.485076 5.355 1.56e-07 ***
## dis -1.798079 0.237867 -7.559 3.72e-13 ***
## rad 0.292811 0.075764 3.865 0.000133 ***
## tax -0.015343 0.004182 -3.669 0.000282 ***
## ptratio -0.792909 0.165321 -4.796 2.41e-06 ***
## lstat -0.641971 0.056374 -11.388 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.963 on 344 degrees of freedom
## Multiple R-squared: 0.696, Adjusted R-squared: 0.6881
## F-statistic: 87.51 on 9 and 344 DF, p-value: < 2.2e-16
We have selected the model from BIC as the final model because its has lower MSE value and gives a less complex model.
MSPE of out-of-Sample Data
#MSPE
pred= predict(Final_Model, newdata = Boston.test)
mspe_pred= mean((pred - Boston.test$medv)^2)
mspe_pred
## [1] 23.42387
#Q4
#Cross Validation
library(boot)
model_2 <- glm(medv~.-indus-age-crim, data = boston_df)
cv.glm(data = boston_df, glmfit = model_2, K = 3)$delta[2]
## [1] 25.5441
We are getting MSE from cross validation as 25.5 and out-pf-sample prediction is 23.42. Both the values are close so we can conclude that we are getting similar results.
In this case we choose the best cp value in order to minimize the prediction error RMSE (root mean squared error). The prediction error is measured by RMSE, which corresponds to the average difference between the observed value and predicted value of the model. The best cp value is computed below along with the best tune plot:
## cp
## 1 0.00632333
The complexity parameter (cp) is used to control the size of the decision tree and to select the optimal tree size. If the cost of adding another variable to the decision tree from the current node is above the value of cp, then tree building does not continue
The above decision tree that is been plot is called Regression Tree. Regression Tree is build for continous data types. In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed. At the end of the learning process, a decision tree covering the training set is returned.
When the data is simple and we are using only one predictor the prediction becomes easy to be determined from a plot across x and y axis. But the same thing becomes difficult if we have multiple predictors.
So when we have 3 or more predictors drawing a graph with x and y axis becomes difficult if not impossible.
In contrast the regression tree easily accomodates the additional predictors.
For Example if we have the following data:
We can predict the median value of owner-occupied homes that is $3946000 and the number of observations are 7.
## n= 354
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 354 27868.34000 22.51017
## 2) lstat>=9.725 201 4959.59100 17.48806
## 4) lstat>=14.915 113 2138.37800 14.71681
## 8) nox>=0.657 59 767.89560 12.37966
## 16) lstat>=18.93 37 273.20920 10.64054 *
## 17) lstat< 18.93 22 194.56950 15.30455 *
## 9) nox< 0.657 54 696.09260 17.27037 *
## 5) lstat< 14.915 88 839.03900 21.04659 *
## 3) lstat< 9.725 153 11179.23000 29.10784
## 6) rm< 7.0155 116 4378.80600 25.88276
## 12) tax< 548 109 1961.24700 25.01101
## 24) rm< 6.5445 65 614.46150 22.84615 *
## 25) rm>=6.5445 44 592.13640 28.20909
## 50) lstat>=4.92 27 217.51190 26.42593 *
## 51) lstat< 4.92 17 152.42120 31.04118 *
## 13) tax>=548 7 1044.87700 39.45714 *
## 7) rm>=7.0155 37 1811.23700 39.21892
## 14) rm< 7.437 17 61.43529 34.02941 *
## 15) rm>=7.437 20 902.82200 43.63000 *
The following is the head of predictions.
## 7 8 9 13 14 15
## 21.04659 17.27037 17.27037 17.27037 22.84615 21.04659
## [1] 3.882004
## RMSE Rsquared MAE
## 3.882004 0.855859 2.850962
The RMSE for the Regression Tree is 3.882004 and R-Square is 85.5%.
## cp
## 1 0.004326469
## n= 354
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 354 30194.93000 22.60847
## 2) lstat>=9.645 207 4640.87000 17.34396
## 4) lstat>=19.85 52 732.74080 12.08077
## 8) nox>=0.603 38 286.59390 10.45526 *
## 9) nox< 0.603 14 73.20929 16.49286 *
## 5) lstat< 19.85 155 1984.41500 19.10968
## 10) lstat>=14.915 60 643.55400 16.69000 *
## 11) lstat< 14.915 95 767.70360 20.63789 *
## 3) lstat< 9.645 147 11738.37000 30.02177
## 6) rm< 7.4545 127 5678.08700 27.67953
## 12) dis>=1.9704 119 2763.28800 26.58992
## 24) rm< 6.803 84 1001.47700 24.38333
## 48) rm< 6.531 54 419.47700 22.89259 *
## 49) rm>=6.531 30 245.98670 27.06667 *
## 25) rm>=6.803 35 371.22290 31.88571 *
## 13) dis< 1.9704 8 671.92880 43.88750 *
## 7) rm>=7.4545 20 939.28950 44.89500
## 14) ptratio>=16.15 10 662.96000 41.80000 *
## 15) ptratio< 16.15 10 84.74900 47.99000 *
## 4 5 8 10 12 16
## 31.88571 31.88571 16.69000 16.69000 20.63789 22.89259
## [1] 4.639071
## RMSE Rsquared MAE
## 4.6390715 0.7503581 3.3348107
We donot get the same results for both the regression tree.
In both cases the training data is randomly selected therefore different training sets will have different misclassfication rate. The main objective of the regression tree is to lower the minimize misclassification rate to provide best prediction.
The bankruptcy dataset contains the companies’ financial information and their bankruptcy status for particular years. It is a subset of Man Xu’s project containing the financial information of the companies.
## [1] "FYEAR" "DLRSN" "CUSIP" "R1" "R2" "R3" "R4" "R5" "R6"
## [10] "R7" "R8" "R9" "R10"
There are 13 attributes in this dataset. They are:
1. FYEAR: Fiscal Year for which the data has been taken.
2. DLRSN: It is a Bankruptcy/Non-Bankruptcy Flag. Bankruptcy = 1 and Non-Bankruptcy = 0.
3. CUSIP: It is also known as the Committee on Uniform Securities Identification Procedures number, is a unique nine-character identification number assigned to all stocks (and registered bonds) in the U.S. and Canada.
4. R1: Working Capital / Total Asset
5. R2: Retained Earning / Total Asset
6. R3: Earning Before Interest & Tax/Total Asset
7. R4: Market Capital / Total Liability
8. R5: SALE / Total Asset
9. R6: Total Liability / Total Asset
10.R7: Current Asset / Current Liability
11.R8: Net Income / Total Asset
12.R9: It is Sales data. But as the sales number was huge, so log has been taken on sales. LOG(SALE)
13.R10: The market capital of a company in a particular Fiscal Year. This data exist on the log-scale.
As, CUSIP is an assigned ID and Fiscal Year won’t play much role in this subset of Man Xu’s dataset, therefore, we will drop these columns.
Now, our dataset is ready, so let’s dive into Exploratory Data Analysis.
The dataset contains 5436 rows and 11 columns. Now, let’s check the structure of the dataset. As, we can see below, none of the variables are having factors. Also, with domain understanding, it is clear that no variables should be converted into factors. DLRSN is the only variable which is 0/1, but this variable will be used as a predictor for logistic regression.
## 'data.frame': 5436 obs. of 11 variables:
## $ DLRSN: int 0 0 0 1 0 0 0 0 1 0 ...
## $ R1 : num 0.307 0.761 -0.514 -0.466 2.023 ...
## $ R2 : num 0.887 0.592 0.338 0.371 0.215 ...
## $ R3 : num 1.648 0.453 0.299 0.496 0.183 ...
## $ R4 : num -0.1992 -0.3699 -0.0291 -0.3734 6.6954 ...
## $ R5 : num 1.093 0.186 -0.433 -0.267 -1.148 ...
## $ R6 : num -0.3133 0.0396 0.83 0.9778 -1.5059 ...
## $ R7 : num -0.197 0.327 -0.708 -0.611 2.876 ...
## $ R8 : num 1.207 0.428 0.476 0.457 0.287 ...
## $ R9 : num 0.282 1.107 2.179 0.152 -0.986 ...
## $ R10 : num 0.1589 0.7934 2.4846 0.0478 0.7911 ...
Lets’s check the top 5 rows of the dataset:
## DLRSN R1 R2 R3 R4 R5 R6
## 1 0 0.3071395 0.8870057 1.6476808 -0.19915760 1.0929640 -0.31328867
## 2 0 0.7607365 0.5924934 0.4530028 -0.36989014 0.1861539 0.03961907
## 3 0 -0.5135961 0.3376148 0.2990145 -0.02907996 -0.4326047 0.82999324
## 4 1 -0.4661293 0.3707467 0.4960667 -0.37342897 -0.2674240 0.97779888
## 5 0 2.0234223 0.2148764 0.1825948 6.69536040 -1.1483381 -1.50588927
## 6 0 0.9074985 0.3868797 0.4778914 -0.34715872 1.4079893 -0.48390230
## R7 R8 R9 R10
## 1 -0.19679332 1.2067628 0.2824709 0.15889625
## 2 0.32749732 0.4284181 1.1069652 0.79344276
## 3 -0.70778613 0.4761533 2.1791755 2.48458450
## 4 -0.61097507 0.4568098 0.1519511 0.04778851
## 5 2.87647687 0.2873749 -0.9864421 0.79107673
## 6 0.07025944 0.5278110 0.5024659 -0.16464802
It is clear that, R10 which is the market cap and R9 which is the sales data is in log-scale. Checking the null values in dataset is very important, and if we find any, then we have to impute the values, either with mean or median or depending upon the domain.
## DLRSN R1 R2 R3 R4 R5 R6 R7 R8 R9 R10
## 0 0 0 0 0 0 0 0 0 0 0
From the above plot and column values it is clear there are no Null Values in the dataset.
## DLRSN R1 R2 R3
## Min. :0.0000 Min. :-4.3828 Min. :-2.2418 Min. :-2.06423
## 1st Qu.:0.0000 1st Qu.:-0.7501 1st Qu.:-1.0805 1st Qu.:-1.07887
## Median :0.0000 Median :-0.2220 Median : 0.1337 Median : 0.06858
## Mean :0.1428 Mean :-0.2352 Mean :-0.2915 Mean :-0.24411
## 3rd Qu.:0.0000 3rd Qu.: 0.4821 3rd Qu.: 0.5103 3rd Qu.: 0.50070
## Max. :1.0000 Max. : 2.0234 Max. : 1.4854 Max. : 2.14246
## R4 R5 R6 R7
## Min. :-0.42712 Min. :-1.3639 Min. :-1.505889 Min. :-1.23340
## 1st Qu.:-0.38752 1st Qu.:-0.8748 1st Qu.:-0.656827 1st Qu.:-0.77996
## Median :-0.30754 Median :-0.3508 Median : 0.003493 Median :-0.43205
## Mean : 0.23903 Mean :-0.1338 Mean : 0.195707 Mean :-0.09612
## 3rd Qu.: 0.02746 3rd Qu.: 0.2878 3rd Qu.: 0.608376 3rd Qu.: 0.17578
## Max. : 6.69536 Max. : 4.0362 Max. : 5.110424 Max. : 2.87648
## R8 R9 R10
## Min. :-2.2082 Min. :-2.76356 Min. :-2.2140
## 1st Qu.:-1.0054 1st Qu.:-0.66606 1st Qu.:-0.6413
## Median : 0.2053 Median : 0.06219 Median : 0.1230
## Mean :-0.2272 Mean : 0.02538 Mean : 0.1806
## 3rd Qu.: 0.5289 3rd Qu.: 0.81215 3rd Qu.: 0.9878
## Max. : 2.0006 Max. : 2.17918 Max. : 2.4846
By checking the summary statistics nothing significant is evident. Let’s move to boxplot to understand the each variable more intuitively.
## [1] "Z-transformation has been applied over data!"
Analysis of Boxplot:
* R1: 3 outliers. R1 is working capital to total asset ratio, which means comapny’s ability to cover short term financial obligations. These 3 outliers are less than minimum.
* R2,R3, R8, R9, and R10: No Outliers.
* R4: It is clear that, R4 is heavily skewed because of the outliers.
* R5, R6 and R7: They all share the same amount of outliers.
Analysis of Density plots:
* Bi-Modal: R1, R2, R3, R5, R7, R8 variables are bi-modal.
* Negative Skewness: R1, R2
* Positive Skewness: R4, R5, R6, R7, R8
Analysis of Correlaton Matrix:
From the correlation matrix, following analysis are evident:
* DLRSN, the predictive variable, is negatively but highly correlated with R10, which is the market capital.
Strong Correlation between variables but predictors:
* R6 negatively correlated with R1 and R7.
* R2 positively correlated with R3, R8 and R9.
* R3 positively correlated with R8 and R9.
* R9 positively correlated with R10.
* R1 positively correlated with R7.
It is evident that the number of companies that went bankrupt are close to 1000.
We have used the seed as 13478476, which is the M Number of one of the group member.
## Observations in training set: 3806 and in testing set: 1630
Since the response variable is qualitative i.e. we classify the company as bankrupt or not, we use the Logistic Regression technique for model creation.
The team has used generalized linear regression with the following link functions: 1. Logit Link 2. Probit Link 3. Complementary log log Link ** To build the model, the team has used all the variables as predictiors, i.e., R1 to R10 and with the combination of different link functions
## ********Logit-Link Model********
## Mean Residual Deviance is 0.5439795
## AIC is 2086.402
## BIC is 2155.09
## ********Probit-Link Model********
## Mean Residual Deviance is 0.5452349
## AIC is 2091.167
## BIC is 2159.854
## ********Complimentary Log Log Model********
## Mean Residual Deviance is 0.549709
## AIC is 2108.146
## BIC is 2176.833
As, we can see that all the models with different links, such as, probit, logit and cloglog is comaparable. Another important point which we have discussed before is: all three models are full models.
## ********Step Wise selection, direction=exhaustive********
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## DLRSN ~ R1 + R2 + R3 + R4 + R5 + R6 + R7 + R8 + R9 + R10
##
## Final Model:
## DLRSN ~ R1 + R2 + R3 + R6 + R7 + R8 + R9 + R10
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 3795 2064.402 2086.402
## 2 - R4 1 0.6810090 3796 2065.083 2085.083
## 3 - R5 1 0.7111673 3797 2065.794 2083.794
## ********Step Wise selection, direction=forward********
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## DLRSN ~ R1 + R2 + R3 + R4 + R5 + R6 + R7 + R8 + R9 + R10
##
## Final Model:
## DLRSN ~ R1 + R2 + R3 + R4 + R5 + R6 + R7 + R8 + R9 + R10
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 3795 2064.402 2086.402
## ********Step Wise selection, direction=backward********
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## DLRSN ~ R1 + R2 + R3 + R4 + R5 + R6 + R7 + R8 + R9 + R10
##
## Final Model:
## DLRSN ~ R1 + R2 + R3 + R6 + R7 + R8 + R9 + R10
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 3795 2064.402 2086.402
## 2 - R4 1 0.6810090 3796 2065.083 2085.083
## 3 - R5 1 0.7111673 3797 2065.794 2083.794
## ********BIC********
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## DLRSN ~ R1 + R2 + R3 + R4 + R5 + R6 + R7 + R8 + R9 + R10
##
## Final Model:
## DLRSN ~ R2 + R3 + R6 + R7 + R9 + R10
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 3795 2064.402 2155.090
## 2 - R4 1 0.6810090 3796 2065.083 2147.527
## 3 - R5 1 0.7111673 3797 2065.794 2139.993
## 4 - R1 1 4.1889926 3798 2069.983 2135.938
## 5 - R8 1 5.1585338 3799 2075.142 2132.852
## ********Lasso********
## 11 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -2.60126171
## R1 0.12066150
## R2 0.59469602
## R3 -0.59088678
## R4 -0.08358830
## R5 0.06509002
## R6 0.21883720
## R7 -0.50769397
## R8 -0.20634429
## R9 0.26436658
## R10 -1.45828944
## 11 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -2.48248014
## R1 .
## R2 0.44262186
## R3 -0.46306918
## R4 -0.06474861
## R5 0.09685529
## R6 0.15140263
## R7 -0.33956672
## R8 -0.16384191
## R9 0.15933240
## R10 -1.30846529
## [1] 10.000 2088.005
By looking at the above image and the output of different variable selection strategies, following points are clear: 1. Residual Deviance is same for all three strategies. 2. AIC is comparable for the above models. 3. But looking at the complexity of the model, it is clear that BIC model gives the least complex model.
Therefore for the final model, we have selected BIC strategy based model.
Our Final model is:
## Mean Residual Deviance for the final model is 0.09047091
## AUC for full model is 0.8787171
## [1] 0.6355754
## Predicted
## True 0 1
## 0 1516 1754
## 1 19 517
## Asymertic Misclassification rate 0.4658434 False Positives 0.5363914 False Negatives 0.03544776
For In-sample testing, AUC for the model comes out to be 87.87%, which is greater than 70% as asked. Hence for in-sample data the model performs good. The Misclassfication rate comes out to be 46% which shows that predicitive probability of misclassifying the training dataset.
##3 & 4. Out of Sample testing and Cut - off probabilities through grid Search
## AUC for full model is 0.8636841
## [1] 0.6779141
## Predicted
## True 0 1
## 0 495 895
## 1 6 234
## Out-Of-Sample Sample Asymertic Misclassification rate 0.5527607 False Positives 0.6438849 False Negatives 0.025
While doing the out-of-sample testing the Area under curve comes out to be: 86%. This means our model is around 86% accurate in predicting the bankruptcy status. For out of sample testing Asymmetric Misclassification rate comes out to be: 55.27% which shows that predicitive probability of misclassifying the training dataset.
Cross Validation with Cross Function
## [1] 0.6737797
Cross Validation with AUC Function
cv.glm(data = bankruptcy_df, glmfit = bankruptcy_model, cost = aucCost, K = 3)$delta[2]
## [1] 0.8680642
From 3 we get misclassification rate as 0.55 and AUC as 0.67. After cross validation we are getting results close to the misclassification & AUC as obtained from 3.
MSE
## [1] 0.4157473
MSPE
## [1] 0.4142877
From homework 3 we are getting MSPE as 0.476 and from above question we get as 0.41, so we can say that CART is doing a better job in model fitting as compared to logistic regression.
MSE
## Warning in bankruptcy.train.pred.tree_1 - bankruptcy.train_1$DLRSN: longer
## object length is not a multiple of shorter object length
## [1] 0.4171277
MPSE
## [1] 0.4153906
We observe that the MSPE for train:test split 70:30 is 0.413 and for 90:10 is 0.415. So we can see that the sample is representative of the entire population. Irrespective of what split we choose, we get almost same MSPE.