Classification Tree


This dataset contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. It was obtained from the StatLib archive (http://lib.stat.cmu.edu/datasets/boston), and has been used extensively throughout the literature to benchmark algorithms.

The data was originally published by Harrison, D. and Rubinfeld, D.L. “Hedonic prices and the demand for clean air”, J. Environ. Economics & Management, vol.5, 81-102, 1978.

Based on the sample training dataset we have 354 observations in the sample with 13 predictor variables and 1 response variable.
Below are the columns included in the data set, all of which are numeric:
1. crim - per capita crime rate by town
2. zn - proportion of residential land zoned for lots over 25,000 sq.ft.
3. indus - proportion of non-retail business acres per town
4. chas - Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
5. nox - nitrogen oxides concentration (parts per 10 million)
6. rm - average number of rooms per dwelling
7. age - proportion of owner-occupied units built prior to 1940
8. dis - weighted mean of distances to five Boston employment centres
9. rad - index of accessibility to radial highways
10. tax - full-value property-tax rate per $10,000
11. ptratio - pupil-teacher ratio by town
12. black - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
13. lstat - lower status of the population (percent)
14. Medv - median value of owner-occupied homes in $1000

The following are the first 6 observations of the Boston Dataset.

##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
##   medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7

Dropped Black column

We have dropped column black from the dataset to avoid racial discrimination.

boston_df <- subset(boston_df, select = -c(black))
colnames(boston_df)
##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "lstat"   "medv"

Structure of Boston Dataset

The dataset consists of 14 attributes and 506 rows of data.

## 'data.frame':    506 obs. of  13 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

Number of rows

## [1] 506

Number of columns

## [1] 13

MISSING VALUES AND DUPLICATES


While checking for missing values in any of the observation for any covariate, we can observe that there are no missing values and no duplicates.

##    crim      zn   indus    chas     nox      rm     age     dis     rad     tax 
##       0       0       0       0       0       0       0       0       0       0 
## ptratio   lstat    medv 
##       0       0       0

Boxplot

We have analysed the box plot to understand the distribution of every variable and also to observe if there are any notable outliers. The distribution for age and tax seems pretty reasonable and there are no outliers which is a positive point to start with. We can clearly observe that there are many outliers particularly in 3 variables named crime, zn and black. Zn and crim distribution are skewed right even after standard normalization and black variable seems to have a left skewed distribution. Nox and rm variable has a perfectly equal distribution and the boxplot is symmetrical. Rad and crim are extremely negatively skewed. Since, chas is a dummy variable with a value of eityher 0 or 1 it doesnt make any sense looking at its distribution.


Scatter plot and correlation matrix

To further the analysis, we will have a look at the correlation matrix and scatter plots and see which predictor variables have a positive or a negative correlation with the response variable (medv). This will give a starting point to analyse which covariates are important for consideration while building the regression model. Moreover, it will also help us realise if there are any covariates which are strongly correlated with each other leading to collinearity. Looking at these plots and values while exploring the data initially sets the ground for model building and variable selection. In general, there are both positive and negative correlation with the response varaible. Crim, indus, nox, age, rad, tax, ptratio, lstat all of them have a negative correlation with the lstat one having the strongest correlation. While, zn, chas, rm, dis and black have a positive correlation with rm having the strongest positive correlation. rad and tax have a strong positive correlation of 0.91 which implies that as accessibility of radial highways increases, the full value property-tax rate per $10,000 also increases. indus has a strong positive correlation with nox indicating that the concentration of nitrogen oxides are very high in industrial areas.

Linear Regression

Full Model

## 
## Call:
## lm(formula = medv ~ ., data = boston_full)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.595  -2.730  -0.518   1.777  26.199 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.646e+01  5.103e+00   7.144 3.28e-12 ***
## crim        -1.080e-01  3.286e-02  -3.287 0.001087 ** 
## zn           4.642e-02  1.373e-02   3.382 0.000778 ***
## indus        2.056e-02  6.150e-02   0.334 0.738288    
## chas         2.687e+00  8.616e-01   3.118 0.001925 ** 
## nox         -1.777e+01  3.820e+00  -4.651 4.25e-06 ***
## rm           3.810e+00  4.179e-01   9.116  < 2e-16 ***
## age          6.922e-04  1.321e-02   0.052 0.958229    
## dis         -1.476e+00  1.995e-01  -7.398 6.01e-13 ***
## rad          3.060e-01  6.635e-02   4.613 5.07e-06 ***
## tax         -1.233e-02  3.760e-03  -3.280 0.001112 ** 
## ptratio     -9.527e-01  1.308e-01  -7.283 1.31e-12 ***
## black        9.312e-03  2.686e-03   3.467 0.000573 ***
## lstat       -5.248e-01  5.072e-02 -10.347  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared:  0.7406, Adjusted R-squared:  0.7338 
## F-statistic: 108.1 on 13 and 492 DF,  p-value: < 2.2e-16

Model After Removing Black Column from the dataset

## 
## Call:
## lm(formula = medv ~ ., data = boston_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.1304  -2.7673  -0.5814   1.9414  26.2526 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  41.617270   4.936039   8.431 3.79e-16 ***
## crim         -0.121389   0.033000  -3.678 0.000261 ***
## zn            0.046963   0.013879   3.384 0.000772 ***
## indus         0.013468   0.062145   0.217 0.828520    
## chas          2.839993   0.870007   3.264 0.001173 ** 
## nox         -18.758022   3.851355  -4.870 1.50e-06 ***
## rm            3.658119   0.420246   8.705  < 2e-16 ***
## age           0.003611   0.013329   0.271 0.786595    
## dis          -1.490754   0.201623  -7.394 6.17e-13 ***
## rad           0.289405   0.066908   4.325 1.84e-05 ***
## tax          -0.012682   0.003801  -3.337 0.000912 ***
## ptratio      -0.937533   0.132206  -7.091 4.63e-12 ***
## lstat        -0.552019   0.050659 -10.897  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.798 on 493 degrees of freedom
## Multiple R-squared:  0.7343, Adjusted R-squared:  0.7278 
## F-statistic: 113.5 on 12 and 493 DF,  p-value: < 2.2e-16

T-test:
From the above model we can see that the p-value of age and indus are greater than 0.05 so as per hypothesis testing methodology they should not be included in the model.

F-test:
From the above model we can also confident say that the linear regression as a whole is significant in explaining the variation in response variable. The p-value of entire model is 2.2e-16 which is less than 0.05.

Variable Selection - BIC

Our main objective while plotting the R-Square Plot after creating subset is to maximize the value of R-Square. We can see the highest value of R-Square is obtained at 70%. The variables that can be avoided in the model as per the plot for R-Square are zn,indus and age.

Even for Adjusted R-Square the objective is to maximize the value for R-Square. We can avoid the variables zn,indus,age,rad and tax as per the above plot keeping the value for AdjustedR_square Maximum.

## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad + 
##     tax + ptratio + lstat
## 
## Final Model:
## medv ~ zn + chas + nox + rm + dis + rad + tax + ptratio + lstat
## 
## 
##      Step Df  Deviance Resid. Df Resid. Dev      AIC
## 1                            341   8309.251 1193.464
## 2 - indus  1  16.73413       342   8325.985 1188.307
## 3   - age  1  17.68703       343   8343.673 1183.189
## 4  - crim  1 128.14733       344   8471.820 1182.715

As we have to minimize the value of BIC it gives us the best and less complex model. We can drop the variables crim,age and indus from the final model.

Linear model with BIC

## 
## Call:
## lm(formula = medv ~ . - indus - age - crim, data = Boston.train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.2992  -2.8149  -0.4606   1.9329  24.4232 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  49.830094   5.945090   8.382 1.35e-15 ***
## zn            0.047722   0.016797   2.841 0.004763 ** 
## chas          3.351016   1.169201   2.866 0.004412 ** 
## nox         -20.559700   4.409203  -4.663 4.47e-06 ***
## rm            2.597822   0.485076   5.355 1.56e-07 ***
## dis          -1.798079   0.237867  -7.559 3.72e-13 ***
## rad           0.292811   0.075764   3.865 0.000133 ***
## tax          -0.015343   0.004182  -3.669 0.000282 ***
## ptratio      -0.792909   0.165321  -4.796 2.41e-06 ***
## lstat        -0.641971   0.056374 -11.388  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.963 on 344 degrees of freedom
## Multiple R-squared:  0.696,  Adjusted R-squared:  0.6881 
## F-statistic: 87.51 on 9 and 344 DF,  p-value: < 2.2e-16

Mean Square Error

## [1] 23.93169

Variables Selection AIC

Stepwise- Selection

## Start:  AIC=1547.55
## medv ~ 1
## 
##           Df Sum of Sq   RSS    AIC
## + lstat    1   14844.1 13024 1280.3
## + rm       1   11169.2 16699 1368.2
## + indus    1    6543.4 21325 1454.8
## + ptratio  1    6015.6 21853 1463.5
## + tax      1    6005.8 21863 1463.6
## + nox      1    4799.2 23069 1482.6
## + crim     1    4539.7 23329 1486.6
## + rad      1    4045.7 23823 1494.0
## + age      1    3076.2 24792 1508.1
## + zn       1    2911.5 24957 1510.5
## + dis      1    1381.0 26487 1531.5
## + chas     1     374.8 27494 1544.8
## <none>                 27868 1547.5
## 
## Step:  AIC=1280.26
## medv ~ lstat
## 
##           Df Sum of Sq   RSS    AIC
## + rm       1    1770.2 11254 1230.5
## + ptratio  1    1092.5 11932 1251.2
## + dis      1     787.6 12237 1260.2
## + chas     1     383.7 12640 1271.7
## + age      1     360.8 12663 1272.3
## + tax      1     141.9 12882 1278.4
## <none>                 13024 1280.3
## + indus    1      64.7 12960 1280.5
## + crim     1      25.8 12998 1281.6
## + zn       1      23.2 13001 1281.6
## + nox      1       6.1 13018 1282.1
## + rad      1       0.3 13024 1282.3
## - lstat    1   14844.1 27868 1547.5
## 
## Step:  AIC=1230.55
## medv ~ lstat + rm
## 
##           Df Sum of Sq   RSS    AIC
## + ptratio  1     759.9 10494 1207.8
## + dis      1     480.9 10773 1217.1
## + chas     1     322.0 10932 1222.3
## + tax      1     182.2 11072 1226.8
## + age      1      97.0 11157 1229.5
## + crim     1      69.4 11185 1230.4
## <none>                 11254 1230.5
## + indus    1      33.3 11221 1231.5
## + rad      1      26.6 11228 1231.7
## + zn       1       6.4 11248 1232.3
## + nox      1       2.4 11252 1232.5
## - rm       1    1770.2 13024 1280.3
## - lstat    1    5445.1 16699 1368.2
## 
## Step:  AIC=1207.8
## medv ~ lstat + rm + ptratio
## 
##           Df Sum of Sq     RSS    AIC
## + dis      1     555.5  9938.6 1190.5
## + chas     1     231.2 10262.9 1201.9
## + age      1     113.4 10380.7 1206.0
## <none>                 10494.1 1207.8
## + rad      1      28.1 10466.0 1208.8
## + tax      1      21.0 10473.1 1209.1
## + zn       1      18.9 10475.2 1209.2
## + crim     1      18.3 10475.8 1209.2
## + nox      1      11.1 10483.0 1209.4
## + indus    1       0.1 10494.0 1209.8
## - ptratio  1     759.9 11254.0 1230.5
## - rm       1    1437.6 11931.7 1251.2
## - lstat    1    4152.6 14646.8 1323.8
## 
## Step:  AIC=1190.55
## medv ~ lstat + rm + ptratio + dis
## 
##           Df Sum of Sq     RSS    AIC
## + nox      1     703.5  9235.1 1166.6
## + indus    1     341.8  9596.7 1180.2
## + tax      1     195.6  9743.0 1185.5
## + chas     1     144.4  9794.2 1187.4
## + zn       1     142.4  9796.1 1187.4
## + crim     1      72.3  9866.2 1190.0
## <none>                  9938.6 1190.5
## + age      1      25.2  9913.4 1191.7
## + rad      1       4.6  9934.0 1192.4
## - dis      1     555.5 10494.1 1207.8
## - ptratio  1     834.5 10773.1 1217.1
## - rm       1    1136.0 11074.6 1226.9
## - lstat    1    4615.1 14553.6 1323.6
## 
## Step:  AIC=1166.56
## medv ~ lstat + rm + ptratio + dis + nox
## 
##           Df Sum of Sq     RSS    AIC
## + chas     1    205.43  9029.7 1160.6
## + zn       1    148.49  9086.6 1162.8
## + rad      1     88.23  9146.9 1165.2
## + indus    1     62.30  9172.8 1166.2
## <none>                  9235.1 1166.6
## + crim     1     24.22  9210.9 1167.6
## + age      1      8.40  9226.7 1168.2
## + tax      1      4.81  9230.3 1168.4
## - nox      1    703.48  9938.6 1190.5
## - ptratio  1   1045.21 10280.3 1202.5
## - rm       1   1080.66 10315.8 1203.7
## - dis      1   1247.90 10483.0 1209.4
## - lstat    1   3156.68 12391.8 1268.6
## 
## Step:  AIC=1160.6
## medv ~ lstat + rm + ptratio + dis + nox + chas
## 
##           Df Sum of Sq     RSS    AIC
## + zn       1    168.46  8861.2 1155.9
## + rad      1    100.32  8929.4 1158.6
## + indus    1     79.08  8950.6 1159.5
## <none>                  9029.7 1160.6
## + crim     1     14.57  9015.1 1162.0
## + age      1      3.87  9025.8 1162.5
## + tax      1      1.39  9028.3 1162.5
## - chas     1    205.43  9235.1 1166.6
## - nox      1    764.54  9794.2 1187.4
## - ptratio  1    946.14  9975.8 1193.9
## - rm       1   1075.58 10105.2 1198.4
## - dis      1   1199.55 10229.2 1202.8
## - lstat    1   3072.19 12101.9 1262.3
## 
## Step:  AIC=1155.93
## medv ~ lstat + rm + ptratio + dis + nox + chas + zn
## 
##           Df Sum of Sq     RSS    AIC
## + indus    1     85.02  8776.2 1154.5
## + rad      1     57.92  8803.3 1155.6
## <none>                  8861.2 1155.9
## + crim     1     37.74  8823.5 1156.4
## + tax      1     21.54  8839.7 1157.1
## + age      1     14.42  8846.8 1157.3
## - zn       1    168.46  9029.7 1160.6
## - chas     1    225.40  9086.6 1162.8
## - ptratio  1    692.05  9553.3 1180.5
## - nox      1    774.50  9635.7 1183.6
## - rm       1    937.76  9799.0 1189.5
## - dis      1   1342.06 10203.3 1203.8
## - lstat    1   3136.33 11997.5 1261.2
## 
## Step:  AIC=1154.52
## medv ~ lstat + rm + ptratio + dis + nox + chas + zn + indus
## 
##           Df Sum of Sq     RSS    AIC
## + rad      1     71.64  8704.5 1153.6
## <none>                  8776.2 1154.5
## + crim     1     40.02  8736.2 1154.9
## + age      1     17.16  8759.0 1155.8
## - indus    1     85.02  8861.2 1155.9
## + tax      1      3.89  8772.3 1156.4
## - zn       1    174.40  8950.6 1159.5
## - chas     1    243.94  9020.1 1162.2
## - nox      1    454.25  9230.4 1170.4
## - ptratio  1    507.80  9284.0 1172.4
## - rm       1    828.02  9604.2 1184.4
## - dis      1   1426.49 10202.7 1205.8
## - lstat    1   3008.06 11784.2 1256.8
## 
## Step:  AIC=1153.62
## medv ~ lstat + rm + ptratio + dis + nox + chas + zn + indus + 
##     rad
## 
##           Df Sum of Sq     RSS    AIC
## + tax      1    242.65  8461.9 1145.6
## + crim     1    134.08  8570.5 1150.1
## <none>                  8704.5 1153.6
## - rad      1     71.64  8776.2 1154.5
## + age      1     24.54  8680.0 1154.6
## - indus    1     98.74  8803.3 1155.6
## - zn       1    127.64  8832.2 1156.8
## - chas     1    253.99  8958.5 1161.8
## - nox      1    525.65  9230.2 1172.4
## - ptratio  1    574.82  9279.4 1174.2
## - rm       1    747.98  9452.5 1180.8
## - dis      1   1401.52 10106.1 1204.5
## - lstat    1   3079.53 11784.1 1258.8
## 
## Step:  AIC=1145.61
## medv ~ lstat + rm + ptratio + dis + nox + chas + zn + indus + 
##     rad + tax
## 
##           Df Sum of Sq     RSS    AIC
## + crim     1    133.31  8328.6 1142.0
## - indus    1      9.93  8471.8 1144.0
## <none>                  8461.9 1145.6
## + age      1     21.66  8440.2 1146.7
## - zn       1    192.58  8654.5 1151.6
## - chas     1    209.37  8671.3 1152.3
## - tax      1    242.65  8704.5 1153.6
## - rad      1    310.40  8772.3 1156.4
## - nox      1    449.08  8911.0 1161.9
## - ptratio  1    523.13  8985.0 1164.8
## - rm       1    675.25  9137.1 1170.8
## - dis      1   1378.72  9840.6 1197.0
## - lstat    1   3136.35 11598.2 1255.2
## 
## Step:  AIC=1141.99
## medv ~ lstat + rm + ptratio + dis + nox + chas + zn + indus + 
##     rad + tax + crim
## 
##           Df Sum of Sq     RSS    AIC
## - indus    1     15.09  8343.7 1140.6
## <none>                  8328.6 1142.0
## + age      1     19.33  8309.3 1143.2
## - crim     1    133.31  8461.9 1145.6
## - chas     1    190.45  8519.0 1148.0
## - zn       1    220.66  8549.2 1149.2
## - tax      1    241.88  8570.5 1150.1
## - rad      1    407.35  8735.9 1156.9
## - nox      1    466.03  8794.6 1159.3
## - ptratio  1    538.89  8867.5 1162.2
## - rm       1    663.89  8992.5 1167.1
## - dis      1   1460.51  9789.1 1197.2
## - lstat    1   2669.87 10998.4 1238.4
## 
## Step:  AIC=1140.63
## medv ~ lstat + rm + ptratio + dis + nox + chas + zn + rad + tax + 
##     crim
## 
##           Df Sum of Sq     RSS    AIC
## <none>                  8343.7 1140.6
## + age      1     17.69  8326.0 1141.9
## + indus    1     15.09  8328.6 1142.0
## - crim     1    128.15  8471.8 1144.0
## - chas     1    181.86  8525.5 1146.3
## - zn       1    227.86  8571.5 1148.2
## - tax      1    341.46  8685.1 1152.8
## - rad      1    472.96  8816.6 1158.1
## - nox      1    563.81  8907.5 1161.8
## - ptratio  1    588.56  8932.2 1162.8
## - rm       1    699.80  9043.5 1167.1
## - dis      1   1480.64  9824.3 1196.5
## - lstat    1   2739.57 11083.2 1239.1
## 
## Call:
## lm(formula = medv ~ lstat + rm + ptratio + dis + nox + chas + 
##     zn + rad + tax + crim, data = Boston.train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5886  -2.8169  -0.5257   1.8192  24.3188 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  50.234002   5.911169   8.498 5.98e-16 ***
## lstat        -0.611279   0.057601 -10.612  < 2e-16 ***
## rm            2.585905   0.482123   5.364 1.50e-07 ***
## ptratio      -0.808922   0.164453  -4.919 1.35e-06 ***
## dis          -1.854224   0.237667  -7.802 7.43e-14 ***
## nox         -21.130869   4.389163  -4.814 2.22e-06 ***
## chas          3.183488   1.164305   2.734  0.00658 ** 
## zn            0.051315   0.016767   3.061  0.00238 ** 
## rad           0.349322   0.079221   4.409 1.39e-05 ***
## tax          -0.015578   0.004158  -3.747  0.00021 ***
## crim         -0.104928   0.045716  -2.295  0.02232 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.932 on 343 degrees of freedom
## Multiple R-squared:  0.7006, Adjusted R-squared:  0.6919 
## F-statistic: 80.26 on 10 and 343 DF,  p-value: < 2.2e-16

Mean Square Error

## [1] 23.5697

While fitting the best linear regression model using AIC we come across the above as the final model with lower value of AIC. In the above model the variable age and indus are dropped giving us a final model. This model is similar to the very first model created on the entire dataset where it was depicted that the age and indus should be dropped from the model with the help of T-Test and F-Test as they have a p-value greater than 0.05.

Lasso Variable Selection

## Loading required package: Matrix
## Loaded glmnet 3.0-2

Selecting Lambda Value

Lambda Min

## [1] 0.006627913

Lambda 1se

## [1] 0.362035

Coefficients from Lambda Min

## 13 x 1 sparse Matrix of class "dgCMatrix"
##                      1
## (Intercept) 22.4679421
## crim        -0.8899073
## zn           1.1898281
## indus       -0.4299548
## chas         0.8144242
## nox         -2.4138474
## rm           1.7327443
## age          0.3845212
## dis         -3.8391074
## rad          2.8179921
## tax         -2.2946981
## ptratio     -1.7014649
## lstat       -4.4432175

Coefficients from Lambda 1se

## 13 x 1 sparse Matrix of class "dgCMatrix"
##                        1
## (Intercept) 22.535109167
## crim        -0.039528378
## zn           0.006475944
## indus       -0.260645724
## chas         0.493677038
## nox         -0.806436894
## rm           2.137370834
## age          .          
## dis         -1.407380924
## rad          .          
## tax         -0.070337823
## ptratio     -1.401210354
## lstat       -4.255274002

Lasso MSE Lambda Min

## [1] 23.47515

Lasso MSE Lambda 1se

## [1] 26.8531

Lasso MPSE Lambda Min

## [1] 22.72916

Lasso MPSE Lambda 1se

## [1] 23.77414

Final Model

#Final Model
Final_Model <- lm(medv ~.-indus-age-crim, data = Boston.train)
summary(Final_Model)
## 
## Call:
## lm(formula = medv ~ . - indus - age - crim, data = Boston.train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.2992  -2.8149  -0.4606   1.9329  24.4232 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  49.830094   5.945090   8.382 1.35e-15 ***
## zn            0.047722   0.016797   2.841 0.004763 ** 
## chas          3.351016   1.169201   2.866 0.004412 ** 
## nox         -20.559700   4.409203  -4.663 4.47e-06 ***
## rm            2.597822   0.485076   5.355 1.56e-07 ***
## dis          -1.798079   0.237867  -7.559 3.72e-13 ***
## rad           0.292811   0.075764   3.865 0.000133 ***
## tax          -0.015343   0.004182  -3.669 0.000282 ***
## ptratio      -0.792909   0.165321  -4.796 2.41e-06 ***
## lstat        -0.641971   0.056374 -11.388  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.963 on 344 degrees of freedom
## Multiple R-squared:  0.696,  Adjusted R-squared:  0.6881 
## F-statistic: 87.51 on 9 and 344 DF,  p-value: < 2.2e-16

We have selected the model from BIC as the final model because its has lower MSE value and gives a less complex model.

MSPE of out-of-Sample Data

#MSPE
pred= predict(Final_Model, newdata = Boston.test)
mspe_pred= mean((pred - Boston.test$medv)^2)
mspe_pred
## [1] 23.42387

Cross Validation

#Q4
#Cross Validation
library(boot)
model_2 <- glm(medv~.-indus-age-crim, data = boston_df)

cv.glm(data = boston_df, glmfit = model_2, K = 3)$delta[2]
## [1] 25.5441

We are getting MSE from cross validation as 25.5 and out-pf-sample prediction is 23.42. Both the values are close so we can conclude that we are getting similar results.

Fitting Regression Tree CART

In this case we choose the best cp value in order to minimize the prediction error RMSE (root mean squared error). The prediction error is measured by RMSE, which corresponds to the average difference between the observed value and predicted value of the model. The best cp value is computed below along with the best tune plot:

##           cp
## 1 0.00632333

The complexity parameter (cp) is used to control the size of the decision tree and to select the optimal tree size. If the cost of adding another variable to the decision tree from the current node is above the value of cp, then tree building does not continue

Regression Tree

The above decision tree that is been plot is called Regression Tree. Regression Tree is build for continous data types. In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed. At the end of the learning process, a decision tree covering the training set is returned.

Comparing CART to Linear Regression Model

When the data is simple and we are using only one predictor the prediction becomes easy to be determined from a plot across x and y axis. But the same thing becomes difficult if we have multiple predictors.

So when we have 3 or more predictors drawing a graph with x and y axis becomes difficult if not impossible.

In contrast the regression tree easily accomodates the additional predictors.

For Example if we have the following data:

We can predict the median value of owner-occupied homes that is $3946000 and the number of observations are 7.

## n= 354 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 354 27868.34000 22.51017  
##    2) lstat>=9.725 201  4959.59100 17.48806  
##      4) lstat>=14.915 113  2138.37800 14.71681  
##        8) nox>=0.657 59   767.89560 12.37966  
##         16) lstat>=18.93 37   273.20920 10.64054 *
##         17) lstat< 18.93 22   194.56950 15.30455 *
##        9) nox< 0.657 54   696.09260 17.27037 *
##      5) lstat< 14.915 88   839.03900 21.04659 *
##    3) lstat< 9.725 153 11179.23000 29.10784  
##      6) rm< 7.0155 116  4378.80600 25.88276  
##       12) tax< 548 109  1961.24700 25.01101  
##         24) rm< 6.5445 65   614.46150 22.84615 *
##         25) rm>=6.5445 44   592.13640 28.20909  
##           50) lstat>=4.92 27   217.51190 26.42593 *
##           51) lstat< 4.92 17   152.42120 31.04118 *
##       13) tax>=548 7  1044.87700 39.45714 *
##      7) rm>=7.0155 37  1811.23700 39.21892  
##       14) rm< 7.437 17    61.43529 34.02941 *
##       15) rm>=7.437 20   902.82200 43.63000 *

The following is the head of predictions.

##        7        8        9       13       14       15 
## 21.04659 17.27037 17.27037 17.27037 22.84615 21.04659
## [1] 3.882004
##     RMSE Rsquared      MAE 
## 3.882004 0.855859 2.850962

The RMSE for the Regression Tree is 3.882004 and R-Square is 85.5%.

Fitting Another Regression Tree with New Random Set

##            cp
## 1 0.004326469

## n= 354 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 354 30194.93000 22.60847  
##    2) lstat>=9.645 207  4640.87000 17.34396  
##      4) lstat>=19.85 52   732.74080 12.08077  
##        8) nox>=0.603 38   286.59390 10.45526 *
##        9) nox< 0.603 14    73.20929 16.49286 *
##      5) lstat< 19.85 155  1984.41500 19.10968  
##       10) lstat>=14.915 60   643.55400 16.69000 *
##       11) lstat< 14.915 95   767.70360 20.63789 *
##    3) lstat< 9.645 147 11738.37000 30.02177  
##      6) rm< 7.4545 127  5678.08700 27.67953  
##       12) dis>=1.9704 119  2763.28800 26.58992  
##         24) rm< 6.803 84  1001.47700 24.38333  
##           48) rm< 6.531 54   419.47700 22.89259 *
##           49) rm>=6.531 30   245.98670 27.06667 *
##         25) rm>=6.803 35   371.22290 31.88571 *
##       13) dis< 1.9704 8   671.92880 43.88750 *
##      7) rm>=7.4545 20   939.28950 44.89500  
##       14) ptratio>=16.15 10   662.96000 41.80000 *
##       15) ptratio< 16.15 10    84.74900 47.99000 *
##        4        5        8       10       12       16 
## 31.88571 31.88571 16.69000 16.69000 20.63789 22.89259
## [1] 4.639071
##      RMSE  Rsquared       MAE 
## 4.6390715 0.7503581 3.3348107

We donot get the same results for both the regression tree.

Conclusion

In both cases the training data is randomly selected therefore different training sets will have different misclassfication rate. The main objective of the regression tree is to lower the minimize misclassification rate to provide best prediction.

[Source]

Logistic Regression to Predict the Bankruptcy

The bankruptcy dataset contains the companies’ financial information and their bankruptcy status for particular years. It is a subset of Man Xu’s project containing the financial information of the companies.

##  [1] "FYEAR" "DLRSN" "CUSIP" "R1"    "R2"    "R3"    "R4"    "R5"    "R6"   
## [10] "R7"    "R8"    "R9"    "R10"

There are 13 attributes in this dataset. They are:
1. FYEAR: Fiscal Year for which the data has been taken.
2. DLRSN: It is a Bankruptcy/Non-Bankruptcy Flag. Bankruptcy = 1 and Non-Bankruptcy = 0.
3. CUSIP: It is also known as the Committee on Uniform Securities Identification Procedures number, is a unique nine-character identification number assigned to all stocks (and registered bonds) in the U.S. and Canada.
4. R1: Working Capital / Total Asset
5. R2: Retained Earning / Total Asset
6. R3: Earning Before Interest & Tax/Total Asset
7. R4: Market Capital / Total Liability
8. R5: SALE / Total Asset
9. R6: Total Liability / Total Asset
10.R7: Current Asset / Current Liability
11.R8: Net Income / Total Asset
12.R9: It is Sales data. But as the sales number was huge, so log has been taken on sales. LOG(SALE)
13.R10: The market capital of a company in a particular Fiscal Year. This data exist on the log-scale.

As, CUSIP is an assigned ID and Fiscal Year won’t play much role in this subset of Man Xu’s dataset, therefore, we will drop these columns.

Now, our dataset is ready, so let’s dive into Exploratory Data Analysis.

Exploratory Data Analysis:

The dataset contains 5436 rows and 11 columns. Now, let’s check the structure of the dataset. As, we can see below, none of the variables are having factors. Also, with domain understanding, it is clear that no variables should be converted into factors. DLRSN is the only variable which is 0/1, but this variable will be used as a predictor for logistic regression.

## 'data.frame':    5436 obs. of  11 variables:
##  $ DLRSN: int  0 0 0 1 0 0 0 0 1 0 ...
##  $ R1   : num  0.307 0.761 -0.514 -0.466 2.023 ...
##  $ R2   : num  0.887 0.592 0.338 0.371 0.215 ...
##  $ R3   : num  1.648 0.453 0.299 0.496 0.183 ...
##  $ R4   : num  -0.1992 -0.3699 -0.0291 -0.3734 6.6954 ...
##  $ R5   : num  1.093 0.186 -0.433 -0.267 -1.148 ...
##  $ R6   : num  -0.3133 0.0396 0.83 0.9778 -1.5059 ...
##  $ R7   : num  -0.197 0.327 -0.708 -0.611 2.876 ...
##  $ R8   : num  1.207 0.428 0.476 0.457 0.287 ...
##  $ R9   : num  0.282 1.107 2.179 0.152 -0.986 ...
##  $ R10  : num  0.1589 0.7934 2.4846 0.0478 0.7911 ...

Lets’s check the top 5 rows of the dataset:

##   DLRSN         R1        R2        R3          R4         R5          R6
## 1     0  0.3071395 0.8870057 1.6476808 -0.19915760  1.0929640 -0.31328867
## 2     0  0.7607365 0.5924934 0.4530028 -0.36989014  0.1861539  0.03961907
## 3     0 -0.5135961 0.3376148 0.2990145 -0.02907996 -0.4326047  0.82999324
## 4     1 -0.4661293 0.3707467 0.4960667 -0.37342897 -0.2674240  0.97779888
## 5     0  2.0234223 0.2148764 0.1825948  6.69536040 -1.1483381 -1.50588927
## 6     0  0.9074985 0.3868797 0.4778914 -0.34715872  1.4079893 -0.48390230
##            R7        R8         R9         R10
## 1 -0.19679332 1.2067628  0.2824709  0.15889625
## 2  0.32749732 0.4284181  1.1069652  0.79344276
## 3 -0.70778613 0.4761533  2.1791755  2.48458450
## 4 -0.61097507 0.4568098  0.1519511  0.04778851
## 5  2.87647687 0.2873749 -0.9864421  0.79107673
## 6  0.07025944 0.5278110  0.5024659 -0.16464802

It is clear that, R10 which is the market cap and R9 which is the sales data is in log-scale. Checking the null values in dataset is very important, and if we find any, then we have to impute the values, either with mean or median or depending upon the domain.

## DLRSN    R1    R2    R3    R4    R5    R6    R7    R8    R9   R10 
##     0     0     0     0     0     0     0     0     0     0     0


From the above plot and column values it is clear there are no Null Values in the dataset.

##      DLRSN              R1                R2                R3          
##  Min.   :0.0000   Min.   :-4.3828   Min.   :-2.2418   Min.   :-2.06423  
##  1st Qu.:0.0000   1st Qu.:-0.7501   1st Qu.:-1.0805   1st Qu.:-1.07887  
##  Median :0.0000   Median :-0.2220   Median : 0.1337   Median : 0.06858  
##  Mean   :0.1428   Mean   :-0.2352   Mean   :-0.2915   Mean   :-0.24411  
##  3rd Qu.:0.0000   3rd Qu.: 0.4821   3rd Qu.: 0.5103   3rd Qu.: 0.50070  
##  Max.   :1.0000   Max.   : 2.0234   Max.   : 1.4854   Max.   : 2.14246  
##        R4                 R5                R6                  R7          
##  Min.   :-0.42712   Min.   :-1.3639   Min.   :-1.505889   Min.   :-1.23340  
##  1st Qu.:-0.38752   1st Qu.:-0.8748   1st Qu.:-0.656827   1st Qu.:-0.77996  
##  Median :-0.30754   Median :-0.3508   Median : 0.003493   Median :-0.43205  
##  Mean   : 0.23903   Mean   :-0.1338   Mean   : 0.195707   Mean   :-0.09612  
##  3rd Qu.: 0.02746   3rd Qu.: 0.2878   3rd Qu.: 0.608376   3rd Qu.: 0.17578  
##  Max.   : 6.69536   Max.   : 4.0362   Max.   : 5.110424   Max.   : 2.87648  
##        R8                R9                R10         
##  Min.   :-2.2082   Min.   :-2.76356   Min.   :-2.2140  
##  1st Qu.:-1.0054   1st Qu.:-0.66606   1st Qu.:-0.6413  
##  Median : 0.2053   Median : 0.06219   Median : 0.1230  
##  Mean   :-0.2272   Mean   : 0.02538   Mean   : 0.1806  
##  3rd Qu.: 0.5289   3rd Qu.: 0.81215   3rd Qu.: 0.9878  
##  Max.   : 2.0006   Max.   : 2.17918   Max.   : 2.4846

By checking the summary statistics nothing significant is evident. Let’s move to boxplot to understand the each variable more intuitively.

## [1] "Z-transformation has been applied over data!"


Analysis of Boxplot:
* R1: 3 outliers. R1 is working capital to total asset ratio, which means comapny’s ability to cover short term financial obligations. These 3 outliers are less than minimum.
* R2,R3, R8, R9, and R10: No Outliers.
* R4: It is clear that, R4 is heavily skewed because of the outliers.
* R5, R6 and R7: They all share the same amount of outliers.

Analysis of Density plots:
* Bi-Modal: R1, R2, R3, R5, R7, R8 variables are bi-modal.
* Negative Skewness: R1, R2
* Positive Skewness: R4, R5, R6, R7, R8

Analysis of Correlaton Matrix:
From the correlation matrix, following analysis are evident:
* DLRSN, the predictive variable, is negatively but highly correlated with R10, which is the market capital.
Strong Correlation between variables but predictors:
* R6 negatively correlated with R1 and R7.
* R2 positively correlated with R3, R8 and R9.
* R3 positively correlated with R8 and R9.
* R9 positively correlated with R10.
* R1 positively correlated with R7.

It is evident that the number of companies that went bankrupt are close to 1000.

Splitting Dataset into training and testing dataset

We have used the seed as 13478476, which is the M Number of one of the group member.

## Observations in training set:  3806  and in testing set:  1630

Bankruptcy Prediction

Since the response variable is qualitative i.e. we classify the company as bankrupt or not, we use the Logistic Regression technique for model creation.

Generalised Linear Regrression

The team has used generalized linear regression with the following link functions: 1. Logit Link 2. Probit Link 3. Complementary log log Link ** To build the model, the team has used all the variables as predictiors, i.e., R1 to R10 and with the combination of different link functions

## ********Logit-Link Model********
## Mean Residual Deviance is  0.5439795
## AIC  is  2086.402
## BIC is  2155.09
## ********Probit-Link Model********
## Mean Residual Deviance is  0.5452349
## AIC is  2091.167
## BIC is  2159.854
## ********Complimentary Log Log Model********
## Mean Residual Deviance is  0.549709
## AIC is  2108.146
## BIC is  2176.833

As, we can see that all the models with different links, such as, probit, logit and cloglog is comaparable. Another important point which we have discussed before is: all three models are full models.

2.Variable Selection and Best Model’s Mean Residual Deviance, Area Under Curve & Misclassification Rate

## ********Step Wise selection, direction=exhaustive********
## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## DLRSN ~ R1 + R2 + R3 + R4 + R5 + R6 + R7 + R8 + R9 + R10
## 
## Final Model:
## DLRSN ~ R1 + R2 + R3 + R6 + R7 + R8 + R9 + R10
## 
## 
##   Step Df  Deviance Resid. Df Resid. Dev      AIC
## 1                        3795   2064.402 2086.402
## 2 - R4  1 0.6810090      3796   2065.083 2085.083
## 3 - R5  1 0.7111673      3797   2065.794 2083.794
## ********Step Wise selection, direction=forward********
## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## DLRSN ~ R1 + R2 + R3 + R4 + R5 + R6 + R7 + R8 + R9 + R10
## 
## Final Model:
## DLRSN ~ R1 + R2 + R3 + R4 + R5 + R6 + R7 + R8 + R9 + R10
## 
## 
##   Step Df Deviance Resid. Df Resid. Dev      AIC
## 1                       3795   2064.402 2086.402
## ********Step Wise selection, direction=backward********
## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## DLRSN ~ R1 + R2 + R3 + R4 + R5 + R6 + R7 + R8 + R9 + R10
## 
## Final Model:
## DLRSN ~ R1 + R2 + R3 + R6 + R7 + R8 + R9 + R10
## 
## 
##   Step Df  Deviance Resid. Df Resid. Dev      AIC
## 1                        3795   2064.402 2086.402
## 2 - R4  1 0.6810090      3796   2065.083 2085.083
## 3 - R5  1 0.7111673      3797   2065.794 2083.794
## ********BIC********
## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## DLRSN ~ R1 + R2 + R3 + R4 + R5 + R6 + R7 + R8 + R9 + R10
## 
## Final Model:
## DLRSN ~ R2 + R3 + R6 + R7 + R9 + R10
## 
## 
##   Step Df  Deviance Resid. Df Resid. Dev      AIC
## 1                        3795   2064.402 2155.090
## 2 - R4  1 0.6810090      3796   2065.083 2147.527
## 3 - R5  1 0.7111673      3797   2065.794 2139.993
## 4 - R1  1 4.1889926      3798   2069.983 2135.938
## 5 - R8  1 5.1585338      3799   2075.142 2132.852
## ********Lasso********

## 11 x 1 sparse Matrix of class "dgCMatrix"
##                       1
## (Intercept) -2.60126171
## R1           0.12066150
## R2           0.59469602
## R3          -0.59088678
## R4          -0.08358830
## R5           0.06509002
## R6           0.21883720
## R7          -0.50769397
## R8          -0.20634429
## R9           0.26436658
## R10         -1.45828944
## 11 x 1 sparse Matrix of class "dgCMatrix"
##                       1
## (Intercept) -2.48248014
## R1           .         
## R2           0.44262186
## R3          -0.46306918
## R4          -0.06474861
## R5           0.09685529
## R6           0.15140263
## R7          -0.33956672
## R8          -0.16384191
## R9           0.15933240
## R10         -1.30846529
## [1]   10.000 2088.005

By looking at the above image and the output of different variable selection strategies, following points are clear: 1. Residual Deviance is same for all three strategies. 2. AIC is comparable for the above models. 3. But looking at the complexity of the model, it is clear that BIC model gives the least complex model.

Therefore for the final model, we have selected BIC strategy based model.

Our Final model is:

## Mean Residual Deviance for the final model is  0.09047091

## AUC for full model is  0.8787171
## [1] 0.6355754

##     Predicted
## True    0    1
##    0 1516 1754
##    1   19  517
## Asymertic Misclassification rate  0.4658434 False Positives  0.5363914 False Negatives  0.03544776

For In-sample testing, AUC for the model comes out to be 87.87%, which is greater than 70% as asked. Hence for in-sample data the model performs good. The Misclassfication rate comes out to be 46% which shows that predicitive probability of misclassifying the training dataset.

##3 & 4. Out of Sample testing and Cut - off probabilities through grid Search

## AUC for full model is  0.8636841
## [1] 0.6779141

##     Predicted
## True   0   1
##    0 495 895
##    1   6 234
## Out-Of-Sample Sample Asymertic        Misclassification rate  0.5527607 False Positives  0.6438849 False Negatives  0.025

While doing the out-of-sample testing the Area under curve comes out to be: 86%. This means our model is around 86% accurate in predicting the bankruptcy status. For out of sample testing Asymmetric Misclassification rate comes out to be: 55.27% which shows that predicitive probability of misclassifying the training dataset.

Cross Validation with Cross Function

## [1] 0.6737797

Cross Validation with AUC Function

cv.glm(data = bankruptcy_df, glmfit = bankruptcy_model, cost = aucCost, K = 3)$delta[2]
## [1] 0.8680642

From 3 we get misclassification rate as 0.55 and AUC as 0.67. After cross validation we are getting results close to the misclassification & AUC as obtained from 3.

MSE

## [1] 0.4157473

MSPE

## [1] 0.4142877

From homework 3 we are getting MSPE as 0.476 and from above question we get as 0.41, so we can say that CART is doing a better job in model fitting as compared to logistic regression.

MSE

## Warning in bankruptcy.train.pred.tree_1 - bankruptcy.train_1$DLRSN: longer
## object length is not a multiple of shorter object length
## [1] 0.4171277

MPSE

## [1] 0.4153906

We observe that the MSPE for train:test split 70:30 is 0.413 and for 90:10 is 0.415. So we can see that the sample is representative of the entire population. Irrespective of what split we choose, we get almost same MSPE.