In this homework assignment, you will explore, analyze and model a data set containing information on crime for various neighborhoods of a major city. Each record has a response variable indicating whether or not the crime rate is above the median crime rate (1) or not (0). Your objective is to build a binary logistic regression model on the training data set to predict whether the neighborhood will be at risk for high crime levels. You will provide classifications and probabilities for the evaluation data set using your binary logistic regression model. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:
ï‚· zn: proportion of residential land zoned for large lots (over 25000 square feet) (predictor variable)
ï‚· indus: proportion of non-retail business acres per suburb (predictor variable)
ï‚· chas: a dummy var. for whether the suburb borders the Charles River (1) or not (0) (predictor variable)
ï‚· nox: nitrogen oxides concentration (parts per 10 million) (predictor variable)
ï‚· rm: average number of rooms per dwelling (predictor variable)
ï‚· age: proportion of owner-occupied units built prior to 1940 (predictor variable)
ï‚· dis: weighted mean of distances to five Boston employment centers (predictor variable)
ï‚· rad: index of accessibility to radial highways (predictor variable)
ï‚· tax: full-value property-tax rate per $10,000 (predictor variable)
ï‚· ptratio: pupil-teacher ratio by town (predictor variable)
ï‚· black: 1000(Bk - 0.63)2 where Bk is the proportion of blacks by town (predictor variable)
ï‚· lstat: lower status of the population (percent) (predictor variable)
ï‚· medv: median value of owner-occupied homes in $1000s (predictor variable)
ï‚· target: whether the crime rate is above the median crime rate (1) or not (0) (response variable)
ï‚· A write-up submitted in PDF format. Your write-up should have four sections. Each one is described below. You may assume you are addressing me as a fellow data scientist, so do not need to shy away from technical details.
ï‚· Assigned prediction (probabilities, classifications) for the evaluation data set. Use 0.5 threshold.
ï‚· Include your R statistical programming code in an Appendix.
| Variable Name | Short Description |
|---|---|
| zn | proportion of residential land zoned for large lots (over 25000 square feet) |
| indus | proportion of non-retail business acres per suburb |
| chas | a dummy var. for whether the suburb borders the Charles River (1) or not (0) |
| nox | nitrogen oxides concentration (parts per 10 million) |
| rm | average number of rooms per dwelling |
| age | proportion of owner-occupied units built prior to 1940 |
| dis | weighted mean of distances to five Boston employment centers |
| rad | index of accessibility to radial highways |
| tax | full-value property-tax rate per $10,000 |
| ptratio | pupil-teacher ratio by town |
| lstat | lower status of the population (percent) |
| medv | median value of owner-occupied homes in $1000s |
| target | whether the crime rate is above the median crime rate (1) or not (0) |
In our dataset we have 466 observations and 13 variables. Each observation represents a crime occurred in the neighborhoods of a major city. Response variable target is a binary determinant with 1 representing an above average crime and 0 as below average crime. There are no missing values in our dataset
## Rows: 466
## Columns: 13
## $ zn <dbl> 0, 0, 0, 30, 0, 0, 0, 0, 0, 80, 22, 0, 0, 22, 0, 0, 100, 20, 0…
## $ indus <dbl> 19.58, 19.58, 18.10, 4.93, 2.46, 8.56, 18.10, 18.10, 5.19, 3.6…
## $ chas <fct> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ nox <dbl> 0.605, 0.871, 0.740, 0.428, 0.488, 0.520, 0.693, 0.693, 0.515,…
## $ rm <dbl> 7.929, 5.403, 6.485, 6.393, 7.155, 6.781, 5.453, 4.519, 6.316,…
## $ age <dbl> 96.2, 100.0, 100.0, 7.8, 92.2, 71.3, 100.0, 100.0, 38.1, 19.1,…
## $ dis <dbl> 2.0459, 1.3216, 1.9784, 7.0355, 2.7006, 2.8561, 1.4896, 1.6582…
## $ rad <int> 5, 5, 24, 6, 3, 5, 24, 24, 5, 1, 7, 5, 24, 7, 3, 3, 5, 5, 24, …
## $ tax <int> 403, 403, 666, 300, 193, 384, 666, 666, 224, 315, 330, 398, 66…
## $ ptratio <dbl> 14.7, 14.7, 20.2, 16.6, 17.8, 20.9, 20.2, 20.2, 20.2, 16.4, 19…
## $ lstat <dbl> 3.70, 26.82, 18.85, 5.19, 4.82, 7.67, 30.59, 36.98, 5.68, 9.25…
## $ medv <dbl> 50.0, 13.4, 15.4, 23.7, 37.9, 26.5, 5.0, 7.0, 22.2, 20.9, 24.8…
## $ target <fct> 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0,…
## vars n mean sd median trimmed mad min max range skew
## zn 1 466 11.58 23.36 0.00 5.35 0.00 0.00 100.00 100.00 2.18
## indus 2 466 11.11 6.85 9.69 10.91 9.34 0.46 27.74 27.28 0.29
## chas* 3 466 1.07 0.26 1.00 1.00 0.00 1.00 2.00 1.00 3.34
## nox 4 466 0.55 0.12 0.54 0.54 0.13 0.39 0.87 0.48 0.75
## rm 5 466 6.29 0.70 6.21 6.26 0.52 3.86 8.78 4.92 0.48
## age 6 466 68.37 28.32 77.15 70.96 30.02 2.90 100.00 97.10 -0.58
## dis 7 466 3.80 2.11 3.19 3.54 1.91 1.13 12.13 11.00 1.00
## rad 8 466 9.53 8.69 5.00 8.70 1.48 1.00 24.00 23.00 1.01
## tax 9 466 409.50 167.90 334.50 401.51 104.52 187.00 711.00 524.00 0.66
## ptratio 10 466 18.40 2.20 18.90 18.60 1.93 12.60 22.00 9.40 -0.75
## lstat 11 466 12.63 7.10 11.35 11.88 7.07 1.73 37.97 36.24 0.91
## medv 12 466 22.59 9.24 21.20 21.63 6.00 5.00 50.00 45.00 1.08
## target* 13 466 1.49 0.50 1.00 1.49 0.00 1.00 2.00 1.00 0.03
## kurtosis se
## zn 3.81 1.08
## indus -1.24 0.32
## chas* 9.15 0.01
## nox -0.04 0.01
## rm 1.54 0.03
## age -1.01 1.31
## dis 0.47 0.10
## rad -0.86 0.40
## tax -1.15 7.78
## ptratio -0.40 0.10
## lstat 0.50 0.33
## medv 1.37 0.43
## target* -2.00 0.02
From the histograms below we see that most of the distributions are skewed and medv and rm being almost normally distributed. age, ptratio are left skewed whereas dis, lstat and zn are right skewed. We also see bi=modal distribution in indus,rad and tax
We see in the below box plot graph the presence of outliers in many of the variables. We see high interquartile range for variables
rad and tax. The variance between two values of the response variable differs for age,dis, nox, rad, tax, zn so we will have to consider adding quadratic terms for them.
In the correlation matrix below we see nox,age,rad,tax and indus have positive coorelation and dis have negative coorelation. Rest other variables have weaker coorelation.
| Correlation | |
|---|---|
| target | 1.0000000 |
| nox | 0.7261062 |
| age | 0.6301062 |
| rad | 0.6281049 |
| tax | 0.6111133 |
| indus | 0.6048507 |
| lstat | 0.4691270 |
| ptratio | 0.2508489 |
| chas | 0.0800419 |
| rm | -0.1525533 |
| medv | -0.2705507 |
| zn | -0.4316818 |
| dis | -0.6186731 |
In our above analysis we see that the data needs some transformations. Due presence of skewness and outliers we will transform these variables. We remove the tax variable due to multicollinearity and then we take log transformations for age and lstat due to the presence of skweness. For rad, nox and zn will add quadratic term due to the presence of variances.
In this model we use all the 12 variables after excluding the tax and run the glm. Many of the predictors have significant p-value(<=0.05) rm and lstat have large p-values.
##
## Call:
## glm(formula = target ~ ., family = "binomial", data = crime_training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8464 -0.1445 -0.0017 0.0029 3.4665
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -40.822934 6.632913 -6.155 7.53e-10 ***
## zn -0.065946 0.034656 -1.903 0.05706 .
## indus -0.064614 0.047622 -1.357 0.17485
## chas 0.910765 0.755546 1.205 0.22803
## nox 49.122297 7.931706 6.193 5.90e-10 ***
## rm -0.587488 0.722847 -0.813 0.41637
## age 0.034189 0.013814 2.475 0.01333 *
## dis 0.738660 0.230275 3.208 0.00134 **
## rad 0.666366 0.163152 4.084 4.42e-05 ***
## tax -0.006171 0.002955 -2.089 0.03674 *
## ptratio 0.402566 0.126627 3.179 0.00148 **
## lstat 0.045869 0.054049 0.849 0.39608
## medv 0.180824 0.068294 2.648 0.00810 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 192.05 on 453 degrees of freedom
## AIC: 218.05
##
## Number of Fisher Scoring iterations: 9
To further improve our model in the second model we use the transformed variables and run the glm. We see the p-values increases when used the transformed variables which makes is less efficient then the earlier model.
##
## Call:
## glm(formula = target ~ ., family = "binomial", data = crime_train_trans)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0433 -0.2233 0.0000 0.0000 3.2207
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -28.050185 5.743773 -4.884 1.04e-06 ***
## zn -0.003025 0.001538 -1.966 0.049279 *
## indus -0.111100 0.045632 -2.435 0.014904 *
## chas 1.341454 0.722182 1.858 0.063240 .
## nox 44.473984 7.183977 6.191 5.99e-10 ***
## rm -0.356816 0.652810 -0.547 0.584664
## age 0.896847 0.639678 1.402 0.160907
## dis 0.661694 0.207061 3.196 0.001395 **
## rad 0.048170 0.012613 3.819 0.000134 ***
## ptratio 0.342741 0.111882 3.063 0.002188 **
## lstat 0.435860 0.649096 0.671 0.501911
## medv 0.149548 0.059824 2.500 0.012425 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 203.77 on 454 degrees of freedom
## AIC: 227.77
##
## Number of Fisher Scoring iterations: 10
In our third model we use stepAIC from MASS package to get the significant features for this model. We see improvement after using stepAIC function with significant p-va;ues for all the predictors.
##
## Call:
## glm(formula = target ~ zn + nox + age + dis + rad + tax + ptratio +
## medv, family = "binomial", data = crime_training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8295 -0.1752 -0.0021 0.0032 3.4191
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -37.415922 6.035013 -6.200 5.65e-10 ***
## zn -0.068648 0.032019 -2.144 0.03203 *
## nox 42.807768 6.678692 6.410 1.46e-10 ***
## age 0.032950 0.010951 3.009 0.00262 **
## dis 0.654896 0.214050 3.060 0.00222 **
## rad 0.725109 0.149788 4.841 1.29e-06 ***
## tax -0.007756 0.002653 -2.924 0.00346 **
## ptratio 0.323628 0.111390 2.905 0.00367 **
## medv 0.110472 0.035445 3.117 0.00183 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 197.32 on 457 degrees of freedom
## AIC: 215.32
##
## Number of Fisher Scoring iterations: 9
We create a table of values for all the 3 models we built above. From the below AUC curve we see Model 1 and Model 3 have almost similar values and Model 2 have lower AUC so we compare between model 1 & 3. When we further analyze the table values for Model 1 and 3 we see that Model 3 has lower AIC value which is preferable and also no of predictors are less in Model 3 than model 1 with other values being almost similar between these 2 models. So we choose Model 3 as our best model
| Model 1 | Model 2 | Model 3 | |
|---|---|---|---|
| Accuracy | 0.9163090 | 0.9055794 | 0.9120172 |
| Class. Error Rate | 0.0836910 | 0.0944206 | 0.0879828 |
| Sensitivity | 0.9039301 | 0.8733624 | 0.9039301 |
| Specificity | 0.9282700 | 0.9367089 | 0.9198312 |
| Precision | 0.9241071 | 0.9302326 | 0.9159292 |
| F1 | 0.9139073 | 0.9009009 | 0.9098901 |
| AIC | 218.0469179 | 227.7670973 | 215.3228528 |
| Predictors | 13.0000000 | 12.0000000 | 9.0000000 |