Overview

In this homework assignment, you will explore, analyze and model a data set containing information on crime for various neighborhoods of a major city. Each record has a response variable indicating whether or not the crime rate is above the median crime rate (1) or not (0). Your objective is to build a binary logistic regression model on the training data set to predict whether the neighborhood will be at risk for high crime levels. You will provide classifications and probabilities for the evaluation data set using your binary logistic regression model. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:

 zn: proportion of residential land zoned for large lots (over 25000 square feet) (predictor variable)

 indus: proportion of non-retail business acres per suburb (predictor variable)

 chas: a dummy var. for whether the suburb borders the Charles River (1) or not (0) (predictor variable)

 nox: nitrogen oxides concentration (parts per 10 million) (predictor variable)

 rm: average number of rooms per dwelling (predictor variable)

 age: proportion of owner-occupied units built prior to 1940 (predictor variable)

 dis: weighted mean of distances to five Boston employment centers (predictor variable)

 rad: index of accessibility to radial highways (predictor variable)

 tax: full-value property-tax rate per $10,000 (predictor variable)

 ptratio: pupil-teacher ratio by town (predictor variable)

 black: 1000(Bk - 0.63)2 where Bk is the proportion of blacks by town (predictor variable)

 lstat: lower status of the population (percent) (predictor variable)

 medv: median value of owner-occupied homes in $1000s (predictor variable)

 target: whether the crime rate is above the median crime rate (1) or not (0) (response variable)

Deliverables

 A write-up submitted in PDF format. Your write-up should have four sections. Each one is described below. You may assume you are addressing me as a fellow data scientist, so do not need to shy away from technical details.

 Assigned prediction (probabilities, classifications) for the evaluation data set. Use 0.5 threshold.

 Include your R statistical programming code in an Appendix.

Variable Name	Short Description
zn	proportion of residential land zoned for large lots (over 25000 square feet)
indus	proportion of non-retail business acres per suburb
chas	a dummy var. for whether the suburb borders the Charles River (1) or not (0)
nox	nitrogen oxides concentration (parts per 10 million)
rm	average number of rooms per dwelling
age	proportion of owner-occupied units built prior to 1940
dis	weighted mean of distances to five Boston employment centers
rad	index of accessibility to radial highways
tax	full-value property-tax rate per $10,000
ptratio	pupil-teacher ratio by town
lstat	lower status of the population (percent)
medv	median value of owner-occupied homes in $1000s
target	whether the crime rate is above the median crime rate (1) or not (0)

Data exploration

In our dataset we have 466 observations and 13 variables. Each observation represents a crime occurred in the neighborhoods of a major city. Response variable target is a binary determinant with 1 representing an above average crime and 0 as below average crime. There are no missing values in our dataset

## Rows: 466
## Columns: 13
## $ zn      <dbl> 0, 0, 0, 30, 0, 0, 0, 0, 0, 80, 22, 0, 0, 22, 0, 0, 100, 20, 0…
## $ indus   <dbl> 19.58, 19.58, 18.10, 4.93, 2.46, 8.56, 18.10, 18.10, 5.19, 3.6…
## $ chas    <fct> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ nox     <dbl> 0.605, 0.871, 0.740, 0.428, 0.488, 0.520, 0.693, 0.693, 0.515,…
## $ rm      <dbl> 7.929, 5.403, 6.485, 6.393, 7.155, 6.781, 5.453, 4.519, 6.316,…
## $ age     <dbl> 96.2, 100.0, 100.0, 7.8, 92.2, 71.3, 100.0, 100.0, 38.1, 19.1,…
## $ dis     <dbl> 2.0459, 1.3216, 1.9784, 7.0355, 2.7006, 2.8561, 1.4896, 1.6582…
## $ rad     <int> 5, 5, 24, 6, 3, 5, 24, 24, 5, 1, 7, 5, 24, 7, 3, 3, 5, 5, 24, …
## $ tax     <int> 403, 403, 666, 300, 193, 384, 666, 666, 224, 315, 330, 398, 66…
## $ ptratio <dbl> 14.7, 14.7, 20.2, 16.6, 17.8, 20.9, 20.2, 20.2, 20.2, 16.4, 19…
## $ lstat   <dbl> 3.70, 26.82, 18.85, 5.19, 4.82, 7.67, 30.59, 36.98, 5.68, 9.25…
## $ medv    <dbl> 50.0, 13.4, 15.4, 23.7, 37.9, 26.5, 5.0, 7.0, 22.2, 20.9, 24.8…
## $ target  <fct> 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0,…

##         vars   n   mean     sd median trimmed    mad    min    max  range  skew
## zn         1 466  11.58  23.36   0.00    5.35   0.00   0.00 100.00 100.00  2.18
## indus      2 466  11.11   6.85   9.69   10.91   9.34   0.46  27.74  27.28  0.29
## chas*      3 466   1.07   0.26   1.00    1.00   0.00   1.00   2.00   1.00  3.34
## nox        4 466   0.55   0.12   0.54    0.54   0.13   0.39   0.87   0.48  0.75
## rm         5 466   6.29   0.70   6.21    6.26   0.52   3.86   8.78   4.92  0.48
## age        6 466  68.37  28.32  77.15   70.96  30.02   2.90 100.00  97.10 -0.58
## dis        7 466   3.80   2.11   3.19    3.54   1.91   1.13  12.13  11.00  1.00
## rad        8 466   9.53   8.69   5.00    8.70   1.48   1.00  24.00  23.00  1.01
## tax        9 466 409.50 167.90 334.50  401.51 104.52 187.00 711.00 524.00  0.66
## ptratio   10 466  18.40   2.20  18.90   18.60   1.93  12.60  22.00   9.40 -0.75
## lstat     11 466  12.63   7.10  11.35   11.88   7.07   1.73  37.97  36.24  0.91
## medv      12 466  22.59   9.24  21.20   21.63   6.00   5.00  50.00  45.00  1.08
## target*   13 466   1.49   0.50   1.00    1.49   0.00   1.00   2.00   1.00  0.03
##         kurtosis   se
## zn          3.81 1.08
## indus      -1.24 0.32
## chas*       9.15 0.01
## nox        -0.04 0.01
## rm          1.54 0.03
## age        -1.01 1.31
## dis         0.47 0.10
## rad        -0.86 0.40
## tax        -1.15 7.78
## ptratio    -0.40 0.10
## lstat       0.50 0.33
## medv        1.37 0.43
## target*    -2.00 0.02

Data Visualization

From the histograms below we see that most of the distributions are skewed and medv and rm being almost normally distributed. age, ptratio are left skewed whereas dis, lstat and zn are right skewed. We also see bi=modal distribution in indus,rad and tax

We see in the below box plot graph the presence of outliers in many of the variables. We see high interquartile range for variables rad and tax. The variance between two values of the response variable differs for age,dis, nox, rad, tax, zn so we will have to consider adding quadratic terms for them.

In the correlation matrix below we see nox,age,rad,tax and indus have positive coorelation and dis have negative coorelation. Rest other variables have weaker coorelation.

	Correlation
target	1.0000000
nox	0.7261062
age	0.6301062
rad	0.6281049
tax	0.6111133
indus	0.6048507
lstat	0.4691270
ptratio	0.2508489
chas	0.0800419
rm	-0.1525533
medv	-0.2705507
zn	-0.4316818
dis	-0.6186731

Data preparation

In our above analysis we see that the data needs some transformations. Due presence of skewness and outliers we will transform these variables. We remove the tax variable due to multicollinearity and then we take log transformations for age and lstat due to the presence of skweness. For rad, nox and zn will add quadratic term due to the presence of variances.

Building Models

Model 1

In this model we use all the 12 variables after excluding the tax and run the glm. Many of the predictors have significant p-value(<=0.05) rm and lstat have large p-values.

## 
## Call:
## glm(formula = target ~ ., family = "binomial", data = crime_training)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8464  -0.1445  -0.0017   0.0029   3.4665  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -40.822934   6.632913  -6.155 7.53e-10 ***
## zn           -0.065946   0.034656  -1.903  0.05706 .  
## indus        -0.064614   0.047622  -1.357  0.17485    
## chas          0.910765   0.755546   1.205  0.22803    
## nox          49.122297   7.931706   6.193 5.90e-10 ***
## rm           -0.587488   0.722847  -0.813  0.41637    
## age           0.034189   0.013814   2.475  0.01333 *  
## dis           0.738660   0.230275   3.208  0.00134 ** 
## rad           0.666366   0.163152   4.084 4.42e-05 ***
## tax          -0.006171   0.002955  -2.089  0.03674 *  
## ptratio       0.402566   0.126627   3.179  0.00148 ** 
## lstat         0.045869   0.054049   0.849  0.39608    
## medv          0.180824   0.068294   2.648  0.00810 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 192.05  on 453  degrees of freedom
## AIC: 218.05
## 
## Number of Fisher Scoring iterations: 9

Model 2

To further improve our model in the second model we use the transformed variables and run the glm. We see the p-values increases when used the transformed variables which makes is less efficient then the earlier model.

## 
## Call:
## glm(formula = target ~ ., family = "binomial", data = crime_train_trans)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0433  -0.2233   0.0000   0.0000   3.2207  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -28.050185   5.743773  -4.884 1.04e-06 ***
## zn           -0.003025   0.001538  -1.966 0.049279 *  
## indus        -0.111100   0.045632  -2.435 0.014904 *  
## chas          1.341454   0.722182   1.858 0.063240 .  
## nox          44.473984   7.183977   6.191 5.99e-10 ***
## rm           -0.356816   0.652810  -0.547 0.584664    
## age           0.896847   0.639678   1.402 0.160907    
## dis           0.661694   0.207061   3.196 0.001395 ** 
## rad           0.048170   0.012613   3.819 0.000134 ***
## ptratio       0.342741   0.111882   3.063 0.002188 ** 
## lstat         0.435860   0.649096   0.671 0.501911    
## medv          0.149548   0.059824   2.500 0.012425 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 203.77  on 454  degrees of freedom
## AIC: 227.77
## 
## Number of Fisher Scoring iterations: 10

Model 3

In our third model we use stepAIC from MASS package to get the significant features for this model. We see improvement after using stepAIC function with significant p-va;ues for all the predictors.

## 
## Call:
## glm(formula = target ~ zn + nox + age + dis + rad + tax + ptratio + 
##     medv, family = "binomial", data = crime_training)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8295  -0.1752  -0.0021   0.0032   3.4191  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -37.415922   6.035013  -6.200 5.65e-10 ***
## zn           -0.068648   0.032019  -2.144  0.03203 *  
## nox          42.807768   6.678692   6.410 1.46e-10 ***
## age           0.032950   0.010951   3.009  0.00262 ** 
## dis           0.654896   0.214050   3.060  0.00222 ** 
## rad           0.725109   0.149788   4.841 1.29e-06 ***
## tax          -0.007756   0.002653  -2.924  0.00346 ** 
## ptratio       0.323628   0.111390   2.905  0.00367 ** 
## medv          0.110472   0.035445   3.117  0.00183 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 197.32  on 457  degrees of freedom
## AIC: 215.32
## 
## Number of Fisher Scoring iterations: 9

Choosing the model

We create a table of values for all the 3 models we built above. From the below AUC curve we see Model 1 and Model 3 have almost similar values and Model 2 have lower AUC so we compare between model 1 & 3. When we further analyze the table values for Model 1 and 3 we see that Model 3 has lower AIC value which is preferable and also no of predictors are less in Model 3 than model 1 with other values being almost similar between these 2 models. So we choose Model 3 as our best model

	Model 1	Model 2	Model 3
Accuracy	0.9163090	0.9055794	0.9120172
Class. Error Rate	0.0836910	0.0944206	0.0879828
Sensitivity	0.9039301	0.8733624	0.9039301
Specificity	0.9282700	0.9367089	0.9198312
Precision	0.9241071	0.9302326	0.9159292
F1	0.9139073	0.9009009	0.9098901
AIC	218.0469179	227.7670973	215.3228528
Predictors	13.0000000	12.0000000	9.0000000

Link for the .rmd

DATA 621 - Homework 3

Peter