1. Data Exploration

The “Neighbourhood crime data” training data set contains 466 rows and 13 columns. The variables are thought to have a positive or negative effect on the crime rate being above median crime rate.Running a summary() function on the data set, we are able to get the mean, median, first and third quartile and the minimum and maximum values for each variable. We included a correlation plot and pairs plot to visualize the relationship among the variables.

We explored the structure of the variables for both the training and evaluation data sets and finally observed how Target variable is affected by other factors.

2. Data Preparation

Interestingly this data was much clean and there were no NA values identified in the data set. But going through the dataset description we could identify that some value though were numerical, but were representing certin classes and therefore should be converted into factors to help come up with right model. These included chas and rad.

3. Build Models

We focussed on building a logistic regression model since the target variable was binary variable with values limited to 0 or 1. Therefore, we used the glm package available within R to perform this. We build 3 LR models by applying the target variable against different dependent variables, and in one model we applied against all the dependent variables. Once we executed the models, we captured the AIC, null and residual deviance values in a table for easy representation of basic parameters of these models.

4. Select Models

Out of the three models, the model that included all the parameters had least AIC value, as well as least residual deviance value. That clearly suggested that this model is better at learning. We also generated the AIC for other models as well using stepAIC from MASS package.

Appendix

Library

##   zn indus chas   nox    rm   age    dis rad tax ptratio lstat medv target
## 1  0 19.58    0 0.605 7.929  96.2 2.0459   5 403    14.7  3.70 50.0      1
## 2  0 19.58    1 0.871 5.403 100.0 1.3216   5 403    14.7 26.82 13.4      1
## 3  0 18.10    0 0.740 6.485 100.0 1.9784  24 666    20.2 18.85 15.4      1
## 4 30  4.93    0 0.428 6.393   7.8 7.0355   6 300    16.6  5.19 23.7      0
## 5  0  2.46    0 0.488 7.155  92.2 2.7006   3 193    17.8  4.82 37.9      0
## 6  0  8.56    0 0.520 6.781  71.3 2.8561   5 384    20.9  7.67 26.5      0

Data Exploration & Preparation

Summary of data

See a summary of each column in the train_df set

##        zn             indus             chas              nox        
##  Min.   :  0.00   Min.   : 0.460   Min.   :0.00000   Min.   :0.3890  
##  1st Qu.:  0.00   1st Qu.: 5.145   1st Qu.:0.00000   1st Qu.:0.4480  
##  Median :  0.00   Median : 9.690   Median :0.00000   Median :0.5380  
##  Mean   : 11.58   Mean   :11.105   Mean   :0.07082   Mean   :0.5543  
##  3rd Qu.: 16.25   3rd Qu.:18.100   3rd Qu.:0.00000   3rd Qu.:0.6240  
##  Max.   :100.00   Max.   :27.740   Max.   :1.00000   Max.   :0.8710  
##        rm             age              dis              rad       
##  Min.   :3.863   Min.   :  2.90   Min.   : 1.130   Min.   : 1.00  
##  1st Qu.:5.887   1st Qu.: 43.88   1st Qu.: 2.101   1st Qu.: 4.00  
##  Median :6.210   Median : 77.15   Median : 3.191   Median : 5.00  
##  Mean   :6.291   Mean   : 68.37   Mean   : 3.796   Mean   : 9.53  
##  3rd Qu.:6.630   3rd Qu.: 94.10   3rd Qu.: 5.215   3rd Qu.:24.00  
##  Max.   :8.780   Max.   :100.00   Max.   :12.127   Max.   :24.00  
##       tax           ptratio         lstat             medv      
##  Min.   :187.0   Min.   :12.6   Min.   : 1.730   Min.   : 5.00  
##  1st Qu.:281.0   1st Qu.:16.9   1st Qu.: 7.043   1st Qu.:17.02  
##  Median :334.5   Median :18.9   Median :11.350   Median :21.20  
##  Mean   :409.5   Mean   :18.4   Mean   :12.631   Mean   :22.59  
##  3rd Qu.:666.0   3rd Qu.:20.2   3rd Qu.:16.930   3rd Qu.:25.00  
##  Max.   :711.0   Max.   :22.0   Max.   :37.970   Max.   :50.00  
##      target      
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.4914  
##  3rd Qu.:1.0000  
##  Max.   :1.0000

Structure of the data

## 'data.frame':    466 obs. of  13 variables:
##  $ zn     : num  0 0 0 30 0 0 0 0 0 80 ...
##  $ indus  : num  19.58 19.58 18.1 4.93 2.46 ...
##  $ chas   : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.605 0.871 0.74 0.428 0.488 0.52 0.693 0.693 0.515 0.392 ...
##  $ rm     : num  7.93 5.4 6.49 6.39 7.16 ...
##  $ age    : num  96.2 100 100 7.8 92.2 71.3 100 100 38.1 19.1 ...
##  $ dis    : num  2.05 1.32 1.98 7.04 2.7 ...
##  $ rad    : int  5 5 24 6 3 5 24 24 5 1 ...
##  $ tax    : int  403 403 666 300 193 384 666 666 224 315 ...
##  $ ptratio: num  14.7 14.7 20.2 16.6 17.8 20.9 20.2 20.2 20.2 16.4 ...
##  $ lstat  : num  3.7 26.82 18.85 5.19 4.82 ...
##  $ medv   : num  50 13.4 15.4 23.7 37.9 26.5 5 7 22.2 20.9 ...
##  $ target : int  1 1 1 0 0 0 1 1 0 0 ...
## 'data.frame':    40 obs. of  12 variables:
##  $ zn     : int  0 0 0 0 0 25 25 0 0 0 ...
##  $ indus  : num  7.07 8.14 8.14 8.14 5.96 5.13 5.13 4.49 4.49 2.89 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.469 0.538 0.538 0.538 0.499 0.453 0.453 0.449 0.449 0.445 ...
##  $ rm     : num  7.18 6.1 6.5 5.95 5.85 ...
##  $ age    : num  61.1 84.5 94.4 82 41.5 66.2 93.4 56.1 56.8 69.6 ...
##  $ dis    : num  4.97 4.46 4.45 3.99 3.93 ...
##  $ rad    : int  2 4 4 4 5 8 8 3 3 2 ...
##  $ tax    : int  242 307 307 307 279 284 284 247 247 276 ...
##  $ ptratio: num  17.8 21 21 21 19.2 19.7 19.7 18.5 18.5 18 ...
##  $ lstat  : num  4.03 10.26 12.8 27.71 8.77 ...
##  $ medv   : num  34.7 18.2 18.4 13.2 21 18.7 16 26.6 22.2 21.4 ...

NA check

## character(0)

There are no NAs observed

The summary() function for the training and testing data sets indicates that there are no missing values in the data. The response variable “target” is binary with 1 indicates crime rate is above median cirme rate and 0 indicates crime rate is not above median crime rate.

Let’s observe how the target variable is effected by other factors:
1. The plot of “target” against “age” shows target equalling one (above median crime rate) increases as the proportion of owner-occupied units built prior to 1940 increaases; the boxplot further shows that a larger mean of proportions of owner-occupied units built prior to 1940 is assoicated with higher crime rate.
2. Plots of crime rate against pupil-teacher ratio indicate higher crime rate “1” is associated with higher pupil-teacher ratio.

Modeling

Binary Logistic Regression

We are running Binary Logistic regression model with three 3 different set of parameters

Modelling with Target ~ Age

## 
## Call:
## glm(formula = target ~ age, family = binomial(), data = train_df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9906  -0.6040  -0.1609   0.6659   2.9096  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.773112   0.465483  -10.25   <2e-16 ***
## age          0.066060   0.005922   11.15   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 424.75  on 464  degrees of freedom
## AIC: 428.75
## 
## Number of Fisher Scoring iterations: 5

Modelling with Target ~ ptratio

## 
## Call:
## glm(formula = target ~ ptratio, family = binomial(), data = train_df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5439  -1.1075  -0.7538   1.0160   1.7812  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.51685    0.86035  -5.250 1.52e-07 ***
## ptratio      0.24303    0.04617   5.264 1.41e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 615.64  on 464  degrees of freedom
## AIC: 619.64
## 
## Number of Fisher Scoring iterations: 4

Modelling with Target ~ .(every other variable)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## 
## Call:
## glm(formula = target ~ ., family = binomial, data = train_df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5265  -0.0409   0.0000   0.0001   4.3848  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -5.526e+01  5.099e+03  -0.011   0.9914    
## zn          -1.609e-01  6.574e-02  -2.447   0.0144 *  
## indus       -1.562e-01  1.166e-01  -1.340   0.1802    
## chas1       -2.603e-01  9.626e-01  -0.270   0.7869    
## nox          6.863e+01  1.362e+01   5.038 4.71e-07 ***
## rm          -1.225e+00  1.010e+00  -1.213   0.2250    
## age          1.871e-02  1.569e-02   1.193   0.2330    
## dis          5.351e-01  2.671e-01   2.003   0.0452 *  
## rad2        -4.532e-01  7.114e+03   0.000   0.9999    
## rad3         1.783e+01  5.099e+03   0.003   0.9972    
## rad4         2.221e+01  5.099e+03   0.004   0.9965    
## rad5         1.950e+01  5.099e+03   0.004   0.9969    
## rad6         1.738e+01  5.099e+03   0.003   0.9973    
## rad7         2.700e+01  5.099e+03   0.005   0.9958    
## rad8         2.564e+01  5.099e+03   0.005   0.9960    
## rad24        4.404e+01  5.457e+03   0.008   0.9936    
## tax         -9.491e-03  5.442e-03  -1.744   0.0811 .  
## ptratio      4.824e-02  2.040e-01   0.236   0.8131    
## lstat        6.778e-02  6.441e-02   1.052   0.2927    
## medv         2.195e-01  9.964e-02   2.203   0.0276 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 116.98  on 446  degrees of freedom
## AIC: 156.98
## 
## Number of Fisher Scoring iterations: 20

Comparing different models performance

Model AIC Null.Deviance Resid.Deviance
2 target ~ age 428.7471 645.8758 464
21 target ~ ptratio 619.6385 645.8758 464
3 target ~ . 156.9822 645.8758 446

Looking at the table, we can identify on a high level that 3rd model that includes all the parameters is better suited. Therefore, let’s come up with a confusion matrix for 3rd model that includes all the parameters.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 233  10
##          1   4 219
##                                           
##                Accuracy : 0.97            
##                  95% CI : (0.9501, 0.9835)
##     No Information Rate : 0.5086          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9399          
##                                           
##  Mcnemar's Test P-Value : 0.1814          
##                                           
##             Sensitivity : 0.9563          
##             Specificity : 0.9831          
##          Pos Pred Value : 0.9821          
##          Neg Pred Value : 0.9588          
##              Prevalence : 0.4914          
##          Detection Rate : 0.4700          
##    Detection Prevalence : 0.4785          
##       Balanced Accuracy : 0.9697          
##                                           
##        'Positive' Class : 1               
## 

Using StepAIC

Using the MASS package provided ‘stepAIC’ lets try to further refine the available models within it

## Start:  AIC=156.98
## target ~ zn + indus + chas + nox + rm + age + dis + rad + tax + 
##     ptratio + lstat + medv
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##           Df Deviance    AIC
## - ptratio  1   117.04 155.04
## - chas     1   117.06 155.06
## - lstat    1   118.07 156.07
## - age      1   118.44 156.44
## - rm       1   118.50 156.50
## - indus    1   118.82 156.82
## <none>         116.98 156.98
## - tax      1   120.42 158.42
## - dis      1   121.06 159.06
## - medv     1   122.98 160.98
## - zn       1   125.74 163.74
## - nox      1   185.39 223.39
## - rad      8   233.74 257.74
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## 
## Step:  AIC=155.04
## target ~ zn + indus + chas + nox + rm + age + dis + rad + tax + 
##     lstat + medv
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##         Df Deviance    AIC
## - chas   1   117.11 153.11
## - lstat  1   118.14 154.15
## - age    1   118.46 154.46
## - rm     1   118.53 154.53
## <none>       117.04 155.04
## - indus  1   119.35 155.35
## - tax    1   120.42 156.42
## - dis    1   121.17 157.17
## - medv   1   124.02 160.02
## - zn     1   127.07 163.07
## - nox    1   187.89 223.89
## - rad    8   242.93 264.93
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## 
## Step:  AIC=153.11
## target ~ zn + indus + nox + rm + age + dis + rad + tax + lstat + 
##     medv
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##         Df Deviance    AIC
## - lstat  1   118.17 152.17
## - age    1   118.46 152.46
## - rm     1   118.54 152.54
## <none>       117.11 153.11
## - indus  1   120.17 154.17
## - tax    1   120.66 154.66
## - dis    1   121.41 155.41
## - medv   1   124.07 158.07
## - zn     1   127.10 161.10
## - nox    1   190.43 224.43
## - rad    8   247.55 267.55
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## 
## Step:  AIC=152.17
## target ~ zn + indus + nox + rm + age + dis + rad + tax + medv
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##         Df Deviance    AIC
## <none>       118.17 152.17
## - age    1   120.74 152.74
## - indus  1   120.93 152.93
## - rm     1   121.05 153.05
## - tax    1   121.73 153.73
## - dis    1   122.35 154.35
## - medv   1   125.18 157.18
## - zn     1   127.58 159.58
## - nox    1   191.60 223.60
## - rad    8   249.92 267.92
## 
## Call:
## glm(formula = target ~ zn + indus + nox + rm + age + dis + rad + 
##     tax + medv, family = binomial, data = train_df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3520  -0.0443   0.0000   0.0001   4.3170  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -5.053e+01  3.170e+03  -0.016   0.9873    
## zn          -1.480e-01  5.772e-02  -2.564   0.0104 *  
## indus       -1.613e-01  9.835e-02  -1.640   0.1009    
## nox          6.718e+01  1.255e+01   5.353 8.64e-08 ***
## rm          -1.462e+00  8.701e-01  -1.681   0.0928 .  
## age          2.172e-02  1.364e-02   1.592   0.1113    
## dis          5.469e-01  2.689e-01   2.034   0.0420 *  
## rad2        -1.873e-02  4.418e+03   0.000   1.0000    
## rad3         1.695e+01  3.170e+03   0.005   0.9957    
## rad4         2.139e+01  3.170e+03   0.007   0.9946    
## rad5         1.839e+01  3.170e+03   0.006   0.9954    
## rad6         1.661e+01  3.170e+03   0.005   0.9958    
## rad7         2.563e+01  3.170e+03   0.008   0.9935    
## rad8         2.434e+01  3.170e+03   0.008   0.9939    
## rad24        4.192e+01  3.387e+03   0.012   0.9901    
## tax         -8.591e-03  4.788e-03  -1.794   0.0728 .  
## medv         2.047e-01  8.269e-02   2.475   0.0133 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 118.17  on 449  degrees of freedom
## AIC: 152.17
## 
## Number of Fisher Scoring iterations: 19
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 233  10
##          1   4 219
##                                           
##                Accuracy : 0.97            
##                  95% CI : (0.9501, 0.9835)
##     No Information Rate : 0.5086          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9399          
##                                           
##  Mcnemar's Test P-Value : 0.1814          
##                                           
##             Sensitivity : 0.9563          
##             Specificity : 0.9831          
##          Pos Pred Value : 0.9821          
##          Neg Pred Value : 0.9588          
##              Prevalence : 0.4914          
##          Detection Rate : 0.4700          
##    Detection Prevalence : 0.4785          
##       Balanced Accuracy : 0.9697          
##                                           
##        'Positive' Class : 1               
## 

Plotting ROC

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases