1 Introduction

The objective of this assignment is to predict if a given neighborhood in Boston is likely to experience a high crime rate. For categorical purposes, we define a neighborhood crime rate as “high” if it exceeds the citywide median crime rate.

Because we are interested in making simple yes/no determinations, we will build various binary logistic regression models to make our predictions.

We will train our models using a prescribed data set that captures a variety of metrics for each Boston neighborhood.

2 Data Collection

The training data comprises 466 observations and 13 variables.

Below is a brief description of the variables in our data set:

zn indus chas nox rm age dis rad tax ptratio black lstat medv target
0 19.58 0 0.605 7.93 96.2 2.05 5 403 14.7 369 3.70 50.0 1
0 19.58 1 0.871 5.40 100.0 1.32 5 403 14.7 397 26.82 13.4 1
0 18.10 0 0.740 6.49 100.0 1.98 24 666 20.2 387 18.85 15.4 1
30 4.93 0 0.428 6.39 7.8 7.04 6 300 16.6 375 5.19 23.7 0
0 2.46 0 0.488 7.16 92.2 2.70 3 193 17.8 394 4.82 37.9 0
0 8.56 0 0.520 6.78 71.3 2.86 5 384 20.9 396 7.67 26.5 0

2.1 Data Overview

The response variable, target, is binary. There is one binary variable, chas, among our predictor variables. Four of the predictors are expressed as proportions, with values ranging from 0 to 100:

  • indus
  • indust
  • age
  • lstat

Below is the descriptive summary of all of our variables:

       zn          indus           chas           nox             rm           age     
 Min.   :  0   Min.   : 0.5   Min.   :0.00   Min.   :0.39   Min.   :3.9   Min.   :  3  
 1st Qu.:  0   1st Qu.: 5.1   1st Qu.:0.00   1st Qu.:0.45   1st Qu.:5.9   1st Qu.: 44  
 Median :  0   Median : 9.7   Median :0.00   Median :0.54   Median :6.2   Median : 77  
 Mean   : 12   Mean   :11.1   Mean   :0.07   Mean   :0.55   Mean   :6.3   Mean   : 68  
 3rd Qu.: 16   3rd Qu.:18.1   3rd Qu.:0.00   3rd Qu.:0.62   3rd Qu.:6.6   3rd Qu.: 94  
 Max.   :100   Max.   :27.7   Max.   :1.00   Max.   :0.87   Max.   :8.8   Max.   :100  
      dis            rad            tax         ptratio         black         lstat         medv   
 Min.   : 1.1   Min.   : 1.0   Min.   :187   Min.   :12.6   Min.   :  0   Min.   : 2   Min.   : 5  
 1st Qu.: 2.1   1st Qu.: 4.0   1st Qu.:281   1st Qu.:16.9   1st Qu.:376   1st Qu.: 7   1st Qu.:17  
 Median : 3.2   Median : 5.0   Median :334   Median :18.9   Median :391   Median :11   Median :21  
 Mean   : 3.8   Mean   : 9.5   Mean   :410   Mean   :18.4   Mean   :357   Mean   :13   Mean   :23  
 3rd Qu.: 5.2   3rd Qu.:24.0   3rd Qu.:666   3rd Qu.:20.2   3rd Qu.:396   3rd Qu.:17   3rd Qu.:25  
 Max.   :12.1   Max.   :24.0   Max.   :711   Max.   :22.0   Max.   :397   Max.   :38   Max.   :50  
     target    
 Min.   :0.00  
 1st Qu.:0.00  
 Median :0.00  
 Mean   :0.49  
 3rd Qu.:1.00  
 Max.   :1.00  

There are no missing values in a data set.

Let’s take a closer look at each variable individually.

3 Response Variables

3.1 zn : Zoned

The zn variable has a significant positive skew, as evident in the histogram, box plot, and qq plots below. We also note that 73% of observations (339 of 466 total) have a value of 0. Furthermore, there appears to be relationship between crime rates and zn: 93% of high crime areas have no land zoned for large lots. Also, about half of low crime areas exclude large-lot zoning. The boxplots by crime type indicate a much higher variance of zn values for low crime neighborhoods compared to high crime areas.

Below is the summarized table of observed zn observations:


  0  10  20  25  30  35  40  45  50  55  60  70  75  80  85  90  95 100 Sum 
339  10  36   8   9   9   7   6   3   3   4   3   3  15   2   4   4   1 466 

Below are summary statistics:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.      SD    Skew    Kurt 
    0.0     0.0     0.0    11.6    16.2   100.0    23.4     2.2     6.8 

below are the relevant plots:


3.2 indus: Industry

The variable indus represents the proportion of non-retail business acres per suburb. The histogram below indicates a bi-modal quality to the variable’s distribution. The distribution has a very mild positive skew, the kurtosis is significantly below that of a normal distribution. In the boxplot below, we see that high crime areas tend to have a higher industry concentration compared to low crime areas.

Below is the summary table, with values rounded to the nearest percentage:


  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  18  20  22  26  28 Sum 
  1   9  31  29  30  27  45  21  26  11  27  14   4   6   9   3 121  28  14   5   5 466 

Here are the summary statistics:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.      SD    Skew    Kurt 
   0.46    5.15    9.69   11.11   18.10   27.74    6.85    0.29    1.76 

Below are the plots:

3.3 chas: Charles River

The variable chas is binary with the value 1 indicating that the neighborhood borders the Charles River. Only 7% of all neighborhoods (33 in total) border the river.

Let’s look at the summary of observations:


  0   1 Sum 
433  33 466 

Below is the scatter plot for ‘chas’ and ‘target’:

     
        0   1 Sum
  0   225 208 433
  1    12  21  33
  Sum 237 229 466

Below for the t-test details:


    2-sample test for equality of proportions with continuity correction

data:  table(crime$chas, crime$target)
X-squared = 2, df = 1, p-value = 0.1
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.031  0.343
sample estimates:
prop 1 prop 2 
  0.52   0.36 


3.4 nox: Nitrogen Oxide

This variable exhibits a moderate, positive skew, with a kurtosis value similar to that of a normally distributed variable. The final boxplot below indicates higher median nox values in high crime areas vis-a-vis the low crime counterparts. We also see moderately higher nox variance in high crime areas.

Here is a summary of observed values, rounded to the nearest 0.1 parts per 10 million:


0.4 0.5 0.6 0.7 0.8 0.9 Sum 
122 149  95  76   8  16 466 

Below are summary statistics:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.      SD    Skew    Kurt 
   0.39    0.45    0.54    0.55    0.62    0.87    0.12    0.75    2.98 

Below are the plots:


3.5 rm : Rooms

The predictor rm is count measure describing the average number of rooms per dwelling. It appears that lower crime areas, on average, are associated with a higher number of rooms per dwelling; however, the room counts by crime type are fairly close in value.

Here is a distribution of room counts, rounded to the nearest whole number:


  4   5   6   7   8   9 Sum 
  4  37 284 115  23   3 466 

Summary statistics are below:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.      SD    Skew    Kurt 
   3.86    5.89    6.21    6.29    6.63    8.78    0.70    0.48    4.56 

Below are the plots:


3.6 age: Age

The predictor ‘age’ has a significant left skew. In the boxplots below, we see a significantly higher mean percentage of older homes in high crime areas.

Here is a table of age values rounded to the nearest 5th percentage.


  5  10  15  20  25  30  35  40  45  50  55  60  65  70  75  80  85  90  95 100 Sum 
  6   7   9  20   6  25  22  18  17  15  16  18  14  19  22  22  33  43  64  70 466 

Below is the summary statistics:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.      SD    Skew    Kurt 
   2.90   43.88   77.15   68.37   94.10  100.00   28.32   -0.58    2.00 

Below are the plots:


3.7 dist: Distance

The predictor dist describes the average distance to Boston employment centers. The variable is moderately right skewed. Based on the boxplots below, we see see that low crime areas are associated with higher average distances to employment centers.

Here is a table of distance values, rounded to the nearest unit:


  1   2   3   4   5   6   7   8   9  11  12 Sum 
 28 145  83  65  47  44  24  16   9   4   1 466 

Summary statistics are as follows:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.      SD    Skew    Kurt 
    1.1     2.1     3.2     3.8     5.2    12.1     2.1     1.0     3.5 

Below are the plots:

3.8 rad: Radial highways

The rad variable is an integer-valued index measure indicating an area’s accessibility to radial highways. In the boxplots below, there appears to be a significant positive association between high crime rates and rad value.

The distribution of this variable is tri-modal, with values clustering around index values of 4, 5, and 24.

Here is a numerical distribution of the index values:


  1   2   3   4   5   6   7   8  24 Sum 
 17  20  36 103 109  25  15  20 121 466 

Below are summary statistics:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.      SD    Skew    Kurt 
    1.0     4.0     5.0     9.5    24.0    24.0     8.7     1.0     2.1 

Below are the plots:

3.9 tax: Tax

The tax variable refers to the the tax rate per $10k of property value. High crime areas also appear to have a strong, positive association with the tax value–see the boxplots below.

Here is distribution of tax values, rounded the nearest $25:


175 200 225 250 275 300 325 350 375 400 425 475 675 700 Sum 
  1  14  35  20  61  82  24  11  13  51  27   1 121   5 466 

Below is the summary statistics:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.      SD    Skew    Kurt 
 187.00  281.00  334.50  409.50  666.00  711.00  167.90    0.66    1.86 

Below are the plots:

3.10 ptratio: pupil-to-student

The predictor ptratio indicates the average school, pupil-to-student ratio, and has a right skewed distribution. The boxplots below indicate a positive relationship between ptratio and high crime.

Here is a distribution of ptratio values, rounded to the nearest whole number:


 13  14  15  16  17  18  19  20  21  22 Sum 
 15   2  55  21  45  67  65 145  49   2 466 

Below are the summary statistics:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.      SD    Skew    Kurt 
  12.60   16.90   18.90   18.40   20.20   22.00    2.20   -0.76    2.61 

Below are the plots:

3.11 lstat: Lower Status

The variable lstat indicates the proportion of the population deemed to be of lower status. The variable is right skewed, and high crime areas tend to have be associated with larger lstat values.

Here is a summary table of lstat values rounded to the nearest 2:


  2   4   6   8  10  12  14  16  18  20  22  24  26  28  30  32  34  36  38 Sum 
  9  48  59  51  60  44  45  35  38  17  14  16   8   4  12   1   3   1   1 466 

Summary statistics are provided below:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.      SD    Skew    Kurt 
   1.73    7.04   11.35   12.63   16.93   37.97    7.10    0.91    3.52 

Below are the plots:

3.12 medv: Median value

The last feature variable in our data set is medv, which represents the median value of residential homes in a given area, in $1,000s. The variable is right skewed, and high values of medv appear to be associated with lower crime rates.

Let’s look at medv values, rounded to the nearest $2k:


  6   9  12  15  18  21  24  27  30  33  36  39  42  45  48  51  54  57  60  63  66  69  72  75 Sum 
  2   2  16  11  18  37  31  38  73  58  65  10  19  14  14  12  11   5   1   4   5   3   2  15 466 

Below is the summary statistics:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.      SD    Skew    Kurt 
    5.0    17.0    21.2    22.6    25.0    50.0     9.2     1.1     4.4 

Below are the plots:

4 Target Variable: target

The binary response variable, target, indicates whether a particular area has a crime rate above the median Boston crime rate.


  0   1 Sum 
237 229 466 

The split is very close to half and half (50/50).


5 Bivariate Relationships

Let’s look at scatter plot/correlation matrix to get a better feel for the relationships between pairs of variables.

5.1 Data Preparation

In this section, we describe data modifications and transformations applied before fitting our regression models.

5.2 Missing Values

As mentioned previously, our data set contains no missing elements; so we do not need to use any imputation procedures.

5.3 Variable Transformations

In binary logistic regression, it is desirable to have predictor variables that are normally distributed, whenever possible. Let’s review predictor variables and discuss potential transformation procedures.

zn

The zn predictor is highly right skewed. This variable also has many zero values–over 70% of observations. Let’s compare the percentage of high crime areas for zero-valued zn vs. values 1 or higher:

     zn_zero zn_1_plus
[1,]   0.631     0.118

    2-sample test for equality of proportions with continuity correction

data:  table(crime$zn_zero, crime$target)
X-squared = 100, df = 1, p-value <2e-16
alternative hypothesis: two.sided
95 percent confidence interval:
 0.432 0.595
sample estimates:
prop 1 prop 2 
 0.882  0.369 

The difference in these proportions is statistically significant.

Let’s look at a jittered scatter plot of nonzero zn by target value:

indus
Let’s create a new categorical variable, indus_high, with a value of 1 if the industry proportion is close to the higher valued mode, and value of 0 if the industry value is close to the lower mode.

Comparing the high crime rates for each value of the new indust_high variable:

     val_0 val_1
[1,] 0.231  0.92

The percentage of high crime areas are very different for each value of indus_high.

nox

The variable nox is moderately right skewed. let’s perform the box-cox procedure to determine an appropriate transformation:

Fitted parameters:
 lambda    beta sigmasq 
 -0.949  -0.862   0.126 

Convergence code returned by optim: 0

Based on the box-cox procedure output, We will create a new variable nox_mod, that is the reciprocal of the raw nox value. We then multiply the reciprocal by -1 to preserve the direction of the original relationship.

The transformed variable is more symmetrical, with a skewness value closer to zero:

[1] 0.0487

The variances for each target value are also similar–see below:

rm

The variable rm has a mild positive skew and high kurtosis value. Let’s look at the suggested box cox transformation:

Fitted parameters:
 lambda    beta sigmasq 
 0.2038  2.2239  0.0263 

Convergence code returned by optim: 0

Based on this output, we will transform the variable by taking the quarter root of the raw value.

The transformed variable is more symmetric, with a skewness value of:

[1] 0.0416

The variances of rm_mod for each target value appear to be fairly similar:

age

Age has a moderate left skew. Let’s review the suggested box-cox transformation:

Fitted parameters:
  lambda     beta  sigmasq 
    1.32   205.70 10492.78 

Convergence code returned by optim: 0

We apply the suggested power transformation of 1.3 and store in a new variable, age_mod.

dis

The predictor dis has a moderate positive skew. Let’s transform using the box-cox transformation:

Fitted parameters:
 lambda    beta sigmasq 
 -0.147   1.072   0.205 

Convergence code returned by optim: 0

Given that the value of the lambda parameter is fairly close to 0, we will use the log transformation and save to results to a new variable, dis_mod.

The transformed distribution has improved skew:

[1] 0.143

The transformed variable has similar variances across each target value.

rad

     val_0 val_1
[1,] 0.313     1

tax

The variable tax also has a bi-modal shape, with values densely clustered around 300 and 700–with no values recording in the training data between between 470 and 665. It assigns a value of 1 when the tax value is greater than or equal to 500, and 0 otherwise. The 500 cutoff reflects an approximate halfway point between the two modal centers.

Let’s perform another sanity check to determine if there is significant relationship with our target variable:

     val_0 val_1
[1,] 0.318  0.96

ptratio

The predictor variable ptratio has a moderate negative skew. Let’s perform box-cox transformations:

Fitted parameters:
  lambda     beta  sigmasq 
4.14e+00 4.57e+04 3.61e+08 

Convergence code returned by optim: 0

lstat

The lstat variable has a moderate positive skew.

Let’s look the suggested power transformation from the box-cox procedure:

Fitted parameters:
 lambda    beta sigmasq 
  0.233   3.235   1.055 

Convergence code returned by optim: 0

Based on this output, we create a new variable lstat_mod, that applies a quarter root transformation to the original variable.

The transformed variable is fairly symmetric, with a skewness value of:

[1] -0.00564

The variances of the transformed variable are also similar across each target variable value.

medv

The predictor medv has a moderate, positive skew. Let’s look a the suggested box-cox transformation:

Fitted parameters:
 lambda    beta sigmasq 
  0.235   4.469   0.693 

Convergence code returned by optim: 0

Based on this output, we will apply a quarter root transformation and assign to a new variable, medv_mod.

The newly transformed variable is virtually symmetric, with a skewness value of:

[1] 0.0402

6 Build Model

6.1 Variable Overview

Let’s take a look at the variables that we will consider for our models:

  • zn_zero: a transformed, binary variable
  • indus_high: a transformed, binary variable
  • chas: a binary variable
  • nox_mod: a transformed, continuous variable
  • rm_mod: a transformed, continuous variable
  • dis_mod: a transformed, continuous variable
  • rad_high: a transformed, binary variable. We should be not include this variable and tax_high together, given the high correlation between these variables.
  • tax_high: a transformed, binary variable. We should not include this variable in the same model as the rad_high variable.
  • lstat_mod: a transformed, continuous variable
  • medv_mod: a transformed, continuous variable. We will consider adding a quadratic term because the variances appear to be different for each target variable value.
  • age: this variable (whether transformed or not) is asymmetric and has different variances for each target value.
  • ptratio: this predictor and its transformed counterpart have moderate skew and different variances for each value of the target predictor.

6.2 Model 1: Manual Selection

In the first model, we will only include binary variables and continuous variables that are approximately normally distributed. The goal is to have a relatively simple model with interpretable coefficients.

Let’s review the VIFs for the included predictors:

target_mod    zn_zero indus_high    nox_mod     rm_mod   tax_high  lstat_mod       chas 
      2.49       1.65       4.21       4.54       1.98       2.75       2.96       1.04 

All VIF values are below 5.

Let’s review our model output:


Call:
glm(formula = target ~ zn_zero + indus_high + chas + nox_mod + 
    rm_mod + tax_high + lstat_mod + medv_mod, family = "binomial", 
    data = crime)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.4574  -0.3228  -0.0541   0.3044   3.0534  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   -9.906     10.622   -0.93   0.3510    
zn_zero        0.810      0.521    1.56   0.1199    
indus_high    -0.249      0.583   -0.43   0.6692    
chas           0.894      0.593    1.51   0.1321    
nox_mod        7.935      1.162    6.83  8.6e-12 ***
rm_mod        10.534      6.049    1.74   0.0816 .  
tax_high       1.748      0.674    2.59   0.0095 ** 
lstat_mod      1.371      1.191    1.15   0.2497    
medv_mod       2.125      1.558    1.36   0.1727    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 645.88  on 465  degrees of freedom
Residual deviance: 268.71  on 457  degrees of freedom
AIC: 286.7

Number of Fisher Scoring iterations: 6

Based on the Wald tests of significance, only two of our predictors, nox_mod and tax_high, are statistically significant.

Let’s visually examine whether or not there is significant interaction between these two variables:

Here is output from our revised model:


Call:
glm(formula = target ~ zn_zero + indus_high + chas + nox_mod + 
    rm_mod + tax_high + lstat_mod + medv_mod + rm_mod:lstat_mod, 
    family = "binomial", data = crime)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.5923  -0.3265  -0.0496   0.3115   3.1440  

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)       -91.598     42.313   -2.16    0.030 *  
zn_zero             0.818      0.521    1.57    0.116    
indus_high         -0.288      0.583   -0.49    0.621    
chas                0.935      0.584    1.60    0.109    
nox_mod             8.131      1.171    6.94  3.8e-12 ***
rm_mod             64.066     27.522    2.33    0.020 *  
tax_high            1.622      0.674    2.41    0.016 *  
lstat_mod          49.826     24.723    2.02    0.044 *  
medv_mod            1.255      1.600    0.78    0.433    
rm_mod:lstat_mod  -30.954     15.778   -1.96    0.050 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 645.88  on 465  degrees of freedom
Residual deviance: 264.27  on 456  degrees of freedom
AIC: 284.3

Number of Fisher Scoring iterations: 6

With the addition of the interaction term, we now see statistically significant p-values for rm_mod, lstat_stat_mod, and the interaction term.

Now, let’s create the marginal plots from earlier for our revised model:

6.3 Model 2: Simplification of Model 1:

First model included a variety of predictors that produced statistically insignificant z scores.


Call:
glm(formula = target ~ nox_mod + rm_mod + lstat_mod + tax_high + 
    rm_mod:lstat_mod, family = "binomial", data = crime)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-2.725  -0.362  -0.080   0.296   2.884  

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)       -94.141     40.418   -2.33    0.020 *  
nox_mod             7.930      0.958    8.27   <2e-16 ***
rm_mod             68.026     25.647    2.65    0.008 ** 
lstat_mod          52.244     23.980    2.18    0.029 *  
tax_high            1.377      0.547    2.52    0.012 *  
rm_mod:lstat_mod  -32.700     15.231   -2.15    0.032 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 645.88  on 465  degrees of freedom
Residual deviance: 270.69  on 460  degrees of freedom
AIC: 282.7

Number of Fisher Scoring iterations: 6

Model 2 has a few benefits as compared to model 1:

  • The model is simple, with only 4 predictors plus one interaction term. Model 2, on the other hand, has 8 unique predictors plus one interaction term
  • All predictors are statistically significant in Model 2. Model 2 also has the added benefit that it’s AIC Criterion is lower than that of Model 1. In other words, Model 2 may provide a better fit to the data overall.

6.4 Model 3: Automatic Selection Stepwise Model

In the ast model, we’ll we’ll fit a binary logistic regression using a stepwise regression procedure, with variable selection occurring in both forward and backward directions.

We’ll exclude rad_high due to its extremely high correlation with tax_high.


Call:
glm(formula = target ~ nox_mod + tax_high + dis_mod + medv_mod + 
    age_mod + ptratio + chas, family = "binomial", data = crime)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.7953  -0.2878  -0.0246   0.3008   2.8622  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -2.00960    3.98159   -0.50  0.61375    
nox_mod     12.73699    1.73709    7.33  2.3e-13 ***
tax_high     2.28549    0.65937    3.47  0.00053 ***
dis_mod      3.18955    0.74094    4.30  1.7e-05 ***
medv_mod     6.47191    1.37538    4.71  2.5e-06 ***
age_mod      0.00646    0.00219    2.95  0.00314 ** 
ptratio      0.31871    0.09854    3.23  0.00122 ** 
chas         1.36917    0.60292    2.27  0.02315 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 645.88  on 465  degrees of freedom
Residual deviance: 244.91  on 458  degrees of freedom
AIC: 260.9

Number of Fisher Scoring iterations: 7

The stepwise procedure resulted in the selection of 7 predictor variables.

The stepwise regression model has a lower AIC value than Model 2, which may indicate a better fit.

7 Model Selection

7.1 Select Best Model

Let’s compute the associated p-value for the difference in deviance for these two models:

[1] 0.17

The p-value does not indicate a statistically significant difference. This is additional evidence that Model 2 is superior to Model 1:

Now let’s compare all three model fits, using AIC, corrected AIC, BIC, and loglikehood values:

     AIC AICc BIC loglik
[1,] 284  285 326   -132
[2,] 283  283 308   -135
[3,] 261  261 294   -122

The first three terms, AIC, AICc, and BIC, provide a consistent interpretation of model fits:

  • Model 1 has the highest AIC, AICc, and BIC values, which is indicative of poor relative fit.
  • Model 2 has values for these measures that are lower than those of Model 1, but higher than the values of Model 3. The interpretation is that Model 2 is superior to Model 2, but inferior to model 3.
  • Model 3 has the lowest measures of AIC, AICc, and BIC, which is indicative of a superior fit to the other two models.

Binary logistic regression models are fit via maximum likelihood estimation, and higher log likelihoods are associated with better model fits. One of the drawbacks of standard log-likelihood values is that is does not penalize for model size (i.e. number of parameters). Using log-likelihood measures alone to assess model fit could bias preference in favor of larger, over fitting models. There are alternative log-likelihood measures that do penalize for model size, but for now we’ll review the non-penalizing measure:

  • Model 1 has a higher log-likelihood than Model 2. The difference in values is likely explained by the additional parameters used vis-a-vis Model 1.
  • Model 3 has the highest log-likelihood, despite having fewer parameters than Model 1.

Moving forward, we choose Model 3 as the best of the models examined in this assignment. The model strikes a middle ground in terms of size in relation to the other two models, and is superior in all goodness-of-fit measures examined.

7.2 Best Model Metrics

Below is a ROC curve of our selected model and area under the curve:

$auc
Area under the curve: 0.958

As a brief aside, let’s compare Model 3’s AUC to the AUCs for Model 1 and Model 2:

[1] "Model 1: 0.95198349087023"
[1] "Model 2: 0.947082342969801"

The AUC for Model 1 is identical to Model 3’s AUC, while model 2’s AUC is lower.

We now need to choose an appropriate cutoff probability measure for predicting whether a neighborhood has a high or low crime rate. One common measure used is called Youden’s index, which attempts to maximize both sensitivity and specificity:

Using the coords() function in the pRoc package, the optimal measure is:

[1] 0.484

From a practical standpoint, this value is very close to 0.50; so we will use a probability of 50% as our cutoff.

We can now produce a confusion matrix, and various classification metrics:

          Reference
Prediction   0   1
         0 208  23
         1  29 206
$accuracy
[1] 0.888

$error_rt
[1] 0.112

$precision
[1] 0.877

$sensitivity
[1] 0.9

$specificity
[1] 0.878

$F1
[1] 0.888

Our model has relatively high values of sensitivity and specificity. We conclude that our model is fairly strong. Our last step is to judge our model against a test data set.

8 Test Data Predictions

Let’s make crime level predictions using the evaluation data set.

zn indus chas nox rm age dis rad tax ptratio black lstat medv zn_zero indus_high nox_mod rm_mod age_mod dis_mod rad_high tax_high lstat_mod medv_mod predict_prob predict_target
0 7.07 0 0.469 7.18 61.1 4.97 2 242 17.8 392.8 4.03 34.7 1 0 -2.13 1.64 209.8 1.603 0 0 1.42 2.43 0.211 0
0 8.14 0 0.538 6.10 84.5 4.46 4 307 21.0 380.0 10.26 18.2 1 0 -1.86 1.57 319.8 1.496 0 0 1.79 2.06 0.771 1
0 8.14 0 0.538 6.50 94.4 4.46 4 307 21.0 387.9 12.80 18.4 1 0 -1.86 1.60 369.4 1.494 0 0 1.89 2.07 0.827 1
0 8.14 0 0.538 5.95 82.0 3.99 4 307 21.0 232.6 27.71 13.2 1 0 -1.86 1.56 307.6 1.384 0 0 2.29 1.91 0.437 0
0 5.96 0 0.499 5.85 41.5 3.93 5 279 19.2 396.9 8.77 21.0 1 0 -2.00 1.55 126.9 1.370 0 0 1.72 2.14 0.085 0
25 5.13 0 0.453 5.74 66.2 7.22 8 284 19.7 395.1 13.15 18.7 0 0 -2.21 1.55 232.9 1.978 0 0 1.90 2.08 0.071 0
25 5.13 0 0.453 5.97 93.4 6.82 8 284 19.7 378.1 14.44 16.0 0 0 -2.21 1.56 364.3 1.920 0 0 1.95 2.00 0.081 0
0 4.49 0 0.449 6.63 56.1 4.44 3 247 18.5 392.3 6.53 26.6 1 0 -2.23 1.60 187.8 1.490 0 0 1.60 2.27 0.022 0
0 4.49 0 0.449 6.12 56.8 3.75 3 247 18.5 395.1 8.44 22.2 1 0 -2.23 1.57 190.8 1.321 0 0 1.70 2.17 0.007 0
0 2.89 0 0.445 6.16 69.6 3.50 2 276 18.0 391.8 11.34 21.4 1 0 -2.25 1.58 248.5 1.251 0 0 1.83 2.15 0.005 0
0 25.65 0 0.581 5.86 97.0 1.94 2 188 19.1 370.3 25.41 17.3 1 1 -1.72 1.56 382.7 0.665 0 0 2.25 2.04 0.487 0
0 25.65 0 0.581 5.61 95.6 1.76 2 188 19.1 359.3 27.26 15.7 1 1 -1.72 1.54 375.5 0.564 0 0 2.29 1.99 0.323 0
0 21.89 0 0.624 5.64 94.7 1.98 4 437 21.2 396.9 18.34 14.3 1 1 -1.60 1.54 370.9 0.683 0 0 2.07 1.95 0.817 1
0 19.58 0 0.605 6.10 93.0 2.28 5 403 14.7 240.2 9.81 25.0 1 1 -1.65 1.57 362.3 0.826 0 0 1.77 2.24 0.744 1
0 19.58 0 0.605 5.88 97.3 2.39 5 403 14.7 348.1 12.03 19.1 1 1 -1.65 1.56 384.2 0.871 0 0 1.86 2.09 0.601 1
0 10.59 1 0.489 5.96 92.1 3.88 4 277 18.6 393.2 17.27 21.7 1 0 -2.04 1.56 357.7 1.355 0 0 2.04 2.16 0.461 0
0 6.20 0 0.504 6.55 21.4 3.38 8 307 17.4 380.3 3.76 31.5 1 0 -1.98 1.60 53.6 1.216 0 0 1.39 2.37 0.102 0
0 6.20 0 0.507 8.25 70.4 3.65 8 307 17.4 378.9 3.95 48.3 1 0 -1.97 1.70 252.3 1.295 0 0 1.41 2.64 0.775 1
22 5.86 0 0.431 6.96 6.8 8.91 7 330 19.1 386.1 3.53 29.6 0 0 -2.32 1.62 12.1 2.187 0 0 1.37 2.33 0.035 0
90 2.97 0 0.400 7.09 20.8 7.31 1 285 15.3 394.7 7.85 32.2 0 0 -2.50 1.63 51.7 1.989 0 0 1.67 2.38 0.001 0
80 1.76 0 0.385 6.23 31.5 9.09 1 241 18.2 341.6 12.93 20.1 0 0 -2.60 1.58 88.7 2.207 0 0 1.90 2.12 0.000 0
33 2.18 0 0.472 6.62 58.1 3.37 7 222 18.4 393.4 8.93 28.4 0 0 -2.12 1.60 196.5 1.215 0 0 1.73 2.31 0.045 0
0 9.90 0 0.544 6.12 52.8 2.64 4 304 18.4 396.9 5.98 22.1 1 0 -1.84 1.57 173.6 0.971 0 0 1.56 2.17 0.213 0
0 7.38 0 0.493 6.42 40.1 4.72 5 287 19.6 396.9 6.12 25.0 1 0 -2.03 1.59 121.4 1.552 0 0 1.57 2.24 0.199 0
0 7.38 0 0.493 6.31 28.9 5.42 5 287 19.6 396.9 6.15 23.0 1 0 -2.03 1.58 79.3 1.689 0 0 1.57 2.19 0.179 0
0 5.19 0 0.515 5.89 59.6 5.62 5 224 20.2 394.8 10.56 18.5 1 0 -1.94 1.56 203.2 1.725 0 0 1.80 2.07 0.484 0
80 2.01 0 0.435 6.63 29.7 8.34 4 280 17.0 390.9 5.99 24.5 0 0 -2.30 1.60 82.1 2.122 0 0 1.56 2.23 0.015 0
0 18.10 0 0.718 3.56 87.9 1.61 24 666 20.2 354.7 7.12 27.5 1 1 -1.39 1.37 336.7 0.478 1 1 1.63 2.29 0.999 1
0 18.10 1 0.631 7.02 97.5 1.20 24 666 20.2 392.1 2.96 50.0 1 1 -1.58 1.63 385.2 0.184 1 1 1.31 2.66 1.000 1
0 18.10 0 0.584 6.35 86.1 2.05 24 666 20.2 83.5 17.64 14.5 1 1 -1.71 1.59 327.7 0.719 1 1 2.05 1.95 0.875 1
0 18.10 0 0.740 5.93 87.9 1.82 24 666 20.2 69.0 34.02 8.4 1 1 -1.35 1.56 336.7 0.599 1 1 2.42 1.70 0.990 1
0 18.10 0 0.740 5.63 93.9 1.82 24 666 20.2 396.9 22.88 12.8 1 1 -1.35 1.54 366.8 0.597 1 1 2.19 1.89 0.998 1
0 18.10 0 0.740 5.82 92.4 1.87 24 666 20.2 391.4 22.11 10.5 1 1 -1.35 1.55 359.2 0.624 1 1 2.17 1.80 0.996 1
0 18.10 0 0.740 6.22 100.0 2.00 24 666 20.2 395.7 16.59 18.4 1 1 -1.35 1.58 398.1 0.696 1 1 2.02 2.07 1.000 1
0 18.10 0 0.740 5.85 96.6 1.90 24 666 20.2 240.5 23.79 10.8 1 1 -1.35 1.55 380.6 0.640 1 1 2.21 1.81 0.997 1
0 18.10 0 0.713 6.53 86.5 2.44 24 666 20.2 50.9 18.13 14.1 1 1 -1.40 1.60 329.7 0.890 1 1 2.06 1.94 0.998 1
0 18.10 0 0.713 6.38 88.4 2.57 24 666 20.2 391.4 14.65 17.7 1 1 -1.40 1.59 339.1 0.943 1 1 1.96 2.05 0.999 1
0 18.10 0 0.655 6.21 65.4 2.96 24 666 20.2 396.9 13.22 21.4 1 1 -1.53 1.58 229.2 1.086 1 1 1.91 2.15 0.998 1
0 9.69 0 0.585 5.79 70.6 2.89 6 391 19.2 396.9 14.10 18.3 1 0 -1.71 1.55 253.2 1.062 0 0 1.94 2.07 0.678 1
0 11.93 0 0.573 6.98 91.0 2.17 1 273 21.0 396.9 5.64 23.9 1 0 -1.75 1.62 352.2 0.774 0 0 1.54 2.21 0.819 1

9 Appendix

knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE, cache=FALSE, comment=NA,
                      fig.align='center',options(width = 110, digits=3))
library(knitr)
library(ggplot2)
library(moments)
library(ggthemes)
library(qqplotr)
library(gridExtra)
library(car)
library(geoR)
library(alr3)
library(rcompanion)
library(sme)
library(caret)
library(pROC)
# read in data
crime <- read.csv('https://raw.githubusercontent.com/niteen11/MSDS/master/DATA621/
dataset/crime-training-data.csv')
kable(head(crime))
options(digits=2, width=100)
(summary(crime))
addmargins(table(5*round(crime$zn/5,0)))
with(crime, c(summary(zn), SD=sd(zn), Skew=skewness(zn), Kurt=kurtosis(zn)))

h <- ggplot(crime, aes(zn)) + geom_histogram(fill = 'salmon', binwidth = 20, color = 'grey69' ) + 
 theme_classic() + labs(title = 'Histogram of zn') + theme(plot.title = element_text(hjust = 0.5)) 

q <- ggplot(crime, aes(sample=zn)) + stat_qq_point(color='salmon') + stat_qq_line(color='grey69') +
  labs(x="Thoretical Quantiles", y="Sample Quantiles", title = "QQ Plot of zn") + theme_classic() + 
  theme(plot.title = element_text(hjust = 0.5)) 

b1 <- ggplot(crime, aes(x="", zn)) + geom_boxplot(fill='salmon', color='grey69')+ theme_classic() +
  labs(title = 'Boxplot of zn', x="") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()

b2 <- ggplot(crime, aes(x=factor(target), zn)) + geom_boxplot(fill='salmon', color='grey69') +
  labs(x='target', title = 'Boxplot of zn by target') + theme_classic() +
  theme(plot.title = element_text(hjust = 0.5)) 

grid.arrange(h, q, b1, b2, ncol=2)

addmargins(table(round(crime$indus,0)))
with(crime, c(summary(indus), SD=sd(indus), Skew=skewness(indus), Kurt=kurtosis(indus)))
h <- ggplot(crime, aes(indus)) + geom_histogram(fill = 'salmon', binwidth = 5, color = 'grey69' ) + 
 theme_classic() + labs(title = 'Histogram of indus') + theme(plot.title = element_text(hjust = 0.5)) 

q <- ggplot(crime, aes(sample=indus)) + stat_qq_point(color='salmon') + stat_qq_line(color='grey69') +
  labs(x="Thoretical Quantiles", y="Sample Quantiles", title = "QQ Plot of indus") + theme_classic() + 
  theme(plot.title = element_text(hjust = 0.5)) 

b1 <- ggplot(crime, aes(x="", indus)) + geom_boxplot(fill='salmon', color='grey69')+ theme_classic() +
  labs(title = 'Boxplot of indus', x="") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()

b2 <- ggplot(crime, aes(x=factor(target), indus)) + geom_boxplot(fill='salmon', color='grey69') +
  labs(x='target', title = 'Boxplot of indus by target') + theme_classic() +
  theme(plot.title = element_text(hjust = 0.5)) 

grid.arrange(h, q, b1, b2, ncol=2)




addmargins(table(crime$chas))
ggplot(crime, aes(x=target, y=chas)) + geom_jitter(color='salmon') + theme_classic() + 
  labs(title ='Jittered Scatter Plot of chas vs.target') + theme(plot.title = element_text(hjust = 0.5)) 

addmargins(table(crime$chas, crime$target))

prop.test(table(crime$chas, crime$target))
addmargins(table(round(crime$nox,1)))
options(digits=2)
with(crime, c(summary(nox), SD=sd(nox), Skew=skewness(nox), Kurt=kurtosis(nox)))
h <- ggplot(crime, aes(nox)) + geom_histogram(fill = 'salmon', binwidth = 0.05, color = 'grey69' ) + 
 theme_classic() + labs(title = 'Histogram of nox') + theme(plot.title = element_text(hjust = 0.5)) 

q <- ggplot(crime, aes(sample=nox)) + stat_qq_point(color='salmon') + stat_qq_line(color='grey69') +
  labs(x="Thoretical Quantiles", y="Sample Quantiles", title = "QQ Plot of nox") + theme_classic() + 
  theme(plot.title = element_text(hjust = 0.5)) 

b1 <- ggplot(crime, aes(x="", nox)) + geom_boxplot(fill='salmon', color='grey69')+ theme_classic() +
  labs(title = 'Boxplot of nox', x="") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()

b2 <- ggplot(crime, aes(x=factor(target), nox)) + geom_boxplot(fill='salmon', color='grey69') +
  labs(x='target', title = 'Boxplot of nox by target') + theme_classic() +
  theme(plot.title = element_text(hjust = 0.5)) 

grid.arrange(h, q, b1, b2, ncol=2)

addmargins(table(round(crime$rm,0)))
options(digits=2)
with(crime, c(summary(rm), SD=sd(rm), Skew=skewness(rm), Kurt=kurtosis(rm)))
h <- ggplot(crime, aes(rm)) + geom_histogram(fill = 'salmon', binwidth = 0.5, color = 'grey69' ) + 
 theme_classic() + labs(title = 'Histogram of rm') + theme(plot.title = element_text(hjust = 0.5)) 

q <- ggplot(crime, aes(sample=rm)) + stat_qq_point(color='salmon') + stat_qq_line(color='grey69') +
  labs(x="Thoretical Quantiles", y="Sample Quantiles", title = "QQ Plot of rm") + theme_classic() + 
  theme(plot.title = element_text(hjust = 0.5)) 

b1 <- ggplot(crime, aes(x="", rm)) + geom_boxplot(fill='salmon', color='grey69')+ theme_classic() +
  labs(title = 'Boxplot of rm', x="") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()

b2 <- ggplot(crime, aes(x=factor(target), rm)) + geom_boxplot(fill='salmon', color='grey69') +
  labs(x='target', title = 'Boxplot of rm by target') + theme_classic() +
  theme(plot.title = element_text(hjust = 0.5)) 

grid.arrange(h, q, b1, b2, ncol=2)

addmargins(table(5*round(crime$age/5,0)))
options(digits=2)
with(crime, c(summary(age), SD=sd(age), Skew=skewness(age), Kurt=kurtosis(age)))
h <- ggplot(crime, aes(age)) + geom_histogram(fill = 'salmon', binwidth = 5, color = 'grey69' ) + 
 theme_classic() + labs(title = 'Histogram of age') + theme(plot.title = element_text(hjust = 0.5)) 

q <- ggplot(crime, aes(sample=age)) + stat_qq_point(color='salmon') + stat_qq_line(color='grey69') +
  labs(x="Thoretical Quantiles", y="Sample Quantiles", title = "QQ Plot of age") + theme_classic() + 
  theme(plot.title = element_text(hjust = 0.5)) 

b1 <- ggplot(crime, aes(x="", age)) + geom_boxplot(fill='salmon', color='grey69')+ theme_classic() +
  labs(title = 'Boxplot of age', x="") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()

b2 <- ggplot(crime, aes(x=factor(target), age)) + geom_boxplot(fill='salmon', color='grey69') +
  labs(x='target', title = 'Boxplot of age by target') + theme_classic() +
  theme(plot.title = element_text(hjust = 0.5)) 

grid.arrange(h, q, b1, b2, ncol=2)
addmargins(table(round(crime$dis,0)))
options(digits=2)
with(crime, c(summary(dis), SD=sd(dis), Skew=skewness(dis), Kurt=kurtosis(dis)))
h <- ggplot(crime, aes(dis)) + geom_histogram(fill = 'salmon', binwidth = 1, color = 'grey69' ) + 
 theme_classic() + labs(title = 'Histogram of dis') + theme(plot.title = element_text(hjust = 0.5)) 

q <- ggplot(crime, aes(sample=dis)) + stat_qq_point(color='salmon') + stat_qq_line(color='grey69') +
  labs(x="Thoretical Quantiles", y="Sample Quantiles", title = "QQ Plot of dis") + theme_classic() + 
  theme(plot.title = element_text(hjust = 0.5)) 

b1 <- ggplot(crime, aes(x="", dis)) + geom_boxplot(fill='salmon', color='grey69')+ theme_classic() +
  labs(title = 'Boxplot of dis', x="") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()

b2 <- ggplot(crime, aes(x=factor(target), dis)) + geom_boxplot(fill='salmon', color='grey69') +
  labs(x='target', title = 'Boxplot of dis by target') + theme_classic() +
  theme(plot.title = element_text(hjust = 0.5)) 

grid.arrange(h, q, b1, b2, ncol=2)
addmargins(table(round(crime$rad,0)))
options(digits=2)
with(crime, c(summary(rad), SD=sd(rad), Skew=skewness(rad), Kurt=kurtosis(rad)))
h <- ggplot(crime, aes(rad)) + geom_histogram(fill = 'salmon', binwidth = 1, color = 'grey69' ) + 
 theme_classic() + labs(title = 'Histogram of rad') + theme(plot.title = element_text(hjust = 0.5)) 

q <- ggplot(crime, aes(sample=rad)) + stat_qq_point(color='salmon') + stat_qq_line(color='grey69') +
  labs(x="Thoretical Quantiles", y="Sample Quantiles", title = "QQ Plot of rad") + theme_classic() + 
  theme(plot.title = element_text(hjust = 0.5)) 

b1 <- ggplot(crime, aes(x="", rad)) + geom_boxplot(fill='salmon', color='grey69')+ theme_classic() +
  labs(title = 'Boxplot of rad', x="") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()

b2 <- ggplot(crime, aes(x=factor(target), rad)) + geom_boxplot(fill='salmon', color='grey69') +
  labs(x='target', title = 'Boxplot of rad by target') + theme_classic() +
  theme(plot.title = element_text(hjust = 0.5)) 

grid.arrange(h, q, b1, b2, ncol=2)
addmargins(table(25*round(crime$tax/25,0)))
options(digits=2)
with(crime, c(summary(tax), SD=sd(tax), Skew=skewness(tax), Kurt=kurtosis(tax)))
h <- ggplot(crime, aes(tax)) + geom_histogram(fill = 'salmon', binwidth = 25, color = 'grey69' ) + 
 theme_classic() + labs(title = 'Histogram of tax') + theme(plot.title = element_text(hjust = 0.5)) 

q <- ggplot(crime, aes(sample=tax)) + stat_qq_point(color='salmon') + stat_qq_line(color='grey69') +
  labs(x="Thoretical Quantiles", y="Sample Quantiles", title = "QQ Plot of tax") + theme_classic() + 
  theme(plot.title = element_text(hjust = 0.5)) 

b1 <- ggplot(crime, aes(x="", tax)) + geom_boxplot(fill='salmon', color='grey69')+ theme_classic() +
  labs(title = 'Boxplot of tax', x="") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()

b2 <- ggplot(crime, aes(x=factor(target), tax)) + geom_boxplot(fill='salmon', color='grey69') +
  labs(x='target', title = 'Boxplot of tax by target') + theme_classic() +
  theme(plot.title = element_text(hjust = 0.5)) 

grid.arrange(h, q, b1, b2, ncol=2)
addmargins(table(round(crime$ptratio,0)))
options(digits=2)
with(crime, c(summary(ptratio), SD=sd(ptratio), Skew=skewness(ptratio), Kurt=kurtosis(ptratio)))
h <- ggplot(crime, aes(ptratio)) + geom_histogram(fill = 'salmon', binwidth = 1, color = 'grey69' ) + 
 theme_classic() + labs(title = 'Histogram of ptratio') + theme(plot.title = element_text(hjust = 0.5)) 

q <- ggplot(crime, aes(sample=ptratio)) + stat_qq_point(color='salmon') + stat_qq_line(color='grey69') +
  labs(x="Thoretical Quantiles", y="Sample Quantiles", title = "QQ Plot of ptratio") + theme_classic() + 
  theme(plot.title = element_text(hjust = 0.5)) 

b1 <- ggplot(crime, aes(x="", ptratio)) + geom_boxplot(fill='salmon', color='grey69')+ theme_classic() +
  labs(title = 'Boxplot of ptratio', x="") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()

b2 <- ggplot(crime, aes(x=factor(target), ptratio)) + geom_boxplot(fill='salmon', color='grey69') +
  labs(x='target', title = 'Boxplot of ptratio by target') + theme_classic() +
  theme(plot.title = element_text(hjust = 0.5)) 

grid.arrange(h, q, b1, b2, ncol=2)
addmargins(table(2*round(crime$lstat/2,0)))
options(digits=2)
with(crime, c(summary(lstat), SD=sd(lstat), Skew=skewness(lstat), Kurt=kurtosis(lstat)))
h <- ggplot(crime, aes(lstat)) + geom_histogram(fill = 'salmon', binwidth = 2 , color = 'grey69' ) + 
 theme_classic() + labs(title = 'Histogram of lstat') + theme(plot.title = element_text(hjust = 0.5)) 

q <- ggplot(crime, aes(sample=lstat)) + stat_qq_point(color='salmon') + stat_qq_line(color='grey69') +
  labs(x="Thoretical Quantiles", y="Sample Quantiles", title = "QQ Plot of lstat") + theme_classic() + 
  theme(plot.title = element_text(hjust = 0.5)) 

b1 <- ggplot(crime, aes(x="", lstat)) + geom_boxplot(fill='salmon', color='grey69')+ theme_classic() +
  labs(title = 'Boxplot of lstat', x="") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()

b2 <- ggplot(crime, aes(x=factor(target), lstat)) + geom_boxplot(fill='salmon', color='grey69') +
  labs(x='target', title = 'Boxplot of lstat by target') + theme_classic() +
  theme(plot.title = element_text(hjust = 0.5)) 

grid.arrange(h, q, b1, b2, ncol=2)
addmargins(table(3*round(crime$medv/2,0)))
options(digits=2)
with(crime, c(summary(medv), SD=sd(medv), Skew=skewness(medv), Kurt=kurtosis(medv)))
h <- ggplot(crime, aes(medv)) + geom_histogram(fill = 'salmon', binwidth = 3, color = 'grey69' ) + 
 theme_classic() + labs(title = 'Histogram of medv') + theme(plot.title = element_text(hjust = 0.5)) 

q <- ggplot(crime, aes(sample=medv)) + stat_qq_point(color='salmon') + stat_qq_line(color='grey69') +
  labs(x="Thoretical Quantiles", y="Sample Quantiles", title = "QQ Plot of medv") + theme_classic() + 
  theme(plot.title = element_text(hjust = 0.5)) 

b1 <- ggplot(crime, aes(x="", medv)) + geom_boxplot(fill='salmon', color='grey69')+ theme_classic() +
  labs(title = 'Boxplot of medv', x="") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()

b2 <- ggplot(crime, aes(x=factor(target), medv)) + geom_boxplot(fill='salmon', color='grey69') +
  labs(x='target', title = 'Boxplot of medv by target') + theme_classic() +
  theme(plot.title = element_text(hjust = 0.5)) 

grid.arrange(h, q, b1, b2, ncol=2)
addmargins(table(crime$target))

mycols = c('green3','blue')
crime$target_mod <- factor(ifelse(crime$target ==1 , 'y', 'n'))

panel.cor <- function(x,y, digits = 2, cex.cor, ...) {
  usr <- par("usr"); on.exit(par(usr))
  par(usr = c(0, 1, 0, 1))
  r <- cor(x,y)
  txt <- format(c(r, 0.123456789), digits = digits)[1]
  text(0.5,0.5, txt, cex =sqrt(abs(r*6)))
  
}

pairs(crime[,1:13],  col= mycols[crime$target_mod], gap=0.3, lower.panel = panel.smooth, upper.panel = panel.cor, cex.labels = 3)

options(digits=3)
cbind(zn_zero=mean(crime$target[crime$zn==0]), zn_1_plus=mean(crime$target[crime$zn != 0]))

crime$zn_zero <- ifelse(crime$zn == 0,1,0)
prop.test(table(crime$zn_zero,crime$target))

ggplot(crime[crime$zn != 0,], aes(x=zn, target)) + geom_jitter(color='salmon', width=0,
                                                               height=0.05) +
  labs(title = 'Scatter of zn (greater than 0) by target') + theme_classic() +
  theme(plot.title = element_text(hjust = 0.5)) 

crime$indus_high <- ifelse(crime$indus <= 14, 0, 1)
options(digits=3)
cbind(val_0=mean(crime$target[crime$indus_high==0]),val_1=mean(crime$target[crime$indus_high==1]))

boxcoxfit(crime$nox)
crime$nox_mod <- -1 / crime$nox
skewness(crime$nox_mod)

h <-  ggplot(crime, aes(nox_mod)) + geom_histogram(fill = 'salmon', binwidth = 0.15, color = 'grey69' ) +
 theme_classic() + labs(title = 'Histogram of nox_mod') + theme(plot.title = element_text(hjust = 0.5))

b <- ggplot(crime, aes(x=factor(target), nox_mod)) + geom_boxplot(fill='salmon', color='grey69') +
  labs(x='target', title = 'Boxplot of nox_mod by target') + theme_classic() +
  theme(plot.title = element_text(hjust = 0.5))

grid.arrange(h,b, ncol= 2)

boxcoxfit(crime$rm)
crime$rm_mod <- crime$rm ^ 0.25

options(digits=3)
skewness(crime$rm_mod)

h <-  ggplot(crime, aes(rm_mod)) + geom_histogram(fill = 'salmon', binwidth = 0.02, color = 'grey69' ) +
 theme_classic() + labs(title = 'Histogram of rm_mod') + theme(plot.title = element_text(hjust = 0.5))


b <- ggplot(crime, aes(x=factor(target), rm_mod)) + geom_boxplot(fill='salmon', color='grey69') +
  labs(x='target', title = 'Boxplot of rm_mod by target') + theme_classic() +
  theme(plot.title = element_text(hjust = 0.5))

grid.arrange(h,b, ncol=2)

boxcoxfit(crime$age)


crime$age_mod <- crime$age^1.3

h <- ggplot(crime, aes(age_mod)) + geom_histogram(fill = 'salmon', binwidth = 50, color = 'grey69' ) +
 theme_classic() + labs(title = 'Histogram of age_mod') + theme(plot.title = element_text(hjust = 0.5))

b <- ggplot(crime, aes(x=factor(target), age_mod)) + geom_boxplot(fill='salmon', color='grey69') +
  labs(x='target', title = 'Boxplot of rm_mod by target') + theme_classic() +
  theme(plot.title = element_text(hjust = 0.5))

grid.arrange(h,b, ncol=2)

boxcoxfit(crime$dis)

crime$dis_mod <- log(crime$dis)
skewness(crime$dis_mod)


h <-  ggplot(crime, aes(dis_mod)) + geom_histogram(fill = 'salmon', binwidth = 0.2, color = 'grey69' ) +
 theme_classic() + labs(title = 'Histogram of dis_mod') + theme(plot.title = element_text(hjust = 0.5))

b <- ggplot(crime, aes(x=factor(target), dis_mod)) + geom_boxplot(fill='salmon', color='grey69') +
  labs(x='target', title = 'Boxplot of dis_mod by target') + theme_classic() +
  theme(plot.title = element_text(hjust = 0.5))

grid.arrange(h,b, ncol=2)

crime$rad_high <- ifelse(crime$rad >= 15, 1, 0)
options(digits=3)
cbind(val_0=mean(crime$target[crime$rad_high==0]),val_1=mean(crime$target[crime$rad_high==1]))

crime$tax_high <- ifelse(crime$tax >= 500, 1,0)
options(digits=3)
cbind(val_0=mean(crime$target[crime$tax_high==0]),val_1=mean(crime$target[crime$tax_high==1]))

boxcoxfit(crime$ptratio)
boxcoxfit(crime$lstat)
crime$lstat_mod <- crime$lstat^0.25
options(digits=3)
skewness(crime$lstat_mod)

h <-  ggplot(crime, aes(lstat_mod)) + geom_histogram(fill = 'salmon', binwidth = 0.2, color = 'grey69' ) +
 theme_classic() + labs(title = 'Histogram of lstat_mod') + theme(plot.title = element_text(hjust = 0.5))

b <- ggplot(crime, aes(x=factor(target), lstat_mod)) + geom_boxplot(fill='salmon', color='grey69') + labs(x='target', title = 'Boxplot of lstat_mod by target') + theme_classic() +
  theme(plot.title = element_text(hjust = 0.5))

grid.arrange(h,b, ncol=2)

boxcoxfit(crime$medv)
crime$medv_mod <- crime$medv^0.25

options(digits=3)
skewness(crime$medv_mod)

h <-  ggplot(crime, aes(medv_mod)) + geom_histogram(fill = 'salmon', binwidth = 0.2, color = 'grey69' ) +
 theme_classic() + labs(title = 'Histogram of medv_mod') + theme(plot.title = element_text(hjust = 0.5))

b <- ggplot(crime, aes(x=factor(target), medv_mod)) + geom_boxplot(fill='salmon', color='grey69') + labs(x='target', title = 'Boxplot of medv_mod by target') + theme_classic() +
  theme(plot.title = element_text(hjust = 0.5))

grid.arrange(h,b, ncol=2)

options(digits=3)
mylm <- lm(target ~ . - age_mod - rad_high - ptratio - dis_mod ,data=crime[,c(15:22,10, 23:24, 14,3)])
vif(mylm)

mylogit1 <- glm(target~zn_zero + indus_high + chas + nox_mod + 
                  rm_mod + tax_high + lstat_mod + 
                  medv_mod ,family="binomial",data=crime)
summary(mylogit1)

mmps(mylogit1, terms = ~ . - zn_zero - indus_high - chas -tax_high, layout=c(3,2), key=FALSE)

par(mfrow=c(1,1))
plot(crime$rm_mod,crime$lstat_mod,pch=crime$target+1,col=crime$target+1,
     xlab='rm_mod',ylab='lstat_mod', main="Test Interaction of lstat_mod and rm_mod")
abline(lsfit(crime$rm_mod[crime$target==0],crime$lstat_mod[crime$target==0]),lty=1,col=1)
abline(lsfit(crime$rm_mod[crime$target==1],crime$lstat_mod[crime$target==1]),lty=2,col=2)
legend(1.65, 2.2,legend=c("No","Yes"),pch=1:2,col=1:2,title="High Crime?")
mylogit1 <- glm(target~zn_zero + indus_high + chas + nox_mod + 
                  rm_mod + tax_high + lstat_mod + 
                  medv_mod + rm_mod:lstat_mod ,family="binomial",data=crime)
summary(mylogit1)

mmps(mylogit1, terms = ~ . - zn_zero - indus_high - chas -tax_high, layout=c(3,2), key=FALSE)

mylogit2 <- glm(target~ nox_mod + rm_mod + lstat_mod + tax_high + 
                  rm_mod:lstat_mod ,family="binomial",data=crime)

summary(mylogit2)

model.upper <- glm(target~zn_zero + indus_high + chas + nox_mod + 
                  rm_mod  + dis_mod + tax_high + lstat_mod + 
                  medv_mod + ptratio+ age_mod  ,family="binomial",data=crime)

model.null = glm(target ~ 1, 
                 data=crime,
                 family = "binomial")
mylogit3 <- step(model.null, scope = list(upper=model.upper, lower = model.null),
                 trace = 0, direction = 'both')

summary(mylogit3)

pchisq(mylogit2$deviance - mylogit1$deviance,mylogit2$df.residual - mylogit1$df.residual ,lower=FALSE)

m1 <- cbind(AIC=AIC(mylogit1), AICc=AICc(mylogit1), BIC = BIC(mylogit1), loglik=logLik(mylogit1))
m2 <- cbind(AIC=AIC(mylogit2), AICc=AICc(mylogit2), BIC = BIC(mylogit2), loglik=logLik(mylogit2))
m3 <- cbind(AIC=AIC(mylogit3), AICc=AICc(mylogit3), BIC = BIC(mylogit3), loglik=logLik(mylogit3))
rbind(m1,m2,m3)
crime$predict <- predict(mylogit3, crime, type='response')

myroc <- roc(crime$target, crime$predict, plot=T, asp=NA,
                legacy.axes=T, main = "ROC Curve", ret="tp", col="blue")

myroc["auc"]

# models 1 and 2 prediction probs
m1_pred <- predict(mylogit1, crime, type="response")
m2_pred <- predict(mylogit2, crime, type="response")

#AUC
paste("Model 1:",roc(crime$target, m1_pred)["auc"])
paste("Model 2:",roc(crime$target, m2_pred)["auc"])

coords(myroc, "best", ret='threshold', best.method="youden")
crime$predict_target <- ifelse(crime$predict >= 0.5, 1, 0)  
cm <- confusionMatrix(data=as.factor(crime$predict_target), reference=as.factor(crime$target), positive='1')

cm$table

a <- cm$overall["Accuracy"]; names(a) <- NULL # accuracy
e <- 1 - a; names(e) <- NULL # error rate
p <- cm$byClass["Precision"]; names(p) <- NULL  # precision
st <- cm$byClass["Sensitivity"]; names(st) <- NULL  # sensitivity
sp <- cm$byClass["Specificity"]; names(sp) <- NULL # specificity
f1 <- cm$byClass["F1"] ; names(f1) <- NULL # F1

# display metrics
list(accuracy=a, error_rt = e, precision=p, sensitivity=st, specificity=sp, F1=f1)


test_crime <- read.csv('https://raw.githubusercontent.com/niteen11/MSDS/master/
DATA621/dataset/crime-evaluation-data.csv')
test_crime$zn_zero <- ifelse(test_crime$zn == 0,1,0)
test_crime$indus_high <- ifelse(test_crime$indus <= 14, 0, 1)
test_crime$nox_mod <- -1 / test_crime$nox
test_crime$rm_mod <- test_crime$rm ^ 0.25
test_crime$age_mod <- test_crime$age^1.3
test_crime$dis_mod <- log(test_crime$dis)
test_crime$rad_high <- ifelse(test_crime$rad >= 15, 1, 0)
test_crime$tax_high <- ifelse(test_crime$tax >= 500, 1,0)
test_crime$lstat_mod <- test_crime$lstat^0.25 
test_crime$medv_mod <- test_crime$medv^0.25
test_crime$predict_prob <- predict(mylogit3, test_crime, type='response')
test_crime$predict_target <- ifelse(test_crime$predict_prob >= 0.50, 1,0)

#write.csv(test_crime,"test_predictions.csv", row.names=FALSE)


kable(test_crime)