Analysis

The objective of this analysis is to build a binary logistic regression model to predict whether Boston neighborhoods will be at risk for high crime levels. High crime levels is defined as crime rates above the median level of crime.

Data Exploration

The data set includes 466 observations with 12 variables (excluding the target variable). Descriptive information on the data set follows:

variable Description Type
zn proportion of residential land zoned for large lots (over 25000 square feet) predictor
indus proportion of non-retail business acres per suburb predictor
chas a dummy var. for whether the suburb borders the Charles River (1) or not (0) predictor
nox nitrogen oxides concentration (parts per 10 million) predictor
rm average number of rooms per dwelling predictor
age proportion of owner-occupied units built prior to 1940 predictor
dis weighted mean of distances to five Boston employment center predictor
rad index of accessibility to radial highways predictor
tax full-value property-tax rate per $10,000 predictor
ptratio pupil-teacher ratio by town predictor
lstat lower status of the population predictor
medv median value of owner-occupied homes in $1000s predictor
target whether the crime rate is above the median crime rate (1) or not (0) response

The str, summary and skim functions are utilized to display a high level summary of the data set below. A detailed exploration of the variables in the data sets follows.

Summary of Variables

'data.frame':   466 obs. of  13 variables:
 $ zn     : num  0 0 0 30 0 0 0 0 0 80 ...
 $ indus  : num  19.58 19.58 18.1 4.93 2.46 ...
 $ chas   : int  0 1 0 0 0 0 0 0 0 0 ...
 $ nox    : num  0.605 0.871 0.74 0.428 0.488 0.52 0.693 0.693 0.515 0.392 ...
 $ rm     : num  7.93 5.4 6.49 6.39 7.16 ...
 $ age    : num  96.2 100 100 7.8 92.2 71.3 100 100 38.1 19.1 ...
 $ dis    : num  2.05 1.32 1.98 7.04 2.7 ...
 $ rad    : int  5 5 24 6 3 5 24 24 5 1 ...
 $ tax    : int  403 403 666 300 193 384 666 666 224 315 ...
 $ ptratio: num  14.7 14.7 20.2 16.6 17.8 20.9 20.2 20.2 20.2 16.4 ...
 $ lstat  : num  3.7 26.82 18.85 5.19 4.82 ...
 $ medv   : num  50 13.4 15.4 23.7 37.9 26.5 5 7 22.2 20.9 ...
 $ target : int  1 1 1 0 0 0 1 1 0 0 ...
       zn            indus            chas            nox              rm            age       
 Min.   :  0.0   Min.   : 0.46   Min.   :0.000   Min.   :0.389   Min.   :3.86   Min.   :  2.9  
 1st Qu.:  0.0   1st Qu.: 5.14   1st Qu.:0.000   1st Qu.:0.448   1st Qu.:5.89   1st Qu.: 43.9  
 Median :  0.0   Median : 9.69   Median :0.000   Median :0.538   Median :6.21   Median : 77.2  
 Mean   : 11.6   Mean   :11.11   Mean   :0.071   Mean   :0.554   Mean   :6.29   Mean   : 68.4  
 3rd Qu.: 16.2   3rd Qu.:18.10   3rd Qu.:0.000   3rd Qu.:0.624   3rd Qu.:6.63   3rd Qu.: 94.1  
 Max.   :100.0   Max.   :27.74   Max.   :1.000   Max.   :0.871   Max.   :8.78   Max.   :100.0  
      dis             rad             tax         ptratio         lstat           medv          target     
 Min.   : 1.13   Min.   : 1.00   Min.   :187   Min.   :12.6   Min.   : 1.7   Min.   : 5.0   Min.   :0.000  
 1st Qu.: 2.10   1st Qu.: 4.00   1st Qu.:281   1st Qu.:16.9   1st Qu.: 7.0   1st Qu.:17.0   1st Qu.:0.000  
 Median : 3.19   Median : 5.00   Median :334   Median :18.9   Median :11.3   Median :21.2   Median :0.000  
 Mean   : 3.80   Mean   : 9.53   Mean   :410   Mean   :18.4   Mean   :12.6   Mean   :22.6   Mean   :0.491  
 3rd Qu.: 5.21   3rd Qu.:24.00   3rd Qu.:666   3rd Qu.:20.2   3rd Qu.:16.9   3rd Qu.:25.0   3rd Qu.:1.000  
 Max.   :12.13   Max.   :24.00   Max.   :711   Max.   :22.0   Max.   :38.0   Max.   :50.0   Max.   :1.000  
Data summary
Name crime
Number of rows 466
Number of columns 13
_______________________
Column type frequency:
numeric 13
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
zn 0 1 11.58 23.36 0.00 0.00 0.00 16.25 100.00 ▇▁▁▁▁
indus 0 1 11.11 6.85 0.46 5.15 9.69 18.10 27.74 ▇▆▁▇▁
chas 0 1 0.07 0.26 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
nox 0 1 0.55 0.12 0.39 0.45 0.54 0.62 0.87 ▇▇▅▃▁
rm 0 1 6.29 0.70 3.86 5.89 6.21 6.63 8.78 ▁▂▇▂▁
age 0 1 68.37 28.32 2.90 43.88 77.15 94.10 100.00 ▂▂▂▃▇
dis 0 1 3.80 2.11 1.13 2.10 3.19 5.21 12.13 ▇▅▂▁▁
rad 0 1 9.53 8.69 1.00 4.00 5.00 24.00 24.00 ▇▂▁▁▃
tax 0 1 409.50 167.90 187.00 281.00 334.50 666.00 711.00 ▇▇▅▁▇
ptratio 0 1 18.40 2.20 12.60 16.90 18.90 20.20 22.00 ▁▃▅▅▇
lstat 0 1 12.63 7.10 1.73 7.04 11.35 16.93 37.97 ▇▇▅▂▁
medv 0 1 22.59 9.24 5.00 17.02 21.20 25.00 50.00 ▂▇▅▁▁
target 0 1 0.49 0.50 0.00 0.00 0.00 1.00 1.00 ▇▁▁▁▇

Detailed Data Exploration

The detailed analysis will include a review of observed values, summary statistics and EDA visualizations. The objective of this analysis is to identify appropriate data preparation measures and to inform the model development process.

Variable: zn

zn reflects the proportion of residential land zoned for large lots - over 25000 square feet. The data ranges from 0 to 100 with a mean and median of 11.58 and 0, respectively. Zero values represent almost 73% of the data. The density plots reveals that high crime is associated with low to zero large lots. Finally, the box plots indicate substantially higher variances in low crime vs. high crime areas. This is consistent with the high number of zero values for the high crime designation. Key takeaway - if you have a zero or low zn value you likely have a high crime area.

Observed Value:


  0  12  18  20  21  22  25  28  30  33  34  35  40  45  52  55  60  70  75  80  82  85  90  95 100 Sum 
339  10   2  21   4   9   8   3   6   3   3   3   7   6   3   3   4   3   3  13   2   2   4   4   1 466 

Summary Statistics:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.      SD    Skew    Kurt 
   0.00    0.00    0.00   11.58   16.25  100.00   23.36    2.18    6.84 

EDA Panel:


Variable: indus

indus is defined as the proportion of non-retail business acres per suburb. Observed values for indus range from 0 to 28 with a bi-modal distribution in aggregate. When plotted by target value the distribution for low crime approaches normal, but the high crime distribution remains bi-modal with peaks in the high single-digits and 18. Both the density plots and box plots show a positive correlation between indus and crime. I note that the variance for high crime is higher than that of low crime. Key takeaway - positive correlation between indus and crime.

Observed Value:


  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  18  20  22  26  28 Sum 
  1   9  31  29  30  27  45  21  26  11  27  14   4   6   9   3 121  28  14   5   5 466 

Summary Statistics:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.      SD    Skew    Kurt 
  0.460   5.145   9.690  11.105  18.100  27.740   6.846   0.289   1.764 

EDA Panel:


Variable: chas

chas is binary with the value 1 indicating that the neighborhood borders the Charles River. Only 7% of all neighborhoods (33 in total) border the river. Key takeaway - High percentage of zero values may mean this variable is not significant.

Observed Value:


  0   1 Sum 
433  33 466 

Summary Statistics:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.      SD    Skew    Kurt 
 0.0000  0.0000  0.0000  0.0708  0.0000  1.0000  0.2568  3.3463 12.1974 

EDA Panel:


Variable: nox

nox represents the concentration of nitrogen oxide in Boston neighborhoods. The variable has a range between 0.39 and 0.87, with a peak around 0.43-0.45. The average value is 0.70 - skew is positive at 0.74. Both density plots demonstrate normal characteristic and the kurtosis around 3 - normal. This variable appears similar to indus, sharing the positive correlation with crime and the higher variance in the high crime box plot. Key takeaway - positive correlation between nox and crime.

Observed Value:


0.39  0.4 0.41 0.42 0.43 0.44 0.45 0.46 0.47 0.48 0.49  0.5 0.51 0.52 0.53 0.54 0.55 0.57 0.58  0.6 0.61 0.62 
   4   17   18    6   27   34   20   14    4    2   28   10   16   26    5   31   13    6   27   18   12   14 
0.63 0.65 0.66 0.67 0.68 0.69  0.7 0.71 0.72 0.74 0.77 0.87  Sum 
   4   10    4   10    8   14   11   16    5    8    8   16  466 

Summary Statistics:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.      SD    Skew    Kurt 
  0.389   0.448   0.538   0.554   0.624   0.871   0.117   0.749   2.977 

EDA Panel:


Variable: rm

rm, a measure describing the average number of rooms per dwelling. This variable appears to have the most normal distribution among the variables. Medians and variances are similar for the low and high crime outcomes. Key takeaway - I’m not sure this variable will be significant in our regression models.

Observed Value:


  4   5   6   7   8   9 Sum 
  4  37 284 115  23   3 466 

Summary Statistics:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.      SD    Skew    Kurt 
  3.863   5.887   6.210   6.291   6.630   8.780   0.705   0.481   4.562 

EDA Panel:


Variable: age

age reflects the proportion of owner occupied units built prior to 1940. age has a mean of 68.7 and is left skewed. The variable has a range of 3 to 100, with an upper-end concentration in the values between 94-100. This is consistent with an old town like Boston. Higher values of age appear to be associated with higher crime. This could reflect the more affluent demographics moving from older urban areas to newer suburbs. This is supported by the box plots that show low crime has a materially lower median age and significantly higher variance. Key takeaway - high age appears to be associated with high crime.Low crime has a fairly flat distribution with fat tails.

Observed Value:


  3   6   7   8   9  10  13  14  15  16  17  18  19  20  21  22  23  25  26  28  29  30  31  32  33  34  35 
  1   3   2   3   1   3   1   1   2   3   2   8   2   2   3   5   3   1   2   6   5   1   4   9   5   6   1 
 36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  56  57  58  59  60  61  62  63 
  3   7   5   2   4   2   5   3   2   3   5   4   4   3   2   1   5   3   9   3   1   4   4   3   1   6   3 
 64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90 
  1   5   2   3   4   3   5   3   4   5   5   3   4   5   5   4   5   3   5  10   7   7   5   4  10   7   8 
 91  92  93  94  95  96  97  98  99 100 Sum 
  9   9   8  14  14  14  14  18  10  42 466 

Summary Statistics:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.      SD    Skew    Kurt 
   2.90   43.88   77.15   68.37   94.10  100.00   28.32   -0.58    2.00 

EDA Panel:


Variable: dis

dist is defined as the average distance to Boston employment centers. Observed values range from 1 to 12, with a significant concentration between 2-6. The variable has a slight right skew. the box plots indicate, low crime areas are associated with higher average distances to employment centers. Key takeaway - dis appears to be negatively correlated with crime.

Observed Value:


  1   2   3   4   5   6   7   8   9  11  12 Sum 
 28 145  83  65  47  44  24  16   9   4   1 466 

Summary Statistics:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.      SD    Skew    Kurt 
   1.13    2.10    3.19    3.80    5.21   12.13    2.11    1.00    3.49 

EDA Panel:


Variable: rad

rad is an index that measures a neighborhood’s accessibility to radial highways. Observed values range from 1 to 24, with three peaks at 4, 5 and 24, respectively. In the box plots show a small variance and median for low crime neighborhoods and a high median and variance for high crime regions. Key takeaway There appears to a positive correlation between rad and crime.

Observed Value:


  1   2   3   4   5   6   7   8  24 Sum 
 17  20  36 103 109  25  15  20 121 466 

Summary Statistics:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.      SD    Skew    Kurt 
   1.00    4.00    5.00    9.53   24.00   24.00    8.69    1.01    2.15 

EDA Panel:


Variable: tax

tax is the the tax rate per $10k of property value. Observed values range from 187 to 711, with concentration around 300 and 690. The density plot show a bi-modal distribution for high crime neighbors with peaks 300 and 700. Similar to rad, high crime appears positively correlated to tax. Additionally, tax, like rad, has a high variance in the high crime area and a large variance - no doubt a result of its bi-modal distribution. Key takeaway -rad - is positively correlated with crime and its distribution is very similar to rad.

Observed Value:


187 188 193 198 216 222 223 224 226 233 242 243 244 245 247 252 254 255 256 264 265 270 273 276 277 279 281 
  1   5   8   1   5   6   5   9   1   9   1   4   1   3   2   2   5   1   1  12   2   7   4   8  10   3   4 
284 287 289 293 296 300 304 305 307 311 313 315 329 330 334 335 337 345 348 351 352 358 370 384 391 398 402 
  5   6   5   3   8   7  13   4  35   7   1   2   6   9   2   2   2   3   2   1   2   3   2  11   7  12   2 
403 411 422 430 432 437 469 666 711 Sum 
 28   2   1   3   9  14   1 121   5 466 

Summary Statistics:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.      SD    Skew    Kurt 
187.000 281.000 334.500 409.502 666.000 711.000 167.900   0.661   1.860 

EDA Panel:


Variable: ptratio

ptratio reflects the average school, pupil-to-student ratio. The variable has an observed value range of 13 to 22, with concentrations at 15 and 18 to 20. ptratio High crime neighborhoods appear to be associated with higher pupil-teacher ratios. Low crime neighborhoods have a median ratio lower than high crime, but a variance that is higher. Key takeaway - the high crime neighborhoods have positive correlation with the ratio and a distribution with a strong left skew.

Observed Value:


 15  20 Sum 
138 328 466 

 13  14  15  16  17  18  19  20  21  22 Sum 
 15   2  55  21  45  67  65 145  49   2 466 

Summary Statistics:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.      SD    Skew    Kurt 
 12.600  16.900  18.900  18.398  20.200  22.000   2.197  -0.757   2.611 

EDA Panel:


Variable: lstat

lstat is the proportion of the population considered to be of lower status. Observed values range between 2 and 38, with concentrations around 4 to 10 12 to 18. The box plots indicate that high crime neighborhoods have a higher median and variance. Key takeaway The high crime density chart appears to be nearly normally distributed and significantly overlaps with the low crime distribution. I’m not sure this is ideal for classification purposes.

Observed Value:


  0   5  10  15  20  25  30  35  40 Sum 
  4 126 130 103  54  28  16   4   1 466 

  2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28 
  4  14  20  35  25  32  28  24  37  18  23  24  19  22  20  18  19  11  11  11   2   8   9   2   5   4   3 
 29  30  31  32  34  35  37  38 Sum 
  2   5   5   1   2   1   1   1 466 

Summary Statistics:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.      SD    Skew    Kurt 
  1.730   7.043  11.350  12.631  16.930  37.970   7.102   0.909   3.518 

EDA Panel:


Variable: medv

medv reflects the median value (in $thousands) of residential homes in Boston neighborhoods. The variable has an observed value range of 5 to 50 The variable is right skewed, and high values of medv appear to be associated with lower crime rates. The box plot show Variances of medv by crime type are similar, however, low crime neighborhoods have higher median medv value of approximately 22, versus 18 for high crime neighborhoods. Key takeaway - the density plot depict significant overlap of the low and high crime neighborhoods.

Observed Value:


  5  10  15  20  25  30  35  40  45  50 Sum 
 10  34  82 145  93  41  28   7   9  17 466 

  5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31 
  2   2   6   7   3   9   4  11  16  20  16  14  16  19  28  40  25  33  35  30  18   4   6  10   9   9   5 
 32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50 Sum 
  8  10   1   8   6   3   2   1   1   1   2   2   3   2   1   1   1   1  15 466 

Summary Statistics:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.      SD    Skew    Kurt 
   5.00   17.02   21.20   22.59   25.00   50.00    9.24    1.08    4.39 

EDA Panel:


Variable: target

target our response variable as 466 observations, 237 0s and 229 1s, a split of approximately 51% and 49% respectively. The variable has no missing data.

Observed Value:

Item/Response Low Crime(0) High Crime (1) Totals
Frequency 237 229 466
Percentage 51% 49% 100%

Correlation

We utilized a pairs plot to better understand the correlation between variables. In the plot below, we employ scatter plots and loess curves to better understand the relationship between variables - high crime neighborhoods are represented in red and low crime neighborhoods in blue.

Pairs Plot Observations:

Here are my observations from the pairs plot above. These observations as well as the EDA analysis above will inform my data preparation decisions in the next section.

  • High Correlation - There are 7 correlations with an absolute value greater than 70% in the pairs plot, excluding the target:

    • indus | tax: 73%
    • indus | nox: 76%
    • nox | age: 74%
    • nox | dis: -77%
    • rm | medv: 71%
    • age | dis: 75%
    • rad | tax: 91
  • Convex Loess Curves - the loess curves for rm, ptratio and medv all change directions. Referring to the density plots for each of these variable, I believe this reflects the large extent of overlap between the low crime and high density plot. This may be consistent with a reduced ability to classify.

  • L-Shaped Loess Curve - the loess curve for zn and target almost forms an L-shape. This likely reflects the high concentration of zero values.

  • S-Curves - The loess curves for age, dis, nox and lstat seem to form sigmoid curves. Could this be an indication of a better ability to classify.

Data Preparation

The data preparation phase of my analysis will be used to address problems with the data (missing values, outliers, multicollinearity), transform variables and feature engineer. These actions aim to improve the modeling phase of the analysis and improve the ability of our models to classify accurately.

Missing Data and Outliers

The the skimr function identified that the data set had no missing vales. Similarly, the EDA analysis above did not identify any problems with outliers.

Summary of Data Preparation Actions

The following table outlines my data preparation actions and rationale.

variable Data Prep Action Rationale
age age2 a categorical with three levels 0(<=25) ,1 (26-74), 2 (>=75) The tails of density plots good classifiers
dis dis2 a categorical with two levels 0(<5), 1 (>=5) Bi-modal distribution of high crime
tax drop variable or don’t use 0.90+ correlation with rad

Summary Information and Plots

We use the str, summary and skim functions to provide high level summary information on our revised data set. Plots of our new variables are also set forth below.

Revised Data Set

'data.frame':   466 obs. of  15 variables:
 $ zn     : num  0 0 0 30 0 0 0 0 0 80 ...
 $ indus  : num  19.58 19.58 18.1 4.93 2.46 ...
 $ chas   : int  0 1 0 0 0 0 0 0 0 0 ...
 $ nox    : num  0.605 0.871 0.74 0.428 0.488 0.52 0.693 0.693 0.515 0.392 ...
 $ rm     : num  7.93 5.4 6.49 6.39 7.16 ...
 $ age    : num  96.2 100 100 7.8 92.2 71.3 100 100 38.1 19.1 ...
 $ dis    : num  2.05 1.32 1.98 7.04 2.7 ...
 $ rad    : int  5 5 24 6 3 5 24 24 5 1 ...
 $ tax    : int  403 403 666 300 193 384 666 666 224 315 ...
 $ ptratio: num  14.7 14.7 20.2 16.6 17.8 20.9 20.2 20.2 20.2 16.4 ...
 $ lstat  : num  3.7 26.82 18.85 5.19 4.82 ...
 $ medv   : num  50 13.4 15.4 23.7 37.9 26.5 5 7 22.2 20.9 ...
 $ target : int  1 1 1 0 0 0 1 1 0 0 ...
 $ disNew : num  1 1 1 0 1 1 1 1 0 0 ...
 $ ageNew : int  3 3 3 1 3 2 3 3 2 1 ...
       zn            indus            chas            nox              rm            age       
 Min.   :  0.0   Min.   : 0.46   Min.   :0.000   Min.   :0.389   Min.   :3.86   Min.   :  2.9  
 1st Qu.:  0.0   1st Qu.: 5.14   1st Qu.:0.000   1st Qu.:0.448   1st Qu.:5.89   1st Qu.: 43.9  
 Median :  0.0   Median : 9.69   Median :0.000   Median :0.538   Median :6.21   Median : 77.2  
 Mean   : 11.6   Mean   :11.11   Mean   :0.071   Mean   :0.554   Mean   :6.29   Mean   : 68.4  
 3rd Qu.: 16.2   3rd Qu.:18.10   3rd Qu.:0.000   3rd Qu.:0.624   3rd Qu.:6.63   3rd Qu.: 94.1  
 Max.   :100.0   Max.   :27.74   Max.   :1.000   Max.   :0.871   Max.   :8.78   Max.   :100.0  
      dis             rad             tax         ptratio         lstat           medv          target     
 Min.   : 1.13   Min.   : 1.00   Min.   :187   Min.   :12.6   Min.   : 1.7   Min.   : 5.0   Min.   :0.000  
 1st Qu.: 2.10   1st Qu.: 4.00   1st Qu.:281   1st Qu.:16.9   1st Qu.: 7.0   1st Qu.:17.0   1st Qu.:0.000  
 Median : 3.19   Median : 5.00   Median :334   Median :18.9   Median :11.3   Median :21.2   Median :0.000  
 Mean   : 3.80   Mean   : 9.53   Mean   :410   Mean   :18.4   Mean   :12.6   Mean   :22.6   Mean   :0.491  
 3rd Qu.: 5.21   3rd Qu.:24.00   3rd Qu.:666   3rd Qu.:20.2   3rd Qu.:16.9   3rd Qu.:25.0   3rd Qu.:1.000  
 Max.   :12.13   Max.   :24.00   Max.   :711   Max.   :22.0   Max.   :38.0   Max.   :50.0   Max.   :1.000  
     disNew          ageNew    
 Min.   :0.000   Min.   :1.00  
 1st Qu.:0.000   1st Qu.:2.00  
 Median :1.000   Median :3.00  
 Mean   :0.727   Mean   :2.42  
 3rd Qu.:1.000   3rd Qu.:3.00  
 Max.   :1.000   Max.   :3.00  
Data summary
Name crime
Number of rows 466
Number of columns 15
_______________________
Column type frequency:
numeric 15
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
zn 0 1 11.58 23.36 0.00 0.00 0.00 16.25 100.00 ▇▁▁▁▁
indus 0 1 11.11 6.85 0.46 5.15 9.69 18.10 27.74 ▇▆▁▇▁
chas 0 1 0.07 0.26 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
nox 0 1 0.55 0.12 0.39 0.45 0.54 0.62 0.87 ▇▇▅▃▁
rm 0 1 6.29 0.70 3.86 5.89 6.21 6.63 8.78 ▁▂▇▂▁
age 0 1 68.37 28.32 2.90 43.88 77.15 94.10 100.00 ▂▂▂▃▇
dis 0 1 3.80 2.11 1.13 2.10 3.19 5.21 12.13 ▇▅▂▁▁
rad 0 1 9.53 8.69 1.00 4.00 5.00 24.00 24.00 ▇▂▁▁▃
tax 0 1 409.50 167.90 187.00 281.00 334.50 666.00 711.00 ▇▇▅▁▇
ptratio 0 1 18.40 2.20 12.60 16.90 18.90 20.20 22.00 ▁▃▅▅▇
lstat 0 1 12.63 7.10 1.73 7.04 11.35 16.93 37.97 ▇▇▅▂▁
medv 0 1 22.59 9.24 5.00 17.02 21.20 25.00 50.00 ▂▇▅▁▁
target 0 1 0.49 0.50 0.00 0.00 0.00 1.00 1.00 ▇▁▁▁▇
disNew 0 1 0.73 0.45 0.00 0.00 1.00 1.00 1.00 ▃▁▁▁▇
ageNew 0 1 2.42 0.66 1.00 2.00 3.00 3.00 3.00 ▂▁▆▁▇


New Variables


Model Building

The Sample.split function is used to create training and test data set. The training set includes 350 rows and the test data set includes 116. Next we will use the training data set to create models. Once models are created the test data set will be used to evaluate performance. This approach is used to avoid overfitting.

Model 1: All Original Variables

The first model uses all the original variables. This model should establish a good base line to improve upon.


Call:
glm(formula = target ~ zn + indus + chas + nox + rm + age + dis + 
    rad + ptratio + lstat + medv, family = binomial(link = "logit"), 
    data = crimeTrain)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.8619  -0.1458  -0.0001   0.0040   2.8314  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -45.2510     8.5187   -5.31  1.1e-07 ***
zn           -0.1498     0.0623   -2.40   0.0162 *  
indus        -0.1118     0.0545   -2.05   0.0402 *  
chas          0.2473     0.8905    0.28   0.7812    
nox          45.2207     9.6111    4.71  2.5e-06 ***
rm            0.3041     0.9205    0.33   0.7411    
age           0.0345     0.0168    2.06   0.0399 *  
dis           0.7221     0.2877    2.51   0.0121 *  
rad           0.5379     0.1709    3.15   0.0016 ** 
ptratio       0.3300     0.1467    2.25   0.0245 *  
lstat         0.1156     0.0695    1.66   0.0961 .  
medv          0.2141     0.0918    2.33   0.0196 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 485.10  on 349  degrees of freedom
Residual deviance: 130.95  on 338  degrees of freedom
AIC: 154.9

Number of Fisher Scoring iterations: 9
          Reference
Prediction  0  1
         0 52  7
         1 10 47
fitting null model for pseudo-r2

Model summary and confusion matrix of running this model against test data are above. The accuracy rate (0.853) is very good and the McFadden R^2 value (0.73) is also high. AIC value is 154. Additionally, consider the ROC curve for this model.

Area under the curve is 0.941.

Model 2: Well Performing Variables From Model 1 Plus New Variables - disNew and ageNew

Model 2 performed inferior to Model 1 and it appears my the new variables had a marginal impact.


Call:
glm(formula = target ~ nox + rad + disNew + ageNew, family = binomial(link = "logit"), 
    data = crimeTrain)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.6901  -0.2596  -0.0211   0.0027   3.0568  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -18.483      2.679   -6.90  5.2e-12 ***
nox           20.464      4.836    4.23  2.3e-05 ***
rad            0.624      0.144    4.33  1.5e-05 ***
disNew         2.655      1.071    2.48    0.013 *  
ageNew         0.634      0.437    1.45    0.147    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 485.10  on 349  degrees of freedom
Residual deviance: 150.83  on 345  degrees of freedom
AIC: 160.8

Number of Fisher Scoring iterations: 8
          Reference
Prediction  0  1
         0 51  8
         1 13 44
fitting null model for pseudo-r2

Model summary and confusion matrix of running this model against test data are above. The accuracy rate (0.819) is very good and the McFadden R^2 value (0.689) is also high. AIC value is 161. Additionally, consider the ROC curve for this model.

Area under the curve is 0.937.

Model 3: But The New Variable Should Work

In this model, not accepting the new models were not effective, I combine the best performing variables, nox and rad, with the new variables and the variable they were derived from - dis and disNew and age and ageNew. This had a positive result on accuracy / performance.


Call:
glm(formula = target ~ nox + rad + disNew + dis + age + ageNew, 
    family = binomial(link = "logit"), data = crimeTrain)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.7462  -0.1694  -0.0021   0.0005   2.8860  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -43.53050    7.17951   -6.06  1.3e-09 ***
nox          42.69831    8.03167    5.32  1.1e-07 ***
rad           0.81938    0.18138    4.52  6.3e-06 ***
disNew        8.42730    2.06516    4.08  4.5e-05 ***
dis           1.63966    0.35921    4.56  5.0e-06 ***
age           0.00747    0.02350    0.32     0.75    
ageNew        0.84177    0.89047    0.95     0.34    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 485.10  on 349  degrees of freedom
Residual deviance: 126.37  on 343  degrees of freedom
AIC: 140.4

Number of Fisher Scoring iterations: 9
          Reference
Prediction  0  1
         0 52  7
         1  6 51
fitting null model for pseudo-r2

Model summary and confusion matrix of running this model against test data are above. The accuracy rate (0.888) is very good and the McFadden R^2 value (0.74) is also high. AIC value is 140. Additionally, consider the ROC curve for this model.

Area under the curve is 0.969.

Model 4: Drop poor performing variable from Model 3 (age and ageNew) and add medv and zn from Model 1. This resulted in an additional improvement over Model 3.

Model 4 seeks to improve upon Model 3 by dropping poor performing variable from Model 3 and adding medv and zn, to average variables from Model 1.


Call:
glm(formula = target ~ nox + dis + disNew + rad + medv + zn, 
    family = binomial(link = "logit"), data = crimeTrain)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.2078  -0.1409  -0.0007   0.0018   2.3021  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -45.7332     7.6330   -5.99  2.1e-09 ***
nox          51.4673     8.9988    5.72  1.1e-08 ***
dis           1.8325     0.4158    4.41  1.0e-05 ***
disNew        7.1469     1.7964    3.98  6.9e-05 ***
rad           0.6241     0.1826    3.42  0.00063 ***
medv          0.0844     0.0425    1.99  0.04675 *  
zn           -0.0704     0.0397   -1.77  0.07607 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 485.10  on 349  degrees of freedom
Residual deviance: 124.04  on 343  degrees of freedom
AIC: 138

Number of Fisher Scoring iterations: 9
          Reference
Prediction  0  1
         0 56  3
         1  4 53
fitting null model for pseudo-r2

Model summary and confusion matrix of running this model against test data are above. The accuracy rate (0.94) is very good and the McFadden R^2 value (0.744) is also high. AIC value is 138. Additionally, consider the ROC curve for this model.

Area under the curve is 0.971.

Model 5: Drop zn from Model 4

In Model 5 we drop zn and medv because their p-value were less than 2.0.


Call:
glm(formula = target ~ nox + dis + disNew + rad, family = binomial(link = "logit"), 
    data = crimeTrain)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.1884  -0.1778  -0.0043   0.0007   2.3900  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -41.981      6.906   -6.08  1.2e-09 ***
nox           46.856      7.803    6.00  1.9e-09 ***
dis            1.517      0.350    4.33  1.5e-05 ***
disNew         7.927      1.983    4.00  6.4e-05 ***
rad            0.783      0.171    4.58  4.7e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 485.10  on 349  degrees of freedom
Residual deviance: 131.35  on 345  degrees of freedom
AIC: 141.4

Number of Fisher Scoring iterations: 9
          Reference
Prediction  0  1
         0 52  7
         1  4 53
fitting null model for pseudo-r2

Model summary and confusion matrix of running this model against test data are above. The accuracy rate (0.905) is very good and the McFadden R^2 value (0.729) is also high. AIC value is 141. Additionally, consider the ROC curve for this model.

Area under the curve is 0.968.

Model Selection

Model 4 and Model 5 distinguished themselves the the other models. My model selection will be between these two models. The table below set forth some key metrics that will facilitate the selection process.

Model Variables MF R-Squared ROC/AUC Accuracy AIC Intuitive
Model 4 6 0.744 0.971 0.94 138 Yes
Model 5 4 0.729 0.968 0.91 141 Yes

Both models yeilded very good results. As we compare the key metrics we see many similarities. Model 4 does have two additional variables compared to Model 5, however, that does not appear to have had been reflected in the AIC. That said the difference between the two AICs is small. Given this the decision between the two models comes down to accuracy and McFadden R-squared. In these two metric Model 4 is superior to Model 5. As a result, Model 4 is the selected model.

The Model 4 summary and prediction against the evaluation data set follow:

This equation indicates that probability of high crime increase as nox, dis, disNew, rad and medv increase and it decreases as zn increases. Based upon my understanding of our variable this appears to make sense. Note, medv and zn are less significant than the other variables.


Call:
glm(formula = target ~ nox + dis + disNew + rad + medv + zn, 
    family = binomial(link = "logit"), data = crimeTrain)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.2078  -0.1409  -0.0007   0.0018   2.3021  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -45.7332     7.6330   -5.99  2.1e-09 ***
nox          51.4673     8.9988    5.72  1.1e-08 ***
dis           1.8325     0.4158    4.41  1.0e-05 ***
disNew        7.1469     1.7964    3.98  6.9e-05 ***
rad           0.6241     0.1826    3.42  0.00063 ***
medv          0.0844     0.0425    1.99  0.04675 *  
zn           -0.0704     0.0397   -1.77  0.07607 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 485.10  on 349  degrees of freedom
Residual deviance: 124.04  on 343  degrees of freedom
AIC: 138

Number of Fisher Scoring iterations: 9

Predictions Using Evaluation Data Set

zn indus chas nox rm age dis rad tax ptratio lstat medv disNew ageNew prob predict
0 7.07 0 0.469 7.18 61.1 4.97 2 242 17.8 4.03 34.7 1 2 0.237 0
0 8.14 0 0.538 6.10 84.5 4.46 4 307 21.0 10.26 18.2 1 3 0.788 1
0 8.14 0 0.538 6.50 94.4 4.46 4 307 21.0 12.80 18.4 1 3 0.789 1
0 8.14 0 0.538 5.95 82.0 3.99 4 307 21.0 27.71 13.2 1 3 0.506 1
0 5.96 0 0.499 5.85 41.5 3.93 5 279 19.2 8.77 21.0 1 2 0.310 0
25 5.13 0 0.453 5.74 66.2 7.22 8 284 19.7 13.15 18.7 0 2 0.012 0
25 5.13 0 0.453 5.97 93.4 6.82 8 284 19.7 14.44 16.0 0 3 0.005 0
0 4.49 0 0.449 6.63 56.1 4.44 3 247 18.5 6.53 26.6 1 2 0.038 0
0 4.49 0 0.449 6.12 56.8 3.75 3 247 18.5 8.44 22.2 1 2 0.008 0
0 2.89 0 0.445 6.16 69.6 3.50 2 276 18.0 11.34 21.4 1 2 0.002 0
0 25.65 0 0.581 5.86 97.0 1.94 2 188 19.1 25.41 17.3 1 3 0.082 0
0 25.65 0 0.581 5.61 95.6 1.76 2 188 19.1 27.26 15.7 1 3 0.053 0
0 21.89 0 0.624 5.64 94.7 1.98 4 437 21.2 18.34 14.3 1 3 0.703 1
0 19.58 0 0.605 6.10 93.0 2.28 5 403 14.7 9.81 25.0 1 3 0.877 1
0 19.58 0 0.605 5.88 97.3 2.39 5 403 14.7 12.03 19.1 1 3 0.840 1
0 10.59 1 0.489 5.96 92.1 3.88 4 277 18.6 17.27 21.7 1 3 0.121 0
0 6.20 0 0.504 6.55 21.4 3.38 8 307 17.4 3.76 31.5 1 1 0.767 1
0 6.20 0 0.507 8.25 70.4 3.65 8 307 17.4 3.95 48.3 1 2 0.963 1
22 5.86 0 0.431 6.96 6.8 8.91 7 330 19.1 3.53 29.6 0 1 0.129 0
90 2.97 0 0.400 7.09 20.8 7.31 1 285 15.3 7.85 32.2 0 1 0.000 0
80 1.76 0 0.385 6.23 31.5 9.09 1 241 18.2 12.93 20.1 0 2 0.000 0
33 2.18 0 0.472 6.62 58.1 3.37 7 222 18.4 8.93 28.4 1 2 0.025 0
0 9.90 0 0.544 6.12 52.8 2.64 4 304 18.4 5.98 22.1 1 2 0.200 0
0 7.38 0 0.493 6.42 40.1 4.72 5 287 19.6 6.12 25.0 1 2 0.662 1
0 7.38 0 0.493 6.31 28.9 5.42 5 287 19.6 6.15 23.0 0 2 0.005 0
0 5.19 0 0.515 5.89 59.6 5.62 5 224 20.2 10.56 18.5 0 2 0.014 0
80 2.01 0 0.435 6.63 29.7 8.34 4 280 17.0 5.99 24.5 0 2 0.000 0
0 18.10 0 0.718 3.56 87.9 1.61 24 666 20.2 7.12 27.5 1 3 1.000 1
0 18.10 1 0.631 7.02 97.5 1.20 24 666 20.2 2.96 50.0 1 3 1.000 1
0 18.10 0 0.584 6.35 86.1 2.05 24 666 20.2 17.64 14.5 1 3 1.000 1
0 18.10 0 0.740 5.93 87.9 1.82 24 666 20.2 34.02 8.4 1 3 1.000 1
0 18.10 0 0.740 5.63 93.9 1.82 24 666 20.2 22.88 12.8 1 3 1.000 1
0 18.10 0 0.740 5.82 92.4 1.87 24 666 20.2 22.11 10.5 1 3 1.000 1
0 18.10 0 0.740 6.22 100.0 2.00 24 666 20.2 16.59 18.4 1 3 1.000 1
0 18.10 0 0.740 5.85 96.6 1.90 24 666 20.2 23.79 10.8 1 3 1.000 1
0 18.10 0 0.713 6.53 86.5 2.44 24 666 20.2 18.13 14.1 1 3 1.000 1
0 18.10 0 0.713 6.38 88.4 2.57 24 666 20.2 14.65 17.7 1 3 1.000 1
0 18.10 0 0.655 6.21 65.4 2.96 24 666 20.2 13.22 21.4 1 2 1.000 1
0 9.69 0 0.585 5.79 70.6 2.89 6 391 19.2 14.10 18.3 1 2 0.892 1
0 11.93 0 0.573 6.98 91.0 2.17 1 273 21.0 5.64 23.9 1 3 0.077 0

Split between predicted outcomes is illustrated by tables below.


 0  1 
19 21 

    0     1 
0.475 0.525