The objective of this analysis is to build a binary logistic regression model to predict whether Boston neighborhoods will be at risk for high crime levels. High crime levels is defined as crime rates above the median level of crime.
The data set includes 466 observations with 12 variables (excluding the target variable). Descriptive information on the data set follows:
| variable | Description | Type |
|---|---|---|
| zn | proportion of residential land zoned for large lots (over 25000 square feet) | predictor |
| indus | proportion of non-retail business acres per suburb | predictor |
| chas | a dummy var. for whether the suburb borders the Charles River (1) or not (0) | predictor |
| nox | nitrogen oxides concentration (parts per 10 million) | predictor |
| rm | average number of rooms per dwelling | predictor |
| age | proportion of owner-occupied units built prior to 1940 | predictor |
| dis | weighted mean of distances to five Boston employment center | predictor |
| rad | index of accessibility to radial highways | predictor |
| tax | full-value property-tax rate per $10,000 | predictor |
| ptratio | pupil-teacher ratio by town | predictor |
| lstat | lower status of the population | predictor |
| medv | median value of owner-occupied homes in $1000s | predictor |
| target | whether the crime rate is above the median crime rate (1) or not (0) | response |
The str, summary and skim functions are utilized to display a high level summary of the data set below. A detailed exploration of the variables in the data sets follows.
'data.frame': 466 obs. of 13 variables:
$ zn : num 0 0 0 30 0 0 0 0 0 80 ...
$ indus : num 19.58 19.58 18.1 4.93 2.46 ...
$ chas : int 0 1 0 0 0 0 0 0 0 0 ...
$ nox : num 0.605 0.871 0.74 0.428 0.488 0.52 0.693 0.693 0.515 0.392 ...
$ rm : num 7.93 5.4 6.49 6.39 7.16 ...
$ age : num 96.2 100 100 7.8 92.2 71.3 100 100 38.1 19.1 ...
$ dis : num 2.05 1.32 1.98 7.04 2.7 ...
$ rad : int 5 5 24 6 3 5 24 24 5 1 ...
$ tax : int 403 403 666 300 193 384 666 666 224 315 ...
$ ptratio: num 14.7 14.7 20.2 16.6 17.8 20.9 20.2 20.2 20.2 16.4 ...
$ lstat : num 3.7 26.82 18.85 5.19 4.82 ...
$ medv : num 50 13.4 15.4 23.7 37.9 26.5 5 7 22.2 20.9 ...
$ target : int 1 1 1 0 0 0 1 1 0 0 ...
zn indus chas nox rm age
Min. : 0.0 Min. : 0.46 Min. :0.000 Min. :0.389 Min. :3.86 Min. : 2.9
1st Qu.: 0.0 1st Qu.: 5.14 1st Qu.:0.000 1st Qu.:0.448 1st Qu.:5.89 1st Qu.: 43.9
Median : 0.0 Median : 9.69 Median :0.000 Median :0.538 Median :6.21 Median : 77.2
Mean : 11.6 Mean :11.11 Mean :0.071 Mean :0.554 Mean :6.29 Mean : 68.4
3rd Qu.: 16.2 3rd Qu.:18.10 3rd Qu.:0.000 3rd Qu.:0.624 3rd Qu.:6.63 3rd Qu.: 94.1
Max. :100.0 Max. :27.74 Max. :1.000 Max. :0.871 Max. :8.78 Max. :100.0
dis rad tax ptratio lstat medv target
Min. : 1.13 Min. : 1.00 Min. :187 Min. :12.6 Min. : 1.7 Min. : 5.0 Min. :0.000
1st Qu.: 2.10 1st Qu.: 4.00 1st Qu.:281 1st Qu.:16.9 1st Qu.: 7.0 1st Qu.:17.0 1st Qu.:0.000
Median : 3.19 Median : 5.00 Median :334 Median :18.9 Median :11.3 Median :21.2 Median :0.000
Mean : 3.80 Mean : 9.53 Mean :410 Mean :18.4 Mean :12.6 Mean :22.6 Mean :0.491
3rd Qu.: 5.21 3rd Qu.:24.00 3rd Qu.:666 3rd Qu.:20.2 3rd Qu.:16.9 3rd Qu.:25.0 3rd Qu.:1.000
Max. :12.13 Max. :24.00 Max. :711 Max. :22.0 Max. :38.0 Max. :50.0 Max. :1.000
| Name | crime |
| Number of rows | 466 |
| Number of columns | 13 |
| _______________________ | |
| Column type frequency: | |
| numeric | 13 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| zn | 0 | 1 | 11.58 | 23.36 | 0.00 | 0.00 | 0.00 | 16.25 | 100.00 | ▇▁▁▁▁ |
| indus | 0 | 1 | 11.11 | 6.85 | 0.46 | 5.15 | 9.69 | 18.10 | 27.74 | ▇▆▁▇▁ |
| chas | 0 | 1 | 0.07 | 0.26 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| nox | 0 | 1 | 0.55 | 0.12 | 0.39 | 0.45 | 0.54 | 0.62 | 0.87 | ▇▇▅▃▁ |
| rm | 0 | 1 | 6.29 | 0.70 | 3.86 | 5.89 | 6.21 | 6.63 | 8.78 | ▁▂▇▂▁ |
| age | 0 | 1 | 68.37 | 28.32 | 2.90 | 43.88 | 77.15 | 94.10 | 100.00 | ▂▂▂▃▇ |
| dis | 0 | 1 | 3.80 | 2.11 | 1.13 | 2.10 | 3.19 | 5.21 | 12.13 | ▇▅▂▁▁ |
| rad | 0 | 1 | 9.53 | 8.69 | 1.00 | 4.00 | 5.00 | 24.00 | 24.00 | ▇▂▁▁▃ |
| tax | 0 | 1 | 409.50 | 167.90 | 187.00 | 281.00 | 334.50 | 666.00 | 711.00 | ▇▇▅▁▇ |
| ptratio | 0 | 1 | 18.40 | 2.20 | 12.60 | 16.90 | 18.90 | 20.20 | 22.00 | ▁▃▅▅▇ |
| lstat | 0 | 1 | 12.63 | 7.10 | 1.73 | 7.04 | 11.35 | 16.93 | 37.97 | ▇▇▅▂▁ |
| medv | 0 | 1 | 22.59 | 9.24 | 5.00 | 17.02 | 21.20 | 25.00 | 50.00 | ▂▇▅▁▁ |
| target | 0 | 1 | 0.49 | 0.50 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▇ |
The detailed analysis will include a review of observed values, summary statistics and EDA visualizations. The objective of this analysis is to identify appropriate data preparation measures and to inform the model development process.
Variable: zn
zn reflects the proportion of residential land zoned for large lots - over 25000 square feet. The data ranges from 0 to 100 with a mean and median of 11.58 and 0, respectively. Zero values represent almost 73% of the data. The density plots reveals that high crime is associated with low to zero large lots. Finally, the box plots indicate substantially higher variances in low crime vs. high crime areas. This is consistent with the high number of zero values for the high crime designation. Key takeaway - if you have a zero or low zn value you likely have a high crime area.
Observed Value:
0 12 18 20 21 22 25 28 30 33 34 35 40 45 52 55 60 70 75 80 82 85 90 95 100 Sum
339 10 2 21 4 9 8 3 6 3 3 3 7 6 3 3 4 3 3 13 2 2 4 4 1 466
Summary Statistics:
Min. 1st Qu. Median Mean 3rd Qu. Max. SD Skew Kurt
0.00 0.00 0.00 11.58 16.25 100.00 23.36 2.18 6.84
EDA Panel:
Variable: indus
indus is defined as the proportion of non-retail business acres per suburb. Observed values for indus range from 0 to 28 with a bi-modal distribution in aggregate. When plotted by target value the distribution for low crime approaches normal, but the high crime distribution remains bi-modal with peaks in the high single-digits and 18. Both the density plots and box plots show a positive correlation between indus and crime. I note that the variance for high crime is higher than that of low crime. Key takeaway - positive correlation between indus and crime.
Observed Value:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 18 20 22 26 28 Sum
1 9 31 29 30 27 45 21 26 11 27 14 4 6 9 3 121 28 14 5 5 466
Summary Statistics:
Min. 1st Qu. Median Mean 3rd Qu. Max. SD Skew Kurt
0.460 5.145 9.690 11.105 18.100 27.740 6.846 0.289 1.764
EDA Panel:
Variable: chas
chas is binary with the value 1 indicating that the neighborhood borders the Charles River. Only 7% of all neighborhoods (33 in total) border the river. Key takeaway - High percentage of zero values may mean this variable is not significant.
Observed Value:
0 1 Sum
433 33 466
Summary Statistics:
Min. 1st Qu. Median Mean 3rd Qu. Max. SD Skew Kurt
0.0000 0.0000 0.0000 0.0708 0.0000 1.0000 0.2568 3.3463 12.1974
EDA Panel:
Variable: nox
nox represents the concentration of nitrogen oxide in Boston neighborhoods. The variable has a range between 0.39 and 0.87, with a peak around 0.43-0.45. The average value is 0.70 - skew is positive at 0.74. Both density plots demonstrate normal characteristic and the kurtosis around 3 - normal. This variable appears similar to indus, sharing the positive correlation with crime and the higher variance in the high crime box plot. Key takeaway - positive correlation between nox and crime.
Observed Value:
0.39 0.4 0.41 0.42 0.43 0.44 0.45 0.46 0.47 0.48 0.49 0.5 0.51 0.52 0.53 0.54 0.55 0.57 0.58 0.6 0.61 0.62
4 17 18 6 27 34 20 14 4 2 28 10 16 26 5 31 13 6 27 18 12 14
0.63 0.65 0.66 0.67 0.68 0.69 0.7 0.71 0.72 0.74 0.77 0.87 Sum
4 10 4 10 8 14 11 16 5 8 8 16 466
Summary Statistics:
Min. 1st Qu. Median Mean 3rd Qu. Max. SD Skew Kurt
0.389 0.448 0.538 0.554 0.624 0.871 0.117 0.749 2.977
EDA Panel:
Variable: rm
rm, a measure describing the average number of rooms per dwelling. This variable appears to have the most normal distribution among the variables. Medians and variances are similar for the low and high crime outcomes. Key takeaway - I’m not sure this variable will be significant in our regression models.
Observed Value:
4 5 6 7 8 9 Sum
4 37 284 115 23 3 466
Summary Statistics:
Min. 1st Qu. Median Mean 3rd Qu. Max. SD Skew Kurt
3.863 5.887 6.210 6.291 6.630 8.780 0.705 0.481 4.562
EDA Panel:
Variable: age
age reflects the proportion of owner occupied units built prior to 1940. age has a mean of 68.7 and is left skewed. The variable has a range of 3 to 100, with an upper-end concentration in the values between 94-100. This is consistent with an old town like Boston. Higher values of age appear to be associated with higher crime. This could reflect the more affluent demographics moving from older urban areas to newer suburbs. This is supported by the box plots that show low crime has a materially lower median age and significantly higher variance. Key takeaway - high age appears to be associated with high crime.Low crime has a fairly flat distribution with fat tails.
Observed Value:
3 6 7 8 9 10 13 14 15 16 17 18 19 20 21 22 23 25 26 28 29 30 31 32 33 34 35
1 3 2 3 1 3 1 1 2 3 2 8 2 2 3 5 3 1 2 6 5 1 4 9 5 6 1
36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 56 57 58 59 60 61 62 63
3 7 5 2 4 2 5 3 2 3 5 4 4 3 2 1 5 3 9 3 1 4 4 3 1 6 3
64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
1 5 2 3 4 3 5 3 4 5 5 3 4 5 5 4 5 3 5 10 7 7 5 4 10 7 8
91 92 93 94 95 96 97 98 99 100 Sum
9 9 8 14 14 14 14 18 10 42 466
Summary Statistics:
Min. 1st Qu. Median Mean 3rd Qu. Max. SD Skew Kurt
2.90 43.88 77.15 68.37 94.10 100.00 28.32 -0.58 2.00
EDA Panel:
Variable: dis
dist is defined as the average distance to Boston employment centers. Observed values range from 1 to 12, with a significant concentration between 2-6. The variable has a slight right skew. the box plots indicate, low crime areas are associated with higher average distances to employment centers. Key takeaway - dis appears to be negatively correlated with crime.
Observed Value:
1 2 3 4 5 6 7 8 9 11 12 Sum
28 145 83 65 47 44 24 16 9 4 1 466
Summary Statistics:
Min. 1st Qu. Median Mean 3rd Qu. Max. SD Skew Kurt
1.13 2.10 3.19 3.80 5.21 12.13 2.11 1.00 3.49
EDA Panel:
Variable: rad
rad is an index that measures a neighborhood’s accessibility to radial highways. Observed values range from 1 to 24, with three peaks at 4, 5 and 24, respectively. In the box plots show a small variance and median for low crime neighborhoods and a high median and variance for high crime regions. Key takeaway There appears to a positive correlation between rad and crime.
Observed Value:
1 2 3 4 5 6 7 8 24 Sum
17 20 36 103 109 25 15 20 121 466
Summary Statistics:
Min. 1st Qu. Median Mean 3rd Qu. Max. SD Skew Kurt
1.00 4.00 5.00 9.53 24.00 24.00 8.69 1.01 2.15
EDA Panel:
Variable: tax
tax is the the tax rate per $10k of property value. Observed values range from 187 to 711, with concentration around 300 and 690. The density plot show a bi-modal distribution for high crime neighbors with peaks 300 and 700. Similar to rad, high crime appears positively correlated to tax. Additionally, tax, like rad, has a high variance in the high crime area and a large variance - no doubt a result of its bi-modal distribution. Key takeaway -rad - is positively correlated with crime and its distribution is very similar to rad.
Observed Value:
187 188 193 198 216 222 223 224 226 233 242 243 244 245 247 252 254 255 256 264 265 270 273 276 277 279 281
1 5 8 1 5 6 5 9 1 9 1 4 1 3 2 2 5 1 1 12 2 7 4 8 10 3 4
284 287 289 293 296 300 304 305 307 311 313 315 329 330 334 335 337 345 348 351 352 358 370 384 391 398 402
5 6 5 3 8 7 13 4 35 7 1 2 6 9 2 2 2 3 2 1 2 3 2 11 7 12 2
403 411 422 430 432 437 469 666 711 Sum
28 2 1 3 9 14 1 121 5 466
Summary Statistics:
Min. 1st Qu. Median Mean 3rd Qu. Max. SD Skew Kurt
187.000 281.000 334.500 409.502 666.000 711.000 167.900 0.661 1.860
EDA Panel:
Variable: ptratio
ptratio reflects the average school, pupil-to-student ratio. The variable has an observed value range of 13 to 22, with concentrations at 15 and 18 to 20. ptratio High crime neighborhoods appear to be associated with higher pupil-teacher ratios. Low crime neighborhoods have a median ratio lower than high crime, but a variance that is higher. Key takeaway - the high crime neighborhoods have positive correlation with the ratio and a distribution with a strong left skew.
Observed Value:
15 20 Sum
138 328 466
13 14 15 16 17 18 19 20 21 22 Sum
15 2 55 21 45 67 65 145 49 2 466
Summary Statistics:
Min. 1st Qu. Median Mean 3rd Qu. Max. SD Skew Kurt
12.600 16.900 18.900 18.398 20.200 22.000 2.197 -0.757 2.611
EDA Panel:
Variable: lstat
lstat is the proportion of the population considered to be of lower status. Observed values range between 2 and 38, with concentrations around 4 to 10 12 to 18. The box plots indicate that high crime neighborhoods have a higher median and variance. Key takeaway The high crime density chart appears to be nearly normally distributed and significantly overlaps with the low crime distribution. I’m not sure this is ideal for classification purposes.
Observed Value:
0 5 10 15 20 25 30 35 40 Sum
4 126 130 103 54 28 16 4 1 466
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
4 14 20 35 25 32 28 24 37 18 23 24 19 22 20 18 19 11 11 11 2 8 9 2 5 4 3
29 30 31 32 34 35 37 38 Sum
2 5 5 1 2 1 1 1 466
Summary Statistics:
Min. 1st Qu. Median Mean 3rd Qu. Max. SD Skew Kurt
1.730 7.043 11.350 12.631 16.930 37.970 7.102 0.909 3.518
EDA Panel:
Variable: medv
medv reflects the median value (in $thousands) of residential homes in Boston neighborhoods. The variable has an observed value range of 5 to 50 The variable is right skewed, and high values of medv appear to be associated with lower crime rates. The box plot show Variances of medv by crime type are similar, however, low crime neighborhoods have higher median medv value of approximately 22, versus 18 for high crime neighborhoods. Key takeaway - the density plot depict significant overlap of the low and high crime neighborhoods.
Observed Value:
5 10 15 20 25 30 35 40 45 50 Sum
10 34 82 145 93 41 28 7 9 17 466
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
2 2 6 7 3 9 4 11 16 20 16 14 16 19 28 40 25 33 35 30 18 4 6 10 9 9 5
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 Sum
8 10 1 8 6 3 2 1 1 1 2 2 3 2 1 1 1 1 15 466
Summary Statistics:
Min. 1st Qu. Median Mean 3rd Qu. Max. SD Skew Kurt
5.00 17.02 21.20 22.59 25.00 50.00 9.24 1.08 4.39
EDA Panel:
Variable: target
target our response variable as 466 observations, 237 0s and 229 1s, a split of approximately 51% and 49% respectively. The variable has no missing data.
Observed Value:
| Item/Response | Low Crime(0) | High Crime (1) | Totals |
|---|---|---|---|
| Frequency | 237 | 229 | 466 |
| Percentage | 51% | 49% | 100% |
We utilized a pairs plot to better understand the correlation between variables. In the plot below, we employ scatter plots and loess curves to better understand the relationship between variables - high crime neighborhoods are represented in red and low crime neighborhoods in blue.
Pairs Plot Observations:
Here are my observations from the pairs plot above. These observations as well as the EDA analysis above will inform my data preparation decisions in the next section.
High Correlation - There are 7 correlations with an absolute value greater than 70% in the pairs plot, excluding the target:
indus | tax: 73%indus | nox: 76%nox | age: 74%nox | dis: -77%rm | medv: 71%age | dis: 75%rad | tax: 91 Convex Loess Curves - the loess curves for rm, ptratio and medv all change directions. Referring to the density plots for each of these variable, I believe this reflects the large extent of overlap between the low crime and high density plot. This may be consistent with a reduced ability to classify.
L-Shaped Loess Curve - the loess curve for zn and target almost forms an L-shape. This likely reflects the high concentration of zero values.
S-Curves - The loess curves for age, dis, nox and lstat seem to form sigmoid curves. Could this be an indication of a better ability to classify.
The data preparation phase of my analysis will be used to address problems with the data (missing values, outliers, multicollinearity), transform variables and feature engineer. These actions aim to improve the modeling phase of the analysis and improve the ability of our models to classify accurately.
Missing Data and Outliers
The the skimr function identified that the data set had no missing vales. Similarly, the EDA analysis above did not identify any problems with outliers.
Summary of Data Preparation Actions
The following table outlines my data preparation actions and rationale.
| variable | Data Prep Action | Rationale |
|---|---|---|
| age | age2 a categorical with three levels 0(<=25) ,1 (26-74), 2 (>=75) | The tails of density plots good classifiers |
| dis | dis2 a categorical with two levels 0(<5), 1 (>=5) | Bi-modal distribution of high crime |
| tax | drop variable or don’t use | 0.90+ correlation with rad |
We use the str, summary and skim functions to provide high level summary information on our revised data set. Plots of our new variables are also set forth below.
'data.frame': 466 obs. of 15 variables:
$ zn : num 0 0 0 30 0 0 0 0 0 80 ...
$ indus : num 19.58 19.58 18.1 4.93 2.46 ...
$ chas : int 0 1 0 0 0 0 0 0 0 0 ...
$ nox : num 0.605 0.871 0.74 0.428 0.488 0.52 0.693 0.693 0.515 0.392 ...
$ rm : num 7.93 5.4 6.49 6.39 7.16 ...
$ age : num 96.2 100 100 7.8 92.2 71.3 100 100 38.1 19.1 ...
$ dis : num 2.05 1.32 1.98 7.04 2.7 ...
$ rad : int 5 5 24 6 3 5 24 24 5 1 ...
$ tax : int 403 403 666 300 193 384 666 666 224 315 ...
$ ptratio: num 14.7 14.7 20.2 16.6 17.8 20.9 20.2 20.2 20.2 16.4 ...
$ lstat : num 3.7 26.82 18.85 5.19 4.82 ...
$ medv : num 50 13.4 15.4 23.7 37.9 26.5 5 7 22.2 20.9 ...
$ target : int 1 1 1 0 0 0 1 1 0 0 ...
$ disNew : num 1 1 1 0 1 1 1 1 0 0 ...
$ ageNew : int 3 3 3 1 3 2 3 3 2 1 ...
zn indus chas nox rm age
Min. : 0.0 Min. : 0.46 Min. :0.000 Min. :0.389 Min. :3.86 Min. : 2.9
1st Qu.: 0.0 1st Qu.: 5.14 1st Qu.:0.000 1st Qu.:0.448 1st Qu.:5.89 1st Qu.: 43.9
Median : 0.0 Median : 9.69 Median :0.000 Median :0.538 Median :6.21 Median : 77.2
Mean : 11.6 Mean :11.11 Mean :0.071 Mean :0.554 Mean :6.29 Mean : 68.4
3rd Qu.: 16.2 3rd Qu.:18.10 3rd Qu.:0.000 3rd Qu.:0.624 3rd Qu.:6.63 3rd Qu.: 94.1
Max. :100.0 Max. :27.74 Max. :1.000 Max. :0.871 Max. :8.78 Max. :100.0
dis rad tax ptratio lstat medv target
Min. : 1.13 Min. : 1.00 Min. :187 Min. :12.6 Min. : 1.7 Min. : 5.0 Min. :0.000
1st Qu.: 2.10 1st Qu.: 4.00 1st Qu.:281 1st Qu.:16.9 1st Qu.: 7.0 1st Qu.:17.0 1st Qu.:0.000
Median : 3.19 Median : 5.00 Median :334 Median :18.9 Median :11.3 Median :21.2 Median :0.000
Mean : 3.80 Mean : 9.53 Mean :410 Mean :18.4 Mean :12.6 Mean :22.6 Mean :0.491
3rd Qu.: 5.21 3rd Qu.:24.00 3rd Qu.:666 3rd Qu.:20.2 3rd Qu.:16.9 3rd Qu.:25.0 3rd Qu.:1.000
Max. :12.13 Max. :24.00 Max. :711 Max. :22.0 Max. :38.0 Max. :50.0 Max. :1.000
disNew ageNew
Min. :0.000 Min. :1.00
1st Qu.:0.000 1st Qu.:2.00
Median :1.000 Median :3.00
Mean :0.727 Mean :2.42
3rd Qu.:1.000 3rd Qu.:3.00
Max. :1.000 Max. :3.00
| Name | crime |
| Number of rows | 466 |
| Number of columns | 15 |
| _______________________ | |
| Column type frequency: | |
| numeric | 15 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| zn | 0 | 1 | 11.58 | 23.36 | 0.00 | 0.00 | 0.00 | 16.25 | 100.00 | ▇▁▁▁▁ |
| indus | 0 | 1 | 11.11 | 6.85 | 0.46 | 5.15 | 9.69 | 18.10 | 27.74 | ▇▆▁▇▁ |
| chas | 0 | 1 | 0.07 | 0.26 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| nox | 0 | 1 | 0.55 | 0.12 | 0.39 | 0.45 | 0.54 | 0.62 | 0.87 | ▇▇▅▃▁ |
| rm | 0 | 1 | 6.29 | 0.70 | 3.86 | 5.89 | 6.21 | 6.63 | 8.78 | ▁▂▇▂▁ |
| age | 0 | 1 | 68.37 | 28.32 | 2.90 | 43.88 | 77.15 | 94.10 | 100.00 | ▂▂▂▃▇ |
| dis | 0 | 1 | 3.80 | 2.11 | 1.13 | 2.10 | 3.19 | 5.21 | 12.13 | ▇▅▂▁▁ |
| rad | 0 | 1 | 9.53 | 8.69 | 1.00 | 4.00 | 5.00 | 24.00 | 24.00 | ▇▂▁▁▃ |
| tax | 0 | 1 | 409.50 | 167.90 | 187.00 | 281.00 | 334.50 | 666.00 | 711.00 | ▇▇▅▁▇ |
| ptratio | 0 | 1 | 18.40 | 2.20 | 12.60 | 16.90 | 18.90 | 20.20 | 22.00 | ▁▃▅▅▇ |
| lstat | 0 | 1 | 12.63 | 7.10 | 1.73 | 7.04 | 11.35 | 16.93 | 37.97 | ▇▇▅▂▁ |
| medv | 0 | 1 | 22.59 | 9.24 | 5.00 | 17.02 | 21.20 | 25.00 | 50.00 | ▂▇▅▁▁ |
| target | 0 | 1 | 0.49 | 0.50 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▇ |
| disNew | 0 | 1 | 0.73 | 0.45 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | ▃▁▁▁▇ |
| ageNew | 0 | 1 | 2.42 | 0.66 | 1.00 | 2.00 | 3.00 | 3.00 | 3.00 | ▂▁▆▁▇ |
The Sample.split function is used to create training and test data set. The training set includes 350 rows and the test data set includes 116. Next we will use the training data set to create models. Once models are created the test data set will be used to evaluate performance. This approach is used to avoid overfitting.
The first model uses all the original variables. This model should establish a good base line to improve upon.
Call:
glm(formula = target ~ zn + indus + chas + nox + rm + age + dis +
rad + ptratio + lstat + medv, family = binomial(link = "logit"),
data = crimeTrain)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.8619 -0.1458 -0.0001 0.0040 2.8314
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -45.2510 8.5187 -5.31 1.1e-07 ***
zn -0.1498 0.0623 -2.40 0.0162 *
indus -0.1118 0.0545 -2.05 0.0402 *
chas 0.2473 0.8905 0.28 0.7812
nox 45.2207 9.6111 4.71 2.5e-06 ***
rm 0.3041 0.9205 0.33 0.7411
age 0.0345 0.0168 2.06 0.0399 *
dis 0.7221 0.2877 2.51 0.0121 *
rad 0.5379 0.1709 3.15 0.0016 **
ptratio 0.3300 0.1467 2.25 0.0245 *
lstat 0.1156 0.0695 1.66 0.0961 .
medv 0.2141 0.0918 2.33 0.0196 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 485.10 on 349 degrees of freedom
Residual deviance: 130.95 on 338 degrees of freedom
AIC: 154.9
Number of Fisher Scoring iterations: 9
Reference
Prediction 0 1
0 52 7
1 10 47
fitting null model for pseudo-r2
Model summary and confusion matrix of running this model against test data are above. The accuracy rate (0.853) is very good and the McFadden R^2 value (0.73) is also high. AIC value is 154. Additionally, consider the ROC curve for this model.
Area under the curve is 0.941.
Model 2 performed inferior to Model 1 and it appears my the new variables had a marginal impact.
Call:
glm(formula = target ~ nox + rad + disNew + ageNew, family = binomial(link = "logit"),
data = crimeTrain)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.6901 -0.2596 -0.0211 0.0027 3.0568
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -18.483 2.679 -6.90 5.2e-12 ***
nox 20.464 4.836 4.23 2.3e-05 ***
rad 0.624 0.144 4.33 1.5e-05 ***
disNew 2.655 1.071 2.48 0.013 *
ageNew 0.634 0.437 1.45 0.147
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 485.10 on 349 degrees of freedom
Residual deviance: 150.83 on 345 degrees of freedom
AIC: 160.8
Number of Fisher Scoring iterations: 8
Reference
Prediction 0 1
0 51 8
1 13 44
fitting null model for pseudo-r2
Model summary and confusion matrix of running this model against test data are above. The accuracy rate (0.819) is very good and the McFadden R^2 value (0.689) is also high. AIC value is 161. Additionally, consider the ROC curve for this model.
Area under the curve is 0.937.
In this model, not accepting the new models were not effective, I combine the best performing variables, nox and rad, with the new variables and the variable they were derived from - dis and disNew and age and ageNew. This had a positive result on accuracy / performance.
Call:
glm(formula = target ~ nox + rad + disNew + dis + age + ageNew,
family = binomial(link = "logit"), data = crimeTrain)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.7462 -0.1694 -0.0021 0.0005 2.8860
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -43.53050 7.17951 -6.06 1.3e-09 ***
nox 42.69831 8.03167 5.32 1.1e-07 ***
rad 0.81938 0.18138 4.52 6.3e-06 ***
disNew 8.42730 2.06516 4.08 4.5e-05 ***
dis 1.63966 0.35921 4.56 5.0e-06 ***
age 0.00747 0.02350 0.32 0.75
ageNew 0.84177 0.89047 0.95 0.34
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 485.10 on 349 degrees of freedom
Residual deviance: 126.37 on 343 degrees of freedom
AIC: 140.4
Number of Fisher Scoring iterations: 9
Reference
Prediction 0 1
0 52 7
1 6 51
fitting null model for pseudo-r2
Model summary and confusion matrix of running this model against test data are above. The accuracy rate (0.888) is very good and the McFadden R^2 value (0.74) is also high. AIC value is 140. Additionally, consider the ROC curve for this model.
Area under the curve is 0.969.
Model 4 seeks to improve upon Model 3 by dropping poor performing variable from Model 3 and adding medv and zn, to average variables from Model 1.
Call:
glm(formula = target ~ nox + dis + disNew + rad + medv + zn,
family = binomial(link = "logit"), data = crimeTrain)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.2078 -0.1409 -0.0007 0.0018 2.3021
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -45.7332 7.6330 -5.99 2.1e-09 ***
nox 51.4673 8.9988 5.72 1.1e-08 ***
dis 1.8325 0.4158 4.41 1.0e-05 ***
disNew 7.1469 1.7964 3.98 6.9e-05 ***
rad 0.6241 0.1826 3.42 0.00063 ***
medv 0.0844 0.0425 1.99 0.04675 *
zn -0.0704 0.0397 -1.77 0.07607 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 485.10 on 349 degrees of freedom
Residual deviance: 124.04 on 343 degrees of freedom
AIC: 138
Number of Fisher Scoring iterations: 9
Reference
Prediction 0 1
0 56 3
1 4 53
fitting null model for pseudo-r2
Model summary and confusion matrix of running this model against test data are above. The accuracy rate (0.94) is very good and the McFadden R^2 value (0.744) is also high. AIC value is 138. Additionally, consider the ROC curve for this model.
Area under the curve is 0.971.
In Model 5 we drop zn and medv because their p-value were less than 2.0.
Call:
glm(formula = target ~ nox + dis + disNew + rad, family = binomial(link = "logit"),
data = crimeTrain)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.1884 -0.1778 -0.0043 0.0007 2.3900
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -41.981 6.906 -6.08 1.2e-09 ***
nox 46.856 7.803 6.00 1.9e-09 ***
dis 1.517 0.350 4.33 1.5e-05 ***
disNew 7.927 1.983 4.00 6.4e-05 ***
rad 0.783 0.171 4.58 4.7e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 485.10 on 349 degrees of freedom
Residual deviance: 131.35 on 345 degrees of freedom
AIC: 141.4
Number of Fisher Scoring iterations: 9
Reference
Prediction 0 1
0 52 7
1 4 53
fitting null model for pseudo-r2
Model summary and confusion matrix of running this model against test data are above. The accuracy rate (0.905) is very good and the McFadden R^2 value (0.729) is also high. AIC value is 141. Additionally, consider the ROC curve for this model.
Area under the curve is 0.968.
Model 4 and Model 5 distinguished themselves the the other models. My model selection will be between these two models. The table below set forth some key metrics that will facilitate the selection process.
| Model | Variables | MF R-Squared | ROC/AUC | Accuracy | AIC | Intuitive |
|---|---|---|---|---|---|---|
| Model 4 | 6 | 0.744 | 0.971 | 0.94 | 138 | Yes |
| Model 5 | 4 | 0.729 | 0.968 | 0.91 | 141 | Yes |
Both models yeilded very good results. As we compare the key metrics we see many similarities. Model 4 does have two additional variables compared to Model 5, however, that does not appear to have had been reflected in the AIC. That said the difference between the two AICs is small. Given this the decision between the two models comes down to accuracy and McFadden R-squared. In these two metric Model 4 is superior to Model 5. As a result, Model 4 is the selected model.
The Model 4 summary and prediction against the evaluation data set follow:
This equation indicates that probability of high crime increase as nox, dis, disNew, rad and medv increase and it decreases as zn increases. Based upon my understanding of our variable this appears to make sense. Note, medv and zn are less significant than the other variables.
Call:
glm(formula = target ~ nox + dis + disNew + rad + medv + zn,
family = binomial(link = "logit"), data = crimeTrain)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.2078 -0.1409 -0.0007 0.0018 2.3021
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -45.7332 7.6330 -5.99 2.1e-09 ***
nox 51.4673 8.9988 5.72 1.1e-08 ***
dis 1.8325 0.4158 4.41 1.0e-05 ***
disNew 7.1469 1.7964 3.98 6.9e-05 ***
rad 0.6241 0.1826 3.42 0.00063 ***
medv 0.0844 0.0425 1.99 0.04675 *
zn -0.0704 0.0397 -1.77 0.07607 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 485.10 on 349 degrees of freedom
Residual deviance: 124.04 on 343 degrees of freedom
AIC: 138
Number of Fisher Scoring iterations: 9
| zn | indus | chas | nox | rm | age | dis | rad | tax | ptratio | lstat | medv | disNew | ageNew | prob | predict |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7.07 | 0 | 0.469 | 7.18 | 61.1 | 4.97 | 2 | 242 | 17.8 | 4.03 | 34.7 | 1 | 2 | 0.237 | 0 |
| 0 | 8.14 | 0 | 0.538 | 6.10 | 84.5 | 4.46 | 4 | 307 | 21.0 | 10.26 | 18.2 | 1 | 3 | 0.788 | 1 |
| 0 | 8.14 | 0 | 0.538 | 6.50 | 94.4 | 4.46 | 4 | 307 | 21.0 | 12.80 | 18.4 | 1 | 3 | 0.789 | 1 |
| 0 | 8.14 | 0 | 0.538 | 5.95 | 82.0 | 3.99 | 4 | 307 | 21.0 | 27.71 | 13.2 | 1 | 3 | 0.506 | 1 |
| 0 | 5.96 | 0 | 0.499 | 5.85 | 41.5 | 3.93 | 5 | 279 | 19.2 | 8.77 | 21.0 | 1 | 2 | 0.310 | 0 |
| 25 | 5.13 | 0 | 0.453 | 5.74 | 66.2 | 7.22 | 8 | 284 | 19.7 | 13.15 | 18.7 | 0 | 2 | 0.012 | 0 |
| 25 | 5.13 | 0 | 0.453 | 5.97 | 93.4 | 6.82 | 8 | 284 | 19.7 | 14.44 | 16.0 | 0 | 3 | 0.005 | 0 |
| 0 | 4.49 | 0 | 0.449 | 6.63 | 56.1 | 4.44 | 3 | 247 | 18.5 | 6.53 | 26.6 | 1 | 2 | 0.038 | 0 |
| 0 | 4.49 | 0 | 0.449 | 6.12 | 56.8 | 3.75 | 3 | 247 | 18.5 | 8.44 | 22.2 | 1 | 2 | 0.008 | 0 |
| 0 | 2.89 | 0 | 0.445 | 6.16 | 69.6 | 3.50 | 2 | 276 | 18.0 | 11.34 | 21.4 | 1 | 2 | 0.002 | 0 |
| 0 | 25.65 | 0 | 0.581 | 5.86 | 97.0 | 1.94 | 2 | 188 | 19.1 | 25.41 | 17.3 | 1 | 3 | 0.082 | 0 |
| 0 | 25.65 | 0 | 0.581 | 5.61 | 95.6 | 1.76 | 2 | 188 | 19.1 | 27.26 | 15.7 | 1 | 3 | 0.053 | 0 |
| 0 | 21.89 | 0 | 0.624 | 5.64 | 94.7 | 1.98 | 4 | 437 | 21.2 | 18.34 | 14.3 | 1 | 3 | 0.703 | 1 |
| 0 | 19.58 | 0 | 0.605 | 6.10 | 93.0 | 2.28 | 5 | 403 | 14.7 | 9.81 | 25.0 | 1 | 3 | 0.877 | 1 |
| 0 | 19.58 | 0 | 0.605 | 5.88 | 97.3 | 2.39 | 5 | 403 | 14.7 | 12.03 | 19.1 | 1 | 3 | 0.840 | 1 |
| 0 | 10.59 | 1 | 0.489 | 5.96 | 92.1 | 3.88 | 4 | 277 | 18.6 | 17.27 | 21.7 | 1 | 3 | 0.121 | 0 |
| 0 | 6.20 | 0 | 0.504 | 6.55 | 21.4 | 3.38 | 8 | 307 | 17.4 | 3.76 | 31.5 | 1 | 1 | 0.767 | 1 |
| 0 | 6.20 | 0 | 0.507 | 8.25 | 70.4 | 3.65 | 8 | 307 | 17.4 | 3.95 | 48.3 | 1 | 2 | 0.963 | 1 |
| 22 | 5.86 | 0 | 0.431 | 6.96 | 6.8 | 8.91 | 7 | 330 | 19.1 | 3.53 | 29.6 | 0 | 1 | 0.129 | 0 |
| 90 | 2.97 | 0 | 0.400 | 7.09 | 20.8 | 7.31 | 1 | 285 | 15.3 | 7.85 | 32.2 | 0 | 1 | 0.000 | 0 |
| 80 | 1.76 | 0 | 0.385 | 6.23 | 31.5 | 9.09 | 1 | 241 | 18.2 | 12.93 | 20.1 | 0 | 2 | 0.000 | 0 |
| 33 | 2.18 | 0 | 0.472 | 6.62 | 58.1 | 3.37 | 7 | 222 | 18.4 | 8.93 | 28.4 | 1 | 2 | 0.025 | 0 |
| 0 | 9.90 | 0 | 0.544 | 6.12 | 52.8 | 2.64 | 4 | 304 | 18.4 | 5.98 | 22.1 | 1 | 2 | 0.200 | 0 |
| 0 | 7.38 | 0 | 0.493 | 6.42 | 40.1 | 4.72 | 5 | 287 | 19.6 | 6.12 | 25.0 | 1 | 2 | 0.662 | 1 |
| 0 | 7.38 | 0 | 0.493 | 6.31 | 28.9 | 5.42 | 5 | 287 | 19.6 | 6.15 | 23.0 | 0 | 2 | 0.005 | 0 |
| 0 | 5.19 | 0 | 0.515 | 5.89 | 59.6 | 5.62 | 5 | 224 | 20.2 | 10.56 | 18.5 | 0 | 2 | 0.014 | 0 |
| 80 | 2.01 | 0 | 0.435 | 6.63 | 29.7 | 8.34 | 4 | 280 | 17.0 | 5.99 | 24.5 | 0 | 2 | 0.000 | 0 |
| 0 | 18.10 | 0 | 0.718 | 3.56 | 87.9 | 1.61 | 24 | 666 | 20.2 | 7.12 | 27.5 | 1 | 3 | 1.000 | 1 |
| 0 | 18.10 | 1 | 0.631 | 7.02 | 97.5 | 1.20 | 24 | 666 | 20.2 | 2.96 | 50.0 | 1 | 3 | 1.000 | 1 |
| 0 | 18.10 | 0 | 0.584 | 6.35 | 86.1 | 2.05 | 24 | 666 | 20.2 | 17.64 | 14.5 | 1 | 3 | 1.000 | 1 |
| 0 | 18.10 | 0 | 0.740 | 5.93 | 87.9 | 1.82 | 24 | 666 | 20.2 | 34.02 | 8.4 | 1 | 3 | 1.000 | 1 |
| 0 | 18.10 | 0 | 0.740 | 5.63 | 93.9 | 1.82 | 24 | 666 | 20.2 | 22.88 | 12.8 | 1 | 3 | 1.000 | 1 |
| 0 | 18.10 | 0 | 0.740 | 5.82 | 92.4 | 1.87 | 24 | 666 | 20.2 | 22.11 | 10.5 | 1 | 3 | 1.000 | 1 |
| 0 | 18.10 | 0 | 0.740 | 6.22 | 100.0 | 2.00 | 24 | 666 | 20.2 | 16.59 | 18.4 | 1 | 3 | 1.000 | 1 |
| 0 | 18.10 | 0 | 0.740 | 5.85 | 96.6 | 1.90 | 24 | 666 | 20.2 | 23.79 | 10.8 | 1 | 3 | 1.000 | 1 |
| 0 | 18.10 | 0 | 0.713 | 6.53 | 86.5 | 2.44 | 24 | 666 | 20.2 | 18.13 | 14.1 | 1 | 3 | 1.000 | 1 |
| 0 | 18.10 | 0 | 0.713 | 6.38 | 88.4 | 2.57 | 24 | 666 | 20.2 | 14.65 | 17.7 | 1 | 3 | 1.000 | 1 |
| 0 | 18.10 | 0 | 0.655 | 6.21 | 65.4 | 2.96 | 24 | 666 | 20.2 | 13.22 | 21.4 | 1 | 2 | 1.000 | 1 |
| 0 | 9.69 | 0 | 0.585 | 5.79 | 70.6 | 2.89 | 6 | 391 | 19.2 | 14.10 | 18.3 | 1 | 2 | 0.892 | 1 |
| 0 | 11.93 | 0 | 0.573 | 6.98 | 91.0 | 2.17 | 1 | 273 | 21.0 | 5.64 | 23.9 | 1 | 3 | 0.077 | 0 |
Split between predicted outcomes is illustrated by tables below.
0 1
19 21
0 1
0.475 0.525