————————————————————————————————————————–

In this assignment, you will explore, analyze and model a data set containing information on crime for various neighborhoods of a major city. Each record has a response variable indicating whether or not the crime rate is above the median crime rate (1) or not (0). Your objective is to build a binary logistic regression model on the training data set to predict whether the neighborhood will be at risk for high crime levels. You will provide classifications and probabilities for the evaluation data set using your binary logistic regression model. You can only use the variables given to you (or variables that you derive from the variables provided).

Part 1: Data Exploration

Data Summary

The data set we are exploring focuses on the binomial success or failure of whether the crime rate is above (1) or below (0) the median for a particular neighborhood. There are 466 rows of data, each representing neighborhoods located in the Boston metropolitan area. For each neighborhood we are provided with 13 attributes that could potentially be used as predictor variables and one response variable (“target”) that indicates whether or not the neighborhood does, in fact, have an above-median crime rate. Our assignment is to try to predict whether or not a neighborhood would be more likely to a have an above-median crime rate based on the 13 attributes provided in the data set. A summary table for the data set is provided below.

vars n mean sd median min max range skew kurtosis se
zn 466 12 23 0 0 100 100 2 3.81 1.08
indus 466 11 7 10 0 28 28 0 -1.24 0.32
chas 466 0 0 0 0 1 1 3 9.15 0.01
nox 466 1 0 1 0 1 0 1 -0.04 0.01
rm 466 6 1 6 4 9 5 0 1.54 0.03
age 466 68 28 77 3 100 97 -1 -1.01 1.31
dis 466 4 2 3 1 12 11 1 0.47 0.10
rad 466 10 9 5 1 24 23 1 -0.86 0.40
tax 466 410 168 335 187 711 524 1 -1.15 7.78
ptratio 466 18 2 19 13 22 9 -1 -0.40 0.10
black 466 357 91 391 0 397 397 -3 7.34 4.23
lstat 466 13 7 11 2 38 36 1 0.50 0.33
medv 466 23 9 21 5 50 45 1 1.37 0.43
target 466 0 1 0 0 1 1 0 -2.00 0.02

Our analysis of the data yielded no evidence of missing data values. Some variables such as ‘zn’, ‘age’, ‘rad’, and ‘tax’, show evidence of significant skew as indicated by the large differences between their mean and median values, while ‘black’ shows evidence of potentially being problematic as evidenced by its large kurtosis and standard error values.

Analysis of Predictor Variables

Boxplots of each independent variable relative to the binary response are one way in which we can further investigate potential skew issues while also gaining insight into the predictive aspects of the data:

The boxplots shown above indicate that our skew suspicions regarding the ‘zn’, ‘age’, ‘rad’, and ‘tax’ variables appear to have been justified. Furthermore, we are able to deduce some predictive aspects of the variables in relation to the response variable. For example, it appears as though larger values for ‘zn’, ‘rm’, ‘dis’, and ‘medv’ will each tend to decrease the likelihood that a neighborhood will have an above-median crime rate, while larger values for ‘indus’, ‘nox’, ‘age’, ‘rad’, ‘tax’, ‘ptratio’, and ‘lstat’ will each tend to increase the likelihood of an above-median crime rate for a given neighborhood.

Histograms allow us to more thoroughly examine whether the distribution of a variable is skewed as well as whether there may be high incidence of specific variable values throughout the data set:

The histograms again demonstrate the skew issues we’ve previously identified while also indicating that the distributions of many of the skewed variables may be dominated by high frequencies of specific data values. For example, ‘zn’, ‘indus’, ‘chas’, ‘age’, ‘rad’, ‘tax’, ‘ptratio’, and ‘black’ all appear to not only be skewed, but also have a large number of data values “clustered” around a specific number.

Aggregating variable values by their frequencies allows us to assess whether any distinctive clustering effects occur within the data. For example, the ‘indus’, ‘rad’, ‘tax’, and ‘ptratio’ variables each have an unusually large number of identical values. Analysis reveals that 121 rows of the data set contain recurring values for each of these variables.

The ‘zn’, ‘age’ and ‘black’ variables also show evidence of clustering on a single data value, though not entirely coincidental with the 121 rows mentioned above for the ‘indus’, ‘rad’, ‘tax’, and ‘ptratio’ variables.

Further analysis reveals that more than 70% of the values for ‘zn’ are zeroes, indicating that the variable may be a good candidate for conversion to a binary ‘0/1’ indicator where all non-zero values are converted to 1’s. Similarly, analysis of the boxplot for the ‘age’ variable indicates that the vast majority of instances where ‘age’ matches up to a postive value for the response variable occur for values of (age > 80). As such, ‘age’ also appears to be a strong candidate for conversion to a binary ‘0/1’ indicator, with all values of (age > 80) converted to 1’s while values of (age <= 80) are converted to zeroes.

The reason for the clustering of values for the ‘black’ variable is much more difficult to interpret due to the nature of the variable itself: its values are calculated using a quadratic equation whose meaning is unclear. However, a component of that equation is a simple proportion, which can be easily interpreted. Therefore, the ‘black’ variable is also a strong candidate for transformation during Data Preparation.

Variables that show such high frequencies of specific values cannot easily be transformed without information loss. For example, analysis of the geographic variable ‘rad’ reveals that ‘rad’ in some ways may be an indicator of the economic status of a neighborhood, and economic status may be a significant indicator of crime rates. As such, transformation of such values during data preparation must only be attempted after careful consideration of the consequences.

The table below summarizes the clustering observed for these six variables.

Attribute Most Common Value Times Repeated
zn 0 339
ptratio 20.2 128
indus 18.10 121
tax 666 121
rad 24 121
black 396.90 112

The clustering of these data values can be strongly indicative of collinearity amongst the respective variables. We can identify prospective collinear variables through the use of a correlation matrix.

Correlation Matrix

A correlation matrix for the full data set is provided below. As can be seen in the matrix, several variables appear to be more than (+/-) 0.50 correlated with the response variable, including ‘indus’, ‘nox’, ‘age’, ‘dis’, ‘rad’, and ‘tax’. Variables exhibiting such relatively high correlation with the response variable should be of particular interest for purposes of constructing an effective binary logistic regression model.

zn indus chas nox rm age dis rad tax ptratio black lstat medv target
zn 1.00 -0.54 -0.04 -0.52 0.32 -0.57 0.66 -0.32 -0.32 -0.39 0.18 -0.43 0.38 -0.43
indus -0.54 1.00 0.06 0.76 -0.39 0.64 -0.70 0.60 0.73 0.39 -0.36 0.61 -0.50 0.60
chas -0.04 0.06 1.00 0.10 0.09 0.08 -0.10 -0.02 -0.05 -0.13 0.04 -0.05 0.16 0.08
nox -0.52 0.76 0.10 1.00 -0.30 0.74 -0.77 0.60 0.65 0.18 -0.38 0.60 -0.43 0.73
rm 0.32 -0.39 0.09 -0.30 1.00 -0.23 0.20 -0.21 -0.30 -0.36 0.13 -0.63 0.71 -0.15
age -0.57 0.64 0.08 0.74 -0.23 1.00 -0.75 0.46 0.51 0.26 -0.27 0.61 -0.38 0.63
dis 0.66 -0.70 -0.10 -0.77 0.20 -0.75 1.00 -0.49 -0.53 -0.23 0.29 -0.51 0.26 -0.62
rad -0.32 0.60 -0.02 0.60 -0.21 0.46 -0.49 1.00 0.91 0.47 -0.45 0.50 -0.40 0.63
tax -0.32 0.73 -0.05 0.65 -0.30 0.51 -0.53 0.91 1.00 0.47 -0.44 0.56 -0.49 0.61
ptratio -0.39 0.39 -0.13 0.18 -0.36 0.26 -0.23 0.47 0.47 1.00 -0.18 0.38 -0.52 0.25
black 0.18 -0.36 0.04 -0.38 0.13 -0.27 0.29 -0.45 -0.44 -0.18 1.00 -0.35 0.33 -0.35
lstat -0.43 0.61 -0.05 0.60 -0.63 0.61 -0.51 0.50 0.56 0.38 -0.35 1.00 -0.74 0.47
medv 0.38 -0.50 0.16 -0.43 0.71 -0.38 0.26 -0.40 -0.49 -0.52 0.33 -0.74 1.00 -0.27
target -0.43 0.60 0.08 0.73 -0.15 0.63 -0.62 0.63 0.61 0.25 -0.35 0.47 -0.27 1.00

The correlation matrix also shows that several of the potential predictor variables are correlated with each other. Such relatively high correlations are usually an indication of collinearity between the respective variables. The table below summarizes the largest such correlations:

Variable Correlated w/
zn .66 with age
indus .76 with nox
nox -.77 with dis
rm .71 with medv
age -.75 with dis
dis -.70 with indus
rad .91 with tax
lstat -.74 with medv

Many of these correlations appear to make sense from an intuitive perspective. For example, the correlation between tax and rad, suggests that neighborhoods with relatively better access to radial highways in the Boston area also have relatively high property tax rates. Similarly, the median value of homes (medv) appears to be highly correlated with the average number of rooms per dwelling (rm). Nitrous oxide levels (nox) appear to be higher in neighborhoods with relatively high levels of industrial zoning (indus), etc.

Such strong correlations can to be used during model building to select what are expected to be significant predictors of the response variable while helping to avoid the potential inclusion of additional variables that may be collinear to those already included. For example, the ‘tax’ variable might justifiably be dropped from consideration during model building due to its high correlation with the ‘rad’ variable.

The correlation matrix also shows that the ‘chas’ variable is not highly correlated with either the target variable or any of the other predictor variables. Additional analysis shows that only 33 out of the 466 rows in the data set have a non-zero value for ‘chas’. In other words, the ‘chas’ variable does not appear to be imparting much substantive information within the context of the other variables and could justifiably be dropped from consideration for inclusion in any regression models.

Conclusion of Data Exploration

During our data exploration efforts we identified one variable (chas) that can justifiably be ignored during model building, and we identified several instances of significant correlation amongst various predictor variables that could serve as justification for dropping one or more other variables from the model building process. Furthermore, we identified 121 rows of data that, due to recurring variable values, require further investigation to determine whether or not additional action might be required to make use of those rows in our analysis. Finally, we identified three variables (‘zn’, ‘age’ and ‘black’) that are good candidates for transformation during the Data Preparation process.

————————————————————————————————————————–

Part 2 - Data Preparation

Our Data Preparation efforts included an investigation of what appeared to have been an abnormally large number of records within the data set having shared values, proposing the conversion of two predictor variables to binary “0/1” factor variables, and simplifying the interpretation of the ‘black’ variable via mathematical transformation. While we also considered the possibility of transforming one or more of the predictor variables that have skewed distributions, we chose not to apply any such transforms prior to model building since normal distributions aren’t necessarily required for logistical regression modeling. Transforms can be applied if the marginal model plots for a logistic regression model show evidence of deviance between the modeled data and the actual data, but aren’t required prior to model building.

Step 1: Investigation of Recurring Variable Values

Our data exploration work showed evidence of 121 out of 466 neighborhoods within the training data set having recurring values for each of four variables: ‘indus’, ‘rad’, ‘tax’, and ‘ptratio’. Initially, this seemingly anomolous repetition of values across such a large percentage of the data led us to question whether those 121 records were, in fact, legitimate, and might perhaps be candidates for some sort of transformation (e.g., replacing the recurring value via imputation or use of a mean or median) or even removal from the data set prior to building models so as to minimize the skew they obviously produce in the distributions of the four affected variables.

However, we concluded that such clustering could legitimately be explained by each of those 121 neighborhoods being located within the bounds of a single municipality. If that is, in fact, the case then we would expect each neighborhood to have identical proportions of non-retail business acres per suburb (as indicated by ‘indus’), identical indices of accessibility to radial highways (as indicated by ‘rad’), identical full-value property-tax rate per $10,000 (as indicated by ‘tax’), and identical pupil-teacher ratios (as indicated by ‘ptratio’). Therefore, it was decided that no action was required to address that aspect of the data.

Step 2: Converting the ‘zn’ Variable to a Binary ‘0/1’ Factor

Our analysis of the ‘zn’ variable indicated that nearly 73% of its values were zeroes, with the remaining 27% being postive values ranging between 12.5 and 100 (inclusive). Since the variable is meant to indicate “the proportion of residential land zoned for large lots” within a neighborhood, we concluded it was reasonable to assume that nearly 73% of neighborhoods within the data set have no such zoning. Therefore, we believe a conversion of the variable to a binary ‘0/1’ factor indicative of whether or not a neighborhood has such zoning is justified.

The ‘zn’ variable was transformed via simple threshholding where all non-zero values were converted to ‘1’, yielding the following interpretation:

  • zn = 0: the neighborhood has no residential land zoned for large lots
  • zn = 1: the neighborhood has residential land zoned for large lots

Step 3: Converting the ‘age’ Variable to a Binary ‘0/1’ Factor

The ‘age’ variable is meant to represent “the proportion of owner-occupied units built prior to 1940.” Our analysis of the variable’s boxplots led us to conclude that the vast majority of instances where the crime rate was deemed to be above the median occurred in neighborhoods where \(age > 80\). Therefore, we believe a conversion of the variable to a binary ‘0/1’ factor indicating whether or not a neighborhood has more than 80% of its owner-occupied units built prior to 1940 is justified.

The ‘age’ variable was transformed via simple threshholding where all values greater than 80 were converted to ‘1’, and all values less than or equal to 80 were converted to ‘0’, yielding the following interpretation:

  • age = 0: the proportion of owner-occupied units built prior to 1940 is <= 80
  • age = 1: the proportion of owner-occupied units built prior to 1940 is > 80

Step 4: Transforming the ‘black’ Variable into a Proportion

The ‘black’ variable is defined as the result of the mathematical formula \(1000(Bk - 0.63)^2\) where Bk is “the proportion of blacks by town”. The exact meaning of this metric is very difficult to interpret: we can see that Bk is a proportion that has a clear meaning, but no explanation has been provided as to what the other components of the equation are necessary for.

However, through simple algebra we can transform the variable so that we are left with only the proportion Bk, which is an understandable metric for our purposes. Rearranging the formula for ‘black’ yields the following equation for the proportion Bk:

\[Bk = (\sqrt black + 19.92235) / 31.62278\]

Applying this formula to the ‘black’ variable changes its meaning to: “the proportion of blacks by town”.

The transformed data set can be seen here:

https://github.com/spsstudent15/2016-02-621-W2/blob/master/HW-3/621-HW3-Clean-Data.csv

————————————————————————————————————————–

Part 3 - Build Models

Model 1: Use the bestglm Function to Build a Model

Approach:

Our first approach applies R’s bestglm function in an attempt to identify the “best subset” of predictor variables to make use of for a binary logistic regression model. An initial application of the function made use of all 13 of the predictor variables, applied backward selection via the ‘logit’ method and used the Akaike Information Criterion (AIC) to produce the subset of variables that yields the lowest overall AIC figure from all possible combinations of our predictors. The result of the initial run was a model comprised of 11 predictor variables with performance metrics as indicated in the ‘With Outliers’ column of the table below:

Metric With Outliers Without Outliers
Number of Predictors 11 11
AIC 212.88 189.46
Accuracy 0.9206 0.9239
Classification Error Rate 0.0794 0.0761
Precision 0.9248 0.9357
Sensitivity 0.9127 0.9067
Specificity 0.9282 0.9404
F1 Score 0.9187 0.9211
AUC 0.9205 0.9235

A plot of the standardized deviance residuals for this model revealed a number of outliers. The rows containing those outliers were then removed from the data set via a series of modeling iterations, yielding the performance metrics indicated in the ‘Without Outliers’ column of the table shown above.

The table shows that the performance metrics did improve as a result of the removal of the outliers, with the AIC in particular experiencing a rather significant decrease.

Though this model appears to yield strong performance statistics, the number of predictor variables (11) was deemed to large to be of practical use. As such, an alternative application of the bestglm function was then attempted wherein the ‘IC’ parameter was set to “BIC” instead of “AIC” to force the function to make use of Bayesian Information Criterion (BIC) for purposes of identifying the “best subset” of predictor variables.

The result of applying the BIC method was a model comprised of only 4 predictor variables with the following performance metrics:

Metric Value
Number of Predictors 4
AIC 227.34
Accuracy 0.8777
Classification Error Rate 0.1223
Precision 0.8874
Sensitivity 0.8603
Specificity 0.8945
F1 Score 0.8736
AUC 0.8774

The marginal model plots for this model show very strong agreement between the modeled data and the actual data, with no deviance evident in any of the four plots:

A plot of the standardized deviance residuals for this model revealed no high leverage outliers, and as we can see from the table above, the performance metrics of the model are somewhat lower than those of the final 11-variable bestglm model discussed above. However, that tradeoff in performance comes with the benefit of a vast improvement in the simplicity of the model: only four variables are required to produce results that are only slightly less accurate than the 11-variable model.

Therefore, it would appear reasonable to conclude that this smaller model would be preferred over the larger, more complex model if a slight decrease in model performance is acceptable for purposes of predicting the likelihood of a particular neighborhood having a crime rate above the median.

The logit coefficients for this 4-variable model are as follows:

Coefficient Variable
-18.575 Intercept
+32.290 nox
+ 1.048 age1
+ 0.705 rad
- 0.009 tax

While these coefficients can tell us whether a neighborhood is more or less likely to have a crime rate that is above the median, according to the supplemental Week 3 materials we were provided we should not attempt to interpret the magnitude of the effect of each predictor’s coefficient on the response variable.

Inferences:

We can infer the following from the coefficients listed above:

  • nox: The higher the nitrogen oxides concentration (nox) is within a neighborhood, the more likely it is that the neighborhood will have an above-median crime rate. In other words, the more air pollution a neighborhood has, the more likely it is to have an above-median crime rate.

  • age1: If more than 80% of a neighborhood’s owner-occupied units were built prior to 1940 (age), the neighborhood is more likely to have an above-median crime rate. So neighborhoods with largely older housing stocks would appear more likely to have an above-median crime rate.

  • rad: The higher a neighborhood’s index of accessibility to radial highways (rad) is, the more likely it is that the neighborhood will have an above-median crime rate. Assuming the ‘rad’ index increases if more radial highways are accessible to a neighborhood, this would seem to indicate that neighborhoods located in urban areas (where access to radial highways is typically a feature) are likely to be more prone to having higher crime rates.

  • tax: The higher a neighborhood’s full-value property-tax rate per $10,000 (tax) is, the less likely it is that the neighborhood will have an above-median crime rate. The interpretation of this coefficient is much less clear than those of the three variables discussed above. Property tax rates vary widely between municipalities and are highly susceptible to being changed for either political or budgetary reasons. Perhaps it is true that in the Boston metropolitan area an increase in a neighborhood’s property tax rate will decrease the likelihood of a neighborhood having an above-median crime rate. On the other hand, the value of this coefficient may simply be an artifact of the mathematics that comprise the logistical modeling process.

Model 2: Logit Model Using Backward Selection

Approach:

Our second model applied simple backward selection and a logit link function to produce a binary logistic regression model. The initial modeling iteration excluded both the ‘chas’ and ‘tax’ variables for the reasons stated in Part 1. Successive iterations removed the least statistically significant predictors (‘rm’, ‘black’, ‘ptratio’ and ‘lstat’) until all p-values were below 0.05. The final result was a model comprised of 7 predictor variables with performance metrics as indicated in the ‘With Outliers’ column of the table below:

Metric With Outliers Without Outliers
Number of Predictors 7 7
AIC 221.87 221.87
Accuracy 0.9120 0.9142
Classification Error Rate 0.0880 0.0858
Precision 0.9159 0.9163
Sensitivity 0.9039 0.9083
Specificity 0.9198 0.9198
F1 Score 0.9099 0.9123
AUC 0.9120 0.9142

A plot of the standardized deviance residuals for this model revealed a number of outliers. The rows containing those outliers were then removed from the data set via a series of modeling iterations, yielding a model with the performance metrics indicated in the ‘Without Outliers’ column of the table shown above.

The table shows that the performance metrics improved very little as a result of the removal of the outliers, with the AIC exhibiting no change at all.

The marginal model plots for this model show some evidence of deviance with both the ‘indus’ and ‘dis’ variables:

Such deviance can be an indicator of a sub-optimal model and we will need to consider this factor later when we attempt to select our preferred model.

The logit coefficients for this 7-variable model are as follows:

Coefficient Variable
-31.025 Intercept
- 2.632 zn1
- 0.104 indus
+45.130 nox
+ 1.394 age1
+ 0.747 dis
+ 0.509 rad
+ 0.088 medv

Inferences:

The ‘nox’, ‘age1’, and ‘rad’ predictors for Model 2 overlap those discussed in Model 1, and all have directionally similar coefficients to those found in Model 1. As such, the effects of the coefficients for those variables won’t be reiterated here and can be found in the Model 1 discussion above.

For the four remaining variables we can state the following:

  • zn1: Since we had split the ‘zn’ variable into a binary indicating whether or not a neighborhood has residential land zoned for large lots, the presence of large lot zoning indicates that a neighborhood is less likely to have an above-median crime rate. This would seem to indicate that neighborhoods with denser housing are more likely to experience higher crime rates.

  • indus: The larger the proportion of non-retail business acres in the neighborhood, the less likely it is to have an above-median crime rate. This result appears to conform with that of the ‘zn1’ variable in that lower housing density seems to be indicative of a lower likelihood of having a relatively high crime rate.

  • dis: The larger the weighted mean of distances to five Boston employment centers for a neighborhood, the more likely it is to have an above-median crime rate. This result is somewhat non-intuitive in that many people would expect neighborhoods located farther away from an employment center to have a lesser likelihood of having higher crime rates since they would tend to be located outside of an urban center. However, housing densities may, in fact, be higher outside of business districts, thereby increasing the likelihood of a neighborhood having a higher crime rate. If this is the case, the result obtained for dis would appear to conform with those obtained for zn1 and indus.

  • medv: The larger the median value for homes is within a neighborhood, the more likely that neighborhood is to have above-median crime rates. This result is also somewhat non-intuitive since wealthier / more expensive neighborhoods are generally thought to have lower crime rates. This result would warrant further investigation at some point.

————————————————————————————————————————–

Model 3: Probit Model Using Backward Selection

Approach:

The third model applied simple backward selection and a probit link function to produce a binary logistic regression model. The initial modeling iteration excluded both the ‘chas’ and ‘tax’ variables for the reasons stated in Part 1. Successive iterations removed the least statistically significant predictors (‘rm’, ‘black’and ’lstat’) until all p-values were below 0.05. The final result was a model comprised of 8 predictor variables with performance metrics as indicated in the ‘With Outliers’ column of the table below:

Metric With Outliers Without Outliers
Number of Predictors 8 7
AIC 224.46 221.77
Accuracy 0.9142 0.9120
Classification Error Rate 0.0858 0.0880
Precision 0.9163 0.9159
Sensitivity 0.9083 0.9039
Specificity 0.9198 0.9198
F1 Score 0.9133 0.9099
AUC 0.9142 0.9120

Removing Outliers and Next Iteration:

A plot of the standardized deviance residuals for this model revealed the need to remove four high leverage outliers. Removal of the outliers led to the ‘ptratio’ variable losing its statistical signficance and thereby allowed us to reduce the number of predictors from 8 to 7, and produced the performance metrics indicated in the ‘Without Outliers’ column in the table above.

While the performance metrics show little change relative to the 8-variable “With Outliers” model, the 7-variable model is preferred since it is somewhat simpler than the 8-variable model.

The marginal model plots for this model show some evidence of deviance with both the ‘indus’ and ‘dis’ variables:

Such deviance can be an indicator of a sub-optimal model and we will need to consider this factor later when we attempt to select our preferred model.

The probit coefficients for this 7-variable model are as follows::

Coefficient Variable
-15.171 Intercept
- 1.237 zn1
- 0.050 indus
+22.241 nox
+ 0.730 age1
+ 0.334 dis
+ 0.274 rad
+ 0.036 medv

Inferences:

The (+/-) signs of the coefficients mirror those of the logit approach shown in Model 2 and as such will not be reiterated here. In fact, there appears to be little difference between the logit (Model 2) and probit (Model 3) models we created via backward selection. Both consist of the same set of predictors and the coefficients of those predictors indicate similar predictive qualities.

————————————————————————————————————————–

Model 4: Forward Selection + AIC

Approach:

The fourth model made use of forward selection with a logit link function in an attempt to produce a binary logistic regression model yielding the lowest AIC value possible when only statistically significant predictor variables were considered.

The forward selection process made use of the knowledge we gained in Part 1 via the construction of the correlation matrix. As we discussed in Part 1, several of the potential predictor variables show evidence of being strongly correlated with the response variable. The table below summarizes the correlations of all 13 potential predictor variables.

Var Target
zn -0.43
indus 0.60
chas 0.08
nox 0.73
rm -0.15
age 0.63
dis 0.61
rad 0.63
tax 0.61
ptratio 0.25
black -0.35
lstat 0.47
medv -0.27

Our knowledge of these correlations allowed us to carefully select variables to be added to the model based on the absolute value of their correlation with the response variable, with the variable having the highest absolute value added first followed by others in descending order of their absolute values. If the model did not show improved metrics after a variable was added (perhaps due to collinearity), that predictor was discarded and the next was tried.

Iterations:

Using this iterative process the variables nox, rad, age, tax, ptratio, and medv were successfully added to the model yielding the metrics shown below:

Metric Value
Number of Predictors 6
AIC 220.52
Accuracy 0.9120
Classification Error Rate 0.0880
Precision 0.9273
Sensitivity 0.8908
Specificity 0.9325
F1 Score 0.9087
AUC 0.9120

The marginal model plots for this model show relatively strong agreement between the modeled data and the actual data, with some minor deviance evident with both the ‘medv’ and ‘ptratio’ variables:

A plot of the standardized deviance residuals for this 6-variable model revealed no high leverage outliers, and its logit coefficients are as follows:

Coefficient Variable
-27.916 Intercept
+34.940 nox
+ 0.779 rad
+ 1.190 age1
- 0.010 tax
+ 0.330 ptratio
+ 0.062 medv

Inferences:

The ‘nox’, ‘age1’, ‘rad’, ‘tax’, and ‘medv’ predictors for Model 4 overlap those discussed for Models 1 and 2, and all have directionally similar coefficients to those models. As such, the effects of the coefficients for those variables won’t be reiterated here and can be found in the Model 1 and Model 2 discussions above.

Of the remaining ‘ptratio’ variable we can state the following:

  • ptratio: The higher pupil-teacher ratio is in a neighborhood, the more likely it is to have an above-median crime rate. This result is difficult to interpret since most people would typically associate higher pupil-teacher ratios with neighborhoods of lower economic status. However, our correlation matrix shows only a 0.38 correlation between those two variables within the data set. On the other hand, the matrix shows a -0.52 correlation between ‘ptratio’ and median home values (medv), which would indicate that wealthier neighborhoods do, in fact, enjoy lower pupil-teacher ratios.

————————————————————————————————————————–

Part 4. Select Models

Step 1: Compare Key statistics

The chart below summarizes the model statistics for all four of our models. The models are listed from left to right in accordance with the order in which they were described in Part 3.

Metric bestglm Backward / Logit Backward / Probit Forward - AIC
# Predictors 4 7 7 6
AIC 227.34 221.87 221.77 220.52
Accuracy 0.8777 0.9142 0.9120 0.9120
Class.Err.R. 0.1223 0.0858 0.0880 0.0880
Precision 0.8874 0.9163 0.9159 0.9273
Sensitivity 0.8603 0.9083 0.9039 0.8903
Specificity 0.8945 0.9198 0.9198 0.9325
F1 Score 0.8736 0.9123 0.9099 0.9087
AUC 0.8774 0.9142 0.9120 0.9120

As we can see in the table, Models 2, 3, and 4 are fairly similar in their performance metrics, with Model 2 (Backward / Logit) yielding the highest F1 Score and AUC and Model 4 (Forward / AIC) yielding the lowest AIC and highest precision and specificity. However, the differences in their performance metrics are marginal at best. Of those three, Model 4 is the simplest overall since it has the smallest number of predictor variables. Furthermore, Model 4’s marginal model plots exhibited less overall deviance than did those of either Model 2 or Model 3.

Model 1 appears to slightly underperform each of the other three models relative to every performance metric we’ve evaluated. However, with only four predictor variabes, the model is easily the simplest of the four models we’ve constructed.

Step 2: Pick the top two

Of the four models, we chose to first eliminate Models 2 and 3 from consideration due to Model 4’s greater simplicity, lower AIC, better precision and specificity, and better marginal model plots. Model 1 was retained due to its relative simplicity (only 4 variables) even though its performance metrics were slightly below those of the other three models.

The two final models can be summarized as follows:

  • Model 1: Less complex, ideal marginal model plots, but slightly lower performance

  • Model 4: More complex (2 additional variables), slightly less than ideal marginal model plots, but better overall performance.

Step 3: Pick the “best” model

Picking our preferred model required answering the following question:

  • Should the smaller/simpler model be preferred over the larger, more complex model?

An objective answer to this question was found through the use of a Likelihood Ratio Test. In a likelihood ratio test, models having different numbers of predictors are compared to determine whether or not a smaller logistic regression model should be preferred over a larger logistic regression model assuming both are “valid” in the sense that their marginal model plots show no significant evidence of deviance and their performance metrics are somewhat similar.

In our likelihood ratio test, the smaller model is considered to be the null hypothesis H0, and a p-value for the overall model fit statistic that is less than 0.05 would indicate that we should reject the null hypothesis in favor of the alternative hypothesis HA. In other words, a small p-value would provide evidence against the smaller model and in favor of the larger model.

We conducted a likelihood ratio test using R’s lrtest() function. The results are displayed below:

## Likelihood ratio test
## 
## Model 1: target ~ nox + rad + age + tax + ptratio + medv
## Model 2: target ~ nox + age + rad + tax
##   #Df  LogLik Df  Chisq Pr(>Chisq)   
## 1   7 -103.26                        
## 2   5 -108.67 -2 10.821   0.004469 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The results show a p-value of less than 0.05, indicating that we should reject the null hypothesis H0 in favor of the alternative hypothesis HA. Specifically, there is sufficient statistical evidence for us to justifiably prefer Model 4 (the larger model) over Model 1. Therefore, we selected Model 4, the Forward - AIC model, as the basis for our predictions of the response variable in the Evaluation data set.

Step 4: Apply to the evaluation data

To ensure the model’s efficacy when applied to the Evaluation data set, we apply the same set of transformations used on the Training data set prior to building our individual models. The results of those transformations can be found here:

https://github.com/spsstudent15/2016-02-621-W2/blob/master/HW-3/621-HW3-Clean-EvalData-.csv

The Forward - AIC model is then applied to the evaluation data to yield a set of predicted values for the ‘target’ response variable. The results of the prediction process are shown in the table below.

zn indus chas nox rm age dis rad tax ptratio black lstat medv target
0 7.07 0 0.469 7.185 0 4.9671 2 242 17.8 1.2568 4.03 34.7 0
0 8.14 0 0.538 6.096 1 4.4619 4 307 21.0 1.2465 10.26 18.2 0
0 8.14 0 0.538 6.495 1 4.4547 4 307 21.0 1.2528 12.80 18.4 0
0 8.14 0 0.538 5.950 1 3.9900 4 307 21.0 1.1123 27.71 13.2 1
0 5.96 0 0.499 5.850 0 3.9342 5 279 19.2 1.2600 8.77 21.0 0
1 5.13 0 0.453 5.741 0 7.2254 8 284 19.7 1.2586 13.15 18.7 0
1 5.13 0 0.453 5.966 1 6.8185 8 284 19.7 1.2449 14.44 16.0 0
0 4.49 0 0.449 6.630 0 4.4377 3 247 18.5 1.2563 6.53 26.6 0
0 4.49 0 0.449 6.121 0 3.7476 3 247 18.5 1.2586 8.44 22.2 0
0 2.89 0 0.445 6.163 0 3.4952 2 276 18.0 1.2560 11.34 21.4 0
0 25.65 0 0.581 5.856 1 1.9444 2 188 19.1 1.2385 25.41 17.3 0
0 25.65 0 0.581 5.613 1 1.7572 2 188 19.1 1.2294 27.26 15.7 0
0 21.89 0 0.624 5.637 1 1.9799 4 437 21.2 1.2600 18.34 14.3 1
0 19.58 0 0.605 6.101 1 2.2834 5 403 14.7 1.1201 9.81 25.0 1
0 19.58 0 0.605 5.880 1 2.3887 5 403 14.7 1.2200 12.03 19.1 1
0 10.59 1 0.489 5.960 1 3.8771 4 277 18.6 1.2571 17.27 21.7 0
0 6.20 0 0.504 6.552 0 3.3751 8 307 17.4 1.2467 3.76 31.5 1
0 6.20 0 0.507 8.247 0 3.6519 8 307 17.4 1.2456 3.95 48.3 1
1 5.86 0 0.431 6.957 0 8.9067 7 330 19.1 1.2514 3.53 29.6 0
1 2.97 0 0.400 7.088 0 7.3073 1 285 15.3 1.2583 7.85 32.2 0
1 1.76 0 0.385 6.230 0 9.0892 1 241 18.2 1.2145 12.93 20.1 0
1 2.18 0 0.472 6.616 0 3.3700 7 222 18.4 1.2572 8.93 28.4 0
0 9.90 0 0.544 6.122 0 2.6403 4 304 18.4 1.2600 5.98 22.1 0
0 7.38 0 0.493 6.415 0 4.7211 5 287 19.6 1.2600 6.12 25.0 0
0 7.38 0 0.493 6.312 0 5.4159 5 287 19.6 1.2600 6.15 23.0 0
0 5.19 0 0.515 5.895 0 5.6150 5 224 20.2 1.2583 10.56 18.5 0
1 2.01 0 0.435 6.635 0 8.3440 4 280 17.0 1.2553 5.99 24.5 0
0 18.10 0 0.718 3.561 1 1.6132 24 666 20.2 1.2256 7.12 27.5 1
0 18.10 1 0.631 7.016 1 1.2024 24 666 20.2 1.2561 2.96 50.0 1
0 18.10 0 0.584 6.348 1 2.0527 24 666 20.2 0.9189 17.64 14.5 1
0 18.10 0 0.740 5.935 1 1.8206 24 666 20.2 0.8926 34.02 8.4 1
0 18.10 0 0.740 5.627 1 1.8172 24 666 20.2 1.2600 22.88 12.8 1
0 18.10 0 0.740 5.818 1 1.8662 24 666 20.2 1.2557 22.11 10.5 1
0 18.10 0 0.740 6.219 1 2.0048 24 666 20.2 1.2590 16.59 18.4 1
0 18.10 0 0.740 5.854 1 1.8956 24 666 20.2 1.1204 23.79 10.8 1
0 18.10 0 0.713 6.525 1 2.4358 24 666 20.2 0.8557 18.13 14.1 1
0 18.10 0 0.713 6.376 1 2.5671 24 666 20.2 1.2556 14.65 17.7 1
0 18.10 0 0.655 6.209 0 2.9634 24 666 20.2 1.2600 13.22 21.4 1
0 9.69 0 0.585 5.794 0 2.8927 6 391 19.2 1.2600 14.10 18.3 1
0 11.93 0 0.573 6.976 1 2.1675 1 273 21.0 1.2600 5.64 23.9 0

The full set of evaluation data including the predicted values can also be found at the following web link:

https://github.com/spsstudent15/2016-02-621-W2/blob/master/HW-3/HW3-PRED-EVAL-ALL_M_DATA.csv

Summary statistics for the ‘target’ variable for both the training data set and the evaluation data set are shown below (NOTE: “n” represents the number of records within a data set).

Data Set n Mean sd Med. Min Max Range Skew Kurtosis SE
Evaluation 40 0.45 0.50 0.00 0.00 1.00 1 0.19 -2.01 0.08
Training 466 0.49 0.50 0.00 0.00 1.00 1 0.03 -2.00 0.02

Conclusion:

After developing 4 distinct models using a best subset algorithm, backward selection using both logit and probit link functions, and forward selection we’ve selected what we believe to be our best model based on the results of a likelihood ratio test. Though the training data set presented a variety of challenges, including significant clustering of data values within roughly 25% of the cases and the heavily skewed distributions of some of the predictor variables, the binary logistic regression model we’ve selected demonstrated greater than 90% accuracy, precision, and specificity while generating an F1 Score greater than .90 and an AUC of more than .91. Furthermore, the model appears to be well-fit relative to the training data as evidenced by the strong results seen in the model’s marginal model plots.

Therefore, we believe we’ve selected a model that could be effective for predicting the likelihood that a neighborhood in the Boston metropolitan area will have an above-median crime rate. However, we can’t be certain the model will be effective for judging the crime rates of other metropolitan areas due to wide disparities in the ways in which factors such as property tax rates and zoning are applied by other municipalities. Further research would therefore be required to determine whether our selected model could be applied to other geographic areas.

————————————————————————————————————————–

Appendix

The Appendix to this document containing all of the R code used for this assignment (as well as other relevant output) can be found in a separate PDF file accessible via this web link:

https://github.com/spsstudent15/2016-02-621-W2/blob/master/HW-3/HW3-Appendix.pdf