————————————————————————————————————————–
In this assignment, you will explore, analyze and model a data set containing information on crime for various neighborhoods of a major city. Each record has a response variable indicating whether or not the crime rate is above the median crime rate (1) or not (0). Your objective is to build a binary logistic regression model on the training data set to predict whether the neighborhood will be at risk for high crime levels. You will provide classifications and probabilities for the evaluation data set using your binary logistic regression model. You can only use the variables given to you (or variables that you derive from the variables provided).
The data set we are exploring focuses on the binomial success or failure of whether the crime rate is above (1) or below (0) the median for a particular neighborhood. There are 466 rows of data, each representing neighborhoods located in the Boston metropolitan area. For each neighborhood we are provided with 13 attributes that could potentially be used as predictor variables and one response variable (“target”) that indicates whether or not the neighborhood does, in fact, have an above-median crime rate. Our assignment is to try to predict whether or not a neighborhood would be more likely to a have an above-median crime rate based on the 13 attributes provided in the data set. A summary table for the data set is provided below.
| vars | n | mean | sd | median | min | max | range | skew | kurtosis | se |
|---|---|---|---|---|---|---|---|---|---|---|
| zn | 466 | 12 | 23 | 0 | 0 | 100 | 100 | 2 | 3.81 | 1.08 |
| indus | 466 | 11 | 7 | 10 | 0 | 28 | 28 | 0 | -1.24 | 0.32 |
| chas | 466 | 0 | 0 | 0 | 0 | 1 | 1 | 3 | 9.15 | 0.01 |
| nox | 466 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | -0.04 | 0.01 |
| rm | 466 | 6 | 1 | 6 | 4 | 9 | 5 | 0 | 1.54 | 0.03 |
| age | 466 | 68 | 28 | 77 | 3 | 100 | 97 | -1 | -1.01 | 1.31 |
| dis | 466 | 4 | 2 | 3 | 1 | 12 | 11 | 1 | 0.47 | 0.10 |
| rad | 466 | 10 | 9 | 5 | 1 | 24 | 23 | 1 | -0.86 | 0.40 |
| tax | 466 | 410 | 168 | 335 | 187 | 711 | 524 | 1 | -1.15 | 7.78 |
| ptratio | 466 | 18 | 2 | 19 | 13 | 22 | 9 | -1 | -0.40 | 0.10 |
| black | 466 | 357 | 91 | 391 | 0 | 397 | 397 | -3 | 7.34 | 4.23 |
| lstat | 466 | 13 | 7 | 11 | 2 | 38 | 36 | 1 | 0.50 | 0.33 |
| medv | 466 | 23 | 9 | 21 | 5 | 50 | 45 | 1 | 1.37 | 0.43 |
| target | 466 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | -2.00 | 0.02 |
Our analysis of the data yielded no evidence of missing data values. Some variables such as ‘zn’, ‘age’, ‘rad’, and ‘tax’, show evidence of significant skew as indicated by the large differences between their mean and median values, while ‘black’ shows evidence of potentially being problematic as evidenced by its large kurtosis and standard error values.
Boxplots of each independent variable relative to the binary response are one way in which we can further investigate potential skew issues while also gaining insight into the predictive aspects of the data:
The boxplots shown above indicate that our skew suspicions regarding the ‘zn’, ‘age’, ‘rad’, and ‘tax’ variables appear to have been justified. Furthermore, we are able to deduce some predictive aspects of the variables in relation to the response variable. For example, it appears as though larger values for ‘zn’, ‘rm’, ‘dis’, and ‘medv’ will each tend to decrease the likelihood that a neighborhood will have an above-median crime rate, while larger values for ‘indus’, ‘nox’, ‘age’, ‘rad’, ‘tax’, ‘ptratio’, and ‘lstat’ will each tend to increase the likelihood of an above-median crime rate for a given neighborhood.
Histograms allow us to more thoroughly examine whether the distribution of a variable is skewed as well as whether there may be high incidence of specific variable values throughout the data set:
The histograms again demonstrate the skew issues we’ve previously identified while also indicating that the distributions of many of the skewed variables may be dominated by high frequencies of specific data values. For example, ‘zn’, ‘indus’, ‘chas’, ‘age’, ‘rad’, ‘tax’, ‘ptratio’, and ‘black’ all appear to not only be skewed, but also have a large number of data values “clustered” around a specific number.
Aggregating variable values by their frequencies allows us to assess whether any distinctive clustering effects occur within the data. For example, the ‘indus’, ‘rad’, ‘tax’, and ‘ptratio’ variables each have an unusually large number of identical values. Analysis reveals that 121 rows of the data set contain recurring values for each of these variables.
The ‘zn’, ‘age’ and ‘black’ variables also show evidence of clustering on a single data value, though not entirely coincidental with the 121 rows mentioned above for the ‘indus’, ‘rad’, ‘tax’, and ‘ptratio’ variables.
Further analysis reveals that more than 70% of the values for ‘zn’ are zeroes, indicating that the variable may be a good candidate for conversion to a binary ‘0/1’ indicator where all non-zero values are converted to 1’s. Similarly, analysis of the boxplot for the ‘age’ variable indicates that the vast majority of instances where ‘age’ matches up to a postive value for the response variable occur for values of (age > 80). As such, ‘age’ also appears to be a strong candidate for conversion to a binary ‘0/1’ indicator, with all values of (age > 80) converted to 1’s while values of (age <= 80) are converted to zeroes.
The reason for the clustering of values for the ‘black’ variable is much more difficult to interpret due to the nature of the variable itself: its values are calculated using a quadratic equation whose meaning is unclear. However, a component of that equation is a simple proportion, which can be easily interpreted. Therefore, the ‘black’ variable is also a strong candidate for transformation during Data Preparation.
Variables that show such high frequencies of specific values cannot easily be transformed without information loss. For example, analysis of the geographic variable ‘rad’ reveals that ‘rad’ in some ways may be an indicator of the economic status of a neighborhood, and economic status may be a significant indicator of crime rates. As such, transformation of such values during data preparation must only be attempted after careful consideration of the consequences.
The table below summarizes the clustering observed for these six variables.
| Attribute | Most Common Value | Times Repeated |
|---|---|---|
| zn | 0 | 339 |
| ptratio | 20.2 | 128 |
| indus | 18.10 | 121 |
| tax | 666 | 121 |
| rad | 24 | 121 |
| black | 396.90 | 112 |
The clustering of these data values can be strongly indicative of collinearity amongst the respective variables. We can identify prospective collinear variables through the use of a correlation matrix.
A correlation matrix for the full data set is provided below. As can be seen in the matrix, several variables appear to be more than (+/-) 0.50 correlated with the response variable, including ‘indus’, ‘nox’, ‘age’, ‘dis’, ‘rad’, and ‘tax’. Variables exhibiting such relatively high correlation with the response variable should be of particular interest for purposes of constructing an effective binary logistic regression model.
| zn | indus | chas | nox | rm | age | dis | rad | tax | ptratio | black | lstat | medv | target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| zn | 1.00 | -0.54 | -0.04 | -0.52 | 0.32 | -0.57 | 0.66 | -0.32 | -0.32 | -0.39 | 0.18 | -0.43 | 0.38 | -0.43 |
| indus | -0.54 | 1.00 | 0.06 | 0.76 | -0.39 | 0.64 | -0.70 | 0.60 | 0.73 | 0.39 | -0.36 | 0.61 | -0.50 | 0.60 |
| chas | -0.04 | 0.06 | 1.00 | 0.10 | 0.09 | 0.08 | -0.10 | -0.02 | -0.05 | -0.13 | 0.04 | -0.05 | 0.16 | 0.08 |
| nox | -0.52 | 0.76 | 0.10 | 1.00 | -0.30 | 0.74 | -0.77 | 0.60 | 0.65 | 0.18 | -0.38 | 0.60 | -0.43 | 0.73 |
| rm | 0.32 | -0.39 | 0.09 | -0.30 | 1.00 | -0.23 | 0.20 | -0.21 | -0.30 | -0.36 | 0.13 | -0.63 | 0.71 | -0.15 |
| age | -0.57 | 0.64 | 0.08 | 0.74 | -0.23 | 1.00 | -0.75 | 0.46 | 0.51 | 0.26 | -0.27 | 0.61 | -0.38 | 0.63 |
| dis | 0.66 | -0.70 | -0.10 | -0.77 | 0.20 | -0.75 | 1.00 | -0.49 | -0.53 | -0.23 | 0.29 | -0.51 | 0.26 | -0.62 |
| rad | -0.32 | 0.60 | -0.02 | 0.60 | -0.21 | 0.46 | -0.49 | 1.00 | 0.91 | 0.47 | -0.45 | 0.50 | -0.40 | 0.63 |
| tax | -0.32 | 0.73 | -0.05 | 0.65 | -0.30 | 0.51 | -0.53 | 0.91 | 1.00 | 0.47 | -0.44 | 0.56 | -0.49 | 0.61 |
| ptratio | -0.39 | 0.39 | -0.13 | 0.18 | -0.36 | 0.26 | -0.23 | 0.47 | 0.47 | 1.00 | -0.18 | 0.38 | -0.52 | 0.25 |
| black | 0.18 | -0.36 | 0.04 | -0.38 | 0.13 | -0.27 | 0.29 | -0.45 | -0.44 | -0.18 | 1.00 | -0.35 | 0.33 | -0.35 |
| lstat | -0.43 | 0.61 | -0.05 | 0.60 | -0.63 | 0.61 | -0.51 | 0.50 | 0.56 | 0.38 | -0.35 | 1.00 | -0.74 | 0.47 |
| medv | 0.38 | -0.50 | 0.16 | -0.43 | 0.71 | -0.38 | 0.26 | -0.40 | -0.49 | -0.52 | 0.33 | -0.74 | 1.00 | -0.27 |
| target | -0.43 | 0.60 | 0.08 | 0.73 | -0.15 | 0.63 | -0.62 | 0.63 | 0.61 | 0.25 | -0.35 | 0.47 | -0.27 | 1.00 |
The correlation matrix also shows that several of the potential predictor variables are correlated with each other. Such relatively high correlations are usually an indication of collinearity between the respective variables. The table below summarizes the largest such correlations:
| Variable | Correlated w/ |
|---|---|
| zn | .66 with age |
| indus | .76 with nox |
| nox | -.77 with dis |
| rm | .71 with medv |
| age | -.75 with dis |
| dis | -.70 with indus |
| rad | .91 with tax |
| lstat | -.74 with medv |
Many of these correlations appear to make sense from an intuitive perspective. For example, the correlation between tax and rad, suggests that neighborhoods with relatively better access to radial highways in the Boston area also have relatively high property tax rates. Similarly, the median value of homes (medv) appears to be highly correlated with the average number of rooms per dwelling (rm). Nitrous oxide levels (nox) appear to be higher in neighborhoods with relatively high levels of industrial zoning (indus), etc.
Such strong correlations can to be used during model building to select what are expected to be significant predictors of the response variable while helping to avoid the potential inclusion of additional variables that may be collinear to those already included. For example, the ‘tax’ variable might justifiably be dropped from consideration during model building due to its high correlation with the ‘rad’ variable.
The correlation matrix also shows that the ‘chas’ variable is not highly correlated with either the target variable or any of the other predictor variables. Additional analysis shows that only 33 out of the 466 rows in the data set have a non-zero value for ‘chas’. In other words, the ‘chas’ variable does not appear to be imparting much substantive information within the context of the other variables and could justifiably be dropped from consideration for inclusion in any regression models.
During our data exploration efforts we identified one variable (chas) that can justifiably be ignored during model building, and we identified several instances of significant correlation amongst various predictor variables that could serve as justification for dropping one or more other variables from the model building process. Furthermore, we identified 121 rows of data that, due to recurring variable values, require further investigation to determine whether or not additional action might be required to make use of those rows in our analysis. Finally, we identified three variables (‘zn’, ‘age’ and ‘black’) that are good candidates for transformation during the Data Preparation process.
————————————————————————————————————————–
Our Data Preparation efforts included an investigation of what appeared to have been an abnormally large number of records within the data set having shared values, proposing the conversion of two predictor variables to binary “0/1” factor variables, and simplifying the interpretation of the ‘black’ variable via mathematical transformation. While we also considered the possibility of transforming one or more of the predictor variables that have skewed distributions, we chose not to apply any such transforms prior to model building since normal distributions aren’t necessarily required for logistical regression modeling. Transforms can be applied if the marginal model plots for a logistic regression model show evidence of deviance between the modeled data and the actual data, but aren’t required prior to model building.
Our data exploration work showed evidence of 121 out of 466 neighborhoods within the training data set having recurring values for each of four variables: ‘indus’, ‘rad’, ‘tax’, and ‘ptratio’. Initially, this seemingly anomolous repetition of values across such a large percentage of the data led us to question whether those 121 records were, in fact, legitimate, and might perhaps be candidates for some sort of transformation (e.g., replacing the recurring value via imputation or use of a mean or median) or even removal from the data set prior to building models so as to minimize the skew they obviously produce in the distributions of the four affected variables.
However, we concluded that such clustering could legitimately be explained by each of those 121 neighborhoods being located within the bounds of a single municipality. If that is, in fact, the case then we would expect each neighborhood to have identical proportions of non-retail business acres per suburb (as indicated by ‘indus’), identical indices of accessibility to radial highways (as indicated by ‘rad’), identical full-value property-tax rate per $10,000 (as indicated by ‘tax’), and identical pupil-teacher ratios (as indicated by ‘ptratio’). Therefore, it was decided that no action was required to address that aspect of the data.
Our analysis of the ‘zn’ variable indicated that nearly 73% of its values were zeroes, with the remaining 27% being postive values ranging between 12.5 and 100 (inclusive). Since the variable is meant to indicate “the proportion of residential land zoned for large lots” within a neighborhood, we concluded it was reasonable to assume that nearly 73% of neighborhoods within the data set have no such zoning. Therefore, we believe a conversion of the variable to a binary ‘0/1’ factor indicative of whether or not a neighborhood has such zoning is justified.
The ‘zn’ variable was transformed via simple threshholding where all non-zero values were converted to ‘1’, yielding the following interpretation:
The ‘age’ variable is meant to represent “the proportion of owner-occupied units built prior to 1940.” Our analysis of the variable’s boxplots led us to conclude that the vast majority of instances where the crime rate was deemed to be above the median occurred in neighborhoods where \(age > 80\). Therefore, we believe a conversion of the variable to a binary ‘0/1’ factor indicating whether or not a neighborhood has more than 80% of its owner-occupied units built prior to 1940 is justified.
The ‘age’ variable was transformed via simple threshholding where all values greater than 80 were converted to ‘1’, and all values less than or equal to 80 were converted to ‘0’, yielding the following interpretation:
The ‘black’ variable is defined as the result of the mathematical formula \(1000(Bk - 0.63)^2\) where Bk is “the proportion of blacks by town”. The exact meaning of this metric is very difficult to interpret: we can see that Bk is a proportion that has a clear meaning, but no explanation has been provided as to what the other components of the equation are necessary for.
However, through simple algebra we can transform the variable so that we are left with only the proportion Bk, which is an understandable metric for our purposes. Rearranging the formula for ‘black’ yields the following equation for the proportion Bk:
\[Bk = (\sqrt black + 19.92235) / 31.62278\]
Applying this formula to the ‘black’ variable changes its meaning to: “the proportion of blacks by town”.
The transformed data set can be seen here:
https://github.com/spsstudent15/2016-02-621-W2/blob/master/HW-3/621-HW3-Clean-Data.csv
————————————————————————————————————————–
Our first approach applies R’s bestglm function in an attempt to identify the “best subset” of predictor variables to make use of for a binary logistic regression model. An initial application of the function made use of all 13 of the predictor variables, applied backward selection via the ‘logit’ method and used the Akaike Information Criterion (AIC) to produce the subset of variables that yields the lowest overall AIC figure from all possible combinations of our predictors. The result of the initial run was a model comprised of 11 predictor variables with performance metrics as indicated in the ‘With Outliers’ column of the table below:
| Metric | With Outliers | Without Outliers |
|---|---|---|
| Number of Predictors | 11 | 11 |
| AIC | 212.88 | 189.46 |
| Accuracy | 0.9206 | 0.9239 |
| Classification Error Rate | 0.0794 | 0.0761 |
| Precision | 0.9248 | 0.9357 |
| Sensitivity | 0.9127 | 0.9067 |
| Specificity | 0.9282 | 0.9404 |
| F1 Score | 0.9187 | 0.9211 |
| AUC | 0.9205 | 0.9235 |
A plot of the standardized deviance residuals for this model revealed a number of outliers. The rows containing those outliers were then removed from the data set via a series of modeling iterations, yielding the performance metrics indicated in the ‘Without Outliers’ column of the table shown above.
The table shows that the performance metrics did improve as a result of the removal of the outliers, with the AIC in particular experiencing a rather significant decrease.
Though this model appears to yield strong performance statistics, the number of predictor variables (11) was deemed to large to be of practical use. As such, an alternative application of the bestglm function was then attempted wherein the ‘IC’ parameter was set to “BIC” instead of “AIC” to force the function to make use of Bayesian Information Criterion (BIC) for purposes of identifying the “best subset” of predictor variables.
The result of applying the BIC method was a model comprised of only 4 predictor variables with the following performance metrics:
| Metric | Value |
|---|---|
| Number of Predictors | 4 |
| AIC | 227.34 |
| Accuracy | 0.8777 |
| Classification Error Rate | 0.1223 |
| Precision | 0.8874 |
| Sensitivity | 0.8603 |
| Specificity | 0.8945 |
| F1 Score | 0.8736 |
| AUC | 0.8774 |
The marginal model plots for this model show very strong agreement between the modeled data and the actual data, with no deviance evident in any of the four plots:
A plot of the standardized deviance residuals for this model revealed no high leverage outliers, and as we can see from the table above, the performance metrics of the model are somewhat lower than those of the final 11-variable bestglm model discussed above. However, that tradeoff in performance comes with the benefit of a vast improvement in the simplicity of the model: only four variables are required to produce results that are only slightly less accurate than the 11-variable model.
Therefore, it would appear reasonable to conclude that this smaller model would be preferred over the larger, more complex model if a slight decrease in model performance is acceptable for purposes of predicting the likelihood of a particular neighborhood having a crime rate above the median.
The logit coefficients for this 4-variable model are as follows:
| Coefficient | Variable |
|---|---|
| -18.575 | Intercept |
| +32.290 | nox |
| + 1.048 | age1 |
| + 0.705 | rad |
| - 0.009 | tax |
While these coefficients can tell us whether a neighborhood is more or less likely to have a crime rate that is above the median, according to the supplemental Week 3 materials we were provided we should not attempt to interpret the magnitude of the effect of each predictor’s coefficient on the response variable.
We can infer the following from the coefficients listed above:
nox: The higher the nitrogen oxides concentration (nox) is within a neighborhood, the more likely it is that the neighborhood will have an above-median crime rate. In other words, the more air pollution a neighborhood has, the more likely it is to have an above-median crime rate.
age1: If more than 80% of a neighborhood’s owner-occupied units were built prior to 1940 (age), the neighborhood is more likely to have an above-median crime rate. So neighborhoods with largely older housing stocks would appear more likely to have an above-median crime rate.
rad: The higher a neighborhood’s index of accessibility to radial highways (rad) is, the more likely it is that the neighborhood will have an above-median crime rate. Assuming the ‘rad’ index increases if more radial highways are accessible to a neighborhood, this would seem to indicate that neighborhoods located in urban areas (where access to radial highways is typically a feature) are likely to be more prone to having higher crime rates.
tax: The higher a neighborhood’s full-value property-tax rate per $10,000 (tax) is, the less likely it is that the neighborhood will have an above-median crime rate. The interpretation of this coefficient is much less clear than those of the three variables discussed above. Property tax rates vary widely between municipalities and are highly susceptible to being changed for either political or budgetary reasons. Perhaps it is true that in the Boston metropolitan area an increase in a neighborhood’s property tax rate will decrease the likelihood of a neighborhood having an above-median crime rate. On the other hand, the value of this coefficient may simply be an artifact of the mathematics that comprise the logistical modeling process.
Our second model applied simple backward selection and a logit link function to produce a binary logistic regression model. The initial modeling iteration excluded both the ‘chas’ and ‘tax’ variables for the reasons stated in Part 1. Successive iterations removed the least statistically significant predictors (‘rm’, ‘black’, ‘ptratio’ and ‘lstat’) until all p-values were below 0.05. The final result was a model comprised of 7 predictor variables with performance metrics as indicated in the ‘With Outliers’ column of the table below:
| Metric | With Outliers | Without Outliers |
|---|---|---|
| Number of Predictors | 7 | 7 |
| AIC | 221.87 | 221.87 |
| Accuracy | 0.9120 | 0.9142 |
| Classification Error Rate | 0.0880 | 0.0858 |
| Precision | 0.9159 | 0.9163 |
| Sensitivity | 0.9039 | 0.9083 |
| Specificity | 0.9198 | 0.9198 |
| F1 Score | 0.9099 | 0.9123 |
| AUC | 0.9120 | 0.9142 |
A plot of the standardized deviance residuals for this model revealed a number of outliers. The rows containing those outliers were then removed from the data set via a series of modeling iterations, yielding a model with the performance metrics indicated in the ‘Without Outliers’ column of the table shown above.
The table shows that the performance metrics improved very little as a result of the removal of the outliers, with the AIC exhibiting no change at all.
The marginal model plots for this model show some evidence of deviance with both the ‘indus’ and ‘dis’ variables:
Such deviance can be an indicator of a sub-optimal model and we will need to consider this factor later when we attempt to select our preferred model.
The logit coefficients for this 7-variable model are as follows:
| Coefficient | Variable |
|---|---|
| -31.025 | Intercept |
| - 2.632 | zn1 |
| - 0.104 | indus |
| +45.130 | nox |
| + 1.394 | age1 |
| + 0.747 | dis |
| + 0.509 | rad |
| + 0.088 | medv |
The ‘nox’, ‘age1’, and ‘rad’ predictors for Model 2 overlap those discussed in Model 1, and all have directionally similar coefficients to those found in Model 1. As such, the effects of the coefficients for those variables won’t be reiterated here and can be found in the Model 1 discussion above.
For the four remaining variables we can state the following:
zn1: Since we had split the ‘zn’ variable into a binary indicating whether or not a neighborhood has residential land zoned for large lots, the presence of large lot zoning indicates that a neighborhood is less likely to have an above-median crime rate. This would seem to indicate that neighborhoods with denser housing are more likely to experience higher crime rates.
indus: The larger the proportion of non-retail business acres in the neighborhood, the less likely it is to have an above-median crime rate. This result appears to conform with that of the ‘zn1’ variable in that lower housing density seems to be indicative of a lower likelihood of having a relatively high crime rate.
dis: The larger the weighted mean of distances to five Boston employment centers for a neighborhood, the more likely it is to have an above-median crime rate. This result is somewhat non-intuitive in that many people would expect neighborhoods located farther away from an employment center to have a lesser likelihood of having higher crime rates since they would tend to be located outside of an urban center. However, housing densities may, in fact, be higher outside of business districts, thereby increasing the likelihood of a neighborhood having a higher crime rate. If this is the case, the result obtained for dis would appear to conform with those obtained for zn1 and indus.
medv: The larger the median value for homes is within a neighborhood, the more likely that neighborhood is to have above-median crime rates. This result is also somewhat non-intuitive since wealthier / more expensive neighborhoods are generally thought to have lower crime rates. This result would warrant further investigation at some point.
————————————————————————————————————————–
The third model applied simple backward selection and a probit link function to produce a binary logistic regression model. The initial modeling iteration excluded both the ‘chas’ and ‘tax’ variables for the reasons stated in Part 1. Successive iterations removed the least statistically significant predictors (‘rm’, ‘black’and ’lstat’) until all p-values were below 0.05. The final result was a model comprised of 8 predictor variables with performance metrics as indicated in the ‘With Outliers’ column of the table below:
| Metric | With Outliers | Without Outliers |
|---|---|---|
| Number of Predictors | 8 | 7 |
| AIC | 224.46 | 221.77 |
| Accuracy | 0.9142 | 0.9120 |
| Classification Error Rate | 0.0858 | 0.0880 |
| Precision | 0.9163 | 0.9159 |
| Sensitivity | 0.9083 | 0.9039 |
| Specificity | 0.9198 | 0.9198 |
| F1 Score | 0.9133 | 0.9099 |
| AUC | 0.9142 | 0.9120 |
A plot of the standardized deviance residuals for this model revealed the need to remove four high leverage outliers. Removal of the outliers led to the ‘ptratio’ variable losing its statistical signficance and thereby allowed us to reduce the number of predictors from 8 to 7, and produced the performance metrics indicated in the ‘Without Outliers’ column in the table above.
While the performance metrics show little change relative to the 8-variable “With Outliers” model, the 7-variable model is preferred since it is somewhat simpler than the 8-variable model.
The marginal model plots for this model show some evidence of deviance with both the ‘indus’ and ‘dis’ variables:
Such deviance can be an indicator of a sub-optimal model and we will need to consider this factor later when we attempt to select our preferred model.
The probit coefficients for this 7-variable model are as follows::
| Coefficient | Variable |
|---|---|
| -15.171 | Intercept |
| - 1.237 | zn1 |
| - 0.050 | indus |
| +22.241 | nox |
| + 0.730 | age1 |
| + 0.334 | dis |
| + 0.274 | rad |
| + 0.036 | medv |
The (+/-) signs of the coefficients mirror those of the logit approach shown in Model 2 and as such will not be reiterated here. In fact, there appears to be little difference between the logit (Model 2) and probit (Model 3) models we created via backward selection. Both consist of the same set of predictors and the coefficients of those predictors indicate similar predictive qualities.
————————————————————————————————————————–
The fourth model made use of forward selection with a logit link function in an attempt to produce a binary logistic regression model yielding the lowest AIC value possible when only statistically significant predictor variables were considered.
The forward selection process made use of the knowledge we gained in Part 1 via the construction of the correlation matrix. As we discussed in Part 1, several of the potential predictor variables show evidence of being strongly correlated with the response variable. The table below summarizes the correlations of all 13 potential predictor variables.
| Var | Target |
|---|---|
| zn | -0.43 |
| indus | 0.60 |
| chas | 0.08 |
| nox | 0.73 |
| rm | -0.15 |
| age | 0.63 |
| dis | 0.61 |
| rad | 0.63 |
| tax | 0.61 |
| ptratio | 0.25 |
| black | -0.35 |
| lstat | 0.47 |
| medv | -0.27 |
Our knowledge of these correlations allowed us to carefully select variables to be added to the model based on the absolute value of their correlation with the response variable, with the variable having the highest absolute value added first followed by others in descending order of their absolute values. If the model did not show improved metrics after a variable was added (perhaps due to collinearity), that predictor was discarded and the next was tried.
Using this iterative process the variables nox, rad, age, tax, ptratio, and medv were successfully added to the model yielding the metrics shown below:
| Metric | Value |
|---|---|
| Number of Predictors | 6 |
| AIC | 220.52 |
| Accuracy | 0.9120 |
| Classification Error Rate | 0.0880 |
| Precision | 0.9273 |
| Sensitivity | 0.8908 |
| Specificity | 0.9325 |
| F1 Score | 0.9087 |
| AUC | 0.9120 |
The marginal model plots for this model show relatively strong agreement between the modeled data and the actual data, with some minor deviance evident with both the ‘medv’ and ‘ptratio’ variables:
A plot of the standardized deviance residuals for this 6-variable model revealed no high leverage outliers, and its logit coefficients are as follows:
| Coefficient | Variable |
|---|---|
| -27.916 | Intercept |
| +34.940 | nox |
| + 0.779 | rad |
| + 1.190 | age1 |
| - 0.010 | tax |
| + 0.330 | ptratio |
| + 0.062 | medv |
The ‘nox’, ‘age1’, ‘rad’, ‘tax’, and ‘medv’ predictors for Model 4 overlap those discussed for Models 1 and 2, and all have directionally similar coefficients to those models. As such, the effects of the coefficients for those variables won’t be reiterated here and can be found in the Model 1 and Model 2 discussions above.
Of the remaining ‘ptratio’ variable we can state the following:
————————————————————————————————————————–
The chart below summarizes the model statistics for all four of our models. The models are listed from left to right in accordance with the order in which they were described in Part 3.
| Metric | bestglm | Backward / Logit | Backward / Probit | Forward - AIC |
|---|---|---|---|---|
| # Predictors | 4 | 7 | 7 | 6 |
| AIC | 227.34 | 221.87 | 221.77 | 220.52 |
| Accuracy | 0.8777 | 0.9142 | 0.9120 | 0.9120 |
| Class.Err.R. | 0.1223 | 0.0858 | 0.0880 | 0.0880 |
| Precision | 0.8874 | 0.9163 | 0.9159 | 0.9273 |
| Sensitivity | 0.8603 | 0.9083 | 0.9039 | 0.8903 |
| Specificity | 0.8945 | 0.9198 | 0.9198 | 0.9325 |
| F1 Score | 0.8736 | 0.9123 | 0.9099 | 0.9087 |
| AUC | 0.8774 | 0.9142 | 0.9120 | 0.9120 |
As we can see in the table, Models 2, 3, and 4 are fairly similar in their performance metrics, with Model 2 (Backward / Logit) yielding the highest F1 Score and AUC and Model 4 (Forward / AIC) yielding the lowest AIC and highest precision and specificity. However, the differences in their performance metrics are marginal at best. Of those three, Model 4 is the simplest overall since it has the smallest number of predictor variables. Furthermore, Model 4’s marginal model plots exhibited less overall deviance than did those of either Model 2 or Model 3.
Model 1 appears to slightly underperform each of the other three models relative to every performance metric we’ve evaluated. However, with only four predictor variabes, the model is easily the simplest of the four models we’ve constructed.
Of the four models, we chose to first eliminate Models 2 and 3 from consideration due to Model 4’s greater simplicity, lower AIC, better precision and specificity, and better marginal model plots. Model 1 was retained due to its relative simplicity (only 4 variables) even though its performance metrics were slightly below those of the other three models.
The two final models can be summarized as follows:
Model 1: Less complex, ideal marginal model plots, but slightly lower performance
Model 4: More complex (2 additional variables), slightly less than ideal marginal model plots, but better overall performance.
Picking our preferred model required answering the following question:
An objective answer to this question was found through the use of a Likelihood Ratio Test. In a likelihood ratio test, models having different numbers of predictors are compared to determine whether or not a smaller logistic regression model should be preferred over a larger logistic regression model assuming both are “valid” in the sense that their marginal model plots show no significant evidence of deviance and their performance metrics are somewhat similar.
In our likelihood ratio test, the smaller model is considered to be the null hypothesis H0, and a p-value for the overall model fit statistic that is less than 0.05 would indicate that we should reject the null hypothesis in favor of the alternative hypothesis HA. In other words, a small p-value would provide evidence against the smaller model and in favor of the larger model.
We conducted a likelihood ratio test using R’s lrtest() function. The results are displayed below:
## Likelihood ratio test
##
## Model 1: target ~ nox + rad + age + tax + ptratio + medv
## Model 2: target ~ nox + age + rad + tax
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 7 -103.26
## 2 5 -108.67 -2 10.821 0.004469 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The results show a p-value of less than 0.05, indicating that we should reject the null hypothesis H0 in favor of the alternative hypothesis HA. Specifically, there is sufficient statistical evidence for us to justifiably prefer Model 4 (the larger model) over Model 1. Therefore, we selected Model 4, the Forward - AIC model, as the basis for our predictions of the response variable in the Evaluation data set.
To ensure the model’s efficacy when applied to the Evaluation data set, we apply the same set of transformations used on the Training data set prior to building our individual models. The results of those transformations can be found here:
https://github.com/spsstudent15/2016-02-621-W2/blob/master/HW-3/621-HW3-Clean-EvalData-.csv
The Forward - AIC model is then applied to the evaluation data to yield a set of predicted values for the ‘target’ response variable. The results of the prediction process are shown in the table below.
| zn | indus | chas | nox | rm | age | dis | rad | tax | ptratio | black | lstat | medv | target |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7.07 | 0 | 0.469 | 7.185 | 0 | 4.9671 | 2 | 242 | 17.8 | 1.2568 | 4.03 | 34.7 | 0 |
| 0 | 8.14 | 0 | 0.538 | 6.096 | 1 | 4.4619 | 4 | 307 | 21.0 | 1.2465 | 10.26 | 18.2 | 0 |
| 0 | 8.14 | 0 | 0.538 | 6.495 | 1 | 4.4547 | 4 | 307 | 21.0 | 1.2528 | 12.80 | 18.4 | 0 |
| 0 | 8.14 | 0 | 0.538 | 5.950 | 1 | 3.9900 | 4 | 307 | 21.0 | 1.1123 | 27.71 | 13.2 | 1 |
| 0 | 5.96 | 0 | 0.499 | 5.850 | 0 | 3.9342 | 5 | 279 | 19.2 | 1.2600 | 8.77 | 21.0 | 0 |
| 1 | 5.13 | 0 | 0.453 | 5.741 | 0 | 7.2254 | 8 | 284 | 19.7 | 1.2586 | 13.15 | 18.7 | 0 |
| 1 | 5.13 | 0 | 0.453 | 5.966 | 1 | 6.8185 | 8 | 284 | 19.7 | 1.2449 | 14.44 | 16.0 | 0 |
| 0 | 4.49 | 0 | 0.449 | 6.630 | 0 | 4.4377 | 3 | 247 | 18.5 | 1.2563 | 6.53 | 26.6 | 0 |
| 0 | 4.49 | 0 | 0.449 | 6.121 | 0 | 3.7476 | 3 | 247 | 18.5 | 1.2586 | 8.44 | 22.2 | 0 |
| 0 | 2.89 | 0 | 0.445 | 6.163 | 0 | 3.4952 | 2 | 276 | 18.0 | 1.2560 | 11.34 | 21.4 | 0 |
| 0 | 25.65 | 0 | 0.581 | 5.856 | 1 | 1.9444 | 2 | 188 | 19.1 | 1.2385 | 25.41 | 17.3 | 0 |
| 0 | 25.65 | 0 | 0.581 | 5.613 | 1 | 1.7572 | 2 | 188 | 19.1 | 1.2294 | 27.26 | 15.7 | 0 |
| 0 | 21.89 | 0 | 0.624 | 5.637 | 1 | 1.9799 | 4 | 437 | 21.2 | 1.2600 | 18.34 | 14.3 | 1 |
| 0 | 19.58 | 0 | 0.605 | 6.101 | 1 | 2.2834 | 5 | 403 | 14.7 | 1.1201 | 9.81 | 25.0 | 1 |
| 0 | 19.58 | 0 | 0.605 | 5.880 | 1 | 2.3887 | 5 | 403 | 14.7 | 1.2200 | 12.03 | 19.1 | 1 |
| 0 | 10.59 | 1 | 0.489 | 5.960 | 1 | 3.8771 | 4 | 277 | 18.6 | 1.2571 | 17.27 | 21.7 | 0 |
| 0 | 6.20 | 0 | 0.504 | 6.552 | 0 | 3.3751 | 8 | 307 | 17.4 | 1.2467 | 3.76 | 31.5 | 1 |
| 0 | 6.20 | 0 | 0.507 | 8.247 | 0 | 3.6519 | 8 | 307 | 17.4 | 1.2456 | 3.95 | 48.3 | 1 |
| 1 | 5.86 | 0 | 0.431 | 6.957 | 0 | 8.9067 | 7 | 330 | 19.1 | 1.2514 | 3.53 | 29.6 | 0 |
| 1 | 2.97 | 0 | 0.400 | 7.088 | 0 | 7.3073 | 1 | 285 | 15.3 | 1.2583 | 7.85 | 32.2 | 0 |
| 1 | 1.76 | 0 | 0.385 | 6.230 | 0 | 9.0892 | 1 | 241 | 18.2 | 1.2145 | 12.93 | 20.1 | 0 |
| 1 | 2.18 | 0 | 0.472 | 6.616 | 0 | 3.3700 | 7 | 222 | 18.4 | 1.2572 | 8.93 | 28.4 | 0 |
| 0 | 9.90 | 0 | 0.544 | 6.122 | 0 | 2.6403 | 4 | 304 | 18.4 | 1.2600 | 5.98 | 22.1 | 0 |
| 0 | 7.38 | 0 | 0.493 | 6.415 | 0 | 4.7211 | 5 | 287 | 19.6 | 1.2600 | 6.12 | 25.0 | 0 |
| 0 | 7.38 | 0 | 0.493 | 6.312 | 0 | 5.4159 | 5 | 287 | 19.6 | 1.2600 | 6.15 | 23.0 | 0 |
| 0 | 5.19 | 0 | 0.515 | 5.895 | 0 | 5.6150 | 5 | 224 | 20.2 | 1.2583 | 10.56 | 18.5 | 0 |
| 1 | 2.01 | 0 | 0.435 | 6.635 | 0 | 8.3440 | 4 | 280 | 17.0 | 1.2553 | 5.99 | 24.5 | 0 |
| 0 | 18.10 | 0 | 0.718 | 3.561 | 1 | 1.6132 | 24 | 666 | 20.2 | 1.2256 | 7.12 | 27.5 | 1 |
| 0 | 18.10 | 1 | 0.631 | 7.016 | 1 | 1.2024 | 24 | 666 | 20.2 | 1.2561 | 2.96 | 50.0 | 1 |
| 0 | 18.10 | 0 | 0.584 | 6.348 | 1 | 2.0527 | 24 | 666 | 20.2 | 0.9189 | 17.64 | 14.5 | 1 |
| 0 | 18.10 | 0 | 0.740 | 5.935 | 1 | 1.8206 | 24 | 666 | 20.2 | 0.8926 | 34.02 | 8.4 | 1 |
| 0 | 18.10 | 0 | 0.740 | 5.627 | 1 | 1.8172 | 24 | 666 | 20.2 | 1.2600 | 22.88 | 12.8 | 1 |
| 0 | 18.10 | 0 | 0.740 | 5.818 | 1 | 1.8662 | 24 | 666 | 20.2 | 1.2557 | 22.11 | 10.5 | 1 |
| 0 | 18.10 | 0 | 0.740 | 6.219 | 1 | 2.0048 | 24 | 666 | 20.2 | 1.2590 | 16.59 | 18.4 | 1 |
| 0 | 18.10 | 0 | 0.740 | 5.854 | 1 | 1.8956 | 24 | 666 | 20.2 | 1.1204 | 23.79 | 10.8 | 1 |
| 0 | 18.10 | 0 | 0.713 | 6.525 | 1 | 2.4358 | 24 | 666 | 20.2 | 0.8557 | 18.13 | 14.1 | 1 |
| 0 | 18.10 | 0 | 0.713 | 6.376 | 1 | 2.5671 | 24 | 666 | 20.2 | 1.2556 | 14.65 | 17.7 | 1 |
| 0 | 18.10 | 0 | 0.655 | 6.209 | 0 | 2.9634 | 24 | 666 | 20.2 | 1.2600 | 13.22 | 21.4 | 1 |
| 0 | 9.69 | 0 | 0.585 | 5.794 | 0 | 2.8927 | 6 | 391 | 19.2 | 1.2600 | 14.10 | 18.3 | 1 |
| 0 | 11.93 | 0 | 0.573 | 6.976 | 1 | 2.1675 | 1 | 273 | 21.0 | 1.2600 | 5.64 | 23.9 | 0 |
The full set of evaluation data including the predicted values can also be found at the following web link:
https://github.com/spsstudent15/2016-02-621-W2/blob/master/HW-3/HW3-PRED-EVAL-ALL_M_DATA.csv
Summary statistics for the ‘target’ variable for both the training data set and the evaluation data set are shown below (NOTE: “n” represents the number of records within a data set).
| Data Set | n | Mean | sd | Med. | Min | Max | Range | Skew | Kurtosis | SE |
|---|---|---|---|---|---|---|---|---|---|---|
| Evaluation | 40 | 0.45 | 0.50 | 0.00 | 0.00 | 1.00 | 1 | 0.19 | -2.01 | 0.08 |
| Training | 466 | 0.49 | 0.50 | 0.00 | 0.00 | 1.00 | 1 | 0.03 | -2.00 | 0.02 |
After developing 4 distinct models using a best subset algorithm, backward selection using both logit and probit link functions, and forward selection we’ve selected what we believe to be our best model based on the results of a likelihood ratio test. Though the training data set presented a variety of challenges, including significant clustering of data values within roughly 25% of the cases and the heavily skewed distributions of some of the predictor variables, the binary logistic regression model we’ve selected demonstrated greater than 90% accuracy, precision, and specificity while generating an F1 Score greater than .90 and an AUC of more than .91. Furthermore, the model appears to be well-fit relative to the training data as evidenced by the strong results seen in the model’s marginal model plots.
Therefore, we believe we’ve selected a model that could be effective for predicting the likelihood that a neighborhood in the Boston metropolitan area will have an above-median crime rate. However, we can’t be certain the model will be effective for judging the crime rates of other metropolitan areas due to wide disparities in the ways in which factors such as property tax rates and zoning are applied by other municipalities. Further research would therefore be required to determine whether our selected model could be applied to other geographic areas.
————————————————————————————————————————–
The Appendix to this document containing all of the R code used for this assignment (as well as other relevant output) can be found in a separate PDF file accessible via this web link:
https://github.com/spsstudent15/2016-02-621-W2/blob/master/HW-3/HW3-Appendix.pdf