————————————————————————————————————————–

In this assignment, you will explore, analyze and model a data set containing information on crime for various neighborhoods of a major city. Each record has a response variable indicating whether or not the crime rate is above the median crime rate (1) or not (0). Your objective is to build a binary logistic regression model on the training data set to predict whether the neighborhood will be at risk for high crime levels. You will provide classifications and probabilities for the evaluation data set using your binary logistic regression model. You can only use the variables given to you (or variables that you derive from the variables provided).

Part 1: Data Exploration

Data Summary

The data set we are exploring focuses on the binomial success or failure of whether the crime rate is above (1) or below (0) the median for a particular neighborhood. There are 466 rows of data, each representing neighborhoods located in the Boston metropolitan area. For each neighborhood we are provided with 13 attributes that could potentially be used as predictor variables and one response variable (“target”) that indicates whether or not the neighborhood does, in fact, have an above-median crime rate. Our assignment is to try to predict whether or not a neighborhood would be more likely to a have an above-median crime rate based on the 13 attributes provided in the data set. A summary table for the data set is provided below.

vars	n	mean	sd	median	min	max	range	skew	kurtosis	se
zn	466	12	23	0	0	100	100	2	3.81	1.08
indus	466	11	7	10	0	28	28	0	-1.24	0.32
chas	466	0	0	0	0	1	1	3	9.15	0.01
nox	466	1	0	1	0	1	0	1	-0.04	0.01
rm	466	6	1	6	4	9	5	0	1.54	0.03
age	466	68	28	77	3	100	97	-1	-1.01	1.31
dis	466	4	2	3	1	12	11	1	0.47	0.10
rad	466	10	9	5	1	24	23	1	-0.86	0.40
tax	466	410	168	335	187	711	524	1	-1.15	7.78
ptratio	466	18	2	19	13	22	9	-1	-0.40	0.10
black	466	357	91	391	0	397	397	-3	7.34	4.23
lstat	466	13	7	11	2	38	36	1	0.50	0.33
medv	466	23	9	21	5	50	45	1	1.37	0.43
target	466	0	1	0	0	1	1	0	-2.00	0.02

Our analysis of the data yielded no evidence of missing data values. Some variables such as ‘zn’, ‘age’, ‘rad’, and ‘tax’, show evidence of significant skew as indicated by the large differences between their mean and median values, while ‘black’ shows evidence of potentially being problematic as evidenced by its large kurtosis and standard error values.

Analysis of Predictor Variables

Boxplots of each independent variable relative to the binary response are one way in which we can further investigate potential skew issues while also gaining insight into the predictive aspects of the data:

The boxplots shown above indicate that our skew suspicions regarding the ‘zn’, ‘age’, ‘rad’, and ‘tax’ variables appear to have been justified. Furthermore, we are able to deduce some predictive aspects of the variables in relation to the response variable. For example, it appears as though larger values for ‘zn’, ‘rm’, ‘dis’, and ‘medv’ will each tend to decrease the likelihood that a neighborhood will have an above-median crime rate, while larger values for ‘indus’, ‘nox’, ‘age’, ‘rad’, ‘tax’, ‘ptratio’, and ‘lstat’ will each tend to increase the likelihood of an above-median crime rate for a given neighborhood.

Histograms allow us to more thoroughly examine whether the distribution of a variable is skewed as well as whether there may be high incidence of specific variable values throughout the data set:

The histograms again demonstrate the skew issues we’ve previously identified while also indicating that the distributions of many of the skewed variables may be dominated by high frequencies of specific data values. For example, ‘zn’, ‘indus’, ‘chas’, ‘age’, ‘rad’, ‘tax’, ‘ptratio’, and ‘black’ all appear to not only be skewed, but also have a large number of data values “clustered” around a specific number.

Aggregating variable values by their frequencies allows us to assess whether any distinctive clustering effects occur within the data. For example, the ‘indus’, ‘rad’, ‘tax’, and ‘ptratio’ variables each have an unusually large number of identical values. Analysis reveals that 121 rows of the data set contain recurring values for each of these variables.

The ‘zn’, ‘age’ and ‘black’ variables also show evidence of clustering on a single data value, though not entirely coincidental with the 121 rows mentioned above for the ‘indus’, ‘rad’, ‘tax’, and ‘ptratio’ variables.

Further analysis reveals that more than 70% of the values for ‘zn’ are zeroes, indicating that the variable may be a good candidate for conversion to a binary ‘0/1’ indicator where all non-zero values are converted to 1’s. Similarly, analysis of the boxplot for the ‘age’ variable indicates that the vast majority of instances where ‘age’ matches up to a postive value for the response variable occur for values of (age > 80). As such, ‘age’ also appears to be a strong candidate for conversion to a binary ‘0/1’ indicator, with all values of (age > 80) converted to 1’s while values of (age <= 80) are converted to zeroes.

The reason for the clustering of values for the ‘black’ variable is much more difficult to interpret due to the nature of the variable itself: its values are calculated using a quadratic equation whose meaning is unclear. However, a component of that equation is a simple proportion, which can be easily interpreted. Therefore, the ‘black’ variable is also a strong candidate for transformation during Data Preparation.

Variables that show such high frequencies of specific values cannot easily be transformed without information loss. For example, analysis of the geographic variable ‘rad’ reveals that ‘rad’ in some ways may be an indicator of the economic status of a neighborhood, and economic status may be a significant indicator of crime rates. As such, transformation of such values during data preparation must only be attempted after careful consideration of the consequences.

The table below summarizes the clustering observed for these six variables.

Attribute	Most Common Value	Times Repeated
zn	0	339
ptratio	20.2	128
indus	18.10	121
tax	666	121
rad	24	121
black	396.90	112

The clustering of these data values can be strongly indicative of collinearity amongst the respective variables. We can identify prospective collinear variables through the use of a correlation matrix.

Correlation Matrix

A correlation matrix for the full data set is provided below. As can be seen in the matrix, several variables appear to be more than (+/-) 0.50 correlated with the response variable, including ‘indus’, ‘nox’, ‘age’, ‘dis’, ‘rad’, and ‘tax’. Variables exhibiting such relatively high correlation with the response variable should be of particular interest for purposes of constructing an effective binary logistic regression model.

	zn	indus	chas	nox	rm	age	dis	rad	tax	ptratio	black	lstat	medv	target
zn	1.00	-0.54	-0.04	-0.52	0.32	-0.57	0.66	-0.32	-0.32	-0.39	0.18	-0.43	0.38	-0.43
indus	-0.54	1.00	0.06	0.76	-0.39	0.64	-0.70	0.60	0.73	0.39	-0.36	0.61	-0.50	0.60
chas	-0.04	0.06	1.00	0.10	0.09	0.08	-0.10	-0.02	-0.05	-0.13	0.04	-0.05	0.16	0.08
nox	-0.52	0.76	0.10	1.00	-0.30	0.74	-0.77	0.60	0.65	0.18	-0.38	0.60	-0.43	0.73
rm	0.32	-0.39	0.09	-0.30	1.00	-0.23	0.20	-0.21	-0.30	-0.36	0.13	-0.63	0.71	-0.15
age	-0.57	0.64	0.08	0.74	-0.23	1.00	-0.75	0.46	0.51	0.26	-0.27	0.61	-0.38	0.63
dis	0.66	-0.70	-0.10	-0.77	0.20	-0.75	1.00	-0.49	-0.53	-0.23	0.29	-0.51	0.26	-0.62
rad	-0.32	0.60	-0.02	0.60	-0.21	0.46	-0.49	1.00	0.91	0.47	-0.45	0.50	-0.40	0.63
tax	-0.32	0.73	-0.05	0.65	-0.30	0.51	-0.53	0.91	1.00	0.47	-0.44	0.56	-0.49	0.61
ptratio	-0.39	0.39	-0.13	0.18	-0.36	0.26	-0.23	0.47	0.47	1.00	-0.18	0.38	-0.52	0.25
black	0.18	-0.36	0.04	-0.38	0.13	-0.27	0.29	-0.45	-0.44	-0.18	1.00	-0.35	0.33	-0.35
lstat	-0.43	0.61	-0.05	0.60	-0.63	0.61	-0.51	0.50	0.56	0.38	-0.35	1.00	-0.74	0.47
medv	0.38	-0.50	0.16	-0.43	0.71	-0.38	0.26	-0.40	-0.49	-0.52	0.33	-0.74	1.00	-0.27
target	-0.43	0.60	0.08	0.73	-0.15	0.63	-0.62	0.63	0.61	0.25	-0.35	0.47	-0.27	1.00

The correlation matrix also shows that several of the potential predictor variables are correlated with each other. Such relatively high correlations are usually an indication of collinearity between the respective variables. The table below summarizes the largest such correlations:

Variable	Correlated w/
zn	.66 with age
indus	.76 with nox
nox	-.77 with dis
rm	.71 with medv
age	-.75 with dis
dis	-.70 with indus
rad	.91 with tax
lstat	-.74 with medv

Many of these correlations appear to make sense from an intuitive perspective. For example, the correlation between tax and rad, suggests that neighborhoods with relatively better access to radial highways in the Boston area also have relatively high property tax rates. Similarly, the median value of homes (medv) appears to be highly correlated with the average number of rooms per dwelling (rm). Nitrous oxide levels (nox) appear to be higher in neighborhoods with relatively high levels of industrial zoning (indus), etc.

Such strong correlations can to be used during model building to select what are expected to be significant predictors of the response variable while helping to avoid the potential inclusion of additional variables that may be collinear to those already included. For example, the ‘tax’ variable might justifiably be dropped from consideration during model building due to its high correlation with the ‘rad’ variable.

The correlation matrix also shows that the ‘chas’ variable is not highly correlated with either the target variable or any of the other predictor variables. Additional analysis shows that only 33 out of the 466 rows in the data set have a non-zero value for ‘chas’. In other words, the ‘chas’ variable does not appear to be imparting much substantive information within the context of the other variables and could justifiably be dropped from consideration for inclusion in any regression models.

Conclusion of Data Exploration

During our data exploration efforts we identified one variable (chas) that can justifiably be ignored during model building, and we identified several instances of significant correlation amongst various predictor variables that could serve as justification for dropping one or more other variables from the model building process. Furthermore, we identified 121 rows of data that, due to recurring variable values, require further investigation to determine whether or not additional action might be required to make use of those rows in our analysis. Finally, we identified three variables (‘zn’, ‘age’ and ‘black’) that are good candidates for transformation during the Data Preparation process.

————————————————————————————————————————–

Part 2 - Data Preparation

Our Data Preparation efforts included an investigation of what appeared to have been an abnormally large number of records within the data set having shared values, proposing the conversion of two predictor variables to binary “0/1” factor variables, and simplifying the interpretation of the ‘black’ variable via mathematical transformation. While we also considered the possibility of transforming one or more of the predictor variables that have skewed distributions, we chose not to apply any such transforms prior to model building since normal distributions aren’t necessarily required for logistical regression modeling. Transforms can be applied if the marginal model plots for a logistic regression model show evidence of deviance between the modeled data and the actual data, but aren’t required prior to model building.

Step 1: Investigation of Recurring Variable Values

Our data exploration work showed evidence of 121 out of 466 neighborhoods within the training data set having recurring values for each of four variables: ‘indus’, ‘rad’, ‘tax’, and ‘ptratio’. Initially, this seemingly anomolous repetition of values across such a large percentage of the data led us to question whether those 121 records were, in fact, legitimate, and might perhaps be candidates for some sort of transformation (e.g., replacing the recurring value via imputation or use of a mean or median) or even removal from the data set prior to building models so as to minimize the skew they obviously produce in the distributions of the four affected variables.

However, we concluded that such clustering could legitimately be explained by each of those 121 neighborhoods being located within the bounds of a single municipality. If that is, in fact, the case then we would expect each neighborhood to have identical proportions of non-retail business acres per suburb (as indicated by ‘indus’), identical indices of accessibility to radial highways (as indicated by ‘rad’), identical full-value property-tax rate per $10,000 (as indicated by ‘tax’), and identical pupil-teacher ratios (as indicated by ‘ptratio’). Therefore, it was decided that no action was required to address that aspect of the data.

Step 2: Converting the ‘zn’ Variable to a Binary ‘0/1’ Factor

Our analysis of the ‘zn’ variable indicated that nearly 73% of its values were zeroes, with the remaining 27% being postive values ranging between 12.5 and 100 (inclusive). Since the variable is meant to indicate “the proportion of residential land zoned for large lots” within a neighborhood, we concluded it was reasonable to assume that nearly 73% of neighborhoods within the data set have no such zoning. Therefore, we believe a conversion of the variable to a binary ‘0/1’ factor indicative of whether or not a neighborhood has such zoning is justified.

The ‘zn’ variable was transformed via simple threshholding where all non-zero values were converted to ‘1’, yielding the following interpretation:

zn = 0: the neighborhood has no residential land zoned for large lots
zn = 1: the neighborhood has residential land zoned for large lots

Step 3: Converting the ‘age’ Variable to a Binary ‘0/1’ Factor

The ‘age’ variable is meant to represent “the proportion of owner-occupied units built prior to 1940.” Our analysis of the variable’s boxplots led us to conclude that the vast majority of instances where the crime rate was deemed to be above the median occurred in neighborhoods where $age > 80$. Therefore, we believe a conversion of the variable to a binary ‘0/1’ factor indicating whether or not a neighborhood has more than 80% of its owner-occupied units built prior to 1940 is justified.

The ‘age’ variable was transformed via simple threshholding where all values greater than 80 were converted to ‘1’, and all values less than or equal to 80 were converted to ‘0’, yielding the following interpretation:

age = 0: the proportion of owner-occupied units built prior to 1940 is <= 80
age = 1: the proportion of owner-occupied units built prior to 1940 is > 80

Step 4: Transforming the ‘black’ Variable into a Proportion

The ‘black’ variable is defined as the result of the mathematical formula $1000(Bk - 0.63)^2$ where Bk is “the proportion of blacks by town”. The exact meaning of this metric is very difficult to interpret: we can see that Bk is a proportion that has a clear meaning, but no explanation has been provided as to what the other components of the equation are necessary for.

However, through simple algebra we can transform the variable so that we are left with only the proportion Bk, which is an understandable metric for our purposes. Rearranging the formula for ‘black’ yields the following equation for the proportion Bk:

\[Bk = (\sqrt black + 19.92235) / 31.62278\]

Applying this formula to the ‘black’ variable changes its meaning to: “the proportion of blacks by town”.

The transformed data set can be seen here:

https://github.com/spsstudent15/2016-02-621-W2/blob/master/HW-3/621-HW3-Clean-Data.csv

————————————————————————————————————————–

Part 3 - Build Models

Model 1: Use the bestglm Function to Build a Model

Approach:

Our first approach applies R’s bestglm function in an attempt to identify the “best subset” of predictor variables to make use of for a binary logistic regression model. An initial application of the function made use of all 13 of the predictor variables, applied backward selection via the ‘logit’ method and used the Akaike Information Criterion (AIC) to produce the subset of variables that yields the lowest overall AIC figure from all possible combinations of our predictors. The result of the initial run was a model comprised of 11 predictor variables with performance metrics as indicated in the ‘With Outliers’ column of the table below:

Metric	With Outliers	Without Outliers
Number of Predictors	11	11
AIC	212.88	189.46
Accuracy	0.9206	0.9239
Classification Error Rate	0.0794	0.0761
Precision	0.9248	0.9357
Sensitivity	0.9127	0.9067
Specificity	0.9282	0.9404
F1 Score	0.9187	0.9211
AUC	0.9205	0.9235

A plot of the standardized deviance residuals for this model revealed a number of outliers. The rows containing those outliers were then removed from the data set via a series of modeling iterations, yielding the performance metrics indicated in the ‘Without Outliers’ column of the table shown above.

The table shows that the performance metrics did improve as a result of the removal of the outliers, with the AIC in particular experiencing a rather significant decrease.

Though this model appears to yield strong performance statistics, the number of predictor variables (11) was deemed to large to be of practical use. As such, an alternative application of the bestglm function was then attempted wherein the ‘IC’ parameter was set to “BIC” instead of “AIC” to force the function to make use of Bayesian Information Criterion (BIC) for purposes of identifying the “best subset” of predictor variables.

The result of applying the BIC method was a model comprised of only 4 predictor variables with the following performance metrics:

Metric	Value
Number of Predictors	4
AIC	227.34
Accuracy	0.8777
Classification Error Rate	0.1223
Precision	0.8874
Sensitivity	0.8603
Specificity	0.8945
F1 Score	0.8736
AUC	0.8774

The marginal model plots for this model show very strong agreement between the modeled data and the actual data, with no deviance evident in any of the four plots:

A plot of the standardized deviance residuals for this model revealed no high leverage outliers, and as we can see from the table above, the performance metrics of the model are somewhat lower than those of the final 11-variable bestglm model discussed above. However, that tradeoff in performance comes with the benefit of a vast improvement in the simplicity of the model: only four variables are required to produce results that are only slightly less accurate than the 11-variable model.

Therefore, it would appear reasonable to conclude that this smaller model would be preferred over the larger, more complex model if a slight decrease in model performance is acceptable for purposes of predicting the likelihood of a particular neighborhood having a crime rate above the median.

The logit coefficients for this 4-variable model are as follows:

Coefficient	Variable
-18.575	Intercept
+32.290	nox
+ 1.048	age1
+ 0.705	rad
- 0.009	tax

While these coefficients can tell us whether a neighborhood is more or less likely to have a crime rate that is above the median, according to the supplemental Week 3 materials we were provided we should not attempt to interpret the magnitude of the effect of each predictor’s coefficient on the response variable.

Inferences:

We can infer the following from the coefficients listed above:

nox: The higher the nitrogen oxides concentration (nox) is within a neighborhood, the more likely it is that the neighborhood will have an above-median crime rate. In other words, the more air pollution a neighborhood has, the more likely it is to have an above-median crime rate.
age1: If more than 80% of a neighborhood’s owner-occupied units were built prior to 1940 (age), the neighborhood is more likely to have an above-median crime rate. So neighborhoods with largely older housing stocks would appear more likely to have an above-median crime rate.
rad: The higher a neighborhood’s index of accessibility to radial highways (rad) is, the more likely it is that the neighborhood will have an above-median crime rate. Assuming the ‘rad’ index increases if more radial highways are accessible to a neighborhood, this would seem to indicate that neighborhoods located in urban areas (where access to radial highways is typically a feature) are likely to be more prone to having higher crime rates.
tax: The higher a neighborhood’s full-value property-tax rate per $10,000 (tax) is, the less likely it is that the neighborhood will have an above-median crime rate. The interpretation of this coefficient is much less clear than those of the three variables discussed above. Property tax rates vary widely between municipalities and are highly susceptible to being changed for either political or budgetary reasons. Perhaps it is true that in the Boston metropolitan area an increase in a neighborhood’s property tax rate will decrease the likelihood of a neighborhood having an above-median crime rate. On the other hand, the value of this coefficient may simply be an artifact of the mathematics that comprise the logistical modeling process.

Model 2: Logit Model Using Backward Selection

Approach:

Our second model applied simple backward selection and a logit link function to produce a binary logistic regression model. The initial modeling iteration excluded both the ‘chas’ and ‘tax’ variables for the reasons stated in Part 1. Successive iterations removed the least statistically significant predictors (‘rm’, ‘black’, ‘ptratio’ and ‘lstat’) until all p-values were below 0.05. The final result was a model comprised of 7 predictor variables with performance metrics as indicated in the ‘With Outliers’ column of the table below:

Metric	With Outliers	Without Outliers
Number of Predictors	7	7
AIC	221.87	221.87
Accuracy	0.9120	0.9142
Classification Error Rate	0.0880	0.0858
Precision	0.9159	0.9163
Sensitivity	0.9039	0.9083
Specificity	0.9198	0.9198
F1 Score	0.9099	0.9123
AUC	0.9120	0.9142

A plot of the standardized deviance residuals for this model revealed a number of outliers. The rows containing those outliers were then removed from the data set via a series of modeling iterations, yielding a model with the performance metrics indicated in the ‘Without Outliers’ column of the table shown above.

The table shows that the performance metrics improved very little as a result of the removal of the outliers, with the AIC exhibiting no change at all.

The marginal model plots for this model show some evidence of deviance with both the ‘indus’ and ‘dis’ variables:

Such deviance can be an indicator of a sub-optimal model and we will need to consider this factor later when we attempt to select our preferred model.

The logit coefficients for this 7-variable model are as follows:

Coefficient	Variable
-31.025	Intercept
- 2.632	zn1
- 0.104	indus
+45.130	nox
+ 1.394	age1
+ 0.747	dis
+ 0.509	rad
+ 0.088	medv

Inferences:

The ‘nox’, ‘age1’, and ‘rad’ predictors for Model 2 overlap those discussed in Model 1, and all have directionally similar coefficients to those found in Model 1. As such, the effects of the coefficients for those variables won’t be reiterated here and can be found in the Model 1 discussion above.

For the four remaining variables we can state the following:

zn1: Since we had split the ‘zn’ variable into a binary indicating whether or not a neighborhood has residential land zoned for large lots, the presence of large lot zoning indicates that a neighborhood is less likely to have an above-median crime rate. This would seem to indicate that neighborhoods with denser housing are more likely to experience higher crime rates.
indus: The larger the proportion of non-retail business acres in the neighborhood, the less likely it is to have an above-median crime rate. This result appears to conform with that of the ‘zn1’ variable in that lower housing density seems to be indicative of a lower likelihood of having a relatively high crime rate.
dis: The larger the weighted mean of distances to five Boston employment centers for a neighborhood, the more likely it is to have an above-median crime rate. This result is somewhat non-intuitive in that many people would expect neighborhoods located farther away from an employment center to have a lesser likelihood of having higher crime rates since they would tend to be located outside of an urban center. However, housing densities may, in fact, be higher outside of business districts, thereby increasing the likelihood of a neighborhood having a higher crime rate. If this is the case, the result obtained for dis would appear to conform with those obtained for zn1 and indus.
medv: The larger the median value for homes is within a neighborhood, the more likely that neighborhood is to have above-median crime rates. This result is also somewhat non-intuitive since wealthier / more expensive neighborhoods are generally thought to have lower crime rates. This result would warrant further investigation at some point.

————————————————————————————————————————–

Model 3: Probit Model Using Backward Selection

Approach:

The third model applied simple backward selection and a probit link function to produce a binary logistic regression model. The initial modeling iteration excluded both the ‘chas’ and ‘tax’ variables for the reasons stated in Part 1. Successive iterations removed the least statistically significant predictors (‘rm’, ‘black’and ’lstat’) until all p-values were below 0.05. The final result was a model comprised of 8 predictor variables with performance metrics as indicated in the ‘With Outliers’ column of the table below:

Metric	With Outliers	Without Outliers
Number of Predictors	8	7
AIC	224.46	221.77
Accuracy	0.9142	0.9120
Classification Error Rate	0.0858	0.0880
Precision	0.9163	0.9159
Sensitivity	0.9083	0.9039
Specificity	0.9198	0.9198
F1 Score	0.9133	0.9099
AUC	0.9142	0.9120

Removing Outliers and Next Iteration:

A plot of the standardized deviance residuals for this model revealed the need to remove four high leverage outliers. Removal of the outliers led to the ‘ptratio’ variable losing its statistical signficance and thereby allowed us to reduce the number of predictors from 8 to 7, and produced the performance metrics indicated in the ‘Without Outliers’ column in the table above.

While the performance metrics show little change relative to the 8-variable “With Outliers” model, the 7-variable model is preferred since it is somewhat simpler than the 8-variable model.

The marginal model plots for this model show some evidence of deviance with both the ‘indus’ and ‘dis’ variables:

Such deviance can be an indicator of a sub-optimal model and we will need to consider this factor later when we attempt to select our preferred model.

The probit coefficients for this 7-variable model are as follows::

Coefficient	Variable
-15.171	Intercept
- 1.237	zn1
- 0.050	indus
+22.241	nox
+ 0.730	age1
+ 0.334	dis
+ 0.274	rad
+ 0.036	medv

Inferences:

The (+/-) signs of the coefficients mirror those of the logit approach shown in Model 2 and as such will not be reiterated here. In fact, there appears to be little difference between the logit (Model 2) and probit (Model 3) models we created via backward selection. Both consist of the same set of predictors and the coefficients of those predictors indicate similar predictive qualities.

————————————————————————————————————————–

Model 4: Forward Selection + AIC

Approach:

The fourth model made use of forward selection with a logit link function in an attempt to produce a binary logistic regression model yielding the lowest AIC value possible when only statistically significant predictor variables were considered.

The forward selection process made use of the knowledge we gained in Part 1 via the construction of the correlation matrix. As we discussed in Part 1, several of the potential predictor variables show evidence of being strongly correlated with the response variable. The table below summarizes the correlations of all 13 potential predictor variables.

Var	Target
zn	-0.43
indus	0.60
chas	0.08
nox	0.73
rm	-0.15
age	0.63
dis	0.61
rad	0.63
tax	0.61
ptratio	0.25
black	-0.35
lstat	0.47
medv	-0.27

Our knowledge of these correlations allowed us to carefully select variables to be added to the model based on the absolute value of their correlation with the response variable, with the variable having the highest absolute value added first followed by others in descending order of their absolute values. If the model did not show improved metrics after a variable was added (perhaps due to collinearity), that predictor was discarded and the next was tried.

Iterations:

Using this iterative process the variables nox, rad, age, tax, ptratio, and medv were successfully added to the model yielding the metrics shown below:

Metric	Value
Number of Predictors	6
AIC	220.52
Accuracy	0.9120
Classification Error Rate	0.0880
Precision	0.9273
Sensitivity	0.8908
Specificity	0.9325
F1 Score	0.9087
AUC	0.9120

The marginal model plots for this model show relatively strong agreement between the modeled data and the actual data, with some minor deviance evident with both the ‘medv’ and ‘ptratio’ variables:

A plot of the standardized deviance residuals for this 6-variable model revealed no high leverage outliers, and its logit coefficients are as follows:

Coefficient	Variable
-27.916	Intercept
+34.940	nox
+ 0.779	rad
+ 1.190	age1
- 0.010	tax
+ 0.330	ptratio
+ 0.062	medv

Inferences:

The ‘nox’, ‘age1’, ‘rad’, ‘tax’, and ‘medv’ predictors for Model 4 overlap those discussed for Models 1 and 2, and all have directionally similar coefficients to those models. As such, the effects of the coefficients for those variables won’t be reiterated here and can be found in the Model 1 and Model 2 discussions above.

Of the remaining ‘ptratio’ variable we can state the following:

ptratio: The higher pupil-teacher ratio is in a neighborhood, the more likely it is to have an above-median crime rate. This result is difficult to interpret since most people would typically associate higher pupil-teacher ratios with neighborhoods of lower economic status. However, our correlation matrix shows only a 0.38 correlation between those two variables within the data set. On the other hand, the matrix shows a -0.52 correlation between ‘ptratio’ and median home values (medv), which would indicate that wealthier neighborhoods do, in fact, enjoy lower pupil-teacher ratios.

————————————————————————————————————————–

Part 4. Select Models

Step 1: Compare Key statistics

The chart below summarizes the model statistics for all four of our models. The models are listed from left to right in accordance with the order in which they were described in Part 3.

Metric	bestglm	Backward / Logit	Backward / Probit	Forward - AIC
# Predictors	4	7	7	6
AIC	227.34	221.87	221.77	220.52
Accuracy	0.8777	0.9142	0.9120	0.9120
Class.Err.R.	0.1223	0.0858	0.0880	0.0880
Precision	0.8874	0.9163	0.9159	0.9273
Sensitivity	0.8603	0.9083	0.9039	0.8903
Specificity	0.8945	0.9198	0.9198	0.9325
F1 Score	0.8736	0.9123	0.9099	0.9087
AUC	0.8774	0.9142	0.9120	0.9120

As we can see in the table, Models 2, 3, and 4 are fairly similar in their performance metrics, with Model 2 (Backward / Logit) yielding the highest F1 Score and AUC and Model 4 (Forward / AIC) yielding the lowest AIC and highest precision and specificity. However, the differences in their performance metrics are marginal at best. Of those three, Model 4 is the simplest overall since it has the smallest number of predictor variables. Furthermore, Model 4’s marginal model plots exhibited less overall deviance than did those of either Model 2 or Model 3.

Model 1 appears to slightly underperform each of the other three models relative to every performance metric we’ve evaluated. However, with only four predictor variabes, the model is easily the simplest of the four models we’ve constructed.

Step 2: Pick the top two

Of the four models, we chose to first eliminate Models 2 and 3 from consideration due to Model 4’s greater simplicity, lower AIC, better precision and specificity, and better marginal model plots. Model 1 was retained due to its relative simplicity (only 4 variables) even though its performance metrics were slightly below those of the other three models.

The two final models can be summarized as follows:

Model 1: Less complex, ideal marginal model plots, but slightly lower performance
Model 4: More complex (2 additional variables), slightly less than ideal marginal model plots, but better overall performance.

Step 3: Pick the “best” model

Picking our preferred model required answering the following question:

Should the smaller/simpler model be preferred over the larger, more complex model?

An objective answer to this question was found through the use of a Likelihood Ratio Test. In a likelihood ratio test, models having different numbers of predictors are compared to determine whether or not a smaller logistic regression model should be preferred over a larger logistic regression model assuming both are “valid” in the sense that their marginal model plots show no significant evidence of deviance and their performance metrics are somewhat similar.

In our likelihood ratio test, the smaller model is considered to be the null hypothesis H0, and a p-value for the overall model fit statistic that is less than 0.05 would indicate that we should reject the null hypothesis in favor of the alternative hypothesis HA. In other words, a small p-value would provide evidence against the smaller model and in favor of the larger model.

We conducted a likelihood ratio test using R’s lrtest() function. The results are displayed below:

## Likelihood ratio test
## 
## Model 1: target ~ nox + rad + age + tax + ptratio + medv
## Model 2: target ~ nox + age + rad + tax
##   #Df  LogLik Df  Chisq Pr(>Chisq)   
## 1   7 -103.26                        
## 2   5 -108.67 -2 10.821   0.004469 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The results show a p-value of less than 0.05, indicating that we should reject the null hypothesis H0 in favor of the alternative hypothesis HA. Specifically, there is sufficient statistical evidence for us to justifiably prefer Model 4 (the larger model) over Model 1. Therefore, we selected Model 4, the Forward - AIC model, as the basis for our predictions of the response variable in the Evaluation data set.

Step 4: Apply to the evaluation data

To ensure the model’s efficacy when applied to the Evaluation data set, we apply the same set of transformations used on the Training data set prior to building our individual models. The results of those transformations can be found here:

https://github.com/spsstudent15/2016-02-621-W2/blob/master/HW-3/621-HW3-Clean-EvalData-.csv

The Forward - AIC model is then applied to the evaluation data to yield a set of predicted values for the ‘target’ response variable. The results of the prediction process are shown in the table below.

zn	indus	chas	nox	rm	age	dis	rad	tax	ptratio	black	lstat	medv	target
0	7.07	0	0.469	7.185	0	4.9671	2	242	17.8	1.2568	4.03	34.7	0
0	8.14	0	0.538	6.096	1	4.4619	4	307	21.0	1.2465	10.26	18.2	0
0	8.14	0	0.538	6.495	1	4.4547	4	307	21.0	1.2528	12.80	18.4	0
0	8.14	0	0.538	5.950	1	3.9900	4	307	21.0	1.1123	27.71	13.2	1
0	5.96	0	0.499	5.850	0	3.9342	5	279	19.2	1.2600	8.77	21.0	0
1	5.13	0	0.453	5.741	0	7.2254	8	284	19.7	1.2586	13.15	18.7	0
1	5.13	0	0.453	5.966	1	6.8185	8	284	19.7	1.2449	14.44	16.0	0
0	4.49	0	0.449	6.630	0	4.4377	3	247	18.5	1.2563	6.53	26.6	0
0	4.49	0	0.449	6.121	0	3.7476	3	247	18.5	1.2586	8.44	22.2	0
0	2.89	0	0.445	6.163	0	3.4952	2	276	18.0	1.2560	11.34	21.4	0
0	25.65	0	0.581	5.856	1	1.9444	2	188	19.1	1.2385	25.41	17.3	0
0	25.65	0	0.581	5.613	1	1.7572	2	188	19.1	1.2294	27.26	15.7	0
0	21.89	0	0.624	5.637	1	1.9799	4	437	21.2	1.2600	18.34	14.3	1
0	19.58	0	0.605	6.101	1	2.2834	5	403	14.7	1.1201	9.81	25.0	1
0	19.58	0	0.605	5.880	1	2.3887	5	403	14.7	1.2200	12.03	19.1	1
0	10.59	1	0.489	5.960	1	3.8771	4	277	18.6	1.2571	17.27	21.7	0
0	6.20	0	0.504	6.552	0	3.3751	8	307	17.4	1.2467	3.76	31.5	1
0	6.20	0	0.507	8.247	0	3.6519	8	307	17.4	1.2456	3.95	48.3	1
1	5.86	0	0.431	6.957	0	8.9067	7	330	19.1	1.2514	3.53	29.6	0
1	2.97	0	0.400	7.088	0	7.3073	1	285	15.3	1.2583	7.85	32.2	0
1	1.76	0	0.385	6.230	0	9.0892	1	241	18.2	1.2145	12.93	20.1	0
1	2.18	0	0.472	6.616	0	3.3700	7	222	18.4	1.2572	8.93	28.4	0
0	9.90	0	0.544	6.122	0	2.6403	4	304	18.4	1.2600	5.98	22.1	0
0	7.38	0	0.493	6.415	0	4.7211	5	287	19.6	1.2600	6.12	25.0	0
0	7.38	0	0.493	6.312	0	5.4159	5	287	19.6	1.2600	6.15	23.0	0
0	5.19	0	0.515	5.895	0	5.6150	5	224	20.2	1.2583	10.56	18.5	0
1	2.01	0	0.435	6.635	0	8.3440	4	280	17.0	1.2553	5.99	24.5	0
0	18.10	0	0.718	3.561	1	1.6132	24	666	20.2	1.2256	7.12	27.5	1
0	18.10	1	0.631	7.016	1	1.2024	24	666	20.2	1.2561	2.96	50.0	1
0	18.10	0	0.584	6.348	1	2.0527	24	666	20.2	0.9189	17.64	14.5	1
0	18.10	0	0.740	5.935	1	1.8206	24	666	20.2	0.8926	34.02	8.4	1
0	18.10	0	0.740	5.627	1	1.8172	24	666	20.2	1.2600	22.88	12.8	1
0	18.10	0	0.740	5.818	1	1.8662	24	666	20.2	1.2557	22.11	10.5	1
0	18.10	0	0.740	6.219	1	2.0048	24	666	20.2	1.2590	16.59	18.4	1
0	18.10	0	0.740	5.854	1	1.8956	24	666	20.2	1.1204	23.79	10.8	1
0	18.10	0	0.713	6.525	1	2.4358	24	666	20.2	0.8557	18.13	14.1	1
0	18.10	0	0.713	6.376	1	2.5671	24	666	20.2	1.2556	14.65	17.7	1
0	18.10	0	0.655	6.209	0	2.9634	24	666	20.2	1.2600	13.22	21.4	1
0	9.69	0	0.585	5.794	0	2.8927	6	391	19.2	1.2600	14.10	18.3	1
0	11.93	0	0.573	6.976	1	2.1675	1	273	21.0	1.2600	5.64	23.9	0

The full set of evaluation data including the predicted values can also be found at the following web link:

https://github.com/spsstudent15/2016-02-621-W2/blob/master/HW-3/HW3-PRED-EVAL-ALL_M_DATA.csv

Summary statistics for the ‘target’ variable for both the training data set and the evaluation data set are shown below (NOTE: “n” represents the number of records within a data set).

Data Set	n	Mean	sd	Med.	Min	Max	Range	Skew	Kurtosis	SE
Evaluation	40	0.45	0.50	0.00	0.00	1.00	1	0.19	-2.01	0.08
Training	466	0.49	0.50	0.00	0.00	1.00	1	0.03	-2.00	0.02

Conclusion:

After developing 4 distinct models using a best subset algorithm, backward selection using both logit and probit link functions, and forward selection we’ve selected what we believe to be our best model based on the results of a likelihood ratio test. Though the training data set presented a variety of challenges, including significant clustering of data values within roughly 25% of the cases and the heavily skewed distributions of some of the predictor variables, the binary logistic regression model we’ve selected demonstrated greater than 90% accuracy, precision, and specificity while generating an F1 Score greater than .90 and an AUC of more than .91. Furthermore, the model appears to be well-fit relative to the training data as evidenced by the strong results seen in the model’s marginal model plots.

Therefore, we believe we’ve selected a model that could be effective for predicting the likelihood that a neighborhood in the Boston metropolitan area will have an above-median crime rate. However, we can’t be certain the model will be effective for judging the crime rates of other metropolitan areas due to wide disparities in the ways in which factors such as property tax rates and zoning are applied by other municipalities. Further research would therefore be required to determine whether our selected model could be applied to other geographic areas.

————————————————————————————————————————–

Appendix

The Appendix to this document containing all of the R code used for this assignment (as well as other relevant output) can be found in a separate PDF file accessible via this web link:

https://github.com/spsstudent15/2016-02-621-W2/blob/master/HW-3/HW3-Appendix.pdf

Predicting Neighborhood Crime Rates via Binary Logistic Regression

James Topor, Jeff Nieman, Scott Karr

June 30, 2016

Part 1: Data Exploration

Data Summary

Analysis of Predictor Variables

Correlation Matrix

Conclusion of Data Exploration

Part 2 - Data Preparation

Step 1: Investigation of Recurring Variable Values

Step 2: Converting the ‘zn’ Variable to a Binary ‘0/1’ Factor

Step 3: Converting the ‘age’ Variable to a Binary ‘0/1’ Factor

Step 4: Transforming the ‘black’ Variable into a Proportion

Part 3 - Build Models

Model 1: Use the bestglm Function to Build a Model

Approach:

Inferences:

Model 2: Logit Model Using Backward Selection

Approach:

Inferences:

Model 3: Probit Model Using Backward Selection

Approach:

Removing Outliers and Next Iteration:

Inferences:

Model 4: Forward Selection + AIC

Approach:

Iterations:

Inferences:

Part 4. Select Models

Step 1: Compare Key statistics

Step 2: Pick the top two

Step 3: Pick the “best” model

Step 4: Apply to the evaluation data

Conclusion:

Appendix