Build a binary logistic regression model on the training data set to predict whether the neighborhood will be at risk for high crime levels. You will provide classifications and probabilities for the evaluation data set using your binary logistic regression model. You can only use the variables given to you (or variables that you derive from the variables provided). Use 0.5 threshold. Below is a short description of the variables of interest in the data set:
The dataset includes information on 466 neighborhoods in the city of Boston. Despite its East Coast location and reputation as a bastion of liberalism, Boston is among the most racially segregated of American cities. Attempts to integrate the schools using busing in the 1970s led to sustained violence, including deaths. Recent scholarship has highlighted the widespread use of redlining, a process by which institutions such as banks refused to offer mortgages or other financial services to people of certain races if they wished to purchase a home in certain neighborhoods despite creditworthiness.
In short, one would probably not want to construct a model to predict crime by neighborhood that uses variables such as race without having a clear idea of the model’s intended use and an ethical framework for evaluating said model. This, however, is an academic exercise, so we proceed. Let’s preview the data.
zn | indus | chas | nox | rm | age | dis | rad | tax | ptratio | lstat | medv | target |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 19.58 | 0 | 0.605 | 7.929 | 96.2 | 2.0459 | 5 | 403 | 14.7 | 3.70 | 50.0 | 1 |
0 | 19.58 | 1 | 0.871 | 5.403 | 100.0 | 1.3216 | 5 | 403 | 14.7 | 26.82 | 13.4 | 1 |
0 | 18.10 | 0 | 0.740 | 6.485 | 100.0 | 1.9784 | 24 | 666 | 20.2 | 18.85 | 15.4 | 1 |
30 | 4.93 | 0 | 0.428 | 6.393 | 7.8 | 7.0355 | 6 | 300 | 16.6 | 5.19 | 23.7 | 0 |
0 | 2.46 | 0 | 0.488 | 7.155 | 92.2 | 2.7006 | 3 | 193 | 17.8 | 4.82 | 37.9 | 0 |
0 | 8.56 | 0 | 0.520 | 6.781 | 71.3 | 2.8561 | 5 | 384 | 20.9 | 7.67 | 26.5 | 0 |
0 | 18.10 | 0 | 0.693 | 5.453 | 100.0 | 1.4896 | 24 | 666 | 20.2 | 30.59 | 5.0 | 1 |
0 | 18.10 | 0 | 0.693 | 4.519 | 100.0 | 1.6582 | 24 | 666 | 20.2 | 36.98 | 7.0 | 1 |
0 | 5.19 | 0 | 0.515 | 6.316 | 38.1 | 6.4584 | 5 | 224 | 20.2 | 5.68 | 22.2 | 0 |
80 | 3.64 | 0 | 0.392 | 5.876 | 19.1 | 9.2203 | 1 | 315 | 16.4 | 9.25 | 20.9 | 0 |
22 | 5.86 | 0 | 0.431 | 6.438 | 8.9 | 7.3967 | 7 | 330 | 19.1 | 3.59 | 24.8 | 0 |
0 | 12.83 | 0 | 0.437 | 6.286 | 45.0 | 4.5026 | 5 | 398 | 18.7 | 8.94 | 21.4 | 0 |
0 | 18.10 | 0 | 0.532 | 7.061 | 77.0 | 3.4106 | 24 | 666 | 20.2 | 7.01 | 25.0 | 1 |
22 | 5.86 | 0 | 0.431 | 8.259 | 8.4 | 8.9067 | 7 | 330 | 19.1 | 3.54 | 42.8 | 1 |
0 | 2.46 | 0 | 0.488 | 6.153 | 68.8 | 3.2797 | 3 | 193 | 17.8 | 13.15 | 29.6 | 0 |
Expected variables are present. Note that, as indicated in the variables’ descriptions, many of the variables have already been scaled or transformed in some way. Let’s calculate summary statistics and generate a box plot for further review.
vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
zn | 1 | 466 | 11.5772532 | 23.3646511 | 0.00000 | 5.3542781 | 0.0000000 | 0.0000 | 100.0000 | 100.0000 | 2.1768152 | 3.8135765 | 1.0823466 |
indus | 2 | 466 | 11.1050215 | 6.8458549 | 9.69000 | 10.9082353 | 9.3403800 | 0.4600 | 27.7400 | 27.2800 | 0.2885450 | -1.2432132 | 0.3171281 |
chas | 3 | 466 | 0.0708155 | 0.2567920 | 0.00000 | 0.0000000 | 0.0000000 | 0.0000 | 1.0000 | 1.0000 | 3.3354899 | 9.1451313 | 0.0118957 |
nox | 4 | 466 | 0.5543105 | 0.1166667 | 0.53800 | 0.5442684 | 0.1334340 | 0.3890 | 0.8710 | 0.4820 | 0.7463281 | -0.0357736 | 0.0054045 |
rm | 5 | 466 | 6.2906738 | 0.7048513 | 6.21000 | 6.2570615 | 0.5166861 | 3.8630 | 8.7800 | 4.9170 | 0.4793202 | 1.5424378 | 0.0326516 |
age | 6 | 466 | 68.3675966 | 28.3213784 | 77.15000 | 70.9553476 | 30.0226500 | 2.9000 | 100.0000 | 97.1000 | -0.5777075 | -1.0098814 | 1.3119625 |
dis | 7 | 466 | 3.7956929 | 2.1069496 | 3.19095 | 3.5443647 | 1.9144814 | 1.1296 | 12.1265 | 10.9969 | 0.9988926 | 0.4719679 | 0.0976026 |
rad | 8 | 466 | 9.5300429 | 8.6859272 | 5.00000 | 8.6978610 | 1.4826000 | 1.0000 | 24.0000 | 23.0000 | 1.0102788 | -0.8619110 | 0.4023678 |
tax | 9 | 466 | 409.5021459 | 167.9000887 | 334.50000 | 401.5080214 | 104.5233000 | 187.0000 | 711.0000 | 524.0000 | 0.6593136 | -1.1480456 | 7.7778214 |
ptratio | 10 | 466 | 18.3984979 | 2.1968447 | 18.90000 | 18.5970588 | 1.9273800 | 12.6000 | 22.0000 | 9.4000 | -0.7542681 | -0.4003627 | 0.1017669 |
lstat | 11 | 466 | 12.6314592 | 7.1018907 | 11.35000 | 11.8809626 | 7.0720020 | 1.7300 | 37.9700 | 36.2400 | 0.9055864 | 0.5033688 | 0.3289887 |
medv | 12 | 466 | 22.5892704 | 9.2396814 | 21.20000 | 21.6304813 | 6.0045300 | 5.0000 | 50.0000 | 45.0000 | 1.0766920 | 1.3737825 | 0.4280200 |
target | 13 | 466 | 0.4914163 | 0.5004636 | 0.00000 | 0.4893048 | 0.0000000 | 0.0000 | 1.0000 | 1.0000 | 0.0342293 | -2.0031131 | 0.0231835 |
We see 466 records in our training set and no missing values for any variable. Other than that, without a specific question in mind, it’s difficult to draw any conclusions from this big table of numbers. We see no missing values that would require imputation using medians or other methods.
Now, we visualize using box plots. We’ll separate the box plots by the target value, which signifies whether or not the neighborhood is high crime. And we’ll approximate Boston Red Sox colors.
The dummy variable (chas) that represents proximity to the Charles River is not meaningful, but clear distinctions in distributions between the neighborhoods in which the crime rate is below and above the median - the target variable by which the box plots are split. We might later look at these values after transformations such as logs.
To check for skewness, let’s examine the distribution of each variable independent of target variable value.
Skewness abounds. We will file this away for now and revisit in the Data Preparation part of the project. In particular, zn, nox, age, dis, ptratio, and lstat seem likely candidates for transformations.
We will now check for covariance.
zn | indus | chas | nox | rm | age | dis | rad | tax | ptratio | lstat | medv | target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
zn | 1.0000000 | -0.5382664 | -0.0401620 | -0.5170452 | 0.3198141 | -0.5725805 | 0.6601243 | -0.3154812 | -0.3192841 | -0.3910357 | -0.4329925 | 0.3767171 | -0.4316818 |
indus | -0.5382664 | 1.0000000 | 0.0611832 | 0.7596301 | -0.3927118 | 0.6395818 | -0.7036189 | 0.6006284 | 0.7322292 | 0.3946898 | 0.6071102 | -0.4961743 | 0.6048507 |
chas | -0.0401620 | 0.0611832 | 1.0000000 | 0.0974558 | 0.0905098 | 0.0788837 | -0.0965771 | -0.0159004 | -0.0467648 | -0.1286606 | -0.0514232 | 0.1615653 | 0.0800419 |
nox | -0.5170452 | 0.7596301 | 0.0974558 | 1.0000000 | -0.2954897 | 0.7351278 | -0.7688840 | 0.5958298 | 0.6538780 | 0.1762687 | 0.5962426 | -0.4301227 | 0.7261062 |
rm | 0.3198141 | -0.3927118 | 0.0905098 | -0.2954897 | 1.0000000 | -0.2328125 | 0.1990158 | -0.2084457 | -0.2969343 | -0.3603471 | -0.6320245 | 0.7053368 | -0.1525533 |
age | -0.5725805 | 0.6395818 | 0.0788837 | 0.7351278 | -0.2328125 | 1.0000000 | -0.7508976 | 0.4603143 | 0.5121245 | 0.2554479 | 0.6056200 | -0.3781560 | 0.6301062 |
dis | 0.6601243 | -0.7036189 | -0.0965771 | -0.7688840 | 0.1990158 | -0.7508976 | 1.0000000 | -0.4949919 | -0.5342546 | -0.2333394 | -0.5075280 | 0.2566948 | -0.6186731 |
rad | -0.3154812 | 0.6006284 | -0.0159004 | 0.5958298 | -0.2084457 | 0.4603143 | -0.4949919 | 1.0000000 | 0.9064632 | 0.4714516 | 0.5031013 | -0.3976683 | 0.6281049 |
tax | -0.3192841 | 0.7322292 | -0.0467648 | 0.6538780 | -0.2969343 | 0.5121245 | -0.5342546 | 0.9064632 | 1.0000000 | 0.4744223 | 0.5641886 | -0.4900329 | 0.6111133 |
ptratio | -0.3910357 | 0.3946898 | -0.1286606 | 0.1762687 | -0.3603471 | 0.2554479 | -0.2333394 | 0.4714516 | 0.4744223 | 1.0000000 | 0.3773560 | -0.5159153 | 0.2508489 |
lstat | -0.4329925 | 0.6071102 | -0.0514232 | 0.5962426 | -0.6320245 | 0.6056200 | -0.5075280 | 0.5031013 | 0.5641886 | 0.3773560 | 1.0000000 | -0.7358008 | 0.4691270 |
medv | 0.3767171 | -0.4961743 | 0.1615653 | -0.4301227 | 0.7053368 | -0.3781560 | 0.2566948 | -0.3976683 | -0.4900329 | -0.5159153 | -0.7358008 | 1.0000000 | -0.2705507 |
target | -0.4316818 | 0.6048507 | 0.0800419 | 0.7261062 | -0.1525533 | 0.6301062 | -0.6186731 | 0.6281049 | 0.6111133 | 0.2508489 | 0.4691270 | -0.2705507 | 1.0000000 |
We see some very high positive and negative correlations between variables. Let’s construct a more effective visualization.
We see candidates for combination due to covariance.
As a final step, let’s look just a correlation between the independent variables and the target variables.
target | |
---|---|
zn | -0.4316818 |
indus | 0.6048507 |
chas | 0.0800419 |
nox | 0.7261062 |
rm | -0.1525533 |
age | 0.6301062 |
dis | -0.6186731 |
rad | 0.6281049 |
tax | 0.6111133 |
ptratio | 0.2508489 |
lstat | 0.4691270 |
medv | -0.2705507 |
target | 1.0000000 |
Nox, or the concentration of nitrogen oxide, a significant pollutant that’s harmful to human health, in a neighborhood, shows the closest correlation with the target variable at .73. Next, age, rad, tax, and indus all correlate with the target value just above .6. Zn showed the largest negative correlation with the target at -.43. Zn represents the percentage of residential lots zoned for large lots, which may be an indicator large rental housing - apartments.
Prior to modelling the training data set, we must prepare the data. We do not have any missing values, so imputation is not required. We will probably actually start with a model that uses all variables regardless of skew or covariance. However, we will definitely progress to using transformations and will also combine variables due to covariance in seeking the construction of accurate and valid models.
Let’s look at how transformations might solve distribution issues with some of our variables. Earlier, we saw a strong right skew in the distribution of the variable lstat, which tracks the “lower status” of a neighborhood’s population. Probably not the best phrasing.
What would a log transformation do to this distribution?
Looks slightly better. Let’s generate log transformations for all variables in the dataset.
Medv looks slightly better. However, age remains strongly left skewed. Dis is now bimodal.
What about other transformations such as quadratic ones?
Nope. Not a lot of improvement.
In Part 1, we saw high covariances among variables such as rad and tax (.91). To build the best models, we’ll likely want to examine combining some of these variables that are correlated to each other, which tends to increase standard errors. This can lead to overfitting and inefficient models. We will not combine variables here but instead revisit this concept in part 3 when evaluating our models.
Our textbooks have also discussed the possibility of creating bins for continuous variables. For example, dis, the weighted distance of means of distances from a neighborhood to five Boston job centers, might be better suited to fall into three categories than to remain a continuous variable for performance reasons.
Prior to building the model, we’re going to split our training data into a true training set and a validation set. We’ll go 80/20 training to validation.
Following convention, we will start with a model consisting of all variables, none of which have been transformed. While we’ve moved on to Part 3, where we will construct the models, the boundary between data preparation and model building is grey. We will to explore transformations and collinearity.
##
## Call:
## glm(formula = target ~ ., family = "binomial", data = crime_train_data_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9392 -0.1315 -0.0024 0.0015 3.4720
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -43.954712 8.008648 -5.488 4.06e-08 ***
## zn -0.062257 0.039283 -1.585 0.113003
## indus -0.053255 0.058450 -0.911 0.362232
## chas 0.675370 0.857697 0.787 0.431035
## nox 51.934147 9.821117 5.288 1.24e-07 ***
## rm -0.519049 0.888956 -0.584 0.559297
## age 0.029116 0.015881 1.833 0.066735 .
## dis 0.766202 0.265472 2.886 0.003899 **
## rad 0.733765 0.194240 3.778 0.000158 ***
## tax -0.006802 0.003386 -2.009 0.044539 *
## ptratio 0.400174 0.153418 2.608 0.009097 **
## lstat 0.092014 0.063229 1.455 0.145599
## medv 0.215374 0.087839 2.452 0.014210 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 514.63 on 371 degrees of freedom
## Residual deviance: 143.67 on 359 degrees of freedom
## AIC: 169.67
##
## Number of Fisher Scoring iterations: 9
Our most significant variables generally tie to the variables we saw have the highest correlations with the target value earlier. We have an AIC of 169.67 and a residual deviance of 143.67.
Let’s run further diagnostics on the model. We will set a probability of .5 as being the cutoff for determining if a neighborhood will be high crime. Here, we check the relationship between the logit of the outcome and each predictive variable. (Target and the binary dummy variable chas should be ignored.) Again, these steps also could be labelled as data preparation.
## 135 367 190 409 435 22
## "neg" "pos" "pos" "pos" "neg" "pos"
Tax and zn do not show linear associations with the outcomes in logit scale. Along with the previously discussed lstat, they might benefit from transformations.
Let’s use Cook’s Distance to check for outliers.
.rownames | target | zn | indus | chas | nox | rm | age | dis | rad | tax | ptratio | lstat | medv | .fitted | .se.fit | .resid | .hat | .sigma | .cooksd | .std.resid | index |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
457 | 1 | 0 | 10.59 | 0 | 0.489 | 5.412 | 9.8 | 3.5875 | 4 | 277 | 18.6 | 29.55 | 23.7 | -2.5803687 | 1.629657 | 2.303642 | 0.1738331 | 0.6191799 | 0.2586387 | 2.534435 | 131 |
14 | 1 | 22 | 5.86 | 0 | 0.431 | 8.259 | 8.4 | 8.9067 | 7 | 330 | 19.1 | 3.54 | 42.8 | -0.3920093 | 1.194062 | 1.347768 | 0.3430954 | 0.6273771 | 0.0905137 | 1.662891 | 234 |
338 | 1 | 20 | 6.96 | 0 | 0.464 | 5.856 | 42.1 | 4.4290 | 3 | 223 | 18.6 | 13.00 | 21.1 | -6.0250580 | 1.133342 | 3.472023 | 0.0030902 | 0.6062579 | 0.0989419 | 3.477400 | 252 |
An outlier is not necessarily influential. Let’s check for that.
Let’s pull that point that’s above 3 standardized residuals from 0.
.rownames | target | zn | indus | chas | nox | rm | age | dis | rad | tax | ptratio | lstat | medv | .fitted | .se.fit | .resid | .hat | .sigma | .cooksd | .std.resid | index |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
338 | 1 | 20 | 6.96 | 0 | 0.464 | 5.856 | 42.1 | 4.429 | 3 | 223 | 18.6 | 13 | 21.1 | -6.025058 | 1.133342 | 3.472023 | 0.0030902 | 0.6062579 | 0.0989419 | 3.4774 | 252 |
Observation 338 is an influential outlier.
Next, we check multicollinearity.
## zn indus chas nox rm age dis rad
## 1.881239 3.156811 1.229952 4.960830 6.065538 2.642072 4.089579 1.947943
## tax ptratio lstat medv
## 2.376876 2.419764 2.723224 8.696936
The rule of thumb is that vif scores above 5 should be judged as having a high amount of multicollinearity. So rm and medv have issues in this regard.
In summary, we have:
The above are among many methods to check assumptions and diagnostics of logistic regression models. We will not repeat these steps - other than the summary diagnostics - for our additional attempts at constructing a model to predict high-crime neighborhoods.
##
## Call:
## glm(formula = target ~ zn + indus + chas + nox + age + dis +
## rad + tax + ptratio + lstat + medv, family = "binomial",
## data = crime_train_data_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9586 -0.1457 -0.0028 0.0018 3.4402
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -44.415908 7.905817 -5.618 1.93e-08 ***
## zn -0.063314 0.038909 -1.627 0.103689
## indus -0.051556 0.058270 -0.885 0.376276
## chas 0.697782 0.857637 0.814 0.415869
## nox 50.739927 9.490760 5.346 8.98e-08 ***
## age 0.025030 0.014057 1.781 0.074979 .
## dis 0.724497 0.251411 2.882 0.003955 **
## rad 0.711955 0.188502 3.777 0.000159 ***
## tax -0.007035 0.003359 -2.095 0.036203 *
## ptratio 0.359167 0.134298 2.674 0.007486 **
## lstat 0.105872 0.058173 1.820 0.068764 .
## medv 0.175190 0.052623 3.329 0.000871 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 514.63 on 371 degrees of freedom
## Residual deviance: 144.02 on 360 degrees of freedom
## AIC: 168.02
##
## Number of Fisher Scoring iterations: 9
Iniitally for model 2, we removed the variable with the highest collinearity (medv), which led to slightly higher residual deviance and AIC. Medv has a low p-value in the original model, so we instead removed rm, which led to a slight increase in residual deviance but a drop in AIC. Not a big difference.
Here, we try log transformations of a couple variables that did not show linear relations between them and the logit of the outcome.
##
## Call:
## glm(formula = target ~ zn + indus + chas + nox + rm + age + dis +
## rad + tax + ptratio + log(lstat) + log(medv), family = "binomial",
## data = crime_train_data_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9146 -0.1649 -0.0048 0.0029 3.3521
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -51.103411 11.370274 -4.494 6.97e-06 ***
## zn -0.050353 0.034899 -1.443 0.149070
## indus -0.052091 0.058319 -0.893 0.371739
## chas 0.751968 0.890887 0.844 0.398632
## nox 49.332175 9.664369 5.105 3.32e-07 ***
## rm 0.063858 0.756310 0.084 0.932712
## age 0.030108 0.015184 1.983 0.047387 *
## dis 0.712916 0.252333 2.825 0.004724 **
## rad 0.664907 0.180679 3.680 0.000233 ***
## tax -0.006223 0.003470 -1.793 0.072946 .
## ptratio 0.346223 0.148787 2.327 0.019967 *
## log(lstat) 0.542533 0.796489 0.681 0.495773
## log(medv) 3.519152 2.010423 1.750 0.080040 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 514.63 on 371 degrees of freedom
## Residual deviance: 149.84 on 359 degrees of freedom
## AIC: 175.84
##
## Number of Fisher Scoring iterations: 9
Adding log transformations to lstat and medv has increased both our residual deviance and AIC, which is less than desirable.
Here, we include only the variables that were signfiicant (p value < .05) fro m model 1.
##
## Call:
## glm(formula = target ~ nox + rad + dis + ptratio + medv + tax +
## age, family = "binomial", data = crime_train_data_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.97027 -0.17714 -0.01856 0.00296 3.13739
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -38.666602 6.897163 -5.606 2.07e-08 ***
## nox 44.611899 7.789706 5.727 1.02e-08 ***
## rad 0.739014 0.164010 4.506 6.61e-06 ***
## dis 0.487788 0.196070 2.488 0.01285 *
## ptratio 0.366746 0.127129 2.885 0.00392 **
## medv 0.110761 0.041069 2.697 0.00700 **
## tax -0.008714 0.002875 -3.031 0.00244 **
## age 0.034673 0.013382 2.591 0.00957 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 514.63 on 371 degrees of freedom
## Residual deviance: 152.25 on 364 degrees of freedom
## AIC: 168.25
##
## Number of Fisher Scoring iterations: 9
The residual deviance is higher here than in model 1 without a significant drop in AIC.
We’re not making good progress here on improving upon model 1, the initial try that included all untransformed variabes. Let’s try the Leaps package, which will start with the most significant variable and then continue adding variables/features until many possible models have been evaluated. This is known as Forward selection. Note that while the Leaps pacage executed perfectly in R Studio and allowed us to evaluate numerous possible models with the variables, it caused an error when “knitting.”
The LEAPS package provided adjusted R^2 scores for a variety of packages. It started with just the nox variable and then added variables in constructing 12 total models. The 10h model (zn + indus + nox + rm + age+ dis + rad + tax + lstat + medv) from LEAPS has the highest R^2 but the 6th model is only slightly lower in adjusted R^2 with only six variables. In the interest of simplicity, we’ll take that 6th model rom LEAPS - i-n-a-rd-t-m, which translates to the below.
##
## Call:
## glm(formula = target ~ nox + rad + tax + medv + age + indus,
## family = "binomial", data = crime_train_data_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.84034 -0.26538 -0.04395 0.00700 2.73765
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -20.816101 3.518666 -5.916 3.30e-09 ***
## nox 34.004675 7.540052 4.510 6.49e-06 ***
## rad 0.647194 0.157451 4.110 3.95e-05 ***
## tax -0.008066 0.003092 -2.608 0.0091 **
## medv 0.026737 0.029892 0.894 0.3711
## age 0.018464 0.011659 1.584 0.1133
## indus -0.024421 0.055520 -0.440 0.6600
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 514.63 on 371 degrees of freedom
## Residual deviance: 165.20 on 365 degrees of freedom
## AIC: 179.2
##
## Number of Fisher Scoring iterations: 8
Note that we selected this model from LEAPS based on R^2 but are now comparing it to other models using residual deviance and AIC, where it is found wanting. Deciding whether to use residual deviance, AIC, or adjusted R^2 among other model diagnostics like AICc and BIC is always challenge and may be domain - or maybe even problem - dependent.
We now looked at using the stepAIC function and forward selection. As with the leaps package, we had problems “knitting” the stepAIC function from the MASS package here. Processed fine in R Studio, and led to the four-variable described below in model 6.
##
## Call:
## glm(formula = target ~ nox + rad + medv + age, family = "binomial",
## data = crime_train_data_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.65002 -0.34124 -0.10477 0.01132 2.76068
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -18.39756 2.53421 -7.260 3.88e-13 ***
## nox 23.79415 4.69296 5.070 3.97e-07 ***
## rad 0.49666 0.13052 3.805 0.000142 ***
## medv 0.06182 0.02781 2.223 0.026196 *
## age 0.01839 0.01133 1.623 0.104634
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 514.63 on 371 degrees of freedom
## Residual deviance: 176.87 on 367 degrees of freedom
## AIC: 186.87
##
## Number of Fisher Scoring iterations: 8
In summary, for Part 3, Build Models, we attempted a number of methods to construct a good model. After setting a baseline with all variables, we attempted to address collinearity by removing a variable with a high vif score in Model 2. In Model 3, we applied logarithmic transformations to two of our variables the showd skewed distrubtions. In Model 4, we only included variables that showed significance in Model 1. Models 5 and 6 both used forward selection using R functions.
Some of our findings were counterintuitive in that models with more variables showed higher AICs. Also, transfomrations did not seem effective, at least in terms of AIC and residual deviances.
We will now run that remaining 20% of our training data through the models created with the 80% training set. We’ll evaluate models 1, 2, 5, and 6. Finally, we will run the evaluation file that does not include target data through our model to ouptut predictions.
Model 1 confusion matrix and AUC information:
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 36 5
## 1 5 48
##
## Accuracy : 0.8936
## 95% CI : (0.813, 0.9478)
## No Information Rate : 0.5638
## P-Value [Acc > NIR] : 3.307e-12
##
## Kappa : 0.7837
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9057
## Specificity : 0.8780
## Pos Pred Value : 0.9057
## Neg Pred Value : 0.8780
## Prevalence : 0.5638
## Detection Rate : 0.5106
## Detection Prevalence : 0.5638
## Balanced Accuracy : 0.8919
##
## 'Positive' Class : 1
##
##
## Call:
## roc.default(response = crime_train_data_valid$target, predictor = model1_predict)
##
## Data: model1_predict in 41 controls (crime_train_data_valid$target 0) < 53 cases (crime_train_data_valid$target 1).
## Area under the curve: 0.954
Now the same for model 2:
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 35 5
## 1 6 48
##
## Accuracy : 0.883
## 95% CI : (0.8003, 0.9401)
## No Information Rate : 0.5638
## P-Value [Acc > NIR] : 1.993e-11
##
## Kappa : 0.7614
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9057
## Specificity : 0.8537
## Pos Pred Value : 0.8889
## Neg Pred Value : 0.8750
## Prevalence : 0.5638
## Detection Rate : 0.5106
## Detection Prevalence : 0.5745
## Balanced Accuracy : 0.8797
##
## 'Positive' Class : 1
##
##
## Call:
## roc.default(response = crime_train_data_valid$target, predictor = model2_predict)
##
## Data: model2_predict in 41 controls (crime_train_data_valid$target 0) < 53 cases (crime_train_data_valid$target 1).
## Area under the curve: 0.9544
And model 5:
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 34 10
## 1 7 43
##
## Accuracy : 0.8191
## 95% CI : (0.7263, 0.891)
## No Information Rate : 0.5638
## P-Value [Acc > NIR] : 1.497e-07
##
## Kappa : 0.6353
## Mcnemar's Test P-Value : 0.6276
##
## Sensitivity : 0.8113
## Specificity : 0.8293
## Pos Pred Value : 0.8600
## Neg Pred Value : 0.7727
## Prevalence : 0.5638
## Detection Rate : 0.4574
## Detection Prevalence : 0.5319
## Balanced Accuracy : 0.8203
##
## 'Positive' Class : 1
##
##
## Call:
## roc.default(response = crime_train_data_valid$target, predictor = model5_predict)
##
## Data: model5_predict in 41 controls (crime_train_data_valid$target 0) < 53 cases (crime_train_data_valid$target 1).
## Area under the curve: 0.9383
Finally, model 6:
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 36 10
## 1 5 43
##
## Accuracy : 0.8404
## 95% CI : (0.7505, 0.9078)
## No Information Rate : 0.5638
## P-Value [Acc > NIR] : 1.05e-08
##
## Kappa : 0.68
## Mcnemar's Test P-Value : 0.3017
##
## Sensitivity : 0.8113
## Specificity : 0.8780
## Pos Pred Value : 0.8958
## Neg Pred Value : 0.7826
## Prevalence : 0.5638
## Detection Rate : 0.4574
## Detection Prevalence : 0.5106
## Balanced Accuracy : 0.8447
##
## 'Positive' Class : 1
##
##
## Call:
## roc.default(response = crime_train_data_valid$target, predictor = model6_predict)
##
## Data: model6_predict in 41 controls (crime_train_data_valid$target 0) < 53 cases (crime_train_data_valid$target 1).
## Area under the curve: 0.9305
Models 1 and 2 provide better accuracy and AIC. Due to a slight edge in simplicity with having one fewer variable, we will go with model 2, which is just all untransformed variables other than the multicollinear rm. In reality, for a given business problem, there are often considerations at play that would lead one to value specificitiy over sensitivity, for example. Predicting that a neighborhood is high crime could be an expensive mistake, or it could be the safer assumption of the two, depending on context. Here, we don’t know.
Finally, we run the evaluation dataset through model 2, which is submitted along with this report. A reflection of the predictions of the evaluation set:
The above looks like it predicts neighborhoods to be high-crime at a slightly lower rate than indicated by the frequency distribution of the target value in the original full training set. We would need to know more about the sampling techinque used to split training from evaluation data before determining if this is problematic.
Model diagnostics: http://www.sthda.com/english/articles/36-classification-methods-essentials/148-logistic-regression-assumptions-and-diagnostics-in-r/
Leaps package usage: https://rstudio-pubs-static.s3.amazonaws.com/2897_9220b21cfc0c43a396ff9abf122bb351.html
library(knitr) library(dplyr) library(kableExtra)
url_crime_train <- ‘https://raw.githubusercontent.com/littlejohnjeff/Data_621_Fall_2019/master/crime-training-data_modified.csv’ crime_train_data <- read.csv(url_crime_train, header = TRUE) kable(crime_train_data[1:15,]) %>% kable_styling(bootstrap_options = c(“striped”, “hover”, “condensed”, “responsive”))
library(psych) kable(psych::describe(crime_train_data)) %>% kable_styling(bootstrap_options = c(“striped”, “hover”, “responsive”))
library(reshape2) library(dplyr) library(ggplot2) crime_plot <- melt(crime_train_data, id.vars= ‘target’) %>% mutate(target = as.factor(target))
ggplot(data = crime_plot, aes(x = variable, y = value)) + geom_boxplot(aes(fill = target)) + facet_wrap( ~ variable, , dir = “h”, scales = ‘free’) + scale_fill_manual(values=c(“blue3”, “red3”))
ggplot(crime_plot,aes(value)) + geom_histogram(bin=25,fill=“Red3”) + facet_wrap( ~ variable, , dir = “h”, scales = ‘free’)
ggplot(crime_plot,aes(log(value))) + geom_histogram(bin=25,fill=“Blue3”) + facet_wrap( ~ variable, , dir = “h”, scales = ‘free’)
library(stats) library(corrplot) cor_train <- cor(crime_train_data, method=“pearson”) kable(cor_train, “html”) %>% kable_styling(bootstrap_options = c(“striped”, “hover”, “condensed”, “responsive”))
corrplot(cor_train)
cor_train_target <- as.data.frame(cor_train) %>% dplyr::select(target) kable(cor_train_target, “html”) %>% kable_styling(bootstrap_options = c(“striped”, “hover”, “condensed”, “responsive”))
ggplot(crime_train_data,aes(x=lstat)) + geom_histogram(bin=25,fill=“Red3”)
ggplot(crime_train_data,aes(x=log(lstat))) + geom_histogram(bin=25,fill=“Blue3”)
train_size <- floor(0.8 * nrow(crime_train_data))
set.seed(123) train_ind <- sample(seq_len(nrow(crime_train_data)), size = train_size)
crime_train_data_train <- crime_train_data[train_ind, ] crime_train_data_valid <- crime_train_data[-train_ind, ]
model1_untransformed <- glm(formula = target ~ ., family = “binomial”, data = crime_train_data_train) summary(model1_untransformed)
library(tidyr) probabilities <- predict(model1_untransformed, type = “response”) predicted.classes <- ifelse(probabilities > 0.5, “pos”, “neg”) head(predicted.classes)
mydata <- crime_train_data_train %>% dplyr::select_if(is.numeric) predictors <- colnames(mydata)
mydata <- mydata %>% mutate(logit = log(probabilities/(1-probabilities))) %>% gather(key = “predictors”, value = “predictor.value”, -logit)
ggplot(mydata, aes(logit, predictor.value))+ geom_point(size = 0.5, alpha = 0.5) + geom_smooth(method = “loess”) + theme_bw() + facet_wrap(~predictors, scales = “free_y”)
plot(model1_untransformed, which = 4, id.n = 3)
library(broom) model1_untransformed.data <- augment(model1_untransformed) %>% mutate(index = 1:n())
kable(model1_untransformed.data %>% top_n(3, .cooksd)) %>% kable_styling(bootstrap_options = c(“striped”, “hover”, “condensed”, “responsive”))
ggplot(model1_untransformed.data, aes(index, .std.resid)) + geom_point(alpha = .5) + theme_light()
library(car) car::vif(model1_untransformed)
model2_coll <- glm(formula = target ~ zn + indus + chas + nox + age + dis + rad + tax + ptratio + lstat + medv, family = “binomial”, data = crime_train_data_train) summary(model2_coll)
model3_some_logs <- glm(formula = target ~ zn + indus + chas + nox + rm + age + dis + rad + tax + ptratio + log(lstat) + log(medv), family = “binomial”, data = crime_train_data_train) summary(model3_some_logs)
model4_sig_var <- glm(formula = target ~ nox + rad + dis + ptratio + medv + tax + age, family = “binomial”, data = crime_train_data_train) summary(model4_sig_var)
model5_leap <- glm(formula = target ~ nox + rad + tax + medv + age + indus, family = “binomial”, data = crime_train_data_train) summary(model5_leap)
model6_step <- glm(formula = target ~ nox + rad + medv + age, family = “binomial”, data = crime_train_data_train) summary(model6_step)
library(caret) model1_predict <- predict(model1_untransformed, newdata=crime_train_data_valid, type=“response”) model1_predict_data <- ifelse(model1_predict > .5, 1, 0) confusionMatrix(data=model1_predict_data, crime_train_data_valid$target, positive=‘1’)
library(‘pROC’) plot(roc(crime_train_data_valid$target, model1_predict))
model2_predict <- predict(model2_coll, newdata=crime_train_data_valid, type=“response”) model2_predict_data <- ifelse(model2_predict > .5, 1, 0) confusionMatrix(data=model2_predict_data, crime_train_data_valid$target, positive=‘1’)
plot(roc(crime_train_data_valid$target, model2_predict))
model5_predict <- predict(model5_leap, newdata=crime_train_data_valid, type=“response”) model5_predict_data <- ifelse(model5_predict > .5, 1, 0) confusionMatrix(data=model5_predict_data, crime_train_data_valid$target, positive=‘1’)
plot(roc(crime_train_data_valid$target, model5_predict))
model6_predict <- predict(model6_step, newdata=crime_train_data_valid, type=“response”) model6_predict_data <- ifelse(model6_predict > .5, 1, 0) confusionMatrix(data=model6_predict_data, crime_train_data_valid$target, positive=‘1’)
plot(roc(crime_train_data_valid$target, model6_predict))
url_crime_eval <- ‘https://raw.githubusercontent.com/littlejohnjeff/Data_621_Fall_2019/master/crime-evaluation-data_modified.csv’ crime_eval_data <- read.csv(url_crime_eval, header = TRUE) model2_eval_data <- predict(model2_coll, newdata=crime_eval_data, type=“response”) model2_eval <- ifelse(model2_eval_data > .5, 1, 0) hist(model2_eval) crime_eval_data_out <- cbind(crime_eval_data,model2_eval)
hist(crime_train_data$target)
write.csv(crime_eval_data_out,“DATA 621Data_621_Hw_3_Evaluation_Output.csv”,row.names = FALSE)
library(leaps) reg_subsets.out <- regsubsets(target ~ ., data = crime_train_data_train, nbest = 1, ##### 1 best model for each number of predictors nvmax = NULL, ##### NULL for no limit on number of variables force.in = NULL, force.out = NULL, method = “exhaustive”)
library(car) layout(matrix(1:2, ncol = 2)) ########## Adjusted R2 res.legend <- subsets(reg_subsets.out, statistic=“adjr2”, legend = FALSE, min.size = 5, main = “Adjusted R^2”)
summary.out$adjr2
summary.out$which[6,]
library(MASS) mod6a1 <- lm(target ~ ., data=crime_train_data_train) mod6a2 <- lm(target ~ 1, data=crime_train_data_train) mod6a <- stepAIC(mod4a2, direction=“forward”, scope = list(upper=mod6a1, lower=mod6a2)) summary(mod6a)