—

1970s Boston Crime Rates: a Logit Model

———

Our dataset consists of 14 variables from a 1978 study by Harrison, D. and Rubinfeld, D.L. of demand for clean air in the city of Boston. Our task is to use the information to predict the likelihood of an area to have higher crime than the regional average or lower than the regional average. One of the variables, medv, appears to be right-censored at 50. (This is mentioned by some subsequent studies). The variables collected are:

When we look at our correlations, we find that some of the variables of the dataset are positively correlated with one another and with the target variable. We’ll use methods to try to deal with multicollinearity and to make sure interactions that could help a successful model are accounted for.

##       zn    indus     chas      nox       rm      age      dis      rad 
## 2.325878 4.124187 1.091361 4.514656 2.401350 3.145982 4.248625 6.911438 
##      tax  ptratio    black    lstat     medv 
## 9.217279 2.025427 1.356751 3.662172 3.773939

Tax, a measure of tax rate, and rad, a measure of access to radial highways, are both highly correlated with each other and each has a vif greater than 5. For modelling purposes, we created a compound variable, taxrad, which is the product of them.

Looking at graphs split by positive and negative target value, we can see that some of the data could be transformed. We create a squared variable for three left-skewed variables. These include tax, rad, indus and our composite, taxrad.

A quick word about the logit model is necessary at this point. The general form of the logit model is $\frac{\pi }{1-\pi} = e^{\beta _{0}+{\beta_{1} }{\\X_{1}}+{\beta_{2} }{\\X_{2}}...{\beta_{n} }{\\X_{n}}}$. The logit model fits a linear model for all of the predictor variables. The sum of the intercept and predictors multiplied by their betas (sensitivity to change in x) makes up the right side of the model. The odds of being in a category are logged (the link function) before being set to equal the model. Odds are probability / 1-probability.

After testing many variables and models, we arrived at 3 possible models, evaluated by AIC. First, we found a spare model with the 4 most statistical significance after removing variables in pairs: nox, rad, tax and black. We check VIF and find that in this smaller model, tax and rad do not exhibit a problem with collinearity. Our residuals for this model are somewhat left-shifted. One of our variables, “black”, has the potential to be rather offensive. The language is not the way we would phrase it today. We should remember, however, that descrimination takes many forms and has included genuine difficulty finding housing in safer or more desirable areas. These four variables are all statistically significant. The model is relatively simple. The AIC is 129.28.

## 
## Call:
## glm(formula = target ~ ., family = binomial(link = "logit"), 
##     data = most_significant_predictors)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.86057  -0.30885  -0.04057   0.00113   2.65525  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.117427   7.048278  -0.584  0.55910    
## nox         35.676274   6.251694   5.707 1.15e-08 ***
## rad          0.693592   0.170983   4.056 4.98e-05 ***
## tax         -0.008705   0.003331  -2.613  0.00897 ** 
## black       -0.041351   0.016668  -2.481  0.01311 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 387.25  on 279  degrees of freedom
## Residual deviance: 119.28  on 275  degrees of freedom
## AIC: 129.28
## 
## Number of Fisher Scoring iterations: 9

##      nox      rad      tax    black 
## 1.768794 1.326936 1.607416 1.060482

Our second model includes a more complicated set of explanatory variables. Our left-shifted tax and rad are included as squares. So is taxrad multiplied by age included as an interaction term. This model has fairly normal residuals and low Cook’s Distance scores for all observations. This model has an AIC of 114.67, the lowest of the models tested.

## 
## Call:
## glm(formula = target ~ nox + age + log(taxrad) + (taxrad * log(age)) + 
##     taxsq + radsq + medv, family = binomial(link = "logit"), 
##     data = training_set)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1703  -0.1590  -0.0108   0.0166   3.5977  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -5.209e+01  1.251e+01  -4.163 3.13e-05 ***
## nox              4.630e+01  1.056e+01   4.386 1.15e-05 ***
## age              1.411e-01  3.535e-02   3.992 6.56e-05 ***
## log(taxrad)      6.189e+00  1.568e+00   3.947 7.91e-05 ***
## taxrad          -6.194e-03  3.753e-03  -1.650  0.09885 .  
## log(age)        -5.076e+00  1.681e+00  -3.020  0.00253 ** 
## taxsq            4.792e-05  9.027e-05   0.531  0.59550    
## radsq            1.980e-01  6.908e-02   2.867  0.00415 ** 
## medv             4.524e-03  3.803e-02   0.119  0.90531    
## taxrad:log(age) -5.149e-04  6.359e-04  -0.810  0.41807    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 387.248  on 279  degrees of freedom
## Residual deviance:  94.668  on 270  degrees of freedom
## AIC: 114.67
## 
## Number of Fisher Scoring iterations: 10

Dis, a measure of distance to employment centers, has not fit in well with our models and has not appeared to be significant. It is, however negatively correlated with our target and a linear model with dis had R squared = .3803.

## 
## Call:
## lm(formula = target ~ training_set$dis, data = training_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7673 -0.3116  0.1371  0.2835  1.2676 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       1.03320    0.04884   21.15   <2e-16 ***
## training_set$dis -0.14605    0.01113  -13.12   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3937 on 278 degrees of freedom
## Multiple R-squared:  0.3825, Adjusted R-squared:  0.3803 
## F-statistic: 172.2 on 1 and 278 DF,  p-value: < 2.2e-16

Dis is negatively correlated with our target and with some other variables. This doesn’t prevent dis from being significant, but inhibits its use as an interaction term. When we look at dis crossed with indus, nox, age and taxrad, it appears that there is some interesting interaction and a cluster analysis might improve our model. All four had negative correlations to dis in the correlation plot and each shows an interaction with dis here that might be exploited by a cluster model.

We include the output of an 8 part k-means clustering as a variable means_group. We include it in a model with zn, nox, age, dis, taxrad, ptratio, black, medv and taxsq. AIC for this model is 113.58. The residuals are fairly normal the leverage, measured by Cook’s Distance is under control. The means group variable, tested by vif, fits with the other variables in this model.

## 
## Call:
## glm(formula = target ~ zn + nox + age + dis + taxrad + ptratio + 
##     black + medv + taxsq + means_group, family = binomial(link = "logit"), 
##     data = training_set)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7988  -0.1815  -0.0031   0.0000   3.2863  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.888e+00  1.094e+01  -0.264 0.791810    
## zn          -4.078e-02  3.808e-02  -1.071 0.284255    
## nox          4.597e+01  1.052e+01   4.370 1.24e-05 ***
## age          2.379e-02  1.533e-02   1.552 0.120628    
## dis          4.896e-01  2.771e-01   1.767 0.077188 .  
## taxrad       2.997e-03  8.671e-04   3.456 0.000548 ***
## ptratio      4.725e-01  1.802e-01   2.622 0.008731 ** 
## black       -9.015e-02  2.751e-02  -3.277 0.001051 ** 
## medv         1.293e-01  4.984e-02   2.595 0.009465 ** 
## taxsq       -2.760e-04  9.379e-05  -2.942 0.003259 ** 
## means_group -6.546e-01  2.198e-01  -2.978 0.002898 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 387.248  on 279  degrees of freedom
## Residual deviance:  91.582  on 269  degrees of freedom
## AIC: 113.58
## 
## Number of Fisher Scoring iterations: 11

##          zn         nox         age         dis      taxrad     ptratio 
##    1.648745    3.021711    1.553451    2.650129    3.405757    1.781346 
##       black        medv       taxsq means_group 
##    1.366804    2.008031    5.134847    1.738339

After looking at three models, we investigate each model using the validation set that we segregated at the beginning of our data preparation. We plot a ROC curve for each one and calculate area under the curve and find the point and probability where the strongest model can be found.

Model 3: K-Means Model

## Area under the curve: 0.9629

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 85 20
##          1  4 77
##                                           
##                Accuracy : 0.871           
##                  95% CI : (0.8141, 0.9155)
##     No Information Rate : 0.5215          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.7434          
##  Mcnemar's Test P-Value : 0.0022          
##                                           
##             Sensitivity : 0.7938          
##             Specificity : 0.9551          
##          Pos Pred Value : 0.9506          
##          Neg Pred Value : 0.8095          
##              Prevalence : 0.5215          
##          Detection Rate : 0.4140          
##    Detection Prevalence : 0.4355          
##       Balanced Accuracy : 0.8744          
##                                           
##        'Positive' Class : 1               
##

Model 2: Complicated Model

## Area under the curve: 0.9613

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 83 12
##          1  6 85
##                                           
##                Accuracy : 0.9032          
##                  95% CI : (0.8514, 0.9416)
##     No Information Rate : 0.5215          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8066          
##  Mcnemar's Test P-Value : 0.2386          
##                                           
##             Sensitivity : 0.8763          
##             Specificity : 0.9326          
##          Pos Pred Value : 0.9341          
##          Neg Pred Value : 0.8737          
##              Prevalence : 0.5215          
##          Detection Rate : 0.4570          
##    Detection Prevalence : 0.4892          
##       Balanced Accuracy : 0.9044          
##                                           
##        'Positive' Class : 1               
##

Model 1: Four Variable Model

## Area under the curve: 0.9537

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 80 14
##          1  9 83
##                                         
##                Accuracy : 0.8763        
##                  95% CI : (0.8203, 0.92)
##     No Information Rate : 0.5215        
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.7528        
##  Mcnemar's Test P-Value : 0.4042        
##                                         
##             Sensitivity : 0.8557        
##             Specificity : 0.8989        
##          Pos Pred Value : 0.9022        
##          Neg Pred Value : 0.8511        
##              Prevalence : 0.5215        
##          Detection Rate : 0.4462        
##    Detection Prevalence : 0.4946        
##       Balanced Accuracy : 0.8773        
##                                         
##        'Positive' Class : 1             
##

All three of our models provided a respectable area-under-the curve (AUC). But the most accurate model, according to sensitivity and specificity is model 3. This accords with our earlier AIC analysis. Our third model, which includes the k-means clustering variable, is the most robust model. We conclude that this model is the most accurate. While it includes 10 terms and isn’t parsimonious, all of the variables create a good model together.

##       [,1]
##  [1,]    0
##  [2,]    1
##  [3,]    1
##  [4,]    1
##  [5,]    0
##  [6,]    0
##  [7,]    0
##  [8,]    0
##  [9,]    0
## [10,]    0
## [11,]    1
## [12,]    1
## [13,]    1
## [14,]    1
## [15,]    1
## [16,]    0
## [17,]    0
## [18,]    1
## [19,]    0
## [20,]    0
## [21,]    0
## [22,]    0
## [23,]    0
## [24,]    0
## [25,]    0
## [26,]    0
## [27,]    0
## [28,]    1
## [29,]    1
## [30,]    1
## [31,]    1
## [32,]    1
## [33,]    1
## [34,]    1
## [35,]    1
## [36,]    1
## [37,]    1
## [38,]    1
## [39,]    0
## [40,]    0

Appendix

———

library(e1071)
library(dplyr)
library(ggplot2)
library(gridExtra)
suppressWarnings(suppressMessages(library(MASS)))
suppressWarnings(suppressMessages(library(car)))
suppressWarnings(suppressMessages(library(corrplot)))
suppressWarnings(suppressMessages(library(pROC)))
suppressWarnings(suppressMessages(library(caret)))
boston_data_set<-read.csv(‘https://raw.githubusercontent.com/WigodskyD/data-sets/master/crime-training-data.csv’)
full_model<-glm(data=boston_data_set,target~zn+indus+chas+nox+rm+age+dis+rad+tax+ptratio+black+lstat+medv)
vif(full_model)

boston_data_set %>%
mutate(taxrad=tax*rad) %>%
mutate(radsq = rad^2) %>%
mutate(taxsq = tax^1.7) %>%
mutate(indussq = indus^2) %>%
mutate(taxradsq = taxrad^2)->boston_data_set
correl.matrix<-cor(boston_data_set[,1:14], use=“complete.obs”)
corrplot(correl.matrix,method=“color”,type=“upper”)

plota<-ggplot()+geom_boxplot(data=boston_data_set, y=boston_data_set$zn,aes(y=boston_data_set$zn,x=boston_data_set$target,group=boston_data_set$target))+labs(x=‘target’,y=‘zn’)+ theme(panel.background = element_rect(fill = ‘#4286f4’))
plotb<-ggplot()+geom_boxplot(data=boston_data_set, y=boston_data_set$indus,aes(y=boston_data_set$indus,x=boston_data_set$target,group=boston_data_set$target))+labs(x=‘target’,y=‘indus’)+ theme(panel.background = element_rect(fill = ‘#4286f4’))
plotc<-ggplot()+geom_boxplot(data=boston_data_set, y=boston_data_set$chas,aes(y=boston_data_set$chas,x=boston_data_set$target,group=boston_data_set$target))+labs(x=‘target’,y=‘chas’)+ theme(panel.background = element_rect(fill = ‘#4286f4’))
plotd<-ggplot()+geom_boxplot(data=boston_data_set, y=boston_data_set$nox,aes(y=boston_data_set$nox,x=boston_data_set$target,group=boston_data_set$target))+labs(x=‘target’,y=‘nox’)+ theme(panel.background = element_rect(fill = ‘#4286f4’))
plote<-ggplot()+geom_boxplot(data=boston_data_set, y=boston_data_set$rm,aes(y=boston_data_set$rm,x=boston_data_set$target,group=boston_data_set$target))+labs(x=‘target’,y=‘rm’)+ theme(panel.background = element_rect(fill = ‘#4286f4’))
plotf<-ggplot()+geom_boxplot(data=boston_data_set, y=boston_data_set$age,aes(y=boston_data_set$age,x=boston_data_set$target,group=boston_data_set$target))+labs(x=‘target’,y=‘age’)+ theme(panel.background = element_rect(fill = ‘#4286f4’))
plotg<-ggplot()+geom_boxplot(data=boston_data_set, y=boston_data_set$dis,aes(y=boston_data_set$dis,x=boston_data_set$target,group=boston_data_set$target))+labs(x=‘target’,y=‘dis’)+ theme(panel.background = element_rect(fill = ‘#4286f4’))
ploth<-ggplot()+geom_boxplot(data=boston_data_set, y=boston_data_set$rad,aes(y=boston_data_set$rad,x=boston_data_set$target,group=boston_data_set$target))+labs(x=‘target’,y=‘rad’)+ theme(panel.background = element_rect(fill = ‘#4286f4’))
ploti<-ggplot()+geom_boxplot(data=boston_data_set, y=boston_data_set$tax,aes(y=boston_data_set$tax,x=boston_data_set$target,group=boston_data_set$target))+labs(x=‘target’,y=‘tax’)+ theme(panel.background = element_rect(fill = ‘#4286f4’))
plotj<-ggplot()+geom_boxplot(data=boston_data_set, y=boston_data_set$ptratio,aes(y=boston_data_set$ptratio,x=boston_data_set$target,group=boston_data_set$target))+labs(x=‘target’,y=‘ptrario’)+ theme(panel.background = element_rect(fill = ‘#4286f4’))
plotk<-ggplot()+geom_boxplot(data=boston_data_set, y=boston_data_set$lstat,aes(y=boston_data_set$lstat,x=boston_data_set$target,group=boston_data_set$target))+labs(x=‘target’,y=‘lstat’)+ theme(panel.background = element_rect(fill = ‘#4286f4’))
plotl<-ggplot()+geom_boxplot(data=boston_data_set, y=boston_data_set$medv,aes(y=boston_data_set$medv,x=boston_data_set$target,group=boston_data_set$target))+labs(x=‘target’,y=‘medv’)+ theme(panel.background = element_rect(fill = ‘#4286f4’))
grid.arrange(plota,plotb,plotc,plotd,plote,plotf,plotg,ploth,ploti,plotj,plotk,plotl,nrow = 3)
set.seed(102)
testing_indices<-sample(1:length(boston_data_set[,1]),(.4length(boston_data_set$target)))
testing_set<-boston_data_set[testing_indices,]
training_set<-boston_data_set[-testing_indices,]
most_significant_predictors<-training_set[,c(4,8,9,11,14)]
most_significant_model<-glm(data=most_significant_predictors,target~.,family=binomial(link=‘logit’))
summary(most_significant_model)
vif(most_significant_model)
ggplot()+geom_point(aes(x=seq_along(resid(most_significant_model)),y=resid(most_significant_model)),color=‘blue’,shape=9,size=3)+ylim(-1.3,1.3)+ theme(panel.background = element_rect(fill = ‘#d3dded’))+labs(x=‘4 Variable Model’,y=‘Residuals’)
ggplot()+geom_point(aes(x=seq_along(cooks.distance(most_significant_model)),y=cooks.distance(most_significant_model)),color=‘blue’,shape=9,size=3)+ theme(panel.background = element_rect(fill = ‘#d3dded’))+labs(x=‘4 Variable Model’,y=“Cook’s Distance”)
complicated_model<-glm(data=training_set,target~ nox+ age+ log(taxrad) +(taxradlog(age)) +taxsq+radsq+medv, family=binomial(link=‘logit’))
summary(complicated_model)
ggplot()+geom_point(aes(x=seq_along(resid(complicated_model)),y=resid(complicated_model)),color=‘blue’,shape=9,size=3)+ylim(-1.3,1.3)+ theme(panel.background = element_rect(fill = ‘#d3dded’))+labs(x=‘Complicated Model’,y=‘Residuals’)
ggplot()+geom_point(aes(x=seq_along(cooks.distance(complicated_model)),y=cooks.distance(complicated_model)),color=‘blue’,shape=9,size=3)+ theme(panel.background = element_rect(fill = ‘#d3dded’))+labs(x=‘Complicated Model’,y=“Cook’s Distance”)

dis_correlation_model<-lm(data=training_set,target~training_set$dis)
summary(dis_correlation_model)

plot.a<-ggplot()+geom_point(aes(y=training_set$dis,x=training_set$indus,color=training_set$target))+labs(x='indus',y='dis',color='target\n')+ theme(panel.background = element_rect(fill = '#f4f5f7')) plot.b<-ggplot()+geom_point(aes(y=training_set$dis,x=training_set$nox,color=training_set$target))+labs(x=‘nox’,y=‘dis’,color=‘target’)+ theme(panel.background = element_rect(fill = ‘#f4f5f7’))
plot.c<-ggplot()+geom_point(aes(y=training_set$dis,x=training_set$age,color=training_set$target))+labs(x='age',y='dis',color='target\n')+ theme(panel.background = element_rect(fill = '#f4f5f7')) plot.d<-ggplot()+geom_point(aes(y=training_set$dis,x=training_set$taxrad,color=training_set$target))+labs(x=‘taxrad’,y=‘dis’,color=‘target’)+ theme(panel.background = element_rect(fill = ‘#f4f5f7’))
grid.arrange(plot.a,plot.b,plot.c,plot.d,nrow = 2)

means_group<-matrix(kmeans(training_set[,c(2,4,6,7,15)],8))
training_set<-cbind(training_set,means_group[1])
colnames(training_set)[20]<-‘means_group’
kmeans_model<-glm(data=training_set,target~zn+nox+age+dis+taxrad+ptratio+black+medv+taxsq+means_group,family=binomial(link=‘logit’))
summary(kmeans_model)
vif(kmeans_model)
ggplot()+geom_point(aes(x=seq_along(resid(kmeans_model)),y=resid(kmeans_model)),color=‘blue’,shape=9,size=3)+ylim(-1.3,1.3)+ theme(panel.background = element_rect(fill = ‘#d3dded’))+labs(x=‘K-means Model’,y=‘Residuals’)
ggplot()+geom_point(aes(x=seq_along(cooks.distance(kmeans_model)),y=cooks.distance(kmeans_model)),color=‘blue’,shape=9,size=3)+
theme(panel.background = element_rect(fill = ‘#d3dded’))+labs(x=‘K-means Model’,y=“Cook’s Distance”)

means_group<-matrix(kmeans(testing_set[,c(2,4,6,7,15)],8))
testing_set<-cbind(testing_set,means_group[1])
colnames(testing_set)[20]<-‘means_group’
prediction_set<-predict(kmeans_model,newdata=testing_set,type=‘response’)
target<-testing_set[,14]
ROC_set<-cbind(target,prediction_set)

roc_function_object<-roc(ROC_set[,1],ROC_set[,2])
plot(roc_function_object)
auc(roc_function_object)
column1<-matrix(unlist(roc_function_object[2]))
column2<-matrix(unlist(roc_function_object[3]))
roc_function_matrix<-as.data.frame(cbind(column1,column2),ncol=2)
colnames(roc_function_matrix)<-c(‘column1’,‘column2’)
roc_function_matrix %>%
mutate(mean = .5*(column1+(1-column2))) %>%
mutate(euclidean_dist = ((column1-mean)^{2+((1-column2)-mean)}2)^.5)->roc_function_matrix
threshold<-(which.max(roc_function_matrix$euclidean_dist))/length(roc_function_matrix[,1]) as.data.frame(ROC_set) %>% mutate(guess = prediction_set >threshold )->ROC_set ROC_set$guess[ROC_set$guess ==TRUE]<-1
ROC_set$guess<-as.factor(ROC_set$guess)
ROC_set$target<-as.factor(ROC_set$target)
confusionMatrix(ROC_set$guess,ROC_set$target,positive=“1”)

prediction_set<-predict(complicated_model,newdata=testing_set,type=‘response’) target<-testing_set[,14] ROC_set<-cbind(target,prediction_set)

roc_function_object<-roc(ROC_set[,1],ROC_set[,2]) plot(roc_function_object) auc(roc_function_object) column1<-matrix(unlist(roc_function_object[2])) column2<-matrix(unlist(roc_function_object[3])) roc_function_matrix<-as.data.frame(cbind(column1,column2),ncol=2) colnames(roc_function_matrix)<-c(‘column1’,‘column2’) roc_function_matrix %>% mutate(mean = .5*(column1+(1-column2))) %>% mutate(euclidean_dist = ((column1-mean)^{2+((1-column2)-mean)}2)^.5)->roc_function_matrix threshold<-(which.max(roc_function_matrix$euclidean_dist))/length(roc_function_matrix[,1]) as.data.frame(ROC_set) %>% mutate(guess = prediction_set >threshold )->ROC_set ROC_set$guess[ROC_set$guess ==TRUE]<-1 ROC_set$guess<-as.factor(ROC_set$guess) ROC_set$target<-as.factor(ROC_set$target) confusionMatrix(ROC_set$guess,ROC_set$target,positive=“1”)

prediction_set<-predict(most_significant_model,newdata=testing_set,type=‘response’)
target<-testing_set[,14]
ROC_set<-cbind(target,prediction_set)

evaluation_set<-read.csv(‘https://raw.githubusercontent.com/WigodskyD/data-sets/master/crime-evaluation-data.csv’)
evaluation_set %>%
mutate(taxrad=tax*rad) %>%
mutate(radsq = rad^2) %>%
mutate(taxsq = tax^1.7) %>%
mutate(indussq = indus^2) %>%
mutate(taxradsq = taxrad^2)->evaluation_set
means_group<-matrix(kmeans(evaluation_set[,c(2,4,6,7,15)],8))
evaluation_set<-cbind(evaluation_set,means_group[1])
colnames(evaluation_set)[19]<-‘means_group’
prediction_set<-matrix(predict(kmeans_model,newdata=evaluation_set,type=‘response’))
prediction_set[prediction_set<.5]<-0
prediction_set[prediction_set>.5]<-1
prediction_set
write.csv(prediction_set,‘C:/Users/dawig/Desktop/Data621/WigodskyDanpredictions.csv’)

Data_621_project_3

Dan Wigodsky

October 31, 2018

—

1970s Boston Crime Rates: a Logit Model

———

Tax, a measure of tax rate, and rad, a measure of access to radial highways, are both highly correlated with each other and each has a vif greater than 5. For modelling purposes, we created a compound variable, taxrad, which is the product of them.

Looking at graphs split by positive and negative target value, we can see that some of the data could be transformed. We create a squared variable for three left-skewed variables. These include tax, rad, indus and our composite, taxrad.

After looking at three models, we investigate each model using the validation set that we segregated at the beginning of our data preparation. We plot a ROC curve for each one and calculate area under the curve and find the point and probability where the strongest model can be found.

Appendix

———