1. Data Exploration

Summary:

  • Training data consists of 1 Response variable and 12 predictor variables
  • Data is very complete with no missing values.
  • Variables Rad, Tax and PTRatio have approximately 120 values that are 24,666,20.2 respectively.
  • Each such pair of Rad and Tax has target of 1; this will affect the model.

I read the crime training data from the csv provided into a data frame. The data seems very clean, with no missing values. I wrote a loop to double check for missing values to be certain. Most of the data is numeric, save for chas, which is a categorical dummy variable.

A summary of the data are provided in the table below:

Variable Mean Median Standard Deviation
target 0.4914 0 0.5004636
chas 0.7082 0 0.2567920
zn 11.58 0 23.3646511
indus 11.105 9.690 6.8458549
nox 0.5543 0.5380 0.1166667
rm 6.291 6.210 0.7048513
age 77.15 68.37 28.3213784
dis 3.796 3.191 2.1069496
rad 9.53 5.00 8.6859272
tax 409.5 344.5 167.9000887
ptratio 18.4 18.9 2.1968447
lstat 12.631 11.350 7.1018907
medv 22.59 21.20 9.2396814

You can see that for rad and tax the means are much higher than the medians which indicates the data is skewed. Upon closer inspection, I noticed that rad, tax and ptratio have sets of (24,666,20.2). Each rad = 24 matches with a tax = 666, however there are some ptratios that have a value of 20.2 that do not correspond to the other two values. This is because 24 and 666 are extreme outliers for rad and tax, but 20.2 is not an outlier for ptratio.

Nox Violin Plot shows a clear difference in distribution between response variable cases

Nox Violin Plot shows a clear difference in distribution between response variable cases

RM Violin Plot does not show a clear difference in distribution between response variable cases

RM Violin Plot does not show a clear difference in distribution between response variable cases

Nox vs Target Variable show a clear relationship between Response variable and Predictor variable

Nox vs Target Variable show a clear relationship between Response variable and Predictor variable

2. Data Preperation

Summary

  • Rad, Tax and PTRatio had high collinearity (>0.9) that disappear when (24,66,20.2) were replaced by medians.
  • No other predictor variables correlated with other predictors above 0.8 and were not transformed.
  • Violin Plots for Each Variable Were Generated. Only medv and rm showed no obvious difference between target cases (0,1).
  • Variable Nox (Nitrogen Oxide Concentration) had the largest correlation

I prefer to take a minimalist approach to preparing data, fixing only those things that are broken. In this way we can have confidence that the data will meaningfully predict the outcome. In this case the only problem with the data that seemed to need fixing was the substitution errors discussed previously with rad, tax and ptratio.

I opted for a median replacement for the identified incorrect values. The values were deemed incorrect because rad is an index of accessibility to radial highways. Rad appears to be a Linert scale of 1-8, so that the value of 24 is absurd. Furthermore, each time rad was 24, tax was 666. The superstitions regarding that number aside, since this is property tax by $ 10,000, I would expect some outliers that were in the $6,660,000, representing wealthy neighborhood. However, 120 such values from 466 observations is unreasonably high. I would also expect that pupil-teacher ratio for a wealthy neighborhood to be well below the mean value of 18.4, and not above at 20.2. The probabilities that these values always match is also practically zero.

Tax vs. Rad shows that high correlation is due to 120 outliers with high leverage.

Tax vs. Rad shows that high correlation is due to 120 outliers with high leverage.

The rad vs tax values also created high leverage for the correlation between the variables as seen in figure four. Keep in mind that what looks like a single point is actually 120 points in the exact same position. I opted for a median substitution, as the mean is not resilient against outliers, and putting those data close to the center greatly reduces their leverage. This is especially important as each set 0f (24,666,20.2) corresponded to a target of 1.

Variable Correlation with Target
zn -0.4316818
indus 0.6048507
chas 0.08004187
nox 0.7261062
rm -0.1525533
age 0.6301062
dis -0.6186731
rad 0.3121268
tax 0.2090816
ptratio 0.115086
lstat 0.469127
medv -0.2705507

3. Build Models

-Summary

  • Two backwards eliminated models were created, one with logit link, one with a probit link.
  • R function regsubsets() was used to guide a good end-point for the backward eliminated models.
  • A forward selected model starting with nox was also created. This model avoided rad,tax, and ptratio.

My first step in building models was to use the regsubsets() function which finds the best variables to use for each possible subset of variables. In other words, it will tell you if you want to create an eight variable model which 8 variables will create the best model based on the criteria of AIC, BIC, CIC, DIC, etc. I did this to help guide the process, in case I would create an independent model and it varied to greatly from the model suggested by regsubset() it may be cause to revisit my methods.

The goal is, of course, to create a model that has high accuracy, but also that I have high confidence will work well in the client’s data. To that end I am using the well-tested backwards elimination model with logit and probit outcomes, to see which produces higher accuracy against a test set.

Since the variable nox is the most highly correlated variable to target. I am also interested in doing a forward selected model that starts with nox and adds any other variable aside from rad, tax and ptratio. Since those variables had 120 out of 466 values replaced, I am concerned about how using those data will affect the generalization of the model to the client’s data. If this model performs with close to the same accuracy as the other two, it should be the recommended model.

Eight Variable Logit Model:

parameter Value
Intercept -41.46706143
zn -0.07190461
nox 49.66969441
age 0.02656795
dis 0.60600888
tax -0.01022942
rad 0.88472112
ptratio 0.41245041
medv 0.09469977

Seven Variable Probit Model:

parameter Value
Intercept -21.374638936
nox 26.607996613
age 0.011643943
dis 0.192553245
tax -0.005991887
rad 0.452207740
ptratio 0.234360912
medv 0.039826087

3 Variable Nox Model

parameter Value
Intercept -18.23049857
nox 32.13766020
dis 0.27895617
zn -0.04385274

That crime is linked to poverty is an observation that goes back mellennia, Roman Emperor Marcus Aurelius commented said “Poverty is the Mother of Crime”. I expect that the variable zn which is the portion of land zoned for large lots to have negative slope, as wealthier people can afford to live in larger plots of land. Nox which tracks air pollution should have a positive slope, as poorer people may only be able to afford to live in more polluted areas. ‘age’ tracks the age of buildings, this should have a positive slope, as upper class people in the US tend to live in newer constructions, the so-called McMansions. ‘dis’ tracks the weighted mean distance from employment centers should have positive slope since poorer people may find themselves living further from work. ‘tax’ measuring property tax revenue ought to have negative slope, more valuable land is owned by wealthier people, who have less likely to commit crimes. ‘rad’ distance from radial highways should have positive slope for a similar reason as dis, poor people have less access to transportation. ‘ptratio’ should have positive slope since education funding us tied to property taxes; poorer neighborhoods can’t afford to hire as many teachers as wealthier ones. ‘medv’ which tracks median value of owner occupied houses, I would expect to be negative, but is positive and is the only variable whose sign of slope doesn’t match expectations. Perhaps this is because poorer neighborhoods have very few owner occupied home. This results in a slope that is close to zero.

4. Select Models

Summary:

  • Six-fold Cross validation was used to create test sets and generate the model selection metrics.
  • The 8-variable Logit model has 91.2% accuracy on the test set.
  • The 7-variable Probit model has 90.3% accuracy on the test set.
  • The 3-variable Nox forward selected model has 85.6% accuracy on the test set.

The 3-variable Nox forward selected model is recommended as it is more likely to generilize to the accuracy reported.

I used a Six-fold Cross validation to test each model. This algorithm shuffles the data and iteratively divides the data into 5/6th for a training set and 1/6 for a test set. The algorithm generate a model for each training set and use it to make predictions using the test set. The algorithm then generates a confusion matrix for each test set and calculates accuracy, error classification rate, precision, sensitivity, specificity, and F1. The algorithm then calculates the mean for all these metrics. They are reported in the table Below. Finally the Six-fold cross validation used the roc() function to generate ROC curves and AUC for each iteration.

Mean Measure 8-var Logit 7-var Probit 3-var Nox
accuracy 0.9117184 0.9031152 0.8559942
CERs 0.0882816 0.09688484 0.1440058
precision 0.9170535 0.904981 0.8469928
sensitivity 0.9091372 0.9044007 0.8724329
specificity 0.9180123 0.9056955 0.8412602
F1 0.9117115 0.9027346 0.8582994

We can see that the 8 variable logit (8v) model is only marginally better in all metrics than the 7 variable (7v) probit model. Area under the curve ranges from 0.9913 to 0.9361 for the 8v model, and from 0.9866 to 0.9278 for the 7v model. Although a marginal difference, the 8V model is recommended between these two models.

Residuals of the 3V model, note the two negatively sloped parallel lines are the expected behavior

Residuals of the 3V model, note the two negatively sloped parallel lines are the expected behavior

The 3 variable model (3v) under preforms the other two models in all the above metrics by about 5%. The Area under the curve of the ROC plots range from 0.9037 to 0.9733, only slightly less than the other two models. The 3v model has one distinct advantage: It is more likely to predict outcomes at the indicted accuracy than the other two.

The 8v and 7v models depend on 3 variables (rad,tax,ptratio) that have had 26% of their data replaced with median values. Had these data not been incorrectly inputted, then the coefficients and significance may have been greatly different, especially since each of these sets corresponded to a target value of 1.

In light of these considerations, I recommend that the 3v model be the model used. It performs only marginally less accurately than the others, but at an accuracy that is considered very good for social science models, and I have much more confidence that it will work at the indicated accuracy with the customer’s data.

Code Appendix

Below is the R code used in the analysis organized by section.

Data Exploration

suppressMessages(suppressWarnings(library(tidyverse)))
suppressMessages(suppressWarnings(library(ggplot2)))
suppressMessages(suppressWarnings(library(leaps)))
suppressMessages(suppressWarnings(library(pROC)))
path <- 'C:\\Users\\Nate\\Documents\\DataSet\\crime-training-data_modified.csv'
crime <- path %>% read.csv() %>% as.data.frame()
crime %>% head()
crime %>% summary()
crime %>% sapply(sd)
for(i in 1:ncol(crime)){
  nas <- crime[,i] %>% sum() %>% is.na()
  print(nas)
}# Double-checking that there are no NA's
crime$tax[crime$tax == 666] <- median(crime$tax, na.rm = TRUE)
crime$rad[crime$rad == 24] <- median(crime$rad, na.rm = TRUE)
crime$ptratio[crime$ptratio == 20.2] <- median(crime$ptratio, na.rm = TRUE)
cor(crime)
cor(crime$target,crime[,1:13])
ggplot(crime,aes(x=as.factor(target), y=zn),)+
  geom_violin( fill='red2',alpha=0.2)+
  geom_boxplot(width=0.025)
ggplot(crime,aes(x=as.factor(target), y=indus),)+
  geom_violin( fill='orangered',alpha=0.2)+
  geom_boxplot(width=0.1)
ggplot(crime,aes(x=as.factor(target), y=nox),)+
  geom_violin( fill='orange',alpha=0.2)+
  geom_boxplot(width=0.1)
ggplot(crime,aes(x=as.factor(target), y=rm),)+
  geom_violin( fill='goldenrod',alpha=0.2)+
  geom_boxplot(width=0.1)
ggplot(crime,aes(x=as.factor(target), y=age),)+
  geom_violin( fill='yellow',alpha=0.2)+
  geom_boxplot(width=0.1)
ggplot(crime,aes(x=as.factor(target), y=dis),)+
  geom_violin( fill='yellowgreen',alpha=0.2)+
  geom_boxplot(width=0.1)
ggplot(crime,aes(x=as.factor(target), y=rad),)+
  geom_violin( fill='green',alpha=0.2)+
  geom_boxplot(width=0.1)
ggplot(crime,aes(x=as.factor(target), y=tax),)+
  geom_violin( fill='cyan',alpha=0.2)+
  geom_boxplot(width=0.1)
ggplot(crime,aes(x=as.factor(target), y=ptratio),)+
  geom_violin( fill='blue',alpha=0.2)+
  geom_boxplot(width=0.1)
ggplot(crime,aes(x=as.factor(target), y=lstat),)+
  geom_violin( fill='blueviolet',alpha=0.2)+
  geom_boxplot(width=0.1)
ggplot(crime,aes(x=as.factor(target), y=medv),)+
  geom_violin( fill='violet',alpha=0.2)+
  geom_boxplot(width=0.1)
ggplot(crime, aes(x=target))+
  geom_bar(fill="blue")
ggplot(crime, aes(x=zn))+
  geom_histogram(fill="goldenrod")
ggplot(crime, aes(x=indus))+
  geom_histogram(fill="skyblue")
ggplot(crime, aes(x=nox))+
  geom_histogram(fill="slateblue1")
ggplot(crime, aes(x=rm))+
  geom_histogram(fill="steelblue")
ggplot(crime, aes(x=age))+
  geom_histogram(fill="steelblue2")
ggplot(crime, aes(x=dis))+
  geom_histogram(fill="royalblue")
ggplot(crime, aes(x=rad))+
  geom_bar(fill="lightblue")
ggplot(crime, aes(x=tax))+
  geom_histogram(fill="steelblue2")
ggplot(crime, aes(x=ptratio))+
  geom_histogram(fill="sienna")
ggplot(crime, aes(x=lstat))+
  geom_histogram(fill="olivedrab")
ggplot(crime, aes(x=medv))+
  geom_histogram(fill="olivedrab2")
ggplot(crime, aes(x=chas))+
  geom_bar(fill="olivedrab3")
regfit.full=regsubsets(target~zn+
                indus+
                chas+
                nox+
                rm+
                age+
                dis+
                rad+
                tax+
                ptratio+
                lstat+
                medv,data=crime, nvmax = 12)
summary(regfit.full)
nox.fit <- glm(target ~ nox, data=crime,family = 'binomial')
summary(nox.fit)
ggplot(crime, aes(x=nox,y=target))+
  geom_point(color='red')+
  geom_smooth(method = 'glm', method.args = list(family = "binomial"), 
    se = FALSE)
all.fit <- glm(target ~., data=crime,family = 'binomial')
summary(all.fit)
#remove rm
el.fit <- glm(target ~zn+
                indus+
                chas+
                nox+
                age+
                dis+
                rad+
                tax+
                ptratio+
                lstat+
                medv, data=crime,family = 'binomial')
summary(el.fit)
#remove chas
ten.fit <- glm(target ~zn+
                indus+
                nox+
                age+
                dis+
                rad+
                tax+
                ptratio+
                lstat+
                medv, data=crime,family = 'binomial')
summary(ten.fit)
#remove indus
nin.fit <- glm(target ~zn+
                nox+
                age+
                dis+
                rad+
                tax+
                ptratio+
                lstat+
                medv, data=crime,family = 'binomial')
summary(nin.fit)
#remove lstat
eig.fit <- glm(target ~zn+
                nox+
                age+
                dis+
                rad+
                tax+
                ptratio+
                medv, data=crime,family = 'binomial')
summary(eig.fit)
#This model gives the lowest AIC, and matches the best model with 8 predictors form the regsubsets() function.

#remove zn to see if AIC continues to decrease
sev.fit <- glm(target ~
                nox+
                age+
                dis+
                rad+
                tax+
                ptratio+
                medv, data=crime,family = 'binomial')
summary(sev.fit)
# removing 'zn' caused a sharp increase in AIC from 215.32 to 219.45

Six-fold cross-validating using the Eight Variable Model:

# Set random seed. Don't remove this line.
set.seed(1)

# Initialize the accs vector
n <- nrow(crime)
shuffled <- crime[sample(n),]
accs <- rep(0,6)
cers <- rep(0,6)
pres <- rep(0,6)
sens <- rep(0,6)
spes <- rep(0,6)
f1s <-  rep(0,6)

for (i in 1:6) {
  # These indices indicate the interval of the test set
  
  indices <- (((i-1) * round((1/6)*nrow(shuffled))) + 1):((i*round((1/6) * nrow(shuffled))))
  
  # Exclude them from the train set
  train <- shuffled[-indices,]
  
  # Include them in the test set
  test <- shuffled[indices,]
  pred <- rep(0,nrow(test))
  # A model is learned using each training set
  fit <- glm(target ~zn+
                nox+
                age+
                dis+
                rad+
                tax+
                ptratio+
                medv, data=train,family = 'binomial' (link = 'logit'))
  
  # Make a prediction on the test set using tree
  prob <- predict.glm(fit,test,type = 'response')
  prob[is.na(prob)] <- 0
  for(j in 1:length(prob)){
    if(prob[j] >0.5){
      pred[j] = 1
    }
    else if (prob[j]< 0.5){
      pred[j] = 0
    }
  }
  roc1 <- roc(test$target,prob)
  plot(roc1)
  print(roc1)
  # Assign the confusion matrix to conf
  conf <- table(test$target,pred)
  print(conf)
  TN <- conf[1,1]
  TN # True Negatives
  FN <- conf[2,1]
  FN # False Negatives
  TP <- conf[2,2]
  TP # True Positives
  FP <- conf[1,2]
  FP # False Positives
  
  # Assign the accuracy of this model to the ith index in accs
  accs[i] <- sum(diag(conf))/sum(conf)
  cers[i] <- (FP+FN)/sum(conf)
  pres[i] <- (TP)/(TP+FP)
  sens[i] <- (TP)/(TP+FN)
  spes[i] <- (TN)/(TN+FP)
  f1s[i] <- (2*pres[i]*sens[i])/(pres[i]+sens[i])
}

# Print out the mean of accs
mean(accs)
mean(cers)
mean(pres)
mean(sens)
mean(spes)
mean(f1s)
prb_fit <- glm(target ~ ., family = 'binomial' (link = 'probit'), data=crime)
summary(prb_fit)

# The probit model yeilds only 7 Statisitcally Significant varibales and has higher AIC
prb2_fit <- glm(target ~ 
                #zn+
                #indus+
                #chas+
                nox+
                age+
                dis+
                rad+
                tax+
                ptratio+
                #lstat+
                medv, family = 'binomial' (link = 'probit'), data=crime)
summary(prb2_fit)
# Set random seed.
set.seed(1)

# Initialize the accs vector
n <- nrow(crime)
shuffled <- crime[sample(n),]
accs <- rep(0,6)
cers <- rep(0,6)
pres <- rep(0,6)
sens <- rep(0,6)
spes <- rep(0,6)
f1s <- rep(0,6)

for (i in 1:6) {
  # These indices indicate the interval of the test set
  
  indices <- (((i-1) * round((1/6)*nrow(shuffled))) + 1):((i*round((1/6) * nrow(shuffled))))
  
  # Exclude them from the train set
  train <- shuffled[-indices,]
  
  # Include them in the test set
  test <- shuffled[indices,]
  pred <- rep(0,nrow(test))
  # A model is learned using each training set
  fit <- glm(target ~ 
                nox+
                age+
                dis+
                rad+
                tax+
                ptratio+
                medv, family = 'binomial' (link = 'probit'), data=train)
  
  # Make a prediction on the test set using tree
  prob <- predict.glm(fit,test,type = 'response')
  prob[is.na(prob)] <- 0
  for(j in 1:length(prob)){
    if(prob[j] >0.5){
      pred[j] = 1
    }
    else if (prob[j]< 0.5){
      pred[j] = 0
    }
  }
  roc1 <- roc(test$target,prob)
  plot(roc1)
  print(roc1)
  # Assign the confusion matrix to conf
  conf <- table(test$target,pred)
  print(conf)
  TN <- conf[1,1]
  TN # True Negatives
  FN <- conf[2,1]
  FN # False Negatives
  TP <- conf[2,2]
  TP # True Positives
  FP <- conf[1,2]
  FP # False Positives
  
  # Assign the accuracy of this model to the ith index in accs
  accs[i] <- sum(diag(conf))/sum(conf)
  cers[i] <- (FP+FN)/sum(conf)
  pres[i] <- (TP)/(TP+FP)
  sens[i] <- (TP)/(TP+FN)
  spes[i] <- (TN)/(TN+FP)
  f1s[i] <- (2*pres[i]*sens[i])/(pres[i]+sens[i])
}

# Print out the mean of accs
mean(accs)
mean(cers)
mean(pres)
mean(sens)
mean(spes)
mean(f1s)
# This model I want to avoid tax,rad,and ptratio since they had so many bad values
# I initially ran nox alone through the 6-fold validation and got 81% accuracy
# How many variables can I add that improve that without over-fitting?
nx_fit <- glm(target ~
                nox+
                #lstat +
                dis +
                #medv+
                #age +
                #indus+
                #rm+
                #chas+
                zn, family = 'binomial', data=crime)
summary(nx_fit)
# Set random seed.
set.seed(1)

# Initialize the accs vector
n <- nrow(crime)
shuffled <- crime[sample(n),]
accs <- rep(0,6)
cers <- rep(0,6)
pres <- rep(0,6)
sens <- rep(0,6)
spes <- rep(0,6)
f1s <- rep(0,6)

for (i in 1:6) {
  # These indices indicate the interval of the test set
  
  indices <- (((i-1) * round((1/6)*nrow(shuffled))) + 1):((i*round((1/6) * nrow(shuffled))))
  
  # Exclude them from the train set
  train <- shuffled[-indices,]
  
  # Include them in the test set
  test <- shuffled[indices,]
  pred <- rep(0,nrow(test))
  # A model is learned using each training set
  fit <- glm(target ~ nox+
                dis +
                zn
             , family = 'binomial' (link = 'logit'), data=train)
  
  # Make a prediction on the test set using tree
  prob <- predict.glm(fit,test,type = 'response')
  prob[is.na(prob)] <- 0
  for(j in 1:length(prob)){
    if(prob[j] >0.5){
      pred[j] = 1
    }
    else if (prob[j]< 0.5){
      pred[j] = 0
    }
  }
  roc1 <- roc(test$target,prob)
  plot(roc1)
  print(roc1)
  # Assign the confusion matrix to conf
  conf <- table(test$target,pred)
  print(conf)
  TN <- conf[1,1]
  TN # True Negatives
  FN <- conf[2,1]
  FN # False Negatives
  TP <- conf[2,2]
  TP # True Positives
  FP <- conf[1,2]
  FP # False Positives
  
  # Assign the accuracy of this model to the ith index in accs
  accs[i] <- sum(diag(conf))/sum(conf)
  cers[i] <- (FP+FN)/sum(conf)
  pres[i] <- (TP)/(TP+FP)
  sens[i] <- (TP)/(TP+FN)
  spes[i] <- (TN)/(TN+FP)
  f1s[i] <- (2*pres[i]*sens[i])/(pres[i]+sens[i])
}

# Print out the mean of accs
mean(accs)
mean(cers)
mean(pres)
mean(sens)
mean(spes)
mean(f1s)
plot(resid(eig.fit) ~ fitted(eig.fit))
plot(resid(prb2_fit) ~ fitted(prb2_fit))
plot(resid(nx_fit) ~ fitted(nx_fit))
path <- "C:\\Users\\Nate\\Documents\\DataSet\\crime-evaluation-data_modified.csv"
crime_eval <- path %>% read.csv() %>% as.data.frame()
head(crime_eval)
summary(crime_eval)
ev_prd <- predict.glm(nx_fit, crime_eval,type = 'response')
ev_prd
for(j in 1:length(ev_prd)){
  if(ev_prd[j] >0.5){
    pred[j] = 1
  }
  else if (ev_prd[j]< 0.5){
    pred[j] = 0
  }
}
crm_ev <- data.frame(cbind(crime_eval$INDEX,pred))
write.csv(crm_ev, file="NCooper_PredictedCrime.csv")