Summary:
I read the crime training data from the csv provided into a data frame. The data seems very clean, with no missing values. I wrote a loop to double check for missing values to be certain. Most of the data is numeric, save for chas, which is a categorical dummy variable.
A summary of the data are provided in the table below:
Variable | Mean | Median | Standard Deviation |
---|---|---|---|
target | 0.4914 | 0 | 0.5004636 |
chas | 0.7082 | 0 | 0.2567920 |
zn | 11.58 | 0 | 23.3646511 |
indus | 11.105 | 9.690 | 6.8458549 |
nox | 0.5543 | 0.5380 | 0.1166667 |
rm | 6.291 | 6.210 | 0.7048513 |
age | 77.15 | 68.37 | 28.3213784 |
dis | 3.796 | 3.191 | 2.1069496 |
rad | 9.53 | 5.00 | 8.6859272 |
tax | 409.5 | 344.5 | 167.9000887 |
ptratio | 18.4 | 18.9 | 2.1968447 |
lstat | 12.631 | 11.350 | 7.1018907 |
medv | 22.59 | 21.20 | 9.2396814 |
You can see that for rad and tax the means are much higher than the medians which indicates the data is skewed. Upon closer inspection, I noticed that rad, tax and ptratio have sets of (24,666,20.2). Each rad = 24 matches with a tax = 666, however there are some ptratios that have a value of 20.2 that do not correspond to the other two values. This is because 24 and 666 are extreme outliers for rad and tax, but 20.2 is not an outlier for ptratio.
Nox Violin Plot shows a clear difference in distribution between response variable cases
RM Violin Plot does not show a clear difference in distribution between response variable cases
Nox vs Target Variable show a clear relationship between Response variable and Predictor variable
Summary
I prefer to take a minimalist approach to preparing data, fixing only those things that are broken. In this way we can have confidence that the data will meaningfully predict the outcome. In this case the only problem with the data that seemed to need fixing was the substitution errors discussed previously with rad, tax and ptratio.
I opted for a median replacement for the identified incorrect values. The values were deemed incorrect because rad is an index of accessibility to radial highways. Rad appears to be a Linert scale of 1-8, so that the value of 24 is absurd. Furthermore, each time rad was 24, tax was 666. The superstitions regarding that number aside, since this is property tax by $ 10,000, I would expect some outliers that were in the $6,660,000, representing wealthy neighborhood. However, 120 such values from 466 observations is unreasonably high. I would also expect that pupil-teacher ratio for a wealthy neighborhood to be well below the mean value of 18.4, and not above at 20.2. The probabilities that these values always match is also practically zero.
Tax vs. Rad shows that high correlation is due to 120 outliers with high leverage.
The rad vs tax values also created high leverage for the correlation between the variables as seen in figure four. Keep in mind that what looks like a single point is actually 120 points in the exact same position. I opted for a median substitution, as the mean is not resilient against outliers, and putting those data close to the center greatly reduces their leverage. This is especially important as each set 0f (24,666,20.2) corresponded to a target of 1.
Variable | Correlation with Target |
---|---|
zn | -0.4316818 |
indus | 0.6048507 |
chas | 0.08004187 |
nox | 0.7261062 |
rm | -0.1525533 |
age | 0.6301062 |
dis | -0.6186731 |
rad | 0.3121268 |
tax | 0.2090816 |
ptratio | 0.115086 |
lstat | 0.469127 |
medv | -0.2705507 |
-Summary
My first step in building models was to use the regsubsets() function which finds the best variables to use for each possible subset of variables. In other words, it will tell you if you want to create an eight variable model which 8 variables will create the best model based on the criteria of AIC, BIC, CIC, DIC, etc. I did this to help guide the process, in case I would create an independent model and it varied to greatly from the model suggested by regsubset() it may be cause to revisit my methods.
The goal is, of course, to create a model that has high accuracy, but also that I have high confidence will work well in the client’s data. To that end I am using the well-tested backwards elimination model with logit and probit outcomes, to see which produces higher accuracy against a test set.
Since the variable nox is the most highly correlated variable to target. I am also interested in doing a forward selected model that starts with nox and adds any other variable aside from rad, tax and ptratio. Since those variables had 120 out of 466 values replaced, I am concerned about how using those data will affect the generalization of the model to the client’s data. If this model performs with close to the same accuracy as the other two, it should be the recommended model.
Eight Variable Logit Model:
parameter | Value |
---|---|
Intercept | -41.46706143 |
zn | -0.07190461 |
nox | 49.66969441 |
age | 0.02656795 |
dis | 0.60600888 |
tax | -0.01022942 |
rad | 0.88472112 |
ptratio | 0.41245041 |
medv | 0.09469977 |
Seven Variable Probit Model:
parameter | Value |
---|---|
Intercept | -21.374638936 |
nox | 26.607996613 |
age | 0.011643943 |
dis | 0.192553245 |
tax | -0.005991887 |
rad | 0.452207740 |
ptratio | 0.234360912 |
medv | 0.039826087 |
3 Variable Nox Model
parameter | Value |
---|---|
Intercept | -18.23049857 |
nox | 32.13766020 |
dis | 0.27895617 |
zn | -0.04385274 |
That crime is linked to poverty is an observation that goes back mellennia, Roman Emperor Marcus Aurelius commented said “Poverty is the Mother of Crime”. I expect that the variable zn which is the portion of land zoned for large lots to have negative slope, as wealthier people can afford to live in larger plots of land. Nox which tracks air pollution should have a positive slope, as poorer people may only be able to afford to live in more polluted areas. ‘age’ tracks the age of buildings, this should have a positive slope, as upper class people in the US tend to live in newer constructions, the so-called McMansions. ‘dis’ tracks the weighted mean distance from employment centers should have positive slope since poorer people may find themselves living further from work. ‘tax’ measuring property tax revenue ought to have negative slope, more valuable land is owned by wealthier people, who have less likely to commit crimes. ‘rad’ distance from radial highways should have positive slope for a similar reason as dis, poor people have less access to transportation. ‘ptratio’ should have positive slope since education funding us tied to property taxes; poorer neighborhoods can’t afford to hire as many teachers as wealthier ones. ‘medv’ which tracks median value of owner occupied houses, I would expect to be negative, but is positive and is the only variable whose sign of slope doesn’t match expectations. Perhaps this is because poorer neighborhoods have very few owner occupied home. This results in a slope that is close to zero.
Summary:
The 3-variable Nox forward selected model is recommended as it is more likely to generilize to the accuracy reported.
I used a Six-fold Cross validation to test each model. This algorithm shuffles the data and iteratively divides the data into 5/6th for a training set and 1/6 for a test set. The algorithm generate a model for each training set and use it to make predictions using the test set. The algorithm then generates a confusion matrix for each test set and calculates accuracy, error classification rate, precision, sensitivity, specificity, and F1. The algorithm then calculates the mean for all these metrics. They are reported in the table Below. Finally the Six-fold cross validation used the roc() function to generate ROC curves and AUC for each iteration.
Mean Measure | 8-var Logit | 7-var Probit | 3-var Nox |
---|---|---|---|
accuracy | 0.9117184 | 0.9031152 | 0.8559942 |
CERs | 0.0882816 | 0.09688484 | 0.1440058 |
precision | 0.9170535 | 0.904981 | 0.8469928 |
sensitivity | 0.9091372 | 0.9044007 | 0.8724329 |
specificity | 0.9180123 | 0.9056955 | 0.8412602 |
F1 | 0.9117115 | 0.9027346 | 0.8582994 |
We can see that the 8 variable logit (8v) model is only marginally better in all metrics than the 7 variable (7v) probit model. Area under the curve ranges from 0.9913 to 0.9361 for the 8v model, and from 0.9866 to 0.9278 for the 7v model. Although a marginal difference, the 8V model is recommended between these two models.
Residuals of the 3V model, note the two negatively sloped parallel lines are the expected behavior
The 3 variable model (3v) under preforms the other two models in all the above metrics by about 5%. The Area under the curve of the ROC plots range from 0.9037 to 0.9733, only slightly less than the other two models. The 3v model has one distinct advantage: It is more likely to predict outcomes at the indicted accuracy than the other two.
The 8v and 7v models depend on 3 variables (rad,tax,ptratio) that have had 26% of their data replaced with median values. Had these data not been incorrectly inputted, then the coefficients and significance may have been greatly different, especially since each of these sets corresponded to a target value of 1.
In light of these considerations, I recommend that the 3v model be the model used. It performs only marginally less accurately than the others, but at an accuracy that is considered very good for social science models, and I have much more confidence that it will work at the indicated accuracy with the customer’s data.
Below is the R code used in the analysis organized by section.
suppressMessages(suppressWarnings(library(tidyverse)))
suppressMessages(suppressWarnings(library(ggplot2)))
suppressMessages(suppressWarnings(library(leaps)))
suppressMessages(suppressWarnings(library(pROC)))
path <- 'C:\\Users\\Nate\\Documents\\DataSet\\crime-training-data_modified.csv'
crime <- path %>% read.csv() %>% as.data.frame()
crime %>% head()
crime %>% summary()
crime %>% sapply(sd)
for(i in 1:ncol(crime)){
nas <- crime[,i] %>% sum() %>% is.na()
print(nas)
}# Double-checking that there are no NA's
crime$tax[crime$tax == 666] <- median(crime$tax, na.rm = TRUE)
crime$rad[crime$rad == 24] <- median(crime$rad, na.rm = TRUE)
crime$ptratio[crime$ptratio == 20.2] <- median(crime$ptratio, na.rm = TRUE)
cor(crime)
cor(crime$target,crime[,1:13])
ggplot(crime,aes(x=as.factor(target), y=zn),)+
geom_violin( fill='red2',alpha=0.2)+
geom_boxplot(width=0.025)
ggplot(crime,aes(x=as.factor(target), y=indus),)+
geom_violin( fill='orangered',alpha=0.2)+
geom_boxplot(width=0.1)
ggplot(crime,aes(x=as.factor(target), y=nox),)+
geom_violin( fill='orange',alpha=0.2)+
geom_boxplot(width=0.1)
ggplot(crime,aes(x=as.factor(target), y=rm),)+
geom_violin( fill='goldenrod',alpha=0.2)+
geom_boxplot(width=0.1)
ggplot(crime,aes(x=as.factor(target), y=age),)+
geom_violin( fill='yellow',alpha=0.2)+
geom_boxplot(width=0.1)
ggplot(crime,aes(x=as.factor(target), y=dis),)+
geom_violin( fill='yellowgreen',alpha=0.2)+
geom_boxplot(width=0.1)
ggplot(crime,aes(x=as.factor(target), y=rad),)+
geom_violin( fill='green',alpha=0.2)+
geom_boxplot(width=0.1)
ggplot(crime,aes(x=as.factor(target), y=tax),)+
geom_violin( fill='cyan',alpha=0.2)+
geom_boxplot(width=0.1)
ggplot(crime,aes(x=as.factor(target), y=ptratio),)+
geom_violin( fill='blue',alpha=0.2)+
geom_boxplot(width=0.1)
ggplot(crime,aes(x=as.factor(target), y=lstat),)+
geom_violin( fill='blueviolet',alpha=0.2)+
geom_boxplot(width=0.1)
ggplot(crime,aes(x=as.factor(target), y=medv),)+
geom_violin( fill='violet',alpha=0.2)+
geom_boxplot(width=0.1)
ggplot(crime, aes(x=target))+
geom_bar(fill="blue")
ggplot(crime, aes(x=zn))+
geom_histogram(fill="goldenrod")
ggplot(crime, aes(x=indus))+
geom_histogram(fill="skyblue")
ggplot(crime, aes(x=nox))+
geom_histogram(fill="slateblue1")
ggplot(crime, aes(x=rm))+
geom_histogram(fill="steelblue")
ggplot(crime, aes(x=age))+
geom_histogram(fill="steelblue2")
ggplot(crime, aes(x=dis))+
geom_histogram(fill="royalblue")
ggplot(crime, aes(x=rad))+
geom_bar(fill="lightblue")
ggplot(crime, aes(x=tax))+
geom_histogram(fill="steelblue2")
ggplot(crime, aes(x=ptratio))+
geom_histogram(fill="sienna")
ggplot(crime, aes(x=lstat))+
geom_histogram(fill="olivedrab")
ggplot(crime, aes(x=medv))+
geom_histogram(fill="olivedrab2")
ggplot(crime, aes(x=chas))+
geom_bar(fill="olivedrab3")
regfit.full=regsubsets(target~zn+
indus+
chas+
nox+
rm+
age+
dis+
rad+
tax+
ptratio+
lstat+
medv,data=crime, nvmax = 12)
summary(regfit.full)
nox.fit <- glm(target ~ nox, data=crime,family = 'binomial')
summary(nox.fit)
ggplot(crime, aes(x=nox,y=target))+
geom_point(color='red')+
geom_smooth(method = 'glm', method.args = list(family = "binomial"),
se = FALSE)
all.fit <- glm(target ~., data=crime,family = 'binomial')
summary(all.fit)
#remove rm
el.fit <- glm(target ~zn+
indus+
chas+
nox+
age+
dis+
rad+
tax+
ptratio+
lstat+
medv, data=crime,family = 'binomial')
summary(el.fit)
#remove chas
ten.fit <- glm(target ~zn+
indus+
nox+
age+
dis+
rad+
tax+
ptratio+
lstat+
medv, data=crime,family = 'binomial')
summary(ten.fit)
#remove indus
nin.fit <- glm(target ~zn+
nox+
age+
dis+
rad+
tax+
ptratio+
lstat+
medv, data=crime,family = 'binomial')
summary(nin.fit)
#remove lstat
eig.fit <- glm(target ~zn+
nox+
age+
dis+
rad+
tax+
ptratio+
medv, data=crime,family = 'binomial')
summary(eig.fit)
#This model gives the lowest AIC, and matches the best model with 8 predictors form the regsubsets() function.
#remove zn to see if AIC continues to decrease
sev.fit <- glm(target ~
nox+
age+
dis+
rad+
tax+
ptratio+
medv, data=crime,family = 'binomial')
summary(sev.fit)
# removing 'zn' caused a sharp increase in AIC from 215.32 to 219.45
Six-fold cross-validating using the Eight Variable Model:
# Set random seed. Don't remove this line.
set.seed(1)
# Initialize the accs vector
n <- nrow(crime)
shuffled <- crime[sample(n),]
accs <- rep(0,6)
cers <- rep(0,6)
pres <- rep(0,6)
sens <- rep(0,6)
spes <- rep(0,6)
f1s <- rep(0,6)
for (i in 1:6) {
# These indices indicate the interval of the test set
indices <- (((i-1) * round((1/6)*nrow(shuffled))) + 1):((i*round((1/6) * nrow(shuffled))))
# Exclude them from the train set
train <- shuffled[-indices,]
# Include them in the test set
test <- shuffled[indices,]
pred <- rep(0,nrow(test))
# A model is learned using each training set
fit <- glm(target ~zn+
nox+
age+
dis+
rad+
tax+
ptratio+
medv, data=train,family = 'binomial' (link = 'logit'))
# Make a prediction on the test set using tree
prob <- predict.glm(fit,test,type = 'response')
prob[is.na(prob)] <- 0
for(j in 1:length(prob)){
if(prob[j] >0.5){
pred[j] = 1
}
else if (prob[j]< 0.5){
pred[j] = 0
}
}
roc1 <- roc(test$target,prob)
plot(roc1)
print(roc1)
# Assign the confusion matrix to conf
conf <- table(test$target,pred)
print(conf)
TN <- conf[1,1]
TN # True Negatives
FN <- conf[2,1]
FN # False Negatives
TP <- conf[2,2]
TP # True Positives
FP <- conf[1,2]
FP # False Positives
# Assign the accuracy of this model to the ith index in accs
accs[i] <- sum(diag(conf))/sum(conf)
cers[i] <- (FP+FN)/sum(conf)
pres[i] <- (TP)/(TP+FP)
sens[i] <- (TP)/(TP+FN)
spes[i] <- (TN)/(TN+FP)
f1s[i] <- (2*pres[i]*sens[i])/(pres[i]+sens[i])
}
# Print out the mean of accs
mean(accs)
mean(cers)
mean(pres)
mean(sens)
mean(spes)
mean(f1s)
prb_fit <- glm(target ~ ., family = 'binomial' (link = 'probit'), data=crime)
summary(prb_fit)
# The probit model yeilds only 7 Statisitcally Significant varibales and has higher AIC
prb2_fit <- glm(target ~
#zn+
#indus+
#chas+
nox+
age+
dis+
rad+
tax+
ptratio+
#lstat+
medv, family = 'binomial' (link = 'probit'), data=crime)
summary(prb2_fit)
# Set random seed.
set.seed(1)
# Initialize the accs vector
n <- nrow(crime)
shuffled <- crime[sample(n),]
accs <- rep(0,6)
cers <- rep(0,6)
pres <- rep(0,6)
sens <- rep(0,6)
spes <- rep(0,6)
f1s <- rep(0,6)
for (i in 1:6) {
# These indices indicate the interval of the test set
indices <- (((i-1) * round((1/6)*nrow(shuffled))) + 1):((i*round((1/6) * nrow(shuffled))))
# Exclude them from the train set
train <- shuffled[-indices,]
# Include them in the test set
test <- shuffled[indices,]
pred <- rep(0,nrow(test))
# A model is learned using each training set
fit <- glm(target ~
nox+
age+
dis+
rad+
tax+
ptratio+
medv, family = 'binomial' (link = 'probit'), data=train)
# Make a prediction on the test set using tree
prob <- predict.glm(fit,test,type = 'response')
prob[is.na(prob)] <- 0
for(j in 1:length(prob)){
if(prob[j] >0.5){
pred[j] = 1
}
else if (prob[j]< 0.5){
pred[j] = 0
}
}
roc1 <- roc(test$target,prob)
plot(roc1)
print(roc1)
# Assign the confusion matrix to conf
conf <- table(test$target,pred)
print(conf)
TN <- conf[1,1]
TN # True Negatives
FN <- conf[2,1]
FN # False Negatives
TP <- conf[2,2]
TP # True Positives
FP <- conf[1,2]
FP # False Positives
# Assign the accuracy of this model to the ith index in accs
accs[i] <- sum(diag(conf))/sum(conf)
cers[i] <- (FP+FN)/sum(conf)
pres[i] <- (TP)/(TP+FP)
sens[i] <- (TP)/(TP+FN)
spes[i] <- (TN)/(TN+FP)
f1s[i] <- (2*pres[i]*sens[i])/(pres[i]+sens[i])
}
# Print out the mean of accs
mean(accs)
mean(cers)
mean(pres)
mean(sens)
mean(spes)
mean(f1s)
# This model I want to avoid tax,rad,and ptratio since they had so many bad values
# I initially ran nox alone through the 6-fold validation and got 81% accuracy
# How many variables can I add that improve that without over-fitting?
nx_fit <- glm(target ~
nox+
#lstat +
dis +
#medv+
#age +
#indus+
#rm+
#chas+
zn, family = 'binomial', data=crime)
summary(nx_fit)
# Set random seed.
set.seed(1)
# Initialize the accs vector
n <- nrow(crime)
shuffled <- crime[sample(n),]
accs <- rep(0,6)
cers <- rep(0,6)
pres <- rep(0,6)
sens <- rep(0,6)
spes <- rep(0,6)
f1s <- rep(0,6)
for (i in 1:6) {
# These indices indicate the interval of the test set
indices <- (((i-1) * round((1/6)*nrow(shuffled))) + 1):((i*round((1/6) * nrow(shuffled))))
# Exclude them from the train set
train <- shuffled[-indices,]
# Include them in the test set
test <- shuffled[indices,]
pred <- rep(0,nrow(test))
# A model is learned using each training set
fit <- glm(target ~ nox+
dis +
zn
, family = 'binomial' (link = 'logit'), data=train)
# Make a prediction on the test set using tree
prob <- predict.glm(fit,test,type = 'response')
prob[is.na(prob)] <- 0
for(j in 1:length(prob)){
if(prob[j] >0.5){
pred[j] = 1
}
else if (prob[j]< 0.5){
pred[j] = 0
}
}
roc1 <- roc(test$target,prob)
plot(roc1)
print(roc1)
# Assign the confusion matrix to conf
conf <- table(test$target,pred)
print(conf)
TN <- conf[1,1]
TN # True Negatives
FN <- conf[2,1]
FN # False Negatives
TP <- conf[2,2]
TP # True Positives
FP <- conf[1,2]
FP # False Positives
# Assign the accuracy of this model to the ith index in accs
accs[i] <- sum(diag(conf))/sum(conf)
cers[i] <- (FP+FN)/sum(conf)
pres[i] <- (TP)/(TP+FP)
sens[i] <- (TP)/(TP+FN)
spes[i] <- (TN)/(TN+FP)
f1s[i] <- (2*pres[i]*sens[i])/(pres[i]+sens[i])
}
# Print out the mean of accs
mean(accs)
mean(cers)
mean(pres)
mean(sens)
mean(spes)
mean(f1s)
plot(resid(eig.fit) ~ fitted(eig.fit))
plot(resid(prb2_fit) ~ fitted(prb2_fit))
plot(resid(nx_fit) ~ fitted(nx_fit))
path <- "C:\\Users\\Nate\\Documents\\DataSet\\crime-evaluation-data_modified.csv"
crime_eval <- path %>% read.csv() %>% as.data.frame()
head(crime_eval)
summary(crime_eval)
ev_prd <- predict.glm(nx_fit, crime_eval,type = 'response')
ev_prd
for(j in 1:length(ev_prd)){
if(ev_prd[j] >0.5){
pred[j] = 1
}
else if (ev_prd[j]< 0.5){
pred[j] = 0
}
}
crm_ev <- data.frame(cbind(crime_eval$INDEX,pred))
write.csv(crm_ev, file="NCooper_PredictedCrime.csv")