The dataset consists of information concerning crime in various neighborhoods of a major city (Boston.)
The available predictor variables include:
| Num | Variable | Description |
|---|---|---|
| 1. | zn |
proportion of residential land zoned for large lots (over 25000 square feet) |
| 2. | indus |
proportion of non-retail business acres per suburb |
| 3. | chas |
a dummy var. for whether the suburb borders the Charles River (1) or not (0) |
| 4. | nox |
nitrogen oxides concentration (parts per 10 million) |
| 5. | rm |
average number of rooms per dwelling |
| 6. | age |
proportion of owner-occupied units built prior to 1940 |
| 7. | dis |
weighted mean of distances to five Boston employment centers |
| 8. | rad |
index of accessibility to radial highways |
| 9. | tax |
full-value property-tax rate per $10,000 |
| 10. | ptratio |
pupil-teacher ratio by town |
| 11. | lstat |
lower status of the population (percent) |
| 12. | medv |
median value of owner-occupied homes in $1000s |
The project description indicated that one more predictor variable should have been included:
\({\quad}\) 13. black : \(1000 \cdot (Bk - 0.63)^2\), where \(Bk\) is the proportion of blacks by town
However, this feature was not found in the dataset, so it had apparently been removed before the dataset was made available to us.
The target variable is binary, set to 0 or 1, indicating whether the crime rate in the neighborhood is above the median crime rate (1 = High Crime Rate) or below (0 = Low Crime Rate).
Here are the first 10 observations from the training dataset:
| zn | indus | chas | nox | rm | age | dis | rad | tax | ptratio | lstat | medv | target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 0 | 19.58 | 0 | 0.605 | 7.929 | 96.2 | 2.0459 | 5 | 403 | 14.7 | 3.70 | 50.0 | 1 |
| 2 | 0 | 19.58 | 1 | 0.871 | 5.403 | 100.0 | 1.3216 | 5 | 403 | 14.7 | 26.82 | 13.4 | 1 |
| 3 | 0 | 18.10 | 0 | 0.740 | 6.485 | 100.0 | 1.9784 | 24 | 666 | 20.2 | 18.85 | 15.4 | 1 |
| 4 | 30 | 4.93 | 0 | 0.428 | 6.393 | 7.8 | 7.0355 | 6 | 300 | 16.6 | 5.19 | 23.7 | 0 |
| 5 | 0 | 2.46 | 0 | 0.488 | 7.155 | 92.2 | 2.7006 | 3 | 193 | 17.8 | 4.82 | 37.9 | 0 |
| 6 | 0 | 8.56 | 0 | 0.520 | 6.781 | 71.3 | 2.8561 | 5 | 384 | 20.9 | 7.67 | 26.5 | 0 |
| 7 | 0 | 18.10 | 0 | 0.693 | 5.453 | 100.0 | 1.4896 | 24 | 666 | 20.2 | 30.59 | 5.0 | 1 |
| 8 | 0 | 18.10 | 0 | 0.693 | 4.519 | 100.0 | 1.6582 | 24 | 666 | 20.2 | 36.98 | 7.0 | 1 |
| 9 | 0 | 5.19 | 0 | 0.515 | 6.316 | 38.1 | 6.4584 | 5 | 224 | 20.2 | 5.68 | 22.2 | 0 |
| 10 | 80 | 3.64 | 0 | 0.392 | 5.876 | 19.1 | 9.2203 | 1 | 315 | 16.4 | 9.25 | 20.9 | 0 |
## [1] "Number of columns (features) = 13"
## [1] "Number of rows (observations) = 466"
## 'data.frame': 466 obs. of 13 variables:
## $ zn : num 0 0 0 30 0 0 0 0 0 80 ...
## $ indus : num 19.58 19.58 18.1 4.93 2.46 ...
## $ chas : int 0 1 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.605 0.871 0.74 0.428 0.488 0.52 0.693 0.693 0.515 0.392 ...
## $ rm : num 7.93 5.4 6.49 6.39 7.16 ...
## $ age : num 96.2 100 100 7.8 92.2 71.3 100 100 38.1 19.1 ...
## $ dis : num 2.05 1.32 1.98 7.04 2.7 ...
## $ rad : int 5 5 24 6 3 5 24 24 5 1 ...
## $ tax : int 403 403 666 300 193 384 666 666 224 315 ...
## $ ptratio: num 14.7 14.7 20.2 16.6 17.8 20.9 20.2 20.2 20.2 16.4 ...
## $ lstat : num 3.7 26.82 18.85 5.19 4.82 ...
## $ medv : num 50 13.4 15.4 23.7 37.9 26.5 5 7 22.2 20.9 ...
## $ target : int 1 1 1 0 0 0 1 1 0 0 ...
## Factor w/ 2 levels "LowCrime","HighCrime": 2 2 2 1 1 1 2 2 1 1 ...
## LowCrime HighCrime
## 237 229
## [1] "Number of columns with missing values = 0"
target variable.low crime rate andhigh crime rate.Boruta function.## Boruta performed 99 iterations in 28.514 secs.
## 11 attributes confirmed important: age, dis, indus, lstat, medv and 6 more;
## No attributes deemed unimportant.
## 1 tentative attributes left: chas;
chas is very close to shadow (random) variables, its contribution to the model is questionable.Kursa, Jankowski, Rudnicki: “Boruta – A System for Feature Selection”, in Fundamenta Informaticae 101 (2010) 271–285, available at https://pdfs.semanticscholar.org/85a8/b1d9c52f9f795fda7e12376e751526953f38.pdf
Kursa, Rudnicki: “Feature Selection with the Boruta Package”, in Journal of Statistical Software September 2010, Volume 36, Issue 11, available at https://www.jstatsoft.org/index.php/jss/article/view/v036i11/v36i11.pdf
Kursa: “Boruta for those in a hurry”, CRAN vignette available at https://cran.r-project.org/web/packages/Boruta/vignettes/inahurry.pdf
age vs. disindus vs. disnox vs. disnox) near Boston employment centers.All these four terms are correlated (raw distance is negatively correlated, thus “near” is inverse). That can also be seen from the correlation graph above.
Let’s add an interaction term to our dataset which will account for cumulative effect of all these four input variables.
We will define the interaction term using the following formula: \(intterm = \sqrt{\frac{dis}{nox*indus} + age}\) .
Let’s see how newly added interaction term is helping us to categorize the target variable
intterm generally correspond to high crime valuesrad variablerad variable into bucketsrad variable:Let’s see how crime rate is divided into these three buckets after we add the newly transformed rad variable
Let’s see how transformed rad variable is helping us to categorize data:
(Y-values of 1, 2, 3 have been “jittered” for display purposes)
There are no observations between 470 and 664, but if there were, they would be bucketed into the medium category.
Let’s see how crime rate is divided into these three buckets after we transform tax rate variable
This make sense since areas with higher tax rate have high value properties. The data indicates that the crime rate tends to be higher in the areas where the property values are higher.
(Y-values of 1, 2, 3 have been “jittered” for display purposes)
lstat variableLet’s see how crime rate is divided into these three buckets after we transform tax rate variable
(Y-values of 1, 2, 3, 4 have been “jittered” for display purposes)
zn variableLet’s see how the crime rate is divided into these two buckets after we transform ZN variable:
The above illustrates that the higher crime rate occurs in areas which are not zoned for large residential lots.
Create binary logistic regression models, using different combinations of (transformed) variables.
Create functions to perform cross-validation of each model, and to assess model metrics.
rad: Index of accessibility to radial highwaystax: Full-value property-tax rate per $10,000lstat: Lower status of the population (percent)age: Proportion of owner-occupied units built prior to 1940indus: Proportion of non-retail business acres per suburbnox: Nitrogen oxides concentrationzn: Proportion of residential land zoned for large lotsdis: Weighted mean of distances to five Boston employment centersmodel.metrix = cross.validation(5, "target~ rad+tax+lstat+age+indus+nox+zn+dis", train.df)
print.model.matrix("Base Model", model.metrix)## [1] "Printing Metrics for model: Base Model"
## [1] "AUC : 0.953606020208308"
## [1] "Accuracy : 0.87741935483871"
## [1] "Recall : 0.85940805394676"
## [1] "Precision : 0.884431012250161"
model.metrix = cross.validation(5, "target~ rad+tax+lstat+age+indus+nox+zn+dis + intterm", train.df)
print.model.matrix("Model with interaction term", model.metrix)## [1] "Printing Metrics for model: Model with interaction term"
## [1] "AUC : 0.957999559963697"
## [1] "Accuracy : 0.898924731182796"
## [1] "Recall : 0.898792043940506"
## [1] "Precision : 0.893309572855849"
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## [1] "Printing Metrics for model: Model with transformed variable term"
## [1] "AUC : 0.974811756641251"
## [1] "Accuracy : 0.926881720430108"
## [1] "Recall : 0.937286211490877"
## [1] "Precision : 0.914603635922785"
model.metrix = cross.validation(5, "target~ rad+tax+lstat+age+indus+nox+zn+dis + medv+rm+ptratio +chas", train.df)
print.model.matrix("Model with all original variables", model.metrix)## [1] "Printing Metrics for model: Model with all original variables"
## [1] "AUC : 0.961738716946576"
## [1] "Accuracy : 0.896774193548387"
## [1] "Recall : 0.880492903173179"
## [1] "Precision : 0.906328407368241"
chaschas variable, as Boruta confirms importance of all other variables in the train.df dataset which the exception of chas, which was “tentative”, i.e., not as important as other variables.model.metrix = cross.validation(5, "target~ rad+tax+lstat+age+indus+nox+zn+dis + medv+rm+ptratio", train.df)
print.model.matrix("Model with all original variables but chas", model.metrix)## [1] "Printing Metrics for model: Model with all original variables but chas"
## [1] "AUC : 0.959861242646322"
## [1] "Accuracy : 0.901075268817204"
## [1] "Recall : 0.885691464774179"
## [1] "Precision : 0.911204960453081"
chas improves the model performance:The best two models are compared in the bar chart below.
Both models have 11 variables.
The interaction term and transformed variables brought higher accuracy, AUC, and Recall to the model, while the precision values remain very close.
The discussion point here is that since the interaction term and transformed variables increase the complexity of the model, we should work with domain experts to determine whether
training = train.df
testing = read.csv("./Data/crime-evaluation-data_modified.csv", stringsAsFactors = FALSE)
testing$intterm = ((testing$dis / (testing$nox * testing$indus)) + testing$age) ^ 0.5
testing$radtrans = round(log(1+testing$rad))
testing$taxtrans = round(log(1+testing$tax))-4
model.formula = "target~ rad+tax+lstat+age+indus+nox+zn+dis + intterm + radtrans + taxtrans"
model <- glm(formula = model.formula,
family = "binomial", data = training)
predicted <- predict(model, newdata = testing,type="response")
lables = ifelse(predicted > 0.5, 1, 0)
testing$targetproba = round(predicted,1)
testing$target = lables
testing = data.frame(testing)
write.table(testing, "./PredictedOutcome_v2.csv", row.names = FALSE, sep=",")options(scipen = 999, digits=6, width=150)
library(dplyr)
library(ggplot2)
library(corrplot)
library(Hmisc)
library(ROCR)
library(parallel)
library(Boruta)
library(kableExtra)
# Set working directory and load the training data
mydir = "C:/Users/Michael/Dropbox/priv/CUNY/MSDS/201909-Fall/DATA621_Nasrin/20191030_HW3_Crime/HW-3"
setwd(mydir)
train.df = read.csv("./Data/crime-training-data_modified.csv")
# 1. Data Exploration
## **Load Data**
numhead = 10
head(train.df,n = numhead) %>%
kable(row.names = 1:numhead) %>%
kable_styling(c("striped", "bordered"))
print(paste("Number of columns (features) = ", ncol(train.df)))
print(paste("Number of rows (observations) = ", nrow(train.df)))
#### **Let's analyze the datatypes of each column:**
str(train.df)
targetAsFactor <- factor(train.df$target, labels=c("LowCrime","HighCrime"))
str(targetAsFactor)
summary(targetAsFactor)
## **Check missing values**
miss.cols = apply(train.df, 2, function(x) any(is.na(x)))
print(paste("Number of columns with missing values = ", length(colnames(miss.cols))))
## **Check Class Imbalance**
ggplot(train.df, aes(target, fill = targetAsFactor)) +
geom_bar() +
geom_text(stat='count', aes(label=..count..), vjust=1) +
labs(title="Bar Chart: Counts for target variable",
caption="Source: Crime dataset for major city (Boston)")
## **Boruta: Variable Importance**
set.seed(1)
Boruta(target ~ . , data=train.df)->Bor.hvo
print(Bor.hvo)
plot(Bor.hvo, cex.axis=0.75, las=2, main="Boruta algorithm for Feature Selection")
## **Check variable correlation**
res2<-rcorr(as.matrix(train.df))
respearson=rcorr(as.matrix(train.df),type = "pearson")
resspearman=rcorr(as.matrix(train.df),type = "spearman")
res3 <- cor(as.matrix(train.df))
#### Pearson rank correlation
corrplot(corr = respearson$r, type = "upper", outline = T, order="hclust",
p.mat = respearson$P, sig.level = 0.05, insig = "blank", addCoef.col = "black",
title = "\nRank Correlation (Pearson)",
number.cex = 0.9, number.font = 1, number.digits = 2 )
#### Spearman rank correlation
corrplot(corr = resspearman$r, type = "upper", outline = T, order="hclust",
p.mat = resspearman$P, sig.level = 0.05, insig = "blank", addCoef.col = "black",
title = "\nRank Correlation (Spearman)",
number.cex = 0.9, number.font = 1, number.digits = 2)
#### actual correlations (not rank correlations)
corrplot(corr = res3, type = "upper", outline = T, order="hclust",
sig.level = 0.05, insig = "blank", addCoef.col = "black",
title = "\nActual correlations (not rank correlations)",
number.cex = 0.9, number.font = 1, number.digits = 2 )
## **Look for interactions**
### **Scatter plot of age vs. dis**
ggplot(train.df, aes(x = dis, y=age, color=targetAsFactor)) +
geom_point() + ggtitle("Percentage of aged housing vs. Distance from employment centers")
### **Scatter plot of indus vs. dis**
ggplot(train.df, aes(x = dis, y=indus, color=targetAsFactor)) +
geom_point() + ggtitle("Proportion of non-retail business acres per suburb vs. \nDistance from employment centers")
### **Scatter plot of nox vs. dis**
ggplot(train.df, aes(x = dis, y=nox, color=targetAsFactor)) +
geom_point() + ggtitle("Nitrogen Oxides Concentration vs. \nDistance from employment centers")
# 2. Data Preparation
## **Add Interaction Term**
train.df$intterm = ((train.df$dis / (train.df$nox * train.df$indus)) + train.df$age) ^ 0.5
### **Scatter plot showing effect of newly added interaction term for categorizing data**
ggplot(train.df, aes(x = seq(1:nrow(train.df)), y=intterm, color=targetAsFactor)) +
geom_point() + ggtitle("Interaction term for each of the 466 cases") + labs(x = "Each case in training dataset (n=466)")
## **Bucketize rad variable**
train.df$radtrans = round(log(1+train.df$rad))
ggplot(train.df, aes(x=radtrans, fill=targetAsFactor))+
geom_bar(position=position_dodge()) +
geom_text(stat='count', aes(label=..count..), vjust=-0.5, color="black",
position = position_dodge(0.9), size=3.5) +
ggtitle(label="Crime Rate bucketed by accessibility to radial highways",
subtitle="1: low access; 2: medium access; 3: high access")
ggplot(train.df, aes(x = seq(1:nrow(train.df)), y=radtrans, color=targetAsFactor)) +
geom_point() +
geom_jitter(width = 0, height = 0.1) +
ggtitle("Bucketed radial highway accessibility for each of the 466 cases") +
labs(x = "Each case in training dataset (n=466)",
y = "Radial highway accessibility: 1=low / 2=medium / 3=high")
## **Bucketize tax rate variable**
# round(log(1+tax)) gives results in the set {5,6,7}
# So, subtracting 4 from the above makes the results {1,2,3} which makes more sense.
train.df$taxtrans = round(log(1+train.df$tax))-4
ggplot(train.df, aes(x=taxtrans, fill=targetAsFactor))+
geom_bar(position=position_dodge()) +
geom_text(stat='count', aes(label=..count..), vjust=-0.5, color="black",
position = position_dodge(0.9), size=3.5) +
ggtitle(label="Crime Rate bucketed by property tax rate",
subtitle="1: low tax; 2: medium tax; 3: high tax")
ggplot(train.df, aes(x = seq(1:nrow(train.df)), y=taxtrans, color=targetAsFactor)) +
geom_point() +
geom_jitter(width = 0, height = 0.1) +
ggtitle("Bucketed tax rate for each of the 466 cases") +
labs(x = "Each case in training dataset (n=466)",
y = "tax rate: 1=low / 2=medium / 3=high")
## **Bucketize "lstat" (lower status percent) variable**
train.df$lstattrans = round(log(train.df$lstat))
ggplot(train.df, aes(x=lstattrans, fill=targetAsFactor))+
geom_bar(position=position_dodge()) +
geom_text(stat='count', aes(label=..count..), vjust=-0.5, color="black",
position = position_dodge(0.9), size=3.5) +
ggtitle(label="Crime Rate bucketed by transformed 'lower status' rate",
subtitle="1: very low; 2: low; 3: medium; 4: high")
ggplot(train.df, aes(x = seq(1:nrow(train.df)), y=lstattrans, color=targetAsFactor)) +
geom_point() +
geom_jitter(width = 0, height = 0.1) +
ggtitle("Bucketed 'lstat' ('lower Status') rate for each of the 466 cases") +
labs(x = "Each case in training dataset (n=466)",
y = "lstat (lower status) rate: 1=very low / 2=low / 3=medium / 4=high")
## **Bucketize ZN variable** (zoning for large lots)
train.df$zntrans = round(log(1+train.df$zn^0.5))/2+1
ggplot(train.df, aes(x=as.factor(train.df$zntrans), fill=targetAsFactor))+
geom_bar(position=position_dodge()) +
geom_text(stat='count', aes(label=..count..), vjust=-0.5, color="black",
position = position_dodge(0.9), size=3.5) +
ggtitle(label="Crime Rate bucketed by transformed 'ZN' (zoning) rate") +
xlab("1: No large lots ; 2: Large lots")
ggplot(train.df, aes(x = seq(1:nrow(train.df)), y=zntrans, color=targetAsFactor)) +
geom_point() +
geom_jitter(width = 0, height = 0.1) +
scale_y_continuous(name="ZN (zoning): 1=No large lots; 2=Large lots", breaks=c(1,2)) +
ggtitle("Bucketed 'ZN' (zoning) for each of the 466 cases") +
labs(x = "Each case in training dataset (n=466)")
# 3. Build Models
## **Model fitting and evaluation**
cross.validation <- function(K, model.formula, LCdata) {
auclist = NULL
accuracylist = NULL
recalllist = NULL
precisionlist = NULL
fold.size <- nrow(LCdata) / K
for(k in 1:K){
testing.index <- (k-1)*fold.size + 1:fold.size
training <- LCdata[-testing.index,]
testing.org <- LCdata[testing.index,]
testing = testing.org[1:nrow(testing.org), names(testing.org)[names(testing.org) != 'target']]
model <- glm(formula = model.formula,
family = "binomial", data = training)
predicted <- predict(model, newdata = testing,type="response")
pred <- prediction(predicted, testing.org$target)
ind = which.max(round(slot(pred, 'cutoffs')[[1]],1) == 0.5)
perf <- performance(pred, measure = "tpr", x.measure = "fpr")
auc.tmp <- performance(pred,"auc");
auc <- as.numeric(auc.tmp@y.values)
auclist[k] = auc
acc.perf = performance(pred, measure = "acc")
acc = slot(acc.perf, "y.values")[[1]][ind]
accuracylist[k] = acc
prec.perf = performance(pred, measure = "prec")
prec = slot(prec.perf, "y.values")[[1]][ind]
precisionlist[k] = prec
recall.perf = performance(pred, measure = "tpr")
recall = slot(recall.perf, "y.values")[[1]][ind]
recalllist[k] = recall
}
return(list("AUC" = mean(auclist), "Accuracy" = mean(accuracylist), "Recall" = mean(recalllist), "Precision" = mean(precisionlist)))
}
df.metrix <<- NULL
print.model.matrix = function(model.name, matrixobj)
{
print(paste("Printing Metrics for model: ", model.name))
for(i in 1 : length(matrixobj))
{
df = data.frame("Model" = model.name, "Metrix"=names(matrixobj)[[i]], "Value" = matrixobj[[i]])
df.metrix <<- rbind(df, df.metrix)
print(paste(names(matrixobj)[[i]], ":", matrixobj[[i]]))
}
}
## **Cross Validation**
## **1. Base Model**
model.metrix = cross.validation(5, "target~ rad+tax+lstat+age+indus+nox+zn+dis", train.df)
print.model.matrix("Base Model", model.metrix)
## **2. Model with Interaction Term**
model.metrix = cross.validation(5, "target~ rad+tax+lstat+age+indus+nox+zn+dis + intterm", train.df)
print.model.matrix("Model with interaction term", model.metrix)
## **3. Model with transformed variables**
model.metrix = cross.validation(5, "target~ rad+tax+lstat+age+indus+nox+zn+dis + intterm + radtrans + taxtrans", train.df)
print.model.matrix("Model with transformed variable term", model.metrix)
## **4. Model including all original variables**
model.metrix = cross.validation(5, "target~ rad+tax+lstat+age+indus+nox+zn+dis + medv+rm+ptratio +chas", train.df)
print.model.matrix("Model with all original variables", model.metrix)
## **5. Model with all original variables except chas**
model.metrix = cross.validation(5, "target~ rad+tax+lstat+age+indus+nox+zn+dis + medv+rm+ptratio", train.df)
print.model.matrix("Model with all original variables but chas", model.metrix)
# 4. Select Models
## Model #3 is best
ggplot(df.metrix, aes(x = Model, y=Value, fill=Metrix)) +
geom_bar(stat='identity', position=position_dodge()) +
labs(title="Bar Chart : Model performance evaluation",
caption="Source: Crime Rate dataset of a major city") +
geom_text(aes(label=Value), vjust=0.25, hjust=1.1, color="black",
position = position_dodge(0.9), size=3.5) +
coord_flip()
### **Discussion:**
selected.metrix <- df.metrix[df.metrix$Model %in% c('Model with all original variables but chas',
'Model with transformed variable term'), ]
ggplot(selected.metrix, aes(x = Metrix, y=Value, fill=Model)) +
geom_bar(stat='identity', position=position_dodge()) +
labs(title="Best 2 models performance evaluation",
caption="Source: Crime Rate dataset of a major city") +
geom_text(aes(label=Value), vjust=0.25, hjust=1.1, color="black",
position = position_dodge(0.9), size=3.5) +
coord_flip()
# 5. Appendix
### **Fit model and predict on evaluation dataset**
training = train.df
testing = read.csv("./Data/crime-evaluation-data_modified.csv", stringsAsFactors = FALSE)
testing$intterm = ((testing$dis / (testing$nox * testing$indus)) + testing$age) ^ 0.5
testing$radtrans = round(log(1+testing$rad))
testing$taxtrans = round(log(1+testing$tax))-4
model.formula = "target~ rad+tax+lstat+age+indus+nox+zn+dis + intterm + radtrans + taxtrans"
model <- glm(formula = model.formula,
family = "binomial", data = training)
predicted <- predict(model, newdata = testing,type="response")
lables = ifelse(predicted > 0.5, 1, 0)
testing$targetproba = round(predicted,1)
testing$target = lables
testing = data.frame(testing)
write.table(testing, "./PredictedOutcome_V2.csv", row.names = FALSE, sep=",")