1. Data Exploration

The dataset of interest for this report is the crime rate of Boston area neighbourhood. This analyitc report will explore the different variables that affect whether a neighbourhood is above or below the median crime rate of the whole Boston area. The dataset is stored in a github repostiory and read into R programing environment first. The first 6 rows of the data given are as below.The target variable is under “target” column that indicates whether the crime rate in the neigborhood is above the median crime rate or not.

##   zn indus chas   nox    rm   age    dis rad tax ptratio lstat medv target
## 1  0 19.58    0 0.605 7.929  96.2 2.0459   5 403    14.7  3.70 50.0      1
## 2  0 19.58    1 0.871 5.403 100.0 1.3216   5 403    14.7 26.82 13.4      1
## 3  0 18.10    0 0.740 6.485 100.0 1.9784  24 666    20.2 18.85 15.4      1
## 4 30  4.93    0 0.428 6.393   7.8 7.0355   6 300    16.6  5.19 23.7      0
## 5  0  2.46    0 0.488 7.155  92.2 2.7006   3 193    17.8  4.82 37.9      0
## 6  0  8.56    0 0.520 6.781  71.3 2.8561   5 384    20.9  7.67 26.5      0

A corresponding short description of the predictors variables are given as below. Further analysis will show that some of the variables do not effect the predictive power of the models at all. The interesting variables are the nitrogen oxides concentration which is caused by high traffic. Another unusual variable is ‘chas’ which indicates whether a neighborhood is close to Charles River. In another word, the river is acting as a border dividing the neighborhood into several distinct groups. This is proven to be not the case in this analysis.

##    Columns_Name                                   Short_Desc
## 1            zn             Residential Land Zone Proportion
## 2         indus                     Non-Retail Business Acre
## 3          chas                   borders the Charles River 
## 4           nox               nitrogen oxides concentration 
## 5            rm                      average number of rooms
## 6           age           proportion buildings prior to 1940
## 7           dis  distances to five Boston employment centers
## 8           rad            accessibility to radial highways 
## 9           tax                           property-tax rate 
## 10      ptratio                         pupil-teacher ratio 
## 11        lstat              lower status of the population 
## 12         medv        median value of owner-occupied homes 
## 13       target              crime rate is above the median

The data is checked to see if it contains any blank or NA values. As there was none, no imputation or inference was needed for this dataset. Below is the output that confirm the absent of NA values from the dataset.

## [1] na_count  col_names
## <0 rows> (or 0-length row.names)

A correlation plot was generated to examine the heteroscadasity, multicollinearity and autocorrelation among the variables. As below plot shows some variable already shown high correlation. For example; property tax rate was correlated with a neighborhood being close to radial highways. Radial highway is literal terms for highway that link directly to urban center. We might be able to assume high tax rate area is usually correlated with affluent neighborhood with better infrastrutures.These two variables also show higher correlation with the target variable. The suprising note was that the distant to employment centers was negatively correlated with high crime area indicator or the target variable.

Below is a mosaic of small bar charts that explore the distribution of each variables. As easily imagined, the target variable is a binary variable that only have two possible values of 1 or 0 with 1 meaning the area’s crimerate is above the median and 0 being the opposite. The “dis”, “age” and “rad” variable all show skewdness in the distribution thus are good candidate for data transformation.

Below table shows the correlation and statistical significant indication between variables in term of numbers. Charles river border as a variable is pretty insignificant for its correlation with other variables with it’s p-value greater than 0.05 in most cases.

##            zn indus  chas   nox    rm   age   dis   rad   tax ptratio
## zn       1.00 -0.54 -0.04 -0.52  0.32 -0.57  0.66 -0.32 -0.32   -0.39
## indus   -0.54  1.00  0.06  0.76 -0.39  0.64 -0.70  0.60  0.73    0.39
## chas    -0.04  0.06  1.00  0.10  0.09  0.08 -0.10 -0.02 -0.05   -0.13
## nox     -0.52  0.76  0.10  1.00 -0.30  0.74 -0.77  0.60  0.65    0.18
## rm       0.32 -0.39  0.09 -0.30  1.00 -0.23  0.20 -0.21 -0.30   -0.36
## age     -0.57  0.64  0.08  0.74 -0.23  1.00 -0.75  0.46  0.51    0.26
## dis      0.66 -0.70 -0.10 -0.77  0.20 -0.75  1.00 -0.49 -0.53   -0.23
## rad     -0.32  0.60 -0.02  0.60 -0.21  0.46 -0.49  1.00  0.91    0.47
## tax     -0.32  0.73 -0.05  0.65 -0.30  0.51 -0.53  0.91  1.00    0.47
## ptratio -0.39  0.39 -0.13  0.18 -0.36  0.26 -0.23  0.47  0.47    1.00
## lstat   -0.43  0.61 -0.05  0.60 -0.63  0.61 -0.51  0.50  0.56    0.38
## medv     0.38 -0.50  0.16 -0.43  0.71 -0.38  0.26 -0.40 -0.49   -0.52
## target  -0.43  0.60  0.08  0.73 -0.15  0.63 -0.62  0.63  0.61    0.25
##         lstat  medv target
## zn      -0.43  0.38  -0.43
## indus    0.61 -0.50   0.60
## chas    -0.05  0.16   0.08
## nox      0.60 -0.43   0.73
## rm      -0.63  0.71  -0.15
## age      0.61 -0.38   0.63
## dis     -0.51  0.26  -0.62
## rad      0.50 -0.40   0.63
## tax      0.56 -0.49   0.61
## ptratio  0.38 -0.52   0.25
## lstat    1.00 -0.74   0.47
## medv    -0.74  1.00  -0.27
## target   0.47 -0.27   1.00
## 
## n= 466 
## 
## 
## P
##         zn     indus  chas   nox    rm     age    dis    rad    tax   
## zn             0.0000 0.3870 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
## indus   0.0000        0.1874 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
## chas    0.3870 0.1874        0.0355 0.0509 0.0890 0.0372 0.7321 0.3138
## nox     0.0000 0.0000 0.0355        0.0000 0.0000 0.0000 0.0000 0.0000
## rm      0.0000 0.0000 0.0509 0.0000        0.0000 0.0000 0.0000 0.0000
## age     0.0000 0.0000 0.0890 0.0000 0.0000        0.0000 0.0000 0.0000
## dis     0.0000 0.0000 0.0372 0.0000 0.0000 0.0000        0.0000 0.0000
## rad     0.0000 0.0000 0.7321 0.0000 0.0000 0.0000 0.0000        0.0000
## tax     0.0000 0.0000 0.3138 0.0000 0.0000 0.0000 0.0000 0.0000       
## ptratio 0.0000 0.0000 0.0054 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000
## lstat   0.0000 0.0000 0.2679 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
## medv    0.0000 0.0000 0.0005 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
## target  0.0000 0.0000 0.0843 0.0000 0.0010 0.0000 0.0000 0.0000 0.0000
##         ptratio lstat  medv   target
## zn      0.0000  0.0000 0.0000 0.0000
## indus   0.0000  0.0000 0.0000 0.0000
## chas    0.0054  0.2679 0.0005 0.0843
## nox     0.0001  0.0000 0.0000 0.0000
## rm      0.0000  0.0000 0.0000 0.0010
## age     0.0000  0.0000 0.0000 0.0000
## dis     0.0000  0.0000 0.0000 0.0000
## rad     0.0000  0.0000 0.0000 0.0000
## tax     0.0000  0.0000 0.0000 0.0000
## ptratio         0.0000 0.0000 0.0000
## lstat   0.0000         0.0000 0.0000
## medv    0.0000  0.0000        0.0000
## target  0.0000  0.0000 0.0000

Box plot was also used to explore potential outlier within the dataset. There was no apparent outlier from the plot as shown below. The residential land zone proportion or ‘zn’ is the only variable that show potential extremes. However, this was expected as people tend to be clustered in one area created patches of high residential zones.

2. Data Preparation

As the data does not have any missing value nor NA value as shown in above data exploration sections, no imputation was done. The data was randomly broken into 75 and 25 percentage group for training models and testing models.

3. Building Model

This analytic exercise will mainly use different logistic regression with different link functions to explore the peformance of each model.

Both logit and probit link functions are of symmetrical distribution. They also show little difference per below image from stackoverflow.com. However, if the probability of success rises slowly from zero, then tapers off more quickly as it approaches one, the cloglog link function should be explored.

Logit vs. Probit vs. coglog ``

The three confusion matrixes below shows the performance of the 3 models are very similar. As shown in below table, the accuracy of the models are trailing 85%.

##           
## model1_pre  0  1
##          0 57 11
##          1  7 42
##           
## model2_pre  0  1
##          0 59 11
##          1  5 42
##           
## model3_pre  0  1
##          0 60 12
##          1  4 41

The three confusion matrixes below shows the performance of the 3 models are very similar. As shown in below table, the accuracy of the models are trailing 85%.

Measurements m1results m2results m3results
F1-Score 0.8636364 0.8805970 0.8823529
Accuracy 0.8461538 0.8632479 0.8632479
Sensitivity 0.7924528 0.7924528 0.7735849
Specificity 0.7924528 0.7924528 0.7735849
AUC 0.8415389 0.8571639 0.8555425

3.1 Model Performance measure using ROC plots.

The unexpected measures of the models performance was using ROC plots. The AUC or area under the curve values vary wildly using functions two packages or libraries: “pROC” vs. “MLMetrics”. The built in function from “MLMetrics” seems to provide more accurate picture of the model performance. As apparent in below ROC plots, the AUC values were exagerated in “pROC” package by about 10%.

Another measure for evaluating model is Cp value; the smaller the Cp value, the better the model. This is also known as best subsets regression procedure. The idea is to provide the smallest Cp values with the best combination of predictor variables. From below graph, the best Cp can be achieved using only the variables: nox, age, rad, ptratio and medv.

Below graph is for selecting bic (Shwartz’s Baysian Information Criterion), to counter overfitting of the model.This tool suggest to include vairables: nox, age, rad and medv.

4. SELECT MODELS

There are many measurements that can be used to evaluate the performance of various model. The one measure that truly matters is the Accuracy of the model over time and the model’s consistentcy over time. For this reason, I choose to use “model 2” out of the 3 models I have examined. However, there was no siginifcant difference among all the models. They all seems to provide quite amazing results. Further enhancements could be done by feature enginearing.

A separate file is attached for the evaluation dataset using “model 2”. A simple scatter plot show the density of the predicted values for the evaluation dataset. The model predict higher crime rate areas in the evaluate dataset than the training dataset. The training dataset have about even distribution the two types of neighborhood.

5. Appendix: R Codes

options(warn = 0)
library(dplyr)
library(GGally)
library(MASS)
library(ggplot2)
library(Hmisc)
library(corrplot)
library(lattice)
library(ggcorrplot)
library(kableExtra)
library(tidyr)
library(dplyr)
library(pROC)
library(ROCR)
library(MASS)
library(reshape2)
library(plotROC)
library(leaps)
library(MLmetrics)
library(knitr)

crimetrain <- read.table ("https://raw.githubusercontent.com/angus001/Data621_BusinessAnalytics/master/crim_train_data_20180409.csv", header = T, sep =",")

crimetest <- read.table ("https://raw.githubusercontent.com/angus001/Data621_BusinessAnalytics/master/crime_test_data_20180409.csv",header =T, sep =",")
head(crimetrain)

# provide short description of the variables.

Short_Desc <- c("Residential Land Zone Proportion", "Non-Retail Business Acre" ," borders the Charles River "
              ,"nitrogen oxides concentration ", " average number of rooms", "proportion buildings prior to 1940",  " distances to five Boston employment centers", " accessibility to radial highways "," property-tax rate ", "pupil-teacher ratio ", " lower status of the population ", " median value of owner-occupied homes ", " crime rate is above the median "   )

Columns_Name <- colnames (crimetrain)
data.frame(Columns_Name, Short_Desc)


#check how many na in each variable/column
na_count <-sapply(crimetrain, function(x) sum(length(which(is.na(x)))))
na_count <- data.frame(na_count)
na_count$col_names <- rownames(na_count)
rownames(na_count) <- NULL #Reset the row names of the data frame

df1 <-filter(na_count, na_count > 0) #filter the dataframe into one that shows columns with na values
head( df1[order(-df1$na_count),], 9) #sort descending and show column with most na


#pairs plot or correlation plots
corrt <- round (cor(crimetrain), 2)
ggcorrplot(corrt, hc.order = TRUE, lab=TRUE, type = "lower")


crimet1 <- melt(crimetrain)

ggplot(crimet1, aes(x = value)) + geom_histogram(bins=50) + facet_wrap(~variable, scale="free") + xlab("") + ylab("Frequency")



res2 <- rcorr(as.matrix(crimetrain))
res2


#boxplot
ggplot(stack(crimetrain), aes(x = ind, y = values ))+geom_boxplot()+theme(axis.text.x = element_text(angle = 80, hjust = 1))

#split the dataset into training and testing sample

samplesize <- floor(0.75*nrow(crimetrain))

set.seed(245)
train_ind <- sample(seq_len(nrow(crimetrain)), size = samplesize)

train_data <-crimetrain[train_ind,]
test_data <-crimetrain[-train_ind,]


# build 3 different link function logistic regression models.
pmodel1 <- glm(target~., data = train_data, #subset = train,
               family = binomial(link = "probit"))


pmodel2 <- glm(target~., data = train_data, #subset = train,
               family = binomial (link = "logit"))


pmodel3 <- glm(target~., data = train_data, #subset = train,
               family = binomial (link = "cloglog"))

model_prob = predict(pmodel1, test_data,type = "response" )
model_prob2 = predict(pmodel2, test_data,type = "response" )
model_prob3 = predict(pmodel3, test_data,type = "response" )

model1_pre = ifelse(model_prob > 0.5, "1", "0")
model2_pre = ifelse(model_prob2 > 0.5, "1", "0")
model3_pre = ifelse(model_prob3 > 0.5, "1", "0")


m1_f1 <- F1_Score(y_pred = model1_pre, y_true = test_data$target)
m2_f1 <- F1_Score(y_pred = model2_pre, y_true = test_data$target)
m3_f1 <- F1_Score(y_pred = model3_pre, y_true = test_data$target)

table(model1_pre,test_data$target)
table(model2_pre,test_data$target)
table(model3_pre,test_data$target)

#create a function to call all model measurements
modelmetrics <- function(predvalue, truevalue)
{
  a1 <- F1_Score(y_pred = predvalue, y_true = truevalue)
  a2 <- Accuracy(y_pred = predvalue, y_true = truevalue)
  a3 <- Sensitivity(y_pred = predvalue, y_true = truevalue)
  a4 <- Specificity(y_pred = predvalue, y_true = truevalue)
  a5 <- AUC(y_pred = predvalue, y_true = truevalue)
  values1 <- (c(a1,a2,a4,a4,a5))
  label1 <- c("F1-Score", "Accuracy","Sensitivity","Specificity", "AUC")
  return (values1)
}  
  
m1results <- modelmetrics(model1_pre, test_data$target)
m2results <- modelmetrics(model2_pre, test_data$target)
m3results <- modelmetrics(model3_pre, test_data$target)


Measurements <- c("F1-Score", "Accuracy","Sensitivity","Specificity", "AUC")
metricsdf <- data.frame(Measurements, m1results, m2results, m3results)

#show the measurements in a nice html table
metricsdf %>% kable("html") %>% kable_styling(bootstrap_options = c("striped", "hover"))


#evaluate models using ROC

test_data$model_pred1 = model_prob
test_data$model_pred2 = model_prob2
test_data$model_pred3 = model_prob3

g1 <- roc(target ~ model_pred1, data = test_data)
g2 <- roc(target ~ model_pred2, data = test_data)
g3 <- roc(target ~ model_pred3, data = test_data)


auc1 = auc(test_data$target, test_data$model_pred1)
auc2 = auc(test_data$target, test_data$model_pred2)
auc3 = auc(test_data$target, test_data$model_pred3)


plot(g1, main = sprintf('Model 1 AUC:  %.3f', auc1) )
plot(g2, main = sprintf('Model 2 AUC:  %.3f', auc2) )
plot(g3, main = sprintf('Model 3 AUC:  %.3f', auc3) )

#Cp value plot

plot(regfit.fwd, scale = "Cp" )
title(main = 'Graph for Best Subsets Regression', sub = "Shaded variables are contributing to smaller Cp", col.sub = "red", cex.sub = .35)

#CIB value plot
plot(regfit.fwd )

eval_prob2 = predict(pmodel2, crimetest,type = "response" )
eval_pre = ifelse(eval_prob2 > 0.5, "1", "0")

crimetest$predictions <-c(eval_pre)
#write.table(crimetest, file = "HW3_Predictions.csv", sep=",")

plot(crimetest$predictions )