In this homework assignment, we will explore, analyze and model a data set containing 466 records. Each record contains 14 variables (13 predictors, 1 response variable) that represent characteristics of a neighborhood The data includes fields related to zoning, demographics, education property values etc.
The first objective is to load, inspect and evaluate the data. We want to inspect a small sample to make sure that the data has loaded correctly and that it looks coherent.
#load the data from file
data.train <- read.table("crime-training-data_modified.csv",sep=',', header = 1)
#display the first few rows as a coherence check
kable(head(data.train,5))
zn | indus | chas | nox | rm | age | dis | rad | tax | ptratio | lstat | medv | target |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 19.58 | 0 | 0.605 | 7.929 | 96.2 | 2.0459 | 5 | 403 | 14.7 | 3.70 | 50.0 | 1 |
0 | 19.58 | 1 | 0.871 | 5.403 | 100.0 | 1.3216 | 5 | 403 | 14.7 | 26.82 | 13.4 | 1 |
0 | 18.10 | 0 | 0.740 | 6.485 | 100.0 | 1.9784 | 24 | 666 | 20.2 | 18.85 | 15.4 | 1 |
30 | 4.93 | 0 | 0.428 | 6.393 | 7.8 | 7.0355 | 6 | 300 | 16.6 | 5.19 | 23.7 | 0 |
0 | 2.46 | 0 | 0.488 | 7.155 | 92.2 | 2.7006 | 3 | 193 | 17.8 | 4.82 | 37.9 | 0 |
The data looks good, so we can move on to inspection.
Next we will compute some basic summary statistics using describe() from the Psych package
vars | n | mean | sd | median | trimmed | |
---|---|---|---|---|---|---|
zn | 1 | 466 | 11.5772532 | 23.3646511 | 0.00000 | 5.3542781 |
indus | 2 | 466 | 11.1050215 | 6.8458549 | 9.69000 | 10.9082353 |
chas | 3 | 466 | 0.0708155 | 0.2567920 | 0.00000 | 0.0000000 |
nox | 4 | 466 | 0.5543105 | 0.1166667 | 0.53800 | 0.5442684 |
rm | 5 | 466 | 6.2906738 | 0.7048513 | 6.21000 | 6.2570615 |
age | 6 | 466 | 68.3675966 | 28.3213784 | 77.15000 | 70.9553476 |
dis | 7 | 466 | 3.7956929 | 2.1069496 | 3.19095 | 3.5443647 |
rad | 8 | 466 | 9.5300429 | 8.6859272 | 5.00000 | 8.6978610 |
tax | 9 | 466 | 409.5021459 | 167.9000887 | 334.50000 | 401.5080214 |
ptratio | 10 | 466 | 18.3984979 | 2.1968447 | 18.90000 | 18.5970588 |
lstat | 11 | 466 | 12.6314592 | 7.1018907 | 11.35000 | 11.8809626 |
medv | 12 | 466 | 22.5892704 | 9.2396814 | 21.20000 | 21.6304813 |
target | 13 | 466 | 0.4914163 | 0.5004636 | 0.00000 | 0.4893048 |
trimmed | mad | min | max | range | skew | kurtosis | se | |
---|---|---|---|---|---|---|---|---|
zn | 5.3542781 | 0.0000000 | 0.0000 | 100.0000 | 100.0000 | 2.1768152 | 3.8135765 | 1.0823466 |
indus | 10.9082353 | 9.3403800 | 0.4600 | 27.7400 | 27.2800 | 0.2885450 | -1.2432132 | 0.3171281 |
chas | 0.0000000 | 0.0000000 | 0.0000 | 1.0000 | 1.0000 | 3.3354899 | 9.1451313 | 0.0118957 |
nox | 0.5442684 | 0.1334340 | 0.3890 | 0.8710 | 0.4820 | 0.7463281 | -0.0357736 | 0.0054045 |
rm | 6.2570615 | 0.5166861 | 3.8630 | 8.7800 | 4.9170 | 0.4793202 | 1.5424378 | 0.0326516 |
age | 70.9553476 | 30.0226500 | 2.9000 | 100.0000 | 97.1000 | -0.5777075 | -1.0098814 | 1.3119625 |
dis | 3.5443647 | 1.9144814 | 1.1296 | 12.1265 | 10.9969 | 0.9988926 | 0.4719679 | 0.0976026 |
rad | 8.6978610 | 1.4826000 | 1.0000 | 24.0000 | 23.0000 | 1.0102788 | -0.8619110 | 0.4023678 |
tax | 401.5080214 | 104.5233000 | 187.0000 | 711.0000 | 524.0000 | 0.6593136 | -1.1480456 | 7.7778214 |
ptratio | 18.5970588 | 1.9273800 | 12.6000 | 22.0000 | 9.4000 | -0.7542681 | -0.4003627 | 0.1017669 |
lstat | 11.8809626 | 7.0720020 | 1.7300 | 37.9700 | 36.2400 | 0.9055864 | 0.5033688 | 0.3289887 |
medv | 21.6304813 | 6.0045300 | 5.0000 | 50.0000 | 45.0000 | 1.0766920 | 1.3737825 | 0.4280200 |
target | 0.4893048 | 0.0000000 | 0.0000 | 1.0000 | 1.0000 | 0.0342293 | -2.0031131 | 0.0231835 |
There doesn’t seem to be much immediately concerning about the data. We see a some skewness in a few of the non-binary variables and the variable “tax” has a scale that’s an order of magnitude different than all the other but we don’t see anything of major concern here. All of these are relatively easy issues to correct.
Here we consider the mode (not included above) and a count of the number of unique entries in each column.
## $zn
## [1] 0
##
## $indus
## [1] 18.1
##
## $chas
## [1] 0
##
## $nox
## [1] 0.538
##
## $rm
## [1] 5.713 6.127 6.167 6.229 6.405 6.417
##
## $age
## [1] 100
##
## $dis
## [1] 3.4952 5.2873 5.4007 5.7209 6.8147
##
## $rad
## [1] 24
##
## $tax
## [1] 666
##
## $ptratio
## [1] 20.2
##
## $lstat
## [1] 6.36 7.79 8.05
##
## $medv
## [1] 50
##
## $target
## [1] 0
x | |
---|---|
zn | 26 |
indus | 73 |
chas | 2 |
nox | 79 |
rm | 419 |
age | 333 |
dis | 380 |
rad | 9 |
tax | 63 |
ptratio | 46 |
lstat | 424 |
medv | 218 |
target | 2 |
Here we see that we might be able to convert some of these variables to binary variables. ZN and AGE both represent proportions each with bias towards extremes (0% & 100% respectively). We may be able to simplify by converting these columns to a binary “0” or “1”.
If we look at the unique-entry count for each column we see that “rad” only contains 9 distinct values (likely representative of neightborhoods or something similar) which suggests that this variable more categorical than numerical and as such, we’ll convert this to a factor with 9 levels.
Next we look at various visualizations of the data to get a feel for the shapes of the distributions and to give us a sense as to whether there are any major outliers that we need to be concerned about.
#tidy the data.train
data.train.tidy <- data.train %>%
gather()
ggplot(gather(data.train.tidy), aes(value)) +
geom_histogram(bins = 10) +
facet_wrap(~key, scales = 'free_x')
#build a boxplot from our reformatted data.train
n<-ggplot(data=data.train.tidy, aes(x=key, y=value)) +
geom_boxplot()+
ggtitle("Looking for Outliers - Boxplot") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))+
theme(plot.title = element_text(hjust = 0.5))
n
#create a barplot to highlight
n<-ggplot(data=stats, aes(x=rownames(stats), y=max/median)) +
geom_bar(stat="identity") +
ggtitle("Max vs Median - Looking for Outliers") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))+
theme(plot.title = element_text(hjust = 0.5))
n
We do not see much in the way of outliers here either, other than tax, which needs to be re-scaled. In the set of histograms/boxplots we get a good visualization of all of the variables and everything looks coherent at this stage. The ratio of max-median shows that there aren’t any variables that contain absurd outliers.
Next we consider the correlation and coefficient of determination of our predictors to our target.
##
## Correlation method: 'pearson'
## Missing treated using: 'pairwise.complete.obs'
cor$rs <- cor$target^2
#create a barplot showing the correlation of the target to each variable
cor %>%
mutate(rowname = factor(rowname, levels = rowname[order(target)])) %>%
ggplot(aes(x = rowname, y = target)) +
geom_bar(stat = "identity") +
ylab("Correlation with Target ") +
xlab("Variable")+
ggtitle("Correlation to Target Var") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))+
theme(plot.title = element_text(hjust = 0.5))
#create a barplot showing the correlation of the target to each variable
cor %>%
mutate(rowname = factor(rowname, levels = rowname[order(rs)])) %>%
ggplot(aes(x = rowname, y = rs)) +
geom_bar(stat = "identity") +
ylab("R2 vs Target") +
xlab("Variable")+
ggtitle("R2 to Target Var") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))+
theme(plot.title = element_text(hjust = 0.5))
In terms of correlation to the target, we see that there are several variables with an absolute correlation of ~ >= 0.5. The r2 plot shows the explanatory power of each variable and suggests that we might want to focus on variables like “nox”,“age” and “rad” while ignoring “chas” and “rm”. Intuition suggests that we should be able to get a reasonably good prediction with r^2 values such as the above.
cor.data <- cor(data.train)
ggplot(data = melt(cor.data), aes(x=Var1, y=Var2, fill=value)) +
geom_tile()
Here we see that we might have some aliasing issues where “tax” and “rad” appear to be highly related to one another. These variables represent tax burden and highway accessibility respectively and both be measures of the same thing (specific geographic location or neighborhood, for example). Otherwise, we don’t seem to have much inter-relatedness between variables. This is a good thing.
There a few minor changes that we can make based on the above observations:
We’ll take the proportional data and convert it to binary columns - this is an exercise in logistic regression, after all!. We’ll also retain the original so that we can assess the impact of this change if desired.
we’ll modify for the skew where required. First, we’ll visually inspect the suspect variables more closely:
vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
zn | 1 | 466 | 11.5772532 | 23.3646511 | 0.00000 | 5.3542781 | 0.0000000 | 0.0000 | 100.0000 | 100.0000 | 2.1768152 | 3.8135765 | 1.0823466 |
indus | 2 | 466 | 11.1050215 | 6.8458549 | 9.69000 | 10.9082353 | 9.3403800 | 0.4600 | 27.7400 | 27.2800 | 0.2885450 | -1.2432132 | 0.3171281 |
chas | 3 | 466 | 0.0708155 | 0.2567920 | 0.00000 | 0.0000000 | 0.0000000 | 0.0000 | 1.0000 | 1.0000 | 3.3354899 | 9.1451313 | 0.0118957 |
nox | 4 | 466 | 0.5543105 | 0.1166667 | 0.53800 | 0.5442684 | 0.1334340 | 0.3890 | 0.8710 | 0.4820 | 0.7463281 | -0.0357736 | 0.0054045 |
rm | 5 | 466 | 6.2906738 | 0.7048513 | 6.21000 | 6.2570615 | 0.5166861 | 3.8630 | 8.7800 | 4.9170 | 0.4793202 | 1.5424378 | 0.0326516 |
age | 6 | 466 | 68.3675966 | 28.3213784 | 77.15000 | 70.9553476 | 30.0226500 | 2.9000 | 100.0000 | 97.1000 | -0.5777075 | -1.0098814 | 1.3119625 |
dis | 7 | 466 | 3.7956929 | 2.1069496 | 3.19095 | 3.5443647 | 1.9144814 | 1.1296 | 12.1265 | 10.9969 | 0.9988926 | 0.4719679 | 0.0976026 |
rad | 8 | 466 | 9.5300429 | 8.6859272 | 5.00000 | 8.6978610 | 1.4826000 | 1.0000 | 24.0000 | 23.0000 | 1.0102788 | -0.8619110 | 0.4023678 |
tax | 9 | 466 | 409.5021459 | 167.9000887 | 334.50000 | 401.5080214 | 104.5233000 | 187.0000 | 711.0000 | 524.0000 | 0.6593136 | -1.1480456 | 7.7778214 |
ptratio | 10 | 466 | 18.3984979 | 2.1968447 | 18.90000 | 18.5970588 | 1.9273800 | 12.6000 | 22.0000 | 9.4000 | -0.7542681 | -0.4003627 | 0.1017669 |
lstat | 11 | 466 | 12.6314592 | 7.1018907 | 11.35000 | 11.8809626 | 7.0720020 | 1.7300 | 37.9700 | 36.2400 | 0.9055864 | 0.5033688 | 0.3289887 |
medv | 12 | 466 | 22.5892704 | 9.2396814 | 21.20000 | 21.6304813 | 6.0045300 | 5.0000 | 50.0000 | 45.0000 | 1.0766920 | 1.3737825 | 0.4280200 |
target | 13 | 466 | 0.4914163 | 0.5004636 | 0.00000 | 0.4893048 | 0.0000000 | 0.0000 | 1.0000 | 1.0000 | 0.0342293 | -2.0031131 | 0.0231835 |
znBin | 14 | 466 | 0.2725322 | 0.4457407 | 0.00000 | 0.2165775 | 0.0000000 | 0.0000 | 1.0000 | 1.0000 | 1.0184383 | -0.9648402 | 0.0206485 |
ageBin | 15 | 466 | 0.0901288 | 0.2866739 | 0.00000 | 0.0000000 | 0.0000000 | 0.0000 | 1.0000 | 1.0000 | 2.8533585 | 6.1548765 | 0.0132799 |
next we’ll modify and re-inspect:
#modify
data.train$dis <- log(data.train$dis)
data.train$rad <- log(data.train$rad)
data.train$tax <- sqrt(data.train$tax)
data.train$nox <- log(data.train$nox)
#take another peek
hist(data.train$dis)
vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
zn | 1 | 466 | 11.5772532 | 23.3646511 | 0.0000000 | 5.3542781 | 0.0000000 | 0.0000000 | 100.0000000 | 100.0000000 | 2.1768152 | 3.8135765 | 1.0823466 |
indus | 2 | 466 | 11.1050215 | 6.8458549 | 9.6900000 | 10.9082353 | 9.3403800 | 0.4600000 | 27.7400000 | 27.2800000 | 0.2885450 | -1.2432132 | 0.3171281 |
chas | 3 | 466 | 0.0708155 | 0.2567920 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 1.0000000 | 1.0000000 | 3.3354899 | 9.1451313 | 0.0118957 |
nox | 4 | 466 | -0.6109729 | 0.2026121 | -0.6198967 | -0.6202169 | 0.2549574 | -0.9441759 | -0.1381133 | 0.8060626 | 0.3709699 | -0.7234642 | 0.0093858 |
rm | 5 | 466 | 6.2906738 | 0.7048513 | 6.2100000 | 6.2570615 | 0.5166861 | 3.8630000 | 8.7800000 | 4.9170000 | 0.4793202 | 1.5424378 | 0.0326516 |
age | 6 | 466 | 68.3675966 | 28.3213784 | 77.1500000 | 70.9553476 | 30.0226500 | 2.9000000 | 100.0000000 | 97.1000000 | -0.5777075 | -1.0098814 | 1.3119625 |
dis | 7 | 466 | 1.1875848 | 0.5412299 | 1.1603153 | 1.1782986 | 0.6732974 | 0.1218636 | 2.4953931 | 2.3735296 | 0.1426949 | -1.0086881 | 0.0250720 |
rad | 8 | 466 | 1.8706996 | 0.8653619 | 1.6094379 | 1.8764842 | 0.3308326 | 0.0000000 | 3.1780538 | 3.1780538 | 0.3191654 | -0.6307697 | 0.0400871 |
tax | 9 | 466 | 19.8348783 | 4.0142657 | 18.2893361 | 19.7342218 | 3.0263710 | 13.6747943 | 26.6645833 | 12.9897889 | 0.5091007 | -1.1856441 | 0.1859573 |
ptratio | 10 | 466 | 18.3984979 | 2.1968447 | 18.9000000 | 18.5970588 | 1.9273800 | 12.6000000 | 22.0000000 | 9.4000000 | -0.7542681 | -0.4003627 | 0.1017669 |
lstat | 11 | 466 | 12.6314592 | 7.1018907 | 11.3500000 | 11.8809626 | 7.0720020 | 1.7300000 | 37.9700000 | 36.2400000 | 0.9055864 | 0.5033688 | 0.3289887 |
medv | 12 | 466 | 22.5892704 | 9.2396814 | 21.2000000 | 21.6304813 | 6.0045300 | 5.0000000 | 50.0000000 | 45.0000000 | 1.0766920 | 1.3737825 | 0.4280200 |
target | 13 | 466 | 0.4914163 | 0.5004636 | 0.0000000 | 0.4893048 | 0.0000000 | 0.0000000 | 1.0000000 | 1.0000000 | 0.0342293 | -2.0031131 | 0.0231835 |
znBin | 14 | 466 | 0.2725322 | 0.4457407 | 0.0000000 | 0.2165775 | 0.0000000 | 0.0000000 | 1.0000000 | 1.0000000 | 1.0184383 | -0.9648402 | 0.0206485 |
ageBin | 15 | 466 | 0.0901288 | 0.2866739 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 1.0000000 | 1.0000000 | 2.8533585 | 6.1548765 | 0.0132799 |
We can see that we get much more normal looking results after the transformation
For the variable “rad” which we previously identified as only having 9 distinct values, we’ll convert it to a factor and encode the values as “1-9” to make life easier.
Finally, we’ll split the data into a train and test set after our transforms. This way, we can perform some model eval using the testing set.
First, we’ll build a model using all of the data exactly as it appears in the original file.
#get a fresh copy of the data
data.orig <- read.table("crime-training-data_modified.csv",sep=',', header = 1)
#rebuild train/test
data.orig.test <- data.orig [-train_ind, ]
data.orig.train <- data.orig [train_ind, ]
#build a model using the original, unmodified data
model_orig <- glm(formula = data.orig.train$target ~ .,data.orig.train[,],family="binomial")
summary(model_orig)
##
## Call:
## glm(formula = data.orig.train$target ~ ., family = "binomial",
## data = data.orig.train[, ])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.00434 -0.11934 0.00000 0.00125 3.07110
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -43.398974 8.026598 -5.407 6.41e-08 ***
## zn -0.178760 0.086629 -2.064 0.039064 *
## indus -0.137292 0.065978 -2.081 0.037445 *
## chas 0.901319 0.845654 1.066 0.286503
## nox 59.004084 11.013512 5.357 8.44e-08 ***
## rm -0.963916 0.878829 -1.097 0.272721
## age 0.051233 0.017587 2.913 0.003577 **
## dis 0.771590 0.296352 2.604 0.009224 **
## rad 0.759221 0.229646 3.306 0.000946 ***
## tax -0.008198 0.003818 -2.147 0.031760 *
## ptratio 0.347956 0.168622 2.064 0.039063 *
## lstat 0.061905 0.073619 0.841 0.400412
## medv 0.176052 0.079204 2.223 0.026232 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 483.81 on 348 degrees of freedom
## Residual deviance: 137.33 on 336 degrees of freedom
## AIC: 163.33
##
## Number of Fisher Scoring iterations: 9
#check the accuracy
p <- predict(model_orig,data.orig.test[,],type="response")
p[p>0.5] <-1
p[p<1] <- 0
predicted <- as.factor(p)
actual <- as.factor(data.orig.test$target)
confusionMatrix(data = predicted , reference = actual ,positive="1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 59 9
## 1 3 46
##
## Accuracy : 0.8974
## 95% CI : (0.8277, 0.9459)
## No Information Rate : 0.5299
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.7929
##
## Mcnemar's Test P-Value : 0.1489
##
## Sensitivity : 0.8364
## Specificity : 0.9516
## Pos Pred Value : 0.9388
## Neg Pred Value : 0.8676
## Prevalence : 0.4701
## Detection Rate : 0.3932
## Detection Prevalence : 0.4188
## Balanced Accuracy : 0.8940
##
## 'Positive' Class : 1
##
# get the model results and plot
model_orig.output <- augment(model_orig) %>%
mutate(index = 1:n())
model_orig.output %>% top_n(3, .cooksd)
## # A tibble: 3 x 22
## .rownames data.orig.train~ zn indus chas nox rm age dis
## <chr> <int> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
## 1 62 1 0 9.9 0 0.544 4.97 37.8 2.52
## 2 304 0 0 9.69 0 0.585 5.67 28.8 2.80
## 3 457 1 0 10.6 0 0.489 5.41 9.8 3.59
## # ... with 13 more variables: rad <int>, tax <int>, ptratio <dbl>,
## # lstat <dbl>, medv <dbl>, .fitted <dbl>, .se.fit <dbl>, .resid <dbl>,
## # .hat <dbl>, .sigma <dbl>, .cooksd <dbl>, .std.resid <dbl>, index <int>
ggplot(model_orig.output, aes(index, .std.resid)) +
geom_point(aes(color=as.factor(data.orig.train$target)), alpha = .5) +
theme_bw()+
ggtitle("Model output - Looking for Outliers") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))+
theme(plot.title = element_text(hjust = 0.5))
We get a surprisingly decent result with an accuracy near 90% and and AIC of 163 - my intuition is that we are beginning near the point of diminishing returns in this particular assignment - it may be hard to improve upon this model. However, we do see some sizable outliers in the Cook’s D plot and the resituals plot shows some values > 3.
Next, we’ll run the same thing without modified data using all the variables. Note also that we’ll run this model twice. First, we’ll run it with the modified binary ZN and Age variables, then without.
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##
## Call:
## glm(formula = data.train$target ~ ., family = "binomial", data = data.train[,
## -c(ncol(data.train) - 1, ncol(data.train) - 2)])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.32383 -0.00102 0.00000 0.00003 3.03244
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 31.38596 10.06861 3.117 0.001826 **
## zn -0.27736 0.16098 -1.723 0.084903 .
## indus -0.27203 0.16571 -1.642 0.100678
## chas -0.59856 1.21751 -0.492 0.622982
## nox 41.64628 10.71680 3.886 0.000102 ***
## rm -1.85110 1.47715 -1.253 0.210149
## age 0.02958 0.02078 1.423 0.154711
## dis 0.93825 1.80340 0.520 0.602878
## rad2 24.45937 2195.68961 0.011 0.991112
## rad3 -2.90651 1.72127 -1.689 0.091298 .
## rad4 -20.96470 4525.58117 -0.005 0.996304
## rad5 -19.34965 5361.28382 -0.004 0.997120
## rad6 -8.03856 7076.31647 -0.001 0.999094
## rad7 2.40880 0.85061 2.832 0.004628 **
## rad8 9.01548 6.18967 1.457 0.145244
## rad9 -19.76591 5742.74294 -0.003 0.997254
## tax -0.34797 0.33414 -1.041 0.297692
## ptratio 0.12692 0.24609 0.516 0.606025
## lstat 0.11121 0.09349 1.190 0.234228
## medv 0.30831 0.13645 2.259 0.023853 *
## ageBin 18.00070 3684.00775 0.005 0.996101
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 483.814 on 348 degrees of freedom
## Residual deviance: 72.359 on 328 degrees of freedom
## AIC: 114.36
##
## Number of Fisher Scoring iterations: 20
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##
## Call:
## glm(formula = data.train$target ~ ., family = "binomial", data = data.train[,
## -c(1, 6)])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.36372 -0.00308 0.00000 0.00002 3.03065
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 30.53433 10.33514 2.954 0.00313 **
## indus -0.33242 0.15609 -2.130 0.03320 *
## chas -0.09346 1.14803 -0.081 0.93512
## nox 44.73128 10.42712 4.290 1.79e-05 ***
## rm -0.69199 1.27802 -0.541 0.58819
## dis 0.65532 1.89771 0.345 0.72985
## rad2 23.90571 2114.21608 0.011 0.99098
## rad3 -3.63477 1.64934 -2.204 0.02754 *
## rad4 -20.28378 4696.78990 -0.004 0.99655
## rad5 -19.46388 5660.42644 -0.003 0.99726
## rad6 -11.71883 7179.94527 -0.002 0.99870
## rad7 2.48793 0.83783 2.970 0.00298 **
## rad8 8.04944 4.93686 1.630 0.10300
## rad9 -18.36249 6074.08684 -0.003 0.99759
## tax -0.22495 0.31289 -0.719 0.47217
## ptratio 0.01028 0.22674 0.045 0.96384
## lstat 0.15562 0.09351 1.664 0.09605 .
## medv 0.22290 0.11226 1.985 0.04709 *
## znBin -5.60072 3.94063 -1.421 0.15524
## ageBin 18.83165 3645.07888 0.005 0.99588
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 483.81 on 348 degrees of freedom
## Residual deviance: 74.30 on 329 degrees of freedom
## AIC: 114.3
##
## Number of Fisher Scoring iterations: 20
p <- predict(model_all,data.test[,-c(1,6)],type="response")
p[p>0.5] <-1
p[p<1] <- 0
predicted <- as.factor(p)
actual <- as.factor(data.test$target)
confusionMatrix(data = predicted , reference = actual ,positive="1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 61 3
## 1 1 52
##
## Accuracy : 0.9658
## 95% CI : (0.9148, 0.9906)
## No Information Rate : 0.5299
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9312
##
## Mcnemar's Test P-Value : 0.6171
##
## Sensitivity : 0.9455
## Specificity : 0.9839
## Pos Pred Value : 0.9811
## Neg Pred Value : 0.9531
## Prevalence : 0.4701
## Detection Rate : 0.4444
## Detection Prevalence : 0.4530
## Balanced Accuracy : 0.9647
##
## 'Positive' Class : 1
##
# get the model results and plot
model_all.output <- augment(model_all) %>%
mutate(index = 1:n())
model_all.output %>% top_n(3, .cooksd)
## # A tibble: 3 x 22
## .rownames data.train.targ~ indus chas nox rm dis rad tax
## <chr> <int> <dbl> <int> <dbl> <dbl> <dbl> <fct> <dbl>
## 1 240 0 21.9 0 -0.472 5.86 0.512 7 20.9
## 2 249 1 8.56 0 -0.654 6.23 0.934 1 19.6
## 3 354 1 10.6 1 -0.715 5.40 1.30 7 16.6
## # ... with 13 more variables: ptratio <dbl>, lstat <dbl>, medv <dbl>,
## # znBin <dbl>, ageBin <dbl>, .fitted <dbl>, .se.fit <dbl>, .resid <dbl>,
## # .hat <dbl>, .sigma <dbl>, .cooksd <dbl>, .std.resid <dbl>, index <int>
ggplot(model_all.output, aes(index, .std.resid)) +
geom_point(aes(color=as.factor(data.train$target)), alpha = .5) +
theme_bw()+
ggtitle("Model output - Looking for Outliers") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))+
theme(plot.title = element_text(hjust = 0.5))
With the modified data, we can see that the AIC drops to 114 (~30% drop from the last model) and the accuracy of the model improves to nearly 97%! Once again, we see a few large outliers which we would like to do something about if possible.
Note also that it makes almost no difference whether we use binary or proportional data for ZN and AGE. Given that binary variables are simpler, I’ve opted to take the (incredibly small) performance hit here and stick with the binary variables.
Finally, we’ll try cherry-picking some of the more important variables based on what we’ve observed above.
First we’ll use regfit to sub-select the important variables
#step over all the vars & output the one with the lowest Cp
regfit.full=regsubsets(data.orig.train$target ~ .,,data=data.orig.train,nvmax=12)
s <- summary(regfit.full)
s
## Subset selection object
## Call: regsubsets.formula(data.orig.train$target ~ ., , data = data.orig.train,
## nvmax = 12)
## 12 Variables (and intercept)
## Forced in Forced out
## zn FALSE FALSE
## indus FALSE FALSE
## chas FALSE FALSE
## nox FALSE FALSE
## rm FALSE FALSE
## age FALSE FALSE
## dis FALSE FALSE
## rad FALSE FALSE
## tax FALSE FALSE
## ptratio FALSE FALSE
## lstat FALSE FALSE
## medv FALSE FALSE
## 1 subsets of each size up to 12
## Selection Algorithm: exhaustive
## zn indus chas nox rm age dis rad tax ptratio lstat medv
## 1 ( 1 ) " " " " " " "*" " " " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " "*" " " " " " " "*" " " " " " " " "
## 3 ( 1 ) " " " " " " "*" " " "*" " " "*" " " " " " " " "
## 4 ( 1 ) " " " " " " "*" " " "*" " " "*" " " " " " " "*"
## 5 ( 1 ) " " " " " " "*" " " "*" " " "*" "*" " " " " "*"
## 6 ( 1 ) " " " " "*" "*" " " "*" " " "*" "*" " " " " "*"
## 7 ( 1 ) " " " " "*" "*" " " "*" " " "*" "*" "*" " " "*"
## 8 ( 1 ) " " " " "*" "*" " " "*" " " "*" "*" "*" "*" "*"
## 9 ( 1 ) " " " " "*" "*" "*" "*" " " "*" "*" "*" "*" "*"
## 10 ( 1 ) "*" " " "*" "*" "*" "*" " " "*" "*" "*" "*" "*"
## 11 ( 1 ) "*" " " "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
## 12 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
## [1] 4
The above tells us that iteration #4, which contains only nox,age,rad,medv should be the best (lowest Cp) performing model, so let’s try it!
#upgrade the model:
model_alpha <- glm(formula = data.train$target ~ nox+age+rad+medv,data.train[,-c(ncol(data.test)-1,ncol(data.test)-2)],family="binomial")
summary(model_alpha)
##
## Call:
## glm(formula = data.train$target ~ nox + age + rad + medv, family = "binomial",
## data = data.train[, -c(ncol(data.test) - 1, ncol(data.test) -
## 2)])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.43430 -0.22886 0.00000 0.00016 2.61216
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 6.29545 2.79550 2.252 0.024322 *
## nox 17.34888 3.33260 5.206 1.93e-07 ***
## age 0.01467 0.01302 1.126 0.260022
## rad2 19.54788 1632.34303 0.012 0.990445
## rad3 -1.10292 0.92653 -1.190 0.233901
## rad4 -17.63571 3160.23153 -0.006 0.995547
## rad5 -18.36714 3823.50441 -0.005 0.996167
## rad6 -16.43258 4890.47714 -0.003 0.997319
## rad7 2.14043 0.58892 3.635 0.000278 ***
## rad8 4.32561 1.05535 4.099 4.15e-05 ***
## rad9 -19.11541 3722.90641 -0.005 0.995903
## medv 0.09140 0.03970 2.302 0.021320 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 483.81 on 348 degrees of freedom
## Residual deviance: 115.87 on 337 degrees of freedom
## AIC: 139.87
##
## Number of Fisher Scoring iterations: 19
p <- predict(model_alpha,data.test[,-c(ncol(data.test)-1,ncol(data.test)-2)],type="response")
p[p>0.5] <-1
p[p<1] <- 0
predicted <- as.factor(p)
actual <- as.factor(data.test$target)
confusionMatrix(data = predicted , reference = actual ,positive="1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 62 3
## 1 0 52
##
## Accuracy : 0.9744
## 95% CI : (0.9269, 0.9947)
## No Information Rate : 0.5299
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9484
##
## Mcnemar's Test P-Value : 0.2482
##
## Sensitivity : 0.9455
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 0.9538
## Prevalence : 0.4701
## Detection Rate : 0.4444
## Detection Prevalence : 0.4444
## Balanced Accuracy : 0.9727
##
## 'Positive' Class : 1
##
# get the model results and plot
model_alpha.output <- augment(model_alpha) %>%
mutate(index = 1:n())
model_alpha.output %>% top_n(3, .cooksd)
## # A tibble: 3 x 14
## .rownames data.train.targ~ nox age rad medv .fitted .se.fit
## <chr> <int> <dbl> <dbl> <fct> <dbl> <dbl> <dbl>
## 1 263 1 -0.536 42.6 3 24.5 -1.24 0.861
## 2 457 1 -0.715 9.8 7 23.7 -1.67 0.747
## 3 69 1 -0.536 72.9 3 19.7 -1.24 0.779
## # ... with 6 more variables: .resid <dbl>, .hat <dbl>, .sigma <dbl>,
## # .cooksd <dbl>, .std.resid <dbl>, index <int>
ggplot(model_alpha.output, aes(index, .std.resid)) +
geom_point(aes(color=as.factor(data.train$target)), alpha = .5) +
theme_bw()+
ggtitle("Model output - Looking for Outliers") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))+
theme(plot.title = element_text(hjust = 0.5))
This optimized model blows the original model out of the water with an AIC of 140 and an accuracy of 97.5%. We can see also that we have no residuals with values > abs(3) which is good also. We have a simpler (fewer variables) model with a better prediction and smaller residuals - This is the model we’ll run with.
Based on the above, we’ll go with the model_alpha as it shows superior performance, lower error and contains fewer predictors than the other two that were tested.
First we’ll prep the out-sample data in the same way as was performed in-sample:
#load the data from file
data.eval <- read.table("crime-evaluation-data_modified.csv",sep=',', header = 1)
data.eval$dis <- log(data.eval$dis)
data.eval$rad <- log(data.eval$rad)
data.eval$tax <- sqrt(data.eval$tax)
data.eval$nox <- log(data.eval$nox)
data.eval$rad<- match(data.eval$rad , unique(data.eval$rad ))
data.eval$rad <- as.factor(data.eval$rad )
Next we’ll make out preditcions:
p <- predict(model_alpha,data.eval,type = "response")
#convert to binomial using 50% threshold
p[p>0.5] <-1
p[p<1] <- 0
#p$prediction <- as.data.frame(p)
#p <- p[c(index,prediction)]
kable(data.frame(p),row.names=T)
p | |
---|---|
1 | 0 |
2 | 1 |
3 | 1 |
4 | 1 |
5 | 0 |
6 | 0 |
7 | 0 |
8 | 0 |
9 | 0 |
10 | 0 |
11 | 0 |
12 | 0 |
13 | 1 |
14 | 1 |
15 | 0 |
16 | 1 |
17 | 0 |
18 | 0 |
19 | 0 |
20 | 0 |
21 | 0 |
22 | 0 |
23 | 1 |
24 | 0 |
25 | 0 |
26 | 0 |
27 | 1 |
28 | 1 |
29 | 1 |
30 | 1 |
31 | 1 |
32 | 1 |
33 | 1 |
34 | 1 |
35 | 1 |
36 | 1 |
37 | 1 |
38 | 1 |
39 | 0 |
40 | 1 |