HHernandez_DATA621

Overview

In this homework assignment, we will explore, analyze and model a data set containing 466 records. Each record contains 14 variables (13 predictors, 1 response variable) that represent characteristics of a neighborhood The data includes fields related to zoning, demographics, education property values etc.

Data Exploration

Load and Inspect the Data

The first objective is to load, inspect and evaluate the data. We want to inspect a small sample to make sure that the data has loaded correctly and that it looks coherent.

#load the data from file
data.train <- read.table("crime-training-data_modified.csv",sep=',', header = 1)


#display the first few rows as a coherence check
kable(head(data.train,5))

zn	indus	chas	nox	rm	age	dis	rad	tax	ptratio	lstat	medv	target
0	19.58	0	0.605	7.929	96.2	2.0459	5	403	14.7	3.70	50.0	1
0	19.58	1	0.871	5.403	100.0	1.3216	5	403	14.7	26.82	13.4	1
0	18.10	0	0.740	6.485	100.0	1.9784	24	666	20.2	18.85	15.4	1
30	4.93	0	0.428	6.393	7.8	7.0355	6	300	16.6	5.19	23.7	0
0	2.46	0	0.488	7.155	92.2	2.7006	3	193	17.8	4.82	37.9	0

The data looks good, so we can move on to inspection.

Summary Statistics

Next we will compute some basic summary statistics using describe() from the Psych package

#build a stats table
stats <- psych::describe(data.train)

#display
kable(stats[,1:6])

	vars	n	mean	sd	median	trimmed
zn	1	466	11.5772532	23.3646511	0.00000	5.3542781
indus	2	466	11.1050215	6.8458549	9.69000	10.9082353
chas	3	466	0.0708155	0.2567920	0.00000	0.0000000
nox	4	466	0.5543105	0.1166667	0.53800	0.5442684
rm	5	466	6.2906738	0.7048513	6.21000	6.2570615
age	6	466	68.3675966	28.3213784	77.15000	70.9553476
dis	7	466	3.7956929	2.1069496	3.19095	3.5443647
rad	8	466	9.5300429	8.6859272	5.00000	8.6978610
tax	9	466	409.5021459	167.9000887	334.50000	401.5080214
ptratio	10	466	18.3984979	2.1968447	18.90000	18.5970588
lstat	11	466	12.6314592	7.1018907	11.35000	11.8809626
medv	12	466	22.5892704	9.2396814	21.20000	21.6304813
target	13	466	0.4914163	0.5004636	0.00000	0.4893048

kable(stats[,6:13])

	trimmed	mad	min	max	range	skew	kurtosis	se
zn	5.3542781	0.0000000	0.0000	100.0000	100.0000	2.1768152	3.8135765	1.0823466
indus	10.9082353	9.3403800	0.4600	27.7400	27.2800	0.2885450	-1.2432132	0.3171281
chas	0.0000000	0.0000000	0.0000	1.0000	1.0000	3.3354899	9.1451313	0.0118957
nox	0.5442684	0.1334340	0.3890	0.8710	0.4820	0.7463281	-0.0357736	0.0054045
rm	6.2570615	0.5166861	3.8630	8.7800	4.9170	0.4793202	1.5424378	0.0326516
age	70.9553476	30.0226500	2.9000	100.0000	97.1000	-0.5777075	-1.0098814	1.3119625
dis	3.5443647	1.9144814	1.1296	12.1265	10.9969	0.9988926	0.4719679	0.0976026
rad	8.6978610	1.4826000	1.0000	24.0000	23.0000	1.0102788	-0.8619110	0.4023678
tax	401.5080214	104.5233000	187.0000	711.0000	524.0000	0.6593136	-1.1480456	7.7778214
ptratio	18.5970588	1.9273800	12.6000	22.0000	9.4000	-0.7542681	-0.4003627	0.1017669
lstat	11.8809626	7.0720020	1.7300	37.9700	36.2400	0.9055864	0.5033688	0.3289887
medv	21.6304813	6.0045300	5.0000	50.0000	45.0000	1.0766920	1.3737825	0.4280200
target	0.4893048	0.0000000	0.0000	1.0000	1.0000	0.0342293	-2.0031131	0.0231835

There doesn’t seem to be much immediately concerning about the data. We see a some skewness in a few of the non-binary variables and the variable “tax” has a scale that’s an order of magnitude different than all the other but we don’t see anything of major concern here. All of these are relatively easy issues to correct.

Mode and Unique Entry Count

Here we consider the mode (not included above) and a count of the number of unique entries in each column.

#mode
sapply(data.train, Mode)

## $zn
## [1] 0
## 
## $indus
## [1] 18.1
## 
## $chas
## [1] 0
## 
## $nox
## [1] 0.538
## 
## $rm
## [1] 5.713 6.127 6.167 6.229 6.405 6.417
## 
## $age
## [1] 100
## 
## $dis
## [1] 3.4952 5.2873 5.4007 5.7209 6.8147
## 
## $rad
## [1] 24
## 
## $tax
## [1] 666
## 
## $ptratio
## [1] 20.2
## 
## $lstat
## [1] 6.36 7.79 8.05
## 
## $medv
## [1] 50
## 
## $target
## [1] 0

#number of unique values per col
kable(sapply(sapply(data.train,unique),length))

	x
zn	26
indus	73
chas	2
nox	79
rm	419
age	333
dis	380
rad	9
tax	63
ptratio	46
lstat	424
medv	218
target	2

Here we see that we might be able to convert some of these variables to binary variables. ZN and AGE both represent proportions each with bias towards extremes (0% & 100% respectively). We may be able to simplify by converting these columns to a binary “0” or “1”.

If we look at the unique-entry count for each column we see that “rad” only contains 9 distinct values (likely representative of neightborhoods or something similar) which suggests that this variable more categorical than numerical and as such, we’ll convert this to a factor with 9 levels.

Distribution Plots

Next we look at various visualizations of the data to get a feel for the shapes of the distributions and to give us a sense as to whether there are any major outliers that we need to be concerned about.

#tidy the data.train
data.train.tidy <- data.train %>%
  gather()

ggplot(gather(data.train.tidy), aes(value)) + 
    geom_histogram(bins = 10) + 
    facet_wrap(~key, scales = 'free_x')

#build a boxplot from our reformatted data.train
n<-ggplot(data=data.train.tidy, aes(x=key, y=value)) +
  geom_boxplot()+
  ggtitle("Looking for Outliers - Boxplot") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  theme(plot.title = element_text(hjust = 0.5))
n

#create a barplot to highlight 
n<-ggplot(data=stats, aes(x=rownames(stats), y=max/median)) +
  geom_bar(stat="identity") +
  ggtitle("Max vs Median - Looking for Outliers") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  theme(plot.title = element_text(hjust = 0.5))
n

We do not see much in the way of outliers here either, other than tax, which needs to be re-scaled. In the set of histograms/boxplots we get a good visualization of all of the variables and everything looks coherent at this stage. The ratio of max-median shows that there aren’t any variables that contain absurd outliers.

Relationship to Target

Next we consider the correlation and coefficient of determination of our predictors to our target.

#compute correlation to the target var
cor <- data.train %>% 
  correlate() %>% 
  focus(target)

## 
## Correlation method: 'pearson'
## Missing treated using: 'pairwise.complete.obs'

cor$rs <- cor$target^2

#create a barplot showing the correlation of the target to each variable
cor %>% 
  mutate(rowname = factor(rowname, levels = rowname[order(target)])) %>% 
  ggplot(aes(x = rowname, y = target)) +
    geom_bar(stat = "identity") +
    ylab("Correlation with Target ") +
    xlab("Variable")+
    ggtitle("Correlation to Target Var") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  theme(plot.title = element_text(hjust = 0.5))

#create a barplot showing the correlation of the target to each variable
cor %>% 
  mutate(rowname = factor(rowname, levels = rowname[order(rs)])) %>% 
  ggplot(aes(x = rowname, y = rs)) +
    geom_bar(stat = "identity") +
    ylab("R2 vs Target") +
    xlab("Variable")+
    ggtitle("R2 to Target Var") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  theme(plot.title = element_text(hjust = 0.5))

In terms of correlation to the target, we see that there are several variables with an absolute correlation of ~ >= 0.5. The r2 plot shows the explanatory power of each variable and suggests that we might want to focus on variables like “nox”,“age” and “rad” while ignoring “chas” and “rm”. Intuition suggests that we should be able to get a reasonably good prediction with r^2 values such as the above.

cor.data <- cor(data.train)

ggplot(data = melt(cor.data), aes(x=Var1, y=Var2, fill=value)) + 
  geom_tile()

ggplot(data = melt((cor.data^2)), aes(x=Var1, y=Var2, fill=value)) + 
  geom_tile()

Here we see that we might have some aliasing issues where “tax” and “rad” appear to be highly related to one another. These variables represent tax burden and highway accessibility respectively and both be measures of the same thing (specific geographic location or neighborhood, for example). Otherwise, we don’t seem to have much inter-relatedness between variables. This is a good thing.

Data Preparation

There a few minor changes that we can make based on the above observations:

Convert zn and age

We’ll take the proportional data and convert it to binary columns - this is an exercise in logistic regression, after all!. We’ll also retain the original so that we can assess the impact of this change if desired.

#change zn and age to binary
data.train$znBin <- data.train$zn
data.train$znBin[data.train$znBin  > 0 ] = 1
  
data.train$ageBin <- data.train$age 
data.train$ageBin[data.train$ageBin  < 100 ] = 0
data.train$ageBin[data.train$ageBin  == 100 ] = 1

Adjust non-Normal Vars

we’ll modify for the skew where required. First, we’ll visually inspect the suspect variables more closely:

#look at the data pre-mod
hist(data.train$dis)

hist(data.train$rad)

hist(data.train$tax)

hist(data.train$nox)

kable(psych::describe(data.train))

	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
zn	1	466	11.5772532	23.3646511	0.00000	5.3542781	0.0000000	0.0000	100.0000	100.0000	2.1768152	3.8135765	1.0823466
indus	2	466	11.1050215	6.8458549	9.69000	10.9082353	9.3403800	0.4600	27.7400	27.2800	0.2885450	-1.2432132	0.3171281
chas	3	466	0.0708155	0.2567920	0.00000	0.0000000	0.0000000	0.0000	1.0000	1.0000	3.3354899	9.1451313	0.0118957
nox	4	466	0.5543105	0.1166667	0.53800	0.5442684	0.1334340	0.3890	0.8710	0.4820	0.7463281	-0.0357736	0.0054045
rm	5	466	6.2906738	0.7048513	6.21000	6.2570615	0.5166861	3.8630	8.7800	4.9170	0.4793202	1.5424378	0.0326516
age	6	466	68.3675966	28.3213784	77.15000	70.9553476	30.0226500	2.9000	100.0000	97.1000	-0.5777075	-1.0098814	1.3119625
dis	7	466	3.7956929	2.1069496	3.19095	3.5443647	1.9144814	1.1296	12.1265	10.9969	0.9988926	0.4719679	0.0976026
rad	8	466	9.5300429	8.6859272	5.00000	8.6978610	1.4826000	1.0000	24.0000	23.0000	1.0102788	-0.8619110	0.4023678
tax	9	466	409.5021459	167.9000887	334.50000	401.5080214	104.5233000	187.0000	711.0000	524.0000	0.6593136	-1.1480456	7.7778214
ptratio	10	466	18.3984979	2.1968447	18.90000	18.5970588	1.9273800	12.6000	22.0000	9.4000	-0.7542681	-0.4003627	0.1017669
lstat	11	466	12.6314592	7.1018907	11.35000	11.8809626	7.0720020	1.7300	37.9700	36.2400	0.9055864	0.5033688	0.3289887
medv	12	466	22.5892704	9.2396814	21.20000	21.6304813	6.0045300	5.0000	50.0000	45.0000	1.0766920	1.3737825	0.4280200
target	13	466	0.4914163	0.5004636	0.00000	0.4893048	0.0000000	0.0000	1.0000	1.0000	0.0342293	-2.0031131	0.0231835
znBin	14	466	0.2725322	0.4457407	0.00000	0.2165775	0.0000000	0.0000	1.0000	1.0000	1.0184383	-0.9648402	0.0206485
ageBin	15	466	0.0901288	0.2866739	0.00000	0.0000000	0.0000000	0.0000	1.0000	1.0000	2.8533585	6.1548765	0.0132799

next we’ll modify and re-inspect:

#modify
data.train$dis <- log(data.train$dis)
data.train$rad <- log(data.train$rad)
data.train$tax <- sqrt(data.train$tax)
data.train$nox <- log(data.train$nox)

#take another peek
hist(data.train$dis)

hist(data.train$rad)

hist(data.train$tax)

hist(data.train$nox)

kable(psych::describe(data.train))

	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
zn	1	466	11.5772532	23.3646511	0.0000000	5.3542781	0.0000000	0.0000000	100.0000000	100.0000000	2.1768152	3.8135765	1.0823466
indus	2	466	11.1050215	6.8458549	9.6900000	10.9082353	9.3403800	0.4600000	27.7400000	27.2800000	0.2885450	-1.2432132	0.3171281
chas	3	466	0.0708155	0.2567920	0.0000000	0.0000000	0.0000000	0.0000000	1.0000000	1.0000000	3.3354899	9.1451313	0.0118957
nox	4	466	-0.6109729	0.2026121	-0.6198967	-0.6202169	0.2549574	-0.9441759	-0.1381133	0.8060626	0.3709699	-0.7234642	0.0093858
rm	5	466	6.2906738	0.7048513	6.2100000	6.2570615	0.5166861	3.8630000	8.7800000	4.9170000	0.4793202	1.5424378	0.0326516
age	6	466	68.3675966	28.3213784	77.1500000	70.9553476	30.0226500	2.9000000	100.0000000	97.1000000	-0.5777075	-1.0098814	1.3119625
dis	7	466	1.1875848	0.5412299	1.1603153	1.1782986	0.6732974	0.1218636	2.4953931	2.3735296	0.1426949	-1.0086881	0.0250720
rad	8	466	1.8706996	0.8653619	1.6094379	1.8764842	0.3308326	0.0000000	3.1780538	3.1780538	0.3191654	-0.6307697	0.0400871
tax	9	466	19.8348783	4.0142657	18.2893361	19.7342218	3.0263710	13.6747943	26.6645833	12.9897889	0.5091007	-1.1856441	0.1859573
ptratio	10	466	18.3984979	2.1968447	18.9000000	18.5970588	1.9273800	12.6000000	22.0000000	9.4000000	-0.7542681	-0.4003627	0.1017669
lstat	11	466	12.6314592	7.1018907	11.3500000	11.8809626	7.0720020	1.7300000	37.9700000	36.2400000	0.9055864	0.5033688	0.3289887
medv	12	466	22.5892704	9.2396814	21.2000000	21.6304813	6.0045300	5.0000000	50.0000000	45.0000000	1.0766920	1.3737825	0.4280200
target	13	466	0.4914163	0.5004636	0.0000000	0.4893048	0.0000000	0.0000000	1.0000000	1.0000000	0.0342293	-2.0031131	0.0231835
znBin	14	466	0.2725322	0.4457407	0.0000000	0.2165775	0.0000000	0.0000000	1.0000000	1.0000000	1.0184383	-0.9648402	0.0206485
ageBin	15	466	0.0901288	0.2866739	0.0000000	0.0000000	0.0000000	0.0000000	1.0000000	1.0000000	2.8533585	6.1548765	0.0132799

We can see that we get much more normal looking results after the transformation

Converting Categorical data to Factor

For the variable “rad” which we previously identified as only having 9 distinct values, we’ll convert it to a factor and encode the values as “1-9” to make life easier.

#encode and factorize
data.train$rad<- match(data.train$rad , unique(data.train$rad ))
data.train$rad <- as.factor(data.train$rad )

split data after transforms

Finally, we’ll split the data into a train and test set after our transforms. This way, we can perform some model eval using the testing set.

## reserve 25% for a testing set, train on the rest.
s <- floor(0.75 * nrow(data.train))

#set a random seed
set.seed(42)
train_ind <- sample(seq_len(nrow(data.train)), size = s)

#split the data
data.test <- data.train[-train_ind, ]
data.train <- data.train[train_ind, ]

Building Models

First, we’ll build a model using all of the data exactly as it appears in the original file.

#get a fresh copy of the data
data.orig <- read.table("crime-training-data_modified.csv",sep=',', header = 1)

#rebuild train/test
data.orig.test <- data.orig [-train_ind, ]
data.orig.train <- data.orig [train_ind, ]

#build a model using the original, unmodified data
model_orig <- glm(formula = data.orig.train$target ~ .,data.orig.train[,],family="binomial")
summary(model_orig)

## 
## Call:
## glm(formula = data.orig.train$target ~ ., family = "binomial", 
##     data = data.orig.train[, ])
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.00434  -0.11934   0.00000   0.00125   3.07110  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -43.398974   8.026598  -5.407 6.41e-08 ***
## zn           -0.178760   0.086629  -2.064 0.039064 *  
## indus        -0.137292   0.065978  -2.081 0.037445 *  
## chas          0.901319   0.845654   1.066 0.286503    
## nox          59.004084  11.013512   5.357 8.44e-08 ***
## rm           -0.963916   0.878829  -1.097 0.272721    
## age           0.051233   0.017587   2.913 0.003577 ** 
## dis           0.771590   0.296352   2.604 0.009224 ** 
## rad           0.759221   0.229646   3.306 0.000946 ***
## tax          -0.008198   0.003818  -2.147 0.031760 *  
## ptratio       0.347956   0.168622   2.064 0.039063 *  
## lstat         0.061905   0.073619   0.841 0.400412    
## medv          0.176052   0.079204   2.223 0.026232 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 483.81  on 348  degrees of freedom
## Residual deviance: 137.33  on 336  degrees of freedom
## AIC: 163.33
## 
## Number of Fisher Scoring iterations: 9

#check the accuracy
p <- predict(model_orig,data.orig.test[,],type="response")

p[p>0.5] <-1
p[p<1] <- 0

predicted <- as.factor(p)
actual <- as.factor(data.orig.test$target)

confusionMatrix(data = predicted , reference = actual ,positive="1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 59  9
##          1  3 46
##                                           
##                Accuracy : 0.8974          
##                  95% CI : (0.8277, 0.9459)
##     No Information Rate : 0.5299          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.7929          
##                                           
##  Mcnemar's Test P-Value : 0.1489          
##                                           
##             Sensitivity : 0.8364          
##             Specificity : 0.9516          
##          Pos Pred Value : 0.9388          
##          Neg Pred Value : 0.8676          
##              Prevalence : 0.4701          
##          Detection Rate : 0.3932          
##    Detection Prevalence : 0.4188          
##       Balanced Accuracy : 0.8940          
##                                           
##        'Positive' Class : 1               
##

plot(model_orig, which = 4, id.n = 3)

# get the model results and plot
model_orig.output <- augment(model_orig) %>% 
  mutate(index = 1:n()) 

model_orig.output %>% top_n(3, .cooksd)

## # A tibble: 3 x 22
##   .rownames data.orig.train~    zn indus  chas   nox    rm   age   dis
##   <chr>                <int> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
## 1 62                       1     0  9.9      0 0.544  4.97  37.8  2.52
## 2 304                      0     0  9.69     0 0.585  5.67  28.8  2.80
## 3 457                      1     0 10.6      0 0.489  5.41   9.8  3.59
## # ... with 13 more variables: rad <int>, tax <int>, ptratio <dbl>,
## #   lstat <dbl>, medv <dbl>, .fitted <dbl>, .se.fit <dbl>, .resid <dbl>,
## #   .hat <dbl>, .sigma <dbl>, .cooksd <dbl>, .std.resid <dbl>, index <int>

ggplot(model_orig.output, aes(index, .std.resid)) + 
  geom_point(aes(color=as.factor(data.orig.train$target)), alpha = .5) +
  theme_bw()+ 
  ggtitle("Model output - Looking for Outliers") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  theme(plot.title = element_text(hjust = 0.5))

We get a surprisingly decent result with an accuracy near 90% and and AIC of 163 - my intuition is that we are beginning near the point of diminishing returns in this particular assignment - it may be hard to improve upon this model. However, we do see some sizable outliers in the Cook’s D plot and the resituals plot shows some values > 3.

Next, we’ll run the same thing without modified data using all the variables. Note also that we’ll run this model twice. First, we’ll run it with the modified binary ZN and Age variables, then without.

model_all <- glm(formula = data.train$target ~ .,data.train[,-c(ncol(data.train)-1,ncol(data.train)-2)],family="binomial")

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(model_all)

## 
## Call:
## glm(formula = data.train$target ~ ., family = "binomial", data = data.train[, 
##     -c(ncol(data.train) - 1, ncol(data.train) - 2)])
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.32383  -0.00102   0.00000   0.00003   3.03244  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   31.38596   10.06861   3.117 0.001826 ** 
## zn            -0.27736    0.16098  -1.723 0.084903 .  
## indus         -0.27203    0.16571  -1.642 0.100678    
## chas          -0.59856    1.21751  -0.492 0.622982    
## nox           41.64628   10.71680   3.886 0.000102 ***
## rm            -1.85110    1.47715  -1.253 0.210149    
## age            0.02958    0.02078   1.423 0.154711    
## dis            0.93825    1.80340   0.520 0.602878    
## rad2          24.45937 2195.68961   0.011 0.991112    
## rad3          -2.90651    1.72127  -1.689 0.091298 .  
## rad4         -20.96470 4525.58117  -0.005 0.996304    
## rad5         -19.34965 5361.28382  -0.004 0.997120    
## rad6          -8.03856 7076.31647  -0.001 0.999094    
## rad7           2.40880    0.85061   2.832 0.004628 ** 
## rad8           9.01548    6.18967   1.457 0.145244    
## rad9         -19.76591 5742.74294  -0.003 0.997254    
## tax           -0.34797    0.33414  -1.041 0.297692    
## ptratio        0.12692    0.24609   0.516 0.606025    
## lstat          0.11121    0.09349   1.190 0.234228    
## medv           0.30831    0.13645   2.259 0.023853 *  
## ageBin        18.00070 3684.00775   0.005 0.996101    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 483.814  on 348  degrees of freedom
## Residual deviance:  72.359  on 328  degrees of freedom
## AIC: 114.36
## 
## Number of Fisher Scoring iterations: 20

model_all <- glm(formula = data.train$target ~ .,data.train[,-c(1,6)],family="binomial")

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(model_all)

## 
## Call:
## glm(formula = data.train$target ~ ., family = "binomial", data = data.train[, 
##     -c(1, 6)])
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.36372  -0.00308   0.00000   0.00002   3.03065  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   30.53433   10.33514   2.954  0.00313 ** 
## indus         -0.33242    0.15609  -2.130  0.03320 *  
## chas          -0.09346    1.14803  -0.081  0.93512    
## nox           44.73128   10.42712   4.290 1.79e-05 ***
## rm            -0.69199    1.27802  -0.541  0.58819    
## dis            0.65532    1.89771   0.345  0.72985    
## rad2          23.90571 2114.21608   0.011  0.99098    
## rad3          -3.63477    1.64934  -2.204  0.02754 *  
## rad4         -20.28378 4696.78990  -0.004  0.99655    
## rad5         -19.46388 5660.42644  -0.003  0.99726    
## rad6         -11.71883 7179.94527  -0.002  0.99870    
## rad7           2.48793    0.83783   2.970  0.00298 ** 
## rad8           8.04944    4.93686   1.630  0.10300    
## rad9         -18.36249 6074.08684  -0.003  0.99759    
## tax           -0.22495    0.31289  -0.719  0.47217    
## ptratio        0.01028    0.22674   0.045  0.96384    
## lstat          0.15562    0.09351   1.664  0.09605 .  
## medv           0.22290    0.11226   1.985  0.04709 *  
## znBin         -5.60072    3.94063  -1.421  0.15524    
## ageBin        18.83165 3645.07888   0.005  0.99588    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 483.81  on 348  degrees of freedom
## Residual deviance:  74.30  on 329  degrees of freedom
## AIC: 114.3
## 
## Number of Fisher Scoring iterations: 20

p <- predict(model_all,data.test[,-c(1,6)],type="response")

p[p>0.5] <-1
p[p<1] <- 0



predicted <- as.factor(p)
actual <- as.factor(data.test$target)


confusionMatrix(data = predicted , reference = actual ,positive="1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 61  3
##          1  1 52
##                                           
##                Accuracy : 0.9658          
##                  95% CI : (0.9148, 0.9906)
##     No Information Rate : 0.5299          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9312          
##                                           
##  Mcnemar's Test P-Value : 0.6171          
##                                           
##             Sensitivity : 0.9455          
##             Specificity : 0.9839          
##          Pos Pred Value : 0.9811          
##          Neg Pred Value : 0.9531          
##              Prevalence : 0.4701          
##          Detection Rate : 0.4444          
##    Detection Prevalence : 0.4530          
##       Balanced Accuracy : 0.9647          
##                                           
##        'Positive' Class : 1               
##

# get the model results and plot
model_all.output <- augment(model_all) %>% 
  mutate(index = 1:n()) 

model_all.output %>% top_n(3, .cooksd)

## # A tibble: 3 x 22
##   .rownames data.train.targ~ indus  chas    nox    rm   dis rad     tax
##   <chr>                <int> <dbl> <int>  <dbl> <dbl> <dbl> <fct> <dbl>
## 1 240                      0 21.9      0 -0.472  5.86 0.512 7      20.9
## 2 249                      1  8.56     0 -0.654  6.23 0.934 1      19.6
## 3 354                      1 10.6      1 -0.715  5.40 1.30  7      16.6
## # ... with 13 more variables: ptratio <dbl>, lstat <dbl>, medv <dbl>,
## #   znBin <dbl>, ageBin <dbl>, .fitted <dbl>, .se.fit <dbl>, .resid <dbl>,
## #   .hat <dbl>, .sigma <dbl>, .cooksd <dbl>, .std.resid <dbl>, index <int>

ggplot(model_all.output, aes(index, .std.resid)) + 
  geom_point(aes(color=as.factor(data.train$target)), alpha = .5) +
  theme_bw()+ 
  ggtitle("Model output - Looking for Outliers") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  theme(plot.title = element_text(hjust = 0.5))

With the modified data, we can see that the AIC drops to 114 (~30% drop from the last model) and the accuracy of the model improves to nearly 97%! Once again, we see a few large outliers which we would like to do something about if possible.

Note also that it makes almost no difference whether we use binary or proportional data for ZN and AGE. Given that binary variables are simpler, I’ve opted to take the (incredibly small) performance hit here and stick with the binary variables.

Finally, we’ll try cherry-picking some of the more important variables based on what we’ve observed above.

Using RegFit

First we’ll use regfit to sub-select the important variables

#step over all the vars & output the one with the lowest Cp
regfit.full=regsubsets(data.orig.train$target ~ .,,data=data.orig.train,nvmax=12)
s <- summary(regfit.full)
s

## Subset selection object
## Call: regsubsets.formula(data.orig.train$target ~ ., , data = data.orig.train, 
##     nvmax = 12)
## 12 Variables  (and intercept)
##         Forced in Forced out
## zn          FALSE      FALSE
## indus       FALSE      FALSE
## chas        FALSE      FALSE
## nox         FALSE      FALSE
## rm          FALSE      FALSE
## age         FALSE      FALSE
## dis         FALSE      FALSE
## rad         FALSE      FALSE
## tax         FALSE      FALSE
## ptratio     FALSE      FALSE
## lstat       FALSE      FALSE
## medv        FALSE      FALSE
## 1 subsets of each size up to 12
## Selection Algorithm: exhaustive
##           zn  indus chas nox rm  age dis rad tax ptratio lstat medv
## 1  ( 1 )  " " " "   " "  "*" " " " " " " " " " " " "     " "   " " 
## 2  ( 1 )  " " " "   " "  "*" " " " " " " "*" " " " "     " "   " " 
## 3  ( 1 )  " " " "   " "  "*" " " "*" " " "*" " " " "     " "   " " 
## 4  ( 1 )  " " " "   " "  "*" " " "*" " " "*" " " " "     " "   "*" 
## 5  ( 1 )  " " " "   " "  "*" " " "*" " " "*" "*" " "     " "   "*" 
## 6  ( 1 )  " " " "   "*"  "*" " " "*" " " "*" "*" " "     " "   "*" 
## 7  ( 1 )  " " " "   "*"  "*" " " "*" " " "*" "*" "*"     " "   "*" 
## 8  ( 1 )  " " " "   "*"  "*" " " "*" " " "*" "*" "*"     "*"   "*" 
## 9  ( 1 )  " " " "   "*"  "*" "*" "*" " " "*" "*" "*"     "*"   "*" 
## 10  ( 1 ) "*" " "   "*"  "*" "*" "*" " " "*" "*" "*"     "*"   "*" 
## 11  ( 1 ) "*" " "   "*"  "*" "*" "*" "*" "*" "*" "*"     "*"   "*" 
## 12  ( 1 ) "*" "*"   "*"  "*" "*" "*" "*" "*" "*" "*"     "*"   "*"

which.min(s$cp)

## [1] 4

#plot output
plot(s$cp)

plot(regfit.full,scale="Cp")

The above tells us that iteration #4, which contains only nox,age,rad,medv should be the best (lowest Cp) performing model, so let’s try it!

#upgrade the model:
model_alpha <- glm(formula = data.train$target ~ nox+age+rad+medv,data.train[,-c(ncol(data.test)-1,ncol(data.test)-2)],family="binomial")

summary(model_alpha)

## 
## Call:
## glm(formula = data.train$target ~ nox + age + rad + medv, family = "binomial", 
##     data = data.train[, -c(ncol(data.test) - 1, ncol(data.test) - 
##         2)])
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.43430  -0.22886   0.00000   0.00016   2.61216  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    6.29545    2.79550   2.252 0.024322 *  
## nox           17.34888    3.33260   5.206 1.93e-07 ***
## age            0.01467    0.01302   1.126 0.260022    
## rad2          19.54788 1632.34303   0.012 0.990445    
## rad3          -1.10292    0.92653  -1.190 0.233901    
## rad4         -17.63571 3160.23153  -0.006 0.995547    
## rad5         -18.36714 3823.50441  -0.005 0.996167    
## rad6         -16.43258 4890.47714  -0.003 0.997319    
## rad7           2.14043    0.58892   3.635 0.000278 ***
## rad8           4.32561    1.05535   4.099 4.15e-05 ***
## rad9         -19.11541 3722.90641  -0.005 0.995903    
## medv           0.09140    0.03970   2.302 0.021320 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 483.81  on 348  degrees of freedom
## Residual deviance: 115.87  on 337  degrees of freedom
## AIC: 139.87
## 
## Number of Fisher Scoring iterations: 19

p <- predict(model_alpha,data.test[,-c(ncol(data.test)-1,ncol(data.test)-2)],type="response")

p[p>0.5] <-1
p[p<1] <- 0



predicted <- as.factor(p)
actual <- as.factor(data.test$target)


confusionMatrix(data = predicted , reference = actual ,positive="1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 62  3
##          1  0 52
##                                           
##                Accuracy : 0.9744          
##                  95% CI : (0.9269, 0.9947)
##     No Information Rate : 0.5299          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9484          
##                                           
##  Mcnemar's Test P-Value : 0.2482          
##                                           
##             Sensitivity : 0.9455          
##             Specificity : 1.0000          
##          Pos Pred Value : 1.0000          
##          Neg Pred Value : 0.9538          
##              Prevalence : 0.4701          
##          Detection Rate : 0.4444          
##    Detection Prevalence : 0.4444          
##       Balanced Accuracy : 0.9727          
##                                           
##        'Positive' Class : 1               
##

# get the model results and plot
model_alpha.output <- augment(model_alpha) %>% 
  mutate(index = 1:n()) 

model_alpha.output %>% top_n(3, .cooksd)

## # A tibble: 3 x 14
##   .rownames data.train.targ~    nox   age rad    medv .fitted .se.fit
##   <chr>                <int>  <dbl> <dbl> <fct> <dbl>   <dbl>   <dbl>
## 1 263                      1 -0.536  42.6 3      24.5   -1.24   0.861
## 2 457                      1 -0.715   9.8 7      23.7   -1.67   0.747
## 3 69                       1 -0.536  72.9 3      19.7   -1.24   0.779
## # ... with 6 more variables: .resid <dbl>, .hat <dbl>, .sigma <dbl>,
## #   .cooksd <dbl>, .std.resid <dbl>, index <int>

ggplot(model_alpha.output, aes(index, .std.resid)) + 
  geom_point(aes(color=as.factor(data.train$target)), alpha = .5) +
  theme_bw()+ 
  ggtitle("Model output - Looking for Outliers") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  theme(plot.title = element_text(hjust = 0.5))

This optimized model blows the original model out of the water with an AIC of 140 and an accuracy of 97.5%. We can see also that we have no residuals with values > abs(3) which is good also. We have a simpler (fewer variables) model with a better prediction and smaller residuals - This is the model we’ll run with.

Model Selection

Based on the above, we’ll go with the model_alpha as it shows superior performance, lower error and contains fewer predictors than the other two that were tested.

First we’ll prep the out-sample data in the same way as was performed in-sample:

#load the data from file
data.eval <- read.table("crime-evaluation-data_modified.csv",sep=',', header = 1)


data.eval$dis <- log(data.eval$dis)
data.eval$rad <- log(data.eval$rad)
data.eval$tax <- sqrt(data.eval$tax)

data.eval$nox <- log(data.eval$nox)


data.eval$rad<- match(data.eval$rad , unique(data.eval$rad ))

data.eval$rad <- as.factor(data.eval$rad )

Next we’ll make out preditcions:

p <- predict(model_alpha,data.eval,type = "response")

#convert to binomial using 50% threshold
p[p>0.5] <-1
p[p<1] <- 0

#p$prediction <- as.data.frame(p)

#p <- p[c(index,prediction)]

kable(data.frame(p),row.names=T)

	p
1	0
2	1
3	1
4	1
5	0
6	0
7	0
8	0
9	0
10	0
11	0
12	0
13	1
14	1
15	0
16	1
17	0
18	0
19	0
20	0
21	0
22	0
23	1
24	0
25	0
26	0
27	1
28	1
29	1
30	1
31	1
32	1
33	1
34	1
35	1
36	1
37	1
38	1
39	0
40	1

write.csv(p, file = "HW3_predictions.csv",row.names=T)

HHernandez_DATA621 - HOMEWORK3

humbertohp (in collaboration with Paul Britton)

October 30, 2019