Cover Page
Data 621 - Week 3 HW
Baron Curtin
CUNY School of Professional Studies
Introduction
The purpose of this assignment is to develop a binary logistic model that can “reliably” predict whether a neighborhood is at risk for high crime levels
Data Exploration
Non-Visual Inspection
Variables
* Response Variable: target * Explanatory Variables:
data_frame(explanatory_variables = names(crimeTraining)) %>% filter(explanatory_variables !=
"target") %>% arrange(explanatory_variables)
glimpse(crimeTraining)
## Observations: 466
## Variables: 14
## $ zn <dbl> 0, 0, 0, 30, 0, 0, 0, 0, 0, 80, 22, 0, 0, 22, 0, 0, 10...
## $ indus <dbl> 19.58, 19.58, 18.10, 4.93, 2.46, 8.56, 18.10, 18.10, 5...
## $ chas <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ nox <dbl> 0.605, 0.871, 0.740, 0.428, 0.488, 0.520, 0.693, 0.693...
## $ rm <dbl> 7.929, 5.403, 6.485, 6.393, 7.155, 6.781, 5.453, 4.519...
## $ age <dbl> 96.2, 100.0, 100.0, 7.8, 92.2, 71.3, 100.0, 100.0, 38....
## $ dis <dbl> 2.0459, 1.3216, 1.9784, 7.0355, 2.7006, 2.8561, 1.4896...
## $ rad <int> 5, 5, 24, 6, 3, 5, 24, 24, 5, 1, 7, 5, 24, 7, 3, 3, 5,...
## $ tax <int> 403, 403, 666, 300, 193, 384, 666, 666, 224, 315, 330,...
## $ ptratio <dbl> 14.7, 14.7, 20.2, 16.6, 17.8, 20.9, 20.2, 20.2, 20.2, ...
## $ black <dbl> 369.30, 396.90, 386.73, 374.71, 394.12, 395.58, 396.90...
## $ lstat <dbl> 3.70, 26.82, 18.85, 5.19, 4.82, 7.67, 30.59, 36.98, 5....
## $ medv <dbl> 50.0, 13.4, 15.4, 23.7, 37.9, 26.5, 5.0, 7.0, 22.2, 20...
## $ target <int> 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, ...
The glipmse function of dplyr shows us that there are 466 observations and 14 variables
* 10 variables of type dble
* 4 variables of type int
summary(crimeTraining)
## zn indus chas nox
## Min. : 0.00 Min. : 0.460 Min. :0.00000 Min. :0.3890
## 1st Qu.: 0.00 1st Qu.: 5.145 1st Qu.:0.00000 1st Qu.:0.4480
## Median : 0.00 Median : 9.690 Median :0.00000 Median :0.5380
## Mean : 11.58 Mean :11.105 Mean :0.07082 Mean :0.5543
## 3rd Qu.: 16.25 3rd Qu.:18.100 3rd Qu.:0.00000 3rd Qu.:0.6240
## Max. :100.00 Max. :27.740 Max. :1.00000 Max. :0.8710
## rm age dis rad
## Min. :3.863 Min. : 2.90 Min. : 1.130 Min. : 1.00
## 1st Qu.:5.887 1st Qu.: 43.88 1st Qu.: 2.101 1st Qu.: 4.00
## Median :6.210 Median : 77.15 Median : 3.191 Median : 5.00
## Mean :6.291 Mean : 68.37 Mean : 3.796 Mean : 9.53
## 3rd Qu.:6.630 3rd Qu.: 94.10 3rd Qu.: 5.215 3rd Qu.:24.00
## Max. :8.780 Max. :100.00 Max. :12.127 Max. :24.00
## tax ptratio black lstat
## Min. :187.0 Min. :12.6 Min. : 0.32 Min. : 1.730
## 1st Qu.:281.0 1st Qu.:16.9 1st Qu.:375.61 1st Qu.: 7.043
## Median :334.5 Median :18.9 Median :391.34 Median :11.350
## Mean :409.5 Mean :18.4 Mean :357.12 Mean :12.631
## 3rd Qu.:666.0 3rd Qu.:20.2 3rd Qu.:396.24 3rd Qu.:16.930
## Max. :711.0 Max. :22.0 Max. :396.90 Max. :37.970
## medv target
## Min. : 5.00 Min. :0.0000
## 1st Qu.:17.02 1st Qu.:0.0000
## Median :21.20 Median :0.0000
## Mean :22.59 Mean :0.4914
## 3rd Qu.:25.00 3rd Qu.:1.0000
## Max. :50.00 Max. :1.0000
- Summary is able to show that none of the variables have missing values
- The mean of target is below .5 which means there are more observations where the crime rate is below the median
Basic Stats
crimeStats <- basicStats(crimeTraining)[c("nobs", "NAs", "Minimum", "Maximum",
"1. Quartile", "3. Quartile", "Mean", "Median", "Variance", "Stdev", "Skewness",
"Kurtosis"), ] %>% as_tibble() %>% rownames_to_column() %>% gather(var,
value, -rowname) %>% spread(rowname, value) %>% rename_all(str_to_lower) %>%
rename_all(str_trim) %>% rename(variables = "var", q1 = `1. quartile`, q3 = `3. quartile`,
max = maximum, min = minimum, na_vals = nas, n = nobs, sd = stdev, var = variance) %>%
mutate(obs = n - na_vals, range = max - min, iqr = q3 - q1) %>% select(variables,
n, na_vals, obs, mean, min, q1, median, q3, max, sd, var, range, iqr, skewness,
kurtosis) %>% as_tibble()
crimeStats
- basicStats is further able to confirm no missing values
- The variable black has the highest variance
- The largest skew value is ~3.3 with the chas variable
- This variable is just a dummy variable for whether the subrub borders the Charles River
Correlation
crimeTraining %>% cor(use = "na.or.complete") %>% as.data.frame() %>% rownames_to_column(var = "predictor") %>%
as_data_frame() %>% select(predictor, target) %>% filter(predictor != "target") %>%
arrange(desc(target))
- The correlation table above shows that nox has the strongest positive correlation with target and dis has the strongest negative correlation with target
ggcorr(crimeTraining, palette = "RdBu", label = T, geom = "tile")
* The correlation matrix shows tax and rad have the highest correlation to each other
Visual Inspection
Density Plots
vis <- melt(crimeTraining)
## No id variables; using all as measure variables
ggplot(vis, aes(value)) + geom_density(fill = "skyblue") + facet_wrap(~variable,
scales = "free")
- The variable rm is the only variable that closely mirrors a normal distribution
- The variables zn, chas, and dis are heavily skewed right
- The variables nox, lstat, and medv are are also skewed right
- indus, rad, tax, and target are multi-modal
- The density plots reveal that most is the data is not normal
Histogram
crimeTraining %>% mutate(target = as.factor(target)) %>% keep(is.numeric) %>%
gather() %>% ggplot(aes(value)) + geom_histogram(bins = 35) + facet_wrap(~key,
scales = "free")
- The histograms provide a clearer picture of skew revealing that age, dis, lstat are skewed
- The variables are candidates for transformation
Box Plots
Box plots are provide good visual representations of the variance in the data
ggplot(vis, aes(x = variable, y = value)) + geom_boxplot(show.legend = T) +
stat_summary(fun.y = mean, color = "red", geom = "point", shape = 18, size = 3) +
coord_flip()
- The box plots reveal that many of the variables have low variances
Removing the tax, and black variables…
vis %>% filter(!variable %in% c("tax", "black")) %>% ggplot(aes(x = variable,
y = value)) + geom_boxplot(show.legend = T) + stat_summary(fun.y = mean,
color = "red", geom = "point", shape = 18, size = 3) + coord_flip()
- Removing the tax and black variable made it easier to see that almost all of the variables are skewed as the means and medians do not align
- Of the remaining variables, zn and age appear to have the highest variances
Removing the zn and age variables…
vis %>% filter(!variable %in% c("tax", "black", "zn", "age")) %>% ggplot(aes(x = variable,
y = value)) + geom_boxplot(show.legend = T) + stat_summary(fun.y = mean,
color = "red", geom = "point", shape = 18, size = 3) + coord_flip()
- Many of the variables are skewed right as the mean appears to the right of the median
Data Preparation
To reduce the effect of skew on the model, logistic transformations will be performed on age, dis, lstat
transCrime <- crimeTraining %>% mutate_at(c("age", "dis", "lstat"), log)
Build Models
Leaps subsetting
Untransformed Data
regDiags <- regsubsets(target ~ ., data = crimeTraining, method = "exhaustive",
nvmax = NULL, nbest = 1)
diagSum <- summary(regDiags)
print(diagSum)
## Subset selection object
## Call: regsubsets.formula(target ~ ., data = crimeTraining, method = "exhaustive",
## nvmax = NULL, nbest = 1)
## 13 Variables (and intercept)
## Forced in Forced out
## zn FALSE FALSE
## indus FALSE FALSE
## chas FALSE FALSE
## nox FALSE FALSE
## rm FALSE FALSE
## age FALSE FALSE
## dis FALSE FALSE
## rad FALSE FALSE
## tax FALSE FALSE
## ptratio FALSE FALSE
## black FALSE FALSE
## lstat FALSE FALSE
## medv FALSE FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: exhaustive
## zn indus chas nox rm age dis rad tax ptratio black lstat medv
## 1 ( 1 ) " " " " " " "*" " " " " " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " "*" " " " " " " "*" " " " " " " " " " "
## 3 ( 1 ) " " " " " " "*" " " "*" " " "*" " " " " " " " " " "
## 4 ( 1 ) " " " " " " "*" " " "*" " " "*" " " " " " " " " "*"
## 5 ( 1 ) " " " " " " "*" " " "*" " " "*" " " "*" " " " " "*"
## 6 ( 1 ) " " " " " " "*" " " "*" " " "*" " " "*" "*" " " "*"
## 7 ( 1 ) " " " " " " "*" " " "*" " " "*" "*" "*" "*" " " "*"
## 8 ( 1 ) " " " " " " "*" " " "*" " " "*" "*" "*" "*" "*" "*"
## 9 ( 1 ) "*" " " " " "*" " " "*" " " "*" "*" "*" "*" "*" "*"
## 10 ( 1 ) "*" " " " " "*" " " "*" "*" "*" "*" "*" "*" "*" "*"
## 11 ( 1 ) "*" "*" " " "*" " " "*" "*" "*" "*" "*" "*" "*" "*"
## 12 ( 1 ) "*" "*" " " "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
## 13 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
# determine best fits
plot(diagSum$cp, xlab = "Number of Variables", ylab = "Cp")
points(which.min(diagSum$cp), diagSum$cp[which.min(diagSum$cp)], pch = 20, col = "red")
# cp plot
par(mfrow = c(1, 2))
plot(regDiags, scale = "Cp", main = "Cp")
# r^2 splot
plot(regDiags, scale = "adjr2", main = "Adjusted R^2")
- Based on Cp, a model that includes nox, age, rad, ptratio, and medv would be the best predictor
- Based on Adjusted R^2, a model that includes nox, age, rad, tax, ptratio, black, and medv would be the best predictor
- Both metrics share the nox, age, rad, ptratio, and medv variables
Model 1
m1 <- glm(target ~ nox + age + rad + ptratio + medv, family = binomial(link = "logit"),
data = crimeTraining)
summary(m1)
##
## Call:
## glm(formula = target ~ nox + age + rad + ptratio + medv, family = binomial(link = "logit"),
## data = crimeTraining)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.96654 -0.29783 -0.03987 0.00769 2.80829
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -24.936540 3.683449 -6.770 1.29e-11 ***
## nox 25.334778 4.084106 6.203 5.53e-10 ***
## age 0.019403 0.009308 2.085 0.03711 *
## rad 0.512600 0.114818 4.464 8.03e-06 ***
## ptratio 0.274193 0.098737 2.777 0.00549 **
## medv 0.085445 0.027979 3.054 0.00226 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 224.71 on 460 degrees of freedom
## AIC: 236.71
##
## Number of Fisher Scoring iterations: 8
- nox has the greatest impact by far on target
- age has the minimum impact on target
Model 2
m2 <- glm(target ~ nox + age + rad + tax + ptratio + black + medv, family = binomial(link = "logit"),
data = crimeTraining)
summary(m2)
##
## Call:
## glm(formula = target ~ nox + age + rad + tax + ptratio + black +
## medv, family = binomial(link = "logit"), data = crimeTraining)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.60500 -0.22062 -0.01422 0.00332 2.84619
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -23.560623 4.891894 -4.816 1.46e-06 ***
## nox 33.274178 5.182633 6.420 1.36e-10 ***
## age 0.021944 0.009720 2.258 0.023966 *
## rad 0.722503 0.140598 5.139 2.77e-07 ***
## tax -0.009475 0.002654 -3.570 0.000357 ***
## ptratio 0.352663 0.110197 3.200 0.001373 **
## black -0.012652 0.006906 -1.832 0.066958 .
## medv 0.069553 0.029705 2.341 0.019208 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 204.54 on 458 degrees of freedom
## AIC: 220.54
##
## Number of Fisher Scoring iterations: 9
- nox has the greatest impact by far on target
- tax has the minimum impact on target and is also negative
- Positive coefficients: nox, age, rad, ptratio, medv
- Negative coefficients: tax, black
- Interesting to note that black has a negative impact
Model 3: All Variables
m3 <- glm(target ~ ., family = binomial(link = "logit"), data = crimeTraining)
summary(m3)
##
## Call:
## glm(formula = target ~ ., family = binomial(link = "logit"),
## data = crimeTraining)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2854 -0.1372 -0.0017 0.0020 3.4721
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -36.839521 7.028726 -5.241 1.59e-07 ***
## zn -0.061720 0.034410 -1.794 0.072868 .
## indus -0.072580 0.048546 -1.495 0.134894
## chas 1.032352 0.759627 1.359 0.174139
## nox 50.159513 8.049503 6.231 4.62e-10 ***
## rm -0.692145 0.741431 -0.934 0.350548
## age 0.034522 0.013883 2.487 0.012895 *
## dis 0.765795 0.234407 3.267 0.001087 **
## rad 0.663015 0.165135 4.015 5.94e-05 ***
## tax -0.006593 0.003064 -2.152 0.031422 *
## ptratio 0.442217 0.132234 3.344 0.000825 ***
## black -0.013094 0.006680 -1.960 0.049974 *
## lstat 0.047571 0.054508 0.873 0.382802
## medv 0.199734 0.071022 2.812 0.004919 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 186.15 on 452 degrees of freedom
## AIC: 214.15
##
## Number of Fisher Scoring iterations: 9
- nox has the greatest impact by far on target
- tax has the minimum impact on target and is also negative
- Positive coefficients: chas, nox, age, dis, rad, ptratio, lstat, medv
- Negative coefficients: zn, indus, rm, tax, black
- Interesting to note that black has a negative impact
Transformed Data
regDiags2 <- regsubsets(target ~ ., data = transCrime, method = "exhaustive",
nvmax = NULL, nbest = 1)
diagSum2 <- summary(regDiags2)
print(diagSum2)
## Subset selection object
## Call: regsubsets.formula(target ~ ., data = transCrime, method = "exhaustive",
## nvmax = NULL, nbest = 1)
## 13 Variables (and intercept)
## Forced in Forced out
## zn FALSE FALSE
## indus FALSE FALSE
## chas FALSE FALSE
## nox FALSE FALSE
## rm FALSE FALSE
## age FALSE FALSE
## dis FALSE FALSE
## rad FALSE FALSE
## tax FALSE FALSE
## ptratio FALSE FALSE
## black FALSE FALSE
## lstat FALSE FALSE
## medv FALSE FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: exhaustive
## zn indus chas nox rm age dis rad tax ptratio black lstat medv
## 1 ( 1 ) " " " " " " "*" " " " " " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " "*" " " " " " " "*" " " " " " " " " " "
## 3 ( 1 ) " " " " " " "*" " " " " " " "*" " " " " " " " " "*"
## 4 ( 1 ) " " " " " " "*" " " "*" " " "*" " " " " " " " " "*"
## 5 ( 1 ) "*" " " " " "*" " " "*" " " "*" " " " " " " " " "*"
## 6 ( 1 ) "*" " " " " "*" " " "*" " " "*" " " "*" " " " " "*"
## 7 ( 1 ) "*" " " " " "*" " " "*" " " "*" " " "*" "*" " " "*"
## 8 ( 1 ) "*" " " " " "*" " " "*" " " "*" " " "*" "*" "*" "*"
## 9 ( 1 ) "*" " " " " "*" "*" "*" " " "*" " " "*" "*" "*" "*"
## 10 ( 1 ) "*" " " " " "*" "*" "*" " " "*" "*" "*" "*" "*" "*"
## 11 ( 1 ) "*" "*" " " "*" "*" "*" " " "*" "*" "*" "*" "*" "*"
## 12 ( 1 ) "*" "*" " " "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
## 13 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
# determine best fits
plot(diagSum2$cp, xlab = "Number of Variables", ylab = "Cp")
points(which.min(diagSum2$cp), diagSum2$cp[which.min(diagSum2$cp)], pch = 20,
col = "red")
# cp plot
par(mfrow = c(1, 2))
plot(regDiags2, scale = "Cp", main = "Cp")
# r^2 splot
plot(regDiags2, scale = "adjr2", main = "Adjusted R^2")
- Based on Cp, a model that includes zn, nox age, rad, and medv would be the best predictor
- Based on Adjusted R^2, a model that includes zn, nox age, rad, ptratio, black, lstat, and medv would be the best predictor
Model 4
m4 <- glm(target ~ zn + nox + age + rad + medv, family = binomial(link = "logit"),
data = transCrime)
summary(m4)
##
## Call:
## glm(formula = target ~ zn + nox + age + rad + medv, family = binomial(link = "logit"),
## data = transCrime)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7789 -0.3356 -0.0212 0.0126 2.8086
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -17.58074 2.34312 -7.503 6.23e-14 ***
## zn -0.04485 0.02175 -2.062 0.0392 *
## nox 23.46589 3.85329 6.090 1.13e-09 ***
## age 0.32489 0.44868 0.724 0.4690
## rad 0.46428 0.11393 4.075 4.60e-05 ***
## medv 0.05010 0.02324 2.156 0.0311 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 230.03 on 460 degrees of freedom
## AIC: 242.03
##
## Number of Fisher Scoring iterations: 8
- nox has the greatest impact by far on target
- zn has the minimum impact on target and is also negative
- Positive coefficients: nox, age, rad, medv
- Negative coefficients: zn
Model 5
m5 <- glm(target ~ zn + nox + age + rad + ptratio + black + lstat + medv, family = binomial(link = "logit"),
data = transCrime)
summary(m5)
##
## Call:
## glm(formula = target ~ zn + nox + age + rad + ptratio + black +
## lstat + medv, family = binomial(link = "logit"), data = transCrime)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.39925 -0.28971 -0.02109 0.00695 2.82159
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -21.078965 4.892781 -4.308 1.65e-05 ***
## zn -0.029569 0.020935 -1.412 0.15783
## nox 24.817093 4.134384 6.003 1.94e-09 ***
## age 0.345874 0.471596 0.733 0.46331
## rad 0.506805 0.119092 4.256 2.09e-05 ***
## ptratio 0.241822 0.104946 2.304 0.02121 *
## black -0.010392 0.006291 -1.652 0.09855 .
## lstat 0.286818 0.574518 0.499 0.61762
## medv 0.103543 0.039160 2.644 0.00819 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 221.03 on 457 degrees of freedom
## AIC: 239.03
##
## Number of Fisher Scoring iterations: 8
- nox has the greatest impact by far on target
- zn has the minimum impact on target and is also negative
- Positive coefficients: nox, age, rad, ptratio, lstat, medv
- Negative coefficients: zn, black
- Interesting to note that black has a negative impact on target
Model 6: All Variables
m6 <- glm(target ~ ., family = binomial(link = "logit"), data = transCrime)
summary(m6)
##
## Call:
## glm(formula = target ~ ., family = binomial(link = "logit"),
## data = transCrime)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2707 -0.1514 -0.0029 0.0026 3.2364
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -43.348929 8.261441 -5.247 1.54e-07 ***
## zn -0.046753 0.029791 -1.569 0.116556
## indus -0.032247 0.048264 -0.668 0.504048
## chas 1.072549 0.730298 1.469 0.141929
## nox 51.605183 7.851591 6.573 4.95e-11 ***
## rm -0.320429 0.714474 -0.448 0.653805
## age 1.048890 0.700517 1.497 0.134313
## dis 3.284414 0.932797 3.521 0.000430 ***
## rad 0.619195 0.159589 3.880 0.000104 ***
## tax -0.005005 0.002912 -1.719 0.085637 .
## ptratio 0.410696 0.128682 3.192 0.001415 **
## black -0.013125 0.006719 -1.953 0.050790 .
## lstat 0.554218 0.682897 0.812 0.417039
## medv 0.175183 0.069097 2.535 0.011234 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 189.36 on 452 degrees of freedom
## AIC: 217.36
##
## Number of Fisher Scoring iterations: 9
- nox has the greatest impact by far on target
- tax has the minimum impact on target and is also negative
- Positive coefficients: chas, nox, age, dis, rad, ptratio, lstat, medv
- Negative coefficients: zn, indus, rm, tax, black
- Interesting to note that black has a negative impact on target
Select Models
We will use the Adjusted R^2 to select the best model. Based on the R^2, Model 2 & Model 5 are the best linear models. Both R^2 sat at .61, however we will use Model 2 (without transfromations).
Evaluation
Model 2
par(mfrow = c(2, 2))
plot(m2)
hist(m2$residuals)
qqnorm(m2$residuals)
qqline(m2$residuals)
- The histogram of the residuals do not show a normal distribution
- The qqplot shows a fairly linear relationship, the only exception being towards the tail end of the residuals
- The residual indicates that there is not constant variance throughout, as there is a noticable pattern around 0
Test Model
trainingMinus <- crimeTraining %>% select(-target)
test_results <- predict(m2, newdata = trainingMinus, type = "response")
df <- bind_cols(crimeTraining, data.frame(scored_target = test_results)) %>%
mutate(scored_target = if_else(scored_target > 0.5, 1, 0)) %>% print
## # A tibble: 466 x 15
## zn indus chas nox rm age dis rad tax ptratio black
## <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int> <int> <dbl> <dbl>
## 1 0 19.6 0 0.605 7.93 96.2 2.05 5 403 14.7 369.
## 2 0 19.6 1 0.871 5.40 100 1.32 5 403 14.7 397.
## 3 0 18.1 0 0.74 6.48 100 1.98 24 666 20.2 387.
## 4 30 4.93 0 0.428 6.39 7.8 7.04 6 300 16.6 375.
## 5 0 2.46 0 0.488 7.16 92.2 2.70 3 193 17.8 394.
## 6 0 8.56 0 0.52 6.78 71.3 2.86 5 384 20.9 396.
## 7 0 18.1 0 0.693 5.45 100 1.49 24 666 20.2 397.
## 8 0 18.1 0 0.693 4.52 100 1.66 24 666 20.2 88.3
## 9 0 5.19 0 0.515 6.32 38.1 6.46 5 224 20.2 390.
## 10 80 3.64 0 0.392 5.88 19.1 9.22 1 315 16.4 395.
## # ... with 456 more rows, and 4 more variables: lstat <dbl>, medv <dbl>,
## # target <int>, scored_target <dbl>
Performance
cm <- confusionMatrix(as.factor(df$scored_target), as.factor(df$target), positive = "1",
mode = "everything") %>% print
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 218 26
## 1 19 203
##
## Accuracy : 0.9034
## 95% CI : (0.8729, 0.9287)
## No Information Rate : 0.5086
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8067
## Mcnemar's Test P-Value : 0.3711
##
## Sensitivity : 0.8865
## Specificity : 0.9198
## Pos Pred Value : 0.9144
## Neg Pred Value : 0.8934
## Precision : 0.9144
## Recall : 0.8865
## F1 : 0.9002
## Prevalence : 0.4914
## Detection Rate : 0.4356
## Detection Prevalence : 0.4764
## Balanced Accuracy : 0.9031
##
## 'Positive' Class : 1
##
curveRoc <- roc(df$target, df$scored_target)
plot(curveRoc, legacy.axes = T, main = "pROC")
- The model used had a ~90% accuracy
- The CER is ~10%
- Precision was ~95%
- Negative prediction rate was only ~89%. The positive prediction rate of ~91%
- The model struggled slightly more with predicting negatives
- The model struggled slightly more with predicting negatives
- The sensitivity is ~.89
- The specificity is ~.92
- The F1 is ~.90
- The AUC is 0.9031471
Prediction for Test Data
test_results <- predict(m2, newdata = crimeTest, type = "response")
bind_cols(crimeTest, data.frame(scored_target = test_results)) %>% mutate(scored_target = if_else(scored_target >
0.5, 1, 0)) %>% print
## # A tibble: 40 x 14
## zn indus chas nox rm age dis rad tax ptratio black
## <int> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int> <int> <dbl> <dbl>
## 1 0 7.07 0 0.469 7.18 61.1 4.97 2 242 17.8 393.
## 2 0 8.14 0 0.538 6.10 84.5 4.46 4 307 21 380.
## 3 0 8.14 0 0.538 6.50 94.4 4.45 4 307 21 388.
## 4 0 8.14 0 0.538 5.95 82 3.99 4 307 21 233.
## 5 0 5.96 0 0.499 5.85 41.5 3.93 5 279 19.2 397.
## 6 25 5.13 0 0.453 5.74 66.2 7.23 8 284 19.7 395.
## 7 25 5.13 0 0.453 5.97 93.4 6.82 8 284 19.7 378.
## 8 0 4.49 0 0.449 6.63 56.1 4.44 3 247 18.5 392.
## 9 0 4.49 0 0.449 6.12 56.8 3.75 3 247 18.5 395.
## 10 0 2.89 0 0.445 6.16 69.6 3.50 2 276 18 392.
## # ... with 30 more rows, and 3 more variables: lstat <dbl>, medv <dbl>,
## # scored_target <dbl>
Code Appendix
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(tidy = TRUE)
knitr::opts_chunk$set(warning = FALSE)
libs <- c("knitr", "magrittr", "data.table", "kableExtra", "caret", "pROC",
"zoo", "ISLR", "leaps", "fBasics", "reshape2", "tidyverse", "GGally", "gridExtra",
"ROCR")
loadPkg <- function(x) {
if (!require(x, character.only = T))
install.packages(x, dependencies = T)
require(x, character.only = T)
}
lapply(libs, loadPkg)
crimeTraining <- fread("https://raw.githubusercontent.com/baroncurtin2/data621/master/week3/crime-training-data.csv") %>%
as.tibble()
crimeTest <- fread("https://raw.githubusercontent.com/baroncurtin2/data621/master/week3/crime-evaluation-data.csv") %>%
as.tibble()
data_frame(explanatory_variables = names(crimeTraining)) %>% filter(explanatory_variables !=
"target") %>% arrange(explanatory_variables)
glimpse(crimeTraining)
summary(crimeTraining)
crimeStats <- basicStats(crimeTraining)[c("nobs", "NAs", "Minimum", "Maximum",
"1. Quartile", "3. Quartile", "Mean", "Median", "Variance", "Stdev", "Skewness",
"Kurtosis"), ] %>% as_tibble() %>% rownames_to_column() %>% gather(var,
value, -rowname) %>% spread(rowname, value) %>% rename_all(str_to_lower) %>%
rename_all(str_trim) %>% rename(variables = "var", q1 = `1. quartile`, q3 = `3. quartile`,
max = maximum, min = minimum, na_vals = nas, n = nobs, sd = stdev, var = variance) %>%
mutate(obs = n - na_vals, range = max - min, iqr = q3 - q1) %>% select(variables,
n, na_vals, obs, mean, min, q1, median, q3, max, sd, var, range, iqr, skewness,
kurtosis) %>% as_tibble()
crimeStats
crimeTraining %>% cor(use = "na.or.complete") %>% as.data.frame() %>% rownames_to_column(var = "predictor") %>%
as_data_frame() %>% select(predictor, target) %>% filter(predictor != "target") %>%
arrange(desc(target))
ggcorr(crimeTraining, palette = "RdBu", label = T, geom = "tile")
vis <- melt(crimeTraining)
ggplot(vis, aes(value)) + geom_density(fill = "skyblue") + facet_wrap(~variable,
scales = "free")
crimeTraining %>% mutate(target = as.factor(target)) %>% keep(is.numeric) %>%
gather() %>% ggplot(aes(value)) + geom_histogram(bins = 35) + facet_wrap(~key,
scales = "free")
ggplot(vis, aes(x = variable, y = value)) + geom_boxplot(show.legend = T) +
stat_summary(fun.y = mean, color = "red", geom = "point", shape = 18, size = 3) +
coord_flip()
vis %>% filter(!variable %in% c("tax", "black")) %>% ggplot(aes(x = variable,
y = value)) + geom_boxplot(show.legend = T) + stat_summary(fun.y = mean,
color = "red", geom = "point", shape = 18, size = 3) + coord_flip()
vis %>% filter(!variable %in% c("tax", "black", "zn", "age")) %>% ggplot(aes(x = variable,
y = value)) + geom_boxplot(show.legend = T) + stat_summary(fun.y = mean,
color = "red", geom = "point", shape = 18, size = 3) + coord_flip()
transCrime <- crimeTraining %>% mutate_at(c("age", "dis", "lstat"), log)
regDiags <- regsubsets(target ~ ., data = crimeTraining, method = "exhaustive",
nvmax = NULL, nbest = 1)
diagSum <- summary(regDiags)
print(diagSum)
# determine best fits
plot(diagSum$cp, xlab = "Number of Variables", ylab = "Cp")
points(which.min(diagSum$cp), diagSum$cp[which.min(diagSum$cp)], pch = 20, col = "red")
# cp plot
par(mfrow = c(1, 2))
plot(regDiags, scale = "Cp", main = "Cp")
# r^2 splot
plot(regDiags, scale = "adjr2", main = "Adjusted R^2")
m1 <- glm(target ~ nox + age + rad + ptratio + medv, family = binomial(link = "logit"),
data = crimeTraining)
summary(m1)
m2 <- glm(target ~ nox + age + rad + tax + ptratio + black + medv, family = binomial(link = "logit"),
data = crimeTraining)
summary(m2)
m3 <- glm(target ~ ., family = binomial(link = "logit"), data = crimeTraining)
summary(m3)
regDiags2 <- regsubsets(target ~ ., data = transCrime, method = "exhaustive",
nvmax = NULL, nbest = 1)
diagSum2 <- summary(regDiags2)
print(diagSum2)
# determine best fits
plot(diagSum2$cp, xlab = "Number of Variables", ylab = "Cp")
points(which.min(diagSum2$cp), diagSum2$cp[which.min(diagSum2$cp)], pch = 20,
col = "red")
# cp plot
par(mfrow = c(1, 2))
plot(regDiags2, scale = "Cp", main = "Cp")
# r^2 splot
plot(regDiags2, scale = "adjr2", main = "Adjusted R^2")
m4 <- glm(target ~ zn + nox + age + rad + medv, family = binomial(link = "logit"),
data = transCrime)
summary(m4)
m5 <- glm(target ~ zn + nox + age + rad + ptratio + black + lstat + medv, family = binomial(link = "logit"),
data = transCrime)
summary(m5)
m6 <- glm(target ~ ., family = binomial(link = "logit"), data = transCrime)
summary(m6)
par(mfrow = c(2, 2))
plot(m2)
hist(m2$residuals)
qqnorm(m2$residuals)
qqline(m2$residuals)
trainingMinus <- crimeTraining %>% select(-target)
test_results <- predict(m2, newdata = trainingMinus, type = "response")
df <- bind_cols(crimeTraining, data.frame(scored_target = test_results)) %>%
mutate(scored_target = if_else(scored_target > 0.5, 1, 0)) %>% print
cm <- confusionMatrix(as.factor(df$scored_target), as.factor(df$target), positive = "1",
mode = "everything") %>% print
curveRoc <- roc(df$target, df$scored_target)
plot(curveRoc, legacy.axes = T, main = "pROC")
test_results <- predict(m2, newdata = crimeTest, type = "response")
bind_cols(crimeTest, data.frame(scored_target = test_results)) %>% mutate(scored_target = if_else(scored_target >
0.5, 1, 0)) %>% print