Data 621 - Week3

Cover Page

Data 621 - Week 3 HW

Baron Curtin

CUNY School of Professional Studies

Introduction

The purpose of this assignment is to develop a binary logistic model that can “reliably” predict whether a neighborhood is at risk for high crime levels

Data Exploration

Non-Visual Inspection

Variables

* Response Variable: target * Explanatory Variables:

data_frame(explanatory_variables = names(crimeTraining)) %>% filter(explanatory_variables != 
    "target") %>% arrange(explanatory_variables)

glimpse(crimeTraining)

## Observations: 466
## Variables: 14
## $ zn      <dbl> 0, 0, 0, 30, 0, 0, 0, 0, 0, 80, 22, 0, 0, 22, 0, 0, 10...
## $ indus   <dbl> 19.58, 19.58, 18.10, 4.93, 2.46, 8.56, 18.10, 18.10, 5...
## $ chas    <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ nox     <dbl> 0.605, 0.871, 0.740, 0.428, 0.488, 0.520, 0.693, 0.693...
## $ rm      <dbl> 7.929, 5.403, 6.485, 6.393, 7.155, 6.781, 5.453, 4.519...
## $ age     <dbl> 96.2, 100.0, 100.0, 7.8, 92.2, 71.3, 100.0, 100.0, 38....
## $ dis     <dbl> 2.0459, 1.3216, 1.9784, 7.0355, 2.7006, 2.8561, 1.4896...
## $ rad     <int> 5, 5, 24, 6, 3, 5, 24, 24, 5, 1, 7, 5, 24, 7, 3, 3, 5,...
## $ tax     <int> 403, 403, 666, 300, 193, 384, 666, 666, 224, 315, 330,...
## $ ptratio <dbl> 14.7, 14.7, 20.2, 16.6, 17.8, 20.9, 20.2, 20.2, 20.2, ...
## $ black   <dbl> 369.30, 396.90, 386.73, 374.71, 394.12, 395.58, 396.90...
## $ lstat   <dbl> 3.70, 26.82, 18.85, 5.19, 4.82, 7.67, 30.59, 36.98, 5....
## $ medv    <dbl> 50.0, 13.4, 15.4, 23.7, 37.9, 26.5, 5.0, 7.0, 22.2, 20...
## $ target  <int> 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, ...

The glipmse function of dplyr shows us that there are 466 observations and 14 variables
* 10 variables of type dble
* 4 variables of type int

summary(crimeTraining)

##        zn             indus             chas              nox        
##  Min.   :  0.00   Min.   : 0.460   Min.   :0.00000   Min.   :0.3890  
##  1st Qu.:  0.00   1st Qu.: 5.145   1st Qu.:0.00000   1st Qu.:0.4480  
##  Median :  0.00   Median : 9.690   Median :0.00000   Median :0.5380  
##  Mean   : 11.58   Mean   :11.105   Mean   :0.07082   Mean   :0.5543  
##  3rd Qu.: 16.25   3rd Qu.:18.100   3rd Qu.:0.00000   3rd Qu.:0.6240  
##  Max.   :100.00   Max.   :27.740   Max.   :1.00000   Max.   :0.8710  
##        rm             age              dis              rad       
##  Min.   :3.863   Min.   :  2.90   Min.   : 1.130   Min.   : 1.00  
##  1st Qu.:5.887   1st Qu.: 43.88   1st Qu.: 2.101   1st Qu.: 4.00  
##  Median :6.210   Median : 77.15   Median : 3.191   Median : 5.00  
##  Mean   :6.291   Mean   : 68.37   Mean   : 3.796   Mean   : 9.53  
##  3rd Qu.:6.630   3rd Qu.: 94.10   3rd Qu.: 5.215   3rd Qu.:24.00  
##  Max.   :8.780   Max.   :100.00   Max.   :12.127   Max.   :24.00  
##       tax           ptratio         black            lstat       
##  Min.   :187.0   Min.   :12.6   Min.   :  0.32   Min.   : 1.730  
##  1st Qu.:281.0   1st Qu.:16.9   1st Qu.:375.61   1st Qu.: 7.043  
##  Median :334.5   Median :18.9   Median :391.34   Median :11.350  
##  Mean   :409.5   Mean   :18.4   Mean   :357.12   Mean   :12.631  
##  3rd Qu.:666.0   3rd Qu.:20.2   3rd Qu.:396.24   3rd Qu.:16.930  
##  Max.   :711.0   Max.   :22.0   Max.   :396.90   Max.   :37.970  
##       medv           target      
##  Min.   : 5.00   Min.   :0.0000  
##  1st Qu.:17.02   1st Qu.:0.0000  
##  Median :21.20   Median :0.0000  
##  Mean   :22.59   Mean   :0.4914  
##  3rd Qu.:25.00   3rd Qu.:1.0000  
##  Max.   :50.00   Max.   :1.0000

Summary is able to show that none of the variables have missing values
The mean of target is below .5 which means there are more observations where the crime rate is below the median

Basic Stats

crimeStats <- basicStats(crimeTraining)[c("nobs", "NAs", "Minimum", "Maximum", 
    "1. Quartile", "3. Quartile", "Mean", "Median", "Variance", "Stdev", "Skewness", 
    "Kurtosis"), ] %>% as_tibble() %>% rownames_to_column() %>% gather(var, 
    value, -rowname) %>% spread(rowname, value) %>% rename_all(str_to_lower) %>% 
    rename_all(str_trim) %>% rename(variables = "var", q1 = `1. quartile`, q3 = `3. quartile`, 
    max = maximum, min = minimum, na_vals = nas, n = nobs, sd = stdev, var = variance) %>% 
    mutate(obs = n - na_vals, range = max - min, iqr = q3 - q1) %>% select(variables, 
    n, na_vals, obs, mean, min, q1, median, q3, max, sd, var, range, iqr, skewness, 
    kurtosis) %>% as_tibble()

crimeStats

basicStats is further able to confirm no missing values
The variable black has the highest variance
The largest skew value is ~3.3 with the chas variable
- This variable is just a dummy variable for whether the subrub borders the Charles River

Correlation

crimeTraining %>% cor(use = "na.or.complete") %>% as.data.frame() %>% rownames_to_column(var = "predictor") %>% 
    as_data_frame() %>% select(predictor, target) %>% filter(predictor != "target") %>% 
    arrange(desc(target))

The correlation table above shows that nox has the strongest positive correlation with target and dis has the strongest negative correlation with target

ggcorr(crimeTraining, palette = "RdBu", label = T, geom = "tile")

* The correlation matrix shows tax and rad have the highest correlation to each other

Visual Inspection

Density Plots

vis <- melt(crimeTraining)

## No id variables; using all as measure variables

ggplot(vis, aes(value)) + geom_density(fill = "skyblue") + facet_wrap(~variable, 
    scales = "free")

The variable rm is the only variable that closely mirrors a normal distribution
The variables zn, chas, and dis are heavily skewed right
The variables nox, lstat, and medv are are also skewed right
indus, rad, tax, and target are multi-modal
The density plots reveal that most is the data is not normal

Histogram

crimeTraining %>% mutate(target = as.factor(target)) %>% keep(is.numeric) %>% 
    gather() %>% ggplot(aes(value)) + geom_histogram(bins = 35) + facet_wrap(~key, 
    scales = "free")

The histograms provide a clearer picture of skew revealing that age, dis, lstat are skewed
The variables are candidates for transformation

Box Plots

Box plots are provide good visual representations of the variance in the data

ggplot(vis, aes(x = variable, y = value)) + geom_boxplot(show.legend = T) + 
    stat_summary(fun.y = mean, color = "red", geom = "point", shape = 18, size = 3) + 
    coord_flip()

The box plots reveal that many of the variables have low variances

Removing the tax, and black variables…

vis %>% filter(!variable %in% c("tax", "black")) %>% ggplot(aes(x = variable, 
    y = value)) + geom_boxplot(show.legend = T) + stat_summary(fun.y = mean, 
    color = "red", geom = "point", shape = 18, size = 3) + coord_flip()

Removing the tax and black variable made it easier to see that almost all of the variables are skewed as the means and medians do not align
Of the remaining variables, zn and age appear to have the highest variances

Removing the zn and age variables…

vis %>% filter(!variable %in% c("tax", "black", "zn", "age")) %>% ggplot(aes(x = variable, 
    y = value)) + geom_boxplot(show.legend = T) + stat_summary(fun.y = mean, 
    color = "red", geom = "point", shape = 18, size = 3) + coord_flip()

Many of the variables are skewed right as the mean appears to the right of the median

Data Preparation

To reduce the effect of skew on the model, logistic transformations will be performed on age, dis, lstat

transCrime <- crimeTraining %>% mutate_at(c("age", "dis", "lstat"), log)

Build Models

Leaps subsetting

Untransformed Data

regDiags <- regsubsets(target ~ ., data = crimeTraining, method = "exhaustive", 
    nvmax = NULL, nbest = 1)
diagSum <- summary(regDiags)
print(diagSum)

## Subset selection object
## Call: regsubsets.formula(target ~ ., data = crimeTraining, method = "exhaustive", 
##     nvmax = NULL, nbest = 1)
## 13 Variables  (and intercept)
##         Forced in Forced out
## zn          FALSE      FALSE
## indus       FALSE      FALSE
## chas        FALSE      FALSE
## nox         FALSE      FALSE
## rm          FALSE      FALSE
## age         FALSE      FALSE
## dis         FALSE      FALSE
## rad         FALSE      FALSE
## tax         FALSE      FALSE
## ptratio     FALSE      FALSE
## black       FALSE      FALSE
## lstat       FALSE      FALSE
## medv        FALSE      FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: exhaustive
##           zn  indus chas nox rm  age dis rad tax ptratio black lstat medv
## 1  ( 1 )  " " " "   " "  "*" " " " " " " " " " " " "     " "   " "   " " 
## 2  ( 1 )  " " " "   " "  "*" " " " " " " "*" " " " "     " "   " "   " " 
## 3  ( 1 )  " " " "   " "  "*" " " "*" " " "*" " " " "     " "   " "   " " 
## 4  ( 1 )  " " " "   " "  "*" " " "*" " " "*" " " " "     " "   " "   "*" 
## 5  ( 1 )  " " " "   " "  "*" " " "*" " " "*" " " "*"     " "   " "   "*" 
## 6  ( 1 )  " " " "   " "  "*" " " "*" " " "*" " " "*"     "*"   " "   "*" 
## 7  ( 1 )  " " " "   " "  "*" " " "*" " " "*" "*" "*"     "*"   " "   "*" 
## 8  ( 1 )  " " " "   " "  "*" " " "*" " " "*" "*" "*"     "*"   "*"   "*" 
## 9  ( 1 )  "*" " "   " "  "*" " " "*" " " "*" "*" "*"     "*"   "*"   "*" 
## 10  ( 1 ) "*" " "   " "  "*" " " "*" "*" "*" "*" "*"     "*"   "*"   "*" 
## 11  ( 1 ) "*" "*"   " "  "*" " " "*" "*" "*" "*" "*"     "*"   "*"   "*" 
## 12  ( 1 ) "*" "*"   " "  "*" "*" "*" "*" "*" "*" "*"     "*"   "*"   "*" 
## 13  ( 1 ) "*" "*"   "*"  "*" "*" "*" "*" "*" "*" "*"     "*"   "*"   "*"

# determine best fits
plot(diagSum$cp, xlab = "Number of Variables", ylab = "Cp")
points(which.min(diagSum$cp), diagSum$cp[which.min(diagSum$cp)], pch = 20, col = "red")

# cp plot
par(mfrow = c(1, 2))
plot(regDiags, scale = "Cp", main = "Cp")

# r^2 splot
plot(regDiags, scale = "adjr2", main = "Adjusted R^2")

Based on Cp, a model that includes nox, age, rad, ptratio, and medv would be the best predictor
Based on Adjusted R^2, a model that includes nox, age, rad, tax, ptratio, black, and medv would be the best predictor
Both metrics share the nox, age, rad, ptratio, and medv variables

Model 1

m1 <- glm(target ~ nox + age + rad + ptratio + medv, family = binomial(link = "logit"), 
    data = crimeTraining)
summary(m1)

## 
## Call:
## glm(formula = target ~ nox + age + rad + ptratio + medv, family = binomial(link = "logit"), 
##     data = crimeTraining)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.96654  -0.29783  -0.03987   0.00769   2.80829  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -24.936540   3.683449  -6.770 1.29e-11 ***
## nox          25.334778   4.084106   6.203 5.53e-10 ***
## age           0.019403   0.009308   2.085  0.03711 *  
## rad           0.512600   0.114818   4.464 8.03e-06 ***
## ptratio       0.274193   0.098737   2.777  0.00549 ** 
## medv          0.085445   0.027979   3.054  0.00226 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 224.71  on 460  degrees of freedom
## AIC: 236.71
## 
## Number of Fisher Scoring iterations: 8

nox has the greatest impact by far on target
age has the minimum impact on target

Model 2

m2 <- glm(target ~ nox + age + rad + tax + ptratio + black + medv, family = binomial(link = "logit"), 
    data = crimeTraining)
summary(m2)

## 
## Call:
## glm(formula = target ~ nox + age + rad + tax + ptratio + black + 
##     medv, family = binomial(link = "logit"), data = crimeTraining)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.60500  -0.22062  -0.01422   0.00332   2.84619  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -23.560623   4.891894  -4.816 1.46e-06 ***
## nox          33.274178   5.182633   6.420 1.36e-10 ***
## age           0.021944   0.009720   2.258 0.023966 *  
## rad           0.722503   0.140598   5.139 2.77e-07 ***
## tax          -0.009475   0.002654  -3.570 0.000357 ***
## ptratio       0.352663   0.110197   3.200 0.001373 ** 
## black        -0.012652   0.006906  -1.832 0.066958 .  
## medv          0.069553   0.029705   2.341 0.019208 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 204.54  on 458  degrees of freedom
## AIC: 220.54
## 
## Number of Fisher Scoring iterations: 9

nox has the greatest impact by far on target
tax has the minimum impact on target and is also negative
Positive coefficients: nox, age, rad, ptratio, medv
Negative coefficients: tax, black
- Interesting to note that black has a negative impact

Model 3: All Variables

m3 <- glm(target ~ ., family = binomial(link = "logit"), data = crimeTraining)
summary(m3)

## 
## Call:
## glm(formula = target ~ ., family = binomial(link = "logit"), 
##     data = crimeTraining)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2854  -0.1372  -0.0017   0.0020   3.4721  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -36.839521   7.028726  -5.241 1.59e-07 ***
## zn           -0.061720   0.034410  -1.794 0.072868 .  
## indus        -0.072580   0.048546  -1.495 0.134894    
## chas          1.032352   0.759627   1.359 0.174139    
## nox          50.159513   8.049503   6.231 4.62e-10 ***
## rm           -0.692145   0.741431  -0.934 0.350548    
## age           0.034522   0.013883   2.487 0.012895 *  
## dis           0.765795   0.234407   3.267 0.001087 ** 
## rad           0.663015   0.165135   4.015 5.94e-05 ***
## tax          -0.006593   0.003064  -2.152 0.031422 *  
## ptratio       0.442217   0.132234   3.344 0.000825 ***
## black        -0.013094   0.006680  -1.960 0.049974 *  
## lstat         0.047571   0.054508   0.873 0.382802    
## medv          0.199734   0.071022   2.812 0.004919 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 186.15  on 452  degrees of freedom
## AIC: 214.15
## 
## Number of Fisher Scoring iterations: 9

nox has the greatest impact by far on target
tax has the minimum impact on target and is also negative
Positive coefficients: chas, nox, age, dis, rad, ptratio, lstat, medv
Negative coefficients: zn, indus, rm, tax, black
- Interesting to note that black has a negative impact

Transformed Data

regDiags2 <- regsubsets(target ~ ., data = transCrime, method = "exhaustive", 
    nvmax = NULL, nbest = 1)
diagSum2 <- summary(regDiags2)
print(diagSum2)

## Subset selection object
## Call: regsubsets.formula(target ~ ., data = transCrime, method = "exhaustive", 
##     nvmax = NULL, nbest = 1)
## 13 Variables  (and intercept)
##         Forced in Forced out
## zn          FALSE      FALSE
## indus       FALSE      FALSE
## chas        FALSE      FALSE
## nox         FALSE      FALSE
## rm          FALSE      FALSE
## age         FALSE      FALSE
## dis         FALSE      FALSE
## rad         FALSE      FALSE
## tax         FALSE      FALSE
## ptratio     FALSE      FALSE
## black       FALSE      FALSE
## lstat       FALSE      FALSE
## medv        FALSE      FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: exhaustive
##           zn  indus chas nox rm  age dis rad tax ptratio black lstat medv
## 1  ( 1 )  " " " "   " "  "*" " " " " " " " " " " " "     " "   " "   " " 
## 2  ( 1 )  " " " "   " "  "*" " " " " " " "*" " " " "     " "   " "   " " 
## 3  ( 1 )  " " " "   " "  "*" " " " " " " "*" " " " "     " "   " "   "*" 
## 4  ( 1 )  " " " "   " "  "*" " " "*" " " "*" " " " "     " "   " "   "*" 
## 5  ( 1 )  "*" " "   " "  "*" " " "*" " " "*" " " " "     " "   " "   "*" 
## 6  ( 1 )  "*" " "   " "  "*" " " "*" " " "*" " " "*"     " "   " "   "*" 
## 7  ( 1 )  "*" " "   " "  "*" " " "*" " " "*" " " "*"     "*"   " "   "*" 
## 8  ( 1 )  "*" " "   " "  "*" " " "*" " " "*" " " "*"     "*"   "*"   "*" 
## 9  ( 1 )  "*" " "   " "  "*" "*" "*" " " "*" " " "*"     "*"   "*"   "*" 
## 10  ( 1 ) "*" " "   " "  "*" "*" "*" " " "*" "*" "*"     "*"   "*"   "*" 
## 11  ( 1 ) "*" "*"   " "  "*" "*" "*" " " "*" "*" "*"     "*"   "*"   "*" 
## 12  ( 1 ) "*" "*"   " "  "*" "*" "*" "*" "*" "*" "*"     "*"   "*"   "*" 
## 13  ( 1 ) "*" "*"   "*"  "*" "*" "*" "*" "*" "*" "*"     "*"   "*"   "*"

# determine best fits
plot(diagSum2$cp, xlab = "Number of Variables", ylab = "Cp")
points(which.min(diagSum2$cp), diagSum2$cp[which.min(diagSum2$cp)], pch = 20, 
    col = "red")

# cp plot
par(mfrow = c(1, 2))
plot(regDiags2, scale = "Cp", main = "Cp")

# r^2 splot
plot(regDiags2, scale = "adjr2", main = "Adjusted R^2")

Based on Cp, a model that includes zn, nox age, rad, and medv would be the best predictor
Based on Adjusted R^2, a model that includes zn, nox age, rad, ptratio, black, lstat, and medv would be the best predictor

Model 4

m4 <- glm(target ~ zn + nox + age + rad + medv, family = binomial(link = "logit"), 
    data = transCrime)
summary(m4)

## 
## Call:
## glm(formula = target ~ zn + nox + age + rad + medv, family = binomial(link = "logit"), 
##     data = transCrime)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7789  -0.3356  -0.0212   0.0126   2.8086  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -17.58074    2.34312  -7.503 6.23e-14 ***
## zn           -0.04485    0.02175  -2.062   0.0392 *  
## nox          23.46589    3.85329   6.090 1.13e-09 ***
## age           0.32489    0.44868   0.724   0.4690    
## rad           0.46428    0.11393   4.075 4.60e-05 ***
## medv          0.05010    0.02324   2.156   0.0311 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 230.03  on 460  degrees of freedom
## AIC: 242.03
## 
## Number of Fisher Scoring iterations: 8

nox has the greatest impact by far on target
zn has the minimum impact on target and is also negative
Positive coefficients: nox, age, rad, medv
Negative coefficients: zn

Model 5

m5 <- glm(target ~ zn + nox + age + rad + ptratio + black + lstat + medv, family = binomial(link = "logit"), 
    data = transCrime)
summary(m5)

## 
## Call:
## glm(formula = target ~ zn + nox + age + rad + ptratio + black + 
##     lstat + medv, family = binomial(link = "logit"), data = transCrime)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.39925  -0.28971  -0.02109   0.00695   2.82159  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -21.078965   4.892781  -4.308 1.65e-05 ***
## zn           -0.029569   0.020935  -1.412  0.15783    
## nox          24.817093   4.134384   6.003 1.94e-09 ***
## age           0.345874   0.471596   0.733  0.46331    
## rad           0.506805   0.119092   4.256 2.09e-05 ***
## ptratio       0.241822   0.104946   2.304  0.02121 *  
## black        -0.010392   0.006291  -1.652  0.09855 .  
## lstat         0.286818   0.574518   0.499  0.61762    
## medv          0.103543   0.039160   2.644  0.00819 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 221.03  on 457  degrees of freedom
## AIC: 239.03
## 
## Number of Fisher Scoring iterations: 8

nox has the greatest impact by far on target
zn has the minimum impact on target and is also negative
Positive coefficients: nox, age, rad, ptratio, lstat, medv
Negative coefficients: zn, black
- Interesting to note that black has a negative impact on target

Model 6: All Variables

m6 <- glm(target ~ ., family = binomial(link = "logit"), data = transCrime)
summary(m6)

## 
## Call:
## glm(formula = target ~ ., family = binomial(link = "logit"), 
##     data = transCrime)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2707  -0.1514  -0.0029   0.0026   3.2364  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -43.348929   8.261441  -5.247 1.54e-07 ***
## zn           -0.046753   0.029791  -1.569 0.116556    
## indus        -0.032247   0.048264  -0.668 0.504048    
## chas          1.072549   0.730298   1.469 0.141929    
## nox          51.605183   7.851591   6.573 4.95e-11 ***
## rm           -0.320429   0.714474  -0.448 0.653805    
## age           1.048890   0.700517   1.497 0.134313    
## dis           3.284414   0.932797   3.521 0.000430 ***
## rad           0.619195   0.159589   3.880 0.000104 ***
## tax          -0.005005   0.002912  -1.719 0.085637 .  
## ptratio       0.410696   0.128682   3.192 0.001415 ** 
## black        -0.013125   0.006719  -1.953 0.050790 .  
## lstat         0.554218   0.682897   0.812 0.417039    
## medv          0.175183   0.069097   2.535 0.011234 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 189.36  on 452  degrees of freedom
## AIC: 217.36
## 
## Number of Fisher Scoring iterations: 9

nox has the greatest impact by far on target
tax has the minimum impact on target and is also negative
Positive coefficients: chas, nox, age, dis, rad, ptratio, lstat, medv
Negative coefficients: zn, indus, rm, tax, black
- Interesting to note that black has a negative impact on target

Select Models

We will use the Adjusted R^2 to select the best model. Based on the R^2, Model 2 & Model 5 are the best linear models. Both R^2 sat at .61, however we will use Model 2 (without transfromations).

Evaluation

Model 2

par(mfrow = c(2, 2))

plot(m2)

hist(m2$residuals)
qqnorm(m2$residuals)
qqline(m2$residuals)

The histogram of the residuals do not show a normal distribution
The qqplot shows a fairly linear relationship, the only exception being towards the tail end of the residuals
The residual indicates that there is not constant variance throughout, as there is a noticable pattern around 0

Test Model

trainingMinus <- crimeTraining %>% select(-target)

test_results <- predict(m2, newdata = trainingMinus, type = "response")

df <- bind_cols(crimeTraining, data.frame(scored_target = test_results)) %>% 
    mutate(scored_target = if_else(scored_target > 0.5, 1, 0)) %>% print

## # A tibble: 466 x 15
##       zn indus  chas   nox    rm   age   dis   rad   tax ptratio black
##    <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int> <int>   <dbl> <dbl>
##  1     0 19.6      0 0.605  7.93  96.2  2.05     5   403    14.7 369. 
##  2     0 19.6      1 0.871  5.40 100    1.32     5   403    14.7 397. 
##  3     0 18.1      0 0.74   6.48 100    1.98    24   666    20.2 387. 
##  4    30  4.93     0 0.428  6.39   7.8  7.04     6   300    16.6 375. 
##  5     0  2.46     0 0.488  7.16  92.2  2.70     3   193    17.8 394. 
##  6     0  8.56     0 0.52   6.78  71.3  2.86     5   384    20.9 396. 
##  7     0 18.1      0 0.693  5.45 100    1.49    24   666    20.2 397. 
##  8     0 18.1      0 0.693  4.52 100    1.66    24   666    20.2  88.3
##  9     0  5.19     0 0.515  6.32  38.1  6.46     5   224    20.2 390. 
## 10    80  3.64     0 0.392  5.88  19.1  9.22     1   315    16.4 395. 
## # ... with 456 more rows, and 4 more variables: lstat <dbl>, medv <dbl>,
## #   target <int>, scored_target <dbl>

Performance

cm <- confusionMatrix(as.factor(df$scored_target), as.factor(df$target), positive = "1", 
    mode = "everything") %>% print

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 218  26
##          1  19 203
##                                           
##                Accuracy : 0.9034          
##                  95% CI : (0.8729, 0.9287)
##     No Information Rate : 0.5086          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8067          
##  Mcnemar's Test P-Value : 0.3711          
##                                           
##             Sensitivity : 0.8865          
##             Specificity : 0.9198          
##          Pos Pred Value : 0.9144          
##          Neg Pred Value : 0.8934          
##               Precision : 0.9144          
##                  Recall : 0.8865          
##                      F1 : 0.9002          
##              Prevalence : 0.4914          
##          Detection Rate : 0.4356          
##    Detection Prevalence : 0.4764          
##       Balanced Accuracy : 0.9031          
##                                           
##        'Positive' Class : 1               
##

curveRoc <- roc(df$target, df$scored_target)
plot(curveRoc, legacy.axes = T, main = "pROC")

The model used had a ~90% accuracy
The CER is ~10%
Precision was ~95%
Negative prediction rate was only ~89%. The positive prediction rate of ~91%
- The model struggled slightly more with predicting negatives
The sensitivity is ~.89
The specificity is ~.92
The F1 is ~.90
The AUC is 0.9031471

Prediction for Test Data

test_results <- predict(m2, newdata = crimeTest, type = "response")
bind_cols(crimeTest, data.frame(scored_target = test_results)) %>% mutate(scored_target = if_else(scored_target > 
    0.5, 1, 0)) %>% print

## # A tibble: 40 x 14
##       zn indus  chas   nox    rm   age   dis   rad   tax ptratio black
##    <int> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int> <int>   <dbl> <dbl>
##  1     0  7.07     0 0.469  7.18  61.1  4.97     2   242    17.8  393.
##  2     0  8.14     0 0.538  6.10  84.5  4.46     4   307    21    380.
##  3     0  8.14     0 0.538  6.50  94.4  4.45     4   307    21    388.
##  4     0  8.14     0 0.538  5.95  82    3.99     4   307    21    233.
##  5     0  5.96     0 0.499  5.85  41.5  3.93     5   279    19.2  397.
##  6    25  5.13     0 0.453  5.74  66.2  7.23     8   284    19.7  395.
##  7    25  5.13     0 0.453  5.97  93.4  6.82     8   284    19.7  378.
##  8     0  4.49     0 0.449  6.63  56.1  4.44     3   247    18.5  392.
##  9     0  4.49     0 0.449  6.12  56.8  3.75     3   247    18.5  395.
## 10     0  2.89     0 0.445  6.16  69.6  3.50     2   276    18    392.
## # ... with 30 more rows, and 3 more variables: lstat <dbl>, medv <dbl>,
## #   scored_target <dbl>

Code Appendix

knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(tidy = TRUE)
knitr::opts_chunk$set(warning = FALSE)

libs <- c("knitr", "magrittr", "data.table", "kableExtra", "caret", "pROC", 
    "zoo", "ISLR", "leaps", "fBasics", "reshape2", "tidyverse", "GGally", "gridExtra", 
    "ROCR")

loadPkg <- function(x) {
    if (!require(x, character.only = T)) 
        install.packages(x, dependencies = T)
    require(x, character.only = T)
}

lapply(libs, loadPkg)
crimeTraining <- fread("https://raw.githubusercontent.com/baroncurtin2/data621/master/week3/crime-training-data.csv") %>% 
    as.tibble()
crimeTest <- fread("https://raw.githubusercontent.com/baroncurtin2/data621/master/week3/crime-evaluation-data.csv") %>% 
    as.tibble()
data_frame(explanatory_variables = names(crimeTraining)) %>% filter(explanatory_variables != 
    "target") %>% arrange(explanatory_variables)
glimpse(crimeTraining)
summary(crimeTraining)
crimeStats <- basicStats(crimeTraining)[c("nobs", "NAs", "Minimum", "Maximum", 
    "1. Quartile", "3. Quartile", "Mean", "Median", "Variance", "Stdev", "Skewness", 
    "Kurtosis"), ] %>% as_tibble() %>% rownames_to_column() %>% gather(var, 
    value, -rowname) %>% spread(rowname, value) %>% rename_all(str_to_lower) %>% 
    rename_all(str_trim) %>% rename(variables = "var", q1 = `1. quartile`, q3 = `3. quartile`, 
    max = maximum, min = minimum, na_vals = nas, n = nobs, sd = stdev, var = variance) %>% 
    mutate(obs = n - na_vals, range = max - min, iqr = q3 - q1) %>% select(variables, 
    n, na_vals, obs, mean, min, q1, median, q3, max, sd, var, range, iqr, skewness, 
    kurtosis) %>% as_tibble()

crimeStats
crimeTraining %>% cor(use = "na.or.complete") %>% as.data.frame() %>% rownames_to_column(var = "predictor") %>% 
    as_data_frame() %>% select(predictor, target) %>% filter(predictor != "target") %>% 
    arrange(desc(target))
ggcorr(crimeTraining, palette = "RdBu", label = T, geom = "tile")
vis <- melt(crimeTraining)

ggplot(vis, aes(value)) + geom_density(fill = "skyblue") + facet_wrap(~variable, 
    scales = "free")
crimeTraining %>% mutate(target = as.factor(target)) %>% keep(is.numeric) %>% 
    gather() %>% ggplot(aes(value)) + geom_histogram(bins = 35) + facet_wrap(~key, 
    scales = "free")
ggplot(vis, aes(x = variable, y = value)) + geom_boxplot(show.legend = T) + 
    stat_summary(fun.y = mean, color = "red", geom = "point", shape = 18, size = 3) + 
    coord_flip()
vis %>% filter(!variable %in% c("tax", "black")) %>% ggplot(aes(x = variable, 
    y = value)) + geom_boxplot(show.legend = T) + stat_summary(fun.y = mean, 
    color = "red", geom = "point", shape = 18, size = 3) + coord_flip()
vis %>% filter(!variable %in% c("tax", "black", "zn", "age")) %>% ggplot(aes(x = variable, 
    y = value)) + geom_boxplot(show.legend = T) + stat_summary(fun.y = mean, 
    color = "red", geom = "point", shape = 18, size = 3) + coord_flip()
transCrime <- crimeTraining %>% mutate_at(c("age", "dis", "lstat"), log)
regDiags <- regsubsets(target ~ ., data = crimeTraining, method = "exhaustive", 
    nvmax = NULL, nbest = 1)
diagSum <- summary(regDiags)
print(diagSum)
# determine best fits
plot(diagSum$cp, xlab = "Number of Variables", ylab = "Cp")
points(which.min(diagSum$cp), diagSum$cp[which.min(diagSum$cp)], pch = 20, col = "red")

# cp plot
par(mfrow = c(1, 2))
plot(regDiags, scale = "Cp", main = "Cp")

# r^2 splot
plot(regDiags, scale = "adjr2", main = "Adjusted R^2")
m1 <- glm(target ~ nox + age + rad + ptratio + medv, family = binomial(link = "logit"), 
    data = crimeTraining)
summary(m1)
m2 <- glm(target ~ nox + age + rad + tax + ptratio + black + medv, family = binomial(link = "logit"), 
    data = crimeTraining)
summary(m2)
m3 <- glm(target ~ ., family = binomial(link = "logit"), data = crimeTraining)
summary(m3)
regDiags2 <- regsubsets(target ~ ., data = transCrime, method = "exhaustive", 
    nvmax = NULL, nbest = 1)
diagSum2 <- summary(regDiags2)
print(diagSum2)
# determine best fits
plot(diagSum2$cp, xlab = "Number of Variables", ylab = "Cp")
points(which.min(diagSum2$cp), diagSum2$cp[which.min(diagSum2$cp)], pch = 20, 
    col = "red")

# cp plot
par(mfrow = c(1, 2))
plot(regDiags2, scale = "Cp", main = "Cp")

# r^2 splot
plot(regDiags2, scale = "adjr2", main = "Adjusted R^2")
m4 <- glm(target ~ zn + nox + age + rad + medv, family = binomial(link = "logit"), 
    data = transCrime)
summary(m4)
m5 <- glm(target ~ zn + nox + age + rad + ptratio + black + lstat + medv, family = binomial(link = "logit"), 
    data = transCrime)
summary(m5)
m6 <- glm(target ~ ., family = binomial(link = "logit"), data = transCrime)
summary(m6)
par(mfrow = c(2, 2))

plot(m2)
hist(m2$residuals)
qqnorm(m2$residuals)
qqline(m2$residuals)
trainingMinus <- crimeTraining %>% select(-target)

test_results <- predict(m2, newdata = trainingMinus, type = "response")

df <- bind_cols(crimeTraining, data.frame(scored_target = test_results)) %>% 
    mutate(scored_target = if_else(scored_target > 0.5, 1, 0)) %>% print
cm <- confusionMatrix(as.factor(df$scored_target), as.factor(df$target), positive = "1", 
    mode = "everything") %>% print

curveRoc <- roc(df$target, df$scored_target)
plot(curveRoc, legacy.axes = T, main = "pROC")
test_results <- predict(m2, newdata = crimeTest, type = "response")
bind_cols(crimeTest, data.frame(scored_target = test_results)) %>% mutate(scored_target = if_else(scored_target > 
    0.5, 1, 0)) %>% print

Data 621 - Week3

Baron Curtin

2018-06-29

Cover Page

Introduction

Data Exploration

Non-Visual Inspection

Variables

Basic Stats

Correlation

Visual Inspection

Density Plots

Histogram

Box Plots

Data Preparation

Build Models

Leaps subsetting

Untransformed Data

Model 1

Model 2

Model 3: All Variables

Transformed Data

Model 4

Model 5

Model 6: All Variables

Select Models

Evaluation

Model 2

Test Model

Performance

Prediction for Test Data

Code Appendix