CUNY MSDS HW3 - Binary Logistic Regression

Nicholas Schettini

June 20, 2018

Overview

In this homework assignment, you will explore, analyze and model a data set containing information on crime for various neighborhoods of a major city. Each record has a response variable indicating whether or not the crime rate is above the median crime rate (1) or not (0).

Your objective is to build a binary logistic regression model on the training data set to predict whether the neighborhood will be at risk for high crime levels. You will provide classifications and probabilities for the evaluation data set using your binary logistic regression model. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:

Data Exploration

The dataset contains 13 variables related to housing, transportation, the environment, education, and crime. For this data exploration, we will be focusing on a binary logistic regression for the response variable target, which can be either a 0 or a 1. A 1 indicates that the crime rate is above the median.

## # A tibble: 6 x 14
##      zn indus  chas   nox    rm    age   dis   rad   tax ptratio black
##   <dbl> <dbl> <int> <dbl> <dbl>  <dbl> <dbl> <int> <int>   <dbl> <dbl>
## 1   0   19.6      0 0.605  7.93  96.2   2.05     5   403    14.7   369
## 2   0   19.6      1 0.871  5.40 100     1.32     5   403    14.7   397
## 3   0   18.1      0 0.740  6.48 100     1.98    24   666    20.2   387
## 4  30.0  4.93     0 0.428  6.39   7.80  7.04     6   300    16.6   375
## 5   0    2.46     0 0.488  7.16  92.2   2.70     3   193    17.8   394
## 6   0    8.56     0 0.520  6.78  71.3   2.86     5   384    20.9   396
## # ... with 3 more variables: lstat <dbl>, medv <dbl>, target <int>

Based on the statistics below, it seems that the variables are not large enough to warrent any variable transformations (based on the kurtosis and skew). The data also show that there are no NA values.

vars n mean sd median trimmed mad min max range skew kurtosis se na_count
zn 1 466 11.5772532 23.3646511 0.00000 5.3542781 0.0000000 0.0000 100.0000 100.0000 2.1768152 3.8135765 1.0823466 0
indus 2 466 11.1050215 6.8458549 9.69000 10.9082353 9.3403800 0.4600 27.7400 27.2800 0.2885450 -1.2432132 0.3171281 0
chas 3 466 0.0708155 0.2567920 0.00000 0.0000000 0.0000000 0.0000 1.0000 1.0000 3.3354899 9.1451313 0.0118957 0
nox 4 466 0.5543105 0.1166667 0.53800 0.5442684 0.1334340 0.3890 0.8710 0.4820 0.7463281 -0.0357736 0.0054045 0
rm 5 466 6.2906738 0.7048513 6.21000 6.2570615 0.5166861 3.8630 8.7800 4.9170 0.4793202 1.5424378 0.0326516 0
age 6 466 68.3675966 28.3213784 77.15000 70.9553476 30.0226500 2.9000 100.0000 97.1000 -0.5777075 -1.0098814 1.3119625 0
dis 7 466 3.7956929 2.1069496 3.19095 3.5443647 1.9144814 1.1296 12.1265 10.9969 0.9988926 0.4719679 0.0976026 0
rad 8 466 9.5300429 8.6859272 5.00000 8.6978610 1.4826000 1.0000 24.0000 23.0000 1.0102788 -0.8619110 0.4023678 0
tax 9 466 409.5021459 167.9000887 334.50000 401.5080214 104.5233000 187.0000 711.0000 524.0000 0.6593136 -1.1480456 7.7778214 0
ptratio 10 466 18.3984979 2.1968447 18.90000 18.5970588 1.9273800 12.6000 22.0000 9.4000 -0.7542681 -0.4003627 0.1017669 0
black 11 466 357.1201502 91.3211298 391.34000 383.5064439 8.2432560 0.3200 396.9000 396.5800 -2.9163108 7.3419310 4.2303696 0
lstat 12 466 12.6314592 7.1018907 11.35000 11.8809626 7.0720020 1.7300 37.9700 36.2400 0.9055864 0.5033688 0.3289887 0
medv 13 466 22.5892704 9.2396814 21.20000 21.6304813 6.0045300 5.0000 50.0000 45.0000 1.0766920 1.3737825 0.4280200 0
target 14 466 0.4914163 0.5004636 0.00000 0.4893048 0.0000000 0.0000 1.0000 1.0000 0.0342293 -2.0031131 0.0231835 0

Visual Exploration

Boxplots

The below boxplots show all of the variables listed in the dataset. This visualization will assist in showing how the data is spread for each variable.

The boxplots show that some variables have a large amount of variance between each other, for example, rad, zn, and tax.

Histograms

Looking at the histograms below, it seems that the variable zn, dis, age, and lstat are skewed. Since they’re skewed, it might suggest that we apply a transformation to help normalize those variables.

Correlation

The correlation plot below shows how variables in the dataset are related to each other. Looking at the plot, we can see that certain variables are more related than others.

For this project, it makes sense to break down the correlation by target - since that’s what we’re trying to predict.
x
zn -0.4316818
indus 0.6048507
chas 0.0800419
nox 0.7261062
rm -0.1525533
age 0.6301062
dis -0.6186731
rad 0.6281049
tax 0.6111133
ptratio 0.2508489
black -0.3529568
lstat 0.4691270
medv -0.2705507
target 1.0000000

Below is a visual representation of the correlation plot. rad and tax are the top two variables with the highest correlation (0.91) with a positive correlation.

Missing Values

According to the graph, the data shows no missing variables.

Data Preparation

The data in the crime dataset presents some factors that would lead one to have to perform some “Transformations” on the data. These transformation include adding the log of variables, and adding quadratic terms.

age and lstat are both skewed, so we will add log terms to the model to help build a better fitting model. zn and rad both have high variance, so applying a quadratic will help normalize the variables.

Build Models

Throughout this section, various models will be created to try to determine which will allow for the best “fit” to predict weather crime appears in a major city as given by the dataset. Different methods will be used, as discussed below.

Model 1 - Base Model: All variables

All of the variables will be tested to determine the base model they provided. This will allow us to see which variables are significant in our dataset, and allow us to make other models based on that. This model will be based off of the original data - before transformed (log/quad) variables have been added to account for potential issues in the data.

Looking at the model, 9 of the variables are statistically significant via their p-values. The variable nox has the largest cofficient, which also has the highest correlation with the response variable.

## 
## Call:
## glm(formula = target ~ ., family = "binomial", data = crime_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2854  -0.1372  -0.0017   0.0020   3.4721  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -36.839521   7.028726  -5.241 1.59e-07 ***
## zn           -0.061720   0.034410  -1.794 0.072868 .  
## indus        -0.072580   0.048546  -1.495 0.134894    
## chas          1.032352   0.759627   1.359 0.174139    
## nox          50.159513   8.049503   6.231 4.62e-10 ***
## rm           -0.692145   0.741431  -0.934 0.350548    
## age           0.034522   0.013883   2.487 0.012895 *  
## dis           0.765795   0.234407   3.267 0.001087 ** 
## rad           0.663015   0.165135   4.015 5.94e-05 ***
## tax          -0.006593   0.003064  -2.152 0.031422 *  
## ptratio       0.442217   0.132234   3.344 0.000825 ***
## black        -0.013094   0.006680  -1.960 0.049974 *  
## lstat         0.047571   0.054508   0.873 0.382802    
## medv          0.199734   0.071022   2.812 0.004919 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 186.15  on 452  degrees of freedom
## AIC: 214.15
## 
## Number of Fisher Scoring iterations: 9

The models p-value is low - meaning the null must go - or we reject the null hypothesis that the coefficients are not related to the response variable.

The AIC is 214.15, with a residual deviance of 186.

Model 2 - Backwards elimination - original data

Variables will be removed one by one to determine best fit model. After each variable is removed, the model will be ‘ran’ again - until the most optimal output (r2, f-stat) are produced. Only the final output will be shown. This model is similar to the ‘forward selection’ variant - however I find it easier to work our way backwards and to eliminate variables rather than add them.

## 
## Call:
## glm(formula = target ~ . - lstat - indus, family = binomial, 
##     data = crime_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1802  -0.1317  -0.0019   0.0020   3.4471  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -33.695936   6.719182  -5.015 5.31e-07 ***
## zn           -0.058474   0.032118  -1.821  0.06867 .  
## chas          0.883472   0.742410   1.190  0.23404    
## nox          45.852121   7.184459   6.382 1.75e-10 ***
## rm           -0.906620   0.684211  -1.325  0.18515    
## age           0.038058   0.012558   3.031  0.00244 ** 
## dis           0.764512   0.233408   3.275  0.00106 ** 
## rad           0.730418   0.158162   4.618 3.87e-06 ***
## tax          -0.007777   0.002716  -2.864  0.00419 ** 
## ptratio       0.435162   0.132545   3.283  0.00103 ** 
## black        -0.012373   0.006363  -1.945  0.05183 .  
## medv          0.200483   0.071726   2.795  0.00519 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 189.14  on 454  degrees of freedom
## AIC: 213.14
## 
## Number of Fisher Scoring iterations: 9

The AIC is 213.14, with a residual deviance of 189.14. This model has a lower AIC when compared to model 1, however a slightly higher residual deviance.

Model 3 - All data - including transformations.

Model 3 includes all original variables, plus the created variables from the transformations (log and quad). The log and quadratic variables should help negate the large amount of skew in the data - or help them to become more normalized.

## 
## Call:
## glm(formula = target ~ ., family = "binomial", data = transform_crime)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1264  -0.1182  -0.0008   0.0006   3.6775  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.349e+01  7.674e+00  -3.061 0.002203 ** 
## zn          -5.076e-02  8.178e-02  -0.621 0.534815    
## indus       -1.025e-01  5.699e-02  -1.798 0.072209 .  
## chas         1.129e+00  8.046e-01   1.403 0.160687    
## nox          5.438e+01  9.013e+00   6.034  1.6e-09 ***
## rm          -1.369e+00  8.175e-01  -1.675 0.093911 .  
## age          1.009e-01  2.921e-02   3.454 0.000552 ***
## dis          6.962e-01  2.647e-01   2.630 0.008536 ** 
## rad          7.024e-01  7.040e-01   0.998 0.318466    
## tax         -8.260e-03  3.727e-03  -2.216 0.026687 *  
## ptratio      4.816e-01  1.414e-01   3.406 0.000660 ***
## black       -1.225e-02  6.335e-03  -1.933 0.053184 .  
## lstat        1.454e-01  1.266e-01   1.148 0.250784    
## medv         1.980e-01  7.674e-02   2.581 0.009860 ** 
## logage      -3.092e+00  1.102e+00  -2.806 0.005023 ** 
## loglstat    -1.957e+00  1.637e+00  -1.195 0.231961    
## quadzn      -1.865e-04  2.588e-03  -0.072 0.942554    
## quadrad      4.647e-03  6.345e-02   0.073 0.941613    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 177.33  on 448  degrees of freedom
## AIC: 213.33
## 
## Number of Fisher Scoring iterations: 13

The AIC is 213.33, with a residual deviance of 177.33 This model has a very slightly higher AIC when compared to model 2, however a slightly lower residual deviance.

Model 4 - Only Significant Variables

Model 4 uses only significant variables - including those from the transformations.

## 
## Call:
## glm(formula = target ~ indus + nox + rm + age + dis + tax + ptratio + 
##     black + medv + logage, family = "binomial", data = transform_crime)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.47882  -0.37078  -0.06167   0.16655   2.97476  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -24.926302   5.805363  -4.294 1.76e-05 ***
## indus        -0.136749   0.047429  -2.883 0.003936 ** 
## nox          43.493967   6.654248   6.536 6.31e-11 ***
## rm           -0.701425   0.534087  -1.313 0.189077    
## age           0.068533   0.022810   3.005 0.002660 ** 
## dis           0.510113   0.159032   3.208 0.001338 ** 
## tax           0.004350   0.001724   2.524 0.011616 *  
## ptratio       0.367778   0.101720   3.616 0.000300 ***
## black        -0.011914   0.005596  -2.129 0.033267 *  
## medv          0.195032   0.055063   3.542 0.000397 ***
## logage       -1.884538   0.970968  -1.941 0.052272 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 239.91  on 455  degrees of freedom
## AIC: 261.91
## 
## Number of Fisher Scoring iterations: 7

The AIC is 261.91 - which is significantly higher than previous models, with a residual deviance of 239.91 - which is also much higher than previous models.

Model 5 - Backwards Elimination - Only Significant Variables

Model 5 uses only significant variables - including those from the transformations. Variables will be removed one by one to determine best fit model. After each variable is removed, the model will be ‘ran’ again - until the most optimal output (r2, f-stat) are produced. Only the final output will be shown.

## 
## Call:
## glm(formula = target ~ indus + nox + rm + age + dis + tax + ptratio + 
##     black + medv + logage, family = "binomial", data = transform_crime)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.47882  -0.37078  -0.06167   0.16655   2.97476  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -24.926302   5.805363  -4.294 1.76e-05 ***
## indus        -0.136749   0.047429  -2.883 0.003936 ** 
## nox          43.493967   6.654248   6.536 6.31e-11 ***
## rm           -0.701425   0.534087  -1.313 0.189077    
## age           0.068533   0.022810   3.005 0.002660 ** 
## dis           0.510113   0.159032   3.208 0.001338 ** 
## tax           0.004350   0.001724   2.524 0.011616 *  
## ptratio       0.367778   0.101720   3.616 0.000300 ***
## black        -0.011914   0.005596  -2.129 0.033267 *  
## medv          0.195032   0.055063   3.542 0.000397 ***
## logage       -1.884538   0.970968  -1.941 0.052272 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 239.91  on 455  degrees of freedom
## AIC: 261.91
## 
## Number of Fisher Scoring iterations: 7

Model 6 - Leaps Package

The Leaps package is an “regression subset selection” tool. The package automatically generates all possible models. The tool is basically used to find the “best” model. This tool will be compared to another (further down below).

After running the package, I inputed the “best” model mannualy in R, as to not have to rerun each time.

## 
## Call:
## glm(formula = transform_crime$target ~ nox + age + rad + ptratio + 
##     medv + logage + quadrad, family = "binomial", data = transform_crime)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.05357  -0.30004  -0.04184   0.01074   2.91188  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -20.108923   4.384475  -4.586 4.51e-06 ***
## nox          25.411290   4.193786   6.059 1.37e-09 ***
## age           0.055513   0.020528   2.704  0.00685 ** 
## rad           0.539888   0.421872   1.280  0.20064    
## ptratio       0.273311   0.101230   2.700  0.00694 ** 
## medv          0.087290   0.029080   3.002  0.00268 ** 
## logage       -1.811286   0.862736  -2.099  0.03578 *  
## quadrad      -0.001642   0.039667  -0.041  0.96698    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 221.27  on 458  degrees of freedom
## AIC: 237.27
## 
## Number of Fisher Scoring iterations: 12

Model 7- glmulti Package

The glmulti package is an “automated model selection and model averaging” tool. The package automatically generates all possible models “with the specified response and explanatory variables”. The tool is basically used to find the “best” model. The library/function itself took over 10 minutes for my PC to run - since it was set to use an exhaustive approach to selecting the best model.

After running the package, I inputed the “best” model mannualy in R, as to not have to rerun (10+ mins) each time.

glmulti - all data including transformations

## 
## Call:
## glm(formula = transform_crime$target ~ nox + age + rad + ptratio + 
##     medv + logage + quadrad, data = transform_crime)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.62263  -0.18943  -0.04644   0.16223   0.97175  
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.0819950  0.3189065  -3.393 0.000752 ***
## nox          1.8557632  0.2169094   8.555  < 2e-16 ***
## age          0.0075058  0.0018394   4.081 5.30e-05 ***
## rad          0.0659960  0.0154537   4.271 2.37e-05 ***
## ptratio      0.0146221  0.0085612   1.708 0.088321 .  
## medv         0.0077351  0.0019676   3.931 9.75e-05 ***
## logage      -0.1829056  0.0744502  -2.457 0.014390 *  
## quadrad     -0.0017688  0.0005524  -3.202 0.001459 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.09394292)
## 
##     Null deviance: 116.466  on 465  degrees of freedom
## Residual deviance:  43.026  on 458  degrees of freedom
## AIC: 230.26
## 
## Number of Fisher Scoring iterations: 2

glmulti - all original data only

## 
## Call:
## glm(formula = crime_train$target ~ 1 + nox + age + rad + ptratio + 
##     medv, data = transform_crime)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.5997  -0.2030  -0.0439   0.1252   0.9264  
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.4128361  0.2249301  -6.281 7.79e-10 ***
## nox          1.9566942  0.2157623   9.069  < 2e-16 ***
## age          0.0035317  0.0007664   4.608 5.27e-06 ***
## rad          0.0171066  0.0023402   7.310 1.19e-12 ***
## ptratio      0.0127163  0.0086324   1.473    0.141    
## medv         0.0080212  0.0019934   4.024 6.69e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.09682873)
## 
##     Null deviance: 116.466  on 465  degrees of freedom
## Residual deviance:  44.541  on 460  degrees of freedom
## AIC: 242.39
## 
## Number of Fisher Scoring iterations: 2

Looking at the two models for the glmulti package, the model with the transformed variables performs slightly better.

Model Selection.

Based on the above models, I’ve decided to use model 7 - as determined by the glmulti package. The AIC and residual deviance for this model seemed to give the best values that would be suited for the prediction.

Evaulating the model

Before I ran the evaulation data through the model, I decided to split the trianing data into an 80/20 split. My thoughts are that this will allow me to better check the accuracy of my model, given the fact that this way I can actually check if the ‘target’ variable as predicted by the model is correct or not.

After splitting the data and creating two new variables (training and testing), I created an ROC graph to help determine what threshold I should use in my model

Looking at the graph, the 0.3 threshold seems to be the most ideal soultion for my testing. 0.3 gives about a 0.9 TP rate, while giving only ~0.2 FP rate. 0.2 and 0.2 give slightly higher TP rates, but also give a high FP rate - which, in my opinion, isn’t worth the slight increase in TP. 0.4 gives a slightly lower FP rate, but a siginifcant different in TP rate.

##            PredictedValue
## ActualValue FALSE TRUE
##           0   138   37
##           1    12  176
## [1] 0.873

After testing model 7, on the split training data, the accuracy is around .873, which seems like a good fit.

##            PredictedValue
## ActualValue FALSE TRUE
##           0    48   14
##           1     1   40
## [1] 0.894

After testing model 7, on the split testing data, the accuracy is around .894, which seems like a good fit. I believe this model will be sufficient to test the evaulation data with.

Testing With Model 7

Using the glmulti model (7) on the evaulation dataset, with a threshold of 0.3, the model predicts that there are 12 observations below the median crime rate, and about 28 above the median crime rate.

## predict12
##  0  1 
## 12 28
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.2763  0.2616  0.4140  0.5128  0.8257  1.2303

For a point of refrence, I decided to run the first model against the evaulation dataset as well (baseline model). This model seems to predict an even 20/20 split for the predictions.

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0000009 0.0832492 0.4714571 0.5201166 0.9999995 1.0000000
## predict11
##  0  1 
## 20 20

References

All subset regression with leaps, bestglm, glmulti, and meifly. (n.d.). Retrieved from https://rstudio-pubs-static.s3.amazonaws.com/2897_9220b21cfc0c43a396ff9abf122bb351.html

All subset regression with leaps, bestglm, glmulti, and meifly. (n.d.). Retrieved from https://rstudio-pubs-static.s3.amazonaws.com/2897_9220b21cfc0c43a396ff9abf122bb351.html

All subset regression with leaps, bestglm, glmulti, and meifly. (n.d.). Retrieved from https://rstudio-pubs-static.s3.amazonaws.com/2897_9220b21cfc0c43a396ff9abf122bb351.html

Model selection and multimodel inference made easy. (n.d.). Retrieved from https://cran.r-project.org/web/packages/glmulti/glmulti.pdf

Appendix

library(tidyverse) library(knitr) library(psych) library(readr) library(kableExtra) library(ggiraph) library(cowplot) library(reshape2) library(corrgram) library(gridExtra) library(usdm) library(mice) library(pROC) library(reshape2) library(caTools) library(caret) library(ROCR)

crime_train <- read_csv(“https://raw.githubusercontent.com/nschettini/CUNY-MSDS-DATA-621/master/crime-training-data.csv”) crime_eval <- read_csv(“https://raw.githubusercontent.com/nschettini/CUNY-MSDS-DATA-621/master/crime-evaluation-data.csv”)

train <- describe(crime_train) train$na_count <- sapply(crime_train, function(y) sum(length(which(is.na(y)))))

kable(train, “html”, escape = F) %>% kable_styling(“striped”, full_width = T) %>% column_spec(1, bold = T) %>% scroll_box(width = “100%”, height = “700px”)

long <- melt(crime_train, id.vars= “target”)%>% dplyr::filter(variable != “chas”) %>% mutate(target = as.factor(target))

ggplot(data = long, aes(x = variable, y = value)) + geom_boxplot(aes(fill = target)) + facet_wrap( ~ variable, scales = “free”)

crime_hist <- crime_train

crime_hist %>% keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = “free”) +
geom_histogram(bins = 35)

ggplot(crime_train, aes(crime_train$medv ,target)) + geom_point() + geom_smooth(method = “glm”, method.args = list(family = “binomial”), se = FALSE)

kable(cor(drop_na(crime_train))[,14], “html”, escape = F) %>% kable_styling(“striped”, full_width = F) %>% column_spec(1, bold = T) %>% scroll_box(height = “500px”)

corrgram(drop_na(crime_train), order=TRUE, upper.panel=panel.cor, main=“Moneyball”)

library(Amelia) missmap(crime_eval, main = “Missing values vs observed”)

transform_crime <- crime_train transform_crime\(logage <- log(transform_crime\)age) transform_crime\(loglstat <- log(transform_crime\)lstat) transform_crime\(quadzn <- transform_crime\)zn2 transform_crime\(quadrad <- transform_crime\)rad2

crime_eval1 <- crime_eval crime_eval1\(logage <- log(crime_eval\)age) crime_eval1\(loglstat <- log(crime_eval\)lstat) crime_eval1\(quadzn <- crime_eval\)zn2 crime_eval1\(quadrad <- crime_eval\)rad2

model1 <- glm(target ~., family = “binomial”, data=crime_train) summary(model1)

model2 <- glm(target ~. -lstat -indus, family = binomial, data=crime_train) summary(model2)

model3 <- glm(target ~., family = “binomial”, data=transform_crime) summary(model3)

model4 <- glm(target ~ indus + nox + rm + age + dis + tax + ptratio + black +medv + logage, family = “binomial”, data=transform_crime) summary(model4)

model5 <- glm(target ~ indus + nox + rm + age + dis + tax + ptratio + black +medv + logage, family = “binomial”, data=transform_crime) summary(model5)

library(leaps)

x <- model.matrix(transform_crime\(target ~. -1, data= transform_crime) y <- transform_crime\)target bestmods <- leaps(x, y, nbest=1) bestmods min(bestmods$Cp)

leapsmodel <- glm(transform_crime$target ~ nox + age + rad + ptratio + medv + logage + quadrad, family = “binomial”, transform_crime)

summary(leapsmodel)

library(rJava) library(glmulti)

glmulti.lm.out <- glmulti(crime_train$target ~., data = crime_train, level = 1, # No interaction considered method = “h”, # Exhaustive approach crit = “aic”, # AIC as criteria confsetsize = 5, # Keep 5 best models plotty = F, report = F, # No plot or interim reports fitfunction = “lm”) # lm function

glmulti.lm.out@formulas

modelglmulti <- glm(transform_crime$target ~ nox + age + rad + ptratio + medv + logage + quadrad, data = transform_crime)

summary(modelglmulti)

modelglmulti2 <- glm(crime_train$target ~ 1 + nox + age + rad + ptratio + medv, data = transform_crime)

summary(modelglmulti2)

splitdata <- transform_crime

split <- sample.split(splitdata, SplitRatio = 0.8) split training <- subset(splitdata, split == “TRUE”) testing <- subset(splitdata, split == “FALSE”)

modelglmulti3 <- glm(training$target ~ 1 + nox + age + rad + ptratio + medv, family=“binomial”, data = training) res <- predict(modelglmulti3, newdata=training, type=“response”)

ROCRPred = prediction(res, training$target) ROCRPref <- performance(ROCRPred, “tpr”,“fpr”)

plot(ROCRPref, colorize=TRUE, print.cutoffs.at=seq(0.1,by=0.1))

(table(ActualValue=training$target, PredictedValue=res>0.3)) round((149+167)/(149+167+9+37),3)

res <- predict(modelglmulti3, newdata=testing, type=“response”) (table(ActualValue=testing$target, PredictedValue=res>0.3)) round((42+51)/(42+51+9+2),3)

predict1 <- predict(modelglmulti, newdata=crime_eval1, type=“response”) predict12 <- ifelse(predict1 > 0.3, 1, 0) table(predict12) summary(predict1)

predict2 <- predict(model1, newdata=crime_eval1, type=“response”) summary(predict2) predict11 <- ifelse(predict2 > 0.5, 1, 0) table(predict11)