Overview
In this homework assignment, you will explore, analyze and model a data set containing information on crime for various neighborhoods of a major city. Each record has a response variable indicating whether or not the crime rate is above the median crime rate (1) or not (0).
Your objective is to build a binary logistic regression model on the training data set to predict whether the neighborhood will be at risk for high crime levels. You will provide classifications and probabilities for the evaluation data set using your binary logistic regression model. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:
Data Exploration
The dataset contains 13 variables related to housing, transportation, the environment, education, and crime. For this data exploration, we will be focusing on a binary logistic regression for the response variable target, which can be either a 0 or a 1. A 1 indicates that the crime rate is above the median.
## # A tibble: 6 x 14
## zn indus chas nox rm age dis rad tax ptratio black
## <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int> <int> <dbl> <dbl>
## 1 0 19.6 0 0.605 7.93 96.2 2.05 5 403 14.7 369
## 2 0 19.6 1 0.871 5.40 100 1.32 5 403 14.7 397
## 3 0 18.1 0 0.740 6.48 100 1.98 24 666 20.2 387
## 4 30.0 4.93 0 0.428 6.39 7.80 7.04 6 300 16.6 375
## 5 0 2.46 0 0.488 7.16 92.2 2.70 3 193 17.8 394
## 6 0 8.56 0 0.520 6.78 71.3 2.86 5 384 20.9 396
## # ... with 3 more variables: lstat <dbl>, medv <dbl>, target <int>
Based on the statistics below, it seems that the variables are not large enough to warrent any variable transformations (based on the kurtosis and skew). The data also show that there are no NA values.
vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | na_count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
zn | 1 | 466 | 11.5772532 | 23.3646511 | 0.00000 | 5.3542781 | 0.0000000 | 0.0000 | 100.0000 | 100.0000 | 2.1768152 | 3.8135765 | 1.0823466 | 0 |
indus | 2 | 466 | 11.1050215 | 6.8458549 | 9.69000 | 10.9082353 | 9.3403800 | 0.4600 | 27.7400 | 27.2800 | 0.2885450 | -1.2432132 | 0.3171281 | 0 |
chas | 3 | 466 | 0.0708155 | 0.2567920 | 0.00000 | 0.0000000 | 0.0000000 | 0.0000 | 1.0000 | 1.0000 | 3.3354899 | 9.1451313 | 0.0118957 | 0 |
nox | 4 | 466 | 0.5543105 | 0.1166667 | 0.53800 | 0.5442684 | 0.1334340 | 0.3890 | 0.8710 | 0.4820 | 0.7463281 | -0.0357736 | 0.0054045 | 0 |
rm | 5 | 466 | 6.2906738 | 0.7048513 | 6.21000 | 6.2570615 | 0.5166861 | 3.8630 | 8.7800 | 4.9170 | 0.4793202 | 1.5424378 | 0.0326516 | 0 |
age | 6 | 466 | 68.3675966 | 28.3213784 | 77.15000 | 70.9553476 | 30.0226500 | 2.9000 | 100.0000 | 97.1000 | -0.5777075 | -1.0098814 | 1.3119625 | 0 |
dis | 7 | 466 | 3.7956929 | 2.1069496 | 3.19095 | 3.5443647 | 1.9144814 | 1.1296 | 12.1265 | 10.9969 | 0.9988926 | 0.4719679 | 0.0976026 | 0 |
rad | 8 | 466 | 9.5300429 | 8.6859272 | 5.00000 | 8.6978610 | 1.4826000 | 1.0000 | 24.0000 | 23.0000 | 1.0102788 | -0.8619110 | 0.4023678 | 0 |
tax | 9 | 466 | 409.5021459 | 167.9000887 | 334.50000 | 401.5080214 | 104.5233000 | 187.0000 | 711.0000 | 524.0000 | 0.6593136 | -1.1480456 | 7.7778214 | 0 |
ptratio | 10 | 466 | 18.3984979 | 2.1968447 | 18.90000 | 18.5970588 | 1.9273800 | 12.6000 | 22.0000 | 9.4000 | -0.7542681 | -0.4003627 | 0.1017669 | 0 |
black | 11 | 466 | 357.1201502 | 91.3211298 | 391.34000 | 383.5064439 | 8.2432560 | 0.3200 | 396.9000 | 396.5800 | -2.9163108 | 7.3419310 | 4.2303696 | 0 |
lstat | 12 | 466 | 12.6314592 | 7.1018907 | 11.35000 | 11.8809626 | 7.0720020 | 1.7300 | 37.9700 | 36.2400 | 0.9055864 | 0.5033688 | 0.3289887 | 0 |
medv | 13 | 466 | 22.5892704 | 9.2396814 | 21.20000 | 21.6304813 | 6.0045300 | 5.0000 | 50.0000 | 45.0000 | 1.0766920 | 1.3737825 | 0.4280200 | 0 |
target | 14 | 466 | 0.4914163 | 0.5004636 | 0.00000 | 0.4893048 | 0.0000000 | 0.0000 | 1.0000 | 1.0000 | 0.0342293 | -2.0031131 | 0.0231835 | 0 |
Visual Exploration
Boxplots
The below boxplots show all of the variables listed in the dataset. This visualization will assist in showing how the data is spread for each variable.
The boxplots show that some variables have a large amount of variance between each other, for example, rad, zn, and tax.
Histograms
Looking at the histograms below, it seems that the variable zn, dis, age, and lstat are skewed. Since they’re skewed, it might suggest that we apply a transformation to help normalize those variables.
Correlation
The correlation plot below shows how variables in the dataset are related to each other. Looking at the plot, we can see that certain variables are more related than others.
For this project, it makes sense to break down the correlation by target - since that’s what we’re trying to predict.x | |
---|---|
zn | -0.4316818 |
indus | 0.6048507 |
chas | 0.0800419 |
nox | 0.7261062 |
rm | -0.1525533 |
age | 0.6301062 |
dis | -0.6186731 |
rad | 0.6281049 |
tax | 0.6111133 |
ptratio | 0.2508489 |
black | -0.3529568 |
lstat | 0.4691270 |
medv | -0.2705507 |
target | 1.0000000 |
Below is a visual representation of the correlation plot. rad and tax are the top two variables with the highest correlation (0.91) with a positive correlation.
Missing Values
According to the graph, the data shows no missing variables.
Data Preparation
The data in the crime dataset presents some factors that would lead one to have to perform some “Transformations” on the data. These transformation include adding the log of variables, and adding quadratic terms.
age and lstat are both skewed, so we will add log terms to the model to help build a better fitting model. zn and rad both have high variance, so applying a quadratic will help normalize the variables.
Build Models
Throughout this section, various models will be created to try to determine which will allow for the best “fit” to predict weather crime appears in a major city as given by the dataset. Different methods will be used, as discussed below.
Model 1 - Base Model: All variables
All of the variables will be tested to determine the base model they provided. This will allow us to see which variables are significant in our dataset, and allow us to make other models based on that. This model will be based off of the original data - before transformed (log/quad) variables have been added to account for potential issues in the data.
Looking at the model, 9 of the variables are statistically significant via their p-values. The variable nox has the largest cofficient, which also has the highest correlation with the response variable.
##
## Call:
## glm(formula = target ~ ., family = "binomial", data = crime_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2854 -0.1372 -0.0017 0.0020 3.4721
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -36.839521 7.028726 -5.241 1.59e-07 ***
## zn -0.061720 0.034410 -1.794 0.072868 .
## indus -0.072580 0.048546 -1.495 0.134894
## chas 1.032352 0.759627 1.359 0.174139
## nox 50.159513 8.049503 6.231 4.62e-10 ***
## rm -0.692145 0.741431 -0.934 0.350548
## age 0.034522 0.013883 2.487 0.012895 *
## dis 0.765795 0.234407 3.267 0.001087 **
## rad 0.663015 0.165135 4.015 5.94e-05 ***
## tax -0.006593 0.003064 -2.152 0.031422 *
## ptratio 0.442217 0.132234 3.344 0.000825 ***
## black -0.013094 0.006680 -1.960 0.049974 *
## lstat 0.047571 0.054508 0.873 0.382802
## medv 0.199734 0.071022 2.812 0.004919 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 186.15 on 452 degrees of freedom
## AIC: 214.15
##
## Number of Fisher Scoring iterations: 9
The models p-value is low - meaning the null must go - or we reject the null hypothesis that the coefficients are not related to the response variable.
The AIC is 214.15, with a residual deviance of 186.
Model 2 - Backwards elimination - original data
Variables will be removed one by one to determine best fit model. After each variable is removed, the model will be ‘ran’ again - until the most optimal output (r2, f-stat) are produced. Only the final output will be shown. This model is similar to the ‘forward selection’ variant - however I find it easier to work our way backwards and to eliminate variables rather than add them.
##
## Call:
## glm(formula = target ~ . - lstat - indus, family = binomial,
## data = crime_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1802 -0.1317 -0.0019 0.0020 3.4471
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -33.695936 6.719182 -5.015 5.31e-07 ***
## zn -0.058474 0.032118 -1.821 0.06867 .
## chas 0.883472 0.742410 1.190 0.23404
## nox 45.852121 7.184459 6.382 1.75e-10 ***
## rm -0.906620 0.684211 -1.325 0.18515
## age 0.038058 0.012558 3.031 0.00244 **
## dis 0.764512 0.233408 3.275 0.00106 **
## rad 0.730418 0.158162 4.618 3.87e-06 ***
## tax -0.007777 0.002716 -2.864 0.00419 **
## ptratio 0.435162 0.132545 3.283 0.00103 **
## black -0.012373 0.006363 -1.945 0.05183 .
## medv 0.200483 0.071726 2.795 0.00519 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 189.14 on 454 degrees of freedom
## AIC: 213.14
##
## Number of Fisher Scoring iterations: 9
The AIC is 213.14, with a residual deviance of 189.14. This model has a lower AIC when compared to model 1, however a slightly higher residual deviance.
Model 3 - All data - including transformations.
Model 3 includes all original variables, plus the created variables from the transformations (log and quad). The log and quadratic variables should help negate the large amount of skew in the data - or help them to become more normalized.
##
## Call:
## glm(formula = target ~ ., family = "binomial", data = transform_crime)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1264 -0.1182 -0.0008 0.0006 3.6775
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.349e+01 7.674e+00 -3.061 0.002203 **
## zn -5.076e-02 8.178e-02 -0.621 0.534815
## indus -1.025e-01 5.699e-02 -1.798 0.072209 .
## chas 1.129e+00 8.046e-01 1.403 0.160687
## nox 5.438e+01 9.013e+00 6.034 1.6e-09 ***
## rm -1.369e+00 8.175e-01 -1.675 0.093911 .
## age 1.009e-01 2.921e-02 3.454 0.000552 ***
## dis 6.962e-01 2.647e-01 2.630 0.008536 **
## rad 7.024e-01 7.040e-01 0.998 0.318466
## tax -8.260e-03 3.727e-03 -2.216 0.026687 *
## ptratio 4.816e-01 1.414e-01 3.406 0.000660 ***
## black -1.225e-02 6.335e-03 -1.933 0.053184 .
## lstat 1.454e-01 1.266e-01 1.148 0.250784
## medv 1.980e-01 7.674e-02 2.581 0.009860 **
## logage -3.092e+00 1.102e+00 -2.806 0.005023 **
## loglstat -1.957e+00 1.637e+00 -1.195 0.231961
## quadzn -1.865e-04 2.588e-03 -0.072 0.942554
## quadrad 4.647e-03 6.345e-02 0.073 0.941613
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 177.33 on 448 degrees of freedom
## AIC: 213.33
##
## Number of Fisher Scoring iterations: 13
The AIC is 213.33, with a residual deviance of 177.33 This model has a very slightly higher AIC when compared to model 2, however a slightly lower residual deviance.
Model 4 - Only Significant Variables
Model 4 uses only significant variables - including those from the transformations.
##
## Call:
## glm(formula = target ~ indus + nox + rm + age + dis + tax + ptratio +
## black + medv + logage, family = "binomial", data = transform_crime)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.47882 -0.37078 -0.06167 0.16655 2.97476
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -24.926302 5.805363 -4.294 1.76e-05 ***
## indus -0.136749 0.047429 -2.883 0.003936 **
## nox 43.493967 6.654248 6.536 6.31e-11 ***
## rm -0.701425 0.534087 -1.313 0.189077
## age 0.068533 0.022810 3.005 0.002660 **
## dis 0.510113 0.159032 3.208 0.001338 **
## tax 0.004350 0.001724 2.524 0.011616 *
## ptratio 0.367778 0.101720 3.616 0.000300 ***
## black -0.011914 0.005596 -2.129 0.033267 *
## medv 0.195032 0.055063 3.542 0.000397 ***
## logage -1.884538 0.970968 -1.941 0.052272 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 239.91 on 455 degrees of freedom
## AIC: 261.91
##
## Number of Fisher Scoring iterations: 7
The AIC is 261.91 - which is significantly higher than previous models, with a residual deviance of 239.91 - which is also much higher than previous models.
Model 5 - Backwards Elimination - Only Significant Variables
Model 5 uses only significant variables - including those from the transformations. Variables will be removed one by one to determine best fit model. After each variable is removed, the model will be ‘ran’ again - until the most optimal output (r2, f-stat) are produced. Only the final output will be shown.
##
## Call:
## glm(formula = target ~ indus + nox + rm + age + dis + tax + ptratio +
## black + medv + logage, family = "binomial", data = transform_crime)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.47882 -0.37078 -0.06167 0.16655 2.97476
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -24.926302 5.805363 -4.294 1.76e-05 ***
## indus -0.136749 0.047429 -2.883 0.003936 **
## nox 43.493967 6.654248 6.536 6.31e-11 ***
## rm -0.701425 0.534087 -1.313 0.189077
## age 0.068533 0.022810 3.005 0.002660 **
## dis 0.510113 0.159032 3.208 0.001338 **
## tax 0.004350 0.001724 2.524 0.011616 *
## ptratio 0.367778 0.101720 3.616 0.000300 ***
## black -0.011914 0.005596 -2.129 0.033267 *
## medv 0.195032 0.055063 3.542 0.000397 ***
## logage -1.884538 0.970968 -1.941 0.052272 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 239.91 on 455 degrees of freedom
## AIC: 261.91
##
## Number of Fisher Scoring iterations: 7
Model 6 - Leaps Package
The Leaps package is an “regression subset selection” tool. The package automatically generates all possible models. The tool is basically used to find the “best” model. This tool will be compared to another (further down below).
After running the package, I inputed the “best” model mannualy in R, as to not have to rerun each time.
##
## Call:
## glm(formula = transform_crime$target ~ nox + age + rad + ptratio +
## medv + logage + quadrad, family = "binomial", data = transform_crime)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.05357 -0.30004 -0.04184 0.01074 2.91188
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -20.108923 4.384475 -4.586 4.51e-06 ***
## nox 25.411290 4.193786 6.059 1.37e-09 ***
## age 0.055513 0.020528 2.704 0.00685 **
## rad 0.539888 0.421872 1.280 0.20064
## ptratio 0.273311 0.101230 2.700 0.00694 **
## medv 0.087290 0.029080 3.002 0.00268 **
## logage -1.811286 0.862736 -2.099 0.03578 *
## quadrad -0.001642 0.039667 -0.041 0.96698
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 221.27 on 458 degrees of freedom
## AIC: 237.27
##
## Number of Fisher Scoring iterations: 12
Model 7- glmulti Package
The glmulti package is an “automated model selection and model averaging” tool. The package automatically generates all possible models “with the specified response and explanatory variables”. The tool is basically used to find the “best” model. The library/function itself took over 10 minutes for my PC to run - since it was set to use an exhaustive approach to selecting the best model.
After running the package, I inputed the “best” model mannualy in R, as to not have to rerun (10+ mins) each time.
glmulti - all data including transformations
##
## Call:
## glm(formula = transform_crime$target ~ nox + age + rad + ptratio +
## medv + logage + quadrad, data = transform_crime)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.62263 -0.18943 -0.04644 0.16223 0.97175
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.0819950 0.3189065 -3.393 0.000752 ***
## nox 1.8557632 0.2169094 8.555 < 2e-16 ***
## age 0.0075058 0.0018394 4.081 5.30e-05 ***
## rad 0.0659960 0.0154537 4.271 2.37e-05 ***
## ptratio 0.0146221 0.0085612 1.708 0.088321 .
## medv 0.0077351 0.0019676 3.931 9.75e-05 ***
## logage -0.1829056 0.0744502 -2.457 0.014390 *
## quadrad -0.0017688 0.0005524 -3.202 0.001459 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.09394292)
##
## Null deviance: 116.466 on 465 degrees of freedom
## Residual deviance: 43.026 on 458 degrees of freedom
## AIC: 230.26
##
## Number of Fisher Scoring iterations: 2
glmulti - all original data only
##
## Call:
## glm(formula = crime_train$target ~ 1 + nox + age + rad + ptratio +
## medv, data = transform_crime)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.5997 -0.2030 -0.0439 0.1252 0.9264
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.4128361 0.2249301 -6.281 7.79e-10 ***
## nox 1.9566942 0.2157623 9.069 < 2e-16 ***
## age 0.0035317 0.0007664 4.608 5.27e-06 ***
## rad 0.0171066 0.0023402 7.310 1.19e-12 ***
## ptratio 0.0127163 0.0086324 1.473 0.141
## medv 0.0080212 0.0019934 4.024 6.69e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.09682873)
##
## Null deviance: 116.466 on 465 degrees of freedom
## Residual deviance: 44.541 on 460 degrees of freedom
## AIC: 242.39
##
## Number of Fisher Scoring iterations: 2
Looking at the two models for the glmulti package, the model with the transformed variables performs slightly better.
Model Selection.
Based on the above models, I’ve decided to use model 7 - as determined by the glmulti package. The AIC and residual deviance for this model seemed to give the best values that would be suited for the prediction.
Evaulating the model
Before I ran the evaulation data through the model, I decided to split the trianing data into an 80/20 split. My thoughts are that this will allow me to better check the accuracy of my model, given the fact that this way I can actually check if the ‘target’ variable as predicted by the model is correct or not.
After splitting the data and creating two new variables (training and testing), I created an ROC graph to help determine what threshold I should use in my model
Looking at the graph, the 0.3 threshold seems to be the most ideal soultion for my testing. 0.3 gives about a 0.9 TP rate, while giving only ~0.2 FP rate. 0.2 and 0.2 give slightly higher TP rates, but also give a high FP rate - which, in my opinion, isn’t worth the slight increase in TP. 0.4 gives a slightly lower FP rate, but a siginifcant different in TP rate.
## PredictedValue
## ActualValue FALSE TRUE
## 0 138 37
## 1 12 176
## [1] 0.873
After testing model 7, on the split training data, the accuracy is around .873, which seems like a good fit.
## PredictedValue
## ActualValue FALSE TRUE
## 0 48 14
## 1 1 40
## [1] 0.894
After testing model 7, on the split testing data, the accuracy is around .894, which seems like a good fit. I believe this model will be sufficient to test the evaulation data with.
Testing With Model 7
Using the glmulti model (7) on the evaulation dataset, with a threshold of 0.3, the model predicts that there are 12 observations below the median crime rate, and about 28 above the median crime rate.
## predict12
## 0 1
## 12 28
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.2763 0.2616 0.4140 0.5128 0.8257 1.2303
For a point of refrence, I decided to run the first model against the evaulation dataset as well (baseline model). This model seems to predict an even 20/20 split for the predictions.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000009 0.0832492 0.4714571 0.5201166 0.9999995 1.0000000
## predict11
## 0 1
## 20 20
References
All subset regression with leaps, bestglm, glmulti, and meifly. (n.d.). Retrieved from https://rstudio-pubs-static.s3.amazonaws.com/2897_9220b21cfc0c43a396ff9abf122bb351.html
All subset regression with leaps, bestglm, glmulti, and meifly. (n.d.). Retrieved from https://rstudio-pubs-static.s3.amazonaws.com/2897_9220b21cfc0c43a396ff9abf122bb351.html
All subset regression with leaps, bestglm, glmulti, and meifly. (n.d.). Retrieved from https://rstudio-pubs-static.s3.amazonaws.com/2897_9220b21cfc0c43a396ff9abf122bb351.html
Model selection and multimodel inference made easy. (n.d.). Retrieved from https://cran.r-project.org/web/packages/glmulti/glmulti.pdf
Appendix
library(tidyverse) library(knitr) library(psych) library(readr) library(kableExtra) library(ggiraph) library(cowplot) library(reshape2) library(corrgram) library(gridExtra) library(usdm) library(mice) library(pROC) library(reshape2) library(caTools) library(caret) library(ROCR)
crime_train <- read_csv(“https://raw.githubusercontent.com/nschettini/CUNY-MSDS-DATA-621/master/crime-training-data.csv”) crime_eval <- read_csv(“https://raw.githubusercontent.com/nschettini/CUNY-MSDS-DATA-621/master/crime-evaluation-data.csv”)
train <- describe(crime_train) train$na_count <- sapply(crime_train, function(y) sum(length(which(is.na(y)))))
kable(train, “html”, escape = F) %>% kable_styling(“striped”, full_width = T) %>% column_spec(1, bold = T) %>% scroll_box(width = “100%”, height = “700px”)
long <- melt(crime_train, id.vars= “target”)%>% dplyr::filter(variable != “chas”) %>% mutate(target = as.factor(target))
ggplot(data = long, aes(x = variable, y = value)) + geom_boxplot(aes(fill = target)) + facet_wrap( ~ variable, scales = “free”)
crime_hist <- crime_train
crime_hist %>% keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = “free”) +
geom_histogram(bins = 35)
ggplot(crime_train, aes(crime_train$medv ,target)) + geom_point() + geom_smooth(method = “glm”, method.args = list(family = “binomial”), se = FALSE)
kable(cor(drop_na(crime_train))[,14], “html”, escape = F) %>% kable_styling(“striped”, full_width = F) %>% column_spec(1, bold = T) %>% scroll_box(height = “500px”)
corrgram(drop_na(crime_train), order=TRUE, upper.panel=panel.cor, main=“Moneyball”)
library(Amelia) missmap(crime_eval, main = “Missing values vs observed”)
transform_crime <- crime_train transform_crime\(logage <- log(transform_crime\)age) transform_crime\(loglstat <- log(transform_crime\)lstat) transform_crime\(quadzn <- transform_crime\)zn2 transform_crime\(quadrad <- transform_crime\)rad2
crime_eval1 <- crime_eval crime_eval1\(logage <- log(crime_eval\)age) crime_eval1\(loglstat <- log(crime_eval\)lstat) crime_eval1\(quadzn <- crime_eval\)zn2 crime_eval1\(quadrad <- crime_eval\)rad2
model1 <- glm(target ~., family = “binomial”, data=crime_train) summary(model1)
model2 <- glm(target ~. -lstat -indus, family = binomial, data=crime_train) summary(model2)
model3 <- glm(target ~., family = “binomial”, data=transform_crime) summary(model3)
model4 <- glm(target ~ indus + nox + rm + age + dis + tax + ptratio + black +medv + logage, family = “binomial”, data=transform_crime) summary(model4)
model5 <- glm(target ~ indus + nox + rm + age + dis + tax + ptratio + black +medv + logage, family = “binomial”, data=transform_crime) summary(model5)
library(leaps)
x <- model.matrix(transform_crime\(target ~. -1, data= transform_crime) y <- transform_crime\)target bestmods <- leaps(x, y, nbest=1) bestmods min(bestmods$Cp)
leapsmodel <- glm(transform_crime$target ~ nox + age + rad + ptratio + medv + logage + quadrad, family = “binomial”, transform_crime)
summary(leapsmodel)
library(rJava) library(glmulti)
glmulti.lm.out <- glmulti(crime_train$target ~., data = crime_train, level = 1, # No interaction considered method = “h”, # Exhaustive approach crit = “aic”, # AIC as criteria confsetsize = 5, # Keep 5 best models plotty = F, report = F, # No plot or interim reports fitfunction = “lm”) # lm function
modelglmulti <- glm(transform_crime$target ~ nox + age + rad + ptratio + medv + logage + quadrad, data = transform_crime)
summary(modelglmulti)
modelglmulti2 <- glm(crime_train$target ~ 1 + nox + age + rad + ptratio + medv, data = transform_crime)
summary(modelglmulti2)
splitdata <- transform_crime
split <- sample.split(splitdata, SplitRatio = 0.8) split training <- subset(splitdata, split == “TRUE”) testing <- subset(splitdata, split == “FALSE”)
modelglmulti3 <- glm(training$target ~ 1 + nox + age + rad + ptratio + medv, family=“binomial”, data = training) res <- predict(modelglmulti3, newdata=training, type=“response”)
ROCRPred = prediction(res, training$target) ROCRPref <- performance(ROCRPred, “tpr”,“fpr”)
plot(ROCRPref, colorize=TRUE, print.cutoffs.at=seq(0.1,by=0.1))
(table(ActualValue=training$target, PredictedValue=res>0.3)) round((149+167)/(149+167+9+37),3)
res <- predict(modelglmulti3, newdata=testing, type=“response”) (table(ActualValue=testing$target, PredictedValue=res>0.3)) round((42+51)/(42+51+9+2),3)
predict1 <- predict(modelglmulti, newdata=crime_eval1, type=“response”) predict12 <- ifelse(predict1 > 0.3, 1, 0) table(predict12) summary(predict1)
predict2 <- predict(model1, newdata=crime_eval1, type=“response”) summary(predict2) predict11 <- ifelse(predict2 > 0.5, 1, 0) table(predict11)