## zn indus chas nox rm age dis rad tax ptratio black lstat medv
## 1 0 19.58 0 0.605 7.929 96.2 2.0459 5 403 14.7 369.30 3.70 50.0
## 2 0 19.58 1 0.871 5.403 100.0 1.3216 5 403 14.7 396.90 26.82 13.4
## 3 0 18.10 0 0.740 6.485 100.0 1.9784 24 666 20.2 386.73 18.85 15.4
## 4 30 4.93 0 0.428 6.393 7.8 7.0355 6 300 16.6 374.71 5.19 23.7
## 5 0 2.46 0 0.488 7.155 92.2 2.7006 3 193 17.8 394.12 4.82 37.9
## 6 0 8.56 0 0.520 6.781 71.3 2.8561 5 384 20.9 395.58 7.67 26.5
## target
## 1 1
## 2 1
## 3 1
## 4 0
## 5 0
## 6 0
## [1] 466
The data set includes statistics from neighborhoods near Boston to predict whether the crime rate is above or below the median crime rate. There are 466 records. Each record constitutes data from a neighborhood.
The following is a list of the variables that will be used to build a binary logistic regression model to predict whether the crime rate is above the median (1) or below the median (0):
The following are sumamry statistics for each of the variables described above:
## zn indus chas nox
## Min. : 0.00 Min. : 0.460 Min. :0.00000 Min. :0.3890
## 1st Qu.: 0.00 1st Qu.: 5.145 1st Qu.:0.00000 1st Qu.:0.4480
## Median : 0.00 Median : 9.690 Median :0.00000 Median :0.5380
## Mean : 11.58 Mean :11.105 Mean :0.07082 Mean :0.5543
## 3rd Qu.: 16.25 3rd Qu.:18.100 3rd Qu.:0.00000 3rd Qu.:0.6240
## Max. :100.00 Max. :27.740 Max. :1.00000 Max. :0.8710
## rm age dis rad
## Min. :3.863 Min. : 2.90 Min. : 1.130 Min. : 1.00
## 1st Qu.:5.887 1st Qu.: 43.88 1st Qu.: 2.101 1st Qu.: 4.00
## Median :6.210 Median : 77.15 Median : 3.191 Median : 5.00
## Mean :6.291 Mean : 68.37 Mean : 3.796 Mean : 9.53
## 3rd Qu.:6.630 3rd Qu.: 94.10 3rd Qu.: 5.215 3rd Qu.:24.00
## Max. :8.780 Max. :100.00 Max. :12.127 Max. :24.00
## tax ptratio black lstat
## Min. :187.0 Min. :12.6 Min. : 0.32 Min. : 1.730
## 1st Qu.:281.0 1st Qu.:16.9 1st Qu.:375.61 1st Qu.: 7.043
## Median :334.5 Median :18.9 Median :391.34 Median :11.350
## Mean :409.5 Mean :18.4 Mean :357.12 Mean :12.631
## 3rd Qu.:666.0 3rd Qu.:20.2 3rd Qu.:396.24 3rd Qu.:16.930
## Max. :711.0 Max. :22.0 Max. :396.90 Max. :37.970
## medv
## Min. : 5.00
## 1st Qu.:17.02
## Median :21.20
## Mean :22.59
## 3rd Qu.:25.00
## Max. :50.00
There is no missing data from the data set.
This proportion ranges from 0 to 100. What is striking is that the median value is zero and the average is 11.58. If the proportion is zero, that means that no residential land is zoned for large lots. This suggests that the data is skewed to the right. Proportions of land zoned for large lots greater than 40% are outliers.
The mean for the index of accessibility to radial highways is almost double the median value.
The histogram shows that while most of the values are between 2 and 5, there are no values between 7 and 22 and then there is a spike near 24. The neighborhoods are either fairly inaccessible to radial highways or very accessible to radial highways.
The mean full value property tax rate is larger than the median value.
Most of the neighborhoods have full property tax rates betwen 200 and 425. There are no neighborhoods with property tax rates between 500 and 625 and a spike in the number of neighborhoods with higher property tax rates, at around 675.
With respect to property tax and accessibility to radial highways, there is a bifurcation between neighborhoods, in which neighborhoods fall into 2 separate categories with a large break in between.
The median proportion of black people per town is higher than the mean, indicating that the data is left skewed.
Most of the neighborhoods have a high proportion of black residents. However there are many neighborhoods with very few black residents. There is a great deal of segregation present in the suburbs of Boston. Values of the black variable below 375 are outliers.
The following variables have a number of outliers:
- medv: median value of owner-occupied homes in 1000s of dollars.
Homes valued at higher amounts are outliers.
- lstat: lower status of the population (percent)
There are outliers at the upper end of the range of data.
- black: 1000(Bk - 0.63)^2 where Bk is the proportion of black people by town
There are outliers in communities that have few black residents.
- zn: proportion of residential land zoned for large lots (over 25000 square feet)
Communities that have a proportion of lots over 40% are outliers.
The following are the correlation values between each of the variables. The closer the correlation is to 1 or -1, the more highly correlated the variables.
## zn indus chas nox rm
## zn 1.00000000 -0.53826643 -0.04016203 -0.51704518 0.31981410
## indus -0.53826643 1.00000000 0.06118317 0.75963008 -0.39271181
## chas -0.04016203 0.06118317 1.00000000 0.09745577 0.09050979
## nox -0.51704518 0.75963008 0.09745577 1.00000000 -0.29548972
## rm 0.31981410 -0.39271181 0.09050979 -0.29548972 1.00000000
## age -0.57258054 0.63958182 0.07888366 0.73512782 -0.23281251
## dis 0.66012434 -0.70361886 -0.09657711 -0.76888404 0.19901584
## rad -0.31548119 0.60062839 -0.01590037 0.59582984 -0.20844570
## tax -0.31928408 0.73222922 -0.04676476 0.65387804 -0.29693430
## ptratio -0.39103573 0.39468980 -0.12866058 0.17626871 -0.36034706
## black 0.17941504 -0.35813561 0.04444450 -0.38015487 0.13266756
## lstat -0.43299252 0.60711023 -0.05142322 0.59624264 -0.63202445
## medv 0.37671713 -0.49617432 0.16156528 -0.43012267 0.70533679
## age dis rad tax ptratio
## zn -0.57258054 0.66012434 -0.31548119 -0.31928408 -0.3910357
## indus 0.63958182 -0.70361886 0.60062839 0.73222922 0.3946898
## chas 0.07888366 -0.09657711 -0.01590037 -0.04676476 -0.1286606
## nox 0.73512782 -0.76888404 0.59582984 0.65387804 0.1762687
## rm -0.23281251 0.19901584 -0.20844570 -0.29693430 -0.3603471
## age 1.00000000 -0.75089759 0.46031430 0.51212452 0.2554479
## dis -0.75089759 1.00000000 -0.49499193 -0.53425464 -0.2333394
## rad 0.46031430 -0.49499193 1.00000000 0.90646323 0.4714516
## tax 0.51212452 -0.53425464 0.90646323 1.00000000 0.4744223
## ptratio 0.25544785 -0.23333940 0.47145160 0.47442229 1.0000000
## black -0.27346774 0.29384407 -0.44637503 -0.44250586 -0.1816395
## lstat 0.60562001 -0.50752800 0.50310125 0.56418864 0.3773560
## medv -0.37815605 0.25669476 -0.39766826 -0.49003287 -0.5159153
## black lstat medv
## zn 0.1794150 -0.43299252 0.3767171
## indus -0.3581356 0.60711023 -0.4961743
## chas 0.0444445 -0.05142322 0.1615653
## nox -0.3801549 0.59624264 -0.4301227
## rm 0.1326676 -0.63202445 0.7053368
## age -0.2734677 0.60562001 -0.3781560
## dis 0.2938441 -0.50752800 0.2566948
## rad -0.4463750 0.50310125 -0.3976683
## tax -0.4425059 0.56418864 -0.4900329
## ptratio -0.1816395 0.37735605 -0.5159153
## black 1.0000000 -0.35336588 0.3300286
## lstat -0.3533659 1.00000000 -0.7358008
## medv 0.3300286 -0.73580078 1.0000000
Tax (the full value property tax rate per $10,000) and rad (index of accessibility to radial highways) have a strong positive correlation. Communities with higher property tax rates are more accessible to radial highways.
There is a positive correlation between tax and indus, between indus and nox, and between medv and rm.
There is a negative correlation between dis and nox, between dis and age, between age and nox and between lstat and medv.
To some of the correlated variables, I will use PCA (principal component analysis) to combine the correleated variables. Tax and rad will be combined. Dis, nox and age will be combined as well.
For each of the variables that will undergo PCA, subtract the is subtracted from the value. The eigenvectors are found. The transpose of one column of eigenvectors is multiplied by the transpose of the variables that are correlated. The original variables are then removed from the data table.
I am removing the data connecting black people to crime, as that is devisive.
The data is shuffled randomly. 60% of the data is in the training set and 40% of the data is in the testing set.
A logistic regression model will be built using the backward elimination model. A logit model is being implemented because the target is binary - crime rate is above the median (1) or below the median (0). Initially all of the variables will be present, and then they will be removed one at a time. The variable with the highest p value, which has the least affect on wins, will be elinimated first. Variables will be removed until every predictor has a p value below 0.05.
##
## Call:
## glm(formula = train$target ~ ., family = binomial(link = "logit"),
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.09875 -0.49964 0.03685 0.28834 2.20024
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.17260 3.79201 -0.573 0.566685
## zn -0.04240 0.02088 -2.031 0.042221 *
## indus 0.04393 0.03944 1.114 0.265355
## chas 0.14923 0.76013 0.196 0.844359
## rm -0.09147 0.59222 -0.154 0.877257
## ptratio -0.01935 0.10256 -0.189 0.850345
## lstat 0.05485 0.05037 1.089 0.276207
## medv 0.10843 0.05561 1.950 0.051207 .
## tax_rad_pca -0.01051 0.00214 -4.912 9.01e-07 ***
## dis_nox_age_pca -0.03457 0.01013 -3.413 0.000644 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 388.15 on 279 degrees of freedom
## Residual deviance: 183.13 on 270 degrees of freedom
## AIC: 203.13
##
## Number of Fisher Scoring iterations: 6
rm has the (highest p value) lowest affect on the target and will be removed next.
##
## Call:
## glm(formula = train$target ~ zn + indus + chas + ptratio + lstat +
## medv + tax_rad_pca + dis_nox_age_pca, family = binomial(link = "logit"),
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.09771 -0.50191 0.04179 0.28530 2.20992
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.579788 2.725130 -0.947 0.343809
## zn -0.043030 0.020585 -2.090 0.036581 *
## indus 0.044228 0.039355 1.124 0.261092
## chas 0.138455 0.758326 0.183 0.855128
## ptratio -0.022541 0.100375 -0.225 0.822319
## lstat 0.057323 0.047704 1.202 0.229498
## medv 0.102137 0.037698 2.709 0.006742 **
## tax_rad_pca -0.010423 0.002057 -5.067 4.04e-07 ***
## dis_nox_age_pca -0.033918 0.009181 -3.694 0.000221 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 388.15 on 279 degrees of freedom
## Residual deviance: 183.15 on 271 degrees of freedom
## AIC: 201.15
##
## Number of Fisher Scoring iterations: 6
chas has the (highest p value) lowest affect on the target and will be removed next.
##
## Call:
## glm(formula = train$target ~ zn + indus + ptratio + lstat + medv +
## tax_rad_pca + dis_nox_age_pca, family = binomial(link = "logit"),
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.10069 -0.49674 0.04313 0.28519 2.20695
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.534912 2.712709 -0.934 0.350068
## zn -0.043246 0.020590 -2.100 0.035702 *
## indus 0.045106 0.039010 1.156 0.247578
## ptratio -0.024844 0.099591 -0.249 0.803008
## lstat 0.056822 0.047604 1.194 0.232620
## medv 0.102285 0.037700 2.713 0.006665 **
## tax_rad_pca -0.010406 0.002052 -5.072 3.95e-07 ***
## dis_nox_age_pca -0.033877 0.009161 -3.698 0.000218 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 388.15 on 279 degrees of freedom
## Residual deviance: 183.18 on 272 degrees of freedom
## AIC: 199.18
##
## Number of Fisher Scoring iterations: 6
ptratio has the (highest p value) lowest affect on the target and will be removed next.
##
## Call:
## glm(formula = train$target ~ zn + indus + lstat + medv + tax_rad_pca +
## dis_nox_age_pca, family = binomial(link = "logit"), data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.10598 -0.50048 0.04026 0.28477 2.24260
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.131056 1.292547 -2.422 0.015419 *
## zn -0.041781 0.019468 -2.146 0.031866 *
## indus 0.046197 0.038682 1.194 0.232371
## lstat 0.059143 0.046575 1.270 0.204139
## medv 0.106341 0.034043 3.124 0.001786 **
## tax_rad_pca -0.010326 0.002028 -5.090 3.57e-07 ***
## dis_nox_age_pca -0.034205 0.009080 -3.767 0.000165 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 388.15 on 279 degrees of freedom
## Residual deviance: 183.25 on 273 degrees of freedom
## AIC: 197.25
##
## Number of Fisher Scoring iterations: 6
indus has the (highest p value) lowest affect on the target and will be removed next.
##
## Call:
## glm(formula = train$target ~ zn + lstat + medv + tax_rad_pca +
## dis_nox_age_pca, family = binomial(link = "logit"), data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.98972 -0.49488 0.04391 0.28850 2.21317
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.589591 1.195968 -2.165 0.03037 *
## zn -0.046758 0.019393 -2.411 0.01590 *
## lstat 0.066766 0.045752 1.459 0.14449
## medv 0.102238 0.033445 3.057 0.00224 **
## tax_rad_pca -0.011214 0.001930 -5.811 6.22e-09 ***
## dis_nox_age_pca -0.035284 0.009056 -3.896 9.76e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 388.15 on 279 degrees of freedom
## Residual deviance: 184.65 on 274 degrees of freedom
## AIC: 196.65
##
## Number of Fisher Scoring iterations: 6
lstat has the (highest p value) lowest affect on the target and will be removed next.
##
## Call:
## glm(formula = train$target ~ zn + medv + tax_rad_pca + dis_nox_age_pca,
## family = binomial(link = "logit"), data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.03601 -0.50362 0.05317 0.28005 2.57021
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.085947 0.585945 -1.853 0.06384 .
## zn -0.044320 0.018557 -2.388 0.01693 *
## medv 0.072523 0.025816 2.809 0.00497 **
## tax_rad_pca -0.011500 0.001941 -5.926 3.10e-09 ***
## dis_nox_age_pca -0.040593 0.008623 -4.707 2.51e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 388.15 on 279 degrees of freedom
## Residual deviance: 186.79 on 275 degrees of freedom
## AIC: 196.79
##
## Number of Fisher Scoring iterations: 6
## (Intercept) zn medv tax_rad_pca
## -0.113533677 -0.004633538 0.007582130 -0.001202332
## dis_nox_age_pca
## -0.004243935
The variables that have an affect on the target variable are zn, medv, tax/rad combined, and dis/nox/age combined.
The marginal effects reflect the change in the probability the target equals 1 given a 1 unit change in the independent variable. The marginal effect is determined at the mean for each of the independent variables. A 1 unit increase in zn, the proportion of residential land zoned for large lots, results in a 0.5% decrease in the probability that the target equals 1. A 1 unit increase in medv, median value of owner-occupied homes in $1000s, results in a 0.8% increase in the probability the target value will be 1. The next 2 variables are harder to have intution that links their meaning back to the variables tax, rad, dis, nox and age. However a 1 unit increase in the combined variable of tax, full-value property-tax rate per 10,000 dollars, and rad, the index of accessibility to radial highways, results in a 0.1% decrease in the probability that the target will equal 1. A 1 unit incerase in the combined variable, dis, weighted mean of distances to five Boston employment centers, nox, nitrogen oxides concentration (parts per 10 million), and age, proportion of owner-occupied units built prior to 1940, results in a 0.4% decrease in the probability that the target will be 1.
##
## Call:
## glm(formula = train2$target ~ ., family = binomial(link = "logit"),
## data = train2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6804 -0.0908 0.0000 0.0018 3.4085
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -30.313886 10.980207 -2.761 0.00577 **
## zn -0.045773 0.038319 -1.195 0.23227
## indus -0.115487 0.075428 -1.531 0.12575
## chas 2.082124 1.037282 2.007 0.04472 *
## nox 49.926256 11.509390 4.338 1.44e-05 ***
## rm 0.050818 1.006292 0.051 0.95972
## age 0.035866 0.016796 2.135 0.03273 *
## dis 0.669740 0.286641 2.337 0.01946 *
## rad 0.696678 0.222568 3.130 0.00175 **
## tax -0.009329 0.005669 -1.646 0.09983 .
## ptratio 0.584311 0.202493 2.886 0.00391 **
## black -0.044583 0.020554 -2.169 0.03008 *
## lstat 0.097390 0.079185 1.230 0.21873
## medv 0.164493 0.089236 1.843 0.06528 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 388.11 on 279 degrees of freedom
## Residual deviance: 102.66 on 266 degrees of freedom
## AIC: 130.66
##
## Number of Fisher Scoring iterations: 9
rm has the (highest p value) lowest affect on the target and will be removed next.
##
## Call:
## glm(formula = train2$target ~ zn + indus + chas + nox + age +
## dis + rad + tax + ptratio + black + lstat + medv, family = binomial(link = "logit"),
## data = train2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6772 -0.0903 0.0000 0.0018 3.4092
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -30.200817 10.751159 -2.809 0.00497 **
## zn -0.045334 0.037298 -1.215 0.22419
## indus -0.115582 0.075390 -1.533 0.12525
## chas 2.082105 1.038597 2.005 0.04499 *
## nox 50.038158 11.304226 4.427 9.58e-06 ***
## age 0.036280 0.014677 2.472 0.01344 *
## dis 0.671732 0.284147 2.364 0.01808 *
## rad 0.697734 0.221652 3.148 0.00164 **
## tax -0.009316 0.005667 -1.644 0.10020
## ptratio 0.586709 0.197148 2.976 0.00292 **
## black -0.044580 0.020593 -2.165 0.03041 *
## lstat 0.095562 0.070581 1.354 0.17575
## medv 0.167692 0.063003 2.662 0.00778 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 388.11 on 279 degrees of freedom
## Residual deviance: 102.67 on 267 degrees of freedom
## AIC: 128.67
##
## Number of Fisher Scoring iterations: 9
zn has the (highest p value) lowest affect on the target and will be removed next.
##
## Call:
## glm(formula = train2$target ~ indus + chas + nox + age + dis +
## rad + tax + ptratio + black + lstat + medv, family = binomial(link = "logit"),
## data = train2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7882 -0.1239 0.0000 0.0014 3.2165
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -29.65153 10.84836 -2.733 0.00627 **
## indus -0.12815 0.07706 -1.663 0.09631 .
## chas 2.16013 0.99410 2.173 0.02978 *
## nox 51.99672 11.79631 4.408 1.04e-05 ***
## age 0.03625 0.01453 2.495 0.01261 *
## dis 0.53378 0.25165 2.121 0.03391 *
## rad 0.69773 0.20584 3.390 0.00070 ***
## tax -0.01018 0.00537 -1.895 0.05804 .
## ptratio 0.61628 0.19461 3.167 0.00154 **
## black -0.04739 0.02096 -2.261 0.02376 *
## lstat 0.09697 0.07076 1.370 0.17054
## medv 0.15311 0.06055 2.529 0.01145 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 388.11 on 279 degrees of freedom
## Residual deviance: 104.45 on 268 degrees of freedom
## AIC: 128.45
##
## Number of Fisher Scoring iterations: 9
lstat has the (highest p value) lowest affect on the target and will be removed next.
##
## Call:
## glm(formula = train2$target ~ indus + chas + nox + age + dis +
## rad + tax + ptratio + black + medv, family = binomial(link = "logit"),
## data = train2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.641 -0.124 0.000 0.002 3.257
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -26.541354 10.927955 -2.429 0.015151 *
## indus -0.120912 0.076365 -1.583 0.113341
## chas 2.520250 0.988196 2.550 0.010761 *
## nox 51.051982 11.623669 4.392 1.12e-05 ***
## age 0.039435 0.014712 2.680 0.007353 **
## dis 0.490853 0.250375 1.960 0.049941 *
## rad 0.679160 0.202444 3.355 0.000794 ***
## tax -0.009406 0.005262 -1.788 0.073855 .
## ptratio 0.566253 0.188068 3.011 0.002605 **
## black -0.046906 0.021341 -2.198 0.027952 *
## medv 0.109984 0.050441 2.180 0.029224 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 388.11 on 279 degrees of freedom
## Residual deviance: 106.25 on 269 degrees of freedom
## AIC: 128.25
##
## Number of Fisher Scoring iterations: 9
indus has the (highest p value) lowest affect on the target and will be removed next.
##
## Call:
## glm(formula = train2$target ~ chas + nox + age + dis + rad +
## tax + ptratio + black + medv, family = binomial(link = "logit"),
## data = train2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.84679 -0.14105 0.00000 0.00211 3.15247
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -25.177377 10.574234 -2.381 0.017265 *
## chas 2.068314 0.909367 2.274 0.022939 *
## nox 41.438971 8.883767 4.665 3.09e-06 ***
## age 0.035617 0.013966 2.550 0.010764 *
## dis 0.474410 0.245682 1.931 0.053484 .
## rad 0.706779 0.197877 3.572 0.000355 ***
## tax -0.008507 0.005048 -1.685 0.091903 .
## ptratio 0.512238 0.180677 2.835 0.004581 **
## black -0.038355 0.018480 -2.075 0.037945 *
## medv 0.120506 0.051832 2.325 0.020075 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 388.11 on 279 degrees of freedom
## Residual deviance: 109.03 on 270 degrees of freedom
## AIC: 129.03
##
## Number of Fisher Scoring iterations: 9
dis has the (highest p value) lowest affect on the target and will be removed next.
##
## Call:
## glm(formula = train2$target ~ chas + nox + age + rad + tax +
## ptratio + black + medv, family = binomial(link = "logit"),
## data = train2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.79360 -0.15389 0.00000 0.00281 2.91780
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -15.751389 8.851240 -1.780 0.075147 .
## chas 1.735424 0.875360 1.983 0.047420 *
## nox 32.406463 6.899168 4.697 2.64e-06 ***
## age 0.026774 0.012764 2.098 0.035943 *
## rad 0.763509 0.200657 3.805 0.000142 ***
## tax -0.010535 0.004961 -2.124 0.033685 *
## ptratio 0.441114 0.173763 2.539 0.011130 *
## black -0.037041 0.018350 -2.019 0.043530 *
## medv 0.075602 0.043273 1.747 0.080619 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 388.11 on 279 degrees of freedom
## Residual deviance: 112.76 on 271 degrees of freedom
## AIC: 130.76
##
## Number of Fisher Scoring iterations: 9
medv has the (highest p value) lowest affect on the target and will be removed next.
##
## Call:
## glm(formula = train2$target ~ chas + nox + age + rad + tax +
## ptratio + black, family = binomial(link = "logit"), data = train2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.82982 -0.19751 0.00000 0.00179 2.78606
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -11.676885 8.386879 -1.392 0.16384
## chas 1.712481 0.873377 1.961 0.04991 *
## nox 33.637060 6.843295 4.915 8.86e-07 ***
## age 0.021814 0.012220 1.785 0.07425 .
## rad 0.834703 0.198278 4.210 2.56e-05 ***
## tax -0.012630 0.004784 -2.640 0.00829 **
## ptratio 0.271708 0.133077 2.042 0.04118 *
## black -0.034609 0.018167 -1.905 0.05677 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 388.11 on 279 degrees of freedom
## Residual deviance: 116.00 on 272 degrees of freedom
## AIC: 132
##
## Number of Fisher Scoring iterations: 9
age has the (highest p value) lowest affect on the target and will be removed next.
##
## Call:
## glm(formula = train2$target ~ chas + nox + rad + tax + ptratio +
## black, family = binomial(link = "logit"), data = train2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.98315 -0.20136 0.00000 0.00136 2.67148
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -13.571432 8.192355 -1.657 0.0976 .
## chas 1.872093 0.864814 2.165 0.0304 *
## nox 36.633572 6.418908 5.707 1.15e-08 ***
## rad 0.791397 0.185304 4.271 1.95e-05 ***
## tax -0.010182 0.004371 -2.330 0.0198 *
## ptratio 0.264634 0.131003 2.020 0.0434 *
## black -0.030856 0.017957 -1.718 0.0857 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 388.11 on 279 degrees of freedom
## Residual deviance: 119.31 on 273 degrees of freedom
## AIC: 133.31
##
## Number of Fisher Scoring iterations: 9
black has the (highest p value) lowest affect on the target and will be removed next.
##
## Call:
## glm(formula = train2$target ~ chas + nox + rad + tax + ptratio,
## family = binomial(link = "logit"), data = train2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.05454 -0.21510 0.00002 0.00142 2.71055
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -25.596282 4.785049 -5.349 8.83e-08 ***
## chas 1.784314 0.872626 2.045 0.0409 *
## nox 37.494616 6.563676 5.712 1.11e-08 ***
## rad 0.789544 0.181775 4.344 1.40e-05 ***
## tax -0.008826 0.004194 -2.104 0.0353 *
## ptratio 0.223146 0.122496 1.822 0.0685 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 388.11 on 279 degrees of freedom
## Residual deviance: 124.04 on 274 degrees of freedom
## AIC: 136.04
##
## Number of Fisher Scoring iterations: 9
ptratio has the (highest p value) lowest affect on the target and will be removed next.
##
## Call:
## glm(formula = train2$target ~ chas + nox + rad + tax, family = binomial(link = "logit"),
## data = train2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.11112 -0.26339 0.00008 0.00427 2.61870
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -20.060019 3.129636 -6.410 1.46e-10 ***
## chas 1.548497 0.847409 1.827 0.0677 .
## nox 34.905414 6.073472 5.747 9.07e-09 ***
## rad 0.659414 0.157231 4.194 2.74e-05 ***
## tax -0.006772 0.004011 -1.688 0.0913 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 388.11 on 279 degrees of freedom
## Residual deviance: 127.40 on 275 degrees of freedom
## AIC: 137.4
##
## Number of Fisher Scoring iterations: 8
tax has the (highest p value) lowest affect on the target and will be removed next.
##
## Call:
## glm(formula = train2$target ~ chas + nox + rad, family = binomial(link = "logit"),
## data = train2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0092 -0.2997 0.0001 0.0041 2.6731
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -19.3184 2.9016 -6.658 2.78e-11 ***
## chas 1.6418 0.8208 2.000 0.0455 *
## nox 30.4133 4.9546 6.138 8.34e-10 ***
## rad 0.5541 0.1388 3.991 6.57e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 388.11 on 279 degrees of freedom
## Residual deviance: 130.38 on 276 degrees of freedom
## AIC: 138.38
##
## Number of Fisher Scoring iterations: 8
## (Intercept) chas nox rad
## -1.42247526 0.12089180 2.23943281 0.04079843
The variables that have an affect on the target variable are chas, nox and rad.
The marginal effects reflect the change in the probability the target equals 1 given a 1 unit change in the independent variable. The marginal effect is determined at the mean for each of the independent variables. A suburb that borders the Charles River (chas=1) results in a 12% increase in the probability that the target equals 1. A 1 unit increase in nox, nitrogen oxides concentration (parts per 10 million), results in a 224% increase in the likelihood that the target is 1. A 1 unit increase in rad, index of accessibility to radial highways, results in a 4% increase in the probability the target value will be 1.
## [1] 0.33
## [1] 0.9004522
The cutoff associated with a point the maximum distance from the ROC curve is 0.33. I will use 0.33 as the cutoff for making predictions. A value above 0.33 will be assigned a target of one and a value below 0.33 wil be assigned a value of zero. The area under the curve is 0.90.
## true
## pred 0 1
## 0 76 5
## 1 22 83
The model predicted 76 0’s that were actually 0. The model predicted 5 0’s that were actually 1.
The model predicted 22 1’s that were actaully 0. The model predicted 83 1’s that were actually 1.
## accuracy error precision sensitivity specificity f1
## 1 0.8548387 0.1451613 0.424581 0.7755102 0.9431818 0.5487365
## [1] 0.86
## [1] 0.9447347
The cutoff associated with a point the farthest distance from the ROC curve is 0.86. I will use 0.86 as the cutoff for making predictions. A value above 0.86 will be assigned a target of one and a value below 0.86 wil be assigned a value of zero. The area under the curve is 0.94.
## true
## pred 0 1
## 0 97 19
## 1 2 68
## accuracy error precision sensitivity specificity f1
## 1 0.8870968 0.1129032 0.4511628 0.979798 0.7816092 0.6178344
The model predicted 97 0’s that were actually 0. The model predicted 19 0’s that were actually 1.
The model predicted 2 1’s that were actaully 0. The model predicted 68 1’s that were actually 1.
## [1] 0.28
## [1] 0.9185207
The cutoff associated with a point the farthest distance from the ROC curve is 0.28. I will use 0.28 as the cutoff for making predictions. A value above 0.28 will be assigned a target of one and a value below 0.86 wil be assigned a value of zero. The area under the curve is 0.92.
## true
## pred 0 1
## 0 65 7
## 1 20 94
The model predicted 65 0’s that were actually 0. The model predicted 7 0’s that were actually 1.
The model predicted 20 1’s that were actaully 0. The model predicted 94 1’s that were actually 1.
## accuracy error precision sensitivity specificity f1
## 1 0.8548387 0.1451613 0.4140127 0.7647059 0.9306931 0.5371901
When comparing the three models, the accuracy is greatest for the 2nd model. The precision is also highest for the second model. (The precision is the ratio of correct predictions of zero to total predictions of zero.) The sensitivity is highest for model 2. (The sensitivity is the ratio of the correct predictions of zero to all cases in which the target is zero.) The specificity is the greatest for the first model. (The specificity is the ratio of the correct predictions of 1 to the all cases in which the target is one.) The F1 score is highest for the second model. (The F1 score is equal to 2xPrecisionxSensitivity/(Precision+Sensitivity) and gives a balance between the precision and sensitivity.) The area under the roc curve is the farthest from 0.5 for model 2. (The farther the area is from 0.5, the better the model.)
I will use model 2 to make a prediction for the test data because it has a higher accuracy, precision, sensitivity, F1 score and area under the ROC curve than the other models.
The predictions for the evaluation set are below:
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0
## 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
## 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0
crime <- read.csv(“https://raw.githubusercontent.com/swigodsky/Data621/master/crime-training-data.csv”, stringsAsFactors = FALSE) head(crime) nrow(crime)
crime_target_removed <- crime[-14] summary(crime_target_removed)
plot(crime\(zn, ylab="Proportion of Residential Land Zoned For Large Lots", main="Residential Land Zoned For Large Lots") hist(crime\)zn, xlab=“Proportion of Residential Land Zoned For Large Lots”,main=“Residential Land Zoned For Large Lots”) boxplot(crime$zn, main=“Residential Land Zoned For Large Lots”)
plot(crime\(rad, ylab="Index of Accessibility to Radial Highways", main="Accessibility to Radial Highways") hist(crime\)rad, xlab=“Index of Accessibility to Radial Highways”,main=“Accessibility to Radial Highways”) boxplot(crime$rad, main=“Index of Accessibility to Radial Highways”)
plot(crime$tax, ylab=“Full-Value Property-Tax Rate per \(10,000", main="Full-Value Property-Tax Rate") hist(crime\)tax, xlab=”Full-Value Property-Tax Rate per \(10,000",main="Full-Value Property-Tax Rate") boxplot(crime\)tax, main=“Full-Value Property-Tax Rate”)
plot(crime\(black, ylab="1000(num black people - 0.63)^2", main="Proportion of Black People by Town") hist(crime\)black, xlab=“1000(num black people - 0.63)^2”,main=“Proportion of Black People by Town”) boxplot(crime$black, main=“Proportion of Black People by Town”)
library(dplyr) library(tidyr) library(ggplot2)
boxplot(crime_target_removed, las=2,horizontal=TRUE)#, ylim=c(0,3000))
correlation <- cor(crime_target_removed, method = “pearson”) correlation pairs(crime_target_removed)
library(factoextra) crime_df <- crime
tax_mean <- mean(crime\(tax) rad_mean <- mean(crime\)rad) crime_df\(tax <- crime_df\)tax-tax_mean crime_df\(rad <- crime_df\)rad-rad_mean tax_rad <- data.frame(list(rad=crime_df\(rad, tax=crime_df\)tax), stringsAsFactors = FALSE) pca <- prcomp(tax_rad, scale = FALSE) eigen <- get_eigenvalue(pca) tax_rad <- as.matrix(tax_rad) tax_rad <- t(tax_rad) tax_rad_pca <- pca$rotation[,1] %*% tax_rad tax_rad_pca <- t(tax_rad_pca) crime_df <- cbind(crime_df, tax_rad_pca)
dis_mean <- mean(crime\(dis) nox_mean <- mean(crime\)nox) age_mean <- mean(crime\(age) crime_df\)dis <- crime_df\(dis-dis_mean crime_df\)nox <- crime_df\(nox-nox_mean crime_df\)age <- crime_df\(age-age_mean dis_nox_age <- data.frame(list(dis=crime_df\)dis, nox=crime_df\(nox, age=crime_df\)age), stringsAsFactors = FALSE) pca2 <- prcomp(dis_nox_age, scale = FALSE) eigen2 <- get_eigenvalue(pca2) dis_nox_age <- as.matrix(dis_nox_age) dis_nox_age <- t(dis_nox_age) dis_nox_age_pca <- pca2$rotation[,1] %*% dis_nox_age dis_nox_age_pca <- t(dis_nox_age_pca) crime_df <- cbind(crime_df, dis_nox_age_pca)
crime_df <-subset(crime_df, select=-c(tax,rad,dis,nox,age, black)) ##BUILD MODELS set.seed(13) n <- nrow(crime_df) shuffle_df <- crime_df[sample(n),] train_indeces <- 1:round(0.6n) train <- shuffle_df[train_indeces,] test_indeces <- (round(.6n)+1):n test <- shuffle_df[test_indeces,]
logit1 <- glm(train$target ~ ., data=train, family=binomial (link=“logit”)) summary(logit1)
logit1 <- update(logit1, .~. -rm, data = train, family=binomial (link=“logit”)) summary(logit1)
logit1 <- update(logit1, .~. -chas, data = train, family=binomial (link=“logit”)) summary(logit1)
logit1 <- update(logit1, .~. -ptratio, data = train, family=binomial (link=“logit”)) summary(logit1)
logit1 <- update(logit1, .~. -indus, data = train, family=binomial (link=“logit”)) summary(logit1)
logit1 <- update(logit1, .~. -lstat, data = train, family=binomial (link=“logit”)) summary(logit1) logitscalar1 <- mean(dlogis(predict(logit1,type=“link”))) logitscalar1*coef(logit1)
set.seed(10) n <- nrow(crime) shuffle_df2 <- crime[sample(n),] train_indeces <- 1:round(0.6n) train2 <- shuffle_df2[train_indeces,] test_indeces <- (round(.6n)+1):n test2 <- shuffle_df2[test_indeces,]
logit2 <- glm(train2$target ~ ., data=train2, family=binomial (link=“logit”)) summary(logit2)
logit2 <- update(logit2, .~. -rm, data = train2, family=binomial (link=“logit”)) summary(logit2)
logit2 <- update(logit2, .~. -zn, data = train2, family=binomial (link=“logit”)) summary(logit2)
logit2 <- update(logit2, .~. -lstat, data = train2, family=binomial (link=“logit”))
logit2 <- update(logit2, .~. -indus, data = train2, family=binomial (link=“logit”)) summary(logit2)
logit2 <- update(logit2, .~. -dis, data = train2, family=binomial (link=“logit”)) summary(logit2)
logit2 <- update(logit2, .~. -medv, data = train2, family=binomial (link=“logit”)) summary(logit2)
logit2 <- update(logit2, .~. -age, data = train2, family=binomial (link=“logit”)) summary(logit2)
logit2 <- update(logit2, .~. -black, data = train2, family=binomial (link=“logit”)) summary(logit2)
logit2 <- update(logit2, .~. -ptratio, data = train2, family=binomial (link=“logit”)) summary(logit2)
logit2 <- update(logit2, .~. -tax, data = train2, family=binomial (link=“logit”)) summary(logit2)
crime_df3 <- crime
tax_rad2 <- crime_df3\(tax + crime_df3\)rad crime_df3 <- cbind(crime_df3, tax_rad2)
dis_nox_age2 <- crime_df3\(dis + crime_df3\)nox + crime_df3$age crime_df3 <- cbind(crime_df3, dis_nox_age2)
crime_df3 <-subset(crime_df3, select=-c(tax,rad,dis,nox,age, black))
set.seed(15) n <- nrow(crime_df3) shuffle_df3 <- crime_df3[sample(n),] train_indeces <- 1:round(0.6n) train3 <- shuffle_df3[train_indeces,] test_indeces <- (round(.6n)+1):n test3 <- shuffle_df3[test_indeces,]
logit3 <- glm(train3$target ~ ., data=train3, family=binomial (link=“logit”)) summary(logit3)
logit3 <- update(logit3, .~. -rm, data = train3, family=binomial (link=“logit”)) summary(logit3)
logit3 <- update(logit3, .~. -ptratio, data = train3, family=binomial (link=“logit”)) summary(logit3)
logit3 <- update(logit3, .~. -indus, data = train3, family=binomial (link=“logit”)) summary(logit3)
logit3 <- update(logit3, .~. -chas, data = train3, family=binomial (link=“logit”)) summary(logit3)
acc <- function(pred, test){ totalnum <- length(pred) numRight <- length(which(pred==test$target)) accuracy <- numRight/totalnum return(accuracy) }
err <- function(pred, test){ totalnum <- length(pred) numWrong <- length(which(pred!=test$target)) error <- numWrong/totalnum return(error) }
prec <- function(pred, test){ true_pos <- length(which((pred==0)&(test\(target==0))) all_pos <- length(which(test\)target==0)) + length(which(pred==0)) precision <- true_pos/all_pos return(precision) }
sens <- function(pred, test){ true_pos <- length(which((pred==0)&(test\(target==0))) false_neg <- length(which((pred==1)&(test\)target==0))) sensitivity <- true_pos/(true_pos+false_neg) return(sensitivity) }
spec <- function(pred, test){ true_neg <- length(which((pred==1)&(test\(target==1))) false_pos <- length(which((pred==0)&(test\)target==1))) sensitivity <- true_neg/(true_neg+false_pos) return(sensitivity) }
f1 <- function(pred, test){ true_pos <- length(which((pred==0)&(test\(target==0))) all_pos <- length(which(test\)target==0)) + length(which(pred==0)) precision <- true_pos/all_pos
false_neg <- length(which((pred==1)&(test$target==0))) sensitivity <- true_pos/(true_pos+false_neg)
f1 <- 2precisionsensitivity/(precision+sensitivity) return(f1) } #distance function taken from https://stackoverflow.com/questions/35194048/using-r-how-to-calculate-the-distance-from-one-point-to-a-line dist2d <- function(a,b,c) { v1 <- b - c v2 <- a - b m <- cbind(v1,v2) d2 <- abs(det(m))/sqrt(sum(v1*v1)) return(d2) }
roc <- function(pred,test){ d <- 0 #set distance between point and line y=x to 0 roc_tester <- data.frame(o_m_specificity=NA, sensitivity=NA)[numeric(0), ] auc=0 for (cutoff in seq(0,1.0,0.01)){ test_df <- pred #make a copy of the predictions #set scored (predicted) values in test_df according to whether the probability is above or below the cut-off threshold test_df[test_df < cutoff] <- 0 test_df[test_df >= cutoff] <- 1
spec_val <- spec(test_df, test)
sens_val <- sens(test_df, test)
roc_tester <- rbind(roc_tester, list(o_m_specificity=1-spec_val,sensitivity= sens_val))
#calculating Euclidean distance between point and y=x a2 <- c(1-spec_val,sens_val) b2 <- c(0,0) c2 <- c(1,1) d2 <- dist2d(a2,b2,c2) # distance of point a from line (b,c) if (d2>d){ d <- d2 cut_off_val <- cutoff }
#calculating area of trapezoid for each set of data points
if (cutoff>=0.1){
num_values = nrow(roc_tester)
base2 = roc_tester$sensitivity[num_values]
base1 = roc_tester$sensitivity[num_values-1]
height2 = roc_tester$o_m_specificity[num_values]
height1 = roc_tester$o_m_specificity[num_values-1]
area = .5*(base1+base2)*(height2-height1)
auc = auc + area
}
}
roc_plot <- ggplot(roc_tester, aes(x = o_m_specificity, y = sensitivity)) + geom_point() + geom_abline(slope=1) + labs(x="False Positive Rate (1-specificity)", y="True Positive Rate (sensitivity)", title="ROC Curve" )
return(list(roc_plot=roc_plot, auc_val=auc, cut_off_val=cut_off_val)) }
pred_logit1 <- predict(logit1, newdata=test, type=“response”)
roc_vals <- roc(pred_logit1,test) roc_vals\(roc_plot print(roc_vals\)cut_off_val) print(roc_vals$auc_val)
pred_logit1[pred_logit1>=0.33] <- 1 pred_logit1[pred_logit1<0.33] <- 0 table(pred=pred_logit1, true=test$target)
accuracy1 <- acc(round(pred_logit1), test) error1 <- err(round(pred_logit1), test) precision1 <- prec(round(pred_logit1), test) sensitivity1 <- sens(round(pred_logit1), test) specificity1 <- spec(round(pred_logit1), test) f11 <- f1(round(pred_logit1), test)
stat1 <- data.frame(list(accuracy=accuracy1, error=error1, precision=precision1, sensitivity=sensitivity1, specificity=specificity1, f1=f11)) print(stat1)
pred_logit2 <- predict(logit2, newdata=test2, type=“response”) roc_vals2 <- roc(pred_logit2,test2) roc_vals2\(roc_plot print(roc_vals2\)cut_off_val) print(roc_vals2$auc_val)
pred_logit2[pred_logit2>=0.86] <- 1 pred_logit2[pred_logit2<0.86] <- 0
table(pred=round(pred_logit2), true=test2$target)
accuracy2 <- acc(round(pred_logit2), test2) error2 <- err(round(pred_logit2), test2) precision2 <- prec(round(pred_logit2), test2) sensitivity2 <- sens(round(pred_logit2), test2) specificity2 <- spec(round(pred_logit2), test2) f12 <- f1(round(pred_logit2), test2)
stat2 <- data.frame(list(accuracy=accuracy2, error=error2, precision=precision2, sensitivity=sensitivity2, specificity=specificity2, f1=f12)) print(stat2)
pred_logit3 <- predict(logit3, newdata=test3, type=“response”) roc_vals3 <- roc(pred_logit3,test3) roc_vals3\(roc_plot print(roc_vals3\)cut_off_val) print(roc_vals3$auc_val)
pred_logit3[pred_logit3>=0.28] <- 1 pred_logit3[pred_logit3<0.28] <- 0
table(pred=round(pred_logit3), true=test3$target)
accuracy3 <- acc(round(pred_logit3), test3) error3 <- err(round(pred_logit3), test3) precision3 <- prec(round(pred_logit3), test3) sensitivity3 <- sens(round(pred_logit3), test3) specificity3 <- spec(round(pred_logit3), test3) f13 <- f1(round(pred_logit3), test3)
stat3 <- data.frame(list(accuracy=accuracy3, error=error3, precision=precision3, sensitivity=sensitivity3, specificity=specificity3, f1=f13)) print(stat3) ##SELECT MODEL eval_data <- read.csv(‘https://raw.githubusercontent.com/swigodsky/Data621/master/crime-evaluation-data.csv’)
pred_eval <- predict(logit2, newdata=eval_data, type=“response”) pred_eval[pred_eval>=0.86] <- 1 pred_eval[pred_eval<0.86] <- 0 pred_eval