DATA 621 HW3

DATA EXPLORATION

##   zn indus chas   nox    rm   age    dis rad tax ptratio  black lstat medv
## 1  0 19.58    0 0.605 7.929  96.2 2.0459   5 403    14.7 369.30  3.70 50.0
## 2  0 19.58    1 0.871 5.403 100.0 1.3216   5 403    14.7 396.90 26.82 13.4
## 3  0 18.10    0 0.740 6.485 100.0 1.9784  24 666    20.2 386.73 18.85 15.4
## 4 30  4.93    0 0.428 6.393   7.8 7.0355   6 300    16.6 374.71  5.19 23.7
## 5  0  2.46    0 0.488 7.155  92.2 2.7006   3 193    17.8 394.12  4.82 37.9
## 6  0  8.56    0 0.520 6.781  71.3 2.8561   5 384    20.9 395.58  7.67 26.5
##   target
## 1      1
## 2      1
## 3      1
## 4      0
## 5      0
## 6      0

## [1] 466

The data set includes statistics from neighborhoods near Boston to predict whether the crime rate is above or below the median crime rate. There are 466 records. Each record constitutes data from a neighborhood.

The following is a list of the variables that will be used to build a binary logistic regression model to predict whether the crime rate is above the median (1) or below the median (0):

zn: proportion of residential land zoned for large lots (over 25000 square feet)
indus: proportion of non-retail business acres per suburb
chas: a dummy var. for whether the suburb borders the Charles River
nox: nitrogen oxides concentration (parts per 10 million)
rm: average number of rooms per dwelling
age: proportion of owner-occupied units built prior to 1940
dis: weighted mean of distances to five Boston employment centers
rad: index of accessibility to radial highways
tax: full-value property-tax rate per 10,000 dollars
ptratio: pupil-teacher ratio by town
black: 1000(Bk - 0.63)^2 where Bk is the proportion of black people by town
lstat: percent of lower status of the population
medv: median value of owner-occupied homes in 1000s of dollars

The following are sumamry statistics for each of the variables described above:

##        zn             indus             chas              nox        
##  Min.   :  0.00   Min.   : 0.460   Min.   :0.00000   Min.   :0.3890  
##  1st Qu.:  0.00   1st Qu.: 5.145   1st Qu.:0.00000   1st Qu.:0.4480  
##  Median :  0.00   Median : 9.690   Median :0.00000   Median :0.5380  
##  Mean   : 11.58   Mean   :11.105   Mean   :0.07082   Mean   :0.5543  
##  3rd Qu.: 16.25   3rd Qu.:18.100   3rd Qu.:0.00000   3rd Qu.:0.6240  
##  Max.   :100.00   Max.   :27.740   Max.   :1.00000   Max.   :0.8710  
##        rm             age              dis              rad       
##  Min.   :3.863   Min.   :  2.90   Min.   : 1.130   Min.   : 1.00  
##  1st Qu.:5.887   1st Qu.: 43.88   1st Qu.: 2.101   1st Qu.: 4.00  
##  Median :6.210   Median : 77.15   Median : 3.191   Median : 5.00  
##  Mean   :6.291   Mean   : 68.37   Mean   : 3.796   Mean   : 9.53  
##  3rd Qu.:6.630   3rd Qu.: 94.10   3rd Qu.: 5.215   3rd Qu.:24.00  
##  Max.   :8.780   Max.   :100.00   Max.   :12.127   Max.   :24.00  
##       tax           ptratio         black            lstat       
##  Min.   :187.0   Min.   :12.6   Min.   :  0.32   Min.   : 1.730  
##  1st Qu.:281.0   1st Qu.:16.9   1st Qu.:375.61   1st Qu.: 7.043  
##  Median :334.5   Median :18.9   Median :391.34   Median :11.350  
##  Mean   :409.5   Mean   :18.4   Mean   :357.12   Mean   :12.631  
##  3rd Qu.:666.0   3rd Qu.:20.2   3rd Qu.:396.24   3rd Qu.:16.930  
##  Max.   :711.0   Max.   :22.0   Max.   :396.90   Max.   :37.970  
##       medv      
##  Min.   : 5.00  
##  1st Qu.:17.02  
##  Median :21.20  
##  Mean   :22.59  
##  3rd Qu.:25.00  
##  Max.   :50.00

There is no missing data from the data set.

zn - Proportion of Residential Land Zoned For Large Lots

This proportion ranges from 0 to 100. What is striking is that the median value is zero and the average is 11.58. If the proportion is zero, that means that no residential land is zoned for large lots. This suggests that the data is skewed to the right. Proportions of land zoned for large lots greater than 40% are outliers.

rad - Index of Accessibility to Radial Highways

The mean for the index of accessibility to radial highways is almost double the median value.

The histogram shows that while most of the values are between 2 and 5, there are no values between 7 and 22 and then there is a spike near 24. The neighborhoods are either fairly inaccessible to radial highways or very accessible to radial highways.

tax: Full-Value Property-Tax Rate per $10,000

The mean full value property tax rate is larger than the median value.

Most of the neighborhoods have full property tax rates betwen 200 and 425. There are no neighborhoods with property tax rates between 500 and 625 and a spike in the number of neighborhoods with higher property tax rates, at around 675.

With respect to property tax and accessibility to radial highways, there is a bifurcation between neighborhoods, in which neighborhoods fall into 2 separate categories with a large break in between.

black: 1000(Bk - 0.63)^2 where Bk is the Proportion of Black People by Town

The median proportion of black people per town is higher than the mean, indicating that the data is left skewed.

Most of the neighborhoods have a high proportion of black residents. However there are many neighborhoods with very few black residents. There is a great deal of segregation present in the suburbs of Boston. Values of the black variable below 375 are outliers.

The following variables have a number of outliers:
- medv: median value of owner-occupied homes in 1000s of dollars.
Homes valued at higher amounts are outliers.
- lstat: lower status of the population (percent)
There are outliers at the upper end of the range of data.
- black: 1000(Bk - 0.63)^2 where Bk is the proportion of black people by town
There are outliers in communities that have few black residents.
- zn: proportion of residential land zoned for large lots (over 25000 square feet)
Communities that have a proportion of lots over 40% are outliers.

DATA PREPARATION

Correlation of Variables

The following are the correlation values between each of the variables. The closer the correlation is to 1 or -1, the more highly correlated the variables.

##                  zn       indus        chas         nox          rm
## zn       1.00000000 -0.53826643 -0.04016203 -0.51704518  0.31981410
## indus   -0.53826643  1.00000000  0.06118317  0.75963008 -0.39271181
## chas    -0.04016203  0.06118317  1.00000000  0.09745577  0.09050979
## nox     -0.51704518  0.75963008  0.09745577  1.00000000 -0.29548972
## rm       0.31981410 -0.39271181  0.09050979 -0.29548972  1.00000000
## age     -0.57258054  0.63958182  0.07888366  0.73512782 -0.23281251
## dis      0.66012434 -0.70361886 -0.09657711 -0.76888404  0.19901584
## rad     -0.31548119  0.60062839 -0.01590037  0.59582984 -0.20844570
## tax     -0.31928408  0.73222922 -0.04676476  0.65387804 -0.29693430
## ptratio -0.39103573  0.39468980 -0.12866058  0.17626871 -0.36034706
## black    0.17941504 -0.35813561  0.04444450 -0.38015487  0.13266756
## lstat   -0.43299252  0.60711023 -0.05142322  0.59624264 -0.63202445
## medv     0.37671713 -0.49617432  0.16156528 -0.43012267  0.70533679
##                 age         dis         rad         tax    ptratio
## zn      -0.57258054  0.66012434 -0.31548119 -0.31928408 -0.3910357
## indus    0.63958182 -0.70361886  0.60062839  0.73222922  0.3946898
## chas     0.07888366 -0.09657711 -0.01590037 -0.04676476 -0.1286606
## nox      0.73512782 -0.76888404  0.59582984  0.65387804  0.1762687
## rm      -0.23281251  0.19901584 -0.20844570 -0.29693430 -0.3603471
## age      1.00000000 -0.75089759  0.46031430  0.51212452  0.2554479
## dis     -0.75089759  1.00000000 -0.49499193 -0.53425464 -0.2333394
## rad      0.46031430 -0.49499193  1.00000000  0.90646323  0.4714516
## tax      0.51212452 -0.53425464  0.90646323  1.00000000  0.4744223
## ptratio  0.25544785 -0.23333940  0.47145160  0.47442229  1.0000000
## black   -0.27346774  0.29384407 -0.44637503 -0.44250586 -0.1816395
## lstat    0.60562001 -0.50752800  0.50310125  0.56418864  0.3773560
## medv    -0.37815605  0.25669476 -0.39766826 -0.49003287 -0.5159153
##              black       lstat       medv
## zn       0.1794150 -0.43299252  0.3767171
## indus   -0.3581356  0.60711023 -0.4961743
## chas     0.0444445 -0.05142322  0.1615653
## nox     -0.3801549  0.59624264 -0.4301227
## rm       0.1326676 -0.63202445  0.7053368
## age     -0.2734677  0.60562001 -0.3781560
## dis      0.2938441 -0.50752800  0.2566948
## rad     -0.4463750  0.50310125 -0.3976683
## tax     -0.4425059  0.56418864 -0.4900329
## ptratio -0.1816395  0.37735605 -0.5159153
## black    1.0000000 -0.35336588  0.3300286
## lstat   -0.3533659  1.00000000 -0.7358008
## medv     0.3300286 -0.73580078  1.0000000

Tax (the full value property tax rate per $10,000) and rad (index of accessibility to radial highways) have a strong positive correlation. Communities with higher property tax rates are more accessible to radial highways.

There is a positive correlation between tax and indus, between indus and nox, and between medv and rm.
There is a negative correlation between dis and nox, between dis and age, between age and nox and between lstat and medv.
To some of the correlated variables, I will use PCA (principal component analysis) to combine the correleated variables. Tax and rad will be combined. Dis, nox and age will be combined as well.
For each of the variables that will undergo PCA, subtract the is subtracted from the value. The eigenvectors are found. The transpose of one column of eigenvectors is multiplied by the transpose of the variables that are correlated. The original variables are then removed from the data table.
I am removing the data connecting black people to crime, as that is devisive.

Build Models

Breaking the data set into a training set and a testing set

The data is shuffled randomly. 60% of the data is in the training set and 40% of the data is in the testing set.

Backward Elimination - Logistic Regression Model 1 - Based on combining 2 sets of variables by PCA

A logistic regression model will be built using the backward elimination model. A logit model is being implemented because the target is binary - crime rate is above the median (1) or below the median (0). Initially all of the variables will be present, and then they will be removed one at a time. The variable with the highest p value, which has the least affect on wins, will be elinimated first. Variables will be removed until every predictor has a p value below 0.05.

## 
## Call:
## glm(formula = train$target ~ ., family = binomial(link = "logit"), 
##     data = train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -3.09875  -0.49964   0.03685   0.28834   2.20024  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -2.17260    3.79201  -0.573 0.566685    
## zn              -0.04240    0.02088  -2.031 0.042221 *  
## indus            0.04393    0.03944   1.114 0.265355    
## chas             0.14923    0.76013   0.196 0.844359    
## rm              -0.09147    0.59222  -0.154 0.877257    
## ptratio         -0.01935    0.10256  -0.189 0.850345    
## lstat            0.05485    0.05037   1.089 0.276207    
## medv             0.10843    0.05561   1.950 0.051207 .  
## tax_rad_pca     -0.01051    0.00214  -4.912 9.01e-07 ***
## dis_nox_age_pca -0.03457    0.01013  -3.413 0.000644 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 388.15  on 279  degrees of freedom
## Residual deviance: 183.13  on 270  degrees of freedom
## AIC: 203.13
## 
## Number of Fisher Scoring iterations: 6

rm has the (highest p value) lowest affect on the target and will be removed next.

## 
## Call:
## glm(formula = train$target ~ zn + indus + chas + ptratio + lstat + 
##     medv + tax_rad_pca + dis_nox_age_pca, family = binomial(link = "logit"), 
##     data = train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -3.09771  -0.50191   0.04179   0.28530   2.20992  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -2.579788   2.725130  -0.947 0.343809    
## zn              -0.043030   0.020585  -2.090 0.036581 *  
## indus            0.044228   0.039355   1.124 0.261092    
## chas             0.138455   0.758326   0.183 0.855128    
## ptratio         -0.022541   0.100375  -0.225 0.822319    
## lstat            0.057323   0.047704   1.202 0.229498    
## medv             0.102137   0.037698   2.709 0.006742 ** 
## tax_rad_pca     -0.010423   0.002057  -5.067 4.04e-07 ***
## dis_nox_age_pca -0.033918   0.009181  -3.694 0.000221 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 388.15  on 279  degrees of freedom
## Residual deviance: 183.15  on 271  degrees of freedom
## AIC: 201.15
## 
## Number of Fisher Scoring iterations: 6

chas has the (highest p value) lowest affect on the target and will be removed next.

## 
## Call:
## glm(formula = train$target ~ zn + indus + ptratio + lstat + medv + 
##     tax_rad_pca + dis_nox_age_pca, family = binomial(link = "logit"), 
##     data = train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -3.10069  -0.49674   0.04313   0.28519   2.20695  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -2.534912   2.712709  -0.934 0.350068    
## zn              -0.043246   0.020590  -2.100 0.035702 *  
## indus            0.045106   0.039010   1.156 0.247578    
## ptratio         -0.024844   0.099591  -0.249 0.803008    
## lstat            0.056822   0.047604   1.194 0.232620    
## medv             0.102285   0.037700   2.713 0.006665 ** 
## tax_rad_pca     -0.010406   0.002052  -5.072 3.95e-07 ***
## dis_nox_age_pca -0.033877   0.009161  -3.698 0.000218 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 388.15  on 279  degrees of freedom
## Residual deviance: 183.18  on 272  degrees of freedom
## AIC: 199.18
## 
## Number of Fisher Scoring iterations: 6

ptratio has the (highest p value) lowest affect on the target and will be removed next.

## 
## Call:
## glm(formula = train$target ~ zn + indus + lstat + medv + tax_rad_pca + 
##     dis_nox_age_pca, family = binomial(link = "logit"), data = train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -3.10598  -0.50048   0.04026   0.28477   2.24260  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -3.131056   1.292547  -2.422 0.015419 *  
## zn              -0.041781   0.019468  -2.146 0.031866 *  
## indus            0.046197   0.038682   1.194 0.232371    
## lstat            0.059143   0.046575   1.270 0.204139    
## medv             0.106341   0.034043   3.124 0.001786 ** 
## tax_rad_pca     -0.010326   0.002028  -5.090 3.57e-07 ***
## dis_nox_age_pca -0.034205   0.009080  -3.767 0.000165 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 388.15  on 279  degrees of freedom
## Residual deviance: 183.25  on 273  degrees of freedom
## AIC: 197.25
## 
## Number of Fisher Scoring iterations: 6

indus has the (highest p value) lowest affect on the target and will be removed next.

## 
## Call:
## glm(formula = train$target ~ zn + lstat + medv + tax_rad_pca + 
##     dis_nox_age_pca, family = binomial(link = "logit"), data = train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.98972  -0.49488   0.04391   0.28850   2.21317  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -2.589591   1.195968  -2.165  0.03037 *  
## zn              -0.046758   0.019393  -2.411  0.01590 *  
## lstat            0.066766   0.045752   1.459  0.14449    
## medv             0.102238   0.033445   3.057  0.00224 ** 
## tax_rad_pca     -0.011214   0.001930  -5.811 6.22e-09 ***
## dis_nox_age_pca -0.035284   0.009056  -3.896 9.76e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 388.15  on 279  degrees of freedom
## Residual deviance: 184.65  on 274  degrees of freedom
## AIC: 196.65
## 
## Number of Fisher Scoring iterations: 6

lstat has the (highest p value) lowest affect on the target and will be removed next.

## 
## Call:
## glm(formula = train$target ~ zn + medv + tax_rad_pca + dis_nox_age_pca, 
##     family = binomial(link = "logit"), data = train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -3.03601  -0.50362   0.05317   0.28005   2.57021  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -1.085947   0.585945  -1.853  0.06384 .  
## zn              -0.044320   0.018557  -2.388  0.01693 *  
## medv             0.072523   0.025816   2.809  0.00497 ** 
## tax_rad_pca     -0.011500   0.001941  -5.926 3.10e-09 ***
## dis_nox_age_pca -0.040593   0.008623  -4.707 2.51e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 388.15  on 279  degrees of freedom
## Residual deviance: 186.79  on 275  degrees of freedom
## AIC: 196.79
## 
## Number of Fisher Scoring iterations: 6

##     (Intercept)              zn            medv     tax_rad_pca 
##    -0.113533677    -0.004633538     0.007582130    -0.001202332 
## dis_nox_age_pca 
##    -0.004243935

The variables that have an affect on the target variable are zn, medv, tax/rad combined, and dis/nox/age combined.

The marginal effects reflect the change in the probability the target equals 1 given a 1 unit change in the independent variable. The marginal effect is determined at the mean for each of the independent variables. A 1 unit increase in zn, the proportion of residential land zoned for large lots, results in a 0.5% decrease in the probability that the target equals 1. A 1 unit increase in medv, median value of owner-occupied homes in $1000s, results in a 0.8% increase in the probability the target value will be 1. The next 2 variables are harder to have intution that links their meaning back to the variables tax, rad, dis, nox and age. However a 1 unit increase in the combined variable of tax, full-value property-tax rate per 10,000 dollars, and rad, the index of accessibility to radial highways, results in a 0.1% decrease in the probability that the target will equal 1. A 1 unit incerase in the combined variable, dis, weighted mean of distances to five Boston employment centers, nox, nitrogen oxides concentration (parts per 10 million), and age, proportion of owner-occupied units built prior to 1940, results in a 0.4% decrease in the probability that the target will be 1.

Model 2 - Building a Model Without Combining Correlated Variables

Breaking the data set into a training set and a testing set. The data is shuffled randomly. 60% of the data is in the training set and 40% of the data is in the testing set.

Backward Elimination - Logistic Regression Model 2

## 
## Call:
## glm(formula = train2$target ~ ., family = binomial(link = "logit"), 
##     data = train2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6804  -0.0908   0.0000   0.0018   3.4085  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -30.313886  10.980207  -2.761  0.00577 ** 
## zn           -0.045773   0.038319  -1.195  0.23227    
## indus        -0.115487   0.075428  -1.531  0.12575    
## chas          2.082124   1.037282   2.007  0.04472 *  
## nox          49.926256  11.509390   4.338 1.44e-05 ***
## rm            0.050818   1.006292   0.051  0.95972    
## age           0.035866   0.016796   2.135  0.03273 *  
## dis           0.669740   0.286641   2.337  0.01946 *  
## rad           0.696678   0.222568   3.130  0.00175 ** 
## tax          -0.009329   0.005669  -1.646  0.09983 .  
## ptratio       0.584311   0.202493   2.886  0.00391 ** 
## black        -0.044583   0.020554  -2.169  0.03008 *  
## lstat         0.097390   0.079185   1.230  0.21873    
## medv          0.164493   0.089236   1.843  0.06528 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 388.11  on 279  degrees of freedom
## Residual deviance: 102.66  on 266  degrees of freedom
## AIC: 130.66
## 
## Number of Fisher Scoring iterations: 9

rm has the (highest p value) lowest affect on the target and will be removed next.

## 
## Call:
## glm(formula = train2$target ~ zn + indus + chas + nox + age + 
##     dis + rad + tax + ptratio + black + lstat + medv, family = binomial(link = "logit"), 
##     data = train2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6772  -0.0903   0.0000   0.0018   3.4092  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -30.200817  10.751159  -2.809  0.00497 ** 
## zn           -0.045334   0.037298  -1.215  0.22419    
## indus        -0.115582   0.075390  -1.533  0.12525    
## chas          2.082105   1.038597   2.005  0.04499 *  
## nox          50.038158  11.304226   4.427 9.58e-06 ***
## age           0.036280   0.014677   2.472  0.01344 *  
## dis           0.671732   0.284147   2.364  0.01808 *  
## rad           0.697734   0.221652   3.148  0.00164 ** 
## tax          -0.009316   0.005667  -1.644  0.10020    
## ptratio       0.586709   0.197148   2.976  0.00292 ** 
## black        -0.044580   0.020593  -2.165  0.03041 *  
## lstat         0.095562   0.070581   1.354  0.17575    
## medv          0.167692   0.063003   2.662  0.00778 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 388.11  on 279  degrees of freedom
## Residual deviance: 102.67  on 267  degrees of freedom
## AIC: 128.67
## 
## Number of Fisher Scoring iterations: 9

zn has the (highest p value) lowest affect on the target and will be removed next.

## 
## Call:
## glm(formula = train2$target ~ indus + chas + nox + age + dis + 
##     rad + tax + ptratio + black + lstat + medv, family = binomial(link = "logit"), 
##     data = train2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7882  -0.1239   0.0000   0.0014   3.2165  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -29.65153   10.84836  -2.733  0.00627 ** 
## indus        -0.12815    0.07706  -1.663  0.09631 .  
## chas          2.16013    0.99410   2.173  0.02978 *  
## nox          51.99672   11.79631   4.408 1.04e-05 ***
## age           0.03625    0.01453   2.495  0.01261 *  
## dis           0.53378    0.25165   2.121  0.03391 *  
## rad           0.69773    0.20584   3.390  0.00070 ***
## tax          -0.01018    0.00537  -1.895  0.05804 .  
## ptratio       0.61628    0.19461   3.167  0.00154 ** 
## black        -0.04739    0.02096  -2.261  0.02376 *  
## lstat         0.09697    0.07076   1.370  0.17054    
## medv          0.15311    0.06055   2.529  0.01145 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 388.11  on 279  degrees of freedom
## Residual deviance: 104.45  on 268  degrees of freedom
## AIC: 128.45
## 
## Number of Fisher Scoring iterations: 9

lstat has the (highest p value) lowest affect on the target and will be removed next.

## 
## Call:
## glm(formula = train2$target ~ indus + chas + nox + age + dis + 
##     rad + tax + ptratio + black + medv, family = binomial(link = "logit"), 
##     data = train2)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.641  -0.124   0.000   0.002   3.257  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -26.541354  10.927955  -2.429 0.015151 *  
## indus        -0.120912   0.076365  -1.583 0.113341    
## chas          2.520250   0.988196   2.550 0.010761 *  
## nox          51.051982  11.623669   4.392 1.12e-05 ***
## age           0.039435   0.014712   2.680 0.007353 ** 
## dis           0.490853   0.250375   1.960 0.049941 *  
## rad           0.679160   0.202444   3.355 0.000794 ***
## tax          -0.009406   0.005262  -1.788 0.073855 .  
## ptratio       0.566253   0.188068   3.011 0.002605 ** 
## black        -0.046906   0.021341  -2.198 0.027952 *  
## medv          0.109984   0.050441   2.180 0.029224 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 388.11  on 279  degrees of freedom
## Residual deviance: 106.25  on 269  degrees of freedom
## AIC: 128.25
## 
## Number of Fisher Scoring iterations: 9

indus has the (highest p value) lowest affect on the target and will be removed next.

## 
## Call:
## glm(formula = train2$target ~ chas + nox + age + dis + rad + 
##     tax + ptratio + black + medv, family = binomial(link = "logit"), 
##     data = train2)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.84679  -0.14105   0.00000   0.00211   3.15247  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -25.177377  10.574234  -2.381 0.017265 *  
## chas          2.068314   0.909367   2.274 0.022939 *  
## nox          41.438971   8.883767   4.665 3.09e-06 ***
## age           0.035617   0.013966   2.550 0.010764 *  
## dis           0.474410   0.245682   1.931 0.053484 .  
## rad           0.706779   0.197877   3.572 0.000355 ***
## tax          -0.008507   0.005048  -1.685 0.091903 .  
## ptratio       0.512238   0.180677   2.835 0.004581 ** 
## black        -0.038355   0.018480  -2.075 0.037945 *  
## medv          0.120506   0.051832   2.325 0.020075 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 388.11  on 279  degrees of freedom
## Residual deviance: 109.03  on 270  degrees of freedom
## AIC: 129.03
## 
## Number of Fisher Scoring iterations: 9

dis has the (highest p value) lowest affect on the target and will be removed next.

## 
## Call:
## glm(formula = train2$target ~ chas + nox + age + rad + tax + 
##     ptratio + black + medv, family = binomial(link = "logit"), 
##     data = train2)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.79360  -0.15389   0.00000   0.00281   2.91780  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -15.751389   8.851240  -1.780 0.075147 .  
## chas          1.735424   0.875360   1.983 0.047420 *  
## nox          32.406463   6.899168   4.697 2.64e-06 ***
## age           0.026774   0.012764   2.098 0.035943 *  
## rad           0.763509   0.200657   3.805 0.000142 ***
## tax          -0.010535   0.004961  -2.124 0.033685 *  
## ptratio       0.441114   0.173763   2.539 0.011130 *  
## black        -0.037041   0.018350  -2.019 0.043530 *  
## medv          0.075602   0.043273   1.747 0.080619 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 388.11  on 279  degrees of freedom
## Residual deviance: 112.76  on 271  degrees of freedom
## AIC: 130.76
## 
## Number of Fisher Scoring iterations: 9

medv has the (highest p value) lowest affect on the target and will be removed next.

## 
## Call:
## glm(formula = train2$target ~ chas + nox + age + rad + tax + 
##     ptratio + black, family = binomial(link = "logit"), data = train2)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.82982  -0.19751   0.00000   0.00179   2.78606  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -11.676885   8.386879  -1.392  0.16384    
## chas          1.712481   0.873377   1.961  0.04991 *  
## nox          33.637060   6.843295   4.915 8.86e-07 ***
## age           0.021814   0.012220   1.785  0.07425 .  
## rad           0.834703   0.198278   4.210 2.56e-05 ***
## tax          -0.012630   0.004784  -2.640  0.00829 ** 
## ptratio       0.271708   0.133077   2.042  0.04118 *  
## black        -0.034609   0.018167  -1.905  0.05677 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 388.11  on 279  degrees of freedom
## Residual deviance: 116.00  on 272  degrees of freedom
## AIC: 132
## 
## Number of Fisher Scoring iterations: 9

age has the (highest p value) lowest affect on the target and will be removed next.

## 
## Call:
## glm(formula = train2$target ~ chas + nox + rad + tax + ptratio + 
##     black, family = binomial(link = "logit"), data = train2)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.98315  -0.20136   0.00000   0.00136   2.67148  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -13.571432   8.192355  -1.657   0.0976 .  
## chas          1.872093   0.864814   2.165   0.0304 *  
## nox          36.633572   6.418908   5.707 1.15e-08 ***
## rad           0.791397   0.185304   4.271 1.95e-05 ***
## tax          -0.010182   0.004371  -2.330   0.0198 *  
## ptratio       0.264634   0.131003   2.020   0.0434 *  
## black        -0.030856   0.017957  -1.718   0.0857 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 388.11  on 279  degrees of freedom
## Residual deviance: 119.31  on 273  degrees of freedom
## AIC: 133.31
## 
## Number of Fisher Scoring iterations: 9

black has the (highest p value) lowest affect on the target and will be removed next.

## 
## Call:
## glm(formula = train2$target ~ chas + nox + rad + tax + ptratio, 
##     family = binomial(link = "logit"), data = train2)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.05454  -0.21510   0.00002   0.00142   2.71055  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -25.596282   4.785049  -5.349 8.83e-08 ***
## chas          1.784314   0.872626   2.045   0.0409 *  
## nox          37.494616   6.563676   5.712 1.11e-08 ***
## rad           0.789544   0.181775   4.344 1.40e-05 ***
## tax          -0.008826   0.004194  -2.104   0.0353 *  
## ptratio       0.223146   0.122496   1.822   0.0685 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 388.11  on 279  degrees of freedom
## Residual deviance: 124.04  on 274  degrees of freedom
## AIC: 136.04
## 
## Number of Fisher Scoring iterations: 9

ptratio has the (highest p value) lowest affect on the target and will be removed next.

## 
## Call:
## glm(formula = train2$target ~ chas + nox + rad + tax, family = binomial(link = "logit"), 
##     data = train2)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.11112  -0.26339   0.00008   0.00427   2.61870  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -20.060019   3.129636  -6.410 1.46e-10 ***
## chas          1.548497   0.847409   1.827   0.0677 .  
## nox          34.905414   6.073472   5.747 9.07e-09 ***
## rad           0.659414   0.157231   4.194 2.74e-05 ***
## tax          -0.006772   0.004011  -1.688   0.0913 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 388.11  on 279  degrees of freedom
## Residual deviance: 127.40  on 275  degrees of freedom
## AIC: 137.4
## 
## Number of Fisher Scoring iterations: 8

tax has the (highest p value) lowest affect on the target and will be removed next.

## 
## Call:
## glm(formula = train2$target ~ chas + nox + rad, family = binomial(link = "logit"), 
##     data = train2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0092  -0.2997   0.0001   0.0041   2.6731  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -19.3184     2.9016  -6.658 2.78e-11 ***
## chas          1.6418     0.8208   2.000   0.0455 *  
## nox          30.4133     4.9546   6.138 8.34e-10 ***
## rad           0.5541     0.1388   3.991 6.57e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 388.11  on 279  degrees of freedom
## Residual deviance: 130.38  on 276  degrees of freedom
## AIC: 138.38
## 
## Number of Fisher Scoring iterations: 8

## (Intercept)        chas         nox         rad 
## -1.42247526  0.12089180  2.23943281  0.04079843

The variables that have an affect on the target variable are chas, nox and rad.

The marginal effects reflect the change in the probability the target equals 1 given a 1 unit change in the independent variable. The marginal effect is determined at the mean for each of the independent variables. A suburb that borders the Charles River (chas=1) results in a 12% increase in the probability that the target equals 1. A 1 unit increase in nox, nitrogen oxides concentration (parts per 10 million), results in a 224% increase in the likelihood that the target is 1. A 1 unit increase in rad, index of accessibility to radial highways, results in a 4% increase in the probability the target value will be 1.

Model 3 - Combining Correlated Variables by Adding their Values Together

In this model, the values of rad and tax will be added to create a new variable. The variables dis, nox and age will be added to create a new variable.

Creating a Test Set and Training Set

Backward Elimination - Logistic Regression Model 3

## 
## Call:
## glm(formula = train3$target ~ ., family = binomial(link = "logit"), 
##     data = train3)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.63344  -0.52233  -0.09146   0.38090   2.63168  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -8.336319   3.271190  -2.548   0.0108 *  
## zn           -0.053214   0.021734  -2.448   0.0143 *  
## indus        -0.036189   0.041428  -0.874   0.3824    
## chas          1.336898   0.722313   1.851   0.0642 .  
## rm            0.087923   0.497228   0.177   0.8596    
## ptratio      -0.051419   0.107807  -0.477   0.6334    
## lstat         0.105087   0.047289   2.222   0.0263 *  
## medv          0.087126   0.047039   1.852   0.0640 .  
## tax_rad2      0.009249   0.001782   5.189 2.12e-07 ***
## dis_nox_age2  0.029117   0.009971   2.920   0.0035 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 386.10  on 279  degrees of freedom
## Residual deviance: 192.82  on 270  degrees of freedom
## AIC: 212.82
## 
## Number of Fisher Scoring iterations: 6

rm has the (highest p value) lowest affect on the target and will be removed next.

## 
## Call:
## glm(formula = train3$target ~ zn + indus + chas + ptratio + lstat + 
##     medv + tax_rad2 + dis_nox_age2, family = binomial(link = "logit"), 
##     data = train3)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.63863  -0.51982  -0.09245   0.38420   2.62811  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -8.023883   2.753298  -2.914  0.00357 ** 
## zn           -0.052502   0.021225  -2.474  0.01338 *  
## indus        -0.037224   0.041141  -0.905  0.36558    
## chas          1.338317   0.723024   1.851  0.06417 .  
## ptratio      -0.047474   0.105692  -0.449  0.65331    
## lstat         0.102568   0.045171   2.271  0.02317 *  
## medv          0.092516   0.036025   2.568  0.01023 *  
## tax_rad2      0.009342   0.001711   5.459 4.78e-08 ***
## dis_nox_age2  0.029750   0.009336   3.187  0.00144 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 386.10  on 279  degrees of freedom
## Residual deviance: 192.85  on 271  degrees of freedom
## AIC: 210.85
## 
## Number of Fisher Scoring iterations: 6

ptratio has the (highest p value) lowest affect on the target and will be removed next.

## 
## Call:
## glm(formula = train3$target ~ zn + indus + chas + lstat + medv + 
##     tax_rad2 + dis_nox_age2, family = binomial(link = "logit"), 
##     data = train3)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.64329  -0.52454  -0.09421   0.38734   2.61057  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -9.071379   1.488631  -6.094 1.10e-09 ***
## zn           -0.049825   0.019912  -2.502  0.01234 *  
## indus        -0.036615   0.041012  -0.893  0.37196    
## chas          1.379819   0.710168   1.943  0.05202 .  
## lstat         0.105874   0.044343   2.388  0.01696 *  
## medv          0.098507   0.033385   2.951  0.00317 ** 
## tax_rad2      0.009234   0.001694   5.452 4.99e-08 ***
## dis_nox_age2  0.029944   0.009351   3.202  0.00136 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 386.10  on 279  degrees of freedom
## Residual deviance: 193.06  on 272  degrees of freedom
## AIC: 209.06
## 
## Number of Fisher Scoring iterations: 6

indus has the (highest p value) lowest affect on the target and will be removed next.

## 
## Call:
## glm(formula = train3$target ~ zn + chas + lstat + medv + tax_rad2 + 
##     dis_nox_age2, family = binomial(link = "logit"), data = train3)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7032  -0.5397  -0.1073   0.3970   2.5888  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -9.063304   1.490919  -6.079 1.21e-09 ***
## zn           -0.047013   0.019438  -2.419  0.01558 *  
## chas          1.328872   0.703479   1.889  0.05889 .  
## lstat         0.098289   0.043566   2.256  0.02406 *  
## medv          0.101983   0.033560   3.039  0.00238 ** 
## tax_rad2      0.008548   0.001465   5.836 5.33e-09 ***
## dis_nox_age2  0.028232   0.009106   3.100  0.00193 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 386.10  on 279  degrees of freedom
## Residual deviance: 193.88  on 273  degrees of freedom
## AIC: 207.88
## 
## Number of Fisher Scoring iterations: 6

chas has the (highest p value) lowest affect on the target and will be removed next.

## 
## Call:
## glm(formula = train3$target ~ zn + lstat + medv + tax_rad2 + 
##     dis_nox_age2, family = binomial(link = "logit"), data = train3)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7319  -0.5424  -0.1063   0.3884   2.5598  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -9.098654   1.470158  -6.189 6.06e-10 ***
## zn           -0.047511   0.019602  -2.424 0.015360 *  
## lstat         0.101118   0.042463   2.381 0.017250 *  
## medv          0.104397   0.033064   3.157 0.001592 ** 
## tax_rad2      0.008275   0.001435   5.765 8.17e-09 ***
## dis_nox_age2  0.030296   0.008957   3.383 0.000718 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 386.10  on 279  degrees of freedom
## Residual deviance: 197.82  on 274  degrees of freedom
## AIC: 209.82
## 
## Number of Fisher Scoring iterations: 6

##   (Intercept)            zn         lstat          medv      tax_rad2 
## -1.0055093598 -0.0052504772  0.0111747697  0.0115370695  0.0009144766 
##  dis_nox_age2 
##  0.0033481143

The variables that have an affect on the target variable are zn, lstat, medv and the combination of tax and rad.The marginal effects reflect the change in the probability the target equals 1 given a 1 unit change in the independent variable. The marginal effect is determined at the mean for each of the independent variables. A 1 unit increase in zn, proportion of residential land zoned for large lots (over 25000 square feet), results in a 0.5% decrease in the probability the target of 1. A 1 unit increase in lstat, lower status of the population (percent), results in a 1% increase in the probability that the target is 1. A 1 unit increase in medv, median value of owner-occupied homes in $1000s, results in 1% increase in the probability that the target is 1. A 1 unit increase in the combination of the variables tax and rad results in a 0.09% increase in the probability that the target is 1. A 1 unit increase in the combination of the variables dis, nox and age results in a 0.3% increase in the probability that the target is 1.

Using the Test Set To Make and Evaluate Predictions from Each of the 3 Models Built

Prediction from Model 1

## [1] 0.33

## [1] 0.9004522

The cutoff associated with a point the maximum distance from the ROC curve is 0.33. I will use 0.33 as the cutoff for making predictions. A value above 0.33 will be assigned a target of one and a value below 0.33 wil be assigned a value of zero. The area under the curve is 0.90.

Confusion Matrix

##     true
## pred  0  1
##    0 76  5
##    1 22 83

The model predicted 76 0’s that were actually 0. The model predicted 5 0’s that were actually 1.
The model predicted 22 1’s that were actaully 0. The model predicted 83 1’s that were actually 1.

##    accuracy     error precision sensitivity specificity        f1
## 1 0.8548387 0.1451613  0.424581   0.7755102   0.9431818 0.5487365

Prediction from Model 2:

## [1] 0.86

## [1] 0.9447347

The cutoff associated with a point the farthest distance from the ROC curve is 0.86. I will use 0.86 as the cutoff for making predictions. A value above 0.86 will be assigned a target of one and a value below 0.86 wil be assigned a value of zero. The area under the curve is 0.94.

Confusion Matrix

##     true
## pred  0  1
##    0 97 19
##    1  2 68

##    accuracy     error precision sensitivity specificity        f1
## 1 0.8870968 0.1129032 0.4511628    0.979798   0.7816092 0.6178344

The model predicted 97 0’s that were actually 0. The model predicted 19 0’s that were actually 1.
The model predicted 2 1’s that were actaully 0. The model predicted 68 1’s that were actually 1.

Prediction from Model 3:

## [1] 0.28

## [1] 0.9185207

The cutoff associated with a point the farthest distance from the ROC curve is 0.28. I will use 0.28 as the cutoff for making predictions. A value above 0.28 will be assigned a target of one and a value below 0.86 wil be assigned a value of zero. The area under the curve is 0.92.

Confusion Matrix

##     true
## pred  0  1
##    0 65  7
##    1 20 94

The model predicted 65 0’s that were actually 0. The model predicted 7 0’s that were actually 1.
The model predicted 20 1’s that were actaully 0. The model predicted 94 1’s that were actually 1.

##    accuracy     error precision sensitivity specificity        f1
## 1 0.8548387 0.1451613 0.4140127   0.7647059   0.9306931 0.5371901

SELECT MODEL

When comparing the three models, the accuracy is greatest for the 2nd model. The precision is also highest for the second model. (The precision is the ratio of correct predictions of zero to total predictions of zero.) The sensitivity is highest for model 2. (The sensitivity is the ratio of the correct predictions of zero to all cases in which the target is zero.) The specificity is the greatest for the first model. (The specificity is the ratio of the correct predictions of 1 to the all cases in which the target is one.) The F1 score is highest for the second model. (The F1 score is equal to 2xPrecisionxSensitivity/(Precision+Sensitivity) and gives a balance between the precision and sensitivity.) The area under the roc curve is the farthest from 0.5 for model 2. (The farther the area is from 0.5, the better the model.)

I will use model 2 to make a prediction for the test data because it has a higher accuracy, precision, sensitivity, F1 score and area under the ROC curve than the other models.

The predictions for the evaluation set are below:

##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 
##  0  0  0  0  0  0  0  0  0  0  0  0  1  1  1  0  0  0  0  0  0  0  0  0  0 
## 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 
##  0  0  1  1  1  1  1  1  1  1  1  1  1  0  0

APPENDIX

DATA EXPLORATION

crime <- read.csv(“https://raw.githubusercontent.com/swigodsky/Data621/master/crime-training-data.csv”, stringsAsFactors = FALSE) head(crime) nrow(crime)

crime_target_removed <- crime[-14] summary(crime_target_removed)

plot(crime$zn, ylab="Proportion of Residential Land Zoned For Large Lots", main="Residential Land Zoned For Large Lots") hist(crime$zn, xlab=“Proportion of Residential Land Zoned For Large Lots”,main=“Residential Land Zoned For Large Lots”) boxplot(crime$zn, main=“Residential Land Zoned For Large Lots”)

plot(crime$rad, ylab="Index of Accessibility to Radial Highways", main="Accessibility to Radial Highways") hist(crime$rad, xlab=“Index of Accessibility to Radial Highways”,main=“Accessibility to Radial Highways”) boxplot(crime$rad, main=“Index of Accessibility to Radial Highways”)

plot(crime$tax, ylab=“Full-Value Property-Tax Rate per $10,000", main="Full-Value Property-Tax Rate") hist(crime$tax, xlab=”Full-Value Property-Tax Rate per $10,000",main="Full-Value Property-Tax Rate") boxplot(crime$tax, main=“Full-Value Property-Tax Rate”)

plot(crime$black, ylab="1000(num black people - 0.63)^2", main="Proportion of Black People by Town") hist(crime$black, xlab=“1000(num black people - 0.63)^2”,main=“Proportion of Black People by Town”) boxplot(crime$black, main=“Proportion of Black People by Town”)

library(dplyr) library(tidyr) library(ggplot2)

boxplot(crime_target_removed, las=2,horizontal=TRUE)#, ylim=c(0,3000))

DATA PREPARATION

correlation <- cor(crime_target_removed, method = “pearson”) correlation pairs(crime_target_removed)

library(factoextra) crime_df <- crime

tax_mean <- mean(crime$tax) rad_mean <- mean(crime$rad) crime_df$tax <- crime_df$tax-tax_mean crime_df$rad <- crime_df$rad-rad_mean tax_rad <- data.frame(list(rad=crime_df$rad, tax=crime_df$tax), stringsAsFactors = FALSE) pca <- prcomp(tax_rad, scale = FALSE) eigen <- get_eigenvalue(pca) tax_rad <- as.matrix(tax_rad) tax_rad <- t(tax_rad) tax_rad_pca <- pca$rotation[,1] %*% tax_rad tax_rad_pca <- t(tax_rad_pca) crime_df <- cbind(crime_df, tax_rad_pca)

dis_mean <- mean(crime$dis) nox_mean <- mean(crime$nox) age_mean <- mean(crime$age) crime_df$dis <- crime_df$dis-dis_mean crime_df$nox <- crime_df$nox-nox_mean crime_df$age <- crime_df$age-age_mean dis_nox_age <- data.frame(list(dis=crime_df$dis, nox=crime_df$nox, age=crime_df$age), stringsAsFactors = FALSE) pca2 <- prcomp(dis_nox_age, scale = FALSE) eigen2 <- get_eigenvalue(pca2) dis_nox_age <- as.matrix(dis_nox_age) dis_nox_age <- t(dis_nox_age) dis_nox_age_pca <- pca2$rotation[,1] %*% dis_nox_age dis_nox_age_pca <- t(dis_nox_age_pca) crime_df <- cbind(crime_df, dis_nox_age_pca)

crime_df <-subset(crime_df, select=-c(tax,rad,dis,nox,age, black)) ##BUILD MODELS set.seed(13) n <- nrow(crime_df) shuffle_df <- crime_df[sample(n),] train_indeces <- 1:round(0.6n) train <- shuffle_df[train_indeces,] test_indeces <- (round(.6n)+1):n test <- shuffle_df[test_indeces,]

logit1 <- glm(train$target ~ ., data=train, family=binomial (link=“logit”)) summary(logit1)

logit1 <- update(logit1, .~. -rm, data = train, family=binomial (link=“logit”)) summary(logit1)

logit1 <- update(logit1, .~. -chas, data = train, family=binomial (link=“logit”)) summary(logit1)

logit1 <- update(logit1, .~. -ptratio, data = train, family=binomial (link=“logit”)) summary(logit1)

logit1 <- update(logit1, .~. -indus, data = train, family=binomial (link=“logit”)) summary(logit1)

logit1 <- update(logit1, .~. -lstat, data = train, family=binomial (link=“logit”)) summary(logit1) logitscalar1 <- mean(dlogis(predict(logit1,type=“link”))) logitscalar1*coef(logit1)

set.seed(10) n <- nrow(crime) shuffle_df2 <- crime[sample(n),] train_indeces <- 1:round(0.6n) train2 <- shuffle_df2[train_indeces,] test_indeces <- (round(.6n)+1):n test2 <- shuffle_df2[test_indeces,]

logit2 <- glm(train2$target ~ ., data=train2, family=binomial (link=“logit”)) summary(logit2)

logit2 <- update(logit2, .~. -rm, data = train2, family=binomial (link=“logit”)) summary(logit2)

logit2 <- update(logit2, .~. -zn, data = train2, family=binomial (link=“logit”)) summary(logit2)

logit2 <- update(logit2, .~. -lstat, data = train2, family=binomial (link=“logit”))

logit2 <- update(logit2, .~. -indus, data = train2, family=binomial (link=“logit”)) summary(logit2)

logit2 <- update(logit2, .~. -dis, data = train2, family=binomial (link=“logit”)) summary(logit2)

logit2 <- update(logit2, .~. -medv, data = train2, family=binomial (link=“logit”)) summary(logit2)

logit2 <- update(logit2, .~. -age, data = train2, family=binomial (link=“logit”)) summary(logit2)

logit2 <- update(logit2, .~. -black, data = train2, family=binomial (link=“logit”)) summary(logit2)

logit2 <- update(logit2, .~. -ptratio, data = train2, family=binomial (link=“logit”)) summary(logit2)

logit2 <- update(logit2, .~. -tax, data = train2, family=binomial (link=“logit”)) summary(logit2)

crime_df3 <- crime

tax_rad2 <- crime_df3$tax + crime_df3$rad crime_df3 <- cbind(crime_df3, tax_rad2)

dis_nox_age2 <- crime_df3$dis + crime_df3$nox + crime_df3$age crime_df3 <- cbind(crime_df3, dis_nox_age2)

crime_df3 <-subset(crime_df3, select=-c(tax,rad,dis,nox,age, black))

set.seed(15) n <- nrow(crime_df3) shuffle_df3 <- crime_df3[sample(n),] train_indeces <- 1:round(0.6n) train3 <- shuffle_df3[train_indeces,] test_indeces <- (round(.6n)+1):n test3 <- shuffle_df3[test_indeces,]

logit3 <- glm(train3$target ~ ., data=train3, family=binomial (link=“logit”)) summary(logit3)

logit3 <- update(logit3, .~. -rm, data = train3, family=binomial (link=“logit”)) summary(logit3)

logit3 <- update(logit3, .~. -ptratio, data = train3, family=binomial (link=“logit”)) summary(logit3)

logit3 <- update(logit3, .~. -indus, data = train3, family=binomial (link=“logit”)) summary(logit3)

logit3 <- update(logit3, .~. -chas, data = train3, family=binomial (link=“logit”)) summary(logit3)

acc <- function(pred, test){ totalnum <- length(pred) numRight <- length(which(pred==test$target)) accuracy <- numRight/totalnum return(accuracy) }

err <- function(pred, test){ totalnum <- length(pred) numWrong <- length(which(pred!=test$target)) error <- numWrong/totalnum return(error) }

prec <- function(pred, test){ true_pos <- length(which((pred==0)&(test$target==0))) all_pos <- length(which(test$target==0)) + length(which(pred==0)) precision <- true_pos/all_pos return(precision) }

sens <- function(pred, test){ true_pos <- length(which((pred==0)&(test$target==0))) false_neg <- length(which((pred==1)&(test$target==0))) sensitivity <- true_pos/(true_pos+false_neg) return(sensitivity) }

spec <- function(pred, test){ true_neg <- length(which((pred==1)&(test$target==1))) false_pos <- length(which((pred==0)&(test$target==1))) sensitivity <- true_neg/(true_neg+false_pos) return(sensitivity) }

f1 <- function(pred, test){ true_pos <- length(which((pred==0)&(test$target==0))) all_pos <- length(which(test$target==0)) + length(which(pred==0)) precision <- true_pos/all_pos

false_neg <- length(which((pred==1)&(test$target==0))) sensitivity <- true_pos/(true_pos+false_neg)

f1 <- 2precisionsensitivity/(precision+sensitivity) return(f1) } #distance function taken from https://stackoverflow.com/questions/35194048/using-r-how-to-calculate-the-distance-from-one-point-to-a-line dist2d <- function(a,b,c) { v1 <- b - c v2 <- a - b m <- cbind(v1,v2) d2 <- abs(det(m))/sqrt(sum(v1*v1)) return(d2) }

roc <- function(pred,test){ d <- 0 #set distance between point and line y=x to 0 roc_tester <- data.frame(o_m_specificity=NA, sensitivity=NA)[numeric(0), ] auc=0 for (cutoff in seq(0,1.0,0.01)){ test_df <- pred #make a copy of the predictions #set scored (predicted) values in test_df according to whether the probability is above or below the cut-off threshold test_df[test_df < cutoff] <- 0 test_df[test_df >= cutoff] <- 1

spec_val <- spec(test_df, test)
sens_val <- sens(test_df, test)
  
roc_tester <- rbind(roc_tester, list(o_m_specificity=1-spec_val,sensitivity= sens_val))

#calculating Euclidean distance between point and y=x a2 <- c(1-spec_val,sens_val) b2 <- c(0,0) c2 <- c(1,1) d2 <- dist2d(a2,b2,c2) # distance of point a from line (b,c) if (d2>d){ d <- d2 cut_off_val <- cutoff }

#calculating area of trapezoid for each set of data points  
if (cutoff>=0.1){
  num_values = nrow(roc_tester)
  base2 = roc_tester$sensitivity[num_values]
  base1 = roc_tester$sensitivity[num_values-1]
  height2 = roc_tester$o_m_specificity[num_values]
  height1 = roc_tester$o_m_specificity[num_values-1]
  area = .5*(base1+base2)*(height2-height1)
  auc = auc + area
}

}

roc_plot <- ggplot(roc_tester, aes(x = o_m_specificity, y = sensitivity)) + geom_point() + geom_abline(slope=1) + labs(x="False Positive Rate (1-specificity)", y="True Positive Rate (sensitivity)", title="ROC Curve" )

return(list(roc_plot=roc_plot, auc_val=auc, cut_off_val=cut_off_val)) }

pred_logit1 <- predict(logit1, newdata=test, type=“response”)

roc_vals <- roc(pred_logit1,test) roc_vals$roc_plot print(roc_vals$cut_off_val) print(roc_vals$auc_val)

pred_logit1[pred_logit1>=0.33] <- 1 pred_logit1[pred_logit1<0.33] <- 0 table(pred=pred_logit1, true=test$target)

accuracy1 <- acc(round(pred_logit1), test) error1 <- err(round(pred_logit1), test) precision1 <- prec(round(pred_logit1), test) sensitivity1 <- sens(round(pred_logit1), test) specificity1 <- spec(round(pred_logit1), test) f11 <- f1(round(pred_logit1), test)

stat1 <- data.frame(list(accuracy=accuracy1, error=error1, precision=precision1, sensitivity=sensitivity1, specificity=specificity1, f1=f11)) print(stat1)

pred_logit2 <- predict(logit2, newdata=test2, type=“response”) roc_vals2 <- roc(pred_logit2,test2) roc_vals2$roc_plot print(roc_vals2$cut_off_val) print(roc_vals2$auc_val)

pred_logit2[pred_logit2>=0.86] <- 1 pred_logit2[pred_logit2<0.86] <- 0

table(pred=round(pred_logit2), true=test2$target)

accuracy2 <- acc(round(pred_logit2), test2) error2 <- err(round(pred_logit2), test2) precision2 <- prec(round(pred_logit2), test2) sensitivity2 <- sens(round(pred_logit2), test2) specificity2 <- spec(round(pred_logit2), test2) f12 <- f1(round(pred_logit2), test2)

stat2 <- data.frame(list(accuracy=accuracy2, error=error2, precision=precision2, sensitivity=sensitivity2, specificity=specificity2, f1=f12)) print(stat2)

pred_logit3 <- predict(logit3, newdata=test3, type=“response”) roc_vals3 <- roc(pred_logit3,test3) roc_vals3$roc_plot print(roc_vals3$cut_off_val) print(roc_vals3$auc_val)

pred_logit3[pred_logit3>=0.28] <- 1 pred_logit3[pred_logit3<0.28] <- 0

table(pred=round(pred_logit3), true=test3$target)

accuracy3 <- acc(round(pred_logit3), test3) error3 <- err(round(pred_logit3), test3) precision3 <- prec(round(pred_logit3), test3) sensitivity3 <- sens(round(pred_logit3), test3) specificity3 <- spec(round(pred_logit3), test3) f13 <- f1(round(pred_logit3), test3)

stat3 <- data.frame(list(accuracy=accuracy3, error=error3, precision=precision3, sensitivity=sensitivity3, specificity=specificity3, f1=f13)) print(stat3) ##SELECT MODEL eval_data <- read.csv(‘https://raw.githubusercontent.com/swigodsky/Data621/master/crime-evaluation-data.csv’)

pred_eval <- predict(logit2, newdata=eval_data, type=“response”) pred_eval[pred_eval>=0.86] <- 1 pred_eval[pred_eval<0.86] <- 0 pred_eval

DATA 621 HW3

Sarah Wigodsky

October 23, 2018

DATA EXPLORATION

zn - Proportion of Residential Land Zoned For Large Lots

rad - Index of Accessibility to Radial Highways

tax: Full-Value Property-Tax Rate per $10,000

black: 1000(Bk - 0.63)^2 where Bk is the Proportion of Black People by Town

DATA PREPARATION

Correlation of Variables

Build Models

Breaking the data set into a training set and a testing set

Backward Elimination - Logistic Regression Model 1 - Based on combining 2 sets of variables by PCA

Model 2 - Building a Model Without Combining Correlated Variables

Backward Elimination - Logistic Regression Model 2

Model 3 - Combining Correlated Variables by Adding their Values Together

Creating a Test Set and Training Set

Backward Elimination - Logistic Regression Model 3

Using the Test Set To Make and Evaluate Predictions from Each of the 3 Models Built

Prediction from Model 1

Confusion Matrix

Prediction from Model 2:

Confusion Matrix

Prediction from Model 3:

Confusion Matrix

SELECT MODEL

APPENDIX

DATA EXPLORATION

DATA PREPARATION