Overview

In this homework assignment, you will explore, analyze and model a data set containing information on crime for various neighborhoods of a major city. Each record has a response variable indicating whether or not the crime rate is above the median crime rate (1) or not (0).

Your objective is to build a binary logistic regression model on the training data set to predict whether the neighborhood will be at risk for high crime levels. You will provide classifications and probabilities for the evaluation data set using your binary logistic regression model. You can only use the variables given to you (or, variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:

Write Up:

1. Data Exploration (25 Points)

The data above is a collection of independent variables collected in attempts to evaluate for crime within the Boston neighborhoods. According to Wikipedia, Boston is a city with a population of 665,258 people with a crime rate of 706.8 violent crimes per 100,000 people and 2316.1 property crimes per 100,000 people. (Reference: https://en.wikipedia.org/wiki/List_of_United_States_cities_by_crime_rate). Within the Boston Metropolitan, there are different neighborhoods that have differing levels of safety.

Ultimately, we like to develop a model where we could predict neighborhoods that will either be above or below the median crime rate within the area. However, in order to do that, we should explore the dataset given to us.

Below is the training dataset given to us with the different independent variables as listed above.

zn indus chas nox rm age dis rad tax ptratio lstat medv target
0 19.58 0 0.605 7.929 96.2 2.0459 5 403 14.7 3.70 50.0 1
0 19.58 1 0.871 5.403 100.0 1.3216 5 403 14.7 26.82 13.4 1
0 18.10 0 0.740 6.485 100.0 1.9784 24 666 20.2 18.85 15.4 1
30 4.93 0 0.428 6.393 7.8 7.0355 6 300 16.6 5.19 23.7 0
0 2.46 0 0.488 7.155 92.2 2.7006 3 193 17.8 4.82 37.9 0
0 8.56 0 0.520 6.781 71.3 2.8561 5 384 20.9 7.67 26.5 0

How many rows are there in this dataset?

## [1] 466

The best way to start looking at the data is to look at the basic summary statistics including mean, median, standard deviations, etc.

Summary Statistic:

vars n mean sd median trimmed mad min max range skew kurtosis se
zn 1 466 11.5772532 23.3646511 0.00000 5.3542781 0.0000000 0.0000 100.0000 100.0000 2.1768152 3.8135765 1.0823466
indus 2 466 11.1050215 6.8458549 9.69000 10.9082353 9.3403800 0.4600 27.7400 27.2800 0.2885450 -1.2432132 0.3171281
chas 3 466 0.0708155 0.2567920 0.00000 0.0000000 0.0000000 0.0000 1.0000 1.0000 3.3354899 9.1451313 0.0118957
nox 4 466 0.5543105 0.1166667 0.53800 0.5442684 0.1334340 0.3890 0.8710 0.4820 0.7463281 -0.0357736 0.0054045
rm 5 466 6.2906738 0.7048513 6.21000 6.2570615 0.5166861 3.8630 8.7800 4.9170 0.4793202 1.5424378 0.0326516
age 6 466 68.3675966 28.3213784 77.15000 70.9553476 30.0226500 2.9000 100.0000 97.1000 -0.5777075 -1.0098814 1.3119625
dis 7 466 3.7956929 2.1069496 3.19095 3.5443647 1.9144814 1.1296 12.1265 10.9969 0.9988926 0.4719679 0.0976026
rad 8 466 9.5300429 8.6859272 5.00000 8.6978610 1.4826000 1.0000 24.0000 23.0000 1.0102788 -0.8619110 0.4023678
tax 9 466 409.5021459 167.9000887 334.50000 401.5080214 104.5233000 187.0000 711.0000 524.0000 0.6593136 -1.1480456 7.7778214
ptratio 10 466 18.3984979 2.1968447 18.90000 18.5970588 1.9273800 12.6000 22.0000 9.4000 -0.7542681 -0.4003627 0.1017669
lstat 11 466 12.6314592 7.1018907 11.35000 11.8809626 7.0720020 1.7300 37.9700 36.2400 0.9055864 0.5033688 0.3289887
medv 12 466 22.5892704 9.2396814 21.20000 21.6304813 6.0045300 5.0000 50.0000 45.0000 1.0766920 1.3737825 0.4280200
target 13 466 0.4914163 0.5004636 0.00000 0.4893048 0.0000000 0.0000 1.0000 1.0000 0.0342293 -2.0031131 0.0231835

Data Visualization with a BoxPlot Distribution demonstrating the median, quartiles, etc.:

We will also take a look at the log scale as this will be used later for data transformation.

Histogram plots are great at demonstrating skewness or violations of normal distributions via visualizations:

Histogram:

As we can see that some of the independent variables appear to have a normal distribution, while others are skewed. The skewed data will likely need to undergo transformation to be better used for predictions.

Another assumption that we like to check is whether or not the indepdent variables appears to have some form of a linear relationship to the response variable. The best way to demonstrate this would be with a scatter plot:

Scatterplot (to target):

As you can see though, interpreting a binomial reseponse variable may not be the most intuitive way to visualize the data. As of now, we will leave this here for now and continue on to see if there are any correlations between the independent variables and correlation between the independent variables with the dependent variable.

Visualization of the Correlations:

zn indus chas nox rm age dis rad tax ptratio lstat medv target
zn 1.0000000 -0.5382664 -0.0401620 -0.5170452 0.3198141 -0.5725805 0.6601243 -0.3154812 -0.3192841 -0.3910357 -0.4329925 0.3767171 -0.4316818
indus -0.5382664 1.0000000 0.0611832 0.7596301 -0.3927118 0.6395818 -0.7036189 0.6006284 0.7322292 0.3946898 0.6071102 -0.4961743 0.6048507
chas -0.0401620 0.0611832 1.0000000 0.0974558 0.0905098 0.0788837 -0.0965771 -0.0159004 -0.0467648 -0.1286606 -0.0514232 0.1615653 0.0800419
nox -0.5170452 0.7596301 0.0974558 1.0000000 -0.2954897 0.7351278 -0.7688840 0.5958298 0.6538780 0.1762687 0.5962426 -0.4301227 0.7261062
rm 0.3198141 -0.3927118 0.0905098 -0.2954897 1.0000000 -0.2328125 0.1990158 -0.2084457 -0.2969343 -0.3603471 -0.6320245 0.7053368 -0.1525533
age -0.5725805 0.6395818 0.0788837 0.7351278 -0.2328125 1.0000000 -0.7508976 0.4603143 0.5121245 0.2554479 0.6056200 -0.3781560 0.6301062
dis 0.6601243 -0.7036189 -0.0965771 -0.7688840 0.1990158 -0.7508976 1.0000000 -0.4949919 -0.5342546 -0.2333394 -0.5075280 0.2566948 -0.6186731
rad -0.3154812 0.6006284 -0.0159004 0.5958298 -0.2084457 0.4603143 -0.4949919 1.0000000 0.9064632 0.4714516 0.5031013 -0.3976683 0.6281049
tax -0.3192841 0.7322292 -0.0467648 0.6538780 -0.2969343 0.5121245 -0.5342546 0.9064632 1.0000000 0.4744223 0.5641886 -0.4900329 0.6111133
ptratio -0.3910357 0.3946898 -0.1286606 0.1762687 -0.3603471 0.2554479 -0.2333394 0.4714516 0.4744223 1.0000000 0.3773560 -0.5159153 0.2508489
lstat -0.4329925 0.6071102 -0.0514232 0.5962426 -0.6320245 0.6056200 -0.5075280 0.5031013 0.5641886 0.3773560 1.0000000 -0.7358008 0.4691270
medv 0.3767171 -0.4961743 0.1615653 -0.4301227 0.7053368 -0.3781560 0.2566948 -0.3976683 -0.4900329 -0.5159153 -0.7358008 1.0000000 -0.2705507
target -0.4316818 0.6048507 0.0800419 0.7261062 -0.1525533 0.6301062 -0.6186731 0.6281049 0.6111133 0.2508489 0.4691270 -0.2705507 1.0000000

We can certainly see that different variables relate to each other differently, and some correlate stronger than others. If we look at the target column, we can see how the independent variables correlate with the response variable.

Let’s take a look at the target response variable closer.

What are the Positive and Negative Correlation to the target variable?

As of note, I have arbitrarily chosen any correlation value greater than 0.2 to be a positive correlation, a value less than -0.2 to be a negative correlation, and a value between -0.2 to 0.2 to be neutral.

## [1] "Correlation to Target"
##              target
## zn      -0.43168176
## indus    0.60485074
## chas     0.08004187
## nox      0.72610622
## rm      -0.15255334
## age      0.63010625
## dis     -0.61867312
## rad      0.62810492
## tax      0.61111331
## ptratio  0.25084892
## lstat    0.46912702
## medv    -0.27055071
## target   1.00000000
## [1] "Positive Correlative Factors:"
## [1] "indus"   "nox"     "age"     "rad"     "tax"     "ptratio" "lstat"  
## [8] "target"
## [1] "Negative Correlative Factors:"
## [1] "zn"   "dis"  "medv"
## [1] "Neutral Correlative Factors"
## [1] "chas" "rm"

There is a lot of interesting correlations here within this data set. The one underlying theme though seems to be that the higher socioeconomics within the region, the less likely to have a higher crime rate.

For instance, industrial areas tends to be areas less desired by the wealthy, given the concerns of chemical exposures, noises, etc. In this dataset, it appears that industrial areas have correlated levels of nitrogen oxide concentrations. Also homes with lots of surrounding radial highways also tend to be in areas that are more urban and likely inner city. Also noted is that in poor areas, there tends to be less funding for public institutions. So for instance, schools that tend to have a higher student to teacher ratio could be a proxy of underfunded schools. And in this scenario, higher student to teacher ratio correlated positively with crime.

Not surprisingly, the more expensive the home, the less likely to be crime. However, it is unclear at this moment why age, and taxes have a positive correlation. We will need to keep this in the back of our mind as we create our models..

On a side note, it appears that living near the Charles River and the average number of rooms per dwelling seems to be a neutral correlation.

Let’s take at the most highly correlated values to target. I have arbitrarily set greater than 0.7 or less than -0.7 to be highly correlated (whether positive or negative).

## [1] "Highly Positive Correlated Variables"
##        target
## nox 0.7261062
## [1] "Highly Negative Correlated Variables"
## [1] target
## <0 rows> (or 0-length row.names)

It appears that nitrogen oxide concentrations is the most highly correlated. This makes sense given our conversation listed above.

Before we move onto Data Preparation, we should ensure that there are no missing values that would require imputation.

Are there any missing values?

## [1] 0

2. Data Preparation (25 Points)

In preparation for the data, our goals is to create or transform independent variables that could potentially improve our models. We may try to transform some of our data points into buckets or perform transformations i.e. logarithmics to improve prediction ability. While we may not use all of these, it is worth considering and looking into.

Let’s start by taking some of the independent variables and dividing them into buckets.

According to https://nces.ed.gov/programs/coe/indicator_clr.asp, we can take a look at some of the classroom sizes. As we had discussed before, we assume that areas with lower socioeconomics and/or poorer neighborhoods tend to have areas with higher crimes. Of course, while the number of students to teacher ratio in itself is not what is responsible for the crime rate in a neighborhood, it is the undelying notion that an underfunded school (and hence, in theory, a higher student to teacher ratio) implies that the neighborhood may simply have lesser or funds (or may be poorer). While we may end up losing some information taking a continuous variable and converting them into categorical variables (such as a bucket), it may still be worthwhile to take a look into it and see if we ultimately use this variable at the very end.

Let’s start to bucket the student to teacher ratio.

##    
##     Small Medium Large
##   0    35    128    74
##   1    40     29   160
## [1] "Percentage where Crime is above the Median (1)"

##   stud_teach_ratio freq
## 1            Small   75
## 2           Medium  157
## 3            Large  234

This information isn’t exactly all that revealing, but again, we will investigate this further into our model building section.

The next independent variable that we can evaluate is the size of the home. Though we have seen above that there may be no correlation, it may be still worthwhile to see if we can create buckets of size of homes and incorporate them into our model building.

Like student to teacher ratio, size of homes could in theory represent a proxy for wealth. The larger the home, the more wealthy the owner will likely be (in theory).

Again, while these are not great estimates, we will create the buckets regardless.

##    
##     Small Medium Large
##   0     4    207    26
##   1    37    167    25
## [1] "Percentage where Crime is above the Median (1)"

While we may ultimately not need the buckets, we will still take a look at it when we are evaluating the models.

Transforming by predictor variables:

Let’s look at zn: proportion of residential land zoned for large lots

This may benefit from a logarithmic transformation.

hist(log(crime_train$zn), breaks=20, main="Percentage of Logarithmic Land Noted as 'Residential'", xlab="Zoning Size", col = "blue")

Though, while not perfect, this is better.

Let’s take the logarithmic transformation concept and see if we can apply it to the rest of the independent variables. Several of these predictor variables are skewed, and would likely benefit from a logarithmic transformation. Let’s perform a logarithmic transformation on the predictor variables. Given the numerous zeros in zn, and that chas is binary, we will not include these in our logarithmic transformations.

Let’s create a histogram and scatterplot with these new variables for the logarithmic transformations.

Lastly, last create some interactive terms.

Let’s see which independent variables have the highest correlation amongst each other (apart from its correlation with the target variable).

## [1] "Maximum Positive Correlations amongst independent variables: "
##        zn     indus      chas       nox        rm       age       dis 
## 0.6601243 0.7596301 0.1615653 0.7596301 0.7053368 0.7351278 0.6601243 
##       rad       tax   ptratio     lstat      medv 
## 0.9064632 0.9064632 0.4744223 0.6071102 0.7053368
## [1] "Maximum Negative Correlations amongst independent variables: "
##         zn      indus       chas        nox         rm        age 
## -0.5725805 -0.7036189 -0.1286606 -0.7688840 -0.6320245 -0.7508976 
##        dis        rad        tax    ptratio      lstat       medv 
## -0.7688840 -0.4949919 -0.5342546 -0.5159153 -0.7358008 -0.7358008

Find its corresponding correlated variables.

var1 var2 max_cor
zn zn dis 0.660124338286325
indus indus nox 0.759630083272316
chas chas medv 0.16156528065359
nox nox indus 0.759630083272316
rm rm medv 0.705336793853476
age age nox 0.735127819507785
dis dis zn 0.660124338286325
rad rad tax 0.906463228913755
tax tax rad 0.906463228913755
ptratio ptratio tax 0.474422290254675
lstat lstat indus 0.607110229872569
medv medv rm 0.705336793853476

Looks like the largest correlation between two indepdent variables is between tax and rad. We can build an interactive term for this when we build our models as they are possibly likely very dependent on each other with one term affecting the other.

3. Build Models (25 Points)

In this section, we will be building several models in order to predict to see if a certain set (or subset) of independent variables can adequately predict whether or not an area will be above or below the crime median.

The first model that we will entertain is the binary logistic model utilizing all of the predictor variables (untransformed). Let’s take a look at the first model:

## 
## Call:
## glm(formula = target ~ ., family = binomial, data = crime_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8464  -0.1445  -0.0017   0.0029   3.4665  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -40.822934   6.632913  -6.155 7.53e-10 ***
## zn           -0.065946   0.034656  -1.903  0.05706 .  
## indus        -0.064614   0.047622  -1.357  0.17485    
## chas          0.910765   0.755546   1.205  0.22803    
## nox          49.122297   7.931706   6.193 5.90e-10 ***
## rm           -0.587488   0.722847  -0.813  0.41637    
## age           0.034189   0.013814   2.475  0.01333 *  
## dis           0.738660   0.230275   3.208  0.00134 ** 
## rad           0.666366   0.163152   4.084 4.42e-05 ***
## tax          -0.006171   0.002955  -2.089  0.03674 *  
## ptratio       0.402566   0.126627   3.179  0.00148 ** 
## lstat         0.045869   0.054049   0.849  0.39608    
## medv          0.180824   0.068294   2.648  0.00810 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 192.05  on 453  degrees of freedom
## AIC: 218.05
## 
## Number of Fisher Scoring iterations: 9

The Residual Deviance is 192.05 and the AIC 218.05. We will consider this as the baseline for all models and create other models that could improve on the fit while minimizing its complexity.

The second model is essentially the first model. It will include all of the independent variables, except in this circumstance, it will include the interactive term rad:tax.

## 
## Call:
## glm(formula = target ~ . + rad:tax, family = binomial, data = crime_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8793  -0.1343  -0.0012   0.0379   3.5325  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.194e+01  6.775e+00  -6.191 5.97e-10 ***
## zn          -7.049e-02  3.475e-02  -2.029 0.042494 *  
## indus       -5.376e-02  4.856e-02  -1.107 0.268288    
## chas         7.575e-01  7.786e-01   0.973 0.330605    
## nox          4.872e+01  7.922e+00   6.150 7.74e-10 ***
## rm          -6.517e-01  7.343e-01  -0.888 0.374762    
## age          3.501e-02  1.395e-02   2.511 0.012050 *  
## dis          7.168e-01  2.305e-01   3.110 0.001871 ** 
## rad          1.035e+00  3.088e-01   3.351 0.000806 ***
## tax         -2.626e-03  3.238e-03  -0.811 0.417249    
## ptratio      4.096e-01  1.271e-01   3.223 0.001270 ** 
## lstat        4.812e-02  5.451e-02   0.883 0.377391    
## medv         1.868e-01  6.960e-02   2.684 0.007281 ** 
## rad:tax     -9.755e-04  5.384e-04  -1.812 0.069988 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 190.24  on 452  degrees of freedom
## AIC: 218.24
## 
## Number of Fisher Scoring iterations: 9

As we can see, the complexity of the model (AIC) remained essentially the same at 218.24, with a modest improvement in the Goodness of Fit (Residual deviance 190.24).

With the third model, we will attempt a different model altogether. we will take a completely different approach by using all of the logarithmically transformed independent variables (from section 2). We hope to normalize the distribution of the independent variables such that it will improve on the model’s performance. While some of the variables as seen in section 2 have certainly become normalized, there was a good number of variables that did not. So we should keep this in the back of our mind as we create our logarthmic transformed predictor model.

## 
## Call:
## glm(formula = target ~ ., family = binomial, data = crime_train_log)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.90738  -0.16285  -0.00161   0.08746   3.13907  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -15.96595   11.72606  -1.362  0.17333    
## zn           -0.03216    0.02687  -1.197  0.23145    
## chas          0.86893    0.73997   1.174  0.24028    
## log_indus     0.24931    0.52151   0.478  0.63262    
## log_nox      24.13637    3.91814   6.160 7.27e-10 ***
## log_rm        2.11032    3.70451   0.570  0.56891    
## log_age       0.85132    0.63686   1.337  0.18131    
## log_dis       2.69643    0.84395   3.195  0.00140 ** 
## log_rad       2.98142    0.68476   4.354 1.34e-05 ***
## log_tax      -1.51088    1.12043  -1.348  0.17750    
## log_ptratio   5.52739    2.12059   2.607  0.00915 ** 
## log_lstat     0.17708    0.66971   0.264  0.79146    
## log_medv      2.32688    1.48654   1.565  0.11751    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 206.10  on 453  degrees of freedom
## AIC: 232.1
## 
## Number of Fisher Scoring iterations: 8

It appears that the logarithmic model really may not have been the most optimal choice. The residual deviance and AIC have worsened, but not terribly so. We will hold onto this model, though I suspect that this model may ultimately not be what we want.

Before we continue on with creating additional models, we should consider looking at these built models from a different approach. We should take the step approach and see if we can build a model iteratively either “forward” or “backwards”, and see if this could potentially improve on the model.

Model Number 1 with “Backwards” Direction

## Start:  AIC=218.05
## target ~ zn + indus + chas + nox + rm + age + dis + rad + tax + 
##     ptratio + lstat + medv
## 
##           Df Deviance    AIC
## - rm       1   192.71 216.71
## - lstat    1   192.77 216.77
## - chas     1   193.53 217.53
## - indus    1   193.99 217.99
## <none>         192.05 218.05
## - tax      1   196.59 220.59
## - zn       1   196.89 220.89
## - age      1   198.73 222.73
## - medv     1   199.95 223.95
## - ptratio  1   203.32 227.32
## - dis      1   203.84 227.84
## - rad      1   233.74 257.74
## - nox      1   265.05 289.05
## 
## Step:  AIC=216.71
## target ~ zn + indus + chas + nox + age + dis + rad + tax + ptratio + 
##     lstat + medv
## 
##           Df Deviance    AIC
## - chas     1   194.24 216.24
## - lstat    1   194.32 216.32
## - indus    1   194.58 216.58
## <none>         192.71 216.71
## - tax      1   197.59 219.59
## - zn       1   198.07 220.07
## - age      1   199.11 221.11
## - ptratio  1   203.53 225.53
## - dis      1   203.85 225.85
## - medv     1   205.35 227.35
## - rad      1   233.81 255.81
## - nox      1   265.14 287.14
## 
## Step:  AIC=216.24
## target ~ zn + indus + nox + age + dis + rad + tax + ptratio + 
##     lstat + medv
## 
##           Df Deviance    AIC
## - indus    1   195.51 215.51
## <none>         194.24 216.24
## - lstat    1   196.33 216.33
## - zn       1   200.59 220.59
## - tax      1   200.75 220.75
## - age      1   201.00 221.00
## - ptratio  1   203.94 223.94
## - dis      1   204.83 224.83
## - medv     1   207.12 227.12
## - rad      1   241.41 261.41
## - nox      1   265.19 285.19
## 
## Step:  AIC=215.51
## target ~ zn + nox + age + dis + rad + tax + ptratio + lstat + 
##     medv
## 
##           Df Deviance    AIC
## - lstat    1   197.32 215.32
## <none>         195.51 215.51
## - zn       1   202.05 220.05
## - age      1   202.23 220.23
## - ptratio  1   205.01 223.01
## - dis      1   205.96 223.96
## - tax      1   206.60 224.60
## - medv     1   208.13 226.13
## - rad      1   249.55 267.55
## - nox      1   270.59 288.59
## 
## Step:  AIC=215.32
## target ~ zn + nox + age + dis + rad + tax + ptratio + medv
## 
##           Df Deviance    AIC
## <none>         197.32 215.32
## - zn       1   203.45 219.45
## - ptratio  1   206.27 222.27
## - age      1   207.13 223.13
## - tax      1   207.62 223.62
## - dis      1   207.64 223.64
## - medv     1   208.65 224.65
## - rad      1   250.98 266.98
## - nox      1   273.18 289.18
## 
## Call:  glm(formula = target ~ zn + nox + age + dis + rad + tax + ptratio + 
##     medv, family = binomial, data = crime_train)
## 
## Coefficients:
## (Intercept)           zn          nox          age          dis  
##  -37.415922    -0.068648    42.807768     0.032950     0.654896  
##         rad          tax      ptratio         medv  
##    0.725109    -0.007756     0.323628     0.110472  
## 
## Degrees of Freedom: 465 Total (i.e. Null);  457 Residual
## Null Deviance:       645.9 
## Residual Deviance: 197.3     AIC: 215.3

Model Number 3 (Logarithmic Model) with “Backwards” direction.

## Start:  AIC=232.1
## target ~ zn + chas + log_indus + log_nox + log_rm + log_age + 
##     log_dis + log_rad + log_tax + log_ptratio + log_lstat + log_medv
## 
##               Df Deviance    AIC
## - log_lstat    1   206.18 230.18
## - log_indus    1   206.33 230.33
## - log_rm       1   206.43 230.43
## - chas         1   207.52 231.52
## - zn           1   207.78 231.78
## - log_tax      1   207.90 231.90
## - log_age      1   208.05 232.05
## <none>             206.10 232.10
## - log_medv     1   208.66 232.66
## - log_ptratio  1   213.37 237.37
## - log_dis      1   217.09 241.09
## - log_rad      1   243.10 267.10
## - log_nox      1   270.57 294.57
## 
## Step:  AIC=230.17
## target ~ zn + chas + log_indus + log_nox + log_rm + log_age + 
##     log_dis + log_rad + log_tax + log_ptratio + log_medv
## 
##               Df Deviance    AIC
## - log_indus    1   206.42 228.42
## - log_rm       1   206.43 228.43
## - chas         1   207.70 229.70
## - zn           1   207.79 229.79
## - log_tax      1   208.00 230.00
## <none>             206.18 230.18
## - log_age      1   208.55 230.55
## - log_medv     1   208.75 230.75
## - log_ptratio  1   213.43 235.43
## - log_dis      1   217.09 239.09
## - log_rad      1   243.72 265.72
## - log_nox      1   270.66 292.66
## 
## Step:  AIC=228.42
## target ~ zn + chas + log_nox + log_rm + log_age + log_dis + log_rad + 
##     log_tax + log_ptratio + log_medv
## 
##               Df Deviance    AIC
## - log_rm       1   206.64 226.64
## - log_tax      1   208.02 228.02
## - zn           1   208.35 228.35
## - chas         1   208.40 228.40
## <none>             206.42 228.42
## - log_age      1   208.73 228.73
## - log_medv     1   208.99 228.99
## - log_ptratio  1   213.97 233.97
## - log_dis      1   217.39 237.39
## - log_rad      1   244.37 264.37
## - log_nox      1   280.85 300.85
## 
## Step:  AIC=226.64
## target ~ zn + chas + log_nox + log_age + log_dis + log_rad + 
##     log_tax + log_ptratio + log_medv
## 
##               Df Deviance    AIC
## - log_tax      1   208.16 226.16
## - zn           1   208.40 226.40
## - chas         1   208.52 226.52
## <none>             206.64 226.64
## - log_age      1   209.88 227.88
## - log_ptratio  1   215.35 233.35
## - log_medv     1   216.50 234.50
## - log_dis      1   218.30 236.30
## - log_rad      1   245.33 263.33
## - log_nox      1   282.10 300.10
## 
## Step:  AIC=226.16
## target ~ zn + chas + log_nox + log_age + log_dis + log_rad + 
##     log_ptratio + log_medv
## 
##               Df Deviance    AIC
## <none>             208.16 226.16
## - zn           1   210.21 226.21
## - chas         1   210.51 226.51
## - log_age      1   211.56 227.56
## - log_ptratio  1   216.44 232.44
## - log_medv     1   224.12 240.12
## - log_dis      1   225.23 241.23
## - log_rad      1   262.88 278.88
## - log_nox      1   282.96 298.96
## 
## Call:  glm(formula = target ~ zn + chas + log_nox + log_age + log_dis + 
##     log_rad + log_ptratio + log_medv, family = binomial, data = crime_train_log)
## 
## Coefficients:
## (Intercept)           zn         chas      log_nox      log_age  
##    -22.7058      -0.0321       1.0496      24.4943       0.9951  
##     log_dis      log_rad  log_ptratio     log_medv  
##      3.0797       2.4928       5.6817       3.0752  
## 
## Degrees of Freedom: 465 Total (i.e. Null);  457 Residual
## Null Deviance:       645.9 
## Residual Deviance: 208.2     AIC: 226.2

Model Number 1 with “Forward” direction.

## Start:  AIC=647.88
## target ~ 1
## 
##           Df Deviance    AIC
## + nox      1   292.01 296.01
## + rad      1   404.16 408.16
## + dis      1   409.50 413.50
## + age      1   424.75 428.75
## + tax      1   442.38 446.38
## + indus    1   453.23 457.23
## + zn       1   518.46 522.46
## + lstat    1   528.01 532.01
## + medv     1   609.62 613.62
## + ptratio  1   615.64 619.64
## + rm       1   634.82 638.82
## + chas     1   642.86 646.86
## <none>         645.88 647.88
## 
## Step:  AIC=296.01
## target ~ nox
## 
##           Df Deviance    AIC
## + rad      1   239.51 245.51
## + rm       1   284.63 290.63
## + medv     1   285.86 291.86
## + indus    1   288.11 294.11
## + zn       1   288.29 294.29
## + tax      1   288.40 294.40
## + chas     1   288.47 294.47
## <none>         292.01 296.01
## + ptratio  1   290.13 296.13
## + age      1   290.62 296.62
## + dis      1   290.91 296.91
## + lstat    1   291.93 297.93
## 
## Step:  AIC=245.51
## target ~ nox + rad
## 
##           Df Deviance    AIC
## + tax      1   224.47 232.47
## + indus    1   233.09 241.09
## + zn       1   235.19 243.19
## + rm       1   236.60 244.60
## + age      1   236.76 244.76
## + medv     1   236.86 244.86
## + ptratio  1   237.33 245.33
## <none>         239.51 245.51
## + chas     1   237.64 245.64
## + dis      1   237.96 245.96
## + lstat    1   239.47 247.47
## 
## Step:  AIC=232.47
## target ~ nox + rad + tax
## 
##           Df Deviance    AIC
## + ptratio  1   218.70 228.70
## + zn       1   219.94 229.94
## + age      1   220.44 230.44
## <none>         224.47 232.47
## + dis      1   223.30 233.30
## + indus    1   223.40 233.40
## + chas     1   223.63 233.63
## + lstat    1   223.71 233.71
## + rm       1   223.75 233.75
## + medv     1   224.27 234.27
## 
## Step:  AIC=228.7
## target ~ nox + rad + tax + ptratio
## 
##         Df Deviance    AIC
## + age    1   214.46 226.46
## + medv   1   215.23 227.23
## + rm     1   216.12 228.12
## + zn     1   216.32 228.32
## <none>       218.70 228.70
## + chas   1   216.81 228.81
## + dis    1   217.79 229.79
## + indus  1   217.82 229.82
## + lstat  1   218.57 230.57
## 
## Step:  AIC=226.46
## target ~ nox + rad + tax + ptratio + age
## 
##         Df Deviance    AIC
## + medv   1   209.55 223.55
## + rm     1   212.31 226.31
## + dis    1   212.40 226.40
## <none>       214.46 226.46
## + zn     1   212.67 226.67
## + chas   1   213.24 227.24
## + indus  1   213.38 227.38
## + lstat  1   214.35 228.35
## 
## Step:  AIC=223.55
## target ~ nox + rad + tax + ptratio + age + medv
## 
##         Df Deviance    AIC
## + dis    1   203.45 219.45
## <none>       209.55 223.55
## + zn     1   207.64 223.64
## + lstat  1   208.07 224.07
## + chas   1   208.33 224.33
## + indus  1   208.58 224.58
## + rm     1   208.79 224.79
## 
## Step:  AIC=219.45
## target ~ nox + rad + tax + ptratio + age + medv + dis
## 
##         Df Deviance    AIC
## + zn     1   197.32 215.32
## + chas   1   201.29 219.29
## + rm     1   201.35 219.35
## <none>       203.45 219.45
## + lstat  1   202.05 220.05
## + indus  1   202.23 220.23
## 
## Step:  AIC=215.32
## target ~ nox + rad + tax + ptratio + age + medv + dis + zn
## 
##         Df Deviance    AIC
## <none>       197.32 215.32
## + lstat  1   195.51 215.51
## + rm     1   195.75 215.75
## + chas   1   195.97 215.97
## + indus  1   196.33 216.33
## 
## Call:  glm(formula = target ~ nox + rad + tax + ptratio + age + medv + 
##     dis + zn, family = "binomial")
## 
## Coefficients:
## (Intercept)          nox          rad          tax      ptratio  
##  -37.415922    42.807768     0.725109    -0.007756     0.323628  
##         age         medv          dis           zn  
##    0.032950     0.110472     0.654896    -0.068648  
## 
## Degrees of Freedom: 465 Total (i.e. Null);  457 Residual
## Null Deviance:       645.9 
## Residual Deviance: 197.3     AIC: 215.3

As we can see, judging from the residual deviance and AIC, the forward or backward stepwise approach did not alter or change the original models.

For model number 4, let’s include some few transformed variables and interactive terms. After looking into the independent variables, the variables that would most likely would benefit from a log transformation is: lstat, dis, medv, and rad:tax.

Let’s include the log transformation of these predictor variables into our dataset and rid of the original variables lstat, dis, medv, and rad:tax.

zn indus chas nox rm age rad tax ptratio target log_stat log_dis log_medv
0 19.58 0 0.605 7.929 96.2 5 403 14.7 1 1.308333 0.7158378 3.912023
0 19.58 1 0.871 5.403 100.0 5 403 14.7 1 3.289148 0.2788431 2.595255
0 18.10 0 0.740 6.485 100.0 24 666 20.2 1 2.936513 0.6822884 2.734367
30 4.93 0 0.428 6.393 7.8 6 300 16.6 0 1.646734 1.9509688 3.165475
0 2.46 0 0.488 7.155 92.2 3 193 17.8 0 1.572774 0.9934740 3.634951
0 8.56 0 0.520 6.781 71.3 5 384 20.9 0 2.037317 1.0494571 3.277145

Now that we have our partially transformed dataset, let us build model number 4 (the partially log transformed model).

## 
## Call:
## glm(formula = target ~ . + rad:tax, family = binomial, data = new_crime)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8899  -0.1546  -0.0026   0.0417   3.4155  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -5.248e+01  1.020e+01  -5.147 2.65e-07 ***
## zn          -4.868e-02  3.011e-02  -1.617 0.105983    
## indus       -3.284e-02  4.863e-02  -0.675 0.499477    
## chas         8.426e-01  7.919e-01   1.064 0.287341    
## nox          5.033e+01  7.993e+00   6.297 3.04e-10 ***
## rm          -1.339e-01  6.414e-01  -0.209 0.834600    
## age          3.696e-02  1.355e-02   2.728 0.006365 ** 
## rad          9.322e-01  3.027e-01   3.080 0.002073 ** 
## tax         -1.934e-03  3.341e-03  -0.579 0.562581    
## ptratio      3.829e-01  1.281e-01   2.990 0.002793 ** 
## log_stat     3.411e-02  6.891e-01   0.050 0.960521    
## log_dis      3.192e+00  9.245e-01   3.453 0.000555 ***
## log_medv     3.266e+00  1.717e+00   1.902 0.057225 .  
## rad:tax     -8.294e-04  5.492e-04  -1.510 0.130993    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 191.69  on 452  degrees of freedom
## AIC: 219.69
## 
## Number of Fisher Scoring iterations: 9

Model number 4 appears comprobable to Model number 1.

Let’s look at the bucket values and see if we can improved the AIC and/or Residual Deviance scores for model number 5 (the Bucket Model).

## 
## Call:
## glm(formula = target ~ . + rad:tax, family = binomial, data = full_df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9330  -0.1211  -0.0005   0.0279   3.7792  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -4.058e+01  9.658e+00  -4.202 2.64e-05 ***
## zn                     -8.588e-02  4.002e-02  -2.146 0.031870 *  
## indus                  -3.225e-02  5.499e-02  -0.586 0.557613    
## chas                    5.846e-03  8.440e-01   0.007 0.994473    
## nox                     5.535e+01  1.022e+01   5.413 6.19e-08 ***
## age                     4.131e-02  1.325e-02   3.118 0.001818 ** 
## rad                     1.075e+00  3.592e-01   2.991 0.002777 ** 
## tax                    -4.619e-03  3.628e-03  -1.273 0.202988    
## log_stat               -3.493e-01  6.623e-01  -0.527 0.597971    
## log_dis                 3.805e+00  1.060e+00   3.589 0.000332 ***
## log_medv                1.478e+00  1.426e+00   1.037 0.299858    
## stud_teach_ratioMedium  7.783e-01  8.548e-01   0.910 0.362582    
## stud_teach_ratioLarge   1.192e+00  7.966e-01   1.497 0.134494    
## home_sizeMedium        -3.651e+00  1.071e+00  -3.409 0.000652 ***
## home_sizeLarge         -1.876e+00  1.638e+00  -1.146 0.251982    
## rad:tax                -9.403e-04  6.343e-04  -1.482 0.138249    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 179.58  on 450  degrees of freedom
## AIC: 211.58
## 
## Number of Fisher Scoring iterations: 10

The bucket model has improved the Residual Deviance quite a bit 179.58 and the AIC 211.58. Let’s take a look at this from a stepwise approach (“backwards”) and see if we can improve on this.

## Start:  AIC=211.58
## target ~ zn + indus + chas + nox + age + rad + tax + log_stat + 
##     log_dis + log_medv + stud_teach_ratio + home_size + rad:tax
## 
##                    Df Deviance    AIC
## - chas              1   179.58 209.58
## - log_stat          1   179.85 209.85
## - indus             1   179.93 209.93
## - stud_teach_ratio  2   182.03 210.03
## - log_medv          1   180.67 210.67
## - rad:tax           1   180.76 210.76
## <none>                  179.58 211.58
## - zn                1   185.49 215.49
## - age               1   190.50 220.50
## - log_dis           1   194.29 224.29
## - home_size         2   200.39 228.39
## - nox               1   228.46 258.46
## 
## Step:  AIC=209.58
## target ~ zn + indus + nox + age + rad + tax + log_stat + log_dis + 
##     log_medv + stud_teach_ratio + home_size + rad:tax
## 
##                    Df Deviance    AIC
## - log_stat          1   179.85 207.85
## - indus             1   179.94 207.94
## - stud_teach_ratio  2   182.05 208.05
## - log_medv          1   180.70 208.70
## - rad:tax           1   180.80 208.80
## <none>                  179.58 209.58
## - zn                1   185.65 213.65
## - age               1   190.98 218.98
## - log_dis           1   194.30 222.30
## - home_size         2   200.89 226.89
## - nox               1   228.61 256.61
## 
## Step:  AIC=207.86
## target ~ zn + indus + nox + age + rad + tax + log_dis + log_medv + 
##     stud_teach_ratio + home_size + rad:tax
## 
##                    Df Deviance    AIC
## - indus             1   180.28 206.28
## - stud_teach_ratio  2   182.44 206.44
## - rad:tax           1   181.13 207.13
## <none>                  179.85 207.85
## - log_medv          1   182.08 208.08
## - zn                1   186.10 212.10
## - age               1   191.06 217.06
## - log_dis           1   194.84 220.84
## - home_size         2   200.89 224.89
## - nox               1   228.77 254.77
## 
## Step:  AIC=206.28
## target ~ zn + nox + age + rad + tax + log_dis + log_medv + stud_teach_ratio + 
##     home_size + rad:tax
## 
##                    Df Deviance    AIC
## - stud_teach_ratio  2   182.62 204.62
## - rad:tax           1   181.80 205.80
## <none>                  180.28 206.28
## - log_medv          1   182.29 206.29
## - zn                1   187.58 211.58
## - age               1   191.11 215.11
## - log_dis           1   196.00 220.00
## - home_size         2   201.05 223.05
## - nox               1   231.42 255.42
## 
## Step:  AIC=204.62
## target ~ zn + nox + age + rad + tax + log_dis + log_medv + home_size + 
##     rad:tax
## 
##             Df Deviance    AIC
## - log_medv   1   183.15 203.15
## - rad:tax    1   184.22 204.22
## <none>           182.62 204.62
## - age        1   191.16 211.16
## - zn         1   193.65 213.65
## - log_dis    1   198.44 218.44
## - home_size  2   202.55 220.55
## - nox        1   256.34 276.34
## 
## Step:  AIC=203.15
## target ~ zn + nox + age + rad + tax + log_dis + home_size + rad:tax
## 
##             Df Deviance    AIC
## - rad:tax    1   184.83 202.83
## <none>           183.15 203.15
## - age        1   191.23 209.23
## - zn         1   194.10 212.10
## - log_dis    1   199.12 217.12
## - home_size  2   205.54 221.54
## - nox        1   257.26 275.26
## 
## Step:  AIC=202.83
## target ~ zn + nox + age + rad + tax + log_dis + home_size
## 
##             Df Deviance    AIC
## <none>           184.83 202.83
## - age        1   192.68 208.68
## - zn         1   194.90 210.90
## - tax        1   198.15 214.15
## - log_dis    1   201.91 217.91
## - home_size  2   208.05 222.05
## - rad        1   245.78 261.78
## - nox        1   259.34 275.34
## 
## Call:  glm(formula = target ~ zn + nox + age + rad + tax + log_dis + 
##     home_size, family = binomial, data = full_df)
## 
## Coefficients:
##     (Intercept)               zn              nox              age  
##      -30.294374        -0.098534        48.348341         0.027545  
##             rad              tax          log_dis  home_sizeMedium  
##        0.780517        -0.009425         3.309187        -2.842727  
##  home_sizeLarge  
##       -0.246644  
## 
## Degrees of Freedom: 465 Total (i.e. Null);  457 Residual
## Null Deviance:       645.9 
## Residual Deviance: 184.8     AIC: 202.8

So the stepwise approach on this bucket dataset has a slightly increased residual deviance but improved AIC. Let’s call this stepwise bucket model, model number 6.

One of the features we noticed about the dataset is that the independent variables have considerable correlations with each other. That being said, we can trial the Principle Components Analysis to see if we can develop our 6th model. (PCA model).

## Importance of components:
##                             PC1      PC2      PC3     PC4     PC5     PC6
## Standard deviation     169.0899 28.76861 16.35008 8.60415 4.25551 3.76253
## Proportion of Variance   0.9592  0.02777  0.00897 0.00248 0.00061 0.00047
## Cumulative Proportion    0.9592  0.98700  0.99597 0.99846 0.99906 0.99954
##                            PC7    PC8     PC9    PC10   PC11    PC12
## Standard deviation     3.08576 1.6837 1.06770 0.45869 0.2465 0.05494
## Proportion of Variance 0.00032 0.0001 0.00004 0.00001 0.0000 0.00000
## Cumulative Proportion  0.99986 1.0000 0.99999 1.00000 1.0000 1.00000

It appears that the first principle component explains about 95.9% of the variance, so let us just use the first principle component for our model.

Let’s see what the first PC is comprised of:

##      zn   indus    chas     nox      rm     age     dis     rad     tax 
##    0.05   -0.03    0.00    0.00    0.00   -0.09    0.01   -0.05   -0.99 
## ptratio   lstat    medv 
##   -0.01   -0.02    0.03
## 
## Call:
## glm(formula = crime_train$target ~ prcrime$x[, 1])
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.04633  -0.27195  -0.09360   0.04062   0.85272  
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     0.4914163  0.0182520   26.92   <2e-16 ***
## prcrime$x[, 1] -0.0018282  0.0001081  -16.92   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1552412)
## 
##     Null deviance: 116.466  on 465  degrees of freedom
## Residual deviance:  72.032  on 464  degrees of freedom
## AIC: 458.39
## 
## Number of Fisher Scoring iterations: 2

While the residual deviance has improved quite significantly, though at the expense of a more complex model (AIC: 458.39).

Let’s try with the modified dataset (the dataset with some of the independent variables logarithmically transformed.).

## Importance of components:
##                             PC1      PC2      PC3     PC4     PC5     PC6
## Standard deviation     168.9765 28.54780 16.30897 4.31628 3.19509 1.83312
## Proportion of Variance   0.9625  0.02747  0.00897 0.00063 0.00034 0.00011
## Cumulative Proportion    0.9625  0.98992  0.99889 0.99952 0.99986 0.99997
##                            PC7    PC8    PC9   PC10   PC11    PC12
## Standard deviation     0.71502 0.3197 0.2652 0.2465 0.1855 0.05275
## Proportion of Variance 0.00002 0.0000 0.0000 0.0000 0.0000 0.00000
## Cumulative Proportion  0.99999 1.0000 1.0000 1.0000 1.0000 1.00000
## 
## Call:
## glm(formula = new_crime$target ~ prcrime1$x[, 1])
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.04563  -0.27182  -0.09292   0.03988   0.85321  
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      0.4914163  0.0182537   26.92   <2e-16 ***
## prcrime1$x[, 1] -0.0018291  0.0001081  -16.91   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1552707)
## 
##     Null deviance: 116.466  on 465  degrees of freedom
## Residual deviance:  72.046  on 464  degrees of freedom
## AIC: 458.48
## 
## Number of Fisher Scoring iterations: 2

This model does not appear to be that much better than the PCA model with the original dataset.

Let’s move onto selecting the appropriate (or best fit) model for prediction.

4. Select Models (25 Points)

Decide on the criteria for selecting the best binary logistic regression model. Will you select models with slightly worse performance if it makes more sense or is more parsimonious? Discuss why you selected your model.

Now that we have several models at our disposal, we should evaluate the models in regards to their ability to predict. For these binary logistic regression models, the metrics that we will be using are:

  1. AIC
  2. AUC/ROC
  3. Accuracy
  4. Classification Error Rate
  5. Precision
  6. Sensitivity
  7. Specificity
  8. F1 Score
  9. Confusion Matrix

After looking at the residual deviance scores and AIC scores in the previous section, we’ll take the most promising ones and evaluate them here.

Let us evalute the Model Number 1 (baseline model). The way we will evaluate the data is by partitioning part of the training data set into 70% training and 30% evaluation. From there, we will develop a confusion matrix and create our evaluations there.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 64  8
##          1  6 61
##                                           
##                Accuracy : 0.8993          
##                  95% CI : (0.8368, 0.9438)
##     No Information Rate : 0.5036          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.7985          
##  Mcnemar's Test P-Value : 0.7893          
##                                           
##             Sensitivity : 0.8841          
##             Specificity : 0.9143          
##          Pos Pred Value : 0.9104          
##          Neg Pred Value : 0.8889          
##              Prevalence : 0.4964          
##          Detection Rate : 0.4388          
##    Detection Prevalence : 0.4820          
##       Balanced Accuracy : 0.8992          
##                                           
##        'Positive' Class : 1               
## 
##          Sensitivity          Specificity       Pos Pred Value 
##            0.8840580            0.9142857            0.9104478 
##       Neg Pred Value            Precision               Recall 
##            0.8888889            0.9104478            0.8840580 
##                   F1           Prevalence       Detection Rate 
##            0.8970588            0.4964029            0.4388489 
## Detection Prevalence    Balanced Accuracy 
##            0.4820144            0.8991718

## Area under the curve: 0.9647

Overall, this model appears to be performing quite well. All of the values that we were concerned about such as sensitivity, specificity, AUC, etc. were all quite high. Could this be also because it’s overfitting? Possibly. But let’s continue on and evalute the other models.

Let’s evaluate the third model (the model with all of the independent variables with log transformations).

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 39 33
##          1 36 31
##                                           
##                Accuracy : 0.5036          
##                  95% CI : (0.4176, 0.5894)
##     No Information Rate : 0.5396          
##     P-Value [Acc > NIR] : 0.8254          
##                                           
##                   Kappa : 0.0044          
##  Mcnemar's Test P-Value : 0.8097          
##                                           
##             Sensitivity : 0.4844          
##             Specificity : 0.5200          
##          Pos Pred Value : 0.4627          
##          Neg Pred Value : 0.5417          
##              Prevalence : 0.4604          
##          Detection Rate : 0.2230          
##    Detection Prevalence : 0.4820          
##       Balanced Accuracy : 0.5022          
##                                           
##        'Positive' Class : 1               
## 
##          Sensitivity          Specificity       Pos Pred Value 
##            0.4843750            0.5200000            0.4626866 
##       Neg Pred Value            Precision               Recall 
##            0.5416667            0.4626866            0.4843750 
##                   F1           Prevalence       Detection Rate 
##            0.4732824            0.4604317            0.2230216 
## Detection Prevalence    Balanced Accuracy 
##            0.4820144            0.5021875

## Area under the curve: 0.9853

This model performed significantly worse despite not having such a drastically worse AIC and residual deviance score from the 1st model. We will ultimately not use this model as our final choice.

If transforming all of the variables logarithmically was not the best option, then what if we only tranformed a few that we thought may best improve the model. In this next model, we have also evaluated the model with the buckets (house size and teacher to student ratio).

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 40 32
##          1 35 32
##                                           
##                Accuracy : 0.518           
##                  95% CI : (0.4317, 0.6035)
##     No Information Rate : 0.5396          
##     P-Value [Acc > NIR] : 0.7247          
##                                           
##                   Kappa : 0.0332          
##  Mcnemar's Test P-Value : 0.8070          
##                                           
##             Sensitivity : 0.5000          
##             Specificity : 0.5333          
##          Pos Pred Value : 0.4776          
##          Neg Pred Value : 0.5556          
##              Prevalence : 0.4604          
##          Detection Rate : 0.2302          
##    Detection Prevalence : 0.4820          
##       Balanced Accuracy : 0.5167          
##                                           
##        'Positive' Class : 1               
## 
##          Sensitivity          Specificity       Pos Pred Value 
##            0.5000000            0.5333333            0.4776119 
##       Neg Pred Value            Precision               Recall 
##            0.5555556            0.4776119            0.5000000 
##                   F1           Prevalence       Detection Rate 
##            0.4885496            0.4604317            0.2302158 
## Detection Prevalence    Balanced Accuracy 
##            0.4820144            0.5166667

## Area under the curve: 0.96

Again, the numbers suggest that this model may not be the best fit as well. Let’s see if any results change if we take the stepwise “backwards” model version of this instead.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 34 38
##          1 37 30
##                                          
##                Accuracy : 0.4604         
##                  95% CI : (0.3756, 0.547)
##     No Information Rate : 0.5108         
##     P-Value [Acc > NIR] : 0.8984         
##                                          
##                   Kappa : -0.08          
##  Mcnemar's Test P-Value : 1.0000         
##                                          
##             Sensitivity : 0.4412         
##             Specificity : 0.4789         
##          Pos Pred Value : 0.4478         
##          Neg Pred Value : 0.4722         
##              Prevalence : 0.4892         
##          Detection Rate : 0.2158         
##    Detection Prevalence : 0.4820         
##       Balanced Accuracy : 0.4600         
##                                          
##        'Positive' Class : 1              
## 
##          Sensitivity          Specificity       Pos Pred Value 
##            0.4411765            0.4788732            0.4477612 
##       Neg Pred Value            Precision               Recall 
##            0.4722222            0.4477612            0.4411765 
##                   F1           Prevalence       Detection Rate 
##            0.4444444            0.4892086            0.2158273 
## Detection Prevalence    Balanced Accuracy 
##            0.4820144            0.4600249

## Area under the curve: 0.9785

Again, there is not much improvement.

Let’s look at one more before finally deciding on a model. Let’s look at the Principle Components Model.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 37 35
##          1 28 39
##                                           
##                Accuracy : 0.5468          
##                  95% CI : (0.4602, 0.6313)
##     No Information Rate : 0.5324          
##     P-Value [Acc > NIR] : 0.4001          
##                                           
##                   Kappa : 0.0956          
##  Mcnemar's Test P-Value : 0.4497          
##                                           
##             Sensitivity : 0.5270          
##             Specificity : 0.5692          
##          Pos Pred Value : 0.5821          
##          Neg Pred Value : 0.5139          
##              Prevalence : 0.5324          
##          Detection Rate : 0.2806          
##    Detection Prevalence : 0.4820          
##       Balanced Accuracy : 0.5481          
##                                           
##        'Positive' Class : 1               
## 
##          Sensitivity          Specificity       Pos Pred Value 
##            0.5270270            0.5692308            0.5820896 
##       Neg Pred Value            Precision               Recall 
##            0.5138889            0.5820896            0.5270270 
##                   F1           Prevalence       Detection Rate 
##            0.5531915            0.5323741            0.2805755 
## Detection Prevalence    Balanced Accuracy 
##            0.4820144            0.5481289

I suspect that these models, including this one with the PCA, adds a degree of complexity that may lead to overfitting or overall poor fitting.

Out of all of the models, it appears that the original (baseline) model appears to have the best numbers, including sensitivity and specificity. As we had discussed before, there may be some factors that may explain some of this. For instance, anytime we take a continuous variable (such as student to teacher ratio, or rooms for a size of the house) and converting it into a categorical variable, we run the risk of losing valuable information. In addition, just because we transform any of the variables, this also does not necessarily improve prediction power either. Ultimately, usually Occam’s Razor usually wins and the model that is the least complex and most straightforward typically provides the best model. Given these circumstances, w will make predictions on the testing set with the original baseline model.

##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 
##  0  1  1  1  0  0  0  0  0  0  0  0  1  1  1  0  0  1  0  0  0  0  0  0  0 
## 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 
##  1  0  1  1  1  1  1  1  1  1  1  1  1  1  0

Appendix

# Loading Packages
requiredpackages <- c('knitr', 'kableExtra', 'psych', 'ggplot2', 'reshape2', 'corrplot', 'tidyr', 'dplyr', 'plyr', 'MASS', 'caret', 'pscl', 'lmtest', 'pROC', 'ROCR')
for (i in requiredpackages){
  if(!require(i,character.only=T)) install.packages(i)
  library(i,character.only=T)
}

# Loading in datasets
url_train <- 'https://raw.githubusercontent.com/jcp9010/MSDA/master/CUNY%20621/Homework%203/crime-training-data_modified.csv'
url_test <- 'https://raw.githubusercontent.com/jcp9010/MSDA/master/CUNY%20621/Homework%203/crime-evaluation-data_modified.csv'
crime_train <- read.csv(url_train, header = TRUE)
crime_test <- read.csv(url_test, header = TRUE)
kable(head(crime_train), "html") %>% kable_styling(bootstrap_options = c("striped", "hover"))

# Number of Rows in Training Dataset
nrow(crime_train) 

# Summary Statistic:
describe(crime_train) %>%  kable("html") %>% kable_styling(bootstrap_options = c("striped", "hover"))

# Data Visualization using BoxPlot Distribution
crime_melt <- melt(crime_train)
ggplot(crime_melt, aes(x = factor(variable), y=value)) + geom_boxplot() + facet_wrap(~variable, scale="free") + xlab("") + ylab("Values")
ggplot(crime_melt, aes(x = factor(variable), y=value)) + geom_boxplot() + coord_flip() + xlab("") + ylab("Values")

# Scale Y Log
ggplot(crime_melt, aes(x = factor(variable), y=value)) + geom_boxplot() + scale_y_log10() + coord_flip() + xlab("Log Transformation of Independent Variables") + ylab("")

# Histogram Visualizations
ggplot(crime_melt, aes(x = value)) + geom_histogram(bins=50) + facet_wrap(~variable, scale="free") + xlab("") + ylab("Frequency")

# Scatterplot of independent variables to 'target' variable
meltTarget <- melt(crime_train, id.vars = c("target"))
ggplot(meltTarget, aes(x=value, y=target)) + geom_point() + facet_wrap(~variable, scale="free")

# Correlations and its visualizations
cormatrix <- cor(crime_train, method="pearson", use="complete.obs")
kable(cormatrix, "html") %>% kable_styling(bootstrap_options = c("striped", "hover"))
corrplot(cormatrix, method="circle")

# More correlations
cor_df <- as.data.frame(cormatrix) %>% dplyr::select(target)
pos_cor <- subset(cor_df, cor_df[,'target'] > .2)
neg_cor <- subset(cor_df, cor_df[,'target'] < -.2)
neut_cor <- subset(cor_df, cor_df[,'target'] > -.2 & cor_df[,'target'] < .2)
print("Correlation to Target")
cor_df

print("Positive Correlative Factors:")
row.names(pos_cor) 

print("Negative Correlative Factors:")
row.names(neg_cor) 

print("Neutral Correlative Factors")
row.names(neut_cor) 

# Looking for very highly correlated, arbitrarily set at > .7 or < -.7
highly_pos_correlated <- subset(cor_df, cor_df[,'target'] > .7 & cor_df[,'target'] < 1)
highly_neg_correlated <- subset(cor_df, cor_df[,'target'] < -.7 & cor_df[,'target'] > -1)

print("Highly Positive Correlated Variables")
highly_pos_correlated

print("Highly Negative Correlated Variables")
highly_neg_correlated

# Missing Values?
sum(is.na(crime_train))

# Bucketing student to teaching ratio
stud_teach_ratio <- crime_train$ptratio
stud_teach_ratio <- cut(stud_teach_ratio, breaks = 3, labels=c("Small", "Medium", "Large"))

crime_train_1 <- cbind(crime_train, stud_teach_ratio)

x <- table(crime_train_1$target, crime_train_1$stud_teach_ratio)
x

# Percentage
print("Percentage where Crime is above the Median (1)")
x1 <- as.data.frame(x)
ggplot(x1, aes(x=Var2,y=Freq)) + geom_histogram(stat="identity", aes(fill=Var1), position="dodge") + theme_bw() + xlab("Student to Teacher Ratio Size") + ylab("Count")

ggplot(x1, aes(x=Var2,y=Freq)) + geom_histogram(stat="identity", aes(fill=Var1), position="fill") + theme_bw() + ylab("Percentage of 1 (Above Crime Median) vs 0 (Below Crime Median") + xlab("")

count(crime_train_1, 'stud_teach_ratio')

# Bucketing Home Sizes by counting rooms
home_size <- crime_train$rm
home_size <- cut(home_size, breaks = 3, labels=c("Small", "Medium", "Large"))

crime_train_1 <- cbind(crime_train_1, home_size)

x1 <- table(crime_train_1$target, crime_train_1$home_size)
x1

# Percentage
print("Percentage where Crime is above the Median (1)")
x2 <- as.data.frame(x1)
ggplot(x2, aes(x=Var2,y=Freq)) + geom_histogram(stat="identity", aes(fill=Var1), position="dodge") + theme_bw() + xlab("Size of the Home") + ylab("Count")

# Log transformation of independent variables
hist(crime_train$zn,breaks=20, main="Percentage of Land Noted as 'Residential'", xlab="Zoning Size", col = "lightgreen")

hist(log(crime_train$zn), breaks=20, main="Percentage of Logarithmic Land Noted as 'Residential'", xlab="Zoning Size", col = "blue")

# Creating a dataset with the logarithmic variables
crime_train_log <- cbind(crime_train$target, crime_train$zn, crime_train$chas, log(crime_train[,c(2,4,5,6,7,8,9,10,11,12)]))
colnames(crime_train_log) <- c('target', 'zn', 'chas', 'log_indus', 'log_nox', 'log_rm', 'log_age', 'log_dis', 'log_rad', 'log_tax', 'log_ptratio', 'log_lstat', 'log_medv')

# Histogram and Scatterplot with the new logarithmic transformations
ggplot(melt(crime_train_log), aes(x = value)) + geom_histogram(bins=50) + facet_wrap(~variable, scale="free")

ggplot(melt(crime_train_log, id.vars=c('target')), aes(x=value, y=target)) + geom_point() + facet_wrap(~variable, scale="free")

# Looking at the most correlated terms and attempting to find the interactive terms
all_cor <- as.data.frame(cormatrix)
all_cor <- all_cor[!row.names(all_cor) == 'target', !colnames(all_cor) == 'target']

# Eliminate all 1's from the diagonal and replace it with zero temporarily 
all_cor[all_cor == 1] <- 0 

print("Maximum Positive Correlations amongst independent variables: ")
max_cor <- apply(all_cor, 1, max)
max_cor

print("Maximum Negative Correlations amongst independent variables: ")
min_cor <- apply(all_cor, 1, min)
min_cor

var1 <- names(max_cor)
var2 <- colnames(all_cor)[max.col(all_cor,ties.method="first")]
cor_values <- as.numeric(max_cor)

cor_df1 <- cbind(var1, var2, max_cor)
cor_df1 %>% kable("html") %>% kable_styling(bootstrap_options = c("striped", "hover"))

# Building models

# Model 1
logitmod <- glm(target ~ ., family = binomial, data=crime_train)
summary(logitmod)

# Model 2
logitmod_int <- glm(target ~ . + rad:tax, family = binomial, data=crime_train)
summary(logitmod_int)

# Model 3
logitmod_log <- glm(target ~ ., family = binomial, data=crime_train_log)
summary(logitmod_log)

#Model Number 1 with "Backwards" Direction
step(logitmod, direction="backward")


# Model Number 3 (Logarithmic Model) with "Backwards" direction.
step(logitmod_log, direction="backward")

# Model Number 1 with "Forward" direction.
attach(crime_train)
step(glm(target ~ 1, family="binomial"), target ~ zn + indus + chas + nox + rm + age + dis + 
    rad + tax + ptratio + lstat + medv,direction="forward")
detach(crime_train)

# Creating a new dataset with some predictor variables with log transformations
new_crime <- crime_train[,!names(crime_train) %in% c('lstat','dis','medv')]
log_crime <- data.frame(log(crime_train[,'lstat']), log(crime_train[,'dis']), log(crime_train[,'medv']))
new_crime <- cbind(new_crime, log_crime)

colnames(new_crime)[11:13] <- c('log_stat','log_dis','log_medv')
head(new_crime) %>%  kable("html") %>% kable_styling(bootstrap_options = c("striped", "hover"))

# Building model 4 from the above dataset
logitmod_log_transformed <- glm(target ~ . + rad:tax, family = binomial, data=new_crime)
summary(logitmod_log_transformed) 

# Model 5
full_df <- cbind(new_crime, stud_teach_ratio)
full_df <- cbind(full_df, home_size)
full_df <- full_df[,!names(full_df) %in% c('rm','ptratio')]

full_logitmod <- glm(target ~ . + rad:tax, family = binomial, data=full_df)
summary(full_logitmod)

# Backward Approach (stepwise) with model 5
full_logitmod_back <- step(full_logitmod, direction="backward")
full_logitmod_back

# Model 6 (PCA model)
ccrime <- crime_train[,-13]
prcrime <- prcomp(ccrime)
summary(prcrime)

# First PC
round(prcrime$rot[,1],2)

lmodpcr <- glm(crime_train$target ~ prcrime$x[,1])
summary(lmodpcr)

# Modified dataset (the dataset with some of the independent variables logarithmically transformed.).
ccrime1 <- new_crime[,-10]
prcrime1 <- prcomp(ccrime1)
summary(prcrime1)

lmodpcr1 <- glm(new_crime$target ~ prcrime1$x[,1])
summary(lmodpcr1)

# Evaluating the models

# Model 1

# Reference: https://www.r-bloggers.com/evaluating-logistic-regression-models/
# logitmod

#length(coef(logitmod))
# Let's create a partition
Train <- createDataPartition(crime_train$target, p=0.7, list=FALSE)
training <- crime_train[Train, ]
testing <- crime_train[-Train, ]

pred <- round(predict(logitmod, newdata=testing, type="response"), 3)
pred1 <- ifelse(pred > 0.5, 1, 0)

#Confusion Matrix
ans <- confusionMatrix(data=pred1, testing$target, positive='1')
ans
ans$byClass

plot(roc(testing$target, pred), main="ROC Curve from pROC Package")
# Please note that the X axis is in Specificity (as opposed to 1 - Specificity in the above function)

# Area Under the Curve
auc(roc(testing$target, pred))

# Evaluation of the 3rd model
Train <- createDataPartition(new_crime$target, p=0.7, list=FALSE)
training <- new_crime[Train, ]
testing <- new_crime[-Train, ]

pred2 <- round(predict(logitmod_log_transformed, newdata=testing, type="response"), 3)
pred21 <- ifelse(pred > 0.5, 1, 0)

#Confusion Matrix
ans2 <- confusionMatrix(data=pred21, testing$target, positive='1')
ans2
ans2$byClass

# Evaluating the model with buckets and log transformed data
Train <- createDataPartition(full_df$target, p=0.7, list=FALSE)
training <- full_df[Train, ]
testing <- full_df[-Train, ]

pred3 <- round(predict(full_logitmod, newdata=testing, type="response"), 3)
pred31 <- ifelse(pred > 0.5, 1, 0)

#Confusion Matrix
ans3 <- confusionMatrix(data=pred31, testing$target, positive='1')
ans3
ans3$byClass

plot(roc(testing$target, pred3), main="ROC Curve from pROC Package")
# Please note that the X axis is in Specificity (as opposed to 1 - Specificity in the above function)

# Area Under the Curve
auc(roc(testing$target, pred3))

# Stepwise approach backward with the above model
Train <- createDataPartition(full_df$target, p=0.7, list=FALSE)
training <- full_df[Train, ]
testing <- full_df[-Train, ]

pred4 <- round(predict(full_logitmod_back, newdata=testing, type="response"), 3)
pred41 <- ifelse(pred > 0.5, 1, 0)

#Confusion Matrix
ans4 <- confusionMatrix(data=pred41, testing$target, positive='1')
ans4
ans4$byClass

plot(roc(testing$target, pred4), main="ROC Curve from pROC Package")
# Please note that the X axis is in Specificity (as opposed to 1 - Specificity in the above function)

# Area Under the Curve
auc(roc(testing$target, pred4))

# 6th model: PCA model
Train <- createDataPartition(crime_train$target, p=0.7, list=FALSE)
training <- full_df[Train, ]
testing <- full_df[-Train, ]

pred5 <- round(predict(lmodpcr, newdata=testing, type="response"), 3)
pred51 <- ifelse(pred > 0.5, 1, 0)

#Confusion Matrix
ans5 <- confusionMatrix(data=pred51, testing$target, positive='1')
ans5
ans5$byClass

# Making the prediction
test_ans <- round(predict(logitmod, newdata=crime_test, type="response"), 3)
test_ans <- ifelse(test_ans > 0.5, 1, 0)
test_ans
# You may un-comment this section if you would like to download a csv file of the predictions
# write.csv(test_ans, file="Prediction.csv")