In this homework assignment, you will explore, analyze and model a data set containing information on crime for various neighborhoods of a major city. Each record has a response variable indicating whether or not the crime rate is above the median crime rate (1) or not (0).
Your objective is to build a binary logistic regression model on the training data set to predict whether the neighborhood will be at risk for high crime levels. You will provide classifications and probabilities for the evaluation data set using your binary logistic regression model. You can only use the variables given to you (or, variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:
zn
: proportion of residential land zoned for large lots (over 25,000 square feet)indus
: proportion of non-retail business acres per suburbchas
: a dummy variable for whether the suburb borders the Charles River (1) or not (0)nox
: nitrogen oxides concentration (parts per 10 million)rm
: average number of rooms per dwellingage
proportion of owner-occupied units built prior to 1940dis
: weighted mean of distances to five Boston employment centersrad
: index of accessibility to radial highwaystax
: full-value property-tax rate per $10,000ptratio
: pupil-teacher ratio by townlstat
: lower status of the population (percent)medv
: median value of owner-occupied homes in $1000starget
: whether the crime rate is above the median crime rate (1) or not (0) (response variable)1. Data Exploration (25 Points)
The data above is a collection of independent variables collected in attempts to evaluate for crime within the Boston neighborhoods. According to Wikipedia, Boston is a city with a population of 665,258 people with a crime rate of 706.8 violent crimes per 100,000 people and 2316.1 property crimes per 100,000 people. (Reference: https://en.wikipedia.org/wiki/List_of_United_States_cities_by_crime_rate). Within the Boston Metropolitan, there are different neighborhoods that have differing levels of safety.
Ultimately, we like to develop a model where we could predict neighborhoods that will either be above or below the median crime rate within the area. However, in order to do that, we should explore the dataset given to us.
Below is the training dataset given to us with the different independent variables as listed above.
zn | indus | chas | nox | rm | age | dis | rad | tax | ptratio | lstat | medv | target |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 19.58 | 0 | 0.605 | 7.929 | 96.2 | 2.0459 | 5 | 403 | 14.7 | 3.70 | 50.0 | 1 |
0 | 19.58 | 1 | 0.871 | 5.403 | 100.0 | 1.3216 | 5 | 403 | 14.7 | 26.82 | 13.4 | 1 |
0 | 18.10 | 0 | 0.740 | 6.485 | 100.0 | 1.9784 | 24 | 666 | 20.2 | 18.85 | 15.4 | 1 |
30 | 4.93 | 0 | 0.428 | 6.393 | 7.8 | 7.0355 | 6 | 300 | 16.6 | 5.19 | 23.7 | 0 |
0 | 2.46 | 0 | 0.488 | 7.155 | 92.2 | 2.7006 | 3 | 193 | 17.8 | 4.82 | 37.9 | 0 |
0 | 8.56 | 0 | 0.520 | 6.781 | 71.3 | 2.8561 | 5 | 384 | 20.9 | 7.67 | 26.5 | 0 |
How many rows are there in this dataset?
## [1] 466
The best way to start looking at the data is to look at the basic summary statistics including mean, median, standard deviations, etc.
Summary Statistic:
vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
zn | 1 | 466 | 11.5772532 | 23.3646511 | 0.00000 | 5.3542781 | 0.0000000 | 0.0000 | 100.0000 | 100.0000 | 2.1768152 | 3.8135765 | 1.0823466 |
indus | 2 | 466 | 11.1050215 | 6.8458549 | 9.69000 | 10.9082353 | 9.3403800 | 0.4600 | 27.7400 | 27.2800 | 0.2885450 | -1.2432132 | 0.3171281 |
chas | 3 | 466 | 0.0708155 | 0.2567920 | 0.00000 | 0.0000000 | 0.0000000 | 0.0000 | 1.0000 | 1.0000 | 3.3354899 | 9.1451313 | 0.0118957 |
nox | 4 | 466 | 0.5543105 | 0.1166667 | 0.53800 | 0.5442684 | 0.1334340 | 0.3890 | 0.8710 | 0.4820 | 0.7463281 | -0.0357736 | 0.0054045 |
rm | 5 | 466 | 6.2906738 | 0.7048513 | 6.21000 | 6.2570615 | 0.5166861 | 3.8630 | 8.7800 | 4.9170 | 0.4793202 | 1.5424378 | 0.0326516 |
age | 6 | 466 | 68.3675966 | 28.3213784 | 77.15000 | 70.9553476 | 30.0226500 | 2.9000 | 100.0000 | 97.1000 | -0.5777075 | -1.0098814 | 1.3119625 |
dis | 7 | 466 | 3.7956929 | 2.1069496 | 3.19095 | 3.5443647 | 1.9144814 | 1.1296 | 12.1265 | 10.9969 | 0.9988926 | 0.4719679 | 0.0976026 |
rad | 8 | 466 | 9.5300429 | 8.6859272 | 5.00000 | 8.6978610 | 1.4826000 | 1.0000 | 24.0000 | 23.0000 | 1.0102788 | -0.8619110 | 0.4023678 |
tax | 9 | 466 | 409.5021459 | 167.9000887 | 334.50000 | 401.5080214 | 104.5233000 | 187.0000 | 711.0000 | 524.0000 | 0.6593136 | -1.1480456 | 7.7778214 |
ptratio | 10 | 466 | 18.3984979 | 2.1968447 | 18.90000 | 18.5970588 | 1.9273800 | 12.6000 | 22.0000 | 9.4000 | -0.7542681 | -0.4003627 | 0.1017669 |
lstat | 11 | 466 | 12.6314592 | 7.1018907 | 11.35000 | 11.8809626 | 7.0720020 | 1.7300 | 37.9700 | 36.2400 | 0.9055864 | 0.5033688 | 0.3289887 |
medv | 12 | 466 | 22.5892704 | 9.2396814 | 21.20000 | 21.6304813 | 6.0045300 | 5.0000 | 50.0000 | 45.0000 | 1.0766920 | 1.3737825 | 0.4280200 |
target | 13 | 466 | 0.4914163 | 0.5004636 | 0.00000 | 0.4893048 | 0.0000000 | 0.0000 | 1.0000 | 1.0000 | 0.0342293 | -2.0031131 | 0.0231835 |
Data Visualization with a BoxPlot Distribution demonstrating the median, quartiles, etc.:
We will also take a look at the log scale as this will be used later for data transformation.
Histogram plots are great at demonstrating skewness or violations of normal distributions via visualizations:
Histogram:
As we can see that some of the independent variables appear to have a normal distribution, while others are skewed. The skewed data will likely need to undergo transformation to be better used for predictions.
Another assumption that we like to check is whether or not the indepdent variables appears to have some form of a linear relationship to the response variable. The best way to demonstrate this would be with a scatter plot:
Scatterplot (to target
):
As you can see though, interpreting a binomial reseponse variable may not be the most intuitive way to visualize the data. As of now, we will leave this here for now and continue on to see if there are any correlations between the independent variables and correlation between the independent variables with the dependent variable.
Visualization of the Correlations:
zn | indus | chas | nox | rm | age | dis | rad | tax | ptratio | lstat | medv | target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
zn | 1.0000000 | -0.5382664 | -0.0401620 | -0.5170452 | 0.3198141 | -0.5725805 | 0.6601243 | -0.3154812 | -0.3192841 | -0.3910357 | -0.4329925 | 0.3767171 | -0.4316818 |
indus | -0.5382664 | 1.0000000 | 0.0611832 | 0.7596301 | -0.3927118 | 0.6395818 | -0.7036189 | 0.6006284 | 0.7322292 | 0.3946898 | 0.6071102 | -0.4961743 | 0.6048507 |
chas | -0.0401620 | 0.0611832 | 1.0000000 | 0.0974558 | 0.0905098 | 0.0788837 | -0.0965771 | -0.0159004 | -0.0467648 | -0.1286606 | -0.0514232 | 0.1615653 | 0.0800419 |
nox | -0.5170452 | 0.7596301 | 0.0974558 | 1.0000000 | -0.2954897 | 0.7351278 | -0.7688840 | 0.5958298 | 0.6538780 | 0.1762687 | 0.5962426 | -0.4301227 | 0.7261062 |
rm | 0.3198141 | -0.3927118 | 0.0905098 | -0.2954897 | 1.0000000 | -0.2328125 | 0.1990158 | -0.2084457 | -0.2969343 | -0.3603471 | -0.6320245 | 0.7053368 | -0.1525533 |
age | -0.5725805 | 0.6395818 | 0.0788837 | 0.7351278 | -0.2328125 | 1.0000000 | -0.7508976 | 0.4603143 | 0.5121245 | 0.2554479 | 0.6056200 | -0.3781560 | 0.6301062 |
dis | 0.6601243 | -0.7036189 | -0.0965771 | -0.7688840 | 0.1990158 | -0.7508976 | 1.0000000 | -0.4949919 | -0.5342546 | -0.2333394 | -0.5075280 | 0.2566948 | -0.6186731 |
rad | -0.3154812 | 0.6006284 | -0.0159004 | 0.5958298 | -0.2084457 | 0.4603143 | -0.4949919 | 1.0000000 | 0.9064632 | 0.4714516 | 0.5031013 | -0.3976683 | 0.6281049 |
tax | -0.3192841 | 0.7322292 | -0.0467648 | 0.6538780 | -0.2969343 | 0.5121245 | -0.5342546 | 0.9064632 | 1.0000000 | 0.4744223 | 0.5641886 | -0.4900329 | 0.6111133 |
ptratio | -0.3910357 | 0.3946898 | -0.1286606 | 0.1762687 | -0.3603471 | 0.2554479 | -0.2333394 | 0.4714516 | 0.4744223 | 1.0000000 | 0.3773560 | -0.5159153 | 0.2508489 |
lstat | -0.4329925 | 0.6071102 | -0.0514232 | 0.5962426 | -0.6320245 | 0.6056200 | -0.5075280 | 0.5031013 | 0.5641886 | 0.3773560 | 1.0000000 | -0.7358008 | 0.4691270 |
medv | 0.3767171 | -0.4961743 | 0.1615653 | -0.4301227 | 0.7053368 | -0.3781560 | 0.2566948 | -0.3976683 | -0.4900329 | -0.5159153 | -0.7358008 | 1.0000000 | -0.2705507 |
target | -0.4316818 | 0.6048507 | 0.0800419 | 0.7261062 | -0.1525533 | 0.6301062 | -0.6186731 | 0.6281049 | 0.6111133 | 0.2508489 | 0.4691270 | -0.2705507 | 1.0000000 |
We can certainly see that different variables relate to each other differently, and some correlate stronger than others. If we look at the target
column, we can see how the independent variables correlate with the response variable.
Let’s take a look at the target
response variable closer.
What are the Positive and Negative Correlation to the target
variable?
As of note, I have arbitrarily chosen any correlation value greater than 0.2 to be a positive correlation, a value less than -0.2 to be a negative correlation, and a value between -0.2 to 0.2 to be neutral.
## [1] "Correlation to Target"
## target
## zn -0.43168176
## indus 0.60485074
## chas 0.08004187
## nox 0.72610622
## rm -0.15255334
## age 0.63010625
## dis -0.61867312
## rad 0.62810492
## tax 0.61111331
## ptratio 0.25084892
## lstat 0.46912702
## medv -0.27055071
## target 1.00000000
## [1] "Positive Correlative Factors:"
## [1] "indus" "nox" "age" "rad" "tax" "ptratio" "lstat"
## [8] "target"
## [1] "Negative Correlative Factors:"
## [1] "zn" "dis" "medv"
## [1] "Neutral Correlative Factors"
## [1] "chas" "rm"
There is a lot of interesting correlations here within this data set. The one underlying theme though seems to be that the higher socioeconomics within the region, the less likely to have a higher crime rate.
For instance, industrial areas tends to be areas less desired by the wealthy, given the concerns of chemical exposures, noises, etc. In this dataset, it appears that industrial areas have correlated levels of nitrogen oxide concentrations. Also homes with lots of surrounding radial highways also tend to be in areas that are more urban and likely inner city. Also noted is that in poor areas, there tends to be less funding for public institutions. So for instance, schools that tend to have a higher student to teacher ratio could be a proxy of underfunded schools. And in this scenario, higher student to teacher ratio correlated positively with crime.
Not surprisingly, the more expensive the home, the less likely to be crime. However, it is unclear at this moment why age
, and taxes
have a positive correlation. We will need to keep this in the back of our mind as we create our models..
On a side note, it appears that living near the Charles River and the average number of rooms per dwelling seems to be a neutral correlation.
Let’s take at the most highly correlated values to target
. I have arbitrarily set greater than 0.7 or less than -0.7 to be highly correlated (whether positive or negative).
## [1] "Highly Positive Correlated Variables"
## target
## nox 0.7261062
## [1] "Highly Negative Correlated Variables"
## [1] target
## <0 rows> (or 0-length row.names)
It appears that nitrogen oxide concentrations is the most highly correlated. This makes sense given our conversation listed above.
Before we move onto Data Preparation, we should ensure that there are no missing values that would require imputation.
Are there any missing values?
## [1] 0
2. Data Preparation (25 Points)
In preparation for the data, our goals is to create or transform independent variables that could potentially improve our models. We may try to transform some of our data points into buckets or perform transformations i.e. logarithmics to improve prediction ability. While we may not use all of these, it is worth considering and looking into.
Let’s start by taking some of the independent variables and dividing them into buckets.
According to https://nces.ed.gov/programs/coe/indicator_clr.asp, we can take a look at some of the classroom sizes. As we had discussed before, we assume that areas with lower socioeconomics and/or poorer neighborhoods tend to have areas with higher crimes. Of course, while the number of students to teacher ratio in itself is not what is responsible for the crime rate in a neighborhood, it is the undelying notion that an underfunded school (and hence, in theory, a higher student to teacher ratio) implies that the neighborhood may simply have lesser or funds (or may be poorer). While we may end up losing some information taking a continuous variable and converting them into categorical variables (such as a bucket), it may still be worthwhile to take a look into it and see if we ultimately use this variable at the very end.
Let’s start to bucket the student to teacher ratio.
##
## Small Medium Large
## 0 35 128 74
## 1 40 29 160
## [1] "Percentage where Crime is above the Median (1)"
## stud_teach_ratio freq
## 1 Small 75
## 2 Medium 157
## 3 Large 234
This information isn’t exactly all that revealing, but again, we will investigate this further into our model building section.
The next independent variable that we can evaluate is the size of the home. Though we have seen above that there may be no correlation, it may be still worthwhile to see if we can create buckets of size of homes and incorporate them into our model building.
Like student to teacher ratio, size of homes could in theory represent a proxy for wealth. The larger the home, the more wealthy the owner will likely be (in theory).
Small Home: Example would be a 1 bedroom or studio apartment. These on average, would likely consist of a living room, dining room, kitchen, bathroom and bedroom (or no bedroom if in studio). We will consider a small home less than 6 rooms.
Medium Home: Greater than 6 but less than 7.
Larger Home: Greater than 7 room.
Again, while these are not great estimates, we will create the buckets regardless.
##
## Small Medium Large
## 0 4 207 26
## 1 37 167 25
## [1] "Percentage where Crime is above the Median (1)"
While we may ultimately not need the buckets, we will still take a look at it when we are evaluating the models.
Transforming by predictor variables:
Let’s look at zn
: proportion of residential land zoned for large lots
This may benefit from a logarithmic transformation.
hist(log(crime_train$zn), breaks=20, main="Percentage of Logarithmic Land Noted as 'Residential'", xlab="Zoning Size", col = "blue")
Though, while not perfect, this is better.
Let’s take the logarithmic transformation concept and see if we can apply it to the rest of the independent variables. Several of these predictor variables are skewed, and would likely benefit from a logarithmic transformation. Let’s perform a logarithmic transformation on the predictor variables. Given the numerous zeros in zn
, and that chas
is binary, we will not include these in our logarithmic transformations.
Let’s create a histogram and scatterplot with these new variables for the logarithmic transformations.
Lastly, last create some interactive terms.
Let’s see which independent variables have the highest correlation amongst each other (apart from its correlation with the target variable).
## [1] "Maximum Positive Correlations amongst independent variables: "
## zn indus chas nox rm age dis
## 0.6601243 0.7596301 0.1615653 0.7596301 0.7053368 0.7351278 0.6601243
## rad tax ptratio lstat medv
## 0.9064632 0.9064632 0.4744223 0.6071102 0.7053368
## [1] "Maximum Negative Correlations amongst independent variables: "
## zn indus chas nox rm age
## -0.5725805 -0.7036189 -0.1286606 -0.7688840 -0.6320245 -0.7508976
## dis rad tax ptratio lstat medv
## -0.7688840 -0.4949919 -0.5342546 -0.5159153 -0.7358008 -0.7358008
Find its corresponding correlated variables.
var1 | var2 | max_cor | |
---|---|---|---|
zn | zn | dis | 0.660124338286325 |
indus | indus | nox | 0.759630083272316 |
chas | chas | medv | 0.16156528065359 |
nox | nox | indus | 0.759630083272316 |
rm | rm | medv | 0.705336793853476 |
age | age | nox | 0.735127819507785 |
dis | dis | zn | 0.660124338286325 |
rad | rad | tax | 0.906463228913755 |
tax | tax | rad | 0.906463228913755 |
ptratio | ptratio | tax | 0.474422290254675 |
lstat | lstat | indus | 0.607110229872569 |
medv | medv | rm | 0.705336793853476 |
Looks like the largest correlation between two indepdent variables is between tax
and rad
. We can build an interactive term for this when we build our models as they are possibly likely very dependent on each other with one term affecting the other.
3. Build Models (25 Points)
In this section, we will be building several models in order to predict to see if a certain set (or subset) of independent variables can adequately predict whether or not an area will be above or below the crime median.
The first model that we will entertain is the binary logistic model utilizing all of the predictor variables (untransformed). Let’s take a look at the first model:
##
## Call:
## glm(formula = target ~ ., family = binomial, data = crime_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8464 -0.1445 -0.0017 0.0029 3.4665
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -40.822934 6.632913 -6.155 7.53e-10 ***
## zn -0.065946 0.034656 -1.903 0.05706 .
## indus -0.064614 0.047622 -1.357 0.17485
## chas 0.910765 0.755546 1.205 0.22803
## nox 49.122297 7.931706 6.193 5.90e-10 ***
## rm -0.587488 0.722847 -0.813 0.41637
## age 0.034189 0.013814 2.475 0.01333 *
## dis 0.738660 0.230275 3.208 0.00134 **
## rad 0.666366 0.163152 4.084 4.42e-05 ***
## tax -0.006171 0.002955 -2.089 0.03674 *
## ptratio 0.402566 0.126627 3.179 0.00148 **
## lstat 0.045869 0.054049 0.849 0.39608
## medv 0.180824 0.068294 2.648 0.00810 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 192.05 on 453 degrees of freedom
## AIC: 218.05
##
## Number of Fisher Scoring iterations: 9
The Residual Deviance is 192.05 and the AIC 218.05. We will consider this as the baseline for all models and create other models that could improve on the fit while minimizing its complexity.
The second model is essentially the first model. It will include all of the independent variables, except in this circumstance, it will include the interactive term rad:tax
.
##
## Call:
## glm(formula = target ~ . + rad:tax, family = binomial, data = crime_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8793 -0.1343 -0.0012 0.0379 3.5325
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.194e+01 6.775e+00 -6.191 5.97e-10 ***
## zn -7.049e-02 3.475e-02 -2.029 0.042494 *
## indus -5.376e-02 4.856e-02 -1.107 0.268288
## chas 7.575e-01 7.786e-01 0.973 0.330605
## nox 4.872e+01 7.922e+00 6.150 7.74e-10 ***
## rm -6.517e-01 7.343e-01 -0.888 0.374762
## age 3.501e-02 1.395e-02 2.511 0.012050 *
## dis 7.168e-01 2.305e-01 3.110 0.001871 **
## rad 1.035e+00 3.088e-01 3.351 0.000806 ***
## tax -2.626e-03 3.238e-03 -0.811 0.417249
## ptratio 4.096e-01 1.271e-01 3.223 0.001270 **
## lstat 4.812e-02 5.451e-02 0.883 0.377391
## medv 1.868e-01 6.960e-02 2.684 0.007281 **
## rad:tax -9.755e-04 5.384e-04 -1.812 0.069988 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 190.24 on 452 degrees of freedom
## AIC: 218.24
##
## Number of Fisher Scoring iterations: 9
As we can see, the complexity of the model (AIC) remained essentially the same at 218.24, with a modest improvement in the Goodness of Fit (Residual deviance 190.24).
With the third model, we will attempt a different model altogether. we will take a completely different approach by using all of the logarithmically transformed independent variables (from section 2). We hope to normalize the distribution of the independent variables such that it will improve on the model’s performance. While some of the variables as seen in section 2 have certainly become normalized, there was a good number of variables that did not. So we should keep this in the back of our mind as we create our logarthmic transformed predictor model.
##
## Call:
## glm(formula = target ~ ., family = binomial, data = crime_train_log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.90738 -0.16285 -0.00161 0.08746 3.13907
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -15.96595 11.72606 -1.362 0.17333
## zn -0.03216 0.02687 -1.197 0.23145
## chas 0.86893 0.73997 1.174 0.24028
## log_indus 0.24931 0.52151 0.478 0.63262
## log_nox 24.13637 3.91814 6.160 7.27e-10 ***
## log_rm 2.11032 3.70451 0.570 0.56891
## log_age 0.85132 0.63686 1.337 0.18131
## log_dis 2.69643 0.84395 3.195 0.00140 **
## log_rad 2.98142 0.68476 4.354 1.34e-05 ***
## log_tax -1.51088 1.12043 -1.348 0.17750
## log_ptratio 5.52739 2.12059 2.607 0.00915 **
## log_lstat 0.17708 0.66971 0.264 0.79146
## log_medv 2.32688 1.48654 1.565 0.11751
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 206.10 on 453 degrees of freedom
## AIC: 232.1
##
## Number of Fisher Scoring iterations: 8
It appears that the logarithmic model really may not have been the most optimal choice. The residual deviance and AIC have worsened, but not terribly so. We will hold onto this model, though I suspect that this model may ultimately not be what we want.
Before we continue on with creating additional models, we should consider looking at these built models from a different approach. We should take the step approach and see if we can build a model iteratively either “forward” or “backwards”, and see if this could potentially improve on the model.
Model Number 1 with “Backwards” Direction
## Start: AIC=218.05
## target ~ zn + indus + chas + nox + rm + age + dis + rad + tax +
## ptratio + lstat + medv
##
## Df Deviance AIC
## - rm 1 192.71 216.71
## - lstat 1 192.77 216.77
## - chas 1 193.53 217.53
## - indus 1 193.99 217.99
## <none> 192.05 218.05
## - tax 1 196.59 220.59
## - zn 1 196.89 220.89
## - age 1 198.73 222.73
## - medv 1 199.95 223.95
## - ptratio 1 203.32 227.32
## - dis 1 203.84 227.84
## - rad 1 233.74 257.74
## - nox 1 265.05 289.05
##
## Step: AIC=216.71
## target ~ zn + indus + chas + nox + age + dis + rad + tax + ptratio +
## lstat + medv
##
## Df Deviance AIC
## - chas 1 194.24 216.24
## - lstat 1 194.32 216.32
## - indus 1 194.58 216.58
## <none> 192.71 216.71
## - tax 1 197.59 219.59
## - zn 1 198.07 220.07
## - age 1 199.11 221.11
## - ptratio 1 203.53 225.53
## - dis 1 203.85 225.85
## - medv 1 205.35 227.35
## - rad 1 233.81 255.81
## - nox 1 265.14 287.14
##
## Step: AIC=216.24
## target ~ zn + indus + nox + age + dis + rad + tax + ptratio +
## lstat + medv
##
## Df Deviance AIC
## - indus 1 195.51 215.51
## <none> 194.24 216.24
## - lstat 1 196.33 216.33
## - zn 1 200.59 220.59
## - tax 1 200.75 220.75
## - age 1 201.00 221.00
## - ptratio 1 203.94 223.94
## - dis 1 204.83 224.83
## - medv 1 207.12 227.12
## - rad 1 241.41 261.41
## - nox 1 265.19 285.19
##
## Step: AIC=215.51
## target ~ zn + nox + age + dis + rad + tax + ptratio + lstat +
## medv
##
## Df Deviance AIC
## - lstat 1 197.32 215.32
## <none> 195.51 215.51
## - zn 1 202.05 220.05
## - age 1 202.23 220.23
## - ptratio 1 205.01 223.01
## - dis 1 205.96 223.96
## - tax 1 206.60 224.60
## - medv 1 208.13 226.13
## - rad 1 249.55 267.55
## - nox 1 270.59 288.59
##
## Step: AIC=215.32
## target ~ zn + nox + age + dis + rad + tax + ptratio + medv
##
## Df Deviance AIC
## <none> 197.32 215.32
## - zn 1 203.45 219.45
## - ptratio 1 206.27 222.27
## - age 1 207.13 223.13
## - tax 1 207.62 223.62
## - dis 1 207.64 223.64
## - medv 1 208.65 224.65
## - rad 1 250.98 266.98
## - nox 1 273.18 289.18
##
## Call: glm(formula = target ~ zn + nox + age + dis + rad + tax + ptratio +
## medv, family = binomial, data = crime_train)
##
## Coefficients:
## (Intercept) zn nox age dis
## -37.415922 -0.068648 42.807768 0.032950 0.654896
## rad tax ptratio medv
## 0.725109 -0.007756 0.323628 0.110472
##
## Degrees of Freedom: 465 Total (i.e. Null); 457 Residual
## Null Deviance: 645.9
## Residual Deviance: 197.3 AIC: 215.3
Model Number 3 (Logarithmic Model) with “Backwards” direction.
## Start: AIC=232.1
## target ~ zn + chas + log_indus + log_nox + log_rm + log_age +
## log_dis + log_rad + log_tax + log_ptratio + log_lstat + log_medv
##
## Df Deviance AIC
## - log_lstat 1 206.18 230.18
## - log_indus 1 206.33 230.33
## - log_rm 1 206.43 230.43
## - chas 1 207.52 231.52
## - zn 1 207.78 231.78
## - log_tax 1 207.90 231.90
## - log_age 1 208.05 232.05
## <none> 206.10 232.10
## - log_medv 1 208.66 232.66
## - log_ptratio 1 213.37 237.37
## - log_dis 1 217.09 241.09
## - log_rad 1 243.10 267.10
## - log_nox 1 270.57 294.57
##
## Step: AIC=230.17
## target ~ zn + chas + log_indus + log_nox + log_rm + log_age +
## log_dis + log_rad + log_tax + log_ptratio + log_medv
##
## Df Deviance AIC
## - log_indus 1 206.42 228.42
## - log_rm 1 206.43 228.43
## - chas 1 207.70 229.70
## - zn 1 207.79 229.79
## - log_tax 1 208.00 230.00
## <none> 206.18 230.18
## - log_age 1 208.55 230.55
## - log_medv 1 208.75 230.75
## - log_ptratio 1 213.43 235.43
## - log_dis 1 217.09 239.09
## - log_rad 1 243.72 265.72
## - log_nox 1 270.66 292.66
##
## Step: AIC=228.42
## target ~ zn + chas + log_nox + log_rm + log_age + log_dis + log_rad +
## log_tax + log_ptratio + log_medv
##
## Df Deviance AIC
## - log_rm 1 206.64 226.64
## - log_tax 1 208.02 228.02
## - zn 1 208.35 228.35
## - chas 1 208.40 228.40
## <none> 206.42 228.42
## - log_age 1 208.73 228.73
## - log_medv 1 208.99 228.99
## - log_ptratio 1 213.97 233.97
## - log_dis 1 217.39 237.39
## - log_rad 1 244.37 264.37
## - log_nox 1 280.85 300.85
##
## Step: AIC=226.64
## target ~ zn + chas + log_nox + log_age + log_dis + log_rad +
## log_tax + log_ptratio + log_medv
##
## Df Deviance AIC
## - log_tax 1 208.16 226.16
## - zn 1 208.40 226.40
## - chas 1 208.52 226.52
## <none> 206.64 226.64
## - log_age 1 209.88 227.88
## - log_ptratio 1 215.35 233.35
## - log_medv 1 216.50 234.50
## - log_dis 1 218.30 236.30
## - log_rad 1 245.33 263.33
## - log_nox 1 282.10 300.10
##
## Step: AIC=226.16
## target ~ zn + chas + log_nox + log_age + log_dis + log_rad +
## log_ptratio + log_medv
##
## Df Deviance AIC
## <none> 208.16 226.16
## - zn 1 210.21 226.21
## - chas 1 210.51 226.51
## - log_age 1 211.56 227.56
## - log_ptratio 1 216.44 232.44
## - log_medv 1 224.12 240.12
## - log_dis 1 225.23 241.23
## - log_rad 1 262.88 278.88
## - log_nox 1 282.96 298.96
##
## Call: glm(formula = target ~ zn + chas + log_nox + log_age + log_dis +
## log_rad + log_ptratio + log_medv, family = binomial, data = crime_train_log)
##
## Coefficients:
## (Intercept) zn chas log_nox log_age
## -22.7058 -0.0321 1.0496 24.4943 0.9951
## log_dis log_rad log_ptratio log_medv
## 3.0797 2.4928 5.6817 3.0752
##
## Degrees of Freedom: 465 Total (i.e. Null); 457 Residual
## Null Deviance: 645.9
## Residual Deviance: 208.2 AIC: 226.2
Model Number 1 with “Forward” direction.
## Start: AIC=647.88
## target ~ 1
##
## Df Deviance AIC
## + nox 1 292.01 296.01
## + rad 1 404.16 408.16
## + dis 1 409.50 413.50
## + age 1 424.75 428.75
## + tax 1 442.38 446.38
## + indus 1 453.23 457.23
## + zn 1 518.46 522.46
## + lstat 1 528.01 532.01
## + medv 1 609.62 613.62
## + ptratio 1 615.64 619.64
## + rm 1 634.82 638.82
## + chas 1 642.86 646.86
## <none> 645.88 647.88
##
## Step: AIC=296.01
## target ~ nox
##
## Df Deviance AIC
## + rad 1 239.51 245.51
## + rm 1 284.63 290.63
## + medv 1 285.86 291.86
## + indus 1 288.11 294.11
## + zn 1 288.29 294.29
## + tax 1 288.40 294.40
## + chas 1 288.47 294.47
## <none> 292.01 296.01
## + ptratio 1 290.13 296.13
## + age 1 290.62 296.62
## + dis 1 290.91 296.91
## + lstat 1 291.93 297.93
##
## Step: AIC=245.51
## target ~ nox + rad
##
## Df Deviance AIC
## + tax 1 224.47 232.47
## + indus 1 233.09 241.09
## + zn 1 235.19 243.19
## + rm 1 236.60 244.60
## + age 1 236.76 244.76
## + medv 1 236.86 244.86
## + ptratio 1 237.33 245.33
## <none> 239.51 245.51
## + chas 1 237.64 245.64
## + dis 1 237.96 245.96
## + lstat 1 239.47 247.47
##
## Step: AIC=232.47
## target ~ nox + rad + tax
##
## Df Deviance AIC
## + ptratio 1 218.70 228.70
## + zn 1 219.94 229.94
## + age 1 220.44 230.44
## <none> 224.47 232.47
## + dis 1 223.30 233.30
## + indus 1 223.40 233.40
## + chas 1 223.63 233.63
## + lstat 1 223.71 233.71
## + rm 1 223.75 233.75
## + medv 1 224.27 234.27
##
## Step: AIC=228.7
## target ~ nox + rad + tax + ptratio
##
## Df Deviance AIC
## + age 1 214.46 226.46
## + medv 1 215.23 227.23
## + rm 1 216.12 228.12
## + zn 1 216.32 228.32
## <none> 218.70 228.70
## + chas 1 216.81 228.81
## + dis 1 217.79 229.79
## + indus 1 217.82 229.82
## + lstat 1 218.57 230.57
##
## Step: AIC=226.46
## target ~ nox + rad + tax + ptratio + age
##
## Df Deviance AIC
## + medv 1 209.55 223.55
## + rm 1 212.31 226.31
## + dis 1 212.40 226.40
## <none> 214.46 226.46
## + zn 1 212.67 226.67
## + chas 1 213.24 227.24
## + indus 1 213.38 227.38
## + lstat 1 214.35 228.35
##
## Step: AIC=223.55
## target ~ nox + rad + tax + ptratio + age + medv
##
## Df Deviance AIC
## + dis 1 203.45 219.45
## <none> 209.55 223.55
## + zn 1 207.64 223.64
## + lstat 1 208.07 224.07
## + chas 1 208.33 224.33
## + indus 1 208.58 224.58
## + rm 1 208.79 224.79
##
## Step: AIC=219.45
## target ~ nox + rad + tax + ptratio + age + medv + dis
##
## Df Deviance AIC
## + zn 1 197.32 215.32
## + chas 1 201.29 219.29
## + rm 1 201.35 219.35
## <none> 203.45 219.45
## + lstat 1 202.05 220.05
## + indus 1 202.23 220.23
##
## Step: AIC=215.32
## target ~ nox + rad + tax + ptratio + age + medv + dis + zn
##
## Df Deviance AIC
## <none> 197.32 215.32
## + lstat 1 195.51 215.51
## + rm 1 195.75 215.75
## + chas 1 195.97 215.97
## + indus 1 196.33 216.33
##
## Call: glm(formula = target ~ nox + rad + tax + ptratio + age + medv +
## dis + zn, family = "binomial")
##
## Coefficients:
## (Intercept) nox rad tax ptratio
## -37.415922 42.807768 0.725109 -0.007756 0.323628
## age medv dis zn
## 0.032950 0.110472 0.654896 -0.068648
##
## Degrees of Freedom: 465 Total (i.e. Null); 457 Residual
## Null Deviance: 645.9
## Residual Deviance: 197.3 AIC: 215.3
As we can see, judging from the residual deviance and AIC, the forward or backward stepwise approach did not alter or change the original models.
For model number 4, let’s include some few transformed variables and interactive terms. After looking into the independent variables, the variables that would most likely would benefit from a log transformation is: lstat
, dis
, medv
, and rad:tax
.
Let’s include the log transformation of these predictor variables into our dataset and rid of the original variables lstat
, dis
, medv
, and rad:tax
.
zn | indus | chas | nox | rm | age | rad | tax | ptratio | target | log_stat | log_dis | log_medv |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 19.58 | 0 | 0.605 | 7.929 | 96.2 | 5 | 403 | 14.7 | 1 | 1.308333 | 0.7158378 | 3.912023 |
0 | 19.58 | 1 | 0.871 | 5.403 | 100.0 | 5 | 403 | 14.7 | 1 | 3.289148 | 0.2788431 | 2.595255 |
0 | 18.10 | 0 | 0.740 | 6.485 | 100.0 | 24 | 666 | 20.2 | 1 | 2.936513 | 0.6822884 | 2.734367 |
30 | 4.93 | 0 | 0.428 | 6.393 | 7.8 | 6 | 300 | 16.6 | 0 | 1.646734 | 1.9509688 | 3.165475 |
0 | 2.46 | 0 | 0.488 | 7.155 | 92.2 | 3 | 193 | 17.8 | 0 | 1.572774 | 0.9934740 | 3.634951 |
0 | 8.56 | 0 | 0.520 | 6.781 | 71.3 | 5 | 384 | 20.9 | 0 | 2.037317 | 1.0494571 | 3.277145 |
Now that we have our partially transformed dataset, let us build model number 4 (the partially log transformed model).
##
## Call:
## glm(formula = target ~ . + rad:tax, family = binomial, data = new_crime)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8899 -0.1546 -0.0026 0.0417 3.4155
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.248e+01 1.020e+01 -5.147 2.65e-07 ***
## zn -4.868e-02 3.011e-02 -1.617 0.105983
## indus -3.284e-02 4.863e-02 -0.675 0.499477
## chas 8.426e-01 7.919e-01 1.064 0.287341
## nox 5.033e+01 7.993e+00 6.297 3.04e-10 ***
## rm -1.339e-01 6.414e-01 -0.209 0.834600
## age 3.696e-02 1.355e-02 2.728 0.006365 **
## rad 9.322e-01 3.027e-01 3.080 0.002073 **
## tax -1.934e-03 3.341e-03 -0.579 0.562581
## ptratio 3.829e-01 1.281e-01 2.990 0.002793 **
## log_stat 3.411e-02 6.891e-01 0.050 0.960521
## log_dis 3.192e+00 9.245e-01 3.453 0.000555 ***
## log_medv 3.266e+00 1.717e+00 1.902 0.057225 .
## rad:tax -8.294e-04 5.492e-04 -1.510 0.130993
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 191.69 on 452 degrees of freedom
## AIC: 219.69
##
## Number of Fisher Scoring iterations: 9
Model number 4 appears comprobable to Model number 1.
Let’s look at the bucket values and see if we can improved the AIC and/or Residual Deviance scores for model number 5 (the Bucket Model).
##
## Call:
## glm(formula = target ~ . + rad:tax, family = binomial, data = full_df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9330 -0.1211 -0.0005 0.0279 3.7792
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.058e+01 9.658e+00 -4.202 2.64e-05 ***
## zn -8.588e-02 4.002e-02 -2.146 0.031870 *
## indus -3.225e-02 5.499e-02 -0.586 0.557613
## chas 5.846e-03 8.440e-01 0.007 0.994473
## nox 5.535e+01 1.022e+01 5.413 6.19e-08 ***
## age 4.131e-02 1.325e-02 3.118 0.001818 **
## rad 1.075e+00 3.592e-01 2.991 0.002777 **
## tax -4.619e-03 3.628e-03 -1.273 0.202988
## log_stat -3.493e-01 6.623e-01 -0.527 0.597971
## log_dis 3.805e+00 1.060e+00 3.589 0.000332 ***
## log_medv 1.478e+00 1.426e+00 1.037 0.299858
## stud_teach_ratioMedium 7.783e-01 8.548e-01 0.910 0.362582
## stud_teach_ratioLarge 1.192e+00 7.966e-01 1.497 0.134494
## home_sizeMedium -3.651e+00 1.071e+00 -3.409 0.000652 ***
## home_sizeLarge -1.876e+00 1.638e+00 -1.146 0.251982
## rad:tax -9.403e-04 6.343e-04 -1.482 0.138249
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 179.58 on 450 degrees of freedom
## AIC: 211.58
##
## Number of Fisher Scoring iterations: 10
The bucket model has improved the Residual Deviance quite a bit 179.58 and the AIC 211.58. Let’s take a look at this from a stepwise approach (“backwards”) and see if we can improve on this.
## Start: AIC=211.58
## target ~ zn + indus + chas + nox + age + rad + tax + log_stat +
## log_dis + log_medv + stud_teach_ratio + home_size + rad:tax
##
## Df Deviance AIC
## - chas 1 179.58 209.58
## - log_stat 1 179.85 209.85
## - indus 1 179.93 209.93
## - stud_teach_ratio 2 182.03 210.03
## - log_medv 1 180.67 210.67
## - rad:tax 1 180.76 210.76
## <none> 179.58 211.58
## - zn 1 185.49 215.49
## - age 1 190.50 220.50
## - log_dis 1 194.29 224.29
## - home_size 2 200.39 228.39
## - nox 1 228.46 258.46
##
## Step: AIC=209.58
## target ~ zn + indus + nox + age + rad + tax + log_stat + log_dis +
## log_medv + stud_teach_ratio + home_size + rad:tax
##
## Df Deviance AIC
## - log_stat 1 179.85 207.85
## - indus 1 179.94 207.94
## - stud_teach_ratio 2 182.05 208.05
## - log_medv 1 180.70 208.70
## - rad:tax 1 180.80 208.80
## <none> 179.58 209.58
## - zn 1 185.65 213.65
## - age 1 190.98 218.98
## - log_dis 1 194.30 222.30
## - home_size 2 200.89 226.89
## - nox 1 228.61 256.61
##
## Step: AIC=207.86
## target ~ zn + indus + nox + age + rad + tax + log_dis + log_medv +
## stud_teach_ratio + home_size + rad:tax
##
## Df Deviance AIC
## - indus 1 180.28 206.28
## - stud_teach_ratio 2 182.44 206.44
## - rad:tax 1 181.13 207.13
## <none> 179.85 207.85
## - log_medv 1 182.08 208.08
## - zn 1 186.10 212.10
## - age 1 191.06 217.06
## - log_dis 1 194.84 220.84
## - home_size 2 200.89 224.89
## - nox 1 228.77 254.77
##
## Step: AIC=206.28
## target ~ zn + nox + age + rad + tax + log_dis + log_medv + stud_teach_ratio +
## home_size + rad:tax
##
## Df Deviance AIC
## - stud_teach_ratio 2 182.62 204.62
## - rad:tax 1 181.80 205.80
## <none> 180.28 206.28
## - log_medv 1 182.29 206.29
## - zn 1 187.58 211.58
## - age 1 191.11 215.11
## - log_dis 1 196.00 220.00
## - home_size 2 201.05 223.05
## - nox 1 231.42 255.42
##
## Step: AIC=204.62
## target ~ zn + nox + age + rad + tax + log_dis + log_medv + home_size +
## rad:tax
##
## Df Deviance AIC
## - log_medv 1 183.15 203.15
## - rad:tax 1 184.22 204.22
## <none> 182.62 204.62
## - age 1 191.16 211.16
## - zn 1 193.65 213.65
## - log_dis 1 198.44 218.44
## - home_size 2 202.55 220.55
## - nox 1 256.34 276.34
##
## Step: AIC=203.15
## target ~ zn + nox + age + rad + tax + log_dis + home_size + rad:tax
##
## Df Deviance AIC
## - rad:tax 1 184.83 202.83
## <none> 183.15 203.15
## - age 1 191.23 209.23
## - zn 1 194.10 212.10
## - log_dis 1 199.12 217.12
## - home_size 2 205.54 221.54
## - nox 1 257.26 275.26
##
## Step: AIC=202.83
## target ~ zn + nox + age + rad + tax + log_dis + home_size
##
## Df Deviance AIC
## <none> 184.83 202.83
## - age 1 192.68 208.68
## - zn 1 194.90 210.90
## - tax 1 198.15 214.15
## - log_dis 1 201.91 217.91
## - home_size 2 208.05 222.05
## - rad 1 245.78 261.78
## - nox 1 259.34 275.34
##
## Call: glm(formula = target ~ zn + nox + age + rad + tax + log_dis +
## home_size, family = binomial, data = full_df)
##
## Coefficients:
## (Intercept) zn nox age
## -30.294374 -0.098534 48.348341 0.027545
## rad tax log_dis home_sizeMedium
## 0.780517 -0.009425 3.309187 -2.842727
## home_sizeLarge
## -0.246644
##
## Degrees of Freedom: 465 Total (i.e. Null); 457 Residual
## Null Deviance: 645.9
## Residual Deviance: 184.8 AIC: 202.8
So the stepwise approach on this bucket dataset has a slightly increased residual deviance but improved AIC. Let’s call this stepwise bucket model, model number 6.
One of the features we noticed about the dataset is that the independent variables have considerable correlations with each other. That being said, we can trial the Principle Components Analysis to see if we can develop our 6th model. (PCA model).
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 169.0899 28.76861 16.35008 8.60415 4.25551 3.76253
## Proportion of Variance 0.9592 0.02777 0.00897 0.00248 0.00061 0.00047
## Cumulative Proportion 0.9592 0.98700 0.99597 0.99846 0.99906 0.99954
## PC7 PC8 PC9 PC10 PC11 PC12
## Standard deviation 3.08576 1.6837 1.06770 0.45869 0.2465 0.05494
## Proportion of Variance 0.00032 0.0001 0.00004 0.00001 0.0000 0.00000
## Cumulative Proportion 0.99986 1.0000 0.99999 1.00000 1.0000 1.00000
It appears that the first principle component explains about 95.9% of the variance, so let us just use the first principle component for our model.
Let’s see what the first PC is comprised of:
## zn indus chas nox rm age dis rad tax
## 0.05 -0.03 0.00 0.00 0.00 -0.09 0.01 -0.05 -0.99
## ptratio lstat medv
## -0.01 -0.02 0.03
##
## Call:
## glm(formula = crime_train$target ~ prcrime$x[, 1])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.04633 -0.27195 -0.09360 0.04062 0.85272
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.4914163 0.0182520 26.92 <2e-16 ***
## prcrime$x[, 1] -0.0018282 0.0001081 -16.92 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.1552412)
##
## Null deviance: 116.466 on 465 degrees of freedom
## Residual deviance: 72.032 on 464 degrees of freedom
## AIC: 458.39
##
## Number of Fisher Scoring iterations: 2
While the residual deviance has improved quite significantly, though at the expense of a more complex model (AIC: 458.39).
Let’s try with the modified dataset (the dataset with some of the independent variables logarithmically transformed.).
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 168.9765 28.54780 16.30897 4.31628 3.19509 1.83312
## Proportion of Variance 0.9625 0.02747 0.00897 0.00063 0.00034 0.00011
## Cumulative Proportion 0.9625 0.98992 0.99889 0.99952 0.99986 0.99997
## PC7 PC8 PC9 PC10 PC11 PC12
## Standard deviation 0.71502 0.3197 0.2652 0.2465 0.1855 0.05275
## Proportion of Variance 0.00002 0.0000 0.0000 0.0000 0.0000 0.00000
## Cumulative Proportion 0.99999 1.0000 1.0000 1.0000 1.0000 1.00000
##
## Call:
## glm(formula = new_crime$target ~ prcrime1$x[, 1])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.04563 -0.27182 -0.09292 0.03988 0.85321
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.4914163 0.0182537 26.92 <2e-16 ***
## prcrime1$x[, 1] -0.0018291 0.0001081 -16.91 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.1552707)
##
## Null deviance: 116.466 on 465 degrees of freedom
## Residual deviance: 72.046 on 464 degrees of freedom
## AIC: 458.48
##
## Number of Fisher Scoring iterations: 2
This model does not appear to be that much better than the PCA model with the original dataset.
Let’s move onto selecting the appropriate (or best fit) model for prediction.
4. Select Models (25 Points)
Decide on the criteria for selecting the best binary logistic regression model. Will you select models with slightly worse performance if it makes more sense or is more parsimonious? Discuss why you selected your model.
Now that we have several models at our disposal, we should evaluate the models in regards to their ability to predict. For these binary logistic regression models, the metrics that we will be using are:
After looking at the residual deviance scores and AIC scores in the previous section, we’ll take the most promising ones and evaluate them here.
Let us evalute the Model Number 1 (baseline model). The way we will evaluate the data is by partitioning part of the training data set into 70% training and 30% evaluation. From there, we will develop a confusion matrix and create our evaluations there.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 64 8
## 1 6 61
##
## Accuracy : 0.8993
## 95% CI : (0.8368, 0.9438)
## No Information Rate : 0.5036
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.7985
## Mcnemar's Test P-Value : 0.7893
##
## Sensitivity : 0.8841
## Specificity : 0.9143
## Pos Pred Value : 0.9104
## Neg Pred Value : 0.8889
## Prevalence : 0.4964
## Detection Rate : 0.4388
## Detection Prevalence : 0.4820
## Balanced Accuracy : 0.8992
##
## 'Positive' Class : 1
##
## Sensitivity Specificity Pos Pred Value
## 0.8840580 0.9142857 0.9104478
## Neg Pred Value Precision Recall
## 0.8888889 0.9104478 0.8840580
## F1 Prevalence Detection Rate
## 0.8970588 0.4964029 0.4388489
## Detection Prevalence Balanced Accuracy
## 0.4820144 0.8991718
## Area under the curve: 0.9647
Overall, this model appears to be performing quite well. All of the values that we were concerned about such as sensitivity, specificity, AUC, etc. were all quite high. Could this be also because it’s overfitting? Possibly. But let’s continue on and evalute the other models.
Let’s evaluate the third model (the model with all of the independent variables with log transformations).
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 39 33
## 1 36 31
##
## Accuracy : 0.5036
## 95% CI : (0.4176, 0.5894)
## No Information Rate : 0.5396
## P-Value [Acc > NIR] : 0.8254
##
## Kappa : 0.0044
## Mcnemar's Test P-Value : 0.8097
##
## Sensitivity : 0.4844
## Specificity : 0.5200
## Pos Pred Value : 0.4627
## Neg Pred Value : 0.5417
## Prevalence : 0.4604
## Detection Rate : 0.2230
## Detection Prevalence : 0.4820
## Balanced Accuracy : 0.5022
##
## 'Positive' Class : 1
##
## Sensitivity Specificity Pos Pred Value
## 0.4843750 0.5200000 0.4626866
## Neg Pred Value Precision Recall
## 0.5416667 0.4626866 0.4843750
## F1 Prevalence Detection Rate
## 0.4732824 0.4604317 0.2230216
## Detection Prevalence Balanced Accuracy
## 0.4820144 0.5021875
## Area under the curve: 0.9853
This model performed significantly worse despite not having such a drastically worse AIC and residual deviance score from the 1st model. We will ultimately not use this model as our final choice.
If transforming all of the variables logarithmically was not the best option, then what if we only tranformed a few that we thought may best improve the model. In this next model, we have also evaluated the model with the buckets (house size and teacher to student ratio).
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 40 32
## 1 35 32
##
## Accuracy : 0.518
## 95% CI : (0.4317, 0.6035)
## No Information Rate : 0.5396
## P-Value [Acc > NIR] : 0.7247
##
## Kappa : 0.0332
## Mcnemar's Test P-Value : 0.8070
##
## Sensitivity : 0.5000
## Specificity : 0.5333
## Pos Pred Value : 0.4776
## Neg Pred Value : 0.5556
## Prevalence : 0.4604
## Detection Rate : 0.2302
## Detection Prevalence : 0.4820
## Balanced Accuracy : 0.5167
##
## 'Positive' Class : 1
##
## Sensitivity Specificity Pos Pred Value
## 0.5000000 0.5333333 0.4776119
## Neg Pred Value Precision Recall
## 0.5555556 0.4776119 0.5000000
## F1 Prevalence Detection Rate
## 0.4885496 0.4604317 0.2302158
## Detection Prevalence Balanced Accuracy
## 0.4820144 0.5166667
## Area under the curve: 0.96
Again, the numbers suggest that this model may not be the best fit as well. Let’s see if any results change if we take the stepwise “backwards” model version of this instead.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 34 38
## 1 37 30
##
## Accuracy : 0.4604
## 95% CI : (0.3756, 0.547)
## No Information Rate : 0.5108
## P-Value [Acc > NIR] : 0.8984
##
## Kappa : -0.08
## Mcnemar's Test P-Value : 1.0000
##
## Sensitivity : 0.4412
## Specificity : 0.4789
## Pos Pred Value : 0.4478
## Neg Pred Value : 0.4722
## Prevalence : 0.4892
## Detection Rate : 0.2158
## Detection Prevalence : 0.4820
## Balanced Accuracy : 0.4600
##
## 'Positive' Class : 1
##
## Sensitivity Specificity Pos Pred Value
## 0.4411765 0.4788732 0.4477612
## Neg Pred Value Precision Recall
## 0.4722222 0.4477612 0.4411765
## F1 Prevalence Detection Rate
## 0.4444444 0.4892086 0.2158273
## Detection Prevalence Balanced Accuracy
## 0.4820144 0.4600249
## Area under the curve: 0.9785
Again, there is not much improvement.
Let’s look at one more before finally deciding on a model. Let’s look at the Principle Components Model.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 37 35
## 1 28 39
##
## Accuracy : 0.5468
## 95% CI : (0.4602, 0.6313)
## No Information Rate : 0.5324
## P-Value [Acc > NIR] : 0.4001
##
## Kappa : 0.0956
## Mcnemar's Test P-Value : 0.4497
##
## Sensitivity : 0.5270
## Specificity : 0.5692
## Pos Pred Value : 0.5821
## Neg Pred Value : 0.5139
## Prevalence : 0.5324
## Detection Rate : 0.2806
## Detection Prevalence : 0.4820
## Balanced Accuracy : 0.5481
##
## 'Positive' Class : 1
##
## Sensitivity Specificity Pos Pred Value
## 0.5270270 0.5692308 0.5820896
## Neg Pred Value Precision Recall
## 0.5138889 0.5820896 0.5270270
## F1 Prevalence Detection Rate
## 0.5531915 0.5323741 0.2805755
## Detection Prevalence Balanced Accuracy
## 0.4820144 0.5481289
I suspect that these models, including this one with the PCA, adds a degree of complexity that may lead to overfitting or overall poor fitting.
Out of all of the models, it appears that the original (baseline) model appears to have the best numbers, including sensitivity and specificity. As we had discussed before, there may be some factors that may explain some of this. For instance, anytime we take a continuous variable (such as student to teacher ratio, or rooms for a size of the house) and converting it into a categorical variable, we run the risk of losing valuable information. In addition, just because we transform any of the variables, this also does not necessarily improve prediction power either. Ultimately, usually Occam’s Razor usually wins and the model that is the least complex and most straightforward typically provides the best model. Given these circumstances, w will make predictions on the testing set with the original baseline model.
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## 0 1 1 1 0 0 0 0 0 0 0 0 1 1 1 0 0 1 0 0 0 0 0 0 0
## 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
## 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0
# Loading Packages
requiredpackages <- c('knitr', 'kableExtra', 'psych', 'ggplot2', 'reshape2', 'corrplot', 'tidyr', 'dplyr', 'plyr', 'MASS', 'caret', 'pscl', 'lmtest', 'pROC', 'ROCR')
for (i in requiredpackages){
if(!require(i,character.only=T)) install.packages(i)
library(i,character.only=T)
}
# Loading in datasets
url_train <- 'https://raw.githubusercontent.com/jcp9010/MSDA/master/CUNY%20621/Homework%203/crime-training-data_modified.csv'
url_test <- 'https://raw.githubusercontent.com/jcp9010/MSDA/master/CUNY%20621/Homework%203/crime-evaluation-data_modified.csv'
crime_train <- read.csv(url_train, header = TRUE)
crime_test <- read.csv(url_test, header = TRUE)
kable(head(crime_train), "html") %>% kable_styling(bootstrap_options = c("striped", "hover"))
# Number of Rows in Training Dataset
nrow(crime_train)
# Summary Statistic:
describe(crime_train) %>% kable("html") %>% kable_styling(bootstrap_options = c("striped", "hover"))
# Data Visualization using BoxPlot Distribution
crime_melt <- melt(crime_train)
ggplot(crime_melt, aes(x = factor(variable), y=value)) + geom_boxplot() + facet_wrap(~variable, scale="free") + xlab("") + ylab("Values")
ggplot(crime_melt, aes(x = factor(variable), y=value)) + geom_boxplot() + coord_flip() + xlab("") + ylab("Values")
# Scale Y Log
ggplot(crime_melt, aes(x = factor(variable), y=value)) + geom_boxplot() + scale_y_log10() + coord_flip() + xlab("Log Transformation of Independent Variables") + ylab("")
# Histogram Visualizations
ggplot(crime_melt, aes(x = value)) + geom_histogram(bins=50) + facet_wrap(~variable, scale="free") + xlab("") + ylab("Frequency")
# Scatterplot of independent variables to 'target' variable
meltTarget <- melt(crime_train, id.vars = c("target"))
ggplot(meltTarget, aes(x=value, y=target)) + geom_point() + facet_wrap(~variable, scale="free")
# Correlations and its visualizations
cormatrix <- cor(crime_train, method="pearson", use="complete.obs")
kable(cormatrix, "html") %>% kable_styling(bootstrap_options = c("striped", "hover"))
corrplot(cormatrix, method="circle")
# More correlations
cor_df <- as.data.frame(cormatrix) %>% dplyr::select(target)
pos_cor <- subset(cor_df, cor_df[,'target'] > .2)
neg_cor <- subset(cor_df, cor_df[,'target'] < -.2)
neut_cor <- subset(cor_df, cor_df[,'target'] > -.2 & cor_df[,'target'] < .2)
print("Correlation to Target")
cor_df
print("Positive Correlative Factors:")
row.names(pos_cor)
print("Negative Correlative Factors:")
row.names(neg_cor)
print("Neutral Correlative Factors")
row.names(neut_cor)
# Looking for very highly correlated, arbitrarily set at > .7 or < -.7
highly_pos_correlated <- subset(cor_df, cor_df[,'target'] > .7 & cor_df[,'target'] < 1)
highly_neg_correlated <- subset(cor_df, cor_df[,'target'] < -.7 & cor_df[,'target'] > -1)
print("Highly Positive Correlated Variables")
highly_pos_correlated
print("Highly Negative Correlated Variables")
highly_neg_correlated
# Missing Values?
sum(is.na(crime_train))
# Bucketing student to teaching ratio
stud_teach_ratio <- crime_train$ptratio
stud_teach_ratio <- cut(stud_teach_ratio, breaks = 3, labels=c("Small", "Medium", "Large"))
crime_train_1 <- cbind(crime_train, stud_teach_ratio)
x <- table(crime_train_1$target, crime_train_1$stud_teach_ratio)
x
# Percentage
print("Percentage where Crime is above the Median (1)")
x1 <- as.data.frame(x)
ggplot(x1, aes(x=Var2,y=Freq)) + geom_histogram(stat="identity", aes(fill=Var1), position="dodge") + theme_bw() + xlab("Student to Teacher Ratio Size") + ylab("Count")
ggplot(x1, aes(x=Var2,y=Freq)) + geom_histogram(stat="identity", aes(fill=Var1), position="fill") + theme_bw() + ylab("Percentage of 1 (Above Crime Median) vs 0 (Below Crime Median") + xlab("")
count(crime_train_1, 'stud_teach_ratio')
# Bucketing Home Sizes by counting rooms
home_size <- crime_train$rm
home_size <- cut(home_size, breaks = 3, labels=c("Small", "Medium", "Large"))
crime_train_1 <- cbind(crime_train_1, home_size)
x1 <- table(crime_train_1$target, crime_train_1$home_size)
x1
# Percentage
print("Percentage where Crime is above the Median (1)")
x2 <- as.data.frame(x1)
ggplot(x2, aes(x=Var2,y=Freq)) + geom_histogram(stat="identity", aes(fill=Var1), position="dodge") + theme_bw() + xlab("Size of the Home") + ylab("Count")
# Log transformation of independent variables
hist(crime_train$zn,breaks=20, main="Percentage of Land Noted as 'Residential'", xlab="Zoning Size", col = "lightgreen")
hist(log(crime_train$zn), breaks=20, main="Percentage of Logarithmic Land Noted as 'Residential'", xlab="Zoning Size", col = "blue")
# Creating a dataset with the logarithmic variables
crime_train_log <- cbind(crime_train$target, crime_train$zn, crime_train$chas, log(crime_train[,c(2,4,5,6,7,8,9,10,11,12)]))
colnames(crime_train_log) <- c('target', 'zn', 'chas', 'log_indus', 'log_nox', 'log_rm', 'log_age', 'log_dis', 'log_rad', 'log_tax', 'log_ptratio', 'log_lstat', 'log_medv')
# Histogram and Scatterplot with the new logarithmic transformations
ggplot(melt(crime_train_log), aes(x = value)) + geom_histogram(bins=50) + facet_wrap(~variable, scale="free")
ggplot(melt(crime_train_log, id.vars=c('target')), aes(x=value, y=target)) + geom_point() + facet_wrap(~variable, scale="free")
# Looking at the most correlated terms and attempting to find the interactive terms
all_cor <- as.data.frame(cormatrix)
all_cor <- all_cor[!row.names(all_cor) == 'target', !colnames(all_cor) == 'target']
# Eliminate all 1's from the diagonal and replace it with zero temporarily
all_cor[all_cor == 1] <- 0
print("Maximum Positive Correlations amongst independent variables: ")
max_cor <- apply(all_cor, 1, max)
max_cor
print("Maximum Negative Correlations amongst independent variables: ")
min_cor <- apply(all_cor, 1, min)
min_cor
var1 <- names(max_cor)
var2 <- colnames(all_cor)[max.col(all_cor,ties.method="first")]
cor_values <- as.numeric(max_cor)
cor_df1 <- cbind(var1, var2, max_cor)
cor_df1 %>% kable("html") %>% kable_styling(bootstrap_options = c("striped", "hover"))
# Building models
# Model 1
logitmod <- glm(target ~ ., family = binomial, data=crime_train)
summary(logitmod)
# Model 2
logitmod_int <- glm(target ~ . + rad:tax, family = binomial, data=crime_train)
summary(logitmod_int)
# Model 3
logitmod_log <- glm(target ~ ., family = binomial, data=crime_train_log)
summary(logitmod_log)
#Model Number 1 with "Backwards" Direction
step(logitmod, direction="backward")
# Model Number 3 (Logarithmic Model) with "Backwards" direction.
step(logitmod_log, direction="backward")
# Model Number 1 with "Forward" direction.
attach(crime_train)
step(glm(target ~ 1, family="binomial"), target ~ zn + indus + chas + nox + rm + age + dis +
rad + tax + ptratio + lstat + medv,direction="forward")
detach(crime_train)
# Creating a new dataset with some predictor variables with log transformations
new_crime <- crime_train[,!names(crime_train) %in% c('lstat','dis','medv')]
log_crime <- data.frame(log(crime_train[,'lstat']), log(crime_train[,'dis']), log(crime_train[,'medv']))
new_crime <- cbind(new_crime, log_crime)
colnames(new_crime)[11:13] <- c('log_stat','log_dis','log_medv')
head(new_crime) %>% kable("html") %>% kable_styling(bootstrap_options = c("striped", "hover"))
# Building model 4 from the above dataset
logitmod_log_transformed <- glm(target ~ . + rad:tax, family = binomial, data=new_crime)
summary(logitmod_log_transformed)
# Model 5
full_df <- cbind(new_crime, stud_teach_ratio)
full_df <- cbind(full_df, home_size)
full_df <- full_df[,!names(full_df) %in% c('rm','ptratio')]
full_logitmod <- glm(target ~ . + rad:tax, family = binomial, data=full_df)
summary(full_logitmod)
# Backward Approach (stepwise) with model 5
full_logitmod_back <- step(full_logitmod, direction="backward")
full_logitmod_back
# Model 6 (PCA model)
ccrime <- crime_train[,-13]
prcrime <- prcomp(ccrime)
summary(prcrime)
# First PC
round(prcrime$rot[,1],2)
lmodpcr <- glm(crime_train$target ~ prcrime$x[,1])
summary(lmodpcr)
# Modified dataset (the dataset with some of the independent variables logarithmically transformed.).
ccrime1 <- new_crime[,-10]
prcrime1 <- prcomp(ccrime1)
summary(prcrime1)
lmodpcr1 <- glm(new_crime$target ~ prcrime1$x[,1])
summary(lmodpcr1)
# Evaluating the models
# Model 1
# Reference: https://www.r-bloggers.com/evaluating-logistic-regression-models/
# logitmod
#length(coef(logitmod))
# Let's create a partition
Train <- createDataPartition(crime_train$target, p=0.7, list=FALSE)
training <- crime_train[Train, ]
testing <- crime_train[-Train, ]
pred <- round(predict(logitmod, newdata=testing, type="response"), 3)
pred1 <- ifelse(pred > 0.5, 1, 0)
#Confusion Matrix
ans <- confusionMatrix(data=pred1, testing$target, positive='1')
ans
ans$byClass
plot(roc(testing$target, pred), main="ROC Curve from pROC Package")
# Please note that the X axis is in Specificity (as opposed to 1 - Specificity in the above function)
# Area Under the Curve
auc(roc(testing$target, pred))
# Evaluation of the 3rd model
Train <- createDataPartition(new_crime$target, p=0.7, list=FALSE)
training <- new_crime[Train, ]
testing <- new_crime[-Train, ]
pred2 <- round(predict(logitmod_log_transformed, newdata=testing, type="response"), 3)
pred21 <- ifelse(pred > 0.5, 1, 0)
#Confusion Matrix
ans2 <- confusionMatrix(data=pred21, testing$target, positive='1')
ans2
ans2$byClass
# Evaluating the model with buckets and log transformed data
Train <- createDataPartition(full_df$target, p=0.7, list=FALSE)
training <- full_df[Train, ]
testing <- full_df[-Train, ]
pred3 <- round(predict(full_logitmod, newdata=testing, type="response"), 3)
pred31 <- ifelse(pred > 0.5, 1, 0)
#Confusion Matrix
ans3 <- confusionMatrix(data=pred31, testing$target, positive='1')
ans3
ans3$byClass
plot(roc(testing$target, pred3), main="ROC Curve from pROC Package")
# Please note that the X axis is in Specificity (as opposed to 1 - Specificity in the above function)
# Area Under the Curve
auc(roc(testing$target, pred3))
# Stepwise approach backward with the above model
Train <- createDataPartition(full_df$target, p=0.7, list=FALSE)
training <- full_df[Train, ]
testing <- full_df[-Train, ]
pred4 <- round(predict(full_logitmod_back, newdata=testing, type="response"), 3)
pred41 <- ifelse(pred > 0.5, 1, 0)
#Confusion Matrix
ans4 <- confusionMatrix(data=pred41, testing$target, positive='1')
ans4
ans4$byClass
plot(roc(testing$target, pred4), main="ROC Curve from pROC Package")
# Please note that the X axis is in Specificity (as opposed to 1 - Specificity in the above function)
# Area Under the Curve
auc(roc(testing$target, pred4))
# 6th model: PCA model
Train <- createDataPartition(crime_train$target, p=0.7, list=FALSE)
training <- full_df[Train, ]
testing <- full_df[-Train, ]
pred5 <- round(predict(lmodpcr, newdata=testing, type="response"), 3)
pred51 <- ifelse(pred > 0.5, 1, 0)
#Confusion Matrix
ans5 <- confusionMatrix(data=pred51, testing$target, positive='1')
ans5
ans5$byClass
# Making the prediction
test_ans <- round(predict(logitmod, newdata=crime_test, type="response"), 3)
test_ans <- ifelse(test_ans > 0.5, 1, 0)
test_ans
# You may un-comment this section if you would like to download a csv file of the predictions
# write.csv(test_ans, file="Prediction.csv")