The objective of this assignment is to predict if a given neighborhood in Boston is likely to experience a high crime rate. For categorical purposes, we define a neighborhood crime rate as “high” if it exceeds the citywide median crime rate.
Because we are interested in making simple yes/no determinations, we will build various binary logistic regression models to make our predictions.
We will train our models using a prescribed data set that captures a variety of metrics for each Boston neighborhood.
The training data comprises 466 observations and 13 variables.
Below is a brief description of the variables in our data set:
zn | indus | chas | nox | rm | age | dis | rad | tax | ptratio | black | lstat | medv | target |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 19.58 | 0 | 0.605 | 7.93 | 96.2 | 2.05 | 5 | 403 | 14.7 | 369 | 3.70 | 50.0 | 1 |
0 | 19.58 | 1 | 0.871 | 5.40 | 100.0 | 1.32 | 5 | 403 | 14.7 | 397 | 26.82 | 13.4 | 1 |
0 | 18.10 | 0 | 0.740 | 6.49 | 100.0 | 1.98 | 24 | 666 | 20.2 | 387 | 18.85 | 15.4 | 1 |
30 | 4.93 | 0 | 0.428 | 6.39 | 7.8 | 7.04 | 6 | 300 | 16.6 | 375 | 5.19 | 23.7 | 0 |
0 | 2.46 | 0 | 0.488 | 7.16 | 92.2 | 2.70 | 3 | 193 | 17.8 | 394 | 4.82 | 37.9 | 0 |
0 | 8.56 | 0 | 0.520 | 6.78 | 71.3 | 2.86 | 5 | 384 | 20.9 | 396 | 7.67 | 26.5 | 0 |
The response variable, target
, is binary. There is one binary variable, chas
, among our predictor variables. Four of the predictors are expressed as proportions, with values ranging from 0 to 100:
indus
indust
age
lstat
Below is the descriptive summary of all of our variables:
zn indus chas nox rm age
Min. : 0 Min. : 0.5 Min. :0.00 Min. :0.39 Min. :3.9 Min. : 3
1st Qu.: 0 1st Qu.: 5.1 1st Qu.:0.00 1st Qu.:0.45 1st Qu.:5.9 1st Qu.: 44
Median : 0 Median : 9.7 Median :0.00 Median :0.54 Median :6.2 Median : 77
Mean : 12 Mean :11.1 Mean :0.07 Mean :0.55 Mean :6.3 Mean : 68
3rd Qu.: 16 3rd Qu.:18.1 3rd Qu.:0.00 3rd Qu.:0.62 3rd Qu.:6.6 3rd Qu.: 94
Max. :100 Max. :27.7 Max. :1.00 Max. :0.87 Max. :8.8 Max. :100
dis rad tax ptratio black lstat medv
Min. : 1.1 Min. : 1.0 Min. :187 Min. :12.6 Min. : 0 Min. : 2 Min. : 5
1st Qu.: 2.1 1st Qu.: 4.0 1st Qu.:281 1st Qu.:16.9 1st Qu.:376 1st Qu.: 7 1st Qu.:17
Median : 3.2 Median : 5.0 Median :334 Median :18.9 Median :391 Median :11 Median :21
Mean : 3.8 Mean : 9.5 Mean :410 Mean :18.4 Mean :357 Mean :13 Mean :23
3rd Qu.: 5.2 3rd Qu.:24.0 3rd Qu.:666 3rd Qu.:20.2 3rd Qu.:396 3rd Qu.:17 3rd Qu.:25
Max. :12.1 Max. :24.0 Max. :711 Max. :22.0 Max. :397 Max. :38 Max. :50
target
Min. :0.00
1st Qu.:0.00
Median :0.00
Mean :0.49
3rd Qu.:1.00
Max. :1.00
There are no missing values in a data set.
Let’s take a closer look at each variable individually.
The zn
variable has a significant positive skew, as evident in the histogram, box plot, and qq plots below. We also note that 73% of observations (339 of 466 total) have a value of 0. Furthermore, there appears to be relationship between crime rates and zn
: 93% of high crime areas have no land zoned for large lots. Also, about half of low crime areas exclude large-lot zoning. The boxplots by crime type indicate a much higher variance of zn
values for low crime neighborhoods compared to high crime areas.
Below is the summarized table of observed zn
observations:
0 10 20 25 30 35 40 45 50 55 60 70 75 80 85 90 95 100 Sum
339 10 36 8 9 9 7 6 3 3 4 3 3 15 2 4 4 1 466
Below are summary statistics:
Min. 1st Qu. Median Mean 3rd Qu. Max. SD Skew Kurt
0.0 0.0 0.0 11.6 16.2 100.0 23.4 2.2 6.8
below are the relevant plots:
The variable indus
represents the proportion of non-retail business acres per suburb. The histogram below indicates a bi-modal quality to the variable’s distribution. The distribution has a very mild positive skew, the kurtosis is significantly below that of a normal distribution. In the boxplot below, we see that high crime areas tend to have a higher industry concentration compared to low crime areas.
Below is the summary table, with values rounded to the nearest percentage:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 18 20 22 26 28 Sum
1 9 31 29 30 27 45 21 26 11 27 14 4 6 9 3 121 28 14 5 5 466
Here are the summary statistics:
Min. 1st Qu. Median Mean 3rd Qu. Max. SD Skew Kurt
0.46 5.15 9.69 11.11 18.10 27.74 6.85 0.29 1.76
Below are the plots:
The variable chas
is binary with the value 1 indicating that the neighborhood borders the Charles River. Only 7% of all neighborhoods (33 in total) border the river.
Let’s look at the summary of observations:
0 1 Sum
433 33 466
Below is the scatter plot for ‘chas’ and ‘target’:
0 1 Sum
0 225 208 433
1 12 21 33
Sum 237 229 466
Below for the t-test details:
2-sample test for equality of proportions with continuity correction
data: table(crime$chas, crime$target)
X-squared = 2, df = 1, p-value = 0.1
alternative hypothesis: two.sided
95 percent confidence interval:
-0.031 0.343
sample estimates:
prop 1 prop 2
0.52 0.36
This variable exhibits a moderate, positive skew, with a kurtosis value similar to that of a normally distributed variable. The final boxplot below indicates higher median nox
values in high crime areas vis-a-vis the low crime counterparts. We also see moderately higher nox
variance in high crime areas.
Here is a summary of observed values, rounded to the nearest 0.1 parts per 10 million:
0.4 0.5 0.6 0.7 0.8 0.9 Sum
122 149 95 76 8 16 466
Below are summary statistics:
Min. 1st Qu. Median Mean 3rd Qu. Max. SD Skew Kurt
0.39 0.45 0.54 0.55 0.62 0.87 0.12 0.75 2.98
Below are the plots:
The predictor rm
is count measure describing the average number of rooms per dwelling. It appears that lower crime areas, on average, are associated with a higher number of rooms per dwelling; however, the room counts by crime type are fairly close in value.
Here is a distribution of room counts, rounded to the nearest whole number:
4 5 6 7 8 9 Sum
4 37 284 115 23 3 466
Summary statistics are below:
Min. 1st Qu. Median Mean 3rd Qu. Max. SD Skew Kurt
3.86 5.89 6.21 6.29 6.63 8.78 0.70 0.48 4.56
Below are the plots:
The predictor ‘age’ has a significant left skew. In the boxplots below, we see a significantly higher mean percentage of older homes in high crime areas.
Here is a table of age
values rounded to the nearest 5th percentage.
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 Sum
6 7 9 20 6 25 22 18 17 15 16 18 14 19 22 22 33 43 64 70 466
Below is the summary statistics:
Min. 1st Qu. Median Mean 3rd Qu. Max. SD Skew Kurt
2.90 43.88 77.15 68.37 94.10 100.00 28.32 -0.58 2.00
Below are the plots:
The predictor dist
describes the average distance to Boston employment centers. The variable is moderately right skewed. Based on the boxplots below, we see see that low crime areas are associated with higher average distances to employment centers.
Here is a table of distance values, rounded to the nearest unit:
1 2 3 4 5 6 7 8 9 11 12 Sum
28 145 83 65 47 44 24 16 9 4 1 466
Summary statistics are as follows:
Min. 1st Qu. Median Mean 3rd Qu. Max. SD Skew Kurt
1.1 2.1 3.2 3.8 5.2 12.1 2.1 1.0 3.5
Below are the plots:
The rad
variable is an integer-valued index measure indicating an area’s accessibility to radial highways. In the boxplots below, there appears to be a significant positive association between high crime rates and rad
value.
The distribution of this variable is tri-modal, with values clustering around index values of 4, 5, and 24.
Here is a numerical distribution of the index values:
1 2 3 4 5 6 7 8 24 Sum
17 20 36 103 109 25 15 20 121 466
Below are summary statistics:
Min. 1st Qu. Median Mean 3rd Qu. Max. SD Skew Kurt
1.0 4.0 5.0 9.5 24.0 24.0 8.7 1.0 2.1
Below are the plots:
The tax
variable refers to the the tax rate per $10k of property value. High crime areas also appear to have a strong, positive association with the tax
value–see the boxplots below.
Here is distribution of tax
values, rounded the nearest $25:
175 200 225 250 275 300 325 350 375 400 425 475 675 700 Sum
1 14 35 20 61 82 24 11 13 51 27 1 121 5 466
Below is the summary statistics:
Min. 1st Qu. Median Mean 3rd Qu. Max. SD Skew Kurt
187.00 281.00 334.50 409.50 666.00 711.00 167.90 0.66 1.86
Below are the plots:
The predictor ptratio
indicates the average school, pupil-to-student ratio, and has a right skewed distribution. The boxplots below indicate a positive relationship between ptratio
and high crime.
Here is a distribution of ptratio
values, rounded to the nearest whole number:
13 14 15 16 17 18 19 20 21 22 Sum
15 2 55 21 45 67 65 145 49 2 466
Below are the summary statistics:
Min. 1st Qu. Median Mean 3rd Qu. Max. SD Skew Kurt
12.60 16.90 18.90 18.40 20.20 22.00 2.20 -0.76 2.61
Below are the plots:
The variable lstat
indicates the proportion of the population deemed to be of lower status. The variable is right skewed, and high crime areas tend to have be associated with larger lstat
values.
Here is a summary table of lstat
values rounded to the nearest 2:
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 Sum
9 48 59 51 60 44 45 35 38 17 14 16 8 4 12 1 3 1 1 466
Summary statistics are provided below:
Min. 1st Qu. Median Mean 3rd Qu. Max. SD Skew Kurt
1.73 7.04 11.35 12.63 16.93 37.97 7.10 0.91 3.52
Below are the plots:
The last feature variable in our data set is medv
, which represents the median value of residential homes in a given area, in $1,000s. The variable is right skewed, and high values of medv
appear to be associated with lower crime rates.
Let’s look at medv
values, rounded to the nearest $2k:
6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 Sum
2 2 16 11 18 37 31 38 73 58 65 10 19 14 14 12 11 5 1 4 5 3 2 15 466
Below is the summary statistics:
Min. 1st Qu. Median Mean 3rd Qu. Max. SD Skew Kurt
5.0 17.0 21.2 22.6 25.0 50.0 9.2 1.1 4.4
Below are the plots:
The binary response variable, target
, indicates whether a particular area has a crime rate above the median Boston crime rate.
0 1 Sum
237 229 466
The split is very close to half and half (50/50).
Let’s look at scatter plot/correlation matrix to get a better feel for the relationships between pairs of variables.
In this section, we describe data modifications and transformations applied before fitting our regression models.
As mentioned previously, our data set contains no missing elements; so we do not need to use any imputation procedures.
In binary logistic regression, it is desirable to have predictor variables that are normally distributed, whenever possible. Let’s review predictor variables and discuss potential transformation procedures.
zn
The zn
predictor is highly right skewed. This variable also has many zero values–over 70% of observations. Let’s compare the percentage of high crime areas for zero-valued zn
vs. values 1 or higher:
zn_zero zn_1_plus
[1,] 0.631 0.118
2-sample test for equality of proportions with continuity correction
data: table(crime$zn_zero, crime$target)
X-squared = 100, df = 1, p-value <2e-16
alternative hypothesis: two.sided
95 percent confidence interval:
0.432 0.595
sample estimates:
prop 1 prop 2
0.882 0.369
The difference in these proportions is statistically significant.
Let’s look at a jittered scatter plot of nonzero zn
by target value:
indus
Let’s create a new categorical variable, indus_high
, with a value of 1 if the industry proportion is close to the higher valued mode, and value of 0 if the industry value is close to the lower mode.
Comparing the high crime rates for each value of the new indust_high
variable:
val_0 val_1
[1,] 0.231 0.92
The percentage of high crime areas are very different for each value of indus_high
.
nox
The variable nox
is moderately right skewed. let’s perform the box-cox procedure to determine an appropriate transformation:
Fitted parameters:
lambda beta sigmasq
-0.949 -0.862 0.126
Convergence code returned by optim: 0
Based on the box-cox procedure output, We will create a new variable nox_mod
, that is the reciprocal of the raw nox
value. We then multiply the reciprocal by -1 to preserve the direction of the original relationship.
The transformed variable is more symmetrical, with a skewness value closer to zero:
[1] 0.0487
The variances for each target value are also similar–see below:
rm
The variable rm
has a mild positive skew and high kurtosis value. Let’s look at the suggested box cox transformation:
Fitted parameters:
lambda beta sigmasq
0.2038 2.2239 0.0263
Convergence code returned by optim: 0
Based on this output, we will transform the variable by taking the quarter root of the raw value.
The transformed variable is more symmetric, with a skewness value of:
[1] 0.0416
The variances of rm_mod
for each target value appear to be fairly similar:
age
Age has a moderate left skew. Let’s review the suggested box-cox transformation:
Fitted parameters:
lambda beta sigmasq
1.32 205.70 10492.78
Convergence code returned by optim: 0
We apply the suggested power transformation of 1.3 and store in a new variable, age_mod
.
dis
The predictor dis
has a moderate positive skew. Let’s transform using the box-cox transformation:
Fitted parameters:
lambda beta sigmasq
-0.147 1.072 0.205
Convergence code returned by optim: 0
Given that the value of the lambda parameter is fairly close to 0, we will use the log transformation and save to results to a new variable, dis_mod
.
The transformed distribution has improved skew:
[1] 0.143
The transformed variable has similar variances across each target value.
rad
val_0 val_1
[1,] 0.313 1
tax
The variable tax
also has a bi-modal shape, with values densely clustered around 300 and 700–with no values recording in the training data between between 470 and 665. It assigns a value of 1 when the tax value is greater than or equal to 500, and 0 otherwise. The 500 cutoff reflects an approximate halfway point between the two modal centers.
Let’s perform another sanity check to determine if there is significant relationship with our target variable:
val_0 val_1
[1,] 0.318 0.96
ptratio
The predictor variable ptratio
has a moderate negative skew. Let’s perform box-cox transformations:
Fitted parameters:
lambda beta sigmasq
4.14e+00 4.57e+04 3.61e+08
Convergence code returned by optim: 0
lstat
The lstat
variable has a moderate positive skew.
Let’s look the suggested power transformation from the box-cox procedure:
Fitted parameters:
lambda beta sigmasq
0.233 3.235 1.055
Convergence code returned by optim: 0
Based on this output, we create a new variable lstat_mod
, that applies a quarter root transformation to the original variable.
The transformed variable is fairly symmetric, with a skewness value of:
[1] -0.00564
The variances of the transformed variable are also similar across each target variable value.
medv
The predictor medv
has a moderate, positive skew. Let’s look a the suggested box-cox transformation:
Fitted parameters:
lambda beta sigmasq
0.235 4.469 0.693
Convergence code returned by optim: 0
Based on this output, we will apply a quarter root transformation and assign to a new variable, medv_mod
.
The newly transformed variable is virtually symmetric, with a skewness value of:
[1] 0.0402
Let’s take a look at the variables that we will consider for our models:
zn_zero
: a transformed, binary variableindus_high
: a transformed, binary variablechas
: a binary variablenox_mod
: a transformed, continuous variablerm_mod
: a transformed, continuous variabledis_mod
: a transformed, continuous variablerad_high
: a transformed, binary variable. We should be not include this variable and tax_high
together, given the high correlation between these variables.tax_high
: a transformed, binary variable. We should not include this variable in the same model as the rad_high
variable.lstat_mod
: a transformed, continuous variablemedv_mod
: a transformed, continuous variable. We will consider adding a quadratic term because the variances appear to be different for each target variable value.age
: this variable (whether transformed or not) is asymmetric and has different variances for each target value.ptratio
: this predictor and its transformed counterpart have moderate skew and different variances for each value of the target predictor.In the first model, we will only include binary variables and continuous variables that are approximately normally distributed. The goal is to have a relatively simple model with interpretable coefficients.
Let’s review the VIFs for the included predictors:
target_mod zn_zero indus_high nox_mod rm_mod tax_high lstat_mod chas
2.49 1.65 4.21 4.54 1.98 2.75 2.96 1.04
All VIF values are below 5.
Let’s review our model output:
Call:
glm(formula = target ~ zn_zero + indus_high + chas + nox_mod +
rm_mod + tax_high + lstat_mod + medv_mod, family = "binomial",
data = crime)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.4574 -0.3228 -0.0541 0.3044 3.0534
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -9.906 10.622 -0.93 0.3510
zn_zero 0.810 0.521 1.56 0.1199
indus_high -0.249 0.583 -0.43 0.6692
chas 0.894 0.593 1.51 0.1321
nox_mod 7.935 1.162 6.83 8.6e-12 ***
rm_mod 10.534 6.049 1.74 0.0816 .
tax_high 1.748 0.674 2.59 0.0095 **
lstat_mod 1.371 1.191 1.15 0.2497
medv_mod 2.125 1.558 1.36 0.1727
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 645.88 on 465 degrees of freedom
Residual deviance: 268.71 on 457 degrees of freedom
AIC: 286.7
Number of Fisher Scoring iterations: 6
Based on the Wald tests of significance, only two of our predictors, nox_mod
and tax_high
, are statistically significant.
Let’s visually examine whether or not there is significant interaction between these two variables:
Here is output from our revised model:
Call:
glm(formula = target ~ zn_zero + indus_high + chas + nox_mod +
rm_mod + tax_high + lstat_mod + medv_mod + rm_mod:lstat_mod,
family = "binomial", data = crime)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.5923 -0.3265 -0.0496 0.3115 3.1440
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -91.598 42.313 -2.16 0.030 *
zn_zero 0.818 0.521 1.57 0.116
indus_high -0.288 0.583 -0.49 0.621
chas 0.935 0.584 1.60 0.109
nox_mod 8.131 1.171 6.94 3.8e-12 ***
rm_mod 64.066 27.522 2.33 0.020 *
tax_high 1.622 0.674 2.41 0.016 *
lstat_mod 49.826 24.723 2.02 0.044 *
medv_mod 1.255 1.600 0.78 0.433
rm_mod:lstat_mod -30.954 15.778 -1.96 0.050 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 645.88 on 465 degrees of freedom
Residual deviance: 264.27 on 456 degrees of freedom
AIC: 284.3
Number of Fisher Scoring iterations: 6
With the addition of the interaction term, we now see statistically significant p-values for rm_mod
, lstat_stat_mod
, and the interaction term.
Now, let’s create the marginal plots from earlier for our revised model:
First model included a variety of predictors that produced statistically insignificant z scores.
Call:
glm(formula = target ~ nox_mod + rm_mod + lstat_mod + tax_high +
rm_mod:lstat_mod, family = "binomial", data = crime)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.725 -0.362 -0.080 0.296 2.884
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -94.141 40.418 -2.33 0.020 *
nox_mod 7.930 0.958 8.27 <2e-16 ***
rm_mod 68.026 25.647 2.65 0.008 **
lstat_mod 52.244 23.980 2.18 0.029 *
tax_high 1.377 0.547 2.52 0.012 *
rm_mod:lstat_mod -32.700 15.231 -2.15 0.032 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 645.88 on 465 degrees of freedom
Residual deviance: 270.69 on 460 degrees of freedom
AIC: 282.7
Number of Fisher Scoring iterations: 6
Model 2 has a few benefits as compared to model 1:
In the ast model, we’ll we’ll fit a binary logistic regression using a stepwise regression procedure, with variable selection occurring in both forward and backward directions.
We’ll exclude rad_high
due to its extremely high correlation with tax_high
.
Call:
glm(formula = target ~ nox_mod + tax_high + dis_mod + medv_mod +
age_mod + ptratio + chas, family = "binomial", data = crime)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.7953 -0.2878 -0.0246 0.3008 2.8622
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.00960 3.98159 -0.50 0.61375
nox_mod 12.73699 1.73709 7.33 2.3e-13 ***
tax_high 2.28549 0.65937 3.47 0.00053 ***
dis_mod 3.18955 0.74094 4.30 1.7e-05 ***
medv_mod 6.47191 1.37538 4.71 2.5e-06 ***
age_mod 0.00646 0.00219 2.95 0.00314 **
ptratio 0.31871 0.09854 3.23 0.00122 **
chas 1.36917 0.60292 2.27 0.02315 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 645.88 on 465 degrees of freedom
Residual deviance: 244.91 on 458 degrees of freedom
AIC: 260.9
Number of Fisher Scoring iterations: 7
The stepwise procedure resulted in the selection of 7 predictor variables.
The stepwise regression model has a lower AIC value than Model 2, which may indicate a better fit.
Let’s compute the associated p-value for the difference in deviance for these two models:
[1] 0.17
The p-value does not indicate a statistically significant difference. This is additional evidence that Model 2 is superior to Model 1:
Now let’s compare all three model fits, using AIC, corrected AIC, BIC, and loglikehood values:
AIC AICc BIC loglik
[1,] 284 285 326 -132
[2,] 283 283 308 -135
[3,] 261 261 294 -122
The first three terms, AIC, AICc, and BIC, provide a consistent interpretation of model fits:
Binary logistic regression models are fit via maximum likelihood estimation, and higher log likelihoods are associated with better model fits. One of the drawbacks of standard log-likelihood values is that is does not penalize for model size (i.e. number of parameters). Using log-likelihood measures alone to assess model fit could bias preference in favor of larger, over fitting models. There are alternative log-likelihood measures that do penalize for model size, but for now we’ll review the non-penalizing measure:
Moving forward, we choose Model 3 as the best of the models examined in this assignment. The model strikes a middle ground in terms of size in relation to the other two models, and is superior in all goodness-of-fit measures examined.
Below is a ROC curve of our selected model and area under the curve:
$auc
Area under the curve: 0.958
As a brief aside, let’s compare Model 3’s AUC to the AUCs for Model 1 and Model 2:
[1] "Model 1: 0.95198349087023"
[1] "Model 2: 0.947082342969801"
The AUC for Model 1 is identical to Model 3’s AUC, while model 2’s AUC is lower.
We now need to choose an appropriate cutoff probability measure for predicting whether a neighborhood has a high or low crime rate. One common measure used is called Youden’s index, which attempts to maximize both sensitivity and specificity:
Using the coords()
function in the pRoc package, the optimal measure is:
[1] 0.484
From a practical standpoint, this value is very close to 0.50; so we will use a probability of 50% as our cutoff.
We can now produce a confusion matrix, and various classification metrics:
Reference
Prediction 0 1
0 208 23
1 29 206
$accuracy
[1] 0.888
$error_rt
[1] 0.112
$precision
[1] 0.877
$sensitivity
[1] 0.9
$specificity
[1] 0.878
$F1
[1] 0.888
Our model has relatively high values of sensitivity and specificity. We conclude that our model is fairly strong. Our last step is to judge our model against a test data set.
Let’s make crime level predictions using the evaluation data set.
zn | indus | chas | nox | rm | age | dis | rad | tax | ptratio | black | lstat | medv | zn_zero | indus_high | nox_mod | rm_mod | age_mod | dis_mod | rad_high | tax_high | lstat_mod | medv_mod | predict_prob | predict_target |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 7.07 | 0 | 0.469 | 7.18 | 61.1 | 4.97 | 2 | 242 | 17.8 | 392.8 | 4.03 | 34.7 | 1 | 0 | -2.13 | 1.64 | 209.8 | 1.603 | 0 | 0 | 1.42 | 2.43 | 0.211 | 0 |
0 | 8.14 | 0 | 0.538 | 6.10 | 84.5 | 4.46 | 4 | 307 | 21.0 | 380.0 | 10.26 | 18.2 | 1 | 0 | -1.86 | 1.57 | 319.8 | 1.496 | 0 | 0 | 1.79 | 2.06 | 0.771 | 1 |
0 | 8.14 | 0 | 0.538 | 6.50 | 94.4 | 4.46 | 4 | 307 | 21.0 | 387.9 | 12.80 | 18.4 | 1 | 0 | -1.86 | 1.60 | 369.4 | 1.494 | 0 | 0 | 1.89 | 2.07 | 0.827 | 1 |
0 | 8.14 | 0 | 0.538 | 5.95 | 82.0 | 3.99 | 4 | 307 | 21.0 | 232.6 | 27.71 | 13.2 | 1 | 0 | -1.86 | 1.56 | 307.6 | 1.384 | 0 | 0 | 2.29 | 1.91 | 0.437 | 0 |
0 | 5.96 | 0 | 0.499 | 5.85 | 41.5 | 3.93 | 5 | 279 | 19.2 | 396.9 | 8.77 | 21.0 | 1 | 0 | -2.00 | 1.55 | 126.9 | 1.370 | 0 | 0 | 1.72 | 2.14 | 0.085 | 0 |
25 | 5.13 | 0 | 0.453 | 5.74 | 66.2 | 7.22 | 8 | 284 | 19.7 | 395.1 | 13.15 | 18.7 | 0 | 0 | -2.21 | 1.55 | 232.9 | 1.978 | 0 | 0 | 1.90 | 2.08 | 0.071 | 0 |
25 | 5.13 | 0 | 0.453 | 5.97 | 93.4 | 6.82 | 8 | 284 | 19.7 | 378.1 | 14.44 | 16.0 | 0 | 0 | -2.21 | 1.56 | 364.3 | 1.920 | 0 | 0 | 1.95 | 2.00 | 0.081 | 0 |
0 | 4.49 | 0 | 0.449 | 6.63 | 56.1 | 4.44 | 3 | 247 | 18.5 | 392.3 | 6.53 | 26.6 | 1 | 0 | -2.23 | 1.60 | 187.8 | 1.490 | 0 | 0 | 1.60 | 2.27 | 0.022 | 0 |
0 | 4.49 | 0 | 0.449 | 6.12 | 56.8 | 3.75 | 3 | 247 | 18.5 | 395.1 | 8.44 | 22.2 | 1 | 0 | -2.23 | 1.57 | 190.8 | 1.321 | 0 | 0 | 1.70 | 2.17 | 0.007 | 0 |
0 | 2.89 | 0 | 0.445 | 6.16 | 69.6 | 3.50 | 2 | 276 | 18.0 | 391.8 | 11.34 | 21.4 | 1 | 0 | -2.25 | 1.58 | 248.5 | 1.251 | 0 | 0 | 1.83 | 2.15 | 0.005 | 0 |
0 | 25.65 | 0 | 0.581 | 5.86 | 97.0 | 1.94 | 2 | 188 | 19.1 | 370.3 | 25.41 | 17.3 | 1 | 1 | -1.72 | 1.56 | 382.7 | 0.665 | 0 | 0 | 2.25 | 2.04 | 0.487 | 0 |
0 | 25.65 | 0 | 0.581 | 5.61 | 95.6 | 1.76 | 2 | 188 | 19.1 | 359.3 | 27.26 | 15.7 | 1 | 1 | -1.72 | 1.54 | 375.5 | 0.564 | 0 | 0 | 2.29 | 1.99 | 0.323 | 0 |
0 | 21.89 | 0 | 0.624 | 5.64 | 94.7 | 1.98 | 4 | 437 | 21.2 | 396.9 | 18.34 | 14.3 | 1 | 1 | -1.60 | 1.54 | 370.9 | 0.683 | 0 | 0 | 2.07 | 1.95 | 0.817 | 1 |
0 | 19.58 | 0 | 0.605 | 6.10 | 93.0 | 2.28 | 5 | 403 | 14.7 | 240.2 | 9.81 | 25.0 | 1 | 1 | -1.65 | 1.57 | 362.3 | 0.826 | 0 | 0 | 1.77 | 2.24 | 0.744 | 1 |
0 | 19.58 | 0 | 0.605 | 5.88 | 97.3 | 2.39 | 5 | 403 | 14.7 | 348.1 | 12.03 | 19.1 | 1 | 1 | -1.65 | 1.56 | 384.2 | 0.871 | 0 | 0 | 1.86 | 2.09 | 0.601 | 1 |
0 | 10.59 | 1 | 0.489 | 5.96 | 92.1 | 3.88 | 4 | 277 | 18.6 | 393.2 | 17.27 | 21.7 | 1 | 0 | -2.04 | 1.56 | 357.7 | 1.355 | 0 | 0 | 2.04 | 2.16 | 0.461 | 0 |
0 | 6.20 | 0 | 0.504 | 6.55 | 21.4 | 3.38 | 8 | 307 | 17.4 | 380.3 | 3.76 | 31.5 | 1 | 0 | -1.98 | 1.60 | 53.6 | 1.216 | 0 | 0 | 1.39 | 2.37 | 0.102 | 0 |
0 | 6.20 | 0 | 0.507 | 8.25 | 70.4 | 3.65 | 8 | 307 | 17.4 | 378.9 | 3.95 | 48.3 | 1 | 0 | -1.97 | 1.70 | 252.3 | 1.295 | 0 | 0 | 1.41 | 2.64 | 0.775 | 1 |
22 | 5.86 | 0 | 0.431 | 6.96 | 6.8 | 8.91 | 7 | 330 | 19.1 | 386.1 | 3.53 | 29.6 | 0 | 0 | -2.32 | 1.62 | 12.1 | 2.187 | 0 | 0 | 1.37 | 2.33 | 0.035 | 0 |
90 | 2.97 | 0 | 0.400 | 7.09 | 20.8 | 7.31 | 1 | 285 | 15.3 | 394.7 | 7.85 | 32.2 | 0 | 0 | -2.50 | 1.63 | 51.7 | 1.989 | 0 | 0 | 1.67 | 2.38 | 0.001 | 0 |
80 | 1.76 | 0 | 0.385 | 6.23 | 31.5 | 9.09 | 1 | 241 | 18.2 | 341.6 | 12.93 | 20.1 | 0 | 0 | -2.60 | 1.58 | 88.7 | 2.207 | 0 | 0 | 1.90 | 2.12 | 0.000 | 0 |
33 | 2.18 | 0 | 0.472 | 6.62 | 58.1 | 3.37 | 7 | 222 | 18.4 | 393.4 | 8.93 | 28.4 | 0 | 0 | -2.12 | 1.60 | 196.5 | 1.215 | 0 | 0 | 1.73 | 2.31 | 0.045 | 0 |
0 | 9.90 | 0 | 0.544 | 6.12 | 52.8 | 2.64 | 4 | 304 | 18.4 | 396.9 | 5.98 | 22.1 | 1 | 0 | -1.84 | 1.57 | 173.6 | 0.971 | 0 | 0 | 1.56 | 2.17 | 0.213 | 0 |
0 | 7.38 | 0 | 0.493 | 6.42 | 40.1 | 4.72 | 5 | 287 | 19.6 | 396.9 | 6.12 | 25.0 | 1 | 0 | -2.03 | 1.59 | 121.4 | 1.552 | 0 | 0 | 1.57 | 2.24 | 0.199 | 0 |
0 | 7.38 | 0 | 0.493 | 6.31 | 28.9 | 5.42 | 5 | 287 | 19.6 | 396.9 | 6.15 | 23.0 | 1 | 0 | -2.03 | 1.58 | 79.3 | 1.689 | 0 | 0 | 1.57 | 2.19 | 0.179 | 0 |
0 | 5.19 | 0 | 0.515 | 5.89 | 59.6 | 5.62 | 5 | 224 | 20.2 | 394.8 | 10.56 | 18.5 | 1 | 0 | -1.94 | 1.56 | 203.2 | 1.725 | 0 | 0 | 1.80 | 2.07 | 0.484 | 0 |
80 | 2.01 | 0 | 0.435 | 6.63 | 29.7 | 8.34 | 4 | 280 | 17.0 | 390.9 | 5.99 | 24.5 | 0 | 0 | -2.30 | 1.60 | 82.1 | 2.122 | 0 | 0 | 1.56 | 2.23 | 0.015 | 0 |
0 | 18.10 | 0 | 0.718 | 3.56 | 87.9 | 1.61 | 24 | 666 | 20.2 | 354.7 | 7.12 | 27.5 | 1 | 1 | -1.39 | 1.37 | 336.7 | 0.478 | 1 | 1 | 1.63 | 2.29 | 0.999 | 1 |
0 | 18.10 | 1 | 0.631 | 7.02 | 97.5 | 1.20 | 24 | 666 | 20.2 | 392.1 | 2.96 | 50.0 | 1 | 1 | -1.58 | 1.63 | 385.2 | 0.184 | 1 | 1 | 1.31 | 2.66 | 1.000 | 1 |
0 | 18.10 | 0 | 0.584 | 6.35 | 86.1 | 2.05 | 24 | 666 | 20.2 | 83.5 | 17.64 | 14.5 | 1 | 1 | -1.71 | 1.59 | 327.7 | 0.719 | 1 | 1 | 2.05 | 1.95 | 0.875 | 1 |
0 | 18.10 | 0 | 0.740 | 5.93 | 87.9 | 1.82 | 24 | 666 | 20.2 | 69.0 | 34.02 | 8.4 | 1 | 1 | -1.35 | 1.56 | 336.7 | 0.599 | 1 | 1 | 2.42 | 1.70 | 0.990 | 1 |
0 | 18.10 | 0 | 0.740 | 5.63 | 93.9 | 1.82 | 24 | 666 | 20.2 | 396.9 | 22.88 | 12.8 | 1 | 1 | -1.35 | 1.54 | 366.8 | 0.597 | 1 | 1 | 2.19 | 1.89 | 0.998 | 1 |
0 | 18.10 | 0 | 0.740 | 5.82 | 92.4 | 1.87 | 24 | 666 | 20.2 | 391.4 | 22.11 | 10.5 | 1 | 1 | -1.35 | 1.55 | 359.2 | 0.624 | 1 | 1 | 2.17 | 1.80 | 0.996 | 1 |
0 | 18.10 | 0 | 0.740 | 6.22 | 100.0 | 2.00 | 24 | 666 | 20.2 | 395.7 | 16.59 | 18.4 | 1 | 1 | -1.35 | 1.58 | 398.1 | 0.696 | 1 | 1 | 2.02 | 2.07 | 1.000 | 1 |
0 | 18.10 | 0 | 0.740 | 5.85 | 96.6 | 1.90 | 24 | 666 | 20.2 | 240.5 | 23.79 | 10.8 | 1 | 1 | -1.35 | 1.55 | 380.6 | 0.640 | 1 | 1 | 2.21 | 1.81 | 0.997 | 1 |
0 | 18.10 | 0 | 0.713 | 6.53 | 86.5 | 2.44 | 24 | 666 | 20.2 | 50.9 | 18.13 | 14.1 | 1 | 1 | -1.40 | 1.60 | 329.7 | 0.890 | 1 | 1 | 2.06 | 1.94 | 0.998 | 1 |
0 | 18.10 | 0 | 0.713 | 6.38 | 88.4 | 2.57 | 24 | 666 | 20.2 | 391.4 | 14.65 | 17.7 | 1 | 1 | -1.40 | 1.59 | 339.1 | 0.943 | 1 | 1 | 1.96 | 2.05 | 0.999 | 1 |
0 | 18.10 | 0 | 0.655 | 6.21 | 65.4 | 2.96 | 24 | 666 | 20.2 | 396.9 | 13.22 | 21.4 | 1 | 1 | -1.53 | 1.58 | 229.2 | 1.086 | 1 | 1 | 1.91 | 2.15 | 0.998 | 1 |
0 | 9.69 | 0 | 0.585 | 5.79 | 70.6 | 2.89 | 6 | 391 | 19.2 | 396.9 | 14.10 | 18.3 | 1 | 0 | -1.71 | 1.55 | 253.2 | 1.062 | 0 | 0 | 1.94 | 2.07 | 0.678 | 1 |
0 | 11.93 | 0 | 0.573 | 6.98 | 91.0 | 2.17 | 1 | 273 | 21.0 | 396.9 | 5.64 | 23.9 | 1 | 0 | -1.75 | 1.62 | 352.2 | 0.774 | 0 | 0 | 1.54 | 2.21 | 0.819 | 1 |
knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE, cache=FALSE, comment=NA,
fig.align='center',options(width = 110, digits=3))
library(knitr)
library(ggplot2)
library(moments)
library(ggthemes)
library(qqplotr)
library(gridExtra)
library(car)
library(geoR)
library(alr3)
library(rcompanion)
library(sme)
library(caret)
library(pROC)
# read in data
crime <- read.csv('https://raw.githubusercontent.com/niteen11/MSDS/master/DATA621/
dataset/crime-training-data.csv')
kable(head(crime))
options(digits=2, width=100)
(summary(crime))
addmargins(table(5*round(crime$zn/5,0)))
with(crime, c(summary(zn), SD=sd(zn), Skew=skewness(zn), Kurt=kurtosis(zn)))
h <- ggplot(crime, aes(zn)) + geom_histogram(fill = 'salmon', binwidth = 20, color = 'grey69' ) +
theme_classic() + labs(title = 'Histogram of zn') + theme(plot.title = element_text(hjust = 0.5))
q <- ggplot(crime, aes(sample=zn)) + stat_qq_point(color='salmon') + stat_qq_line(color='grey69') +
labs(x="Thoretical Quantiles", y="Sample Quantiles", title = "QQ Plot of zn") + theme_classic() +
theme(plot.title = element_text(hjust = 0.5))
b1 <- ggplot(crime, aes(x="", zn)) + geom_boxplot(fill='salmon', color='grey69')+ theme_classic() +
labs(title = 'Boxplot of zn', x="") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()
b2 <- ggplot(crime, aes(x=factor(target), zn)) + geom_boxplot(fill='salmon', color='grey69') +
labs(x='target', title = 'Boxplot of zn by target') + theme_classic() +
theme(plot.title = element_text(hjust = 0.5))
grid.arrange(h, q, b1, b2, ncol=2)
addmargins(table(round(crime$indus,0)))
with(crime, c(summary(indus), SD=sd(indus), Skew=skewness(indus), Kurt=kurtosis(indus)))
h <- ggplot(crime, aes(indus)) + geom_histogram(fill = 'salmon', binwidth = 5, color = 'grey69' ) +
theme_classic() + labs(title = 'Histogram of indus') + theme(plot.title = element_text(hjust = 0.5))
q <- ggplot(crime, aes(sample=indus)) + stat_qq_point(color='salmon') + stat_qq_line(color='grey69') +
labs(x="Thoretical Quantiles", y="Sample Quantiles", title = "QQ Plot of indus") + theme_classic() +
theme(plot.title = element_text(hjust = 0.5))
b1 <- ggplot(crime, aes(x="", indus)) + geom_boxplot(fill='salmon', color='grey69')+ theme_classic() +
labs(title = 'Boxplot of indus', x="") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()
b2 <- ggplot(crime, aes(x=factor(target), indus)) + geom_boxplot(fill='salmon', color='grey69') +
labs(x='target', title = 'Boxplot of indus by target') + theme_classic() +
theme(plot.title = element_text(hjust = 0.5))
grid.arrange(h, q, b1, b2, ncol=2)
addmargins(table(crime$chas))
ggplot(crime, aes(x=target, y=chas)) + geom_jitter(color='salmon') + theme_classic() +
labs(title ='Jittered Scatter Plot of chas vs.target') + theme(plot.title = element_text(hjust = 0.5))
addmargins(table(crime$chas, crime$target))
prop.test(table(crime$chas, crime$target))
addmargins(table(round(crime$nox,1)))
options(digits=2)
with(crime, c(summary(nox), SD=sd(nox), Skew=skewness(nox), Kurt=kurtosis(nox)))
h <- ggplot(crime, aes(nox)) + geom_histogram(fill = 'salmon', binwidth = 0.05, color = 'grey69' ) +
theme_classic() + labs(title = 'Histogram of nox') + theme(plot.title = element_text(hjust = 0.5))
q <- ggplot(crime, aes(sample=nox)) + stat_qq_point(color='salmon') + stat_qq_line(color='grey69') +
labs(x="Thoretical Quantiles", y="Sample Quantiles", title = "QQ Plot of nox") + theme_classic() +
theme(plot.title = element_text(hjust = 0.5))
b1 <- ggplot(crime, aes(x="", nox)) + geom_boxplot(fill='salmon', color='grey69')+ theme_classic() +
labs(title = 'Boxplot of nox', x="") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()
b2 <- ggplot(crime, aes(x=factor(target), nox)) + geom_boxplot(fill='salmon', color='grey69') +
labs(x='target', title = 'Boxplot of nox by target') + theme_classic() +
theme(plot.title = element_text(hjust = 0.5))
grid.arrange(h, q, b1, b2, ncol=2)
addmargins(table(round(crime$rm,0)))
options(digits=2)
with(crime, c(summary(rm), SD=sd(rm), Skew=skewness(rm), Kurt=kurtosis(rm)))
h <- ggplot(crime, aes(rm)) + geom_histogram(fill = 'salmon', binwidth = 0.5, color = 'grey69' ) +
theme_classic() + labs(title = 'Histogram of rm') + theme(plot.title = element_text(hjust = 0.5))
q <- ggplot(crime, aes(sample=rm)) + stat_qq_point(color='salmon') + stat_qq_line(color='grey69') +
labs(x="Thoretical Quantiles", y="Sample Quantiles", title = "QQ Plot of rm") + theme_classic() +
theme(plot.title = element_text(hjust = 0.5))
b1 <- ggplot(crime, aes(x="", rm)) + geom_boxplot(fill='salmon', color='grey69')+ theme_classic() +
labs(title = 'Boxplot of rm', x="") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()
b2 <- ggplot(crime, aes(x=factor(target), rm)) + geom_boxplot(fill='salmon', color='grey69') +
labs(x='target', title = 'Boxplot of rm by target') + theme_classic() +
theme(plot.title = element_text(hjust = 0.5))
grid.arrange(h, q, b1, b2, ncol=2)
addmargins(table(5*round(crime$age/5,0)))
options(digits=2)
with(crime, c(summary(age), SD=sd(age), Skew=skewness(age), Kurt=kurtosis(age)))
h <- ggplot(crime, aes(age)) + geom_histogram(fill = 'salmon', binwidth = 5, color = 'grey69' ) +
theme_classic() + labs(title = 'Histogram of age') + theme(plot.title = element_text(hjust = 0.5))
q <- ggplot(crime, aes(sample=age)) + stat_qq_point(color='salmon') + stat_qq_line(color='grey69') +
labs(x="Thoretical Quantiles", y="Sample Quantiles", title = "QQ Plot of age") + theme_classic() +
theme(plot.title = element_text(hjust = 0.5))
b1 <- ggplot(crime, aes(x="", age)) + geom_boxplot(fill='salmon', color='grey69')+ theme_classic() +
labs(title = 'Boxplot of age', x="") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()
b2 <- ggplot(crime, aes(x=factor(target), age)) + geom_boxplot(fill='salmon', color='grey69') +
labs(x='target', title = 'Boxplot of age by target') + theme_classic() +
theme(plot.title = element_text(hjust = 0.5))
grid.arrange(h, q, b1, b2, ncol=2)
addmargins(table(round(crime$dis,0)))
options(digits=2)
with(crime, c(summary(dis), SD=sd(dis), Skew=skewness(dis), Kurt=kurtosis(dis)))
h <- ggplot(crime, aes(dis)) + geom_histogram(fill = 'salmon', binwidth = 1, color = 'grey69' ) +
theme_classic() + labs(title = 'Histogram of dis') + theme(plot.title = element_text(hjust = 0.5))
q <- ggplot(crime, aes(sample=dis)) + stat_qq_point(color='salmon') + stat_qq_line(color='grey69') +
labs(x="Thoretical Quantiles", y="Sample Quantiles", title = "QQ Plot of dis") + theme_classic() +
theme(plot.title = element_text(hjust = 0.5))
b1 <- ggplot(crime, aes(x="", dis)) + geom_boxplot(fill='salmon', color='grey69')+ theme_classic() +
labs(title = 'Boxplot of dis', x="") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()
b2 <- ggplot(crime, aes(x=factor(target), dis)) + geom_boxplot(fill='salmon', color='grey69') +
labs(x='target', title = 'Boxplot of dis by target') + theme_classic() +
theme(plot.title = element_text(hjust = 0.5))
grid.arrange(h, q, b1, b2, ncol=2)
addmargins(table(round(crime$rad,0)))
options(digits=2)
with(crime, c(summary(rad), SD=sd(rad), Skew=skewness(rad), Kurt=kurtosis(rad)))
h <- ggplot(crime, aes(rad)) + geom_histogram(fill = 'salmon', binwidth = 1, color = 'grey69' ) +
theme_classic() + labs(title = 'Histogram of rad') + theme(plot.title = element_text(hjust = 0.5))
q <- ggplot(crime, aes(sample=rad)) + stat_qq_point(color='salmon') + stat_qq_line(color='grey69') +
labs(x="Thoretical Quantiles", y="Sample Quantiles", title = "QQ Plot of rad") + theme_classic() +
theme(plot.title = element_text(hjust = 0.5))
b1 <- ggplot(crime, aes(x="", rad)) + geom_boxplot(fill='salmon', color='grey69')+ theme_classic() +
labs(title = 'Boxplot of rad', x="") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()
b2 <- ggplot(crime, aes(x=factor(target), rad)) + geom_boxplot(fill='salmon', color='grey69') +
labs(x='target', title = 'Boxplot of rad by target') + theme_classic() +
theme(plot.title = element_text(hjust = 0.5))
grid.arrange(h, q, b1, b2, ncol=2)
addmargins(table(25*round(crime$tax/25,0)))
options(digits=2)
with(crime, c(summary(tax), SD=sd(tax), Skew=skewness(tax), Kurt=kurtosis(tax)))
h <- ggplot(crime, aes(tax)) + geom_histogram(fill = 'salmon', binwidth = 25, color = 'grey69' ) +
theme_classic() + labs(title = 'Histogram of tax') + theme(plot.title = element_text(hjust = 0.5))
q <- ggplot(crime, aes(sample=tax)) + stat_qq_point(color='salmon') + stat_qq_line(color='grey69') +
labs(x="Thoretical Quantiles", y="Sample Quantiles", title = "QQ Plot of tax") + theme_classic() +
theme(plot.title = element_text(hjust = 0.5))
b1 <- ggplot(crime, aes(x="", tax)) + geom_boxplot(fill='salmon', color='grey69')+ theme_classic() +
labs(title = 'Boxplot of tax', x="") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()
b2 <- ggplot(crime, aes(x=factor(target), tax)) + geom_boxplot(fill='salmon', color='grey69') +
labs(x='target', title = 'Boxplot of tax by target') + theme_classic() +
theme(plot.title = element_text(hjust = 0.5))
grid.arrange(h, q, b1, b2, ncol=2)
addmargins(table(round(crime$ptratio,0)))
options(digits=2)
with(crime, c(summary(ptratio), SD=sd(ptratio), Skew=skewness(ptratio), Kurt=kurtosis(ptratio)))
h <- ggplot(crime, aes(ptratio)) + geom_histogram(fill = 'salmon', binwidth = 1, color = 'grey69' ) +
theme_classic() + labs(title = 'Histogram of ptratio') + theme(plot.title = element_text(hjust = 0.5))
q <- ggplot(crime, aes(sample=ptratio)) + stat_qq_point(color='salmon') + stat_qq_line(color='grey69') +
labs(x="Thoretical Quantiles", y="Sample Quantiles", title = "QQ Plot of ptratio") + theme_classic() +
theme(plot.title = element_text(hjust = 0.5))
b1 <- ggplot(crime, aes(x="", ptratio)) + geom_boxplot(fill='salmon', color='grey69')+ theme_classic() +
labs(title = 'Boxplot of ptratio', x="") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()
b2 <- ggplot(crime, aes(x=factor(target), ptratio)) + geom_boxplot(fill='salmon', color='grey69') +
labs(x='target', title = 'Boxplot of ptratio by target') + theme_classic() +
theme(plot.title = element_text(hjust = 0.5))
grid.arrange(h, q, b1, b2, ncol=2)
addmargins(table(2*round(crime$lstat/2,0)))
options(digits=2)
with(crime, c(summary(lstat), SD=sd(lstat), Skew=skewness(lstat), Kurt=kurtosis(lstat)))
h <- ggplot(crime, aes(lstat)) + geom_histogram(fill = 'salmon', binwidth = 2 , color = 'grey69' ) +
theme_classic() + labs(title = 'Histogram of lstat') + theme(plot.title = element_text(hjust = 0.5))
q <- ggplot(crime, aes(sample=lstat)) + stat_qq_point(color='salmon') + stat_qq_line(color='grey69') +
labs(x="Thoretical Quantiles", y="Sample Quantiles", title = "QQ Plot of lstat") + theme_classic() +
theme(plot.title = element_text(hjust = 0.5))
b1 <- ggplot(crime, aes(x="", lstat)) + geom_boxplot(fill='salmon', color='grey69')+ theme_classic() +
labs(title = 'Boxplot of lstat', x="") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()
b2 <- ggplot(crime, aes(x=factor(target), lstat)) + geom_boxplot(fill='salmon', color='grey69') +
labs(x='target', title = 'Boxplot of lstat by target') + theme_classic() +
theme(plot.title = element_text(hjust = 0.5))
grid.arrange(h, q, b1, b2, ncol=2)
addmargins(table(3*round(crime$medv/2,0)))
options(digits=2)
with(crime, c(summary(medv), SD=sd(medv), Skew=skewness(medv), Kurt=kurtosis(medv)))
h <- ggplot(crime, aes(medv)) + geom_histogram(fill = 'salmon', binwidth = 3, color = 'grey69' ) +
theme_classic() + labs(title = 'Histogram of medv') + theme(plot.title = element_text(hjust = 0.5))
q <- ggplot(crime, aes(sample=medv)) + stat_qq_point(color='salmon') + stat_qq_line(color='grey69') +
labs(x="Thoretical Quantiles", y="Sample Quantiles", title = "QQ Plot of medv") + theme_classic() +
theme(plot.title = element_text(hjust = 0.5))
b1 <- ggplot(crime, aes(x="", medv)) + geom_boxplot(fill='salmon', color='grey69')+ theme_classic() +
labs(title = 'Boxplot of medv', x="") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()
b2 <- ggplot(crime, aes(x=factor(target), medv)) + geom_boxplot(fill='salmon', color='grey69') +
labs(x='target', title = 'Boxplot of medv by target') + theme_classic() +
theme(plot.title = element_text(hjust = 0.5))
grid.arrange(h, q, b1, b2, ncol=2)
addmargins(table(crime$target))
mycols = c('green3','blue')
crime$target_mod <- factor(ifelse(crime$target ==1 , 'y', 'n'))
panel.cor <- function(x,y, digits = 2, cex.cor, ...) {
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- cor(x,y)
txt <- format(c(r, 0.123456789), digits = digits)[1]
text(0.5,0.5, txt, cex =sqrt(abs(r*6)))
}
pairs(crime[,1:13], col= mycols[crime$target_mod], gap=0.3, lower.panel = panel.smooth, upper.panel = panel.cor, cex.labels = 3)
options(digits=3)
cbind(zn_zero=mean(crime$target[crime$zn==0]), zn_1_plus=mean(crime$target[crime$zn != 0]))
crime$zn_zero <- ifelse(crime$zn == 0,1,0)
prop.test(table(crime$zn_zero,crime$target))
ggplot(crime[crime$zn != 0,], aes(x=zn, target)) + geom_jitter(color='salmon', width=0,
height=0.05) +
labs(title = 'Scatter of zn (greater than 0) by target') + theme_classic() +
theme(plot.title = element_text(hjust = 0.5))
crime$indus_high <- ifelse(crime$indus <= 14, 0, 1)
options(digits=3)
cbind(val_0=mean(crime$target[crime$indus_high==0]),val_1=mean(crime$target[crime$indus_high==1]))
boxcoxfit(crime$nox)
crime$nox_mod <- -1 / crime$nox
skewness(crime$nox_mod)
h <- ggplot(crime, aes(nox_mod)) + geom_histogram(fill = 'salmon', binwidth = 0.15, color = 'grey69' ) +
theme_classic() + labs(title = 'Histogram of nox_mod') + theme(plot.title = element_text(hjust = 0.5))
b <- ggplot(crime, aes(x=factor(target), nox_mod)) + geom_boxplot(fill='salmon', color='grey69') +
labs(x='target', title = 'Boxplot of nox_mod by target') + theme_classic() +
theme(plot.title = element_text(hjust = 0.5))
grid.arrange(h,b, ncol= 2)
boxcoxfit(crime$rm)
crime$rm_mod <- crime$rm ^ 0.25
options(digits=3)
skewness(crime$rm_mod)
h <- ggplot(crime, aes(rm_mod)) + geom_histogram(fill = 'salmon', binwidth = 0.02, color = 'grey69' ) +
theme_classic() + labs(title = 'Histogram of rm_mod') + theme(plot.title = element_text(hjust = 0.5))
b <- ggplot(crime, aes(x=factor(target), rm_mod)) + geom_boxplot(fill='salmon', color='grey69') +
labs(x='target', title = 'Boxplot of rm_mod by target') + theme_classic() +
theme(plot.title = element_text(hjust = 0.5))
grid.arrange(h,b, ncol=2)
boxcoxfit(crime$age)
crime$age_mod <- crime$age^1.3
h <- ggplot(crime, aes(age_mod)) + geom_histogram(fill = 'salmon', binwidth = 50, color = 'grey69' ) +
theme_classic() + labs(title = 'Histogram of age_mod') + theme(plot.title = element_text(hjust = 0.5))
b <- ggplot(crime, aes(x=factor(target), age_mod)) + geom_boxplot(fill='salmon', color='grey69') +
labs(x='target', title = 'Boxplot of rm_mod by target') + theme_classic() +
theme(plot.title = element_text(hjust = 0.5))
grid.arrange(h,b, ncol=2)
boxcoxfit(crime$dis)
crime$dis_mod <- log(crime$dis)
skewness(crime$dis_mod)
h <- ggplot(crime, aes(dis_mod)) + geom_histogram(fill = 'salmon', binwidth = 0.2, color = 'grey69' ) +
theme_classic() + labs(title = 'Histogram of dis_mod') + theme(plot.title = element_text(hjust = 0.5))
b <- ggplot(crime, aes(x=factor(target), dis_mod)) + geom_boxplot(fill='salmon', color='grey69') +
labs(x='target', title = 'Boxplot of dis_mod by target') + theme_classic() +
theme(plot.title = element_text(hjust = 0.5))
grid.arrange(h,b, ncol=2)
crime$rad_high <- ifelse(crime$rad >= 15, 1, 0)
options(digits=3)
cbind(val_0=mean(crime$target[crime$rad_high==0]),val_1=mean(crime$target[crime$rad_high==1]))
crime$tax_high <- ifelse(crime$tax >= 500, 1,0)
options(digits=3)
cbind(val_0=mean(crime$target[crime$tax_high==0]),val_1=mean(crime$target[crime$tax_high==1]))
boxcoxfit(crime$ptratio)
boxcoxfit(crime$lstat)
crime$lstat_mod <- crime$lstat^0.25
options(digits=3)
skewness(crime$lstat_mod)
h <- ggplot(crime, aes(lstat_mod)) + geom_histogram(fill = 'salmon', binwidth = 0.2, color = 'grey69' ) +
theme_classic() + labs(title = 'Histogram of lstat_mod') + theme(plot.title = element_text(hjust = 0.5))
b <- ggplot(crime, aes(x=factor(target), lstat_mod)) + geom_boxplot(fill='salmon', color='grey69') + labs(x='target', title = 'Boxplot of lstat_mod by target') + theme_classic() +
theme(plot.title = element_text(hjust = 0.5))
grid.arrange(h,b, ncol=2)
boxcoxfit(crime$medv)
crime$medv_mod <- crime$medv^0.25
options(digits=3)
skewness(crime$medv_mod)
h <- ggplot(crime, aes(medv_mod)) + geom_histogram(fill = 'salmon', binwidth = 0.2, color = 'grey69' ) +
theme_classic() + labs(title = 'Histogram of medv_mod') + theme(plot.title = element_text(hjust = 0.5))
b <- ggplot(crime, aes(x=factor(target), medv_mod)) + geom_boxplot(fill='salmon', color='grey69') + labs(x='target', title = 'Boxplot of medv_mod by target') + theme_classic() +
theme(plot.title = element_text(hjust = 0.5))
grid.arrange(h,b, ncol=2)
options(digits=3)
mylm <- lm(target ~ . - age_mod - rad_high - ptratio - dis_mod ,data=crime[,c(15:22,10, 23:24, 14,3)])
vif(mylm)
mylogit1 <- glm(target~zn_zero + indus_high + chas + nox_mod +
rm_mod + tax_high + lstat_mod +
medv_mod ,family="binomial",data=crime)
summary(mylogit1)
mmps(mylogit1, terms = ~ . - zn_zero - indus_high - chas -tax_high, layout=c(3,2), key=FALSE)
par(mfrow=c(1,1))
plot(crime$rm_mod,crime$lstat_mod,pch=crime$target+1,col=crime$target+1,
xlab='rm_mod',ylab='lstat_mod', main="Test Interaction of lstat_mod and rm_mod")
abline(lsfit(crime$rm_mod[crime$target==0],crime$lstat_mod[crime$target==0]),lty=1,col=1)
abline(lsfit(crime$rm_mod[crime$target==1],crime$lstat_mod[crime$target==1]),lty=2,col=2)
legend(1.65, 2.2,legend=c("No","Yes"),pch=1:2,col=1:2,title="High Crime?")
mylogit1 <- glm(target~zn_zero + indus_high + chas + nox_mod +
rm_mod + tax_high + lstat_mod +
medv_mod + rm_mod:lstat_mod ,family="binomial",data=crime)
summary(mylogit1)
mmps(mylogit1, terms = ~ . - zn_zero - indus_high - chas -tax_high, layout=c(3,2), key=FALSE)
mylogit2 <- glm(target~ nox_mod + rm_mod + lstat_mod + tax_high +
rm_mod:lstat_mod ,family="binomial",data=crime)
summary(mylogit2)
model.upper <- glm(target~zn_zero + indus_high + chas + nox_mod +
rm_mod + dis_mod + tax_high + lstat_mod +
medv_mod + ptratio+ age_mod ,family="binomial",data=crime)
model.null = glm(target ~ 1,
data=crime,
family = "binomial")
mylogit3 <- step(model.null, scope = list(upper=model.upper, lower = model.null),
trace = 0, direction = 'both')
summary(mylogit3)
pchisq(mylogit2$deviance - mylogit1$deviance,mylogit2$df.residual - mylogit1$df.residual ,lower=FALSE)
m1 <- cbind(AIC=AIC(mylogit1), AICc=AICc(mylogit1), BIC = BIC(mylogit1), loglik=logLik(mylogit1))
m2 <- cbind(AIC=AIC(mylogit2), AICc=AICc(mylogit2), BIC = BIC(mylogit2), loglik=logLik(mylogit2))
m3 <- cbind(AIC=AIC(mylogit3), AICc=AICc(mylogit3), BIC = BIC(mylogit3), loglik=logLik(mylogit3))
rbind(m1,m2,m3)
crime$predict <- predict(mylogit3, crime, type='response')
myroc <- roc(crime$target, crime$predict, plot=T, asp=NA,
legacy.axes=T, main = "ROC Curve", ret="tp", col="blue")
myroc["auc"]
# models 1 and 2 prediction probs
m1_pred <- predict(mylogit1, crime, type="response")
m2_pred <- predict(mylogit2, crime, type="response")
#AUC
paste("Model 1:",roc(crime$target, m1_pred)["auc"])
paste("Model 2:",roc(crime$target, m2_pred)["auc"])
coords(myroc, "best", ret='threshold', best.method="youden")
crime$predict_target <- ifelse(crime$predict >= 0.5, 1, 0)
cm <- confusionMatrix(data=as.factor(crime$predict_target), reference=as.factor(crime$target), positive='1')
cm$table
a <- cm$overall["Accuracy"]; names(a) <- NULL # accuracy
e <- 1 - a; names(e) <- NULL # error rate
p <- cm$byClass["Precision"]; names(p) <- NULL # precision
st <- cm$byClass["Sensitivity"]; names(st) <- NULL # sensitivity
sp <- cm$byClass["Specificity"]; names(sp) <- NULL # specificity
f1 <- cm$byClass["F1"] ; names(f1) <- NULL # F1
# display metrics
list(accuracy=a, error_rt = e, precision=p, sensitivity=st, specificity=sp, F1=f1)
test_crime <- read.csv('https://raw.githubusercontent.com/niteen11/MSDS/master/
DATA621/dataset/crime-evaluation-data.csv')
test_crime$zn_zero <- ifelse(test_crime$zn == 0,1,0)
test_crime$indus_high <- ifelse(test_crime$indus <= 14, 0, 1)
test_crime$nox_mod <- -1 / test_crime$nox
test_crime$rm_mod <- test_crime$rm ^ 0.25
test_crime$age_mod <- test_crime$age^1.3
test_crime$dis_mod <- log(test_crime$dis)
test_crime$rad_high <- ifelse(test_crime$rad >= 15, 1, 0)
test_crime$tax_high <- ifelse(test_crime$tax >= 500, 1,0)
test_crime$lstat_mod <- test_crime$lstat^0.25
test_crime$medv_mod <- test_crime$medv^0.25
test_crime$predict_prob <- predict(mylogit3, test_crime, type='response')
test_crime$predict_target <- ifelse(test_crime$predict_prob >= 0.50, 1,0)
#write.csv(test_crime,"test_predictions.csv", row.names=FALSE)
kable(test_crime)