Overview
1 Data Exploration Analysis
2. Data Preparation
3 Build Models
4 Model Selection
5 Prediction using final model on evaluation data set
Appendix A: Binary Logistic Regression: Crime Prediction: R Code

Overview

The data set contains approximately 466 records and 14 variables. Each record has information on crime for various neighborhoods of a major city. Each record has a response variable indicating whether or not the crime rate is above the median crime rate (1) or not (0).

The objective is to build a binary logistic regression model on the training data set to predict whether the neighborhood will be at risk for high crime levels. In addition, we will provide classifications and probabilities for the evaluation data set using the binary logistic regression model.

1 Data Exploration Analysis

In section we will explore and gain some insights into the dataset by pursuing the below high level steps and inquiries:
-Variable identification
-Variable Relationships
-Data summary analysis
-Outliers and Missing Values Identification

1.1 Variable identification

First let’s display and examine the data dictionary or the data columns as shown in table 1

Variable Description
Variable	Description	Datatype	Role
zn	proportion of residential land zoned for large lots (over 25000 square feet)	numeric	predictor
indus	proportion of non-retail business acres per suburb	numeric	predictor
chas	a dummy var. for whether the suburb borders the Charles River (1) or not (0)	binary	predictor
nox	nitrogen oxides concentration (parts per 10 million)	numeric	predictor
rm	average number of rooms per dwelling	numeric	predictor
age	proportion of owner-occupied units built prior to 1940	numeric	predictor
dis	weighted mean of distances to five Boston employment centers	numeric	predictor
rad	index of accessibility to radial highways	integer	predictor
tax	full-value property-tax rate per $10,000	integer	predictor
ptratio	pupil-teacher ratio by town	numeric	predictor
black	1000(Bk - 0.63)2 where Bk is the proportion of blacks by town	numeric	predictor
lstat	lower status of the population (percent)	numeric	predictor
medv	median value of owner-occupied homes in $1000s	numeric	predictor
target	whether the crime rate is above the median crime rate (1) or not (0)	binary	response

We notice that all variables are numeric except for two variables: the response variable “target” which is binary and the predictor variable “chas” which is a dummy binary variable indicating whether the suburb borders the Charles River (1) or not (0).

Based on the original dataset, our predictor input is made of 13 variables. And our response variable is one variable called target.

1.2 Variable Relationships

The variables seem to not have any arithmetic relations. In other words, there are no symmetricity or transitivity relationships between any two variable in the independent variable set.
In addition, since this is Logistic Regression, we will be making the below assumptions on the variables:-The dependent variable need not to be normally distributed
-Errors need to be independent but not normally distributed.
- We will be using GLM and GLM does not assume a linear relationship between dependent and independent variables. However, it assumes a linear relationship between link function and independent variables in logit model.
- Also does not use OLS (Ordinary Least Square) for parameter estimation. Instead, it uses maximum likelihood estimation (MLE)

1.3 Data Summary Analysis

In this section, we will create summary data to better understand the initial relationship variables have with our dependent variable using correlation, central tendency, and dispersion As shown in table 2.

Now we will produce the correlation table between the independent variables and the dependent variable

Variable Correlation
	x
target	1.0000000
nox	0.7290920
rad	0.6307187
age	0.6275762
indus	0.6034795
tax	0.6021403
lstat	0.4808888
ptratio	0.2198922
chas	0.0579716
rm	-0.1605913
medv	-0.2724789
black	-0.3463425
zn	-0.4239382
dis	-0.6167264

Correlation analysis suggests that there are strong positive and negative between the independent variables and the dependent variable. For instance, we notice that there is a strong correlation of .73 between the concentration of nitrogen oxides and crime rate being above average. We will need to perform more investigations about this correlation as it is not obvious the concentration of nitrogen oxides would results in high crime rate; perhaps it impacts the crime rate indirectly by impacting other independent variables that we may or may not have in our data set.
In addition, we noticed that accessibility to radial highways also has a strong correlation with the crime rate being average average. Again we will investigate such correlation. We also noticed that unit or house age, property tax, and non-retail businesses having a positive impact on the crime rate being above average.

It is also worth noting that that distances to five Boston employment centers, large residential lots, the proportion of blacks by town, median value of owner-occupied homes, and the average number of rooms per dwelling, all have negative correlation to the crime rate being above crime rate average. In other words, the closer people are to the five Boston employment centers, the more likely the crime rate will be below the crime average.

1.4 Outliers and Missing Values Identification

1.4.1 Missing Values

As per Table .3 below, we see that we have no missing values which is good thing as we don’t have to carry out any imputation tasks.

Missing Values
	x
zn	0
indus	0
chas	0
nox	0
rm	0
age	0
dis	0
rad	0
tax	0
ptratio	0
black	0
lstat	0
medv	0
target	0

Also, as per Table .4 below, we can confirm that our target variable is binary as expected.

Unique Values
	x
zn	26
indus	73
chas	2
nox	79
rm	419
age	333
dis	380
rad	9
tax	63
ptratio	46
black	331
lstat	424
medv	218
target	2

1.4.2 Outliers identification

In this section univariate analysis is being carried out and boxplots diagrams are being used to determine the outliers in variables and decide on whether to act on the outliers

From the “Outliers identification” plot above, we see that we have few outliers that we need to treat. We see that: zn (residential land zoned), rm (average number of rooms per dwelling), dis(weighted mean of distances to five Boston employment centers), black (the proportion of blacks by town),lstat (lower status of the population), and medv median value of owner-occupied homes in $1000s all need to be trated

1.4.3 Analysis the link function

In this section, we will investigate how our initial data aligns with a typical logistic model plot.

Recall the Logistic Regression is part of a larger class of algorithms known as Generalized Linear Model (glm). The fundamental equation of generalized linear model is: $g(E(y)) = a+ Bx_1+B_2x_2+ B_3x_3+...$
where, g() is the link function, E(y) is the expectation of target variable and $B_0 + B_1x_1 + B_2x_2+B_3x_3$ is the linear predictor ( $B_0,B_1,B_2, B_3$ to be predicted). The role of link function is to ‘link’ the expectation of y to linear predictor.

In logistic regression, we are only concerned about the probability of outcome dependent variable ( success or failure). As described above, g() is the link function. This function is established using two things: Probability of Success (p) and Probability of Failure (1-p). p should meet following criteria: It must always be positive (since p >= 0) It must always be less than equals to 1 (since p <= 1).

Now let’s investigate how our initial data model aligns with the above criteria. In other words, we will plot regression model plots for each variable and compare it to a typical logistic model plot:

Interpretation

You can see clearly that the probability of crime being above average increases as we get closer to the “1” classification for the indus,nox,age,rad,tax,and lstat variables. In the middle, the probability changes at the highest rate, while it tails off at each end in order to bound it between 0 and 1.
You can see clearly that the probability of crime being above average decreases as we get closer to the “1” classification for the zn, dis,black, and mdev variables. In the middle, the probability changes at the lowest rate. However, it does not tails off at each end for all of the variables.

2. Data Preparation

Now that we have completed the preliminary analysis, we will be cleaning and consolidating data into one dataset for use in analysis and modeling. We will be following the below steps as guidelines:
- Outliers treatment
- Missing values treatment
- Data transformation

2.1 Outliers treatment

For outliers, we will create 2 sets of variables.

The first set uses the capping method. In this method, we will replace all outliers that lie outside the 1.5 times of IQR limits. We will cap it by replacing those observations less than the lower limit with the value of 5th %ile and those that lie above the upper limit with the value of 95th %ile.

Accordingly we create the following new variables while retaining the original variables.

city_crime_train$tax\ city_crime_train$medv
city_crime_train$lstat

Lets see how the new variables look in boxplots.

In the second set, we will use the sin transformation and create the following variables:

city_crime_train_mod$rm_new\ city_crime_train_mod$dis_new

2.3 Tranformation for Variables

In this section, we will analyze few transformation options using Sin, Log, Sqrt, nth transformations. Using histogram and boxplots to evaluate best transformation to handle outliers.
First we will start with variable, zn (proportion of residential land zoned for large lots) as per below-

zn Transformation

Please note that we have created similar figures to figure 1 above for each remaining variable. However, we hid the remaining figures for ease of streamlining the report as they have similar shapes. However, we have drawn the below observations from each remaining figure.

2.3.1 Variable Transformation Interpretation

Logit function has been used in building models in this assignment. Any transformation like sin,sqrt,nth etc for logit model brings additional complexity while doing the interpretation of the model. For that purpose any such transformation has not been used for the variables as normality is not a criteria for logit model.
The following are some of the transformation of the variables that have been used here:
1. zn: proportion of residential land zoned for large lots (over 25000 square feet) (predictor variable). For zn, we can see that there are large number of values with 0. Option to transformed zn to a variable of binary values 0 and 1. Values with 0 value to be transformed to 0 other values will be bucketed to 1.
2. chas: a dummy var. for whether the suburb borders the Charles River (1) or not (0) (predictor variable). This variable will be converted to a factor with values of 0 and 1 in the model.
3. target: whether the crime rate is above the median crime rate (1) or not (0) (response variable). Target variable will also be converted to a factor variable before using that in the model.

3 Build Models

In this section, we will create six models. Aside from using original and transformed data, we will also using different methods and functions such as Linear Discriminant Analysis, step function, and logit function to enhance our models. Below is our model definition:
-Model 1- This model will be created using the original variables in train data set with logit function GLM.
-Model 2- This model will be created using original variables; however using step function instead of GLM.
-Model 3- This model will be created using transformed variables using GLM function.
-Model 4- This model will be created using transformed variables using step function instead of GLM.
-Model 5: this model will be created using original variables using Linear Discriminant Analysis function lda in ISLR package.
-Model 6- This model will be created using transformed variables using Linear Discriminant Analysis

Below is a summary table showing models and their respective variables.

Variables used in different models
Variables	Model.1	Model.2	Model.3	Model.4	Model.5	Model.6
zn	y	y			y	y
indus	y	y	y	y	y	y
chas	y	y	y	y	y	y
nox	y	y	y	y	y	y
rm	y	y	y	y	y	y
age	y	y	y	y	y	y
dis	y	y	y	y	y	y
rad	y	y	y	y	y	y
tax	y	y			y	y
ptratio	y	y	y	y	y	y
black	y	y	y	y	y	y
lstat	y	y			y	y
medv	y	y			y	y
tax_new			y	y		y
medv_new			y	y		y
lstat_new			y	y		y
zn_new			y	y		y

3.1.1 Model One

In this model, we will be using all the given variables in train data set. We will create model using logit function and we will highlight the summary of the model.

## 
## Call:
## glm(formula = target ~ ., family = "binomial", data = city_crime_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8791  -0.1299  -0.0025   0.0011   3.4785  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -41.462153   8.250799  -5.025 5.03e-07 ***
## zn           -0.060580   0.039153  -1.547 0.121799    
## indus        -0.063885   0.059335  -1.077 0.281618    
## chas          0.789391   0.865818   0.912 0.361912    
## nox          53.413503  10.013666   5.334 9.60e-08 ***
## rm           -0.647942   0.904430  -0.716 0.473739    
## age           0.028835   0.015680   1.839 0.065915 .  
## dis           0.800917   0.268877   2.979 0.002894 ** 
## rad           0.721751   0.195662   3.689 0.000225 ***
## tax          -0.007065   0.003490  -2.024 0.042948 *  
## ptratio       0.440768   0.159366   2.766 0.005679 ** 
## black        -0.009591   0.006025  -1.592 0.111412    
## lstat         0.096941   0.062429   1.553 0.120469    
## medv          0.236940   0.091276   2.596 0.009436 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 514.63  on 371  degrees of freedom
## Residual deviance: 140.71  on 358  degrees of freedom
## AIC: 168.71
## 
## Number of Fisher Scoring iterations: 9

Interpretation for model 1

$\newline$ (i) Based on the outcome, it can be seen that indus, chas, rm, age, black, and lstat are not statistically significant.

(ii)As for the statistically significant variables, nox has the lowest p-value suggesting a strong association of the nox to the target variable. Other important variables are dis, rad, tax, ptratio, and medv. The AIC value for the model1 =168.71.

The logistic regression coefficients give the change in the log odds of the outcome for a one unit increase in the predictor variables.

For every one unit change in nox, the log odds of crime rate above median value increases by 53.41.
For a one unit increase in dis, the log odds of crime rate above median value increases by 0.80.
For a one unit increase in rad, the log odds of crime rate above median value increases by 0.72.
For a one unit increase in tax, the log odds of crime rate above median value increases by -0.007. Tax has a negative impact on crime rate.
For a one unit increase in ptratio, the log odds of crime rate above median value increases by 0.44.
For a one unit increase in medv , the log odds of crime rate above median value increases by 0.23.

No. of iterations are 9 before lowest value of AIC was derived for this model.

3.1.2 Model Two

This model, we will be using original variables; however using step function (backward process) instead of GLM.

## 
## Call:
## glm(formula = target ~ zn + nox + age + dis + rad + tax + ptratio + 
##     black + lstat + medv, family = "binomial", data = city_crime_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9258  -0.1459  -0.0024   0.0013   3.3934  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -39.282116   7.705519  -5.098 3.43e-07 ***
## zn           -0.064656   0.037414  -1.728 0.083964 .  
## nox          46.617168   8.074920   5.773 7.78e-09 ***
## age           0.025273   0.013545   1.866 0.062065 .  
## dis           0.710480   0.249767   2.845 0.004447 ** 
## rad           0.775881   0.182072   4.261 2.03e-05 ***
## tax          -0.009144   0.003082  -2.967 0.003011 ** 
## ptratio       0.359297   0.135081   2.660 0.007817 ** 
## black        -0.008384   0.005737  -1.462 0.143871    
## lstat         0.110624   0.055650   1.988 0.046829 *  
## medv          0.181460   0.053572   3.387 0.000706 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 514.63  on 371  degrees of freedom
## Residual deviance: 142.85  on 361  degrees of freedom
## AIC: 164.85
## 
## Number of Fisher Scoring iterations: 9

Interpretation for model 2

$\newline$ (i)It can be seen that zn, age, and black are not statistically significant.

(ii)As for the statistically significant variables, nox has the lowest p-value suggesting a strong association of the nox of the target variable. other important variables are dis, rad, tax, ptratio, medv, and lstat. The AIC value for the model1 =164.85.

The logistic regression coefficients give the change in the log odds of the outcome for a one unit increase in the predictor variables.

For every one unit change in nox, the log odds of crime rate above median value increases by 46.61.
For a one unit increase in dis, the log odds of crime rate above median value increases by 0.71.
For a one unit increase in rad, the log odds of crime rate above median value increases by 0.77.
For a one unit increase in tax, the log odds of crime rate above median value increases by -0.009.
For a one unit increase in ptratio, the log odds of crime rate above median value increases by 0.35.
For a one unit increase in medv , the log odds of crime rate above median value increases by 0.18

there were 9 iterations in backward steps before final model was selected

3.1.3 Model Three

In this model, we will be using transformed variables with the logit function GLM.

## 
## Call:
## glm(formula = target ~ . - zn - tax - lstat - medv, family = "binomial", 
##     data = city_crime_train_mod)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7883  -0.1410  -0.0026   0.0005   3.3645  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -68.319369  16.418997  -4.161 3.17e-05 ***
## indus        -0.001867   0.067017  -0.028 0.977778    
## chas          0.366993   0.849076   0.432 0.665577    
## nox          56.080643  10.147964   5.526 3.27e-08 ***
## rm            2.995884   2.385419   1.256 0.209147    
## age           0.043435   0.018166   2.391 0.016805 *  
## dis           0.472036   0.331312   1.425 0.154231    
## rad           0.838409   0.237364   3.532 0.000412 ***
## ptratio       0.468316   0.176293   2.656 0.007896 ** 
## black        -0.010739   0.005922  -1.813 0.069782 .  
## tax_new      -0.005285   0.003663  -1.443 0.149151    
## medv_new      0.283102   0.106228   2.665 0.007698 ** 
## lstat_new     0.050027   0.074958   0.667 0.504515    
## rm_new       -5.052053   2.830695  -1.785 0.074304 .  
## dis_new      -1.886385   0.552223  -3.416 0.000636 ***
## zn_new       -0.363834   1.036508  -0.351 0.725574    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 514.63  on 371  degrees of freedom
## Residual deviance: 124.11  on 356  degrees of freedom
## AIC: 156.11
## 
## Number of Fisher Scoring iterations: 9

Interpretation for model 3

$\newline$ (i)From this model it can be seen that the following variables are relevant for this model: nox, dis, rad, ptratio, tax_new, medv_new, and lstat_new.

The number of integration is 9 and AIC value =169.71.
nox and rad are the two most important variables. The new variables tax_new, medv_new, lstat_new are having minor impact on the model.
The logistic regression coefficients give the change in the log odds of the outcome for a one unit increase in the predictor variables.

For every one unit change in nox, the log odds of crime rate above median value increases by 56.02.
For a one unit increase in rad, the log odds of crime rate above median value increases by 0.72.
For a one unit increase in dis, the log odds of crime rate above median value increases by 0.82.

3.1.4 Model Four

In this model we will be using transformed variables using backward step function instead of GLM

## 
## Call:
## glm(formula = target ~ nox + age + dis + rad + ptratio + black + 
##     tax_new + medv_new + rm_new + dis_new, family = "binomial", 
##     data = city_crime_train_mod)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0158  -0.1472  -0.0031   0.0005   3.1030  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -52.779764   9.739144  -5.419 5.98e-08 ***
## nox          56.509319   9.188179   6.150 7.74e-10 ***
## age           0.051467   0.016215   3.174 0.001503 ** 
## dis           0.564992   0.255943   2.207 0.027280 *  
## rad           0.849127   0.212643   3.993 6.52e-05 ***
## ptratio       0.533319   0.159365   3.347 0.000818 ***
## black        -0.010960   0.005943  -1.844 0.065147 .  
## tax_new      -0.004534   0.003144  -1.442 0.149355    
## medv_new      0.342778   0.095427   3.592 0.000328 ***
## rm_new       -2.358513   1.028472  -2.293 0.021835 *  
## dis_new      -1.865533   0.488896  -3.816 0.000136 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 514.63  on 371  degrees of freedom
## Residual deviance: 126.80  on 361  degrees of freedom
## AIC: 148.8
## 
## Number of Fisher Scoring iterations: 9

Interpretation for model 4

$\newline$ (i)From this model it can be seen that the following variables are relevant for this model: nox, dis, rad, ptratio , tax_new, medv_new,and lstat_new
(ii) The number of integration is 9 and AIC value =165.8.

The logistic regression coefficients give the change in the log odds of the outcome for a one unit increase in the predictor variables.

For every one unit change in nox, the log odds of crime rate above median value increases by 48.61.
For a one unit increase in rad, the log odds of crime rate above median value increases by 0.79.
For a one unit increase in dis, the log odds of crime rate above median value increases by 0.74.

same variables as model3 are being marked as relevant for model 4 after backward elimination process.

3.1,5 Model Five

In this model we will be using original variables; however the Linear Discriminant Analysis function lda in ISLR package.

## Call:
## lda(target ~ ., data = city_crime_train)
## 
## Prior probabilities of groups:
##         0         1 
## 0.5268817 0.4731183 
## 
## Group means:
##          zn     indus       chas       nox       rm      age      dis
## 0 22.012755  6.956327 0.05102041 0.4689730 6.401296 50.37398 5.086538
## 1  1.613636 15.291193 0.07954545 0.6428523 6.176631 86.38864 2.459868
##         rad      tax  ptratio    black     lstat     medv
## 0  4.107143 308.4949 17.76990 388.6647  9.199235 25.18724
## 1 14.880682 509.6932 18.74773 327.2894 15.959148 20.24148
## 
## Coefficients of linear discriminants:
##                   LD1
## zn      -0.0047914631
## indus    0.0281044279
## chas    -0.0556293189
## nox      7.9109306913
## rm       0.1658180998
## age      0.0131973114
## dis      0.0840623852
## rad      0.1027832012
## tax     -0.0019152605
## ptratio  0.0090391049
## black   -0.0009160458
## lstat    0.0248449648
## medv     0.0425514709

Interpretation for model 5

$\newline$ (i)The Classification boundary equation for our model 5 is below:
$-0.004 * zn + 0.0281 * indus - 0.055 * chas + 7.910 * nox + 0.165 * rm + 0.013 * age + 0.084 * dis + 0.102 * rad - 0.001 * tax + 0.009 * ptratio - 0.0009 * black + 0.024 * lstat + 0.042 * medv =0$

From summary table, we also have the prior probability of of success is 0.4731183 and failure is 0.5268817.
Group means provides mean values for each variable with respect to target variable values 0 and 1.
This model has accuracy value 82.97 % which is less compare to the other models. LDA model assumes normality of the variables used in the model and theere are some variables which are not normally distributed and have outliers that is impacting the result out of this model.

3.1.6 Model six

In this model we be using transformed variables using Linear Discriminant Analysis

## Call:
## lda(target ~ . - zn - rm - dis - tax - lstat - medv, data = city_crime_train_mod)
## 
## Prior probabilities of groups:
##         0         1 
## 0.5268817 0.4731183 
## 
## Group means:
##       indus       chas       nox      age       rad  ptratio    black
## 0  6.956327 0.05102041 0.4689730 50.37398  4.107143 17.76990 388.6647
## 1 15.291193 0.07954545 0.6428523 86.38864 14.880682 18.74773 327.2894
##    tax_new medv_new lstat_new      rm_new    dis_new     zn_new
## 0 308.4949 25.04528  9.199235  0.08333182 -0.0504096 0.46938776
## 1 509.6932 19.86151 15.724247 -0.11166891  0.5106930 0.07954545
## 
## Coefficients of linear discriminants:
##                    LD1
## indus      0.022452946
## chas      -0.186416323
## nox        7.970446650
## age        0.015169354
## rad        0.100159450
## ptratio   -0.014404341
## black     -0.001159202
## tax_new   -0.001196341
## medv_new   0.047596449
## lstat_new  0.016840318
## rm_new    -0.008946209
## dis_new   -0.340985994
## zn_new    -0.001832533

Interpretation for model 6

$\newline$ (i)The Classification boundary equation for our model 6 is below:
$0.022 * indus -0.186 * chas + 7.970 * nox + 0.015 * age + 0.100 * rad -0.001 * tax - 0.014 * ptratio -0.001 * black + 0.016 * lstat_new + 0.047 * medv_new -0.008 * rm_new -0.034 * dis_new -0.001 * zn_new =0$

From summary table, we also have the prior probability of of success is 0.4731183 and failure is 0.5268817.
Group means provides mean values for each variable with respect to target variable values 0 and 1.
This model has accuracy value 82.97 % which is less compare to the other models.LDA model assumes normality of the variables used in the model and theere are some variables which are not normally distributed and have outliers that is impacting the result out of this model. This model with transformed variables seems to performed margaianlly better.

4 Model Selection

In section we will further examine all six models. We will also follow the below steps to select our final model:
-Model selection strategy
-Model1 Evaluation
-Final Model Selection
-Inference of Final Model

4.1 Model selection strategy:

The below model selection strategy will be used in our model evaluation:

Compare the Accuracy & Confusion Matrix of the models.
Compare the Precision, Sensitivity, Specificity, and F1 score.
Compare AUC curve for the models.

4.1.1 Model One Evaluation

Model 1 evaluation KPIs
	Accuracy	Error_Rate	Precision	sensitivity	specificity	F1_Score	AUC
1	0.9042553	0.0957447	0.9245283	0.9074074	0.9	0.9283174	0.9549011

From the key metrics table above, we can see that the model has high accuracy of 0.9042553 and low error rate 0.0957447. The AUC curve for this model is 0.9549011 which is very good as the optimal value for AUC is between 0 and 1 and the closer it goes to 1, the better the model outcome is..

4.1.2 Model Two Evaluation

Model 2 evaluation KPIs
	Accuracy	Error_Rate	Precision	sensitivity	specificity	F1_Score	AUC
2	0.8723404	0.1276596	0.9056604	0.8727273	0.8717949	0.9061444	0.9553613

From the key metrics table above, we can see that the model has high accuracy of 0.8723404 and low error rate 0.12765957. The AUC curve for this model is 0.9553 which is very good.

4.1.3 Model Three Evaluation

Model 3 evaluation KPIs
	Accuracy	Error_Rate	Precision	sensitivity	specificity	F1_Score	AUC
3	0.893617	0.106383	0.9245283	0.8909091	0.8974359	0.9211541	0.9677865

Looking at the key metrics this can be concluded this model has high accuracy 0.893617 and low error rate 0.106383. The AUC curve for this model is 0.9677865 which is very good.

4.1.4 Model Four Evaluation

Model 4 evaluation KPIs
	Accuracy	Error_Rate	Precision	sensitivity	specificity	F1_Score	AUC
4	0.8829787	0.1170213	0.9056604	0.8888889	0.875	0.9127916	0.9687069

Looking at the key metrics this can be concluded this model has high accuracy 0.8829787 and low error rate 0.1170213. The AUC curve for this model is 0.9687069 which is very good.

4.1.5 Model Five Evaluation

Model 5 evaluation KPIs
	Accuracy	Error_Rate	Precision	sensitivity	specificity	F1_Score	AUC
5	0.8297872	0.1702128	0.7358491	0.9512195	0.7358491	0.8297872	0.9263691

Looking at the key metrics this can be concluded this model has high accuracy 0.8297872 and low error rate 0.1702128. The AUC curve for this model is 0.9263691 which is very good.

4.1.6 Model Six Evaluation

Model 6 evaluation KPIs
	Accuracy	Error_Rate	Precision	sensitivity	specificity	F1_Score	AUC
6	0.8297872	0.1702128	0.7358491	0.9512195	0.7358491	0.8297872	0.9406351

Looking at the key metrics this can be concluded this model has high accuracy 0.8297872 and low error rate 0.1702128. The AUC curve for this model is 0.9406351 which is very good.

4.2 Final Model Seletion

Following is the comparison of various metrics for above 6 models

Model Performance Metrics Comparison
Model_No	Accuracy	Error_Rate	AUC	Precision	sensitivity	specificity	F1_Score
1	0.9042553	0.0957447	0.9549011	0.9245283	0.9074074	0.9000000	0.9283174
2	0.8723404	0.1276596	0.9553613	0.9056604	0.8727273	0.8717949	0.9061444
3	0.8936170	0.1063830	0.9677865	0.9245283	0.8909091	0.8974359	0.9211541
4	0.8829787	0.1170213	0.9687069	0.9056604	0.8888889	0.8750000	0.9127916
5	0.8297872	0.1702128	0.9263691	0.7358491	0.9512195	0.7358491	0.8297872
6	0.8297872	0.1702128	0.9406351	0.7358491	0.9512195	0.7358491	0.8297872

From the comparison table, we see that Model 1 and Mode 3 are very close from accuracy and AUC perspective. Model 1 has little better accuracy rate 90.42% compare to Model’s 89.36%. However, Model 3 is the best in terms of AUC value which is .9677 which is closer to model 1’s value.
The AUC provides the best score on probability correctly identifying the patterns at various cut off values. The Accuracy, on the other hand, is calculated as specific cut off value. For this assignment we will go with cut off value of 0.5 and choose the Model 1 based on Accuracy value for further prediction on evaluation data set.

4.2.1 Inference for Final Model

The following analysis will be carried out on the final model:
(i) Relevant variables in the model
(ii) Estimate confidence interval for coefficient
(iii) odds ratios and 95% CI
(iv) AUC curve
(v) Distribution of prediction

4.2.2 Most important variables in the model

## 
## Call:
## glm(formula = target ~ ., family = "binomial", data = city_crime_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8791  -0.1299  -0.0025   0.0011   3.4785  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -41.462153   8.250799  -5.025 5.03e-07 ***
## zn           -0.060580   0.039153  -1.547 0.121799    
## indus        -0.063885   0.059335  -1.077 0.281618    
## chas          0.789391   0.865818   0.912 0.361912    
## nox          53.413503  10.013666   5.334 9.60e-08 ***
## rm           -0.647942   0.904430  -0.716 0.473739    
## age           0.028835   0.015680   1.839 0.065915 .  
## dis           0.800917   0.268877   2.979 0.002894 ** 
## rad           0.721751   0.195662   3.689 0.000225 ***
## tax          -0.007065   0.003490  -2.024 0.042948 *  
## ptratio       0.440768   0.159366   2.766 0.005679 ** 
## black        -0.009591   0.006025  -1.592 0.111412    
## lstat         0.096941   0.062429   1.553 0.120469    
## medv          0.236940   0.091276   2.596 0.009436 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 514.63  on 371  degrees of freedom
## Residual deviance: 140.71  on 358  degrees of freedom
## AIC: 168.71
## 
## Number of Fisher Scoring iterations: 9

Following are the most relevant variables for the model: indus, nox, dis, rad, ptratio, and medv.
we can write the equation of the Model 1 as:
$log(y) = -41.426 + 53.41 * nox + 0.80 * dis + 0.721 * rad -0.007 * tax + 0.44 * Ptratio + 0.23 * medv$

4.2.3 Analysis of odds ratios of variables 95% CI

##                       OR        2.5 %       97.5 %
## (Intercept) 9.844998e-19 9.335179e-26 1.038266e-11
## zn          9.412183e-01 8.716922e-01 1.016290e+00
## indus       9.381125e-01 8.351201e-01 1.053807e+00
## chas        2.202054e+00 4.034991e-01 1.201748e+01
## nox         1.574670e+23 4.715650e+14 5.258208e+31
## rm          5.231212e-01 8.886894e-02 3.079319e+00
## age         1.029255e+00 9.981051e-01 1.061378e+00
## dis         2.227583e+00 1.315121e+00 3.773133e+00
## rad         2.058033e+00 1.402507e+00 3.019950e+00
## tax         9.929600e-01 9.861908e-01 9.997758e-01
## ptratio     1.553900e+00 1.137027e+00 2.123615e+00
## black       9.904547e-01 9.788273e-01 1.002220e+00
## lstat       1.101795e+00 9.749020e-01 1.245205e+00
## medv        1.267365e+00 1.059759e+00 1.515641e+00

The following points can be made for the important variables in the model:

In keeping all other variables same, the odds of having crime rate above median value increases as follow: 0.875 for per unit change in indus, 2.50 per unit change in dis, 1.74 for per unit change in rad, 1.51 for per unit change in ptratio, and 1.30 for per unit change in medv. Any value which is less than 1, it means that there is less chance of an event with the per unit increase of the variable.
The nox variable has very high odd ratio and the reason is, in given data set there are records where for range of nox values there is only one outcome of target (either 0 or 1) as shown in the chart below with vertical lines. That makes easy for prediction of target variable and increases the accuracy of the model.

Please note that the outlier treatment for the nox variable would result in 50% reduction on train data set. However, for this assignment we have kept this variable as is.

4.2.4 AUC curve for the selected model

4.2.5 Distribution of the Predictions

Considering the target has value 1 (crime above median) and 0 when crime rate is below median, then the above plot illustrates the tradeoff of choosing a reasonable threshold. In other words, if the threshold is increased, the number of false positive (FP) results is lowered; while the number of false negative (FN) results increases.

5 Prediction using final model on evaluation data set

In this section, Model 1 has been used to predict the outcome on the evaluation dataset.

Outcome on evaluation data set
Var1	Freq
FALSE	19
TRUE	21

Based on the outcome, we can conclude that in evaluation data set of 40, around 21 records are where the crime rate is above the median and 19 records where the crime rate is below the median.

Appendix A: Binary Logistic Regression: Crime Prediction: R Code

if (!require("ggplot2",character.only = TRUE)) (install.packages("ggplot2",dep=TRUE))
if (!require("MASS",character.only = TRUE)) (install.packages("MASS",dep=TRUE))
if (!require("knitr",character.only = TRUE)) (install.packages("knitr",dep=TRUE))
if (!require("xtable",character.only = TRUE)) (install.packages("xtable",dep=TRUE))
if (!require("dplyr",character.only = TRUE)) (install.packages("dplyr",dep=TRUE))
if (!require("psych",character.only = TRUE)) (install.packages("psych",dep=TRUE))
if (!require("stringr",character.only = TRUE)) (install.packages("stringr",dep=TRUE))
#if (!require("car",character.only = TRUE)) (install.packages("car",dep=TRUE))
if (!require("faraway",character.only = TRUE)) (install.packages("faraway",dep=TRUE))

library(ggplot2)
library(MASS)
library(knitr)
library(xtable)
library(dplyr)
library(psych)
library(stringr)
#library(car)
library(faraway)
library(aod)
library(Rcpp)
library(leaps)
library(ISLR)
library(AUC)
library(ROCR)
library(Amelia)
library(pROC)

city_crime_train_full <- read.csv("https://raw.githubusercontent.com/kishkp/data621-ctg5/master/HW3/crime-training-data.csv")

variables<- read.csv("https://raw.githubusercontent.com/kishkp/data621-ctg5/master/HW3/variables2.csv")
kable(variables, caption = "Variable Description") 




city_crime_test <- read.csv("https://raw.githubusercontent.com/kishkp/data621-ctg5/master/HW3/crime-evaluation-data.csv")

#str(city_crime_test)




smp_size <- floor(0.80 * nrow(city_crime_train_full))

## set the seed to make your partition reproductible
set.seed(123)

train_ind <- sample(seq_len(nrow(city_crime_train_full)), size = smp_size)

city_crime_train<- city_crime_train_full[train_ind, ]
train_test <- city_crime_train_full[-train_ind, ]

ds_stats <- psych::describe(city_crime_train, skew = TRUE, na.rm = TRUE)
#ds_stats
kable(ds_stats[1:7], caption= "Data Summary")
kable(ds_stats[8:13], caption= "Data Summary (Cont)")

fun1 <- function(a, y) cor(y, a)
x<-city_crime_train[,]
Correlation <- sapply(x, FUN = fun1, y=city_crime_train$target) 

#kable(data.frame(Correlation), caption = "Correlation between target and predictor variable")
Correlation <- sort(Correlation, decreasing = TRUE)
kable(Correlation, caption = "Variable Correlation")

missings<- sapply(city_crime_train_full,function(x) sum(is.na(x)))
kable(missings, caption = "Missing Values")

#missmap(city_crime_train_full, main = "Missing values vs observed")

### finding unique values


### % break up of target variable 

#prop.table(table(city_crime_train_full$target))

uniques<- sapply(city_crime_train_full, function(x) length(unique(x)))
kable(uniques, caption = "Unique Values")


library(ggplot2)
library(reshape2)
#
mdata<- select(city_crime_train, -(target))
mdata2 <- melt(mdata)
# Output the boxplot
p <- ggplot(data = mdata2, aes(x=variable, y=value)) + 
  geom_boxplot() + ggtitle("Outliers identification")
p + facet_wrap( ~ variable, scales="free", ncol=5)

library(popbio)
par(mfrow=c(2,3))
logi.hist.plot(x$zn,x$target,logi.mod = 1, type="hist", boxp=FALSE,col="gray", mainlabel = "zn")
logi.hist.plot(x$indus, x$target,logi.mod = 1, type="hist",boxp=FALSE,col="gray", mainlabel = "indus")
logi.hist.plot(x$chas,x$target,logi.mod = 1,boxp=FALSE,type="hist",col="gray", mainlabel = "chas")
logi.hist.plot(x$nox, x$target,logi.mod = 1,type="hist",boxp=FALSE,col="gray", mainlabel = "nox")
logi.hist.plot(x$rm,x$target,logi.mod = 1,boxp=FALSE,type="hist",col="gray", mainlabel = "rm")
logi.hist.plot(x$age, x$target,logi.mod = 1,type="hist",boxp=FALSE,col="gray", mainlabel = "age")
logi.hist.plot(x$dis,x$target,logi.mod = 1,boxp=FALSE,type="hist",col="gray", mainlabel = "dis")
logi.hist.plot(x$rad, x$target,logi.mod = 1,type="hist",boxp=FALSE,col="gray", mainlabel = "rad")
logi.hist.plot(x$tax,x$target,logi.mod = 1,boxp=FALSE,type="hist",col="gray", mainlabel = "tax")
logi.hist.plot(x$ptratio, x$target,logi.mod = 1,type="hist",boxp=FALSE,col="gray", mainlabel = "ptratio")
logi.hist.plot(x$black,x$target,logi.mod = 1,boxp=FALSE,type="hist",col="gray", mainlabel = "black")
logi.hist.plot(x$lstat, x$target,logi.mod = 1,type="hist",boxp=FALSE,col="gray", mainlabel = "lstat")
logi.hist.plot(x$medv, x$target,logi.mod = 1,type="hist",boxp=FALSE,col="gray", mainlabel = "medv")

# function for removing outliers - http://r-statistics.co/Outlier-Treatment-With-R.html

treat_outliers <- function(x) {
qnt <- quantile(x, probs=c(.25, .75), na.rm = T)
caps <- quantile(x, probs=c(.05, .95), na.rm = T)
H <- 1.5 * IQR(x, na.rm = T)
x[x < (qnt[1] - H)] <- caps[1]
x[x > (qnt[2] + H)] <- caps[2]
 
return(x)
}

city_crime_train_mod<-city_crime_train
train_test_mod<-train_test

city_crime_train_mod$tax_new <- treat_outliers(city_crime_train$tax)
city_crime_train_mod$medv_new <- treat_outliers(city_crime_train$medv)
city_crime_train_mod$lstat_new <- treat_outliers(city_crime_train$lstat)


train_test_mod$tax_new <- treat_outliers(train_test$tax)
train_test_mod$medv_new <- treat_outliers(train_test$medv)
train_test_mod$lstat_new <- treat_outliers(train_test$lstat)


par(mfrow=c(1,3))

boxplot(city_crime_train_mod$tax_new,main="tax_new")
boxplot(city_crime_train_mod$medv_new,main="medv_new")
boxplot(city_crime_train_mod$lstat_new,main="lstat_new")


city_crime_train_mod$rm_new<-sin(city_crime_train$rm)

city_crime_train_mod$dis_new<-sin(city_crime_train$dis)

train_test_mod$rm_new<-sin(train_test$rm)

train_test_mod$dis_new<-sin(train_test$dis)


par(mfrow=c(1,2))

#boxplot(city_crime_train_mod$rm_new,main="rm_new")
#boxplot(city_crime_train_mod$dis_new,main="dis_new")



show_charts <- function(x, ...) {
    
    par(mfrow=c(2,3))
    
    xlabel <- unlist(str_split(deparse(substitute(x)), pattern = "\\$"))[2]
    ylabel <- unlist(str_split(deparse(substitute(y)), pattern = "\\$"))[2]
    
    hist(x,main=xlabel)
    boxplot(x,main=xlabel)

    y<-log(x)
    boxplot(y,main='log transform')
    y<-sqrt(x)
    boxplot(y,main='sqrt transform')
    y<-sin(x)
    boxplot(y,main='sin transform')
    y<-(x)^(1/9)
    boxplot(y,main='ninth transform')
   
}

# attach file to R so that only variable can be called

attach(city_crime_train)

show_charts(zn)


show_charts(indus)


show_charts(nox)


show_charts(rm)



show_charts(age)

show_charts(dis)


show_charts(rad)



show_charts(tax)



show_charts(ptratio)


show_charts(black)


show_charts(lstat)


show_charts(medv)


city_crime_train_mod$zn_new<-city_crime_train_mod$zn

city_crime_train_mod$zn_new [city_crime_train_mod$zn_new > 0] <- 1
city_crime_train_mod$zn_new [city_crime_train_mod$zn_new== 0] <- 0
#city_crime_train_mod$zn_new<-factor(city_crime_train_mod$zn_new)



train_test_mod$zn_new<-train_test$zn

train_test_mod$zn_new [train_test_mod$zn_new > 0] <- 1
train_test_mod$zn_new [train_test_mod$zn_new== 0] <- 0
#city_crime_train_mod$zn_new<-factor(train_test_mod$zn_new)

#city_crime_train$chas <-factor(city_crime_train$chas)

#city_crime_train$target<-factor(city_crime_train$target)

#city_crime_train_mod$chas <-factor(city_crime_train_mod$chas)

#city_crime_train_mod$target<-factor(city_crime_train_mod$target)





#Correlation <- cor(city_crime_train$rm_new,(as.numeric(city_crime_train$target)))

#Correlation <- cor(city_crime_train$dis_new,(as.numeric(city_crime_train$target)))




model_var <- read.csv("https://raw.githubusercontent.com/kishkp/data621-ctg5/master/HW3/model_var.csv")

kable(model_var,caption="Variables used in different models")


model1 <- glm(target ~ ., data = city_crime_train, family = "binomial")

summary(model1)


stepmodel1<- step(model1, direction="backward")


summary(stepmodel1)


model2 <- glm(target ~ .-zn-tax-lstat-medv, data = city_crime_train_mod, family = "binomial")
summary(model2)


stepmodel2<- step(model2, direction="backward")


summary(stepmodel2)



model3=lda(target~.,data=city_crime_train)

model3




model3_mod=lda(target~.-zn-rm-dis-tax-lstat-medv,data=city_crime_train_mod)

model3_mod


#Following function Eval() will be used to calculate various metrics related to the model like Accuracy, Sensitivity, #Precision , Specificity, and F1 score
Eval<-function(x){
 TP<-x$Freq[x$metrics=="TRUE_1"]
 FP<-x$Freq[x$metrics=="FALSE_1"]
 TN<-x$Freq[x$metrics=="FALSE_0"]
 FN<-x$Freq[x$metrics=="TRUE_0"]
 Accuracy <-(TP+TN)/(TP+TN+FP+FN)
  Error_Rate<-(FP+FN)/(TP+TN+FP+FN)
  Precision<-TP/(TP+FP)
  sensitivity<-TP/(TP+FN)
  specificity<-TN/(TN+FP)
  F1_Score=2*Precision*sensitivity/(sensitivity+specificity)
 eval_result<-data.frame(Accuracy=c(0),Error_Rate=c(0),Precision=c(0),sensitivity=c(0),specificity=c(0),F1_Score=c(0))
 
 eval_result[1,1]<-Accuracy
 eval_result[1,2]<-Error_Rate
 eval_result[1,3]<- Precision
 eval_result[1,4]<-sensitivity
 eval_result[1,5]<-specificity
 eval_result[1,6]<-F1_Score
 
 eval_result

 }


#confusion matrix
pre_train1 <- predict(model1, newdata=train_test, type="response")
df_pre_train1<-as.data.frame(table(pre_train1>0.5,train_test$target))
df_pre_train1$metrics <- paste(df_pre_train1$Var1,df_pre_train1$Var2, sep = '_') 

#Eval(df_pre_train1)

model_comparison<-data.frame(Accuracy=c(0),Error_Rate=c(0),Precision=c(0),sensitivity=c(0),specificity=c(0),F1_Score=c(0))
model_comparison[1,]<-Eval(df_pre_train1)
 #model_comparison<-c(1)
 
 #row.names(model_comparison[1,])<-"model 1"

 #AUC 

train_test$pre_train1<-c(pre_train1)
#x<-data.frame(auc(train_test$target, train_test$pre_train1))
model_comparison$AUC<-c(0)
model_comparison[1,c("AUC")]<-c(auc(train_test$target, train_test$pre_train1))

kable(model_comparison,row.names = TRUE, caption = " Model 1 evaluation KPIs")

#model_comparison

#confusion matrix

pre_train1_step<-predict(stepmodel1,type="response",newdata=train_test)
df_pre_train1_step<-as.data.frame(table(pre_train1_step>0.5,train_test$target))
df_pre_train1_step$metrics <- paste(df_pre_train1_step$Var1,df_pre_train1_step$Var2, sep = '_') 
#Eval(df_pre_train1_step)
model_comparison[2,]<-Eval(df_pre_train1_step)

#AUC

train_test$pre_train1<-c(pre_train1_step)
model_comparison[2,c("AUC")]<-c(auc(train_test$target, train_test$pre_train1))

kable(model_comparison[2,],caption = "Model 2 evaluation KPIs")


#confusion matrix

pre_train2<-predict(model2,type="response",newdata=train_test_mod)

df_pre_train2<-as.data.frame(table(pre_train2>0.5,train_test_mod$target))
df_pre_train2$metrics <- paste(df_pre_train2$Var1,df_pre_train2$Var2, sep = '_') 
#Eval(df_pre_train2)
model_comparison[3,]<-Eval(df_pre_train2)

#AUC

train_test_mod$pre_train1<-c(pre_train2)
model_comparison[3,c("AUC")]<-c(auc(train_test_mod$target, train_test_mod$pre_train1))

kable(model_comparison[3,],caption = " Model 3 evaluation KPIs")



#confusion matrix


pre_train2_step<-predict(stepmodel2,type="response",newdata=train_test_mod)

df_pre_train2_step<-as.data.frame(table(pre_train2_step>0.5,train_test_mod$target))
df_pre_train2_step$metrics <- paste(df_pre_train2_step$Var1,df_pre_train2_step$Var2, sep = '_') 
#Eval(df_pre_train2_step)

model_comparison[4,]<-Eval(df_pre_train2_step)

#AUC

train_test_mod$pre_train1<-c(pre_train2_step)
model_comparison[4,c("AUC")]<-c(auc(train_test_mod$target, train_test_mod$pre_train1))
kable(model_comparison[4,],caption = " Model 4 evaluation KPIs")



#confusion matrix


pre_train3<-data.frame(predict(model3,type="response",newdata=train_test))

df_pre_train3<-as.data.frame(table(pre_train3$class,train_test$target))
df_pre_train3$metrics <- paste(df_pre_train3$Var1,df_pre_train3$Var2, sep = '_') 

df_pre_train3$metrics[df_pre_train3$metrics=="0_0"]<-"FALSE_0"
df_pre_train3$metrics[df_pre_train3$metrics=="1_0"]<-"TRUE_0"
df_pre_train3$metrics[df_pre_train3$metrics=="0_1"]<-"FALSE_1"
df_pre_train3$metrics[df_pre_train3$metrics=="1_1"]<-"TRUE_1"


#Eval(df_pre_train3)

model_comparison[5,]<-Eval(df_pre_train3)

#AUC
train_test$pre_train1<-c(pre_train3$posterior.1)
model_comparison[5,c("AUC")]<-c(auc(train_test$target, train_test$pre_train1))
kable(model_comparison[5,],caption = " Model 5 evaluation KPIs")


#confusion matrix


pre_train3_mod<-data.frame(predict(model3_mod,type="response",newdata=train_test_mod))

df_pre_train3_mod<-data.frame(table(pre_train3_mod$class,train_test_mod$target))
df_pre_train3_mod$metrics <- paste(df_pre_train3_mod$Var1,df_pre_train3_mod$Var2, sep = '_') 

df_pre_train3_mod$metrics[df_pre_train3_mod$metrics=="0_0"]<-"FALSE_0"
df_pre_train3_mod$metrics[df_pre_train3_mod$metrics=="1_0"]<-"TRUE_0"
df_pre_train3_mod$metrics[df_pre_train3_mod$metrics=="0_1"]<-"FALSE_1"
df_pre_train3_mod$metrics[df_pre_train3_mod$metrics=="1_1"]<-"TRUE_1"

#Eval(df_pre_train3_mod)

model_comparison[6,]<-Eval(df_pre_train3_mod)

#AUC

train_test_mod$pre_train1<-c(pre_train3_mod$posterior.1)
model_comparison[6,c("AUC")]<-c(auc(train_test_mod$target, train_test_mod$pre_train1))


kable(model_comparison[6,],caption = " Model 6 evaluation KPIs")

model_comparison$Model_No<-c(1:6)
kable(model_comparison[,c("Model_No","Accuracy","Error_Rate","AUC","Precision","sensitivity","specificity","F1_Score")],
      caption = "Model Performance Metrics Comparison")


summary(model1)


exp(cbind(OR = coef(model1), confint.default(model1)))


plot (city_crime_train$nox,city_crime_train$target, main="nox vs target")
abline(v=0.425)
abline(v=0.635)

#AUC
pred <- prediction(pre_train1, train_test$target)
perf <- performance(pred, measure = "tpr", x.measure = "fpr")

auc <- performance(pred, measure = "auc")
auc <- auc@y.values[[1]]

roc.data <- data.frame(fpr=unlist(perf@x.values),
                       tpr=unlist(perf@y.values),
                       model="GLM")
ggplot(roc.data, aes(x=fpr, ymin=0, ymax=tpr)) +
    geom_ribbon(alpha=0.2) +
    geom_line(aes(y=tpr)) +
    ggtitle(paste0("ROC Curve w/ AUC=", auc))

plot_pred_type_distribution <- function(df, threshold) {
  v <- rep(NA, nrow(df))
  v <- ifelse(df$pre_train1 >= threshold & df$target == 1, "TP", v)
  v <- ifelse(df$pre_train1 >= threshold & df$target == 0, "FP", v)
  v <- ifelse(df$pre_train1 < threshold & df$target == 1, "FN", v)
  v <- ifelse(df$pre_train1 < threshold & df$target == 0, "TN", v)
  
  df$pred_type <- v
  
  ggplot(data=df, aes(x=target, y=pre_train1)) + 
    geom_violin(fill=rgb(1,1,1,alpha=0.6), color=NA) + 
    geom_jitter(aes(color=pred_type), alpha=0.6) +
    geom_hline(yintercept=threshold, color="red", alpha=0.6) +
    scale_color_discrete(name = "type") +
    labs(title=sprintf("Threshold at %.2f", threshold))
}

train_test$pre_train1 <- predict(model1, newdata=train_test, type="response")

plot_pred_type_distribution (train_test,0.5)

#city_crime_test$chas<-factor(city_crime_test$chas)

Predict_final<-predict(model1,type="response", newdata=city_crime_test)

kable(data.frame(table(Predict_final>0.5)),,caption="Outcome on evaluation data set")

#code=readLines(knitr::purl('https://raw.githubusercontent.com/kishkp/data621-ctg5/master/HW1/HW3_Final_03.Rmd', #documentation = 0)), eval = FALSE}

Binary Logistic Regression: Crime Prediction

Simon Elmoudni