The data set contains approximately 466 records and 14 variables. Each record has information on crime for various neighborhoods of a major city. Each record has a response variable indicating whether or not the crime rate is above the median crime rate (1) or not (0).
The objective is to build a binary logistic regression model on the training data set to predict whether the neighborhood will be at risk for high crime levels. In addition, we will provide classifications and probabilities for the evaluation data set using the binary logistic regression model.
In section we will explore and gain some insights into the dataset by pursuing the below high level steps and inquiries:
-Variable identification
-Variable Relationships
-Data summary analysis
-Outliers and Missing Values Identification
First let’s display and examine the data dictionary or the data columns as shown in table 1
Variable | Description | Datatype | Role |
---|---|---|---|
zn | proportion of residential land zoned for large lots (over 25000 square feet) | numeric | predictor |
indus | proportion of non-retail business acres per suburb | numeric | predictor |
chas | a dummy var. for whether the suburb borders the Charles River (1) or not (0) | binary | predictor |
nox | nitrogen oxides concentration (parts per 10 million) | numeric | predictor |
rm | average number of rooms per dwelling | numeric | predictor |
age | proportion of owner-occupied units built prior to 1940 | numeric | predictor |
dis | weighted mean of distances to five Boston employment centers | numeric | predictor |
rad | index of accessibility to radial highways | integer | predictor |
tax | full-value property-tax rate per $10,000 | integer | predictor |
ptratio | pupil-teacher ratio by town | numeric | predictor |
black | 1000(Bk - 0.63)2 where Bk is the proportion of blacks by town | numeric | predictor |
lstat | lower status of the population (percent) | numeric | predictor |
medv | median value of owner-occupied homes in $1000s | numeric | predictor |
target | whether the crime rate is above the median crime rate (1) or not (0) | binary | response |
We notice that all variables are numeric except for two variables: the response variable “target” which is binary and the predictor variable “chas” which is a dummy binary variable indicating whether the suburb borders the Charles River (1) or not (0).
Based on the original dataset, our predictor input is made of 13 variables. And our response variable is one variable called target.
The variables seem to not have any arithmetic relations. In other words, there are no symmetricity or transitivity relationships between any two variable in the independent variable set.
In addition, since this is Logistic Regression, we will be making the below assumptions on the variables:-The dependent variable need not to be normally distributed
-Errors need to be independent but not normally distributed.
- We will be using GLM and GLM does not assume a linear relationship between dependent and independent variables. However, it assumes a linear relationship between link function and independent variables in logit model.
- Also does not use OLS (Ordinary Least Square) for parameter estimation. Instead, it uses maximum likelihood estimation (MLE)
In this section, we will create summary data to better understand the initial relationship variables have with our dependent variable using correlation, central tendency, and dispersion As shown in table 2.
Now we will produce the correlation table between the independent variables and the dependent variable
x | |
---|---|
target | 1.0000000 |
nox | 0.7290920 |
rad | 0.6307187 |
age | 0.6275762 |
indus | 0.6034795 |
tax | 0.6021403 |
lstat | 0.4808888 |
ptratio | 0.2198922 |
chas | 0.0579716 |
rm | -0.1605913 |
medv | -0.2724789 |
black | -0.3463425 |
zn | -0.4239382 |
dis | -0.6167264 |
Correlation analysis suggests that there are strong positive and negative between the independent variables and the dependent variable. For instance, we notice that there is a strong correlation of .73 between the concentration of nitrogen oxides and crime rate being above average. We will need to perform more investigations about this correlation as it is not obvious the concentration of nitrogen oxides would results in high crime rate; perhaps it impacts the crime rate indirectly by impacting other independent variables that we may or may not have in our data set.
In addition, we noticed that accessibility to radial highways also has a strong correlation with the crime rate being average average. Again we will investigate such correlation. We also noticed that unit or house age, property tax, and non-retail businesses having a positive impact on the crime rate being above average.
It is also worth noting that that distances to five Boston employment centers, large residential lots, the proportion of blacks by town, median value of owner-occupied homes, and the average number of rooms per dwelling, all have negative correlation to the crime rate being above crime rate average. In other words, the closer people are to the five Boston employment centers, the more likely the crime rate will be below the crime average.
As per Table .3 below, we see that we have no missing values which is good thing as we don’t have to carry out any imputation tasks.
x | |
---|---|
zn | 0 |
indus | 0 |
chas | 0 |
nox | 0 |
rm | 0 |
age | 0 |
dis | 0 |
rad | 0 |
tax | 0 |
ptratio | 0 |
black | 0 |
lstat | 0 |
medv | 0 |
target | 0 |
Also, as per Table .4 below, we can confirm that our target variable is binary as expected.
x | |
---|---|
zn | 26 |
indus | 73 |
chas | 2 |
nox | 79 |
rm | 419 |
age | 333 |
dis | 380 |
rad | 9 |
tax | 63 |
ptratio | 46 |
black | 331 |
lstat | 424 |
medv | 218 |
target | 2 |
In this section univariate analysis is being carried out and boxplots diagrams are being used to determine the outliers in variables and decide on whether to act on the outliers
From the “Outliers identification” plot above, we see that we have few outliers that we need to treat. We see that: zn (residential land zoned), rm (average number of rooms per dwelling), dis(weighted mean of distances to five Boston employment centers), black (the proportion of blacks by town),lstat (lower status of the population), and medv median value of owner-occupied homes in $1000s all need to be trated
In this section, we will investigate how our initial data aligns with a typical logistic model plot.
Recall the Logistic Regression is part of a larger class of algorithms known as Generalized Linear Model (glm). The fundamental equation of generalized linear model is: \(g(E(y)) = a+ Bx_1+B_2x_2+ B_3x_3+...\)
where, g() is the link function, E(y) is the expectation of target variable and \(B_0 + B_1x_1 + B_2x_2+B_3x_3\) is the linear predictor ( \(B_0,B_1,B_2, B_3\) to be predicted). The role of link function is to ‘link’ the expectation of y to linear predictor.
In logistic regression, we are only concerned about the probability of outcome dependent variable ( success or failure). As described above, g() is the link function. This function is established using two things: Probability of Success (p) and Probability of Failure (1-p). p should meet following criteria: It must always be positive (since p >= 0) It must always be less than equals to 1 (since p <= 1).
Now let’s investigate how our initial data model aligns with the above criteria. In other words, we will plot regression model plots for each variable and compare it to a typical logistic model plot:
You can see clearly that the probability of crime being above average increases as we get closer to the “1” classification for the indus,nox,age,rad,tax,and lstat variables. In the middle, the probability changes at the highest rate, while it tails off at each end in order to bound it between 0 and 1.
You can see clearly that the probability of crime being above average decreases as we get closer to the “1” classification for the zn, dis,black, and mdev variables. In the middle, the probability changes at the lowest rate. However, it does not tails off at each end for all of the variables.
Now that we have completed the preliminary analysis, we will be cleaning and consolidating data into one dataset for use in analysis and modeling. We will be following the below steps as guidelines:
- Outliers treatment
- Missing values treatment
- Data transformation
For outliers, we will create 2 sets of variables.
The first set uses the capping method. In this method, we will replace all outliers that lie outside the 1.5 times of IQR limits. We will cap it by replacing those observations less than the lower limit with the value of 5th %ile and those that lie above the upper limit with the value of 95th %ile.
Accordingly we create the following new variables while retaining the original variables.
city_crime_train\(tax\
city_crime_train\)medv
city_crime_train$lstat
Lets see how the new variables look in boxplots.
In the second set, we will use the sin transformation and create the following variables:
city_crime_train_mod\(rm_new\
city_crime_train_mod\)dis_new
In this section, we will analyze few transformation options using Sin, Log, Sqrt, nth transformations. Using histogram and boxplots to evaluate best transformation to handle outliers.
First we will start with variable, zn (proportion of residential land zoned for large lots) as per below-
zn Transformation
Logit function has been used in building models in this assignment. Any transformation like sin,sqrt,nth etc for logit model brings additional complexity while doing the interpretation of the model. For that purpose any such transformation has not been used for the variables as normality is not a criteria for logit model.
The following are some of the transformation of the variables that have been used here:
1. zn: proportion of residential land zoned for large lots (over 25000 square feet) (predictor variable). For zn, we can see that there are large number of values with 0. Option to transformed zn to a variable of binary values 0 and 1. Values with 0 value to be transformed to 0 other values will be bucketed to 1.
2. chas: a dummy var. for whether the suburb borders the Charles River (1) or not (0) (predictor variable). This variable will be converted to a factor with values of 0 and 1 in the model.
3. target: whether the crime rate is above the median crime rate (1) or not (0) (response variable). Target variable will also be converted to a factor variable before using that in the model.
In this section, we will create six models. Aside from using original and transformed data, we will also using different methods and functions such as Linear Discriminant Analysis, step function, and logit function to enhance our models. Below is our model definition:
-Model 1- This model will be created using the original variables in train data set with logit function GLM.
-Model 2- This model will be created using original variables; however using step function instead of GLM.
-Model 3- This model will be created using transformed variables using GLM function.
-Model 4- This model will be created using transformed variables using step function instead of GLM.
-Model 5: this model will be created using original variables using Linear Discriminant Analysis function lda in ISLR package.
-Model 6- This model will be created using transformed variables using Linear Discriminant Analysis
Below is a summary table showing models and their respective variables.
Variables | Model.1 | Model.2 | Model.3 | Model.4 | Model.5 | Model.6 |
---|---|---|---|---|---|---|
zn | y | y | y | y | ||
indus | y | y | y | y | y | y |
chas | y | y | y | y | y | y |
nox | y | y | y | y | y | y |
rm | y | y | y | y | y | y |
age | y | y | y | y | y | y |
dis | y | y | y | y | y | y |
rad | y | y | y | y | y | y |
tax | y | y | y | y | ||
ptratio | y | y | y | y | y | y |
black | y | y | y | y | y | y |
lstat | y | y | y | y | ||
medv | y | y | y | y | ||
tax_new | y | y | y | |||
medv_new | y | y | y | |||
lstat_new | y | y | y | |||
zn_new | y | y | y |
In this model, we will be using all the given variables in train data set. We will create model using logit function and we will highlight the summary of the model.
##
## Call:
## glm(formula = target ~ ., family = "binomial", data = city_crime_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8791 -0.1299 -0.0025 0.0011 3.4785
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -41.462153 8.250799 -5.025 5.03e-07 ***
## zn -0.060580 0.039153 -1.547 0.121799
## indus -0.063885 0.059335 -1.077 0.281618
## chas 0.789391 0.865818 0.912 0.361912
## nox 53.413503 10.013666 5.334 9.60e-08 ***
## rm -0.647942 0.904430 -0.716 0.473739
## age 0.028835 0.015680 1.839 0.065915 .
## dis 0.800917 0.268877 2.979 0.002894 **
## rad 0.721751 0.195662 3.689 0.000225 ***
## tax -0.007065 0.003490 -2.024 0.042948 *
## ptratio 0.440768 0.159366 2.766 0.005679 **
## black -0.009591 0.006025 -1.592 0.111412
## lstat 0.096941 0.062429 1.553 0.120469
## medv 0.236940 0.091276 2.596 0.009436 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 514.63 on 371 degrees of freedom
## Residual deviance: 140.71 on 358 degrees of freedom
## AIC: 168.71
##
## Number of Fisher Scoring iterations: 9
\(\newline\) (i) Based on the outcome, it can be seen that indus, chas, rm, age, black, and lstat are not statistically significant.
(ii)As for the statistically significant variables, nox has the lowest p-value suggesting a strong association of the nox to the target variable. Other important variables are dis, rad, tax, ptratio, and medv. The AIC value for the model1 =168.71.
This model, we will be using original variables; however using step function (backward process) instead of GLM.
##
## Call:
## glm(formula = target ~ zn + nox + age + dis + rad + tax + ptratio +
## black + lstat + medv, family = "binomial", data = city_crime_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9258 -0.1459 -0.0024 0.0013 3.3934
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -39.282116 7.705519 -5.098 3.43e-07 ***
## zn -0.064656 0.037414 -1.728 0.083964 .
## nox 46.617168 8.074920 5.773 7.78e-09 ***
## age 0.025273 0.013545 1.866 0.062065 .
## dis 0.710480 0.249767 2.845 0.004447 **
## rad 0.775881 0.182072 4.261 2.03e-05 ***
## tax -0.009144 0.003082 -2.967 0.003011 **
## ptratio 0.359297 0.135081 2.660 0.007817 **
## black -0.008384 0.005737 -1.462 0.143871
## lstat 0.110624 0.055650 1.988 0.046829 *
## medv 0.181460 0.053572 3.387 0.000706 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 514.63 on 371 degrees of freedom
## Residual deviance: 142.85 on 361 degrees of freedom
## AIC: 164.85
##
## Number of Fisher Scoring iterations: 9
\(\newline\) (i)It can be seen that zn, age, and black are not statistically significant.
(ii)As for the statistically significant variables, nox has the lowest p-value suggesting a strong association of the nox of the target variable. other important variables are dis, rad, tax, ptratio, medv, and lstat. The AIC value for the model1 =164.85.
In this model, we will be using transformed variables with the logit function GLM.
##
## Call:
## glm(formula = target ~ . - zn - tax - lstat - medv, family = "binomial",
## data = city_crime_train_mod)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7883 -0.1410 -0.0026 0.0005 3.3645
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -68.319369 16.418997 -4.161 3.17e-05 ***
## indus -0.001867 0.067017 -0.028 0.977778
## chas 0.366993 0.849076 0.432 0.665577
## nox 56.080643 10.147964 5.526 3.27e-08 ***
## rm 2.995884 2.385419 1.256 0.209147
## age 0.043435 0.018166 2.391 0.016805 *
## dis 0.472036 0.331312 1.425 0.154231
## rad 0.838409 0.237364 3.532 0.000412 ***
## ptratio 0.468316 0.176293 2.656 0.007896 **
## black -0.010739 0.005922 -1.813 0.069782 .
## tax_new -0.005285 0.003663 -1.443 0.149151
## medv_new 0.283102 0.106228 2.665 0.007698 **
## lstat_new 0.050027 0.074958 0.667 0.504515
## rm_new -5.052053 2.830695 -1.785 0.074304 .
## dis_new -1.886385 0.552223 -3.416 0.000636 ***
## zn_new -0.363834 1.036508 -0.351 0.725574
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 514.63 on 371 degrees of freedom
## Residual deviance: 124.11 on 356 degrees of freedom
## AIC: 156.11
##
## Number of Fisher Scoring iterations: 9
\(\newline\) (i)From this model it can be seen that the following variables are relevant for this model: nox, dis, rad, ptratio, tax_new, medv_new, and lstat_new.
nox and rad are the two most important variables. The new variables tax_new, medv_new, lstat_new are having minor impact on the model.
The logistic regression coefficients give the change in the log odds of the outcome for a one unit increase in the predictor variables.
In this model we will be using transformed variables using backward step function instead of GLM
##
## Call:
## glm(formula = target ~ nox + age + dis + rad + ptratio + black +
## tax_new + medv_new + rm_new + dis_new, family = "binomial",
## data = city_crime_train_mod)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0158 -0.1472 -0.0031 0.0005 3.1030
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -52.779764 9.739144 -5.419 5.98e-08 ***
## nox 56.509319 9.188179 6.150 7.74e-10 ***
## age 0.051467 0.016215 3.174 0.001503 **
## dis 0.564992 0.255943 2.207 0.027280 *
## rad 0.849127 0.212643 3.993 6.52e-05 ***
## ptratio 0.533319 0.159365 3.347 0.000818 ***
## black -0.010960 0.005943 -1.844 0.065147 .
## tax_new -0.004534 0.003144 -1.442 0.149355
## medv_new 0.342778 0.095427 3.592 0.000328 ***
## rm_new -2.358513 1.028472 -2.293 0.021835 *
## dis_new -1.865533 0.488896 -3.816 0.000136 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 514.63 on 371 degrees of freedom
## Residual deviance: 126.80 on 361 degrees of freedom
## AIC: 148.8
##
## Number of Fisher Scoring iterations: 9
\(\newline\) (i)From this model it can be seen that the following variables are relevant for this model: nox, dis, rad, ptratio , tax_new, medv_new,and lstat_new
(ii) The number of integration is 9 and AIC value =165.8.
In this model we will be using original variables; however the Linear Discriminant Analysis function lda in ISLR package.
## Call:
## lda(target ~ ., data = city_crime_train)
##
## Prior probabilities of groups:
## 0 1
## 0.5268817 0.4731183
##
## Group means:
## zn indus chas nox rm age dis
## 0 22.012755 6.956327 0.05102041 0.4689730 6.401296 50.37398 5.086538
## 1 1.613636 15.291193 0.07954545 0.6428523 6.176631 86.38864 2.459868
## rad tax ptratio black lstat medv
## 0 4.107143 308.4949 17.76990 388.6647 9.199235 25.18724
## 1 14.880682 509.6932 18.74773 327.2894 15.959148 20.24148
##
## Coefficients of linear discriminants:
## LD1
## zn -0.0047914631
## indus 0.0281044279
## chas -0.0556293189
## nox 7.9109306913
## rm 0.1658180998
## age 0.0131973114
## dis 0.0840623852
## rad 0.1027832012
## tax -0.0019152605
## ptratio 0.0090391049
## black -0.0009160458
## lstat 0.0248449648
## medv 0.0425514709
\(\newline\) (i)The Classification boundary equation for our model 5 is below:
\(-0.004 * zn + 0.0281 * indus - 0.055 * chas + 7.910 * nox + 0.165 * rm + 0.013 * age + 0.084 * dis + 0.102 * rad - 0.001 * tax + 0.009 * ptratio - 0.0009 * black + 0.024 * lstat + 0.042 * medv =0\)
In this model we be using transformed variables using Linear Discriminant Analysis
## Call:
## lda(target ~ . - zn - rm - dis - tax - lstat - medv, data = city_crime_train_mod)
##
## Prior probabilities of groups:
## 0 1
## 0.5268817 0.4731183
##
## Group means:
## indus chas nox age rad ptratio black
## 0 6.956327 0.05102041 0.4689730 50.37398 4.107143 17.76990 388.6647
## 1 15.291193 0.07954545 0.6428523 86.38864 14.880682 18.74773 327.2894
## tax_new medv_new lstat_new rm_new dis_new zn_new
## 0 308.4949 25.04528 9.199235 0.08333182 -0.0504096 0.46938776
## 1 509.6932 19.86151 15.724247 -0.11166891 0.5106930 0.07954545
##
## Coefficients of linear discriminants:
## LD1
## indus 0.022452946
## chas -0.186416323
## nox 7.970446650
## age 0.015169354
## rad 0.100159450
## ptratio -0.014404341
## black -0.001159202
## tax_new -0.001196341
## medv_new 0.047596449
## lstat_new 0.016840318
## rm_new -0.008946209
## dis_new -0.340985994
## zn_new -0.001832533
\(\newline\) (i)The Classification boundary equation for our model 6 is below:
\(0.022 * indus -0.186 * chas + 7.970 * nox + 0.015 * age + 0.100 * rad -0.001 * tax - 0.014 * ptratio -0.001 * black + 0.016 * lstat_new + 0.047 * medv_new -0.008 * rm_new -0.034 * dis_new -0.001 * zn_new =0\)
In section we will further examine all six models. We will also follow the below steps to select our final model:
-Model selection strategy
-Model1 Evaluation
-Final Model Selection
-Inference of Final Model
The below model selection strategy will be used in our model evaluation:
Accuracy | Error_Rate | Precision | sensitivity | specificity | F1_Score | AUC | |
---|---|---|---|---|---|---|---|
1 | 0.9042553 | 0.0957447 | 0.9245283 | 0.9074074 | 0.9 | 0.9283174 | 0.9549011 |
From the key metrics table above, we can see that the model has high accuracy of 0.9042553 and low error rate 0.0957447. The AUC curve for this model is 0.9549011 which is very good as the optimal value for AUC is between 0 and 1 and the closer it goes to 1, the better the model outcome is..
Accuracy | Error_Rate | Precision | sensitivity | specificity | F1_Score | AUC | |
---|---|---|---|---|---|---|---|
2 | 0.8723404 | 0.1276596 | 0.9056604 | 0.8727273 | 0.8717949 | 0.9061444 | 0.9553613 |
From the key metrics table above, we can see that the model has high accuracy of 0.8723404 and low error rate 0.12765957. The AUC curve for this model is 0.9553 which is very good.
Accuracy | Error_Rate | Precision | sensitivity | specificity | F1_Score | AUC | |
---|---|---|---|---|---|---|---|
3 | 0.893617 | 0.106383 | 0.9245283 | 0.8909091 | 0.8974359 | 0.9211541 | 0.9677865 |
Looking at the key metrics this can be concluded this model has high accuracy 0.893617 and low error rate 0.106383. The AUC curve for this model is 0.9677865 which is very good.
Accuracy | Error_Rate | Precision | sensitivity | specificity | F1_Score | AUC | |
---|---|---|---|---|---|---|---|
4 | 0.8829787 | 0.1170213 | 0.9056604 | 0.8888889 | 0.875 | 0.9127916 | 0.9687069 |
Looking at the key metrics this can be concluded this model has high accuracy 0.8829787 and low error rate 0.1170213. The AUC curve for this model is 0.9687069 which is very good.
Accuracy | Error_Rate | Precision | sensitivity | specificity | F1_Score | AUC | |
---|---|---|---|---|---|---|---|
5 | 0.8297872 | 0.1702128 | 0.7358491 | 0.9512195 | 0.7358491 | 0.8297872 | 0.9263691 |
Looking at the key metrics this can be concluded this model has high accuracy 0.8297872 and low error rate 0.1702128. The AUC curve for this model is 0.9263691 which is very good.
Accuracy | Error_Rate | Precision | sensitivity | specificity | F1_Score | AUC | |
---|---|---|---|---|---|---|---|
6 | 0.8297872 | 0.1702128 | 0.7358491 | 0.9512195 | 0.7358491 | 0.8297872 | 0.9406351 |
Looking at the key metrics this can be concluded this model has high accuracy 0.8297872 and low error rate 0.1702128. The AUC curve for this model is 0.9406351 which is very good.
Following is the comparison of various metrics for above 6 models
Model_No | Accuracy | Error_Rate | AUC | Precision | sensitivity | specificity | F1_Score |
---|---|---|---|---|---|---|---|
1 | 0.9042553 | 0.0957447 | 0.9549011 | 0.9245283 | 0.9074074 | 0.9000000 | 0.9283174 |
2 | 0.8723404 | 0.1276596 | 0.9553613 | 0.9056604 | 0.8727273 | 0.8717949 | 0.9061444 |
3 | 0.8936170 | 0.1063830 | 0.9677865 | 0.9245283 | 0.8909091 | 0.8974359 | 0.9211541 |
4 | 0.8829787 | 0.1170213 | 0.9687069 | 0.9056604 | 0.8888889 | 0.8750000 | 0.9127916 |
5 | 0.8297872 | 0.1702128 | 0.9263691 | 0.7358491 | 0.9512195 | 0.7358491 | 0.8297872 |
6 | 0.8297872 | 0.1702128 | 0.9406351 | 0.7358491 | 0.9512195 | 0.7358491 | 0.8297872 |
From the comparison table, we see that Model 1 and Mode 3 are very close from accuracy and AUC perspective. Model 1 has little better accuracy rate 90.42% compare to Model’s 89.36%. However, Model 3 is the best in terms of AUC value which is .9677 which is closer to model 1’s value.
The AUC provides the best score on probability correctly identifying the patterns at various cut off values. The Accuracy, on the other hand, is calculated as specific cut off value. For this assignment we will go with cut off value of 0.5 and choose the Model 1 based on Accuracy value for further prediction on evaluation data set.
The following analysis will be carried out on the final model:
(i) Relevant variables in the model
(ii) Estimate confidence interval for coefficient
(iii) odds ratios and 95% CI
(iv) AUC curve
(v) Distribution of prediction
##
## Call:
## glm(formula = target ~ ., family = "binomial", data = city_crime_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8791 -0.1299 -0.0025 0.0011 3.4785
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -41.462153 8.250799 -5.025 5.03e-07 ***
## zn -0.060580 0.039153 -1.547 0.121799
## indus -0.063885 0.059335 -1.077 0.281618
## chas 0.789391 0.865818 0.912 0.361912
## nox 53.413503 10.013666 5.334 9.60e-08 ***
## rm -0.647942 0.904430 -0.716 0.473739
## age 0.028835 0.015680 1.839 0.065915 .
## dis 0.800917 0.268877 2.979 0.002894 **
## rad 0.721751 0.195662 3.689 0.000225 ***
## tax -0.007065 0.003490 -2.024 0.042948 *
## ptratio 0.440768 0.159366 2.766 0.005679 **
## black -0.009591 0.006025 -1.592 0.111412
## lstat 0.096941 0.062429 1.553 0.120469
## medv 0.236940 0.091276 2.596 0.009436 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 514.63 on 371 degrees of freedom
## Residual deviance: 140.71 on 358 degrees of freedom
## AIC: 168.71
##
## Number of Fisher Scoring iterations: 9
Following are the most relevant variables for the model: indus, nox, dis, rad, ptratio, and medv.
we can write the equation of the Model 1 as:
\(log(y) = -41.426 + 53.41 * nox + 0.80 * dis + 0.721 * rad -0.007 * tax + 0.44 * Ptratio + 0.23 * medv\)
## OR 2.5 % 97.5 %
## (Intercept) 9.844998e-19 9.335179e-26 1.038266e-11
## zn 9.412183e-01 8.716922e-01 1.016290e+00
## indus 9.381125e-01 8.351201e-01 1.053807e+00
## chas 2.202054e+00 4.034991e-01 1.201748e+01
## nox 1.574670e+23 4.715650e+14 5.258208e+31
## rm 5.231212e-01 8.886894e-02 3.079319e+00
## age 1.029255e+00 9.981051e-01 1.061378e+00
## dis 2.227583e+00 1.315121e+00 3.773133e+00
## rad 2.058033e+00 1.402507e+00 3.019950e+00
## tax 9.929600e-01 9.861908e-01 9.997758e-01
## ptratio 1.553900e+00 1.137027e+00 2.123615e+00
## black 9.904547e-01 9.788273e-01 1.002220e+00
## lstat 1.101795e+00 9.749020e-01 1.245205e+00
## medv 1.267365e+00 1.059759e+00 1.515641e+00
The following points can be made for the important variables in the model:
In keeping all other variables same, the odds of having crime rate above median value increases as follow: 0.875 for per unit change in indus, 2.50 per unit change in dis, 1.74 for per unit change in rad, 1.51 for per unit change in ptratio, and 1.30 for per unit change in medv. Any value which is less than 1, it means that there is less chance of an event with the per unit increase of the variable.
Please note that the outlier treatment for the nox variable would result in 50% reduction on train data set. However, for this assignment we have kept this variable as is.
Considering the target has value 1 (crime above median) and 0 when crime rate is below median, then the above plot illustrates the tradeoff of choosing a reasonable threshold. In other words, if the threshold is increased, the number of false positive (FP) results is lowered; while the number of false negative (FN) results increases.
In this section, Model 1 has been used to predict the outcome on the evaluation dataset.
Var1 | Freq |
---|---|
FALSE | 19 |
TRUE | 21 |
Based on the outcome, we can conclude that in evaluation data set of 40, around 21 records are where the crime rate is above the median and 19 records where the crime rate is below the median.
if (!require("ggplot2",character.only = TRUE)) (install.packages("ggplot2",dep=TRUE))
if (!require("MASS",character.only = TRUE)) (install.packages("MASS",dep=TRUE))
if (!require("knitr",character.only = TRUE)) (install.packages("knitr",dep=TRUE))
if (!require("xtable",character.only = TRUE)) (install.packages("xtable",dep=TRUE))
if (!require("dplyr",character.only = TRUE)) (install.packages("dplyr",dep=TRUE))
if (!require("psych",character.only = TRUE)) (install.packages("psych",dep=TRUE))
if (!require("stringr",character.only = TRUE)) (install.packages("stringr",dep=TRUE))
#if (!require("car",character.only = TRUE)) (install.packages("car",dep=TRUE))
if (!require("faraway",character.only = TRUE)) (install.packages("faraway",dep=TRUE))
library(ggplot2)
library(MASS)
library(knitr)
library(xtable)
library(dplyr)
library(psych)
library(stringr)
#library(car)
library(faraway)
library(aod)
library(Rcpp)
library(leaps)
library(ISLR)
library(AUC)
library(ROCR)
library(Amelia)
library(pROC)
city_crime_train_full <- read.csv("https://raw.githubusercontent.com/kishkp/data621-ctg5/master/HW3/crime-training-data.csv")
variables<- read.csv("https://raw.githubusercontent.com/kishkp/data621-ctg5/master/HW3/variables2.csv")
kable(variables, caption = "Variable Description")
city_crime_test <- read.csv("https://raw.githubusercontent.com/kishkp/data621-ctg5/master/HW3/crime-evaluation-data.csv")
#str(city_crime_test)
smp_size <- floor(0.80 * nrow(city_crime_train_full))
## set the seed to make your partition reproductible
set.seed(123)
train_ind <- sample(seq_len(nrow(city_crime_train_full)), size = smp_size)
city_crime_train<- city_crime_train_full[train_ind, ]
train_test <- city_crime_train_full[-train_ind, ]
ds_stats <- psych::describe(city_crime_train, skew = TRUE, na.rm = TRUE)
#ds_stats
kable(ds_stats[1:7], caption= "Data Summary")
kable(ds_stats[8:13], caption= "Data Summary (Cont)")
fun1 <- function(a, y) cor(y, a)
x<-city_crime_train[,]
Correlation <- sapply(x, FUN = fun1, y=city_crime_train$target)
#kable(data.frame(Correlation), caption = "Correlation between target and predictor variable")
Correlation <- sort(Correlation, decreasing = TRUE)
kable(Correlation, caption = "Variable Correlation")
missings<- sapply(city_crime_train_full,function(x) sum(is.na(x)))
kable(missings, caption = "Missing Values")
#missmap(city_crime_train_full, main = "Missing values vs observed")
### finding unique values
### % break up of target variable
#prop.table(table(city_crime_train_full$target))
uniques<- sapply(city_crime_train_full, function(x) length(unique(x)))
kable(uniques, caption = "Unique Values")
library(ggplot2)
library(reshape2)
#
mdata<- select(city_crime_train, -(target))
mdata2 <- melt(mdata)
# Output the boxplot
p <- ggplot(data = mdata2, aes(x=variable, y=value)) +
geom_boxplot() + ggtitle("Outliers identification")
p + facet_wrap( ~ variable, scales="free", ncol=5)
library(popbio)
par(mfrow=c(2,3))
logi.hist.plot(x$zn,x$target,logi.mod = 1, type="hist", boxp=FALSE,col="gray", mainlabel = "zn")
logi.hist.plot(x$indus, x$target,logi.mod = 1, type="hist",boxp=FALSE,col="gray", mainlabel = "indus")
logi.hist.plot(x$chas,x$target,logi.mod = 1,boxp=FALSE,type="hist",col="gray", mainlabel = "chas")
logi.hist.plot(x$nox, x$target,logi.mod = 1,type="hist",boxp=FALSE,col="gray", mainlabel = "nox")
logi.hist.plot(x$rm,x$target,logi.mod = 1,boxp=FALSE,type="hist",col="gray", mainlabel = "rm")
logi.hist.plot(x$age, x$target,logi.mod = 1,type="hist",boxp=FALSE,col="gray", mainlabel = "age")
logi.hist.plot(x$dis,x$target,logi.mod = 1,boxp=FALSE,type="hist",col="gray", mainlabel = "dis")
logi.hist.plot(x$rad, x$target,logi.mod = 1,type="hist",boxp=FALSE,col="gray", mainlabel = "rad")
logi.hist.plot(x$tax,x$target,logi.mod = 1,boxp=FALSE,type="hist",col="gray", mainlabel = "tax")
logi.hist.plot(x$ptratio, x$target,logi.mod = 1,type="hist",boxp=FALSE,col="gray", mainlabel = "ptratio")
logi.hist.plot(x$black,x$target,logi.mod = 1,boxp=FALSE,type="hist",col="gray", mainlabel = "black")
logi.hist.plot(x$lstat, x$target,logi.mod = 1,type="hist",boxp=FALSE,col="gray", mainlabel = "lstat")
logi.hist.plot(x$medv, x$target,logi.mod = 1,type="hist",boxp=FALSE,col="gray", mainlabel = "medv")
# function for removing outliers - http://r-statistics.co/Outlier-Treatment-With-R.html
treat_outliers <- function(x) {
qnt <- quantile(x, probs=c(.25, .75), na.rm = T)
caps <- quantile(x, probs=c(.05, .95), na.rm = T)
H <- 1.5 * IQR(x, na.rm = T)
x[x < (qnt[1] - H)] <- caps[1]
x[x > (qnt[2] + H)] <- caps[2]
return(x)
}
city_crime_train_mod<-city_crime_train
train_test_mod<-train_test
city_crime_train_mod$tax_new <- treat_outliers(city_crime_train$tax)
city_crime_train_mod$medv_new <- treat_outliers(city_crime_train$medv)
city_crime_train_mod$lstat_new <- treat_outliers(city_crime_train$lstat)
train_test_mod$tax_new <- treat_outliers(train_test$tax)
train_test_mod$medv_new <- treat_outliers(train_test$medv)
train_test_mod$lstat_new <- treat_outliers(train_test$lstat)
par(mfrow=c(1,3))
boxplot(city_crime_train_mod$tax_new,main="tax_new")
boxplot(city_crime_train_mod$medv_new,main="medv_new")
boxplot(city_crime_train_mod$lstat_new,main="lstat_new")
city_crime_train_mod$rm_new<-sin(city_crime_train$rm)
city_crime_train_mod$dis_new<-sin(city_crime_train$dis)
train_test_mod$rm_new<-sin(train_test$rm)
train_test_mod$dis_new<-sin(train_test$dis)
par(mfrow=c(1,2))
#boxplot(city_crime_train_mod$rm_new,main="rm_new")
#boxplot(city_crime_train_mod$dis_new,main="dis_new")
show_charts <- function(x, ...) {
par(mfrow=c(2,3))
xlabel <- unlist(str_split(deparse(substitute(x)), pattern = "\\$"))[2]
ylabel <- unlist(str_split(deparse(substitute(y)), pattern = "\\$"))[2]
hist(x,main=xlabel)
boxplot(x,main=xlabel)
y<-log(x)
boxplot(y,main='log transform')
y<-sqrt(x)
boxplot(y,main='sqrt transform')
y<-sin(x)
boxplot(y,main='sin transform')
y<-(x)^(1/9)
boxplot(y,main='ninth transform')
}
# attach file to R so that only variable can be called
attach(city_crime_train)
show_charts(zn)
show_charts(indus)
show_charts(nox)
show_charts(rm)
show_charts(age)
show_charts(dis)
show_charts(rad)
show_charts(tax)
show_charts(ptratio)
show_charts(black)
show_charts(lstat)
show_charts(medv)
city_crime_train_mod$zn_new<-city_crime_train_mod$zn
city_crime_train_mod$zn_new [city_crime_train_mod$zn_new > 0] <- 1
city_crime_train_mod$zn_new [city_crime_train_mod$zn_new== 0] <- 0
#city_crime_train_mod$zn_new<-factor(city_crime_train_mod$zn_new)
train_test_mod$zn_new<-train_test$zn
train_test_mod$zn_new [train_test_mod$zn_new > 0] <- 1
train_test_mod$zn_new [train_test_mod$zn_new== 0] <- 0
#city_crime_train_mod$zn_new<-factor(train_test_mod$zn_new)
#city_crime_train$chas <-factor(city_crime_train$chas)
#city_crime_train$target<-factor(city_crime_train$target)
#city_crime_train_mod$chas <-factor(city_crime_train_mod$chas)
#city_crime_train_mod$target<-factor(city_crime_train_mod$target)
#Correlation <- cor(city_crime_train$rm_new,(as.numeric(city_crime_train$target)))
#Correlation <- cor(city_crime_train$dis_new,(as.numeric(city_crime_train$target)))
model_var <- read.csv("https://raw.githubusercontent.com/kishkp/data621-ctg5/master/HW3/model_var.csv")
kable(model_var,caption="Variables used in different models")
model1 <- glm(target ~ ., data = city_crime_train, family = "binomial")
summary(model1)
stepmodel1<- step(model1, direction="backward")
summary(stepmodel1)
model2 <- glm(target ~ .-zn-tax-lstat-medv, data = city_crime_train_mod, family = "binomial")
summary(model2)
stepmodel2<- step(model2, direction="backward")
summary(stepmodel2)
model3=lda(target~.,data=city_crime_train)
model3
model3_mod=lda(target~.-zn-rm-dis-tax-lstat-medv,data=city_crime_train_mod)
model3_mod
#Following function Eval() will be used to calculate various metrics related to the model like Accuracy, Sensitivity, #Precision , Specificity, and F1 score
Eval<-function(x){
TP<-x$Freq[x$metrics=="TRUE_1"]
FP<-x$Freq[x$metrics=="FALSE_1"]
TN<-x$Freq[x$metrics=="FALSE_0"]
FN<-x$Freq[x$metrics=="TRUE_0"]
Accuracy <-(TP+TN)/(TP+TN+FP+FN)
Error_Rate<-(FP+FN)/(TP+TN+FP+FN)
Precision<-TP/(TP+FP)
sensitivity<-TP/(TP+FN)
specificity<-TN/(TN+FP)
F1_Score=2*Precision*sensitivity/(sensitivity+specificity)
eval_result<-data.frame(Accuracy=c(0),Error_Rate=c(0),Precision=c(0),sensitivity=c(0),specificity=c(0),F1_Score=c(0))
eval_result[1,1]<-Accuracy
eval_result[1,2]<-Error_Rate
eval_result[1,3]<- Precision
eval_result[1,4]<-sensitivity
eval_result[1,5]<-specificity
eval_result[1,6]<-F1_Score
eval_result
}
#confusion matrix
pre_train1 <- predict(model1, newdata=train_test, type="response")
df_pre_train1<-as.data.frame(table(pre_train1>0.5,train_test$target))
df_pre_train1$metrics <- paste(df_pre_train1$Var1,df_pre_train1$Var2, sep = '_')
#Eval(df_pre_train1)
model_comparison<-data.frame(Accuracy=c(0),Error_Rate=c(0),Precision=c(0),sensitivity=c(0),specificity=c(0),F1_Score=c(0))
model_comparison[1,]<-Eval(df_pre_train1)
#model_comparison<-c(1)
#row.names(model_comparison[1,])<-"model 1"
#AUC
train_test$pre_train1<-c(pre_train1)
#x<-data.frame(auc(train_test$target, train_test$pre_train1))
model_comparison$AUC<-c(0)
model_comparison[1,c("AUC")]<-c(auc(train_test$target, train_test$pre_train1))
kable(model_comparison,row.names = TRUE, caption = " Model 1 evaluation KPIs")
#model_comparison
#confusion matrix
pre_train1_step<-predict(stepmodel1,type="response",newdata=train_test)
df_pre_train1_step<-as.data.frame(table(pre_train1_step>0.5,train_test$target))
df_pre_train1_step$metrics <- paste(df_pre_train1_step$Var1,df_pre_train1_step$Var2, sep = '_')
#Eval(df_pre_train1_step)
model_comparison[2,]<-Eval(df_pre_train1_step)
#AUC
train_test$pre_train1<-c(pre_train1_step)
model_comparison[2,c("AUC")]<-c(auc(train_test$target, train_test$pre_train1))
kable(model_comparison[2,],caption = "Model 2 evaluation KPIs")
#confusion matrix
pre_train2<-predict(model2,type="response",newdata=train_test_mod)
df_pre_train2<-as.data.frame(table(pre_train2>0.5,train_test_mod$target))
df_pre_train2$metrics <- paste(df_pre_train2$Var1,df_pre_train2$Var2, sep = '_')
#Eval(df_pre_train2)
model_comparison[3,]<-Eval(df_pre_train2)
#AUC
train_test_mod$pre_train1<-c(pre_train2)
model_comparison[3,c("AUC")]<-c(auc(train_test_mod$target, train_test_mod$pre_train1))
kable(model_comparison[3,],caption = " Model 3 evaluation KPIs")
#confusion matrix
pre_train2_step<-predict(stepmodel2,type="response",newdata=train_test_mod)
df_pre_train2_step<-as.data.frame(table(pre_train2_step>0.5,train_test_mod$target))
df_pre_train2_step$metrics <- paste(df_pre_train2_step$Var1,df_pre_train2_step$Var2, sep = '_')
#Eval(df_pre_train2_step)
model_comparison[4,]<-Eval(df_pre_train2_step)
#AUC
train_test_mod$pre_train1<-c(pre_train2_step)
model_comparison[4,c("AUC")]<-c(auc(train_test_mod$target, train_test_mod$pre_train1))
kable(model_comparison[4,],caption = " Model 4 evaluation KPIs")
#confusion matrix
pre_train3<-data.frame(predict(model3,type="response",newdata=train_test))
df_pre_train3<-as.data.frame(table(pre_train3$class,train_test$target))
df_pre_train3$metrics <- paste(df_pre_train3$Var1,df_pre_train3$Var2, sep = '_')
df_pre_train3$metrics[df_pre_train3$metrics=="0_0"]<-"FALSE_0"
df_pre_train3$metrics[df_pre_train3$metrics=="1_0"]<-"TRUE_0"
df_pre_train3$metrics[df_pre_train3$metrics=="0_1"]<-"FALSE_1"
df_pre_train3$metrics[df_pre_train3$metrics=="1_1"]<-"TRUE_1"
#Eval(df_pre_train3)
model_comparison[5,]<-Eval(df_pre_train3)
#AUC
train_test$pre_train1<-c(pre_train3$posterior.1)
model_comparison[5,c("AUC")]<-c(auc(train_test$target, train_test$pre_train1))
kable(model_comparison[5,],caption = " Model 5 evaluation KPIs")
#confusion matrix
pre_train3_mod<-data.frame(predict(model3_mod,type="response",newdata=train_test_mod))
df_pre_train3_mod<-data.frame(table(pre_train3_mod$class,train_test_mod$target))
df_pre_train3_mod$metrics <- paste(df_pre_train3_mod$Var1,df_pre_train3_mod$Var2, sep = '_')
df_pre_train3_mod$metrics[df_pre_train3_mod$metrics=="0_0"]<-"FALSE_0"
df_pre_train3_mod$metrics[df_pre_train3_mod$metrics=="1_0"]<-"TRUE_0"
df_pre_train3_mod$metrics[df_pre_train3_mod$metrics=="0_1"]<-"FALSE_1"
df_pre_train3_mod$metrics[df_pre_train3_mod$metrics=="1_1"]<-"TRUE_1"
#Eval(df_pre_train3_mod)
model_comparison[6,]<-Eval(df_pre_train3_mod)
#AUC
train_test_mod$pre_train1<-c(pre_train3_mod$posterior.1)
model_comparison[6,c("AUC")]<-c(auc(train_test_mod$target, train_test_mod$pre_train1))
kable(model_comparison[6,],caption = " Model 6 evaluation KPIs")
model_comparison$Model_No<-c(1:6)
kable(model_comparison[,c("Model_No","Accuracy","Error_Rate","AUC","Precision","sensitivity","specificity","F1_Score")],
caption = "Model Performance Metrics Comparison")
summary(model1)
exp(cbind(OR = coef(model1), confint.default(model1)))
plot (city_crime_train$nox,city_crime_train$target, main="nox vs target")
abline(v=0.425)
abline(v=0.635)
#AUC
pred <- prediction(pre_train1, train_test$target)
perf <- performance(pred, measure = "tpr", x.measure = "fpr")
auc <- performance(pred, measure = "auc")
auc <- auc@y.values[[1]]
roc.data <- data.frame(fpr=unlist(perf@x.values),
tpr=unlist(perf@y.values),
model="GLM")
ggplot(roc.data, aes(x=fpr, ymin=0, ymax=tpr)) +
geom_ribbon(alpha=0.2) +
geom_line(aes(y=tpr)) +
ggtitle(paste0("ROC Curve w/ AUC=", auc))
plot_pred_type_distribution <- function(df, threshold) {
v <- rep(NA, nrow(df))
v <- ifelse(df$pre_train1 >= threshold & df$target == 1, "TP", v)
v <- ifelse(df$pre_train1 >= threshold & df$target == 0, "FP", v)
v <- ifelse(df$pre_train1 < threshold & df$target == 1, "FN", v)
v <- ifelse(df$pre_train1 < threshold & df$target == 0, "TN", v)
df$pred_type <- v
ggplot(data=df, aes(x=target, y=pre_train1)) +
geom_violin(fill=rgb(1,1,1,alpha=0.6), color=NA) +
geom_jitter(aes(color=pred_type), alpha=0.6) +
geom_hline(yintercept=threshold, color="red", alpha=0.6) +
scale_color_discrete(name = "type") +
labs(title=sprintf("Threshold at %.2f", threshold))
}
train_test$pre_train1 <- predict(model1, newdata=train_test, type="response")
plot_pred_type_distribution (train_test,0.5)
#city_crime_test$chas<-factor(city_crime_test$chas)
Predict_final<-predict(model1,type="response", newdata=city_crime_test)
kable(data.frame(table(Predict_final>0.5)),,caption="Outcome on evaluation data set")
#code=readLines(knitr::purl('https://raw.githubusercontent.com/kishkp/data621-ctg5/master/HW1/HW3_Final_03.Rmd', #documentation = 0)), eval = FALSE}