Overview

In this homework assignment, you will explore, analyze and model a data set containing information on crime for various neighborhoods of a major city. Each record has a response variable indicating whether or not the crime rate is above the median crime rate (1) or not (0).

Your objective is to build a binary logistic regression model on the training data set to predict whether the neighborhood will be at risk for high crime levels. You will provide classifications and probabilities for the evaluation data set using your binary logistic regression model. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set

Data Exploration

In this section, we will be exploring the data and familiarizing ourselves with ‘features’ we can use for the model building process. This training data has 466 observations, 1 target variable, and 12 predictor variables. First, we can take a look at the distribution of each predictor variable so we can assess the necessary steps to take in order generate the best fit for the model later on.

Distribution of Each Predictor Variable

par(mfrow = c(4, 4), mar = c(3, 3, 1, 1))

for (col_name in names(train_df)) {
    hist(train_df[[col_name]], main = paste(col_name), xlab = "Value")
}

par(mfrow = c(1, 1))

kable(tidy(train_df), "pipe")

column	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
zn	466	11.5772532	23.3646511	0.00000	5.3542781	0.0000	0.0000	100.0000	100.0000	2.1838409	6.842914	1.0823466
indus	466	11.1050215	6.8458549	9.69000	10.9082353	6.3000	0.4600	27.7400	27.2800	0.2894763	1.764351	0.3171281
chas	466	0.0708155	0.2567920	0.00000	0.0000000	0.0000	0.0000	1.0000	1.0000	3.3462553	12.197425	0.0118957
nox	466	0.5543105	0.1166667	0.53800	0.5442684	0.0900	0.3890	0.8710	0.4820	0.7487369	2.976990	0.0054045
rm	466	6.2906738	0.7048513	6.21000	6.2570615	0.3485	3.8630	8.7800	4.9170	0.4808673	4.561996	0.0326516
age	466	68.3675966	28.3213784	77.15000	70.9553476	20.2500	2.9000	100.0000	97.1000	-0.5795721	1.998687	1.3119625
dis	466	3.7956929	2.1069496	3.19095	3.5443647	1.2913	1.1296	12.1265	10.9969	1.0021166	3.486917	0.0976026
rad	466	9.5300429	8.6859272	5.00000	8.6978610	1.0000	1.0000	24.0000	23.0000	1.0135395	2.147295	0.4023678
tax	466	409.5021459	167.9000887	334.50000	401.5080214	70.5000	187.0000	711.0000	524.0000	0.6614416	1.859928	7.7778214
ptratio	466	18.3984979	2.1968447	18.90000	18.5970588	1.3000	12.6000	22.0000	9.4000	-0.7567025	2.610831	0.1017669
lstat	466	12.6314592	7.1018907	11.35000	11.8809626	4.7700	1.7300	37.9700	36.2400	0.9085092	3.518453	0.3289887
medv	466	22.5892704	9.2396814	21.20000	21.6304813	4.0500	5.0000	50.0000	45.0000	1.0801670	4.392615	0.4280200
target	466	0.4914163	0.5004636	0.00000	0.4893048	0.0000	0.0000	1.0000	1.0000	0.0343398	1.001179	0.0231835

show_summary <- function(df) {
    cat(rep("+", 50), "\n")
    cat(paste("DIMENSIONS : (", nrow(df), ", ", ncol(df), ")\n", sep = ""), "\n")
    cat(rep("+", 50), "\n")
    cat("COLUMNS:\n", "\n")
    cat(names(df), "\n")
    cat(rep("+", 50), "\n")
    cat("DATA INFO:\n", "\n")
    cat(sapply(df, class), "\n")
    cat(rep("+", 50), "\n")
    cat("MISSING VALUES:\n", "\n")
    cat(colSums(is.na(df)), "\n")
}

show_summary(train_df)

## + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 
## DIMENSIONS : (466, 13)
##  
## + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 
## COLUMNS:
##  
## zn indus chas nox rm age dis rad tax ptratio lstat medv target 
## + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 
## DATA INFO:
##  
## numeric numeric integer numeric numeric numeric numeric integer integer numeric numeric numeric integer 
## + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 
## MISSING VALUES:
##  
## 0 0 0 0 0 0 0 0 0 0 0 0 0

It seems the data is already fairly clean. There are no missing values in this dataset. All the explanatory variables have the right datatype. The next step is to look for multicollinearity.

# correlation plot
correlation_Matrix <- cor(train_df[, 1:12])

corrplot(correlation_Matrix, method = "color", type = "upper", addCoef.col = "black",
    number.cex = 0.7)

I want to remove any high correlation anything above 0.7 will be remove to create an absence of multicollinearity. So, the variable dis has high negative correlation with indus, nox, and age. Based on this correlation plot, we will remove the following variables in order to minimize multicollinearity: dis, tax, and medv.

train_df <- train_df |>
    mutate(crime = ifelse(target == 1, "high", "low"))

df_2 <- train_df |>
    select(indus, nox, age, dis, crime)
ggpairs(data = df_2, columns = 1:4, ggplot2::aes(color = crime))

train_df <- train_df |>
    select(-dis, -tax, -medv, -nox)

Data Preparation

After removing the high correlated variables, we can observe that the updated correlation plot has minimal collinearity between the indepedent predictor variable which is what we want to achieve.

correlation_Matrix <- cor(train_df[, 1:8])

corrplot(correlation_Matrix, method = "color", type = "upper", addCoef.col = "black",
    number.cex = 0.7)

train_df <- train_df |>
    select(-crime)

Model Building

We start off with a simple logistic model and then we can work our way up to build the best fitting model.

Simple/Base Logistic Model

With this model, we will set a base AIC value of 522.46 with only one predictor variable in this case zn variable/feature. We expect the first model to having the worst AIC value.

simple_model <- glm(target ~ zn, data = train_df, family = "binomial")
summary(simple_model)

## 
## Call:
## glm(formula = target ~ zn, family = "binomial", data = train_df)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.54503    0.11170   4.879 1.06e-06 ***
## zn          -0.09176    0.01349  -6.804 1.02e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 518.46  on 464  degrees of freedom
## AIC: 522.46
## 
## Number of Fisher Scoring iterations: 6

Everything Model

For the second model, we will include all the predictor variables that are linearly independent. We will get a better AIC value of 292 which might not be the best.

everything_model <- glm(target ~ ., data = train_df, family = "binomial")
summary(everything_model)

## 
## Call:
## glm(formula = target ~ ., family = "binomial", data = train_df)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -8.640771   3.035549  -2.847  0.00442 ** 
## zn          -0.056399   0.019159  -2.944  0.00324 ** 
## indus        0.049239   0.027236   1.808  0.07063 .  
## chas         0.168888   0.584261   0.289  0.77253    
## rm           0.693054   0.349282   1.984  0.04723 *  
## age          0.035598   0.008583   4.148 3.36e-05 ***
## rad          0.483674   0.113763   4.252 2.12e-05 ***
## ptratio     -0.102886   0.071228  -1.444  0.14861    
## lstat        0.043899   0.041213   1.065  0.28680    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 276.49  on 457  degrees of freedom
## AIC: 294.49
## 
## Number of Fisher Scoring iterations: 8

After creating the everything_model, we wanted to double check with vif() function to check the collinearity of the variable. The table below shows that there is relatively low correlation between the predictor variables.

data.frame(vif(everything_model))

##         vif.everything_model.
## zn                   1.464757
## indus                1.494374
## chas                 1.063072
## rm                   2.501189
## age                  1.560749
## rad                  1.142231
## ptratio              1.289813
## lstat                2.698502

Backward Selection

step(everything_model, direction = "backward", scope = formula(everything_model))

## Start:  AIC=294.49
## target ~ zn + indus + chas + rm + age + rad + ptratio + lstat
## 
##           Df Deviance    AIC
## - chas     1   276.57 292.57
## - lstat    1   277.62 293.62
## <none>         276.49 294.49
## - ptratio  1   278.61 294.61
## - indus    1   279.81 295.81
## - rm       1   280.49 296.49
## - zn       1   288.67 304.67
## - age      1   297.28 313.28
## - rad      1   367.79 383.79
## 
## Step:  AIC=292.57
## target ~ zn + indus + rm + age + rad + ptratio + lstat
## 
##           Df Deviance    AIC
## - lstat    1   277.73 291.73
## <none>         276.57 292.57
## - ptratio  1   278.95 292.95
## - indus    1   279.99 293.99
## - rm       1   280.57 294.57
## - zn       1   289.10 303.10
## - age      1   297.37 311.37
## - rad      1   368.22 382.22
## 
## Step:  AIC=291.73
## target ~ zn + indus + rm + age + rad + ptratio
## 
##           Df Deviance    AIC
## <none>         277.73 291.73
## - ptratio  1   280.04 292.04
## - rm       1   280.74 292.74
## - indus    1   281.64 293.64
## - zn       1   289.48 301.48
## - age      1   312.25 324.25
## - rad      1   373.55 385.55

## 
## Call:  glm(formula = target ~ zn + indus + rm + age + rad + ptratio, 
##     family = "binomial", data = train_df)
## 
## Coefficients:
## (Intercept)           zn        indus           rm          age          rad  
##    -6.83783     -0.05417      0.05294      0.43959      0.03994      0.48680  
##     ptratio  
##    -0.10514  
## 
## Degrees of Freedom: 465 Total (i.e. Null);  459 Residual
## Null Deviance:       645.9 
## Residual Deviance: 277.7     AIC: 291.7

Forward Selection

step(simple_model, direction = "forward", scope = formula(everything_model))

## Start:  AIC=522.46
## target ~ zn
## 
##           Df Deviance    AIC
## + rad      1   344.89 350.89
## + age      1   407.45 413.45
## + indus    1   432.03 438.03
## + lstat    1   473.70 479.70
## + chas     1   516.06 522.06
## <none>         518.46 522.46
## + ptratio  1   517.42 523.42
## + rm       1   518.37 524.37
## 
## Step:  AIC=350.89
## target ~ zn + rad
## 
##           Df Deviance    AIC
## + age      1   286.93 294.93
## + indus    1   325.11 333.11
## + ptratio  1   334.65 342.65
## + lstat    1   336.31 344.31
## + chas     1   342.50 350.50
## + rm       1   342.80 350.80
## <none>         344.89 350.89
## 
## Step:  AIC=294.93
## target ~ zn + rad + age
## 
##           Df Deviance    AIC
## + ptratio  1   283.11 293.11
## + rm       1   284.19 294.19
## + indus    1   284.80 294.80
## <none>         286.93 294.93
## + chas     1   286.28 296.28
## + lstat    1   286.80 296.80
## 
## Step:  AIC=293.11
## target ~ zn + rad + age + ptratio
## 
##         Df Deviance    AIC
## + indus  1   280.74 292.74
## <none>       283.11 293.11
## + rm     1   281.64 293.64
## + chas   1   282.94 294.94
## + lstat  1   283.11 295.11
## 
## Step:  AIC=292.74
## target ~ zn + rad + age + ptratio + indus
## 
##         Df Deviance    AIC
## + rm     1   277.73 291.73
## <none>       280.74 292.74
## + lstat  1   280.57 294.57
## + chas   1   280.67 294.67
## 
## Step:  AIC=291.73
## target ~ zn + rad + age + ptratio + indus + rm
## 
##         Df Deviance    AIC
## <none>       277.73 291.73
## + lstat  1   276.57 292.57
## + chas   1   277.62 293.62

## 
## Call:  glm(formula = target ~ zn + rad + age + ptratio + indus + rm, 
##     family = "binomial", data = train_df)
## 
## Coefficients:
## (Intercept)           zn          rad          age      ptratio        indus  
##    -6.83783     -0.05417      0.48680      0.03994     -0.10514      0.05294  
##          rm  
##     0.43959  
## 
## Degrees of Freedom: 465 Total (i.e. Null);  459 Residual
## Null Deviance:       645.9 
## Residual Deviance: 277.7     AIC: 291.7

With both forward and backward selection, both logit model performs similar with having AIC value of 291.7. Also, both direction gave the formula for target ~ zn + rad + age+ ptratio + indus + rm.

Model Selection

The step model has the lower AIC value meaning that its has the lowest prediction error relative to the other models. The first model, we expected, to have the worst AIC since it was not able to predict a majority of the data. Next, the second model where it had all the variables had a significantly better AIC value which is still not the best model we could have had because the third model using the a step selection process where both direction forward selection and backward elimination yielded the same result which was surprising. In this case, the step model was the model we could have made with predictor variables in this dataset. So, we will be picking the step model to predict the target variable in the eval_df.

step_mod <- glm(target ~ zn + rad + age + ptratio + indus + rm, family = binomial,
    data = train_df)

predictions <- predict(step_mod, eval_df, type = "response")

threshold <- 0.5

binary_predictions <- ifelse(predictions >= threshold, 1, 0)

eval_df$target_pred <- binary_predictions

model_names <- c("simple", "everything", "step")
aic_values <- c(simple_model$aic, everything_model$aic, step_mod$aic)

kable(cbind(model_names, aic_values), col.names = c("Model Name", "AIC Value")) |>
    kable_styling(full_width = T)

Model Name	AIC Value
simple	522.464452017651
everything	294.486574605442
step	291.726687224588

Predictions

kable(eval_df) |>
    kable_styling(full_width = T)

zn	indus	chas	nox	rm	age	dis	rad	tax	ptratio	lstat	medv	target_pred
0	7.07	0	0.469	7.185	61.1	4.9671	2	242	17.8	4.03	34.7	0
0	8.14	0	0.538	6.096	84.5	4.4619	4	307	21.0	10.26	18.2	0
0	8.14	0	0.538	6.495	94.4	4.4547	4	307	21.0	12.80	18.4	0
0	8.14	0	0.538	5.950	82.0	3.9900	4	307	21.0	27.71	13.2	0
0	5.96	0	0.499	5.850	41.5	3.9342	5	279	19.2	8.77	21.0	0
25	5.13	0	0.453	5.741	66.2	7.2254	8	284	19.7	13.15	18.7	0
25	5.13	0	0.453	5.966	93.4	6.8185	8	284	19.7	14.44	16.0	1
0	4.49	0	0.449	6.630	56.1	4.4377	3	247	18.5	6.53	26.6	0
0	4.49	0	0.449	6.121	56.8	3.7476	3	247	18.5	8.44	22.2	0
0	2.89	0	0.445	6.163	69.6	3.4952	2	276	18.0	11.34	21.4	0
0	25.65	0	0.581	5.856	97.0	1.9444	2	188	19.1	25.41	17.3	0
0	25.65	0	0.581	5.613	95.6	1.7572	2	188	19.1	27.26	15.7	0
0	21.89	0	0.624	5.637	94.7	1.9799	4	437	21.2	18.34	14.3	1
0	19.58	0	0.605	6.101	93.0	2.2834	5	403	14.7	9.81	25.0	1
0	19.58	0	0.605	5.880	97.3	2.3887	5	403	14.7	12.03	19.1	1
0	10.59	1	0.489	5.960	92.1	3.8771	4	277	18.6	17.27	21.7	1
0	6.20	0	0.504	6.552	21.4	3.3751	8	307	17.4	3.76	31.5	0
0	6.20	0	0.507	8.247	70.4	3.6519	8	307	17.4	3.95	48.3	1
22	5.86	0	0.431	6.957	6.8	8.9067	7	330	19.1	3.53	29.6	0
90	2.97	0	0.400	7.088	20.8	7.3073	1	285	15.3	7.85	32.2	0
80	1.76	0	0.385	6.230	31.5	9.0892	1	241	18.2	12.93	20.1	0
33	2.18	0	0.472	6.616	58.1	3.3700	7	222	18.4	8.93	28.4	0
0	9.90	0	0.544	6.122	52.8	2.6403	4	304	18.4	5.98	22.1	0
0	7.38	0	0.493	6.415	40.1	4.7211	5	287	19.6	6.12	25.0	0
0	7.38	0	0.493	6.312	28.9	5.4159	5	287	19.6	6.15	23.0	0
0	5.19	0	0.515	5.895	59.6	5.6150	5	224	20.2	10.56	18.5	0
80	2.01	0	0.435	6.635	29.7	8.3440	4	280	17.0	5.99	24.5	0
0	18.10	0	0.718	3.561	87.9	1.6132	24	666	20.2	7.12	27.5	1
0	18.10	1	0.631	7.016	97.5	1.2024	24	666	20.2	2.96	50.0	1
0	18.10	0	0.584	6.348	86.1	2.0527	24	666	20.2	17.64	14.5	1
0	18.10	0	0.740	5.935	87.9	1.8206	24	666	20.2	34.02	8.4	1
0	18.10	0	0.740	5.627	93.9	1.8172	24	666	20.2	22.88	12.8	1
0	18.10	0	0.740	5.818	92.4	1.8662	24	666	20.2	22.11	10.5	1
0	18.10	0	0.740	6.219	100.0	2.0048	24	666	20.2	16.59	18.4	1
0	18.10	0	0.740	5.854	96.6	1.8956	24	666	20.2	23.79	10.8	1
0	18.10	0	0.713	6.525	86.5	2.4358	24	666	20.2	18.13	14.1	1
0	18.10	0	0.713	6.376	88.4	2.5671	24	666	20.2	14.65	17.7	1
0	18.10	0	0.655	6.209	65.4	2.9634	24	666	20.2	13.22	21.4	1
0	9.69	0	0.585	5.794	70.6	2.8927	6	391	19.2	14.10	18.3	0
0	11.93	0	0.573	6.976	91.0	2.1675	1	273	21.0	5.64	23.9	0

Resources

Medium Post: Top 5 Assumption for Logistic Regression

https://www.rdocumentation.org/packages/caret/versions/3.45/topics/confusionMatrix

Basic Report

Nick Climaco

2023-11-06