DATA 621 - Assignment 3

1 Background
2 Data Preparation
- 2.1 Data Import
- 2.2 Data Exploration and Cleaning
3 Data Preparation
- 3.1 a (creating categorical variables)
- 3.2 Transform skewed predictors
4 Build Models
5 Select Models
6 Predictions on evaluated data

1 Background

In this homework assignment, we will explore, analyze and model a data set containing information on crime for various neighborhoods of a major city. Each record has a response variable indicating whether or not the crime rate is above the median crime rate (1) or not (0). Our objective is to build a binary logistic regression model on the training data set to predict whether the neighborhood will be at risk for high crime levels. We will provide classifications and probabilities for the evaluation data set using your binary logistic regression model.Below is a short description of the variables of interest in the data set:

variables <- data.frame(
  Variable = c("zn", "indus", "chas", "nox", "rm", "age", "dis", "rad", 
               "tax", "ptratio", "lstat", "medv", "target"),
  Description = c(
    "Proportion of residential land zoned for large lots (over 25,000 sq. ft.)",
    "Proportion of non-retail business acres per suburb",
    "Whether the suburb borders the Charles River (1 = yes, 0 = no)",
    "Nitrogen oxides concentration (parts per 10 million)",
    "Average number of rooms per dwelling",
    "Proportion of owner-occupied units built before 1940",
    "Weighted mean of distances to five Boston employment centers",
    "Index of accessibility to radial highways",
    "Full-value property-tax rate per $10,000",
    "Pupil-teacher ratio by town",
    "Percentage of population with lower status",
    "Median value of owner-occupied homes (in $1000s)",
    "Whether the crime rate is above the median (1 = yes, 0 = no)"),
  `Type` = c(
    rep("Predictor", 12),
    "Response" ))

kable(variables)

Variable	Description	Type
zn	Proportion of residential land zoned for large lots (over 25,000 sq. ft.)	Predictor
indus	Proportion of non-retail business acres per suburb	Predictor
chas	Whether the suburb borders the Charles River (1 = yes, 0 = no)	Predictor
nox	Nitrogen oxides concentration (parts per 10 million)	Predictor
rm	Average number of rooms per dwelling	Predictor
age	Proportion of owner-occupied units built before 1940	Predictor
dis	Weighted mean of distances to five Boston employment centers	Predictor
rad	Index of accessibility to radial highways	Predictor
tax	Full-value property-tax rate per $10,000	Predictor
ptratio	Pupil-teacher ratio by town	Predictor
lstat	Percentage of population with lower status	Predictor
medv	Median value of owner-occupied homes (in $1000s)	Predictor
target	Whether the crime rate is above the median (1 = yes, 0 = no)	Response

2 Data Preparation

2.1 Data Import

data_train <- read.csv("https://raw.githubusercontent.com/nakesha-fray/DATA621/main/crime-training-data_modified.csv")
data_eval <- read.csv("https://raw.githubusercontent.com/nakesha-fray/DATA621/main/crime-evaluation-data_modified.csv")

2.2 Data Exploration and Cleaning

The crime dataset contains 466 observations and 13 variables, each representing the characteristics of different neighborhoods in a major city, including socioeconomic, environmental, and housing-related factors. The response variable, target, is a binary indicator that denotes whether the crime rate in a neighborhood is above the median (1) or not (0). The goal is to predict whether a neighborhood is at risk for high crime levels, as defined by whether its crime rate is above the median (target = 1). By analyzing patterns in the provided predictor variables, we aim to develop a binary logistic regression model that can estimate the probability of high crime and classify neighborhoods accordingly. This model will later be used to generate both class predictions for an evaluation dataset.

2.21 Summary Statistics and Distribution

The table below provides some summary statistics of the predictor variables in the training dataset. Key metrics such as minimum, maximum, mean, median, and standard deviation (SD) help us understand the range, central tendency, and variability of each variable. Some interesting things to note here is that zn, proportion of residential land zoned for large lots (over 25000 square feet), has a large spread but many zero values (median is zero). ptratio, pupil-teacher ratio by town, also has a median which is larger, closer to the max value, indicating potential skewness.

Additionally, there are no duplicates in this dataset.

# Structure of the data 
str(data_train)

## 'data.frame':    466 obs. of  13 variables:
##  $ zn     : num  0 0 0 30 0 0 0 0 0 80 ...
##  $ indus  : num  19.58 19.58 18.1 4.93 2.46 ...
##  $ chas   : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.605 0.871 0.74 0.428 0.488 0.52 0.693 0.693 0.515 0.392 ...
##  $ rm     : num  7.93 5.4 6.49 6.39 7.16 ...
##  $ age    : num  96.2 100 100 7.8 92.2 71.3 100 100 38.1 19.1 ...
##  $ dis    : num  2.05 1.32 1.98 7.04 2.7 ...
##  $ rad    : int  5 5 24 6 3 5 24 24 5 1 ...
##  $ tax    : int  403 403 666 300 193 384 666 666 224 315 ...
##  $ ptratio: num  14.7 14.7 20.2 16.6 17.8 20.9 20.2 20.2 20.2 16.4 ...
##  $ lstat  : num  3.7 26.82 18.85 5.19 4.82 ...
##  $ medv   : num  50 13.4 15.4 23.7 37.9 26.5 5 7 22.2 20.9 ...
##  $ target : int  1 1 1 0 0 0 1 1 0 0 ...

# Glimpse of the data
head(data_train)

##   zn indus chas   nox    rm   age    dis rad tax ptratio lstat medv target
## 1  0 19.58    0 0.605 7.929  96.2 2.0459   5 403    14.7  3.70 50.0      1
## 2  0 19.58    1 0.871 5.403 100.0 1.3216   5 403    14.7 26.82 13.4      1
## 3  0 18.10    0 0.740 6.485 100.0 1.9784  24 666    20.2 18.85 15.4      1
## 4 30  4.93    0 0.428 6.393   7.8 7.0355   6 300    16.6  5.19 23.7      0
## 5  0  2.46    0 0.488 7.155  92.2 2.7006   3 193    17.8  4.82 37.9      0
## 6  0  8.56    0 0.520 6.781  71.3 2.8561   5 384    20.9  7.67 26.5      0

# Summary statistics
stats <- data_train %>%
  dplyr::select(-target) %>%
  summarise(across(everything(), list(
    Min = min,
    Max = max,
    Mean = mean,
    Median = median,
    SD = sd
  ))) %>%
  pivot_longer(cols = everything(),
               names_to = c("Variable", ".value"),
               names_sep = "_")
kable(stats, caption = "Descriptive Statistics of Predictor Variables")

Descriptive Statistics of Predictor Variables
Variable	Min	Max	Mean	Median	SD
zn	0.0000	100.0000	11.5772532	0.00000	23.3646511
indus	0.4600	27.7400	11.1050215	9.69000	6.8458549
chas	0.0000	1.0000	0.0708155	0.00000	0.2567920
nox	0.3890	0.8710	0.5543105	0.53800	0.1166667
rm	3.8630	8.7800	6.2906738	6.21000	0.7048513
age	2.9000	100.0000	68.3675966	77.15000	28.3213784
dis	1.1296	12.1265	3.7956929	3.19095	2.1069496
rad	1.0000	24.0000	9.5300429	5.00000	8.6859272
tax	187.0000	711.0000	409.5021459	334.50000	167.9000887
ptratio	12.6000	22.0000	18.3984979	18.90000	2.1968447
lstat	1.7300	37.9700	12.6314592	11.35000	7.1018907
medv	5.0000	50.0000	22.5892704	21.20000	9.2396814

# Check for duplicates
duplicates <- duplicated(data_train)

# Print the duplicates
print(data_train[duplicates, ])

##  [1] zn      indus   chas    nox     rm      age     dis     rad     tax    
## [10] ptratio lstat   medv    target 
## <0 rows> (or 0-length row.names)

There is a fairly even distribution of the target variable (high vs low crime). No missing values were found.

Information about the distribution: - age (age) is left skewed - pupil-teacher ratio (ptratio) is a slightly right-skewed distribution
- Zoning (zn),Pollution (nox), lower status of the population (lstat) and distance to employment (dis) is right skewed. - Average number of rooms (rn) and median value of owner-occupied homes (medv) is somewhat symmetrical and normally distributed

The distributions for the remaining variables are multimodal, including the distribution for chas.

# Plot histograms of all numeric variables in data_train
data_train |>
  dplyr::select(where(is.numeric)) |>
  pivot_longer(cols = everything(), names_to = "Feature", values_to = "Value") |>
  filter(!is.na(Value)) |>
  ggplot(aes(x = Value)) +
  geom_histogram(bins = 30, fill = "skyblue", color = "black") +
  facet_wrap(~Feature, scales = "free") +
  labs(title = "Histograms of Numerical Features", x = NULL, y = "Frequency") +
  theme_minimal()

# Bar plot of the target variable (0 = low crime, 1 = high crime)
ggplot(data_train, aes(x = factor(target))) +
  geom_bar(fill = "steelblue", color = "black") +
  labs(
    title = "Distribution of Target Variable",
    x = "Crime Level (0 = Low, 1 = High)",
    y = "Count"
  ) +
  theme_minimal()

# Count of missing Values per Variable
missing <- sapply(data_train, function(x) sum(is.na(x)))
kable(data.frame(Variable = names(missing), Missing = missing), caption = "Missing Values in Each Variable")

Missing Values in Each Variable
	Variable	Missing
zn	zn	0
indus	indus	0
chas	chas	0
nox	nox	0
rm	rm	0
age	age	0
dis	dis	0
rad	rad	0
tax	tax	0
ptratio	ptratio	0
lstat	lstat	0
medv	medv	0
target	target	0

2.22 Identifying Outliers and Correlations

Boxplots show significant variation in some predictors, indicating potential outliers or skewness, which will be considered during modeling.

For example, age and tax both have a much wider spread than the other predictors.

Variables such as nox, age, rad, dis, and zn have a correlation with crime rates:

Top positive correlations: nox 0.73: Higher air pollution is associated with higher crime areas. Specifically, when nox is 0.624 or greater it has a greater correlation with the crime rate being above the median crime rate. age 0.63: Older housing stock tends to be in high crime areas. Specifically, when the age is 77.15 or greater there is a greater correlation with the crime rate being above the median crime rate. rad 0.63: Greater accessibility to radial highways correlates with higher crime rates. Specifically, when rad is 5 or greater it has a greater correlation with the crime rate being above the median crime rate.

Top negative correlations: dis -0.62: Neighborhoods further from employment centers tend to have lower crime. Specifically, when dis is 5.3 or greater it has a lower correlation with the crime rate being above the median crime rate. zn -0.43: More residential zoning for large lots is linked with lower crime. Specifically, when zn is 16.25 or greater it has a lower correlation with the crime rate being above the median crime rate. medv -0.27: Even though this absolute value correlation is not too high, when medv is specifically lower than 17, it has a higher correlation with the crime rate being above the median crime rate.

Through these correlations, we already have some solid insights before even creating an models.

Correlation Among Predictors:

It is clear that the predictor variables rad and tax are very highly correlated (more than 0.9). We will attempt to account for this when we prepare the data.

There are some smaller, but still high correlations between other predictor variables as well. indus is highly correlated (more than 0.7) with nox, dis, and tax. nox is also highly correlated with age and dis. rm is highly correlated with medv. lstat is highly correlated with medv.

Chas has no correlation to the response variable.

# Boxplot of predictors
data_train %>%
  dplyr::select(-target) %>%
  pivot_longer(cols = everything(), names_to = "Predictor", values_to = "Value") %>%
  ggplot(aes(x = Predictor, y = Value)) +
  geom_boxplot(fill = "darkgrey", outlier.color = "red", outlier.shape = 16, outlier.size = 1.5) +
  coord_cartesian(ylim = c(0, 1000)) +  # Adjust this based on your outliers
  labs(title = "Boxplots of Predictor Variables", x = "Predictors", y = "Values") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

#Blox plot of predictors by crime level
data_train %>%
  pivot_longer(cols = -target, names_to = "Predictor", values_to = "Value") %>%
  ggplot(aes(x = factor(target), y = Value, fill = factor(target))) +
  geom_boxplot(outlier.shape = 1, outlier.alpha = 0.5) +
  facet_wrap(~ Predictor, scales = "free", ncol = 4) +
  labs(title = "Boxplots of Predictors by Crime Level", x = "Crime Level", y = "Value") +
  theme_minimal() +
  theme(legend.position = "none")

#Correlation matrix with target
cor_matrix <- cor(data_train)
corr_target <- cor_matrix[,"target"]
corr_target_sorted <- sort(corr_target, decreasing = TRUE)

kable(as.data.frame(corr_target_sorted), col.names = c("Correlation with Target"), digits = 2)

	Correlation with Target
target	1.00
nox	0.73
age	0.63
rad	0.63
tax	0.61
indus	0.60
lstat	0.47
ptratio	0.25
chas	0.08
rm	-0.15
medv	-0.27
zn	-0.43
dis	-0.62

#Correlation heatmap
numeric_vars <- data_train %>%
  dplyr::select(where(is.numeric))

cor_matrix <- cor(numeric_vars, use = "complete.obs")

corrplot::corrplot(cor_matrix, method = "color", type = "upper", 
         tl.col = "black", tl.srt = 45, addCoef.col = "black",
         number.cex = 0.7, diag = FALSE)

# Correlation top 5 positive and negative
corr_target <- cor_matrix[, "target"]
corr_target <- corr_target[names(corr_target) != "target"]  # Remove self-correlation

top_pos <- sort(corr_target, decreasing = TRUE)[1:3]
top_neg <- sort(corr_target, decreasing = FALSE)[1:3]

combined_df <- data.frame(
  Feature = c(names(top_pos), names(top_neg)),
  Correlation = c(top_pos, top_neg)) %>%
  mutate(Direction = ifelse(Correlation > 0, "Positive", "Negative"))

ggplot(combined_df, aes(x = reorder(Feature, Correlation), y = Correlation, fill = Direction)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  scale_fill_manual(values = c("Positive" = "skyblue", "Negative" = "salmon")) +
  labs(title = "Top Features Correlated with High Crime",
       x = "Feature",
       y = "Correlation with Target",
       fill = "Direction") +
  theme_minimal()

# Correlation among predictors
melt_cor <- melt(cor_matrix)
filtered_cor <- melt_cor[melt_cor$Var1 != melt_cor$Var2 & as.numeric(melt_cor$Var1) < as.numeric(melt_cor$Var2), ]
sorted_cor <- filtered_cor[order(abs(filtered_cor$value), decreasing = TRUE), ]
head(sorted_cor)

##      Var1 Var2      value
## 112   rad  tax  0.9064632
## 82    nox  dis -0.7688840
## 41  indus  nox  0.7596301
## 84    age  dis -0.7508976
## 154 lstat medv -0.7358008
## 69    nox  age  0.7351278

# Correlation Funnel using plot_correlation_funnel
# This shows us exactly which prospects groups have a higher correlation with the target variable

# This binarizes the features so it includes categorical features
data_train_binarized <- data_train %>%
    binarize(n_bins = 4, thresh_infreq = 0.01)
data_train_correlated_table <- data_train_binarized %>%
    correlate(target = target__1)
data_train_correlated_table %>%
    plot_correlation_funnel(interactive = FALSE)

## Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## ℹ The deprecated feature was likely used in the correlationfunnel package.
##   Please report the issue at
##   <https://github.com/business-science/correlationfunnel/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## ℹ The deprecated feature was likely used in the correlationfunnel package.
##   Please report the issue at
##   <https://github.com/business-science/correlationfunnel/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

3 Data Preparation

There was no missing data.

3.1 a (creating categorical variables)

# Binning categories
data_train2 <- data_train %>%
  mutate(
    age_bin = cut(age,
                  breaks = quantile(age, probs = c(0, 0.33, 0.66, 1), na.rm = TRUE),
                  labels = c("Young", "Middle", "Old"),
                  include.lowest = TRUE),
    lstat_bin = cut(lstat,
                    breaks = quantile(lstat, probs = c(0, 0.33, 0.66, 1), na.rm = TRUE),
                    labels = c("Low", "Medium", "High"),
                    include.lowest = TRUE),
    medv_bin = cut(medv,
                   breaks = quantile(medv, probs = c(0, 0.33, 0.66, 1), na.rm = TRUE),
                   labels = c("Low", "Medium", "High"),
                   include.lowest = TRUE),
    tax_bin = cut(tax,
                  breaks = quantile(tax, probs = c(0, 0.33, 0.66, 1), na.rm = TRUE),
                  labels = c("Low", "Medium", "High"),
                  include.lowest = TRUE),
    dis_group = case_when(
      ntile(dis, 3) == 1 ~ "Close",
      ntile(dis, 3) == 2 ~ "Medium",
      ntile(dis, 3) == 3 ~ "Far"
    ),
    ptratio_bin = case_when(
      ntile(ptratio, 3) == 1 ~ "Low",
      ntile(ptratio, 3) == 2 ~ "Medium",
      ntile(ptratio, 3) == 3 ~ "High"
    ),
    rm_bin = case_when(
      ntile(rm, 3) == 1 ~ "Small",
      ntile(rm, 3) == 2 ~ "Medium",
      ntile(rm, 3) == 3 ~ "Large"
    )
  )

# Convert to factor
data_train3 <- data_train2 %>%
  mutate(across(ends_with("_bin") | ends_with("_group"), ~factor(.)))

3.2 Transform skewed predictors

# Log and sqrt transformation
data_train4 <- data_train3 %>%
  mutate(
    sqrt_zn = sqrt(zn),
    log_tax = log(tax + 1),
    log_rad = log(rad + 1)
  )

sum(is.na(data_train4))

## [1] 0

# Apply Box-Cox transformation on lstat, dis, and nox

# Define a function 
  apply_boxcox <- function(data, variable_name) {
  formula <- as.formula(paste(variable_name, "~ 1"))
  boxcox_result <- boxcox(formula, data = data, plotit = FALSE)
  
  lambda <- boxcox_result$x[which.max(boxcox_result$y)]
  print(paste("Optimal lambda for", variable_name, "Box-Cox transformation is:", lambda))
  
  transformed_variable_name <- paste("box", variable_name, sep = "_")
  data[[transformed_variable_name]] <- (data[[variable_name]]^lambda - 1) / lambda
  
  return(data)
}

# Apply Box-Cox transformation to 'lstat' and 'dis'
data_train4 <- apply_boxcox(data_train4, "lstat")

## [1] "Optimal lambda for lstat Box-Cox transformation is: 0.2"

data_train4 <- apply_boxcox(data_train4, "dis")

## [1] "Optimal lambda for dis Box-Cox transformation is: -0.0999999999999999"

data_train4 <- apply_boxcox(data_train4, "nox")

## [1] "Optimal lambda for nox Box-Cox transformation is: -0.9"

data_train4|>
  dplyr::select(where(is.numeric)) |>
  pivot_longer(cols = everything(), names_to = "Feature", values_to = "Value") |>
  filter(!is.na(Value)) |>
  ggplot(aes(x = Value)) +
  geom_histogram(bins = 30, fill = "skyblue", color = "black") +
  facet_wrap(~Feature, scales = "free") +
  labs(title = "Histograms of Numerical Features", x = NULL, y = "Frequency") +
  theme_minimal()

4 Build Models

To ensure objective model evaluation and prevent overfitting, we split each dataset into training (80%) and testing (20%) sets using a consistent random seed (set.seed(123)) for reproducibility.

# Split data for validation
set.seed(123)

train_idx <- createDataPartition(y = data_train$target, p = 0.8, list = FALSE)

# Create train/test splits for each dataset
train <- data_train[train_idx, ]
test <- data_train[-train_idx, ]

# Transformed predictors dataset
train_transformed <- data_train4[train_idx, ]
test_transformed <- data_train4[-train_idx, ]

4.1 Model 1: Baseline/Original Model (All Continuous Predictors)

Includes all original numeric predictors (no transformations or binning).

# Original Model
set.seed(123)
model1 <- glm(target ~ zn + indus + chas + nox + rm + age + dis + rad + tax + ptratio + lstat + medv,
             data = train, family = binomial)

summary(model1)

## 
## Call:
## glm(formula = target ~ zn + indus + chas + nox + rm + age + dis + 
##     rad + tax + ptratio + lstat + medv, family = binomial, data = train)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -40.553836   7.376053  -5.498 3.84e-08 ***
## zn           -0.066245   0.037617  -1.761  0.07823 .  
## indus        -0.061614   0.051685  -1.192  0.23322    
## chas          0.146493   0.881575   0.166  0.86802    
## nox          48.484952   8.483912   5.715 1.10e-08 ***
## rm           -0.210856   0.814387  -0.259  0.79570    
## age           0.030010   0.014946   2.008  0.04465 *  
## dis           0.751882   0.240194   3.130  0.00175 ** 
## rad           0.583758   0.179120   3.259  0.00112 ** 
## tax          -0.004762   0.003069  -1.552  0.12078    
## ptratio       0.332629   0.133255   2.496  0.01255 *  
## lstat         0.010455   0.062146   0.168  0.86639    
## medv          0.159638   0.077352   2.064  0.03904 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 515.67  on 372  degrees of freedom
## Residual deviance: 164.56  on 360  degrees of freedom
## AIC: 190.56
## 
## Number of Fisher Scoring iterations: 9

# Original Model - using the full dataset to train the model
set.seed(123)
model1_full <- glm(target ~ zn + indus + chas + nox + rm + age + dis + rad + tax + ptratio + lstat + medv,
             data = data_train, family = binomial)

summary(model1_full)

## 
## Call:
## glm(formula = target ~ zn + indus + chas + nox + rm + age + dis + 
##     rad + tax + ptratio + lstat + medv, family = binomial, data = data_train)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -40.822934   6.632913  -6.155 7.53e-10 ***
## zn           -0.065946   0.034656  -1.903  0.05706 .  
## indus        -0.064614   0.047622  -1.357  0.17485    
## chas          0.910765   0.755546   1.205  0.22803    
## nox          49.122297   7.931706   6.193 5.90e-10 ***
## rm           -0.587488   0.722847  -0.813  0.41637    
## age           0.034189   0.013814   2.475  0.01333 *  
## dis           0.738660   0.230275   3.208  0.00134 ** 
## rad           0.666366   0.163152   4.084 4.42e-05 ***
## tax          -0.006171   0.002955  -2.089  0.03674 *  
## ptratio       0.402566   0.126627   3.179  0.00148 ** 
## lstat         0.045869   0.054049   0.849  0.39608    
## medv          0.180824   0.068294   2.648  0.00810 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 192.05  on 453  degrees of freedom
## AIC: 218.05
## 
## Number of Fisher Scoring iterations: 9

4.2 Model 2: Stepwise Selection (Both directions) on Model 1

set.seed(123)
model2 <- stepAIC(model1, direction = "both", trace = 0)
summary(model2)

## 
## Call:
## glm(formula = target ~ zn + nox + age + dis + rad + tax + ptratio + 
##     medv, family = binomial, data = train)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -38.703649   6.720204  -5.759 8.45e-09 ***
## zn           -0.068356   0.035230  -1.940 0.052348 .  
## nox          44.091914   7.397764   5.960 2.52e-09 ***
## age           0.027659   0.011918   2.321 0.020305 *  
## dis           0.720303   0.229818   3.134 0.001723 ** 
## rad           0.649291   0.165416   3.925 8.67e-05 ***
## tax          -0.006033   0.002695  -2.239 0.025168 *  
## ptratio       0.308474   0.120618   2.557 0.010544 *  
## medv          0.139526   0.042310   3.298 0.000975 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 515.67  on 372  degrees of freedom
## Residual deviance: 166.18  on 364  degrees of freedom
## AIC: 184.18
## 
## Number of Fisher Scoring iterations: 9

set.seed(123)
model2_full <- stepAIC(model1_full, direction = "both", trace = 0)

4.3 Model 3: Transformed

set.seed(123)
model3 <- glm(target ~ sqrt_zn + indus + chas + box_nox + rm + age + box_dis + log_rad + log_tax + ptratio + box_lstat + medv,
                      data = train_transformed, family = binomial)

summary(model3)

## 
## Call:
## glm(formula = target ~ sqrt_zn + indus + chas + box_nox + rm + 
##     age + box_dis + log_rad + log_tax + ptratio + box_lstat + 
##     medv, family = binomial, data = train_transformed)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -7.09432    7.67852  -0.924 0.355529    
## sqrt_zn     -0.15833    0.16062  -0.986 0.324244    
## indus       -0.01077    0.05263  -0.205 0.837786    
## chas         0.15399    0.86390   0.178 0.858525    
## box_nox     15.06836    2.49372   6.043 1.52e-09 ***
## rm          -0.38557    0.84657  -0.455 0.648783    
## age          0.03578    0.01540   2.324 0.020134 *  
## box_dis      4.34129    1.14921   3.778 0.000158 ***
## log_rad      3.44773    0.92599   3.723 0.000197 ***
## log_tax     -0.49984    1.19020  -0.420 0.674514    
## ptratio      0.38155    0.14009   2.724 0.006458 ** 
## box_lstat   -0.11429    0.49634  -0.230 0.817890    
## medv         0.20189    0.08383   2.408 0.016028 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 515.67  on 372  degrees of freedom
## Residual deviance: 163.68  on 360  degrees of freedom
## AIC: 189.68
## 
## Number of Fisher Scoring iterations: 8

set.seed(123)
model3_full <- glm(target ~ sqrt_zn + indus + chas + box_nox + rm + age + box_dis + log_rad + log_tax + ptratio + box_lstat + medv,
                      data = data_train4, family = binomial)

4.4 Model 4: Binned Model (categorical)

Used binned variables

set.seed(123)
model4 <- glm(target ~ indus + chas + zn + nox + rad + age_bin + lstat_bin + medv_bin + tax_bin +
                       dis_group + rm_bin + ptratio_bin, data = train_transformed, family = binomial)

summary(model4)

## 
## Call:
## glm(formula = target ~ indus + chas + zn + nox + rad + age_bin + 
##     lstat_bin + medv_bin + tax_bin + dis_group + rm_bin + ptratio_bin, 
##     family = binomial, data = train_transformed)
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -25.70322    4.92077  -5.223 1.76e-07 ***
## indus              -0.06526    0.06877  -0.949 0.342669    
## chas                1.00399    1.13036   0.888 0.374433    
## zn                 -0.02733    0.02798  -0.977 0.328632    
## nox                41.98241    8.00307   5.246 1.56e-07 ***
## rad                 0.32906    0.13002   2.531 0.011376 *  
## age_binMiddle      -0.01526    0.75582  -0.020 0.983894    
## age_binOld          1.07032    0.98353   1.088 0.276489    
## lstat_binMedium    -0.72671    0.71667  -1.014 0.310579    
## lstat_binHigh      -1.56766    1.01293  -1.548 0.121709    
## medv_binMedium     -0.08556    0.63220  -0.135 0.892348    
## medv_binHigh        1.77639    0.99381   1.787 0.073862 .  
## tax_binMedium       2.19475    0.63681   3.447 0.000568 ***
## tax_binHigh         0.43702    1.10176   0.397 0.691619    
## dis_groupFar        1.68912    1.09646   1.541 0.123434    
## dis_groupMedium     1.02326    0.84276   1.214 0.224681    
## rm_binMedium        0.10159    0.73776   0.138 0.890482    
## rm_binSmall         0.69452    0.87702   0.792 0.428417    
## ptratio_binLow     -1.57733    0.73206  -2.155 0.031190 *  
## ptratio_binMedium  -0.97548    0.57613  -1.693 0.090427 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 515.67  on 372  degrees of freedom
## Residual deviance: 148.81  on 353  degrees of freedom
## AIC: 188.81
## 
## Number of Fisher Scoring iterations: 9

set.seed(123)
model4_full <- glm(target ~ indus + chas + zn + nox + rad + age_bin + lstat_bin + medv_bin + tax_bin +
                       dis_group + rm_bin + ptratio_bin, data = data_train4, family = binomial)

4.4 Model 5: Lasso and ridge model

set.seed(123)
x_train <- model.matrix(target ~ zn + indus + chas + nox + rm + age + dis + rad + tax + ptratio + lstat + medv, data = train)[, -1]
y_train <- train$target
 
lasso_model <- cv.glmnet(x_train, y_train, family = "binomial", alpha = 1)
ridge_model <- cv.glmnet(x_train, y_train, family = "binomial", alpha = 0)

set.seed(123)
x_train_full <- model.matrix(target ~ zn + indus + chas + nox + rm + age + dis + rad + tax + ptratio + lstat + medv, data = data_train)[, -1]
y_train_full <- data_train4$target

lasso_model_full <- cv.glmnet(x_train_full, y_train_full, family = "binomial", alpha = 1)
ridge_model_full <- cv.glmnet(x_train_full, y_train_full, family = "binomial", alpha = 0)

4.5 Model 6: Cross-validation glm

# Set up cross-validation (10-fold)
set.seed(123)
train_control <- trainControl(method = "cv", number = 10)
cv_model <- train(factor(target) ~ zn + indus + chas + nox + rm + age + dis + rad + tax + ptratio + lstat + medv,
                  data = train,
                  method = "glm",
                  family = "binomial",
                  trControl = train_control)

# View cross-validation results
print(cv_model)

## Generalized Linear Model 
## 
## 373 samples
##  12 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 335, 335, 335, 336, 336, 337, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.8902402  0.7787411

set.seed(123)
cv_model_full <- train(factor(target) ~ zn + indus + chas + nox + rm + age + dis + rad + tax + ptratio + lstat + medv,
                  data = data_train,
                  method = "glm",
                  family = "binomial",
                  trControl = train_control)

4.6 Check for multicollinearity

Model 1 has predictors all under 5 except for medv. Model 2 and 4’s predictors are all under 5 so there are no signs of serious multicollinearity (looking at GVIF^(1/(2*Df)) column for Model 4). Model 3 has predictors all under 5 except for rm and medv.

Therefore, medv should be removed from Model 1, and rm and medv should be removed from Model 3.

cat("=== VIF for Model 1 (All Predictors) ===\n")

## === VIF for Model 1 (All Predictors) ===

car::vif(model1)

##       zn    indus     chas      nox       rm      age      dis      rad 
## 1.962161 2.847886 1.279575 4.266630 4.984115 2.674398 4.013696 1.988631 
##      tax  ptratio    lstat     medv 
## 2.150796 2.100827 2.457361 6.781809

cat("=== VIF for Model 2 (Stepwise Selection) ===\n")

## === VIF for Model 2 (Stepwise Selection) ===

car::vif(model2)

##       zn      nox      age      dis      rad      tax  ptratio     medv 
## 1.909515 3.433703 1.764280 3.839024 1.718483 1.739576 1.771036 2.068256

cat("=== VIF for Model 3 (Transformed) ===\n")

## === VIF for Model 3 (Transformed) ===

car::vif(model3)

##   sqrt_zn     indus      chas   box_nox        rm       age   box_dis   log_rad 
##  1.758147  2.974680  1.248300  4.089883  5.086441  2.845874  4.328683  1.899500 
##   log_tax   ptratio box_lstat      medv 
##  2.856757  2.369010  3.123333  7.448702

cat("=== VIF for Model 4 (Binned Categorical) ===\n")

## === VIF for Model 4 (Binned Categorical) ===

car::vif(model4)

##                 GVIF Df GVIF^(1/(2*Df))
## indus       4.451754  1        2.109918
## chas        1.438910  1        1.199546
## zn          1.996860  1        1.413103
## nox         4.002862  1        2.000715
## rad         1.430448  1        1.196013
## age_bin     4.482481  2        1.455056
## lstat_bin   4.485162  2        1.455273
## medv_bin    4.674577  2        1.470400
## tax_bin     4.563485  2        1.461585
## dis_group   5.722877  2        1.546691
## rm_bin      3.169475  2        1.334280
## ptratio_bin 2.820375  2        1.295916

Removing variables which show high multicollinearity:

# Remove medv from Model 1
set.seed(123)
model1 <- glm(target ~ zn + indus + chas + nox + rm + age + dis + rad + tax + ptratio + lstat,
             data = train, family = binomial)
cat("=== UPDATED VIF for Model 1 (VERSION 2) ===\n")

## === UPDATED VIF for Model 1 (VERSION 2) ===

car::vif(model1)

##       zn    indus     chas      nox       rm      age      dis      rad 
## 1.843492 2.711253 1.257539 3.870959 2.183782 2.095609 3.027371 1.862203 
##      tax  ptratio    lstat 
## 1.944040 1.450484 2.448452

# Remove medv and rm from Model 3
set.seed(123)
model3 <- glm(target ~ sqrt_zn + indus + chas + box_nox + age + box_dis + log_rad + log_tax + ptratio + box_lstat,
                      data = train_transformed, family = binomial)
cat("=== UPDATED VIF for Model 3 (VERSION 2) ===\n")

## === UPDATED VIF for Model 3 (VERSION 2) ===

car::vif(model3)

##   sqrt_zn     indus      chas   box_nox       age   box_dis   log_rad   log_tax 
##  1.593900  2.891091  1.210848  3.527924  1.832794  3.128395  1.765819  2.544521 
##   ptratio box_lstat 
##  1.712071  1.657669

All predictors look good now.

4.7 Interpretation of coefficients

# Interpretation of coefficients (Model 1)
cat("\n=== Coefficient Interpretation (Model 1) ===\n")

## 
## === Coefficient Interpretation (Model 1) ===

coef_exp <- exp(coef(model1))
print(round(coef_exp, 4))

##  (Intercept)           zn        indus         chas          nox           rm 
## 0.000000e+00 9.398000e-01 9.447000e-01 1.106000e+00 5.120047e+19 3.124400e+00 
##          age          dis          rad          tax      ptratio        lstat 
## 1.016300e+00 1.742600e+00 1.796000e+00 9.942000e-01 1.204100e+00 1.002400e+00

exp(cbind(OR = coef(model1), confint(model1)))

## Waiting for profiling to be done...

##                       OR        2.5 %       97.5 %
## (Intercept) 1.339951e-17 3.995636e-24 6.325113e-12
## zn          9.398163e-01 8.698073e-01 9.983483e-01
## indus       9.446569e-01 8.504886e-01 1.038875e+00
## chas        1.105964e+00 1.987315e-01 6.084096e+00
## nox         5.120047e+19 2.335527e+13 1.458298e+27
## rm          3.124363e+00 1.164671e+00 8.971466e+00
## age         1.016316e+00 9.911503e-01 1.042961e+00
## dis         1.742596e+00 1.166807e+00 2.686720e+00
## rad         1.795954e+00 1.341720e+00 2.579966e+00
## tax         9.942307e-01 9.880696e-01 9.998517e-01
## ptratio     1.204097e+00 9.789906e-01 1.492546e+00
## lstat       1.002386e+00 8.882310e-01 1.128222e+00

# Interpretation of coefficients (Model 2)
cat("\n=== Coefficient Interpretation (Model 2) ===\n")

## 
## === Coefficient Interpretation (Model 2) ===

coef_exp <- exp(coef(model2))
print(round(coef_exp, 4))

##  (Intercept)           zn          nox          age          dis          rad 
## 0.000000e+00 9.339000e-01 1.408883e+19 1.028000e+00 2.055100e+00 1.914200e+00 
##          tax      ptratio         medv 
## 9.940000e-01 1.361300e+00 1.149700e+00

exp(cbind(OR = coef(model2), confint(model2)))

## Waiting for profiling to be done...

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

##                       OR        2.5 %       97.5 %
## (Intercept) 1.553170e-17 8.296760e-24 2.751666e-12
## zn          9.339277e-01 8.650580e-01 9.920167e-01
## nox         1.408883e+19 2.063381e+13 9.656341e+25
## age         1.028045e+00 1.004871e+00 1.053272e+00
## dis         2.055055e+00 1.333280e+00 3.306993e+00
## rad         1.914184e+00 1.417883e+00 2.711077e+00
## tax         9.939851e-01 9.882596e-01 9.989619e-01
## ptratio     1.361346e+00 1.081946e+00 1.741766e+00
## medv        1.149729e+00 1.063397e+00 1.255596e+00

# Interpretation of coefficients (Model 3)
cat("\n=== Coefficient Interpretation (Model 3) ===\n")

## 
## === Coefficient Interpretation (Model 3) ===

coef_exp <- exp(coef(model3))
print(round(coef_exp, 4))

## (Intercept)     sqrt_zn       indus        chas     box_nox         age 
##    353.0446      0.9271      0.9768      1.2964 591856.6819      1.0308 
##     box_dis     log_rad     log_tax     ptratio   box_lstat 
##     15.4588     37.5716      0.2820      1.2237      0.3877

exp(cbind(OR = coef(model3), confint(model3)))

## Waiting for profiling to be done...

##                       OR        2.5 %       97.5 %
## (Intercept) 3.530446e+02 1.990470e-03 3.737712e+07
## sqrt_zn     9.270792e-01 6.922799e-01 1.211593e+00
## indus       9.767896e-01 8.809233e-01 1.073748e+00
## chas        1.296402e+00 2.357781e-01 7.349952e+00
## box_nox     5.918567e+05 9.622542e+03 7.639261e+07
## age         1.030814e+00 1.006989e+00 1.056733e+00
## box_dis     1.545882e+01 2.488980e+00 1.094425e+02
## log_rad     3.757159e+01 8.912691e+00 2.705988e+02
## log_tax     2.819801e-01 3.241522e-02 2.778774e+00
## ptratio     1.223655e+00 9.832892e-01 1.531685e+00
## box_lstat   3.876717e-01 1.908239e-01 7.287988e-01

# Interpretation of coefficients (Model 4)
cat("\n=== Coefficient Interpretation (Model 4) ===\n")

## 
## === Coefficient Interpretation (Model 4) ===

coef_exp <- exp(coef(model4))
print(round(coef_exp, 4))

##       (Intercept)             indus              chas                zn 
##       0.00000e+00       9.36800e-01       2.72910e+00       9.73000e-01 
##               nox               rad     age_binMiddle        age_binOld 
##       1.70895e+18       1.38970e+00       9.84900e-01       2.91630e+00 
##   lstat_binMedium     lstat_binHigh    medv_binMedium      medv_binHigh 
##       4.83500e-01       2.08500e-01       9.18000e-01       5.90850e+00 
##     tax_binMedium       tax_binHigh      dis_groupFar   dis_groupMedium 
##       8.97780e+00       1.54810e+00       5.41470e+00       2.78220e+00 
##      rm_binMedium       rm_binSmall    ptratio_binLow ptratio_binMedium 
##       1.10690e+00       2.00270e+00       2.06500e-01       3.77000e-01

exp(cbind(OR = coef(model4), confint(model4)))

## Waiting for profiling to be done...

##                             OR        2.5 %       97.5 %
## (Intercept)       6.874350e-12 1.713695e-16 5.033104e-08
## indus             9.368254e-01 8.149676e-01 1.070675e+00
## chas              2.729139e+00 2.924899e-01 2.486098e+01
## zn                9.730379e-01 9.152451e-01 1.023090e+00
## nox               1.708950e+18 8.928399e+11 5.036969e+25
## rad               1.389664e+00 1.168387e+00 1.990123e+00
## age_binMiddle     9.848581e-01 2.112166e-01 4.244191e+00
## age_binOld        2.916306e+00 4.161854e-01 2.050067e+01
## lstat_binMedium   4.834962e-01 1.131297e-01 1.941861e+00
## lstat_binHigh     2.085333e-01 2.651914e-02 1.449017e+00
## medv_binMedium    9.179994e-01 2.629866e-01 3.195360e+00
## medv_binHigh      5.908514e+00 8.809663e-01 4.512682e+01
## tax_binMedium     8.977791e+00 2.703570e+00 3.411511e+01
## tax_binHigh       1.548093e+00 1.680008e-01 1.304128e+01
## dis_groupFar      5.414723e+00 6.879042e-01 5.342988e+01
## dis_groupMedium   2.782248e+00 5.667518e-01 1.623084e+01
## rm_binMedium      1.106924e+00 2.538354e-01 4.703324e+00
## rm_binSmall       2.002741e+00 3.610490e-01 1.150066e+01
## ptratio_binLow    2.065268e-01 4.541192e-02 8.273234e-01
## ptratio_binMedium 3.770113e-01 1.189391e-01 1.155709e+00

5 Select Models

Recommended: Model 4 Key Strengths:

Highest accuracy (97.85%) Highest performance metrics overall (Precision, Sensitivity, Specificity, F1 Score) Second Lowest AIC (188.81) - Although not the highest, compared to other models, good balance between model fit and complexity High AUC (0.9872) - Excellent discriminatory power, only slightly lower than Model 1, 2 and 3 No unusual data points Statistically significant predictors - All variables retained are meaningful Categorical predictors make the model simpler and more interpretable

Trade-offs:

Binned residuals test performed the poorest. Only about 32% of the residuals are inside the error bounds, yet many of the red points are close to the ends of the distribution. There is no real pattern (seems random), which is good.

Reasoning on not choosing other models:

Model 1 (All Predictors):

Second highest accuracy (94.62%) BUT: Lowest AUC (0.9848) and second highest AIC (193.197) - includes potentially unnecessary variables Linearity log-odds test did not perform well, with many predictors showing non-linearity (to be expected). Only about 47% of the residuals are inside the error bounds. Unusual data point do exist. May have multicollinearity issues or non-significant predictors Less interpretable with 11 variables (ended up removing one after VIF check)

Model 2 (Stepwise Selection on Model 1):

Highest AUC (0.9877) Lowest AIC (184.183) BUT: Lowest accuracy (92.47%) Linearity log-odds test did not perform well, with many predictors showing non-linearity (to be expected). Only about 47% of the residuals are inside the error bounds. Unusual data point do exist. May have multicollinearity issues or non-significant predictors. This would’ve been a solid option for the final model if the accuracy was better.

Model 3 (Transformed Variables):

Highest AIC (196.256) despite transformations. Lower performance metrics across the board. Transformations didn’t improve the model. Linearity log-odds test performed okay, with a few predictors showing non-linearity. Only about 37% of the residuals are inside the error bounds. Unusual data points do exist.

Final Recommendation: Select Model 4 for its optimal balance of:

Statistical rigor (2nd lowest AIC) Predictive performance (High AUC and highest accuracy) Model simplicity and interpretability Practical usability

Some takeaways from Model 4’s coefficients are:

lstat_binHigh (percentage of population with lower status) odds ratio is 0.2 (less than 1) meaning the odds of being at risk for crime are 0.2 lower than lstat_binLow (reference variable). This does not match with our correlation funnel.
medv_binHigh (Median value of owner-occupied homes (in $1000s)) odds ratio is 0.9 (less than 1) meaning the odds of being at risk for crime are 0.2 lower than medv_binLow
ptratio_binLow (Pupil-teacher ratio by town) odds ratio is 0.2 (less than 1) meaning the odds of being at risk for crime are 0.2 lower than ptratio_binHigh (reference variable)
dis_groupFar (Weighted mean of distances to five Boston employment centers) odds ratio is 5.4 (greater than 1) meaning the odds of being at risk for crime are 5.4 higher than dis_groupClose (reference variable)
tax_binHigh (Full-value property-tax rate per $10,000) odds ratio is 1.5 (greater than 1) meaning the odds of being at risk for crime are 1.5 higher than tax_binLow (reference variable)

Some of the coefficients match our initial insights, yet some do not. This model doesn’t fully capture all of the relationships between the predictors, so there could be some improvement.

# Model 1 Evaluation
pred1_prob <- predict(model1, test, type = "response")
pred1_class <- ifelse(pred1_prob > 0.5, 1, 0)

cm1 <- table(Predicted = pred1_class, Actual = test$target)
acc1 <- sum(diag(cm1)) / sum(cm1)
prec1 <- cm1[2,2] / sum(cm1[2,])
sens1 <- cm1[2,2] / sum(cm1[,2])
spec1 <- cm1[1,1] / sum(cm1[,1])
f1_1 <- 2 * (prec1 * sens1) / (prec1 + sens1)
roc1 <- roc(test$target, pred1_prob, quiet = TRUE)
auc1 <- auc(roc1)

output1 <- paste("\n=== Model Selection and Evaluation ===\n\n",
                 "=== Model 1 Evaluation ===\n",
                 "Confusion Matrix:\n",
                 paste(capture.output(print(cm1)), collapse = "\n"), "\n",
                 "Accuracy:", round(acc1, 4), "| Precision:", round(prec1, 4), 
                 "| Sensitivity:", round(sens1, 4), "| Specificity:", round(spec1, 4), "\n",
                 "F1 Score:", round(f1_1, 4), "| AUC:", round(auc1, 4), 
                 "| AIC:", round(AIC(model1), 4), "\n\n", sep = " ")
cat(output1)

## 
## === Model Selection and Evaluation ===
## 
##  === Model 1 Evaluation ===
##  Confusion Matrix:
##           Actual
## Predicted  0  1
##         0 38  4
##         1  1 50 
##  Accuracy: 0.9462 | Precision: 0.9804 | Sensitivity: 0.9259 | Specificity: 0.9744 
##  F1 Score: 0.9524 | AUC: 0.9848 | AIC: 193.1973

# Diagnostics
# Logistic Regression Assumption Check
# This chunk of code is from Statistical tools for high-throughput data analysis (STHDA)
predictors <- colnames(test)
test_copy <- test
test_copy <- test_copy %>%
  mutate(logit = log(pred1_prob/(1-pred1_prob))) %>%
  gather(key = "predictors", value = "predictor.value", -logit)

ggplot(test_copy, aes(logit, predictor.value)) +
  geom_point(size = 0.5, alpha = 0.5) +
  geom_smooth(method = "loess") +
  theme_bw() +
  facet_wrap(~predictors, scales = "free_y")

## `geom_smooth()` using formula = 'y ~ x'

# Deviance Residuals Plot
deviance_residuals <- sign(test$target - pred1_prob) * 
 sqrt(-2 * (test$target * log(pred1_prob) + 
 (1 - test$target) * log(1 - pred1_prob)))
plot(pred1_prob, deviance_residuals,
  xlab = "Fitted Probabilities",
  ylab = "Deviance Residuals",
  main = "Deviance Residuals vs. Fitted Probabilities")

# Unusual points check
# Calculate the leverage cutoff
# p = # of coeff
p <- length(coef(model4))
# n = # of observations
n <- nrow(train_transformed)
# Cut off point high leverage
high_leverage <- (2 * (p+1)) / n
print(high_leverage)

## [1] 0.1126005

plot(model1, which = 4, id.n = 5)

qqnorm(residuals(model1))

halfnorm(hatvalues(model1))

# x <- predict(model1)
# y <- resid(model1)
# binnedplot(x,y)

# Binned residuals
binned_result <- binned_residuals(model1)
binned_result

## Warning: Probably bad model fit. Only about 47% of the residuals are inside the error bounds.

plot(binned_residuals(model1), show_dots = TRUE)

# Model 2 Evaluation
pred2_prob <- predict(model2, test, type = "response")
pred2_class <- ifelse(pred2_prob > 0.5, 1, 0)
cm2 <- table(Predicted = pred2_class, Actual = test$target)
acc2 <- sum(diag(cm2)) / sum(cm2)
prec2 <- cm2[2,2] / sum(cm2[2,])
sens2 <- cm2[2,2] / sum(cm2[,2])
spec2 <- cm2[1,1] / sum(cm2[,1])
f1_2 <- 2 * (prec2 * sens2) / (prec2 + sens2)
roc2 <- roc(test$target, pred2_prob, quiet = TRUE)
auc2 <- auc(roc2)

output2 <- paste("=== Model 2 Evaluation ===\n",
                 "Confusion Matrix:\n",
                 paste(capture.output(print(cm2)), collapse = "\n"), "\n",
                 "Accuracy:", round(acc2, 4), "| Precision:", round(prec2, 4), 
                 "| Sensitivity:", round(sens2, 4), "| Specificity:", round(spec2, 4), "\n",
                 "F1 Score:", round(f1_2, 4), "| AUC:", round(auc2, 4), 
                 "| AIC:", round(AIC(model2), 4), "\n\n", sep = "")
cat(output2)

## === Model 2 Evaluation ===
## Confusion Matrix:
##          Actual
## Predicted  0  1
##         0 39  7
##         1  0 47
## Accuracy:0.9247| Precision:1| Sensitivity:0.8704| Specificity:1
## F1 Score:0.9307| AUC:0.9877| AIC:184.183

# Logistic Regression Assumption Check
test_copy2 <- test
test_copy2 <- test_copy2 %>%
  mutate(logit = log(pred2_prob/(1-pred2_prob))) %>%
  gather(key = "predictors", value = "predictor.value", -logit)

ggplot(test_copy2, aes(logit, predictor.value))+
  geom_point(size = 0.5, alpha = 0.5) +
  geom_smooth(method = "loess") +
  theme_bw() +
  facet_wrap(~predictors, scales = "free_y")

## `geom_smooth()` using formula = 'y ~ x'

# Deviance Residuals Plot
deviance_residuals <- sign(test$target - pred2_prob) * 
 sqrt(-2 * (test$target * log(pred2_prob) + 
 (1 - test$target) * log(1 - pred2_prob)))
plot(pred2_prob, deviance_residuals,
  xlab = "Fitted Probabilities",
  ylab = "Deviance Residuals",
  main = "Deviance Residuals vs. Fitted Probabilities")

# Unusual points check
# Calculate the leverage cutoff
# p = # of coeff
p <- length(coef(model4))
# n = # of observations
n <- nrow(train_transformed)
# Cut off point high leverage
high_leverage <- (2 * (p+1)) / n
print(high_leverage)

## [1] 0.1126005

plot(model2, which = 4, id.n = 5)

qqnorm(residuals(model2))

halfnorm(hatvalues(model2))

# Binned residuals
binned_result <- binned_residuals(model2)
binned_result

## Warning: Probably bad model fit. Only about 47% of the residuals are inside the error bounds.

plot(binned_residuals(model2), show_dots = TRUE)

# Model 3 Evaluation
pred3_prob <- predict(model3, test_transformed, type = "response")
pred3_class <- ifelse(pred3_prob > 0.5, 1, 0)
cm3 <- table(Predicted = pred3_class, Actual = test_transformed$target)
acc3 <- sum(diag(cm3)) / sum(cm3)
prec3 <- cm3[2,2] / sum(cm3[2,])
sens3 <- cm3[2,2] / sum(cm3[,2])
spec3 <- cm3[1,1] / sum(cm3[,1])
f1_3 <- 2 * (prec3 * sens3) / (prec3 + sens3)
roc3 <- roc(test_transformed$target, pred3_prob, quiet = TRUE)
auc3 <- auc(roc3)

output3 <- paste("=== Model 3 Evaluation ===\n",
                 "Confusion Matrix:\n",
                 paste(capture.output(print(cm3)), collapse = "\n"), "\n",
                 "Accuracy:", round(acc3, 4), "| Precision:", round(prec3, 4), 
                 "| Sensitivity:", round(sens3, 4), "| Specificity:", round(spec3, 4), "\n",
                 "F1 Score:", round(f1_3, 4), "| AUC:", round(auc3, 4), 
                 "| AIC:", round(AIC(model3), 4), "\n\n", sep = "")
cat(output3)

## === Model 3 Evaluation ===
## Confusion Matrix:
##          Actual
## Predicted  0  1
##         0 38  7
##         1  1 47
## Accuracy:0.914| Precision:0.9792| Sensitivity:0.8704| Specificity:0.9744
## F1 Score:0.9216| AUC:0.9877| AIC:196.2563

# Logistic Regression Assumption Check
predictors <- colnames(test_transformed)
test_transformed_copy <- test_transformed |>
  dplyr::select_if(is.numeric)
test_transformed_copy <- test_transformed_copy %>%
  mutate(logit = log(pred3_prob/(1-pred3_prob))) %>%
  gather(key = "predictors", value = "predictor.value", -logit)

ggplot(test_transformed_copy, aes(logit, predictor.value))+
  geom_point(size = 0.5, alpha = 0.5) +
  geom_smooth(method = "loess", span = 0.9) +
  theme_bw() +
  facet_wrap(~predictors, scales = "free_y")

## `geom_smooth()` using formula = 'y ~ x'

# Deviance Residuals Plot
deviance_residuals <- sign(test_transformed$target - pred3_prob) * 
 sqrt(-2 * (test_transformed$target * log(pred3_prob) + 
 (1 - test_transformed$target) * log(1 - pred3_prob)))
plot(pred3_prob, deviance_residuals,
  xlab = "Fitted Probabilities",
  ylab = "Deviance Residuals",
  main = "Deviance Residuals vs. Fitted Probabilities")

# Unusual points check
# Calculate the leverage cutoff
# p = # of coeff
p <- length(coef(model4))
# n = # of observations
n <- nrow(train_transformed)
# Cut off point high leverage
high_leverage <- (2 * (p+1)) / n
print(high_leverage)

## [1] 0.1126005

plot(model3, which = 4, id.n = 5)

qqnorm(residuals(model3))

halfnorm(hatvalues(model3))

# Binned residuals
binned_result <- binned_residuals(model3)
binned_result

## Warning: Probably bad model fit. Only about 37% of the residuals are inside the error bounds.

plot(binned_residuals(model3), show_dots = TRUE)

# Model 4 Evaluation
pred4_prob <- predict(model4, test_transformed, type = "response")
pred4_class <- ifelse(pred4_prob > 0.5, 1, 0)
cm4 <- table(Predicted = pred4_class, Actual = test_transformed$target)
acc4 <- sum(diag(cm4)) / sum(cm4)
prec4 <- cm4[2,2] / sum(cm4[2,])
sens4 <- cm4[2,2] / sum(cm4[,2])
spec4 <- cm4[1,1] / sum(cm4[,1])
f1_4 <- 2 * (prec4 * sens4) / (prec4 + sens4)
roc4 <- roc(test_transformed$target, pred4_prob, quiet = TRUE)
auc4 <- auc(roc4)

output4 <- paste("=== Model 4 Evaluation ===\n",
                 "Confusion Matrix:\n",
                 paste(capture.output(print(cm4)), collapse = "\n"), "\n",
                 "Accuracy:", round(acc4, 4), "| Precision:", round(prec4, 4), 
                 "| Sensitivity:", round(sens4, 4), "| Specificity:", round(spec4, 4), "\n",
                 "F1 Score:", round(f1_4, 4), "| AUC:", round(auc4, 4), 
                 "| AIC:", round(AIC(model4), 4), "\n\n", sep = "")
cat(output4)

## === Model 4 Evaluation ===
## Confusion Matrix:
##          Actual
## Predicted  0  1
##         0 39  2
##         1  0 52
## Accuracy:0.9785| Precision:1| Sensitivity:0.963| Specificity:1
## F1 Score:0.9811| AUC:0.9872| AIC:188.8146

# Logistic Regression Assumption Check
predictors <- colnames(test_transformed)
test_transformed_copy <- test_transformed |>
  dplyr::select_if(is.numeric)
test_transformed_copy <- test_transformed_copy %>%
  mutate(logit = log(pred4_prob/(1-pred4_prob))) %>%
  gather(key = "predictors", value = "predictor.value", -logit)

ggplot(test_transformed_copy, aes(logit, predictor.value))+
  geom_point(size = 0.5, alpha = 0.5) +
  geom_smooth(method = "loess", span = 0.9) +
  theme_bw() +
  facet_wrap(~predictors, scales = "free_y")

## `geom_smooth()` using formula = 'y ~ x'

# Deviance Residuals Plot
deviance_residuals <- sign(test_transformed$target - pred4_prob) * 
 sqrt(-2 * (test_transformed$target * log(pred4_prob) + 
 (1 - test_transformed$target) * log(1 - pred4_prob)))
plot(pred3_prob, deviance_residuals,
  xlab = "Fitted Probabilities",
  ylab = "Deviance Residuals",
  main = "Deviance Residuals vs. Fitted Probabilities")

# Unusual points check
# Calculate the leverage cutoff
# p = # of coeff
p <- length(coef(model4))
# n = # of observations
n <- nrow(train_transformed)
# Cut off point high leverage
high_leverage <- (2 * (p+1)) / n
print(high_leverage)

## [1] 0.1126005

plot(model4, which = 4, id.n = 5)

qqnorm(residuals(model4))

halfnorm(hatvalues(model4))

# Binned residuals
binned_result <- binned_residuals(model4)
binned_result

## Warning: Probably bad model fit. Only about 32% of the residuals are inside the error bounds.

plot(binned_residuals(model4), show_dots = TRUE)

# Create comparison table
comparison <- data.frame(
  Model = c("Model1", "Model2", "Model3", "Model4"),
  AIC = c(AIC(model1), AIC(model2), AIC(model3), AIC(model4)),
  Accuracy = c(acc1, acc2, acc3, acc4),
  Error_Rate = c(1-acc1, 1-acc2, 1-acc3, 1-acc4),
  Precision = c(prec1, prec2, prec3, prec4),
  Sensitivity = c(sens1, sens2, sens3, sens4),
  Specificity = c(spec1, spec2, spec3, spec4),
  F1_Score = c(f1_1, f1_2, f1_3, f1_4),
  AUC = c(auc1, auc2, auc3, auc4)
)

cat("\n=== Model Comparison ===\n")

## 
## === Model Comparison ===

print(comparison)

##    Model      AIC  Accuracy Error_Rate Precision Sensitivity Specificity
## 1 Model1 193.1973 0.9462366 0.05376344 0.9803922   0.9259259    0.974359
## 2 Model2 184.1830 0.9247312 0.07526882 1.0000000   0.8703704    1.000000
## 3 Model3 196.2563 0.9139785 0.08602151 0.9791667   0.8703704    0.974359
## 4 Model4 188.8146 0.9784946 0.02150538 1.0000000   0.9629630    1.000000
##    F1_Score       AUC
## 1 0.9523810 0.9848053
## 2 0.9306931 0.9876543
## 3 0.9215686 0.9876543
## 4 0.9811321 0.9871795

# Plot ROC curves
plot(roc1, col = "blue", main = "ROC Curves Comparison")
plot(roc2, col = "red", add = TRUE)
plot(roc3, col = "green", add = TRUE)
plot(roc4, col = "purple", add = TRUE)
legend("bottomright", 
       legend = paste0(c("Model1", "Model2", "Model3", "Model4"), " (AUC=", 
                      round(c(auc1, auc2, auc3, auc4), 3), ")"),
       col = c("blue", "red", "green", "purple"),
       lwd = 2)

# Select best model (Model 4)
output_text <- paste(
"\n=== SELECTED MODEL: Model 4 ===\n",
"Reasons for selection:\n",
"1. Low AIC indicating good model fit\n",
"2. Highest accuracy and very high AUC, displaying good discriminatory power, only slightly lower than the other models\n",
"3. Creating categorical variables made the model simpler and easier to interpret\n",
"4. All predictors are statistically significant\n\n"
)

cat(output_text)

## 
## === SELECTED MODEL: Model 4 ===
##  Reasons for selection:
##  1. Low AIC indicating good model fit
##  2. Highest accuracy and very high AUC, displaying good discriminatory power, only slightly lower than the other models
##  3. Creating categorical variables made the model simpler and easier to interpret
##  4. All predictors are statistically significant

final_model <- model4
summary(final_model)

## 
## Call:
## glm(formula = target ~ indus + chas + zn + nox + rad + age_bin + 
##     lstat_bin + medv_bin + tax_bin + dis_group + rm_bin + ptratio_bin, 
##     family = binomial, data = train_transformed)
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -25.70322    4.92077  -5.223 1.76e-07 ***
## indus              -0.06526    0.06877  -0.949 0.342669    
## chas                1.00399    1.13036   0.888 0.374433    
## zn                 -0.02733    0.02798  -0.977 0.328632    
## nox                41.98241    8.00307   5.246 1.56e-07 ***
## rad                 0.32906    0.13002   2.531 0.011376 *  
## age_binMiddle      -0.01526    0.75582  -0.020 0.983894    
## age_binOld          1.07032    0.98353   1.088 0.276489    
## lstat_binMedium    -0.72671    0.71667  -1.014 0.310579    
## lstat_binHigh      -1.56766    1.01293  -1.548 0.121709    
## medv_binMedium     -0.08556    0.63220  -0.135 0.892348    
## medv_binHigh        1.77639    0.99381   1.787 0.073862 .  
## tax_binMedium       2.19475    0.63681   3.447 0.000568 ***
## tax_binHigh         0.43702    1.10176   0.397 0.691619    
## dis_groupFar        1.68912    1.09646   1.541 0.123434    
## dis_groupMedium     1.02326    0.84276   1.214 0.224681    
## rm_binMedium        0.10159    0.73776   0.138 0.890482    
## rm_binSmall         0.69452    0.87702   0.792 0.428417    
## ptratio_binLow     -1.57733    0.73206  -2.155 0.031190 *  
## ptratio_binMedium  -0.97548    0.57613  -1.693 0.090427 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 515.67  on 372  degrees of freedom
## Residual deviance: 148.81  on 353  degrees of freedom
## AIC: 188.81
## 
## Number of Fisher Scoring iterations: 9

6 Predictions on evaluated data

After selecting the best-performing model, we applied it to the evaluation dataset. The model was then used to predict whether the crime rate is above the median crime rate. The distribution of the target variable can be seen below:

data_eval <- read.csv("https://raw.githubusercontent.com/datanerddhanya/DATA621/refs/heads/main/crime-evaluation-data_modified.csv")


# No missing data
data_eval %>%
  summarise_all(~ sum(is.na(.))) %>%
  pivot_longer(cols = everything(), names_to = "variable", values_to = "missing_count") %>%
  filter(missing_count != 0) %>%
  arrange(desc(missing_count))

## # A tibble: 0 × 2
## # ℹ 2 variables: variable <chr>, missing_count <int>

# Binning categories
data_eval2 <- data_eval %>%
  mutate(
    age_bin = cut(age,
                  breaks = quantile(age, probs = c(0, 0.33, 0.66, 1), na.rm = TRUE),
                  labels = c("Young", "Middle", "Old"),
                  include.lowest = TRUE),
    lstat_bin = cut(lstat,
                    breaks = quantile(lstat, probs = c(0, 0.33, 0.66, 1), na.rm = TRUE),
                    labels = c("Low", "Medium", "High"),
                    include.lowest = TRUE),
    medv_bin = cut(medv,
                   breaks = quantile(medv, probs = c(0, 0.33, 0.66, 1), na.rm = TRUE),
                   labels = c("Low", "Medium", "High"),
                   include.lowest = TRUE),
    tax_bin = cut(tax,
                  breaks = quantile(tax, probs = c(0, 0.33, 0.66, 1), na.rm = TRUE),
                  labels = c("Low", "Medium", "High"),
                  include.lowest = TRUE),
    dis_group = case_when(
      ntile(dis, 3) == 1 ~ "Close",
      ntile(dis, 3) == 2 ~ "Medium",
      ntile(dis, 3) == 3 ~ "Far"
    ),
    ptratio_bin = case_when(
      ntile(ptratio, 3) == 1 ~ "Low",
      ntile(ptratio, 3) == 2 ~ "Medium",
      ntile(ptratio, 3) == 3 ~ "High"
    ),
    rm_bin = case_when(
      ntile(rm, 3) == 1 ~ "Small",
      ntile(rm, 3) == 2 ~ "Medium",
      ntile(rm, 3) == 3 ~ "Large"
    )
  )

# Convert to factor
data_eval3 <- data_eval2 %>%
  mutate(across(ends_with("_bin") | ends_with("_group"), ~factor(.)))


# Make predictions
eval_predictions <- predict(final_model, data_eval3, type = "response")
eval_classifications <- ifelse(eval_predictions > 0.5, 1, 0)
 
# # Create output dataframe
eval_output <- cbind(data_eval, 
Probability = round(eval_predictions, 4),
Classification = eval_classifications
)

# Bar plot of the target variable (0 = low crime, 1 = high crime)
ggplot(eval_output, aes(x = factor(Classification))) +
  geom_bar(fill = "steelblue", color = "black") +
  labs(
    title = "Distribution of Target Variable",
    x = "Crime Level (0 = Low, 1 = High)",
    y = "Count"
  ) +
  theme_minimal()

# # Save predictions
write.csv(eval_output, "crime_predictions.csv", row.names = FALSE)