In this homework assignment, we will explore, analyze and model a data set containing information on crime for various neighborhoods of a major city. Each record has a response variable indicating whether or not the crime rate is above the median crime rate (1) or not (0). Our objective is to build a binary logistic regression model on the training data set to predict whether the neighborhood will be at risk for high crime levels. We will provide classifications and probabilities for the evaluation data set using your binary logistic regression model.Below is a short description of the variables of interest in the data set:
variables <- data.frame(
Variable = c("zn", "indus", "chas", "nox", "rm", "age", "dis", "rad",
"tax", "ptratio", "lstat", "medv", "target"),
Description = c(
"Proportion of residential land zoned for large lots (over 25,000 sq. ft.)",
"Proportion of non-retail business acres per suburb",
"Whether the suburb borders the Charles River (1 = yes, 0 = no)",
"Nitrogen oxides concentration (parts per 10 million)",
"Average number of rooms per dwelling",
"Proportion of owner-occupied units built before 1940",
"Weighted mean of distances to five Boston employment centers",
"Index of accessibility to radial highways",
"Full-value property-tax rate per $10,000",
"Pupil-teacher ratio by town",
"Percentage of population with lower status",
"Median value of owner-occupied homes (in $1000s)",
"Whether the crime rate is above the median (1 = yes, 0 = no)"),
`Type` = c(
rep("Predictor", 12),
"Response" ))
kable(variables)
| Variable | Description | Type |
|---|---|---|
| zn | Proportion of residential land zoned for large lots (over 25,000 sq. ft.) | Predictor |
| indus | Proportion of non-retail business acres per suburb | Predictor |
| chas | Whether the suburb borders the Charles River (1 = yes, 0 = no) | Predictor |
| nox | Nitrogen oxides concentration (parts per 10 million) | Predictor |
| rm | Average number of rooms per dwelling | Predictor |
| age | Proportion of owner-occupied units built before 1940 | Predictor |
| dis | Weighted mean of distances to five Boston employment centers | Predictor |
| rad | Index of accessibility to radial highways | Predictor |
| tax | Full-value property-tax rate per $10,000 | Predictor |
| ptratio | Pupil-teacher ratio by town | Predictor |
| lstat | Percentage of population with lower status | Predictor |
| medv | Median value of owner-occupied homes (in $1000s) | Predictor |
| target | Whether the crime rate is above the median (1 = yes, 0 = no) | Response |
data_train <- read.csv("https://raw.githubusercontent.com/nakesha-fray/DATA621/main/crime-training-data_modified.csv")
data_eval <- read.csv("https://raw.githubusercontent.com/nakesha-fray/DATA621/main/crime-evaluation-data_modified.csv")
The crime dataset contains 466 observations and 13 variables, each representing the characteristics of different neighborhoods in a major city, including socioeconomic, environmental, and housing-related factors. The response variable, target, is a binary indicator that denotes whether the crime rate in a neighborhood is above the median (1) or not (0). The goal is to predict whether a neighborhood is at risk for high crime levels, as defined by whether its crime rate is above the median (target = 1). By analyzing patterns in the provided predictor variables, we aim to develop a binary logistic regression model that can estimate the probability of high crime and classify neighborhoods accordingly. This model will later be used to generate both class predictions for an evaluation dataset.
The table below provides some summary statistics of the predictor variables in the training dataset. Key metrics such as minimum, maximum, mean, median, and standard deviation (SD) help us understand the range, central tendency, and variability of each variable. Some interesting things to note here is that zn, proportion of residential land zoned for large lots (over 25000 square feet), has a large spread but many zero values (median is zero). ptratio, pupil-teacher ratio by town, also has a median which is larger, closer to the max value, indicating potential skewness.
Additionally, there are no duplicates in this dataset.
# Structure of the data
str(data_train)
## 'data.frame': 466 obs. of 13 variables:
## $ zn : num 0 0 0 30 0 0 0 0 0 80 ...
## $ indus : num 19.58 19.58 18.1 4.93 2.46 ...
## $ chas : int 0 1 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.605 0.871 0.74 0.428 0.488 0.52 0.693 0.693 0.515 0.392 ...
## $ rm : num 7.93 5.4 6.49 6.39 7.16 ...
## $ age : num 96.2 100 100 7.8 92.2 71.3 100 100 38.1 19.1 ...
## $ dis : num 2.05 1.32 1.98 7.04 2.7 ...
## $ rad : int 5 5 24 6 3 5 24 24 5 1 ...
## $ tax : int 403 403 666 300 193 384 666 666 224 315 ...
## $ ptratio: num 14.7 14.7 20.2 16.6 17.8 20.9 20.2 20.2 20.2 16.4 ...
## $ lstat : num 3.7 26.82 18.85 5.19 4.82 ...
## $ medv : num 50 13.4 15.4 23.7 37.9 26.5 5 7 22.2 20.9 ...
## $ target : int 1 1 1 0 0 0 1 1 0 0 ...
# Glimpse of the data
head(data_train)
## zn indus chas nox rm age dis rad tax ptratio lstat medv target
## 1 0 19.58 0 0.605 7.929 96.2 2.0459 5 403 14.7 3.70 50.0 1
## 2 0 19.58 1 0.871 5.403 100.0 1.3216 5 403 14.7 26.82 13.4 1
## 3 0 18.10 0 0.740 6.485 100.0 1.9784 24 666 20.2 18.85 15.4 1
## 4 30 4.93 0 0.428 6.393 7.8 7.0355 6 300 16.6 5.19 23.7 0
## 5 0 2.46 0 0.488 7.155 92.2 2.7006 3 193 17.8 4.82 37.9 0
## 6 0 8.56 0 0.520 6.781 71.3 2.8561 5 384 20.9 7.67 26.5 0
# Summary statistics
stats <- data_train %>%
dplyr::select(-target) %>%
summarise(across(everything(), list(
Min = min,
Max = max,
Mean = mean,
Median = median,
SD = sd
))) %>%
pivot_longer(cols = everything(),
names_to = c("Variable", ".value"),
names_sep = "_")
kable(stats, caption = "Descriptive Statistics of Predictor Variables")
| Variable | Min | Max | Mean | Median | SD |
|---|---|---|---|---|---|
| zn | 0.0000 | 100.0000 | 11.5772532 | 0.00000 | 23.3646511 |
| indus | 0.4600 | 27.7400 | 11.1050215 | 9.69000 | 6.8458549 |
| chas | 0.0000 | 1.0000 | 0.0708155 | 0.00000 | 0.2567920 |
| nox | 0.3890 | 0.8710 | 0.5543105 | 0.53800 | 0.1166667 |
| rm | 3.8630 | 8.7800 | 6.2906738 | 6.21000 | 0.7048513 |
| age | 2.9000 | 100.0000 | 68.3675966 | 77.15000 | 28.3213784 |
| dis | 1.1296 | 12.1265 | 3.7956929 | 3.19095 | 2.1069496 |
| rad | 1.0000 | 24.0000 | 9.5300429 | 5.00000 | 8.6859272 |
| tax | 187.0000 | 711.0000 | 409.5021459 | 334.50000 | 167.9000887 |
| ptratio | 12.6000 | 22.0000 | 18.3984979 | 18.90000 | 2.1968447 |
| lstat | 1.7300 | 37.9700 | 12.6314592 | 11.35000 | 7.1018907 |
| medv | 5.0000 | 50.0000 | 22.5892704 | 21.20000 | 9.2396814 |
# Check for duplicates
duplicates <- duplicated(data_train)
# Print the duplicates
print(data_train[duplicates, ])
## [1] zn indus chas nox rm age dis rad tax
## [10] ptratio lstat medv target
## <0 rows> (or 0-length row.names)
There is a fairly even distribution of the target variable (high vs low crime). No missing values were found.
Information about the distribution: - age (age) is left skewed -
pupil-teacher ratio (ptratio) is a slightly right-skewed
distribution
- Zoning (zn),Pollution (nox), lower status of the population (lstat)
and distance to employment (dis) is right skewed. - Average number of
rooms (rn) and median value of owner-occupied homes (medv) is somewhat
symmetrical and normally distributed
The distributions for the remaining variables are multimodal, including the distribution for chas.
# Plot histograms of all numeric variables in data_train
data_train |>
dplyr::select(where(is.numeric)) |>
pivot_longer(cols = everything(), names_to = "Feature", values_to = "Value") |>
filter(!is.na(Value)) |>
ggplot(aes(x = Value)) +
geom_histogram(bins = 30, fill = "skyblue", color = "black") +
facet_wrap(~Feature, scales = "free") +
labs(title = "Histograms of Numerical Features", x = NULL, y = "Frequency") +
theme_minimal()
# Bar plot of the target variable (0 = low crime, 1 = high crime)
ggplot(data_train, aes(x = factor(target))) +
geom_bar(fill = "steelblue", color = "black") +
labs(
title = "Distribution of Target Variable",
x = "Crime Level (0 = Low, 1 = High)",
y = "Count"
) +
theme_minimal()
# Count of missing Values per Variable
missing <- sapply(data_train, function(x) sum(is.na(x)))
kable(data.frame(Variable = names(missing), Missing = missing), caption = "Missing Values in Each Variable")
| Variable | Missing | |
|---|---|---|
| zn | zn | 0 |
| indus | indus | 0 |
| chas | chas | 0 |
| nox | nox | 0 |
| rm | rm | 0 |
| age | age | 0 |
| dis | dis | 0 |
| rad | rad | 0 |
| tax | tax | 0 |
| ptratio | ptratio | 0 |
| lstat | lstat | 0 |
| medv | medv | 0 |
| target | target | 0 |
Boxplots show significant variation in some predictors, indicating potential outliers or skewness, which will be considered during modeling.
For example, age and tax both have a much wider spread than the other predictors.
Variables such as nox, age, rad, dis, and zn have a correlation with crime rates:
Top positive correlations: nox 0.73: Higher air pollution is associated with higher crime areas. Specifically, when nox is 0.624 or greater it has a greater correlation with the crime rate being above the median crime rate. age 0.63: Older housing stock tends to be in high crime areas. Specifically, when the age is 77.15 or greater there is a greater correlation with the crime rate being above the median crime rate. rad 0.63: Greater accessibility to radial highways correlates with higher crime rates. Specifically, when rad is 5 or greater it has a greater correlation with the crime rate being above the median crime rate.
Top negative correlations: dis -0.62: Neighborhoods further from employment centers tend to have lower crime. Specifically, when dis is 5.3 or greater it has a lower correlation with the crime rate being above the median crime rate. zn -0.43: More residential zoning for large lots is linked with lower crime. Specifically, when zn is 16.25 or greater it has a lower correlation with the crime rate being above the median crime rate. medv -0.27: Even though this absolute value correlation is not too high, when medv is specifically lower than 17, it has a higher correlation with the crime rate being above the median crime rate.
Through these correlations, we already have some solid insights before even creating an models.
Correlation Among Predictors:
It is clear that the predictor variables rad and tax are very highly correlated (more than 0.9). We will attempt to account for this when we prepare the data.
There are some smaller, but still high correlations between other predictor variables as well. indus is highly correlated (more than 0.7) with nox, dis, and tax. nox is also highly correlated with age and dis. rm is highly correlated with medv. lstat is highly correlated with medv.
Chas has no correlation to the response variable.
# Boxplot of predictors
data_train %>%
dplyr::select(-target) %>%
pivot_longer(cols = everything(), names_to = "Predictor", values_to = "Value") %>%
ggplot(aes(x = Predictor, y = Value)) +
geom_boxplot(fill = "darkgrey", outlier.color = "red", outlier.shape = 16, outlier.size = 1.5) +
coord_cartesian(ylim = c(0, 1000)) + # Adjust this based on your outliers
labs(title = "Boxplots of Predictor Variables", x = "Predictors", y = "Values") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
#Blox plot of predictors by crime level
data_train %>%
pivot_longer(cols = -target, names_to = "Predictor", values_to = "Value") %>%
ggplot(aes(x = factor(target), y = Value, fill = factor(target))) +
geom_boxplot(outlier.shape = 1, outlier.alpha = 0.5) +
facet_wrap(~ Predictor, scales = "free", ncol = 4) +
labs(title = "Boxplots of Predictors by Crime Level", x = "Crime Level", y = "Value") +
theme_minimal() +
theme(legend.position = "none")
#Correlation matrix with target
cor_matrix <- cor(data_train)
corr_target <- cor_matrix[,"target"]
corr_target_sorted <- sort(corr_target, decreasing = TRUE)
kable(as.data.frame(corr_target_sorted), col.names = c("Correlation with Target"), digits = 2)
| Correlation with Target | |
|---|---|
| target | 1.00 |
| nox | 0.73 |
| age | 0.63 |
| rad | 0.63 |
| tax | 0.61 |
| indus | 0.60 |
| lstat | 0.47 |
| ptratio | 0.25 |
| chas | 0.08 |
| rm | -0.15 |
| medv | -0.27 |
| zn | -0.43 |
| dis | -0.62 |
#Correlation heatmap
numeric_vars <- data_train %>%
dplyr::select(where(is.numeric))
cor_matrix <- cor(numeric_vars, use = "complete.obs")
corrplot::corrplot(cor_matrix, method = "color", type = "upper",
tl.col = "black", tl.srt = 45, addCoef.col = "black",
number.cex = 0.7, diag = FALSE)
# Correlation top 5 positive and negative
corr_target <- cor_matrix[, "target"]
corr_target <- corr_target[names(corr_target) != "target"] # Remove self-correlation
top_pos <- sort(corr_target, decreasing = TRUE)[1:3]
top_neg <- sort(corr_target, decreasing = FALSE)[1:3]
combined_df <- data.frame(
Feature = c(names(top_pos), names(top_neg)),
Correlation = c(top_pos, top_neg)) %>%
mutate(Direction = ifelse(Correlation > 0, "Positive", "Negative"))
ggplot(combined_df, aes(x = reorder(Feature, Correlation), y = Correlation, fill = Direction)) +
geom_bar(stat = "identity") +
coord_flip() +
scale_fill_manual(values = c("Positive" = "skyblue", "Negative" = "salmon")) +
labs(title = "Top Features Correlated with High Crime",
x = "Feature",
y = "Correlation with Target",
fill = "Direction") +
theme_minimal()
# Correlation among predictors
melt_cor <- melt(cor_matrix)
filtered_cor <- melt_cor[melt_cor$Var1 != melt_cor$Var2 & as.numeric(melt_cor$Var1) < as.numeric(melt_cor$Var2), ]
sorted_cor <- filtered_cor[order(abs(filtered_cor$value), decreasing = TRUE), ]
head(sorted_cor)
## Var1 Var2 value
## 112 rad tax 0.9064632
## 82 nox dis -0.7688840
## 41 indus nox 0.7596301
## 84 age dis -0.7508976
## 154 lstat medv -0.7358008
## 69 nox age 0.7351278
# Correlation Funnel using plot_correlation_funnel
# This shows us exactly which prospects groups have a higher correlation with the target variable
# This binarizes the features so it includes categorical features
data_train_binarized <- data_train %>%
binarize(n_bins = 4, thresh_infreq = 0.01)
data_train_correlated_table <- data_train_binarized %>%
correlate(target = target__1)
data_train_correlated_table %>%
plot_correlation_funnel(interactive = FALSE)
## Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## ℹ The deprecated feature was likely used in the correlationfunnel package.
## Please report the issue at
## <https://github.com/business-science/correlationfunnel/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## ℹ The deprecated feature was likely used in the correlationfunnel package.
## Please report the issue at
## <https://github.com/business-science/correlationfunnel/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
There was no missing data.
# Binning categories
data_train2 <- data_train %>%
mutate(
age_bin = cut(age,
breaks = quantile(age, probs = c(0, 0.33, 0.66, 1), na.rm = TRUE),
labels = c("Young", "Middle", "Old"),
include.lowest = TRUE),
lstat_bin = cut(lstat,
breaks = quantile(lstat, probs = c(0, 0.33, 0.66, 1), na.rm = TRUE),
labels = c("Low", "Medium", "High"),
include.lowest = TRUE),
medv_bin = cut(medv,
breaks = quantile(medv, probs = c(0, 0.33, 0.66, 1), na.rm = TRUE),
labels = c("Low", "Medium", "High"),
include.lowest = TRUE),
tax_bin = cut(tax,
breaks = quantile(tax, probs = c(0, 0.33, 0.66, 1), na.rm = TRUE),
labels = c("Low", "Medium", "High"),
include.lowest = TRUE),
dis_group = case_when(
ntile(dis, 3) == 1 ~ "Close",
ntile(dis, 3) == 2 ~ "Medium",
ntile(dis, 3) == 3 ~ "Far"
),
ptratio_bin = case_when(
ntile(ptratio, 3) == 1 ~ "Low",
ntile(ptratio, 3) == 2 ~ "Medium",
ntile(ptratio, 3) == 3 ~ "High"
),
rm_bin = case_when(
ntile(rm, 3) == 1 ~ "Small",
ntile(rm, 3) == 2 ~ "Medium",
ntile(rm, 3) == 3 ~ "Large"
)
)
# Convert to factor
data_train3 <- data_train2 %>%
mutate(across(ends_with("_bin") | ends_with("_group"), ~factor(.)))
# Log and sqrt transformation
data_train4 <- data_train3 %>%
mutate(
sqrt_zn = sqrt(zn),
log_tax = log(tax + 1),
log_rad = log(rad + 1)
)
sum(is.na(data_train4))
## [1] 0
# Apply Box-Cox transformation on lstat, dis, and nox
# Define a function
apply_boxcox <- function(data, variable_name) {
formula <- as.formula(paste(variable_name, "~ 1"))
boxcox_result <- boxcox(formula, data = data, plotit = FALSE)
lambda <- boxcox_result$x[which.max(boxcox_result$y)]
print(paste("Optimal lambda for", variable_name, "Box-Cox transformation is:", lambda))
transformed_variable_name <- paste("box", variable_name, sep = "_")
data[[transformed_variable_name]] <- (data[[variable_name]]^lambda - 1) / lambda
return(data)
}
# Apply Box-Cox transformation to 'lstat' and 'dis'
data_train4 <- apply_boxcox(data_train4, "lstat")
## [1] "Optimal lambda for lstat Box-Cox transformation is: 0.2"
data_train4 <- apply_boxcox(data_train4, "dis")
## [1] "Optimal lambda for dis Box-Cox transformation is: -0.0999999999999999"
data_train4 <- apply_boxcox(data_train4, "nox")
## [1] "Optimal lambda for nox Box-Cox transformation is: -0.9"
data_train4|>
dplyr::select(where(is.numeric)) |>
pivot_longer(cols = everything(), names_to = "Feature", values_to = "Value") |>
filter(!is.na(Value)) |>
ggplot(aes(x = Value)) +
geom_histogram(bins = 30, fill = "skyblue", color = "black") +
facet_wrap(~Feature, scales = "free") +
labs(title = "Histograms of Numerical Features", x = NULL, y = "Frequency") +
theme_minimal()
To ensure objective model evaluation and prevent overfitting, we split each dataset into training (80%) and testing (20%) sets using a consistent random seed (set.seed(123)) for reproducibility.
# Split data for validation
set.seed(123)
train_idx <- createDataPartition(y = data_train$target, p = 0.8, list = FALSE)
# Create train/test splits for each dataset
train <- data_train[train_idx, ]
test <- data_train[-train_idx, ]
# Transformed predictors dataset
train_transformed <- data_train4[train_idx, ]
test_transformed <- data_train4[-train_idx, ]
Includes all original numeric predictors (no transformations or binning).
# Original Model
set.seed(123)
model1 <- glm(target ~ zn + indus + chas + nox + rm + age + dis + rad + tax + ptratio + lstat + medv,
data = train, family = binomial)
summary(model1)
##
## Call:
## glm(formula = target ~ zn + indus + chas + nox + rm + age + dis +
## rad + tax + ptratio + lstat + medv, family = binomial, data = train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -40.553836 7.376053 -5.498 3.84e-08 ***
## zn -0.066245 0.037617 -1.761 0.07823 .
## indus -0.061614 0.051685 -1.192 0.23322
## chas 0.146493 0.881575 0.166 0.86802
## nox 48.484952 8.483912 5.715 1.10e-08 ***
## rm -0.210856 0.814387 -0.259 0.79570
## age 0.030010 0.014946 2.008 0.04465 *
## dis 0.751882 0.240194 3.130 0.00175 **
## rad 0.583758 0.179120 3.259 0.00112 **
## tax -0.004762 0.003069 -1.552 0.12078
## ptratio 0.332629 0.133255 2.496 0.01255 *
## lstat 0.010455 0.062146 0.168 0.86639
## medv 0.159638 0.077352 2.064 0.03904 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 515.67 on 372 degrees of freedom
## Residual deviance: 164.56 on 360 degrees of freedom
## AIC: 190.56
##
## Number of Fisher Scoring iterations: 9
# Original Model - using the full dataset to train the model
set.seed(123)
model1_full <- glm(target ~ zn + indus + chas + nox + rm + age + dis + rad + tax + ptratio + lstat + medv,
data = data_train, family = binomial)
summary(model1_full)
##
## Call:
## glm(formula = target ~ zn + indus + chas + nox + rm + age + dis +
## rad + tax + ptratio + lstat + medv, family = binomial, data = data_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -40.822934 6.632913 -6.155 7.53e-10 ***
## zn -0.065946 0.034656 -1.903 0.05706 .
## indus -0.064614 0.047622 -1.357 0.17485
## chas 0.910765 0.755546 1.205 0.22803
## nox 49.122297 7.931706 6.193 5.90e-10 ***
## rm -0.587488 0.722847 -0.813 0.41637
## age 0.034189 0.013814 2.475 0.01333 *
## dis 0.738660 0.230275 3.208 0.00134 **
## rad 0.666366 0.163152 4.084 4.42e-05 ***
## tax -0.006171 0.002955 -2.089 0.03674 *
## ptratio 0.402566 0.126627 3.179 0.00148 **
## lstat 0.045869 0.054049 0.849 0.39608
## medv 0.180824 0.068294 2.648 0.00810 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 192.05 on 453 degrees of freedom
## AIC: 218.05
##
## Number of Fisher Scoring iterations: 9
set.seed(123)
model2 <- stepAIC(model1, direction = "both", trace = 0)
summary(model2)
##
## Call:
## glm(formula = target ~ zn + nox + age + dis + rad + tax + ptratio +
## medv, family = binomial, data = train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -38.703649 6.720204 -5.759 8.45e-09 ***
## zn -0.068356 0.035230 -1.940 0.052348 .
## nox 44.091914 7.397764 5.960 2.52e-09 ***
## age 0.027659 0.011918 2.321 0.020305 *
## dis 0.720303 0.229818 3.134 0.001723 **
## rad 0.649291 0.165416 3.925 8.67e-05 ***
## tax -0.006033 0.002695 -2.239 0.025168 *
## ptratio 0.308474 0.120618 2.557 0.010544 *
## medv 0.139526 0.042310 3.298 0.000975 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 515.67 on 372 degrees of freedom
## Residual deviance: 166.18 on 364 degrees of freedom
## AIC: 184.18
##
## Number of Fisher Scoring iterations: 9
set.seed(123)
model2_full <- stepAIC(model1_full, direction = "both", trace = 0)
set.seed(123)
model3 <- glm(target ~ sqrt_zn + indus + chas + box_nox + rm + age + box_dis + log_rad + log_tax + ptratio + box_lstat + medv,
data = train_transformed, family = binomial)
summary(model3)
##
## Call:
## glm(formula = target ~ sqrt_zn + indus + chas + box_nox + rm +
## age + box_dis + log_rad + log_tax + ptratio + box_lstat +
## medv, family = binomial, data = train_transformed)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.09432 7.67852 -0.924 0.355529
## sqrt_zn -0.15833 0.16062 -0.986 0.324244
## indus -0.01077 0.05263 -0.205 0.837786
## chas 0.15399 0.86390 0.178 0.858525
## box_nox 15.06836 2.49372 6.043 1.52e-09 ***
## rm -0.38557 0.84657 -0.455 0.648783
## age 0.03578 0.01540 2.324 0.020134 *
## box_dis 4.34129 1.14921 3.778 0.000158 ***
## log_rad 3.44773 0.92599 3.723 0.000197 ***
## log_tax -0.49984 1.19020 -0.420 0.674514
## ptratio 0.38155 0.14009 2.724 0.006458 **
## box_lstat -0.11429 0.49634 -0.230 0.817890
## medv 0.20189 0.08383 2.408 0.016028 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 515.67 on 372 degrees of freedom
## Residual deviance: 163.68 on 360 degrees of freedom
## AIC: 189.68
##
## Number of Fisher Scoring iterations: 8
set.seed(123)
model3_full <- glm(target ~ sqrt_zn + indus + chas + box_nox + rm + age + box_dis + log_rad + log_tax + ptratio + box_lstat + medv,
data = data_train4, family = binomial)
Used binned variables
set.seed(123)
model4 <- glm(target ~ indus + chas + zn + nox + rad + age_bin + lstat_bin + medv_bin + tax_bin +
dis_group + rm_bin + ptratio_bin, data = train_transformed, family = binomial)
summary(model4)
##
## Call:
## glm(formula = target ~ indus + chas + zn + nox + rad + age_bin +
## lstat_bin + medv_bin + tax_bin + dis_group + rm_bin + ptratio_bin,
## family = binomial, data = train_transformed)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -25.70322 4.92077 -5.223 1.76e-07 ***
## indus -0.06526 0.06877 -0.949 0.342669
## chas 1.00399 1.13036 0.888 0.374433
## zn -0.02733 0.02798 -0.977 0.328632
## nox 41.98241 8.00307 5.246 1.56e-07 ***
## rad 0.32906 0.13002 2.531 0.011376 *
## age_binMiddle -0.01526 0.75582 -0.020 0.983894
## age_binOld 1.07032 0.98353 1.088 0.276489
## lstat_binMedium -0.72671 0.71667 -1.014 0.310579
## lstat_binHigh -1.56766 1.01293 -1.548 0.121709
## medv_binMedium -0.08556 0.63220 -0.135 0.892348
## medv_binHigh 1.77639 0.99381 1.787 0.073862 .
## tax_binMedium 2.19475 0.63681 3.447 0.000568 ***
## tax_binHigh 0.43702 1.10176 0.397 0.691619
## dis_groupFar 1.68912 1.09646 1.541 0.123434
## dis_groupMedium 1.02326 0.84276 1.214 0.224681
## rm_binMedium 0.10159 0.73776 0.138 0.890482
## rm_binSmall 0.69452 0.87702 0.792 0.428417
## ptratio_binLow -1.57733 0.73206 -2.155 0.031190 *
## ptratio_binMedium -0.97548 0.57613 -1.693 0.090427 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 515.67 on 372 degrees of freedom
## Residual deviance: 148.81 on 353 degrees of freedom
## AIC: 188.81
##
## Number of Fisher Scoring iterations: 9
set.seed(123)
model4_full <- glm(target ~ indus + chas + zn + nox + rad + age_bin + lstat_bin + medv_bin + tax_bin +
dis_group + rm_bin + ptratio_bin, data = data_train4, family = binomial)
set.seed(123)
x_train <- model.matrix(target ~ zn + indus + chas + nox + rm + age + dis + rad + tax + ptratio + lstat + medv, data = train)[, -1]
y_train <- train$target
lasso_model <- cv.glmnet(x_train, y_train, family = "binomial", alpha = 1)
ridge_model <- cv.glmnet(x_train, y_train, family = "binomial", alpha = 0)
set.seed(123)
x_train_full <- model.matrix(target ~ zn + indus + chas + nox + rm + age + dis + rad + tax + ptratio + lstat + medv, data = data_train)[, -1]
y_train_full <- data_train4$target
lasso_model_full <- cv.glmnet(x_train_full, y_train_full, family = "binomial", alpha = 1)
ridge_model_full <- cv.glmnet(x_train_full, y_train_full, family = "binomial", alpha = 0)
# Set up cross-validation (10-fold)
set.seed(123)
train_control <- trainControl(method = "cv", number = 10)
cv_model <- train(factor(target) ~ zn + indus + chas + nox + rm + age + dis + rad + tax + ptratio + lstat + medv,
data = train,
method = "glm",
family = "binomial",
trControl = train_control)
# View cross-validation results
print(cv_model)
## Generalized Linear Model
##
## 373 samples
## 12 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 335, 335, 335, 336, 336, 337, ...
## Resampling results:
##
## Accuracy Kappa
## 0.8902402 0.7787411
set.seed(123)
cv_model_full <- train(factor(target) ~ zn + indus + chas + nox + rm + age + dis + rad + tax + ptratio + lstat + medv,
data = data_train,
method = "glm",
family = "binomial",
trControl = train_control)
Model 1 has predictors all under 5 except for medv. Model 2 and 4’s predictors are all under 5 so there are no signs of serious multicollinearity (looking at GVIF^(1/(2*Df)) column for Model 4). Model 3 has predictors all under 5 except for rm and medv.
Therefore, medv should be removed from Model 1, and rm and medv should be removed from Model 3.
cat("=== VIF for Model 1 (All Predictors) ===\n")
## === VIF for Model 1 (All Predictors) ===
car::vif(model1)
## zn indus chas nox rm age dis rad
## 1.962161 2.847886 1.279575 4.266630 4.984115 2.674398 4.013696 1.988631
## tax ptratio lstat medv
## 2.150796 2.100827 2.457361 6.781809
cat("=== VIF for Model 2 (Stepwise Selection) ===\n")
## === VIF for Model 2 (Stepwise Selection) ===
car::vif(model2)
## zn nox age dis rad tax ptratio medv
## 1.909515 3.433703 1.764280 3.839024 1.718483 1.739576 1.771036 2.068256
cat("=== VIF for Model 3 (Transformed) ===\n")
## === VIF for Model 3 (Transformed) ===
car::vif(model3)
## sqrt_zn indus chas box_nox rm age box_dis log_rad
## 1.758147 2.974680 1.248300 4.089883 5.086441 2.845874 4.328683 1.899500
## log_tax ptratio box_lstat medv
## 2.856757 2.369010 3.123333 7.448702
cat("=== VIF for Model 4 (Binned Categorical) ===\n")
## === VIF for Model 4 (Binned Categorical) ===
car::vif(model4)
## GVIF Df GVIF^(1/(2*Df))
## indus 4.451754 1 2.109918
## chas 1.438910 1 1.199546
## zn 1.996860 1 1.413103
## nox 4.002862 1 2.000715
## rad 1.430448 1 1.196013
## age_bin 4.482481 2 1.455056
## lstat_bin 4.485162 2 1.455273
## medv_bin 4.674577 2 1.470400
## tax_bin 4.563485 2 1.461585
## dis_group 5.722877 2 1.546691
## rm_bin 3.169475 2 1.334280
## ptratio_bin 2.820375 2 1.295916
Removing variables which show high multicollinearity:
# Remove medv from Model 1
set.seed(123)
model1 <- glm(target ~ zn + indus + chas + nox + rm + age + dis + rad + tax + ptratio + lstat,
data = train, family = binomial)
cat("=== UPDATED VIF for Model 1 (VERSION 2) ===\n")
## === UPDATED VIF for Model 1 (VERSION 2) ===
car::vif(model1)
## zn indus chas nox rm age dis rad
## 1.843492 2.711253 1.257539 3.870959 2.183782 2.095609 3.027371 1.862203
## tax ptratio lstat
## 1.944040 1.450484 2.448452
# Remove medv and rm from Model 3
set.seed(123)
model3 <- glm(target ~ sqrt_zn + indus + chas + box_nox + age + box_dis + log_rad + log_tax + ptratio + box_lstat,
data = train_transformed, family = binomial)
cat("=== UPDATED VIF for Model 3 (VERSION 2) ===\n")
## === UPDATED VIF for Model 3 (VERSION 2) ===
car::vif(model3)
## sqrt_zn indus chas box_nox age box_dis log_rad log_tax
## 1.593900 2.891091 1.210848 3.527924 1.832794 3.128395 1.765819 2.544521
## ptratio box_lstat
## 1.712071 1.657669
All predictors look good now.
# Interpretation of coefficients (Model 1)
cat("\n=== Coefficient Interpretation (Model 1) ===\n")
##
## === Coefficient Interpretation (Model 1) ===
coef_exp <- exp(coef(model1))
print(round(coef_exp, 4))
## (Intercept) zn indus chas nox rm
## 0.000000e+00 9.398000e-01 9.447000e-01 1.106000e+00 5.120047e+19 3.124400e+00
## age dis rad tax ptratio lstat
## 1.016300e+00 1.742600e+00 1.796000e+00 9.942000e-01 1.204100e+00 1.002400e+00
exp(cbind(OR = coef(model1), confint(model1)))
## Waiting for profiling to be done...
## OR 2.5 % 97.5 %
## (Intercept) 1.339951e-17 3.995636e-24 6.325113e-12
## zn 9.398163e-01 8.698073e-01 9.983483e-01
## indus 9.446569e-01 8.504886e-01 1.038875e+00
## chas 1.105964e+00 1.987315e-01 6.084096e+00
## nox 5.120047e+19 2.335527e+13 1.458298e+27
## rm 3.124363e+00 1.164671e+00 8.971466e+00
## age 1.016316e+00 9.911503e-01 1.042961e+00
## dis 1.742596e+00 1.166807e+00 2.686720e+00
## rad 1.795954e+00 1.341720e+00 2.579966e+00
## tax 9.942307e-01 9.880696e-01 9.998517e-01
## ptratio 1.204097e+00 9.789906e-01 1.492546e+00
## lstat 1.002386e+00 8.882310e-01 1.128222e+00
# Interpretation of coefficients (Model 2)
cat("\n=== Coefficient Interpretation (Model 2) ===\n")
##
## === Coefficient Interpretation (Model 2) ===
coef_exp <- exp(coef(model2))
print(round(coef_exp, 4))
## (Intercept) zn nox age dis rad
## 0.000000e+00 9.339000e-01 1.408883e+19 1.028000e+00 2.055100e+00 1.914200e+00
## tax ptratio medv
## 9.940000e-01 1.361300e+00 1.149700e+00
exp(cbind(OR = coef(model2), confint(model2)))
## Waiting for profiling to be done...
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## OR 2.5 % 97.5 %
## (Intercept) 1.553170e-17 8.296760e-24 2.751666e-12
## zn 9.339277e-01 8.650580e-01 9.920167e-01
## nox 1.408883e+19 2.063381e+13 9.656341e+25
## age 1.028045e+00 1.004871e+00 1.053272e+00
## dis 2.055055e+00 1.333280e+00 3.306993e+00
## rad 1.914184e+00 1.417883e+00 2.711077e+00
## tax 9.939851e-01 9.882596e-01 9.989619e-01
## ptratio 1.361346e+00 1.081946e+00 1.741766e+00
## medv 1.149729e+00 1.063397e+00 1.255596e+00
# Interpretation of coefficients (Model 3)
cat("\n=== Coefficient Interpretation (Model 3) ===\n")
##
## === Coefficient Interpretation (Model 3) ===
coef_exp <- exp(coef(model3))
print(round(coef_exp, 4))
## (Intercept) sqrt_zn indus chas box_nox age
## 353.0446 0.9271 0.9768 1.2964 591856.6819 1.0308
## box_dis log_rad log_tax ptratio box_lstat
## 15.4588 37.5716 0.2820 1.2237 0.3877
exp(cbind(OR = coef(model3), confint(model3)))
## Waiting for profiling to be done...
## OR 2.5 % 97.5 %
## (Intercept) 3.530446e+02 1.990470e-03 3.737712e+07
## sqrt_zn 9.270792e-01 6.922799e-01 1.211593e+00
## indus 9.767896e-01 8.809233e-01 1.073748e+00
## chas 1.296402e+00 2.357781e-01 7.349952e+00
## box_nox 5.918567e+05 9.622542e+03 7.639261e+07
## age 1.030814e+00 1.006989e+00 1.056733e+00
## box_dis 1.545882e+01 2.488980e+00 1.094425e+02
## log_rad 3.757159e+01 8.912691e+00 2.705988e+02
## log_tax 2.819801e-01 3.241522e-02 2.778774e+00
## ptratio 1.223655e+00 9.832892e-01 1.531685e+00
## box_lstat 3.876717e-01 1.908239e-01 7.287988e-01
# Interpretation of coefficients (Model 4)
cat("\n=== Coefficient Interpretation (Model 4) ===\n")
##
## === Coefficient Interpretation (Model 4) ===
coef_exp <- exp(coef(model4))
print(round(coef_exp, 4))
## (Intercept) indus chas zn
## 0.00000e+00 9.36800e-01 2.72910e+00 9.73000e-01
## nox rad age_binMiddle age_binOld
## 1.70895e+18 1.38970e+00 9.84900e-01 2.91630e+00
## lstat_binMedium lstat_binHigh medv_binMedium medv_binHigh
## 4.83500e-01 2.08500e-01 9.18000e-01 5.90850e+00
## tax_binMedium tax_binHigh dis_groupFar dis_groupMedium
## 8.97780e+00 1.54810e+00 5.41470e+00 2.78220e+00
## rm_binMedium rm_binSmall ptratio_binLow ptratio_binMedium
## 1.10690e+00 2.00270e+00 2.06500e-01 3.77000e-01
exp(cbind(OR = coef(model4), confint(model4)))
## Waiting for profiling to be done...
## OR 2.5 % 97.5 %
## (Intercept) 6.874350e-12 1.713695e-16 5.033104e-08
## indus 9.368254e-01 8.149676e-01 1.070675e+00
## chas 2.729139e+00 2.924899e-01 2.486098e+01
## zn 9.730379e-01 9.152451e-01 1.023090e+00
## nox 1.708950e+18 8.928399e+11 5.036969e+25
## rad 1.389664e+00 1.168387e+00 1.990123e+00
## age_binMiddle 9.848581e-01 2.112166e-01 4.244191e+00
## age_binOld 2.916306e+00 4.161854e-01 2.050067e+01
## lstat_binMedium 4.834962e-01 1.131297e-01 1.941861e+00
## lstat_binHigh 2.085333e-01 2.651914e-02 1.449017e+00
## medv_binMedium 9.179994e-01 2.629866e-01 3.195360e+00
## medv_binHigh 5.908514e+00 8.809663e-01 4.512682e+01
## tax_binMedium 8.977791e+00 2.703570e+00 3.411511e+01
## tax_binHigh 1.548093e+00 1.680008e-01 1.304128e+01
## dis_groupFar 5.414723e+00 6.879042e-01 5.342988e+01
## dis_groupMedium 2.782248e+00 5.667518e-01 1.623084e+01
## rm_binMedium 1.106924e+00 2.538354e-01 4.703324e+00
## rm_binSmall 2.002741e+00 3.610490e-01 1.150066e+01
## ptratio_binLow 2.065268e-01 4.541192e-02 8.273234e-01
## ptratio_binMedium 3.770113e-01 1.189391e-01 1.155709e+00
Recommended: Model 4 Key Strengths:
Highest accuracy (97.85%) Highest performance metrics overall (Precision, Sensitivity, Specificity, F1 Score) Second Lowest AIC (188.81) - Although not the highest, compared to other models, good balance between model fit and complexity High AUC (0.9872) - Excellent discriminatory power, only slightly lower than Model 1, 2 and 3 No unusual data points Statistically significant predictors - All variables retained are meaningful Categorical predictors make the model simpler and more interpretable
Trade-offs:
Binned residuals test performed the poorest. Only about 32% of the residuals are inside the error bounds, yet many of the red points are close to the ends of the distribution. There is no real pattern (seems random), which is good.
Reasoning on not choosing other models:
Model 1 (All Predictors):
Second highest accuracy (94.62%) BUT: Lowest AUC (0.9848) and second highest AIC (193.197) - includes potentially unnecessary variables Linearity log-odds test did not perform well, with many predictors showing non-linearity (to be expected). Only about 47% of the residuals are inside the error bounds. Unusual data point do exist. May have multicollinearity issues or non-significant predictors Less interpretable with 11 variables (ended up removing one after VIF check)
Model 2 (Stepwise Selection on Model 1):
Highest AUC (0.9877) Lowest AIC (184.183) BUT: Lowest accuracy (92.47%) Linearity log-odds test did not perform well, with many predictors showing non-linearity (to be expected). Only about 47% of the residuals are inside the error bounds. Unusual data point do exist. May have multicollinearity issues or non-significant predictors. This would’ve been a solid option for the final model if the accuracy was better.
Model 3 (Transformed Variables):
Highest AIC (196.256) despite transformations. Lower performance metrics across the board. Transformations didn’t improve the model. Linearity log-odds test performed okay, with a few predictors showing non-linearity. Only about 37% of the residuals are inside the error bounds. Unusual data points do exist.
Final Recommendation: Select Model 4 for its optimal balance of:
Statistical rigor (2nd lowest AIC) Predictive performance (High AUC and highest accuracy) Model simplicity and interpretability Practical usability
Some takeaways from Model 4’s coefficients are:
Some of the coefficients match our initial insights, yet some do not. This model doesn’t fully capture all of the relationships between the predictors, so there could be some improvement.
# Model 1 Evaluation
pred1_prob <- predict(model1, test, type = "response")
pred1_class <- ifelse(pred1_prob > 0.5, 1, 0)
cm1 <- table(Predicted = pred1_class, Actual = test$target)
acc1 <- sum(diag(cm1)) / sum(cm1)
prec1 <- cm1[2,2] / sum(cm1[2,])
sens1 <- cm1[2,2] / sum(cm1[,2])
spec1 <- cm1[1,1] / sum(cm1[,1])
f1_1 <- 2 * (prec1 * sens1) / (prec1 + sens1)
roc1 <- roc(test$target, pred1_prob, quiet = TRUE)
auc1 <- auc(roc1)
output1 <- paste("\n=== Model Selection and Evaluation ===\n\n",
"=== Model 1 Evaluation ===\n",
"Confusion Matrix:\n",
paste(capture.output(print(cm1)), collapse = "\n"), "\n",
"Accuracy:", round(acc1, 4), "| Precision:", round(prec1, 4),
"| Sensitivity:", round(sens1, 4), "| Specificity:", round(spec1, 4), "\n",
"F1 Score:", round(f1_1, 4), "| AUC:", round(auc1, 4),
"| AIC:", round(AIC(model1), 4), "\n\n", sep = " ")
cat(output1)
##
## === Model Selection and Evaluation ===
##
## === Model 1 Evaluation ===
## Confusion Matrix:
## Actual
## Predicted 0 1
## 0 38 4
## 1 1 50
## Accuracy: 0.9462 | Precision: 0.9804 | Sensitivity: 0.9259 | Specificity: 0.9744
## F1 Score: 0.9524 | AUC: 0.9848 | AIC: 193.1973
# Diagnostics
# Logistic Regression Assumption Check
# This chunk of code is from Statistical tools for high-throughput data analysis (STHDA)
predictors <- colnames(test)
test_copy <- test
test_copy <- test_copy %>%
mutate(logit = log(pred1_prob/(1-pred1_prob))) %>%
gather(key = "predictors", value = "predictor.value", -logit)
ggplot(test_copy, aes(logit, predictor.value)) +
geom_point(size = 0.5, alpha = 0.5) +
geom_smooth(method = "loess") +
theme_bw() +
facet_wrap(~predictors, scales = "free_y")
## `geom_smooth()` using formula = 'y ~ x'
# Deviance Residuals Plot
deviance_residuals <- sign(test$target - pred1_prob) *
sqrt(-2 * (test$target * log(pred1_prob) +
(1 - test$target) * log(1 - pred1_prob)))
plot(pred1_prob, deviance_residuals,
xlab = "Fitted Probabilities",
ylab = "Deviance Residuals",
main = "Deviance Residuals vs. Fitted Probabilities")
# Unusual points check
# Calculate the leverage cutoff
# p = # of coeff
p <- length(coef(model4))
# n = # of observations
n <- nrow(train_transformed)
# Cut off point high leverage
high_leverage <- (2 * (p+1)) / n
print(high_leverage)
## [1] 0.1126005
plot(model1, which = 4, id.n = 5)
qqnorm(residuals(model1))
halfnorm(hatvalues(model1))
# x <- predict(model1)
# y <- resid(model1)
# binnedplot(x,y)
# Binned residuals
binned_result <- binned_residuals(model1)
binned_result
## Warning: Probably bad model fit. Only about 47% of the residuals are inside the error bounds.
plot(binned_residuals(model1), show_dots = TRUE)
# Model 2 Evaluation
pred2_prob <- predict(model2, test, type = "response")
pred2_class <- ifelse(pred2_prob > 0.5, 1, 0)
cm2 <- table(Predicted = pred2_class, Actual = test$target)
acc2 <- sum(diag(cm2)) / sum(cm2)
prec2 <- cm2[2,2] / sum(cm2[2,])
sens2 <- cm2[2,2] / sum(cm2[,2])
spec2 <- cm2[1,1] / sum(cm2[,1])
f1_2 <- 2 * (prec2 * sens2) / (prec2 + sens2)
roc2 <- roc(test$target, pred2_prob, quiet = TRUE)
auc2 <- auc(roc2)
output2 <- paste("=== Model 2 Evaluation ===\n",
"Confusion Matrix:\n",
paste(capture.output(print(cm2)), collapse = "\n"), "\n",
"Accuracy:", round(acc2, 4), "| Precision:", round(prec2, 4),
"| Sensitivity:", round(sens2, 4), "| Specificity:", round(spec2, 4), "\n",
"F1 Score:", round(f1_2, 4), "| AUC:", round(auc2, 4),
"| AIC:", round(AIC(model2), 4), "\n\n", sep = "")
cat(output2)
## === Model 2 Evaluation ===
## Confusion Matrix:
## Actual
## Predicted 0 1
## 0 39 7
## 1 0 47
## Accuracy:0.9247| Precision:1| Sensitivity:0.8704| Specificity:1
## F1 Score:0.9307| AUC:0.9877| AIC:184.183
# Logistic Regression Assumption Check
test_copy2 <- test
test_copy2 <- test_copy2 %>%
mutate(logit = log(pred2_prob/(1-pred2_prob))) %>%
gather(key = "predictors", value = "predictor.value", -logit)
ggplot(test_copy2, aes(logit, predictor.value))+
geom_point(size = 0.5, alpha = 0.5) +
geom_smooth(method = "loess") +
theme_bw() +
facet_wrap(~predictors, scales = "free_y")
## `geom_smooth()` using formula = 'y ~ x'
# Deviance Residuals Plot
deviance_residuals <- sign(test$target - pred2_prob) *
sqrt(-2 * (test$target * log(pred2_prob) +
(1 - test$target) * log(1 - pred2_prob)))
plot(pred2_prob, deviance_residuals,
xlab = "Fitted Probabilities",
ylab = "Deviance Residuals",
main = "Deviance Residuals vs. Fitted Probabilities")
# Unusual points check
# Calculate the leverage cutoff
# p = # of coeff
p <- length(coef(model4))
# n = # of observations
n <- nrow(train_transformed)
# Cut off point high leverage
high_leverage <- (2 * (p+1)) / n
print(high_leverage)
## [1] 0.1126005
plot(model2, which = 4, id.n = 5)
qqnorm(residuals(model2))
halfnorm(hatvalues(model2))
# Binned residuals
binned_result <- binned_residuals(model2)
binned_result
## Warning: Probably bad model fit. Only about 47% of the residuals are inside the error bounds.
plot(binned_residuals(model2), show_dots = TRUE)
# Model 3 Evaluation
pred3_prob <- predict(model3, test_transformed, type = "response")
pred3_class <- ifelse(pred3_prob > 0.5, 1, 0)
cm3 <- table(Predicted = pred3_class, Actual = test_transformed$target)
acc3 <- sum(diag(cm3)) / sum(cm3)
prec3 <- cm3[2,2] / sum(cm3[2,])
sens3 <- cm3[2,2] / sum(cm3[,2])
spec3 <- cm3[1,1] / sum(cm3[,1])
f1_3 <- 2 * (prec3 * sens3) / (prec3 + sens3)
roc3 <- roc(test_transformed$target, pred3_prob, quiet = TRUE)
auc3 <- auc(roc3)
output3 <- paste("=== Model 3 Evaluation ===\n",
"Confusion Matrix:\n",
paste(capture.output(print(cm3)), collapse = "\n"), "\n",
"Accuracy:", round(acc3, 4), "| Precision:", round(prec3, 4),
"| Sensitivity:", round(sens3, 4), "| Specificity:", round(spec3, 4), "\n",
"F1 Score:", round(f1_3, 4), "| AUC:", round(auc3, 4),
"| AIC:", round(AIC(model3), 4), "\n\n", sep = "")
cat(output3)
## === Model 3 Evaluation ===
## Confusion Matrix:
## Actual
## Predicted 0 1
## 0 38 7
## 1 1 47
## Accuracy:0.914| Precision:0.9792| Sensitivity:0.8704| Specificity:0.9744
## F1 Score:0.9216| AUC:0.9877| AIC:196.2563
# Logistic Regression Assumption Check
predictors <- colnames(test_transformed)
test_transformed_copy <- test_transformed |>
dplyr::select_if(is.numeric)
test_transformed_copy <- test_transformed_copy %>%
mutate(logit = log(pred3_prob/(1-pred3_prob))) %>%
gather(key = "predictors", value = "predictor.value", -logit)
ggplot(test_transformed_copy, aes(logit, predictor.value))+
geom_point(size = 0.5, alpha = 0.5) +
geom_smooth(method = "loess", span = 0.9) +
theme_bw() +
facet_wrap(~predictors, scales = "free_y")
## `geom_smooth()` using formula = 'y ~ x'
# Deviance Residuals Plot
deviance_residuals <- sign(test_transformed$target - pred3_prob) *
sqrt(-2 * (test_transformed$target * log(pred3_prob) +
(1 - test_transformed$target) * log(1 - pred3_prob)))
plot(pred3_prob, deviance_residuals,
xlab = "Fitted Probabilities",
ylab = "Deviance Residuals",
main = "Deviance Residuals vs. Fitted Probabilities")
# Unusual points check
# Calculate the leverage cutoff
# p = # of coeff
p <- length(coef(model4))
# n = # of observations
n <- nrow(train_transformed)
# Cut off point high leverage
high_leverage <- (2 * (p+1)) / n
print(high_leverage)
## [1] 0.1126005
plot(model3, which = 4, id.n = 5)
qqnorm(residuals(model3))
halfnorm(hatvalues(model3))
# Binned residuals
binned_result <- binned_residuals(model3)
binned_result
## Warning: Probably bad model fit. Only about 37% of the residuals are inside the error bounds.
plot(binned_residuals(model3), show_dots = TRUE)
# Model 4 Evaluation
pred4_prob <- predict(model4, test_transformed, type = "response")
pred4_class <- ifelse(pred4_prob > 0.5, 1, 0)
cm4 <- table(Predicted = pred4_class, Actual = test_transformed$target)
acc4 <- sum(diag(cm4)) / sum(cm4)
prec4 <- cm4[2,2] / sum(cm4[2,])
sens4 <- cm4[2,2] / sum(cm4[,2])
spec4 <- cm4[1,1] / sum(cm4[,1])
f1_4 <- 2 * (prec4 * sens4) / (prec4 + sens4)
roc4 <- roc(test_transformed$target, pred4_prob, quiet = TRUE)
auc4 <- auc(roc4)
output4 <- paste("=== Model 4 Evaluation ===\n",
"Confusion Matrix:\n",
paste(capture.output(print(cm4)), collapse = "\n"), "\n",
"Accuracy:", round(acc4, 4), "| Precision:", round(prec4, 4),
"| Sensitivity:", round(sens4, 4), "| Specificity:", round(spec4, 4), "\n",
"F1 Score:", round(f1_4, 4), "| AUC:", round(auc4, 4),
"| AIC:", round(AIC(model4), 4), "\n\n", sep = "")
cat(output4)
## === Model 4 Evaluation ===
## Confusion Matrix:
## Actual
## Predicted 0 1
## 0 39 2
## 1 0 52
## Accuracy:0.9785| Precision:1| Sensitivity:0.963| Specificity:1
## F1 Score:0.9811| AUC:0.9872| AIC:188.8146
# Logistic Regression Assumption Check
predictors <- colnames(test_transformed)
test_transformed_copy <- test_transformed |>
dplyr::select_if(is.numeric)
test_transformed_copy <- test_transformed_copy %>%
mutate(logit = log(pred4_prob/(1-pred4_prob))) %>%
gather(key = "predictors", value = "predictor.value", -logit)
ggplot(test_transformed_copy, aes(logit, predictor.value))+
geom_point(size = 0.5, alpha = 0.5) +
geom_smooth(method = "loess", span = 0.9) +
theme_bw() +
facet_wrap(~predictors, scales = "free_y")
## `geom_smooth()` using formula = 'y ~ x'
# Deviance Residuals Plot
deviance_residuals <- sign(test_transformed$target - pred4_prob) *
sqrt(-2 * (test_transformed$target * log(pred4_prob) +
(1 - test_transformed$target) * log(1 - pred4_prob)))
plot(pred3_prob, deviance_residuals,
xlab = "Fitted Probabilities",
ylab = "Deviance Residuals",
main = "Deviance Residuals vs. Fitted Probabilities")
# Unusual points check
# Calculate the leverage cutoff
# p = # of coeff
p <- length(coef(model4))
# n = # of observations
n <- nrow(train_transformed)
# Cut off point high leverage
high_leverage <- (2 * (p+1)) / n
print(high_leverage)
## [1] 0.1126005
plot(model4, which = 4, id.n = 5)
qqnorm(residuals(model4))
halfnorm(hatvalues(model4))
# Binned residuals
binned_result <- binned_residuals(model4)
binned_result
## Warning: Probably bad model fit. Only about 32% of the residuals are inside the error bounds.
plot(binned_residuals(model4), show_dots = TRUE)
# Create comparison table
comparison <- data.frame(
Model = c("Model1", "Model2", "Model3", "Model4"),
AIC = c(AIC(model1), AIC(model2), AIC(model3), AIC(model4)),
Accuracy = c(acc1, acc2, acc3, acc4),
Error_Rate = c(1-acc1, 1-acc2, 1-acc3, 1-acc4),
Precision = c(prec1, prec2, prec3, prec4),
Sensitivity = c(sens1, sens2, sens3, sens4),
Specificity = c(spec1, spec2, spec3, spec4),
F1_Score = c(f1_1, f1_2, f1_3, f1_4),
AUC = c(auc1, auc2, auc3, auc4)
)
cat("\n=== Model Comparison ===\n")
##
## === Model Comparison ===
print(comparison)
## Model AIC Accuracy Error_Rate Precision Sensitivity Specificity
## 1 Model1 193.1973 0.9462366 0.05376344 0.9803922 0.9259259 0.974359
## 2 Model2 184.1830 0.9247312 0.07526882 1.0000000 0.8703704 1.000000
## 3 Model3 196.2563 0.9139785 0.08602151 0.9791667 0.8703704 0.974359
## 4 Model4 188.8146 0.9784946 0.02150538 1.0000000 0.9629630 1.000000
## F1_Score AUC
## 1 0.9523810 0.9848053
## 2 0.9306931 0.9876543
## 3 0.9215686 0.9876543
## 4 0.9811321 0.9871795
# Plot ROC curves
plot(roc1, col = "blue", main = "ROC Curves Comparison")
plot(roc2, col = "red", add = TRUE)
plot(roc3, col = "green", add = TRUE)
plot(roc4, col = "purple", add = TRUE)
legend("bottomright",
legend = paste0(c("Model1", "Model2", "Model3", "Model4"), " (AUC=",
round(c(auc1, auc2, auc3, auc4), 3), ")"),
col = c("blue", "red", "green", "purple"),
lwd = 2)
# Select best model (Model 4)
output_text <- paste(
"\n=== SELECTED MODEL: Model 4 ===\n",
"Reasons for selection:\n",
"1. Low AIC indicating good model fit\n",
"2. Highest accuracy and very high AUC, displaying good discriminatory power, only slightly lower than the other models\n",
"3. Creating categorical variables made the model simpler and easier to interpret\n",
"4. All predictors are statistically significant\n\n"
)
cat(output_text)
##
## === SELECTED MODEL: Model 4 ===
## Reasons for selection:
## 1. Low AIC indicating good model fit
## 2. Highest accuracy and very high AUC, displaying good discriminatory power, only slightly lower than the other models
## 3. Creating categorical variables made the model simpler and easier to interpret
## 4. All predictors are statistically significant
final_model <- model4
summary(final_model)
##
## Call:
## glm(formula = target ~ indus + chas + zn + nox + rad + age_bin +
## lstat_bin + medv_bin + tax_bin + dis_group + rm_bin + ptratio_bin,
## family = binomial, data = train_transformed)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -25.70322 4.92077 -5.223 1.76e-07 ***
## indus -0.06526 0.06877 -0.949 0.342669
## chas 1.00399 1.13036 0.888 0.374433
## zn -0.02733 0.02798 -0.977 0.328632
## nox 41.98241 8.00307 5.246 1.56e-07 ***
## rad 0.32906 0.13002 2.531 0.011376 *
## age_binMiddle -0.01526 0.75582 -0.020 0.983894
## age_binOld 1.07032 0.98353 1.088 0.276489
## lstat_binMedium -0.72671 0.71667 -1.014 0.310579
## lstat_binHigh -1.56766 1.01293 -1.548 0.121709
## medv_binMedium -0.08556 0.63220 -0.135 0.892348
## medv_binHigh 1.77639 0.99381 1.787 0.073862 .
## tax_binMedium 2.19475 0.63681 3.447 0.000568 ***
## tax_binHigh 0.43702 1.10176 0.397 0.691619
## dis_groupFar 1.68912 1.09646 1.541 0.123434
## dis_groupMedium 1.02326 0.84276 1.214 0.224681
## rm_binMedium 0.10159 0.73776 0.138 0.890482
## rm_binSmall 0.69452 0.87702 0.792 0.428417
## ptratio_binLow -1.57733 0.73206 -2.155 0.031190 *
## ptratio_binMedium -0.97548 0.57613 -1.693 0.090427 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 515.67 on 372 degrees of freedom
## Residual deviance: 148.81 on 353 degrees of freedom
## AIC: 188.81
##
## Number of Fisher Scoring iterations: 9
After selecting the best-performing model, we applied it to the evaluation dataset. The model was then used to predict whether the crime rate is above the median crime rate. The distribution of the target variable can be seen below:
data_eval <- read.csv("https://raw.githubusercontent.com/datanerddhanya/DATA621/refs/heads/main/crime-evaluation-data_modified.csv")
# No missing data
data_eval %>%
summarise_all(~ sum(is.na(.))) %>%
pivot_longer(cols = everything(), names_to = "variable", values_to = "missing_count") %>%
filter(missing_count != 0) %>%
arrange(desc(missing_count))
## # A tibble: 0 × 2
## # ℹ 2 variables: variable <chr>, missing_count <int>
# Binning categories
data_eval2 <- data_eval %>%
mutate(
age_bin = cut(age,
breaks = quantile(age, probs = c(0, 0.33, 0.66, 1), na.rm = TRUE),
labels = c("Young", "Middle", "Old"),
include.lowest = TRUE),
lstat_bin = cut(lstat,
breaks = quantile(lstat, probs = c(0, 0.33, 0.66, 1), na.rm = TRUE),
labels = c("Low", "Medium", "High"),
include.lowest = TRUE),
medv_bin = cut(medv,
breaks = quantile(medv, probs = c(0, 0.33, 0.66, 1), na.rm = TRUE),
labels = c("Low", "Medium", "High"),
include.lowest = TRUE),
tax_bin = cut(tax,
breaks = quantile(tax, probs = c(0, 0.33, 0.66, 1), na.rm = TRUE),
labels = c("Low", "Medium", "High"),
include.lowest = TRUE),
dis_group = case_when(
ntile(dis, 3) == 1 ~ "Close",
ntile(dis, 3) == 2 ~ "Medium",
ntile(dis, 3) == 3 ~ "Far"
),
ptratio_bin = case_when(
ntile(ptratio, 3) == 1 ~ "Low",
ntile(ptratio, 3) == 2 ~ "Medium",
ntile(ptratio, 3) == 3 ~ "High"
),
rm_bin = case_when(
ntile(rm, 3) == 1 ~ "Small",
ntile(rm, 3) == 2 ~ "Medium",
ntile(rm, 3) == 3 ~ "Large"
)
)
# Convert to factor
data_eval3 <- data_eval2 %>%
mutate(across(ends_with("_bin") | ends_with("_group"), ~factor(.)))
# Make predictions
eval_predictions <- predict(final_model, data_eval3, type = "response")
eval_classifications <- ifelse(eval_predictions > 0.5, 1, 0)
# # Create output dataframe
eval_output <- cbind(data_eval,
Probability = round(eval_predictions, 4),
Classification = eval_classifications
)
# Bar plot of the target variable (0 = low crime, 1 = high crime)
ggplot(eval_output, aes(x = factor(Classification))) +
geom_bar(fill = "steelblue", color = "black") +
labs(
title = "Distribution of Target Variable",
x = "Crime Level (0 = Low, 1 = High)",
y = "Count"
) +
theme_minimal()
# # Save predictions
write.csv(eval_output, "crime_predictions.csv", row.names = FALSE)