Q16: Using the Boston data set, fit classification models in order to predict whether a given census tract has a crime rate above or below the me- dian. Explore logistic regression, LDA, naive Bayes, and KNN models using various subsets of the predictors. Describe your findings. Hint: You will have to create the response variable yourself, using the variables that are contained in the Boston data set.
Solution:
We begin by importing a dataset containing various details about housing across different neighborhoods in Boston, such as house age, proximity to highways, and property tax rates.
We establish a binary response variable to investigate areas with potentially higher crime rates by comparing each neighborhood’s crime rate to the median rate for all Boston neighborhoods, we categorize them as either “High” or “Low” crime areas accordingly.
To facilitate computed analysis, we convert these crime labels into a factor variable, essentially organizing them into a structured format that can easily be interpreted, distinguishing between the “High” and “Low” categories.
This preparatory step sets the stage for further exploration and statistical modeling aimed at understanding the factors associated with crime rates in different areas of Boston.
# Load the Boston dataset and libraries
library(MASS)
library(e1071)
library(class)
data(Boston)
names(Boston)
## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
# Create a binary response variable indicating whether the crime rate is above or below the median
Boston$high_crime <- ifelse(Boston$crim > median(Boston$crim), "High", "Low")
# Convert to factor
Boston$high_crime <- factor(Boston$high_crime, levels = c("Low", "High"))
1.Logistic Regression (logit_model)
A logistic regression model is applied to the dataset, utilizing all available predictors to understand their impact on the binary outcome of high crime.
This statistical technique is specifically designed for modeling binary outcomes, making it suitable for determining factors influencing crime rates. After fitting the model, a summary is generated, presenting essential details such as coefficients, standard errors, z-values, and p-values for each predictor.
This summary provides valuable insights into the significance and influence of each predictor variable on the likelihood of high crime occurrence.
# Fit logistic regression model using all predictors
logit_model <- glm(high_crime ~ ., data = Boston, family = binomial)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(logit_model)
##
## Call:
## glm(formula = high_crime ~ ., family = binomial, data = Boston)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.437e+01 1.202e+05 0.000 1.000
## crim 1.083e+03 1.773e+04 0.061 0.951
## zn 2.194e+00 5.856e+01 0.037 0.970
## indus -2.510e+00 9.002e+02 -0.003 0.998
## chas 4.489e+00 1.014e+04 0.000 1.000
## nox -2.585e+02 1.458e+05 -0.002 0.999
## rm -3.953e+01 1.653e+03 -0.024 0.981
## age 3.437e-01 5.798e+01 0.006 0.995
## dis -1.742e+01 2.146e+03 -0.008 0.994
## rad -5.933e+00 2.642e+03 -0.002 0.998
## tax 1.639e-01 1.078e+02 0.002 0.999
## ptratio 5.525e+00 3.640e+03 0.002 0.999
## black 3.266e-02 1.208e+01 0.003 0.998
## lstat -1.687e+00 3.560e+02 -0.005 0.996
## medv 2.358e+00 5.382e+02 0.004 0.997
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 7.0146e+02 on 505 degrees of freedom
## Residual deviance: 2.8371e-05 on 491 degrees of freedom
## AIC: 30
##
## Number of Fisher Scoring iterations: 25
2. LDA (lda_model) Linear Discriminant Analysis (LDA) is utilized to analyze the dataset by considering all available predictors. LDA serves the dual purpose of reducing dimensionality and classifying data points. It operates under the assumption that the predictors exhibit a multivariate normal distribution within each class.
After performing LDA, a concise summary of the model is generated. This summary encompasses key metrics such as prior probabilities, group means, and covariance matrices, offering valuable insights into the underlying structure of the data and aiding in the interpretation of results.
# Fit LDA model using all predictors
lda_model <- lda(high_crime ~ ., data = Boston)
lda_model
## Call:
## lda(high_crime ~ ., data = Boston)
##
## Prior probabilities of groups:
## Low High
## 0.5 0.5
##
## Group means:
## crim zn indus chas nox rm age
## Low 0.0955715 21.525692 7.002292 0.05138340 0.4709711 6.394395 51.31028
## High 7.1314756 1.201581 15.271265 0.08695652 0.6384190 6.174874 85.83953
## dis rad tax ptratio black lstat medv
## Low 5.091596 4.158103 305.7431 17.90711 388.7061 9.419486 24.94941
## High 2.498489 14.940711 510.7312 19.00395 324.6420 15.886640 20.11621
##
## Coefficients of linear discriminants:
## LD1
## crim 0.0046376592
## zn -0.0056431194
## indus 0.0126159626
## chas -0.0592836851
## nox 8.1826206579
## rm 0.0874007870
## age 0.0112829040
## dis 0.0453643651
## rad 0.0699133176
## tax -0.0008444666
## ptratio 0.0513806507
## black -0.0009892799
## lstat 0.0143945059
## medv 0.0386990631
3.Naive Bayes (naive_bayes_model)
Naive Bayes classification is performed. Naive Bayes is a probabilistic classification method based on Bayes’ theorem, with an assumption of independence between predictors.
crime_rate_above_median <- ifelse(Boston$crim > median(Boston$crim), 1, 0)
naive_bayes_model <- naiveBayes(as.factor(crime_rate_above_median) ~ ., data = Boston)
naive_bayes_model
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## 0 1
## 0.5 0.5
##
## Conditional probabilities:
## crim
## Y [,1] [,2]
## 0 0.0955715 0.06281773
## 1 7.1314756 11.10912294
##
## zn
## Y [,1] [,2]
## 0 21.525692 29.319808
## 1 1.201581 4.798611
##
## indus
## Y [,1] [,2]
## 0 7.002292 5.514454
## 1 15.271265 5.439010
##
## chas
## Y [,1] [,2]
## 0 0.05138340 0.2212161
## 1 0.08695652 0.2823299
##
## nox
## Y [,1] [,2]
## 0 0.4709711 0.05559789
## 1 0.6384190 0.09870365
##
## rm
## Y [,1] [,2]
## 0 6.394395 0.5556856
## 1 6.174874 0.8101381
##
## age
## Y [,1] [,2]
## 0 51.31028 25.88190
## 1 85.83953 17.87423
##
## dis
## Y [,1] [,2]
## 0 5.091596 2.081304
## 1 2.498489 1.085521
##
## rad
## Y [,1] [,2]
## 0 4.158103 1.659121
## 1 14.940711 9.529843
##
## tax
## Y [,1] [,2]
## 0 305.7431 87.4837
## 1 510.7312 167.8553
##
## ptratio
## Y [,1] [,2]
## 0 17.90711 1.811216
## 1 19.00395 2.346947
##
## black
## Y [,1] [,2]
## 0 388.7061 22.83774
## 1 324.6420 118.83084
##
## lstat
## Y [,1] [,2]
## 0 9.419486 4.923497
## 1 15.886640 7.546922
##
## medv
## Y [,1] [,2]
## 0 24.94941 7.232047
## 1 20.11621 10.270362
##
## high_crime
## Y Low High
## 0 1 0
## 1 0 1
4. KNN (knn_model)
First, the dataset is divided into two parts: one for training the model and the other for testing its performance.
Afterward, a K-nearest neighbors (KNN) model is trained using specific predictors such as “age,” “rad,” and “tax.” KNN is a method that doesn’t make assumptions about the underlying data distribution and works by assigning a class to an observation based on the most common class among its nearest neighbors in the feature space.
Finally, the accuracy of the trained KNN model is computed and displayed, giving insight into how well the model predicts crime rates in Boston suburbs.
# Split the dataset into train and test sets
set.seed(123) # for reproducibility
train_index <- sample(1:nrow(Boston), 0.7*nrow(Boston)) # 70% train, 30% test
train_data <- Boston[train_index, ]
test_data <- Boston[-train_index, ]
# Fit KNN model using selected predictors
k <- 5 # value of k for KNN
knn_model <- knn(train = train_data[, c("age", "rad", "tax")],
test = test_data[, c("age", "rad", "tax")],
cl = train_data$high_crime,
k = k)
# Print the accuracy of the KNN model
accuracy <- mean(knn_model == test_data$high_crime)*100
formatted_mean_accuracy <- sprintf("%.2f", accuracy)
cat("Mean Accuracy:", formatted_mean_accuracy,"% \n")
## Mean Accuracy: 91.45 %
CONCLUSION
In this analysis of the Boston dataset, several machine learning algorithms were applied to predict whether the crime rate in a given area is above or below the median.
Logistic regression, Linear Discriminant Analysis (LDA), Naive Bayes, and k-Nearest Neighbors (KNN) were implemented and evaluated. Logistic regression and LDA provided detailed summaries of the model, while Naive Bayes offered insights into class probabilities.
The KNN model achieved an accuracy of [accuracy_percentage]% on the test set. This comprehensive approach showcases various techniques for crime rate prediction in urban areas.