In this homework assignment, you will explore, analyze and model a data set containing information on crime for various neighborhoods of a major city. Each record has a response variable indicating whether or not the crime rate is above the median crime rate (1) or not (0).
Your objective is to build a binary logistic regression model on the training data set to predict whether the neighborhood will be at risk for high crime levels. You will provide classifications and probabilities for the evaluation data set using your binary logistic regression model. You can only use the variables given to you (or, variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
## Loading required package: lattice
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
We started by loading the necessary packages for this project which allowed us to create visuals and perform various types of analyses.
Here, we can see a dataframe of the datasets prior to any manipulation. They have all of the same variables, with the exception of the evaluation dataset which does not include the target variable.
## Dimensions of the crime training dataset:
## [1] 466 13
##
## Summary of variables in the crime training dataset:
## 'data.frame': 466 obs. of 13 variables:
## $ zn : num 0 0 0 30 0 0 0 0 0 80 ...
## $ indus : num 19.58 19.58 18.1 4.93 2.46 ...
## $ chas : int 0 1 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.605 0.871 0.74 0.428 0.488 0.52 0.693 0.693 0.515 0.392 ...
## $ rm : num 7.93 5.4 6.49 6.39 7.16 ...
## $ age : num 96.2 100 100 7.8 92.2 71.3 100 100 38.1 19.1 ...
## $ dis : num 2.05 1.32 1.98 7.04 2.7 ...
## $ rad : int 5 5 24 6 3 5 24 24 5 1 ...
## $ tax : int 403 403 666 300 193 384 666 666 224 315 ...
## $ ptratio: num 14.7 14.7 20.2 16.6 17.8 20.9 20.2 20.2 20.2 16.4 ...
## $ lstat : num 3.7 26.82 18.85 5.19 4.82 ...
## $ medv : num 50 13.4 15.4 23.7 37.9 26.5 5 7 22.2 20.9 ...
## $ target : int 1 1 1 0 0 0 1 1 0 0 ...
The crime training dataset comprises 466 observations and 13 variables, including factors like land zoning, environmental indicators, socioeconomic status, and housing features. The target variable denotes whether the crime rate is above the median, aiding in predictive modeling to identify neighborhoods at risk for higher crime levels.
## zn indus chas nox
## Min. : 0.00 Min. : 0.460 Min. :0.00000 Min. :0.3890
## 1st Qu.: 0.00 1st Qu.: 5.145 1st Qu.:0.00000 1st Qu.:0.4480
## Median : 0.00 Median : 9.690 Median :0.00000 Median :0.5380
## Mean : 11.58 Mean :11.105 Mean :0.07082 Mean :0.5543
## 3rd Qu.: 16.25 3rd Qu.:18.100 3rd Qu.:0.00000 3rd Qu.:0.6240
## Max. :100.00 Max. :27.740 Max. :1.00000 Max. :0.8710
## rm age dis rad
## Min. :3.863 Min. : 2.90 Min. : 1.130 Min. : 1.00
## 1st Qu.:5.887 1st Qu.: 43.88 1st Qu.: 2.101 1st Qu.: 4.00
## Median :6.210 Median : 77.15 Median : 3.191 Median : 5.00
## Mean :6.291 Mean : 68.37 Mean : 3.796 Mean : 9.53
## 3rd Qu.:6.630 3rd Qu.: 94.10 3rd Qu.: 5.215 3rd Qu.:24.00
## Max. :8.780 Max. :100.00 Max. :12.127 Max. :24.00
## tax ptratio lstat medv
## Min. :187.0 Min. :12.6 Min. : 1.730 Min. : 5.00
## 1st Qu.:281.0 1st Qu.:16.9 1st Qu.: 7.043 1st Qu.:17.02
## Median :334.5 Median :18.9 Median :11.350 Median :21.20
## Mean :409.5 Mean :18.4 Mean :12.631 Mean :22.59
## 3rd Qu.:666.0 3rd Qu.:20.2 3rd Qu.:16.930 3rd Qu.:25.00
## Max. :711.0 Max. :22.0 Max. :37.970 Max. :50.00
## target
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.4914
## 3rd Qu.:1.0000
## Max. :1.0000
This summary provides descriptive statistics for each variable in the
dataset. For example, the zn variable ranges from 0 to 100
and has a mean of 11.58, indicating the proportion of residential land
zoned for large lots, while the target variable is binary
with a mean of 0.4914, pointing to approximately equal representation of
both being above and below the median crime rate. For more information,
we have to perform further analysis.
By visualizing the frequencies of the predictor variables, we can see
that several of the distributions are positively skewed, especially the
‘dis’ variable which represents the weighted mean of distances to five
Boston employment centers, and the ‘rad’ variable which represents
radial highway accessibility. We can attempt to transform them into a
normal-like distribution.
## Warning in scale_y_continuous(trans = "log10", breaks = c(1, 10, 100, 1000), :
## log-10 transformation introduced infinite values.
## Warning: Removed 772 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
In this boxplot, we have separated the variables on whether the ‘target’
variable is a 0 or a 1. Values corresponding with neighborhoods which
have a crime rate below the median, or 0, are displayed in red, whether
those which have a crime rate above the median, or 1, are displayed in
green. Looking at this boxplot, we can see that both are widely
distributed.
Lastly, before manipulating any data, we created a heat map to check how
correlated these variables are with one another. The ‘target’ variable
has strong correlations with a few predictors. These include ‘nox’ or NO
concentrations, ‘age’, ‘rad’ or accessibility to highways, and property
tax rate. It is negatively correlated with ‘dis’ or distance to
employment centers, and ‘zn’, or residental zoning lots.
## zn indus chas nox rm age dis rad tax ptratio
## 0 0 0 0 0 0 0 0 0 0
## lstat medv target
## 0 0 0
Since there is no missing data, it will not be necessary to impute or remove any data.
In this case, as seen from the boxplot, outliers were reasonable so we decided to keep the data the same.
For the purposes of transparency, we decided to rename several of the variables as this will make it easier for a viewer to interpret the data and the visuals.
Because some of the predictor variables are highly skewed, we
transformed them in an attempt to create a more normal distribution. The
log() function here is applied to
Rad_Highway_Accessibility and dis after adding
1 to each variable. This helps as these variables contain several zero
values, by avoiding any undefined result values. Overall, this reduces
right skewness in the distributions of these variables, and gives us a
distribution which is closer to being normal, making it more suitable
for statistical analyses.
On the lefthand side, we can see the distributions of the ‘Rad_Highway_Accessibility’ and ‘dis’ variables, before transformation. On the righthand side, we can see the same two variables after logarithmic transformation. We can see a slightly closer to normal-like distribution. We can now begin to build models for the data!
##
## Call:
## glm(formula = Crime_Above_Median ~ ., family = binomial, data = crimetrainingdata)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -40.822934 6.632913 -6.155 7.53e-10 ***
## Residential_Zoned_Lots -0.065946 0.034656 -1.903 0.05706 .
## Industry_Proportion -0.064614 0.047622 -1.357 0.17485
## Charles_River 0.910765 0.755546 1.205 0.22803
## NOx_Concentration 49.122297 7.931706 6.193 5.90e-10 ***
## Avg_Rooms -0.587488 0.722847 -0.813 0.41637
## age 0.034189 0.013814 2.475 0.01333 *
## dis 0.738660 0.230275 3.208 0.00134 **
## Rad_Highway_Accessibility 0.666366 0.163152 4.084 4.42e-05 ***
## Property_Tax -0.006171 0.002955 -2.089 0.03674 *
## Pupil_Teacher_Ratio 0.402566 0.126627 3.179 0.00148 **
## Percent_Lower_Status 0.045869 0.054049 0.849 0.39608
## Median_Home_Value 0.180824 0.068294 2.648 0.00810 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 192.05 on 453 degrees of freedom
## AIC: 218.05
##
## Number of Fisher Scoring iterations: 9
The logistic regression model estimates the relationship between
predictor variables and the likelihood of a neighborhood having a crime
rate above the median. We decided to first do this without maniulating
any variables to take a look at the data as a whole. In this model,
significant predictors are NOx concentration, radial highway
accessibility, residential zoned lots, dwelling age, distance to
employment centers, pupil-teacher ratio, property tax, and median home
value. For instance, an increase in NOx concentration and radial highway
accessibility is associated with higher odds of a neighborhood having a
crime rate above the median. However, variables such as residential
zoned lots, industry proportion, and average number of rooms do not show
significant associations. Overall, the model suggests that environmental
factors, socioeconomic indicators, and neighborhood characteristics play
crucial roles in determining crime rates.
##
## Call:
## glm(formula = Crime_Above_Median ~ ., family = binomial, data = crimetrainingdata_transformed)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -51.310805 7.756344 -6.615 3.71e-11 ***
## Residential_Zoned_Lots -0.052757 0.031585 -1.670 0.094862 .
## Industry_Proportion -0.013695 0.050268 -0.272 0.785288
## Charles_River 0.764349 0.747062 1.023 0.306242
## NOx_Concentration 50.077439 7.919249 6.324 2.56e-10 ***
## Avg_Rooms -0.637495 0.750640 -0.849 0.395732
## age 0.038494 0.014335 2.685 0.007244 **
## dis 4.252370 1.205621 3.527 0.000420 ***
## Rad_Highway_Accessibility 4.422877 0.961691 4.599 4.24e-06 ***
## Property_Tax -0.007034 0.003246 -2.167 0.030242 *
## Pupil_Teacher_Ratio 0.465753 0.132312 3.520 0.000431 ***
## Percent_Lower_Status 0.039874 0.054779 0.728 0.466666
## Median_Home_Value 0.204941 0.072832 2.814 0.004895 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 187.04 on 453 degrees of freedom
## AIC: 213.04
##
## Number of Fisher Scoring iterations: 8
The logistic regression model estimates the relationship between predictor variables and the likelihood of a neighborhood having a crime rate above the median. We decided to use the two transformed variables here to see if they would impact the regression model after becoming more normal distributions. In this model, significant predictors are NOx concentration, radial highway accessibility, residential zoned lots, dwelling age, distance to employment centers, pupil-teacher ratio, property tax, and median home value. This is similar to the previous model, but differs in the number of iterations. With less iterations, this means that the analysis was performed with less steps. The null and residual deviance are also lower in this model, indicating that this may be a slightly better fit. Overall, the model suggests that environmental factors, socioeconomic indicators, and neighborhood characteristics play crucial roles in determining crime rates.
## Start: AIC=218.05
## Crime_Above_Median ~ Residential_Zoned_Lots + Industry_Proportion +
## Charles_River + NOx_Concentration + Avg_Rooms + age + dis +
## Rad_Highway_Accessibility + Property_Tax + Pupil_Teacher_Ratio +
## Percent_Lower_Status + Median_Home_Value
##
## Df Deviance AIC
## - Avg_Rooms 1 192.71 216.71
## - Percent_Lower_Status 1 192.77 216.77
## - Charles_River 1 193.53 217.53
## - Industry_Proportion 1 193.99 217.99
## <none> 192.05 218.05
## - Property_Tax 1 196.59 220.59
## - Residential_Zoned_Lots 1 196.89 220.89
## - age 1 198.73 222.73
## - Median_Home_Value 1 199.95 223.95
## - Pupil_Teacher_Ratio 1 203.32 227.32
## - dis 1 203.84 227.84
## - Rad_Highway_Accessibility 1 233.74 257.74
## - NOx_Concentration 1 265.05 289.05
##
## Step: AIC=216.71
## Crime_Above_Median ~ Residential_Zoned_Lots + Industry_Proportion +
## Charles_River + NOx_Concentration + age + dis + Rad_Highway_Accessibility +
## Property_Tax + Pupil_Teacher_Ratio + Percent_Lower_Status +
## Median_Home_Value
##
## Df Deviance AIC
## - Charles_River 1 194.24 216.24
## - Percent_Lower_Status 1 194.32 216.32
## - Industry_Proportion 1 194.58 216.58
## <none> 192.71 216.71
## + Avg_Rooms 1 192.05 218.05
## - Property_Tax 1 197.59 219.59
## - Residential_Zoned_Lots 1 198.07 220.07
## - age 1 199.11 221.11
## - Pupil_Teacher_Ratio 1 203.53 225.53
## - dis 1 203.85 225.85
## - Median_Home_Value 1 205.35 227.35
## - Rad_Highway_Accessibility 1 233.81 255.81
## - NOx_Concentration 1 265.14 287.14
##
## Step: AIC=216.24
## Crime_Above_Median ~ Residential_Zoned_Lots + Industry_Proportion +
## NOx_Concentration + age + dis + Rad_Highway_Accessibility +
## Property_Tax + Pupil_Teacher_Ratio + Percent_Lower_Status +
## Median_Home_Value
##
## Df Deviance AIC
## - Industry_Proportion 1 195.51 215.51
## <none> 194.24 216.24
## - Percent_Lower_Status 1 196.33 216.33
## + Charles_River 1 192.71 216.71
## + Avg_Rooms 1 193.53 217.53
## - Residential_Zoned_Lots 1 200.59 220.59
## - Property_Tax 1 200.75 220.75
## - age 1 201.00 221.00
## - Pupil_Teacher_Ratio 1 203.94 223.94
## - dis 1 204.83 224.83
## - Median_Home_Value 1 207.12 227.12
## - Rad_Highway_Accessibility 1 241.41 261.41
## - NOx_Concentration 1 265.19 285.19
##
## Step: AIC=215.51
## Crime_Above_Median ~ Residential_Zoned_Lots + NOx_Concentration +
## age + dis + Rad_Highway_Accessibility + Property_Tax + Pupil_Teacher_Ratio +
## Percent_Lower_Status + Median_Home_Value
##
## Df Deviance AIC
## - Percent_Lower_Status 1 197.32 215.32
## <none> 195.51 215.51
## + Industry_Proportion 1 194.24 216.24
## + Charles_River 1 194.58 216.58
## + Avg_Rooms 1 194.86 216.86
## - Residential_Zoned_Lots 1 202.05 220.05
## - age 1 202.23 220.23
## - Pupil_Teacher_Ratio 1 205.01 223.01
## - dis 1 205.96 223.96
## - Property_Tax 1 206.60 224.60
## - Median_Home_Value 1 208.13 226.13
## - Rad_Highway_Accessibility 1 249.55 267.55
## - NOx_Concentration 1 270.59 288.59
##
## Step: AIC=215.32
## Crime_Above_Median ~ Residential_Zoned_Lots + NOx_Concentration +
## age + dis + Rad_Highway_Accessibility + Property_Tax + Pupil_Teacher_Ratio +
## Median_Home_Value
##
## Df Deviance AIC
## <none> 197.32 215.32
## + Percent_Lower_Status 1 195.51 215.51
## + Avg_Rooms 1 195.75 215.75
## + Charles_River 1 195.97 215.97
## + Industry_Proportion 1 196.33 216.33
## - Residential_Zoned_Lots 1 203.45 219.45
## - Pupil_Teacher_Ratio 1 206.27 222.27
## - age 1 207.13 223.13
## - Property_Tax 1 207.62 223.62
## - dis 1 207.64 223.64
## - Median_Home_Value 1 208.65 224.65
## - Rad_Highway_Accessibility 1 250.98 266.98
## - NOx_Concentration 1 273.18 289.18
With the stepwise logistic regression model, at each step, the model removes one predictor variable, starting with the least significant one, and compares the resulting AIC values. The stepwise selection helps identify predictors that are most relevant for predicting the probability of crime rates above the median.The process continues until further removal of variables no longer decreases the AIC. Here, the final model includes Residential_Zoned_Lots, NOx_Concentration, age, dis, Rad_Highway_Accessibility, Property_Tax, Pupil_Teacher_Ratio, and Median_Home_Value. These variables are most associated with the crime rate, as indicated by the lowest AIC value of 215.32.
##
## Call:
## glm(formula = Crime_Above_Median ~ Residential_Zoned_Lots + NOx_Concentration +
## age + dis + Rad_Highway_Accessibility + Property_Tax + Pupil_Teacher_Ratio +
## Median_Home_Value, family = binomial, data = crimetrainingdata)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -37.415922 6.035013 -6.200 5.65e-10 ***
## Residential_Zoned_Lots -0.068648 0.032019 -2.144 0.03203 *
## NOx_Concentration 42.807768 6.678692 6.410 1.46e-10 ***
## age 0.032950 0.010951 3.009 0.00262 **
## dis 0.654896 0.214050 3.060 0.00222 **
## Rad_Highway_Accessibility 0.725109 0.149788 4.841 1.29e-06 ***
## Property_Tax -0.007756 0.002653 -2.924 0.00346 **
## Pupil_Teacher_Ratio 0.323628 0.111390 2.905 0.00367 **
## Median_Home_Value 0.110472 0.035445 3.117 0.00183 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 197.32 on 457 degrees of freedom
## AIC: 215.32
##
## Number of Fisher Scoring iterations: 9
In this model, significant predictors are NOx concentration, radial highway accessibility, residential zoned lots, dwelling age, distance to employment centers, pupil-teacher ratio, property tax, and median home value. This is once again similar to the previous models in that the significant variables are all the same. This model confirms what was shown in the other models, narrowing it down to the most significant variables.
Null Hypothesis: There is no association between the observed variables and the reported crime rate.
Alternative Hypothesis: There is an association between at least one of the observed variables and the reported crime rate.
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
With model 1, we created an ROC curve. The AUC, or the area under the
curve, has a value of 0.951. Being that it is close to 1, this indicates
that the logistic regression model has a high probability of correctly
classifying instances, with a 95.1% chance that a randomly chosen
positive instance will have a higher predicted probability of being
positive. In other words, it has a high probability of distinguishing
between crime rates below and above the median.
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## NULL
With model 2, the AUC has a value of .984. Being that it is close to 1, this indicates that the logistic regression model has a high probability of correctly classifying instances, with a 98.4% chance that it is accurate. Compared to the last model, it is slightly more accurate. We can infer that this may be due to the logarithmic transformations which were performed on two of the skewed variables.
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## NULL
With model 3, the AUC has a value of .972. Being that it is close to 1, this indicates that the logistic regression model has a high probability of correctly classifying instances, with a 97.2% chance that it is accurate. Compared to both of the other models, it is slightly better than the first model, but not better than the second model. The accuracy of all the models are still high.
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Model AUC Classification_Error Precision Sensitivity Specificity
## 1 Model 1 0.9509804 0.15053763 0.8627451 0.8627451 0.8333333
## 2 Model 2 0.9842593 0.09677419 0.8679245 0.9583333 0.8444444
## 3 Model 3 0.9717593 0.09677419 0.8979592 0.9166667 0.8888889
The confusion matrix is a table that represents matrices of the three different models with the counts of true negatives (0-0), false positives (0-1), false negatives (1-0), and true positives (1-1). In the first confusion matrix, there are 47 true negatives, 7 false positives, 2 false negatives, and 37 true positives, in the second matrix, there are 47 true negatives, 4 false positives, 2 false negatives, and 40 true positives, in the third confusion matrix, there are 45 true negatives, 2 false positives, 6 false negatives, and 40 true positives.
We also made a dataframe of the performance of three models. Model 2 emerges as the top performer, boasting the highest AUC value of 0.9842593, indicating the highest ability to discriminate between classes. Additionally, along with model 3, it has the lowest classification error rate of 0.09677419, suggesting minimal misclassification. Model 2 and 3 both have high precision at 0.8679245 and 0.8979592 respectfully, reflecting their accuracy in positive outcome predictions. Model 2 demonstrates the highest sensitivity at 0.9583333, indicating its effectiveness in correctly identifying true positives, however, Model 2 stands out with the highest specificity at 0.8888889, meaning that it is proficient in accurately recognizing true negatives. Overall, while all models show strong performance, Model 2 emerges as the best choice, followed by model 3.
In conclusion, we conducted an in-depth analysis and modeling of datasets containing information on crime rates across various neighborhoods of a major city. Three logistic regression models were built to predict whether a neighborhood would be at risk for high crime levels, utilizing different sets of predictor variables and transformations. After thorough evaluation using performance metrics such as the area under the ROC curve (AUC), classification error, precision, sensitivity, and specificity, Model 2 emerged as the top performer. This model, which incorporated logarithmic transformations of skewed variables, demonstrated the highest AUC of 0.984 and the lowest classification error rate of 0.097 among the three models. Additionally, Model 2 exhibited strong precision, sensitivity, and specificity, indicating its effectiveness in accurately predicting both positive and negative outcomes. Overall, the findings suggest that environmental factors, socioeconomic indicators, and neighborhood characteristics play significant roles in determining crime rates, and Model 2 provides the most reliable predictive performance for identifying neighborhoods at risk for higher crime levels.
knitr::opts_chunk$set(echo=FALSE)
library(dplyr)
library(ggplot2)
library(tidyr)
library(reshape2)
library(cowplot)
library(caret)
library(pROC)
set.seed(10)
crimeevaldata <- read.csv('https://raw.githubusercontent.com/rkasa01/DATA621_HW3/main/crime-evaluation-data_modified.csv')
head(crimeevaldata)
crimetrainingdata <-read.csv('https://raw.githubusercontent.com/rkasa01/DATA621_HW3/main/crime-training-data_modified.csv')
head(crimetrainingdata)
cat("Dimensions of the crime training dataset:\n")
dim(crimetrainingdata)
cat("\nSummary of variables in the crime training dataset:\n")
str(crimetrainingdata)
summary(crimetrainingdata)
create_histogram <- function(data, variable_name) {
ggplot(data, aes(x = !!sym(variable_name))) +
geom_histogram(fill = "red", color = "black", bins = 30) +
labs(title = paste("Histogram of", variable_name), x = variable_name, y = "Frequency") +
theme_minimal()
}
plot_histograms <- function(data) {
variables <- names(data)[-which(names(data) == "target")]
histograms <- lapply(variables, function(var) create_histogram(data, var))
plot_grid(plotlist = histograms, ncol = 3)
}
plot_histograms(crimetrainingdata)
melted_data <- melt(crimetrainingdata, id.vars = "target")
ggplot(melted_data, aes(x = variable, y = value, fill = factor(target))) +
geom_boxplot() +
scale_y_continuous(trans = "log10", breaks = c(1, 10, 100, 1000), labels = c(1, 10, 100, 1000)) +
labs(title = "Box Plot of Observed Variables by Neighborhood",
x = "Variables",
y = "Value") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))
correlation_matrix <- cor(crimetrainingdata)
correlation_df <- as.data.frame(as.table(correlation_matrix))
colnames(correlation_df) <- c("Variable_1", "Variable_2", "Correlation")
ggplot(correlation_df, aes(x = Variable_1, y = Variable_2, fill = Correlation)) +
geom_tile() +
scale_fill_gradient2(low = "blue", mid = "white", high = "red", midpoint = 0) +
labs(title = "Correlation Heatmap",
x = "Variables",
y = "Variables")
missing_data <- sapply(crimetrainingdata, function(x) sum(is.na(x)))
print(missing_data)
missing_percentage <- colMeans(is.na(crimetrainingdata)) * 100
ggplot(data.frame(variable = names(missing_percentage), percentage = missing_percentage),
aes(x = variable, y = percentage)) +
geom_bar(stat = "identity", fill = "skyblue", color = "black") +
labs(title = "Percentage of Missing Data",
x = "Variables",
y = "Percentage")
crimetrainingdata <- crimetrainingdata %>%
rename(Residential_Zoned_Lots = zn,
Industry_Proportion = indus,
Charles_River = chas,
NOx_Concentration = nox,
Avg_Rooms = rm,
Rad_Highway_Accessibility = rad,
Pupil_Teacher_Ratio = ptratio,
Property_Tax = tax,
Percent_Lower_Status = lstat,
Median_Home_Value = medv,
Crime_Above_Median = target)
head(crimetrainingdata)
crimetrainingdata_transformed <- crimetrainingdata %>%
mutate(across(c("Rad_Highway_Accessibility", "dis"), ~log(. + 1)))
create_histogram <- function(data, variable_name) {
ggplot(data, aes(x = !!sym(variable_name))) +
geom_histogram(fill = "red", color = "black", bins = 30) +
labs(title = paste("Histogram of", variable_name), x = variable_name, y = "Frequency") +
theme_minimal()
}
hist_rad_before <- create_histogram(crimetrainingdata, "Rad_Highway_Accessibility")
hist_dis_before <- create_histogram(crimetrainingdata, "dis")
hist_rad_after <- create_histogram(crimetrainingdata_transformed, "Rad_Highway_Accessibility")
hist_dis_after <- create_histogram(crimetrainingdata_transformed, "dis")
plot_grid(hist_rad_before, hist_rad_after, hist_dis_before, hist_dis_after, ncol = 2)
original_model <- glm(Crime_Above_Median ~ ., data = crimetrainingdata, family = binomial)
summary(original_model)
model_1 <- glm(Crime_Above_Median ~ ., data = crimetrainingdata, family = binomial)
plot(model_1)
model_2 <- glm(Crime_Above_Median ~ ., data = crimetrainingdata_transformed, family = binomial)
summary(model_2)
plot(model_2)
stepwise_model <- step(model_1, direction = "both", scope = list(upper = ~ ., lower = ~ 1))
summary(stepwise_model)
plot(stepwise_model)
get_performance <- function(data_frame, model, split = 0.8) {
n <- ncol(data_frame)
trainIndex <- createDataPartition(data_frame[, n], p = split, list = FALSE)
data_train <- data_frame[trainIndex, ]
data_test <- data_frame[-trainIndex, ]
x_test <- data_test[, -n]
y_test <- data_test[, n]
predictions <- predict(model, x_test, type = 'response')
cm <- confusionMatrix(data = as.factor(as.numeric(predictions > 0.5)), reference = as.factor(y_test))
roc_result <- roc(y_test, predictions)
plot(roc_result, print.auc = TRUE)
return(cm)
}
model_performance <- get_performance(crimetrainingdata, model_1)
model_performance <- get_performance(crimetrainingdata_transformed, model_2)
model_performance$confusion_matrix
model_performance <- get_performance(crimetrainingdata, stepwise_model)
model_performance$confusion_matrix
set.seed(10)
get_performance <- function(data_frame, model, split = 0.8) {
n <- ncol(data_frame)
trainIndex <- createDataPartition(data_frame[, n], p = split, list = FALSE)
data_train <- data_frame[trainIndex, ]
data_test <- data_frame[-trainIndex, ]
x_test <- data_test[, -n]
y_test <- data_test[, n]
predictions <- predict(model, x_test, type = 'response')
cm <- confusionMatrix(data = as.factor(as.numeric(predictions > 0.5)), reference = as.factor(y_test))
roc_result <- roc(y_test, predictions)
classification_error <- 1 - cm$overall['Accuracy']
precision <- cm$byClass['Pos Pred Value']
sensitivity <- cm$byClass['Sensitivity']
specificity <- cm$byClass['Specificity']
return(list(
confusion_matrix = cm$table,
auc = auc(roc_result),
classification_error = classification_error,
precision = precision,
sensitivity = sensitivity,
specificity = specificity
))
}
performance_model_1 <- get_performance(crimetrainingdata, model_1)
performance_model_2 <- get_performance(crimetrainingdata_transformed, model_2)
performance_model_3 <- get_performance(crimetrainingdata, stepwise_model)
model_comparison <- data.frame(
Model = c("Model 1", "Model 2", "Model 3"),
AUC = c(performance_model_1$auc, performance_model_2$auc, performance_model_3$auc),
Classification_Error = c(performance_model_1$classification_error,
performance_model_2$classification_error,
performance_model_3$classification_error),
Precision = c(performance_model_1$precision, performance_model_2$precision, performance_model_3$precision),
Sensitivity = c(performance_model_1$sensitivity, performance_model_2$sensitivity, performance_model_3$sensitivity),
Specificity = c(performance_model_1$specificity, performance_model_2$specificity, performance_model_3$specificity)
)
print(model_comparison)