Homework 4: Logit Models

rm(list = ls())
gc()

##          used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 524288 28.0    1167772 62.4   660385 35.3
## Vcells 951845  7.3    8388608 64.0  1770057 13.6

# Set working directory
directory <- "C:/Users/mikem/iCloudDrive/Spring 2025/DATA 712/Data sets"
setwd(directory)

# Set seed for reproducibility
set.seed(123)

# Load Data set
shootings <- read.csv("NYPD_Shooting_Data.csv", stringsAsFactors = FALSE)

For this assignment, you will use the NYPD_Shooting_Data.csv data set. This data set contains information on shootings in New York City from 2006 to 2023. The reason I chose this data set opposed to the NYC Disadvantaged Neighborhood Data Set that I had submitted for the previous assignment is that this data contains more binary variables, thus making it ideal to work with for the this assignment. This is the other major data set that I used for my ESS presentation, which will eventually be turned into my graduate thesis.

This data set was provided by the NYPD and is available on the NYC Open Data website.The link to the data set, along with other information pertaining to the data set, can be found here.

Of the 21 variables within this data set, 3 are binary variables. The variables are:

STATISTICAL_MURDER_FLAG Shooting resulted in the victim’s death which would be counted as a murder
- TRUE
- FALSE
PERP_SEX Perpetrator’s sex description
- MALE
- FEMALE
VIV_SEX Victim’s sex description
- MALE
- FEMALE

Likelhood Ration Test

I had to install the below package due to my GitHub Co Pilot not working correctly. I am currently troubleshooting it and should have it working by the time the next assignment is due.

install.packages("remotes", repos = "https://cran.rstudio.com/")  # If not installed

## Installing package into 'C:/Users/mikem/AppData/Local/R/win-library/4.3'
## (as 'lib' is unspecified)

## package 'remotes' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\mikem\AppData\Local\Temp\Rtmp2fAbU3\downloaded_packages

remotes::install_github("IQSS/Zelig")

## Skipping install of 'Zelig' from a github remote, the SHA1 (4cfde99c) has not changed since last install.
##   Use `force = TRUE` to force installation

install.packages("radiant.data", dependencies = TRUE, repos = "https://cran.rstudio.com/")

## Installing package into 'C:/Users/mikem/AppData/Local/R/win-library/4.3'
## (as 'lib' is unspecified)

## package 'radiant.data' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\mikem\AppData\Local\Temp\Rtmp2fAbU3\downloaded_packages

library(radiant.data)

## Warning: package 'radiant.data' was built under R version 4.3.3

## Loading required package: magrittr

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 4.3.3

## Loading required package: lubridate

## Warning: package 'lubridate' was built under R version 4.3.2

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

## Loading required package: tidyr

## Warning: package 'tidyr' was built under R version 4.3.3

## 
## Attaching package: 'tidyr'

## The following object is masked from 'package:magrittr':
## 
##     extract

## Loading required package: dplyr

## Warning: package 'dplyr' was built under R version 4.3.3

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## 
## Attaching package: 'radiant.data'

## The following objects are masked from 'package:lubridate':
## 
##     month, wday

## The following object is masked from 'package:ggplot2':
## 
##     diamonds

## The following object is masked from 'package:magrittr':
## 
##     set_attr

## The following object is masked from 'package:base':
## 
##     date

head(shootings)

##   INCIDENT_KEY OCCUR_DATE OCCUR_TIME     BORO LOC_OF_OCCUR_DESC PRECINCT
## 1    231974218 08/09/2021   01:06:00    BRONX                         40
## 2    177934247 04/07/2018   19:48:00 BROOKLYN                         79
## 3    255028563 12/02/2022   22:57:00    BRONX           OUTSIDE       47
## 4     25384540 11/19/2006   01:50:00 BROOKLYN                         66
## 5     72616285 05/09/2010   01:58:00    BRONX                         46
## 6     85875439 07/22/2012   21:35:00    BRONX                         42
##   JURISDICTION_CODE LOC_CLASSFCTN_DESC             LOCATION_DESC
## 1                 0                                             
## 2                 0                                             
## 3                 0             STREET            GROCERY/BODEGA
## 4                 0                                    PVT HOUSE
## 5                 0                      MULTI DWELL - APT BUILD
## 6                 2                    MULTI DWELL - PUBLIC HOUS
##   STATISTICAL_MURDER_FLAG PERP_AGE_GROUP PERP_SEX      PERP_RACE VIC_AGE_GROUP
## 1                   false                                                18-24
## 2                    true          25-44        M WHITE HISPANIC         25-44
## 3                   false         (null)   (null)         (null)         25-44
## 4                    true        UNKNOWN        U        UNKNOWN         18-24
## 5                    true          25-44        M          BLACK           <18
## 6                   false          18-24        M          BLACK         18-24
##   VIC_SEX VIC_RACE X_COORD_CD Y_COORD_CD Latitude Longitude
## 1       M    BLACK  1006343.0   234270.0 40.80967 -73.92019
## 2       M    BLACK  1000082.9   189064.7 40.68561 -73.94291
## 3       M    BLACK  1020691.0   257125.0 40.87235 -73.86823
## 4       M    BLACK   985107.3   173349.8 40.64249 -73.99691
## 5       F    BLACK  1009853.5   247502.6 40.84598 -73.90746
## 6       M    BLACK  1011046.7   239814.2 40.82488 -73.90318
##                                         Lon_Lat
## 1  POINT (-73.92019278899994 40.80967347200004)
## 2 POINT (-73.94291302299996 40.685609672000055)
## 3                  POINT (-73.868233 40.872349)
## 4 POINT (-73.99691224999998 40.642489932000046)
## 5  POINT (-73.90746098599993 40.84598358900007)
## 6  POINT (-73.90317908399999 40.82487781900005)

Outcome Variable Table looking at Murder Flag based on Sex, Age and Race

library(texreg)

## Warning: package 'texreg' was built under R version 4.3.3

## Version:  1.39.4
## Date:     2024-07-23
## Author:   Philip Leifeld (University of Manchester)
## 
## Consider submitting praise using the praise or praise_interactive functions.
## Please cite the JSS article in your publications -- see citation("texreg").

## 
## Attaching package: 'texreg'

## The following object is masked from 'package:tidyr':
## 
##     extract

## The following object is masked from 'package:magrittr':
## 
##     extract

library(lmtest)

## Warning: package 'lmtest' was built under R version 4.3.3

## Loading required package: zoo

## Warning: package 'zoo' was built under R version 4.3.3

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

library(dplyr)

# Convert necessary variables to factors
shootings <- shootings %>%
  mutate(
    STATISTICAL_MURDER_FLAG = as.factor(STATISTICAL_MURDER_FLAG),
    VIC_SEX = as.factor(VIC_SEX),
    VIC_AGE_GROUP = as.factor(VIC_AGE_GROUP),
    VIC_RACE = as.factor(VIC_RACE)
  )

# Model 1: Null model (Intercept only)
m0 <- glm(STATISTICAL_MURDER_FLAG ~ 1, data = shootings, family = binomial)

# Model 2: Adding VIC_SEX
m1 <- glm(STATISTICAL_MURDER_FLAG ~ VIC_SEX, data = shootings, family = binomial)

# Model 3: Adding VIC_AGE_GROUP
m2 <- glm(STATISTICAL_MURDER_FLAG ~ VIC_SEX + VIC_AGE_GROUP, data = shootings, family = binomial)

# Model 4: Adding VIC_RACE
m3 <- glm(STATISTICAL_MURDER_FLAG ~ VIC_SEX + VIC_AGE_GROUP + VIC_RACE, data = shootings, family = binomial)

# Conduct Likelihood Ratio Tests (Comparing nested models)
lrt_m1 <- lrtest(m1, m0)  # VIC_SEX vs Null
lrt_m2 <- lrtest(m2, m1)  # Adding VIC_AGE_GROUP
lrt_m3 <- lrtest(m3, m2)  # Adding VIC_RACE

# Print LRT results
print(lrt_m1)

## Likelihood ratio test
## 
## Model 1: STATISTICAL_MURDER_FLAG ~ VIC_SEX
## Model 2: STATISTICAL_MURDER_FLAG ~ 1
##   #Df LogLik Df  Chisq Pr(>Chisq)
## 1   3 -14029                     
## 2   1 -14030 -2 2.4651     0.2915

print(lrt_m2)

## Likelihood ratio test
## 
## Model 1: STATISTICAL_MURDER_FLAG ~ VIC_SEX + VIC_AGE_GROUP
## Model 2: STATISTICAL_MURDER_FLAG ~ VIC_SEX
##   #Df LogLik Df  Chisq Pr(>Chisq)    
## 1   9 -13909                         
## 2   3 -14029 -6 240.13  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

print(lrt_m3)

## Likelihood ratio test
## 
## Model 1: STATISTICAL_MURDER_FLAG ~ VIC_SEX + VIC_AGE_GROUP + VIC_RACE
## Model 2: STATISTICAL_MURDER_FLAG ~ VIC_SEX + VIC_AGE_GROUP
##   #Df LogLik Df  Chisq Pr(>Chisq)    
## 1  15 -13886                         
## 2   9 -13909 -6 46.115  2.808e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Compute AIC and BIC for all models (fixing the print issue)
aic_values <- c(AIC(m0), AIC(m1), AIC(m2), AIC(m3))
bic_values <- c(BIC(m0), BIC(m1), BIC(m2), BIC(m3))

# Print AIC and BIC values properly
cat("AIC Values:\n", paste(aic_values, collapse = ", "), "\n")

## AIC Values:
##  28062.5113949385, 28064.0462947588, 27835.9139313256, 27801.7984486193

cat("BIC Values:\n", paste(bic_values, collapse = ", "), "\n")

## BIC Values:
##  28070.7712273806, 28088.8257920848, 27910.2524233037, 27925.6959352495

# Generate a well-organized HTML table for the models
htmlreg(list(m0, m1, m2, m3), 
        doctype = FALSE,            # Removes unnecessary HTML tags
        caption = "Progressive Logistic Regression Models",
        custom.model.names = c("Null Model", "Model 1", 
                               "Model 2", 
                               "Model 3"),
        file = "regression_comparison.html")  # Saves the table to an HTML file

## The table was written to the file 'regression_comparison.html'.

# Open the results in a web browser
browseURL("regression_comparison.html")

The table presents the results of a logistic regression analysis examining factors associated with whether a shooting resulted in a murder. Model 1 includes victim sex, but the coefficients for male and Female (labeled as VIC_SEXU) sex are not statistically significant, suggesting that victim sex alone does not strongly predict the outcome. Model 2 adds victim age group, showing that older victims, particularly those aged 65 and above, have a significantly higher likelihood of the shooting resulting in a murder, while younger victims (18-24, 25-44, and 45-64) also show increased risk compared to the reference group. Model 3 introduces victim race, but the coefficients for racial groups do not appear statistically significant, potentially due to high standard errors. Across models, the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) values decrease, indicating improved model fit with additional predictors. The likelihood ratio tests between models suggest that age group contributes significantly to predicting the outcome, whereas victim sex and race do not appear to have strong independent effects.

Lilkelhood Ratio Test Results

# Perform likelihood ratio test (LRT) using anova()
anova(m0, m1, test = "Chisq")  # Compare Model 1 (VIC_SEX) to Null Model

## Analysis of Deviance Table
## 
## Model 1: STATISTICAL_MURDER_FLAG ~ 1
## Model 2: STATISTICAL_MURDER_FLAG ~ VIC_SEX
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1     28561      28061                     
## 2     28559      28058  2   2.4651   0.2915

anova(m1, m2, test = "Chisq")  # Compare Model 2 (+VIC_AGE_GROUP) to Model 1

## Analysis of Deviance Table
## 
## Model 1: STATISTICAL_MURDER_FLAG ~ VIC_SEX
## Model 2: STATISTICAL_MURDER_FLAG ~ VIC_SEX + VIC_AGE_GROUP
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1     28559      28058                          
## 2     28553      27818  6   240.13 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(m2, m3, test = "Chisq")  # Compare Model 3 (+VIC_RACE) to Model 2

## Analysis of Deviance Table
## 
## Model 1: STATISTICAL_MURDER_FLAG ~ VIC_SEX + VIC_AGE_GROUP
## Model 2: STATISTICAL_MURDER_FLAG ~ VIC_SEX + VIC_AGE_GROUP + VIC_RACE
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1     28553      27818                          
## 2     28547      27772  6   46.115 2.808e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Looking at the results of the likelihood ratio tests, we can see that the p-values for all the tests are less than 0.05, which indicates that the models are statistically significant. This means that the models are better than the null model. The best choice of model is the one with the lowest AIC and BIC values. In this case, Model 3 has the lowest AIC and BIC values, which means that it is the best model among the four models. This model includes the variables VIC_SEX, VIC_AGE and VIC_RACE when predicting the STATISTICAL_MURDER_FLAG.

DATA 712 Week 4 Homework

2025-03-08

Homework 4: Logit Models

Likelhood Ration Test

Outcome Variable Table looking at Murder Flag based on Sex, Age and Race

Lilkelhood Ratio Test Results