Temperature affects many of our day-to-day decisions. Consider the simple example of choosing an outfit. On a hot day, a person is more likely to wear shorts and less likely to wear a sweater but is complex behavior, like gun violence, also affected by temperature?
To determine if there is a relationship between temperature and gun violence, data on total shootings in New York City between 2006 - 2021 were analyzed.
The data were retrieved from from data.gov on August 24th, 2022.
# Loading data into R, replacing blank values with NA, nys stands for New York Shooting
nys <- read.csv('https://data.cityofnewyork.us/api/views/833y-fsy8/rows.csv?accessType=DOWNLOAD',
header = T, na.strings = c("","NA"))
# Printing summary of data
summary(nys)
## INCIDENT_KEY OCCUR_DATE OCCUR_TIME BORO
## Min. : 9953245 Length:25596 Length:25596 Length:25596
## 1st Qu.: 61593633 Class :character Class :character Class :character
## Median : 86437258 Mode :character Mode :character Mode :character
## Mean :112382648
## 3rd Qu.:166660833
## Max. :238490103
##
## PRECINCT JURISDICTION_CODE LOCATION_DESC STATISTICAL_MURDER_FLAG
## Min. : 1.00 Min. :0.0000 Length:25596 Length:25596
## 1st Qu.: 44.00 1st Qu.:0.0000 Class :character Class :character
## Median : 69.00 Median :0.0000 Mode :character Mode :character
## Mean : 65.87 Mean :0.3316
## 3rd Qu.: 81.00 3rd Qu.:0.0000
## Max. :123.00 Max. :2.0000
## NA's :2
## PERP_AGE_GROUP PERP_SEX PERP_RACE VIC_AGE_GROUP
## Length:25596 Length:25596 Length:25596 Length:25596
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## VIC_SEX VIC_RACE X_COORD_CD Y_COORD_CD
## Length:25596 Length:25596 Min. : 914928 Min. :125757
## Class :character Class :character 1st Qu.:1000011 1st Qu.:182782
## Mode :character Mode :character Median :1007715 Median :194038
## Mean :1009455 Mean :207894
## 3rd Qu.:1016838 3rd Qu.:239429
## Max. :1066815 Max. :271128
##
## Latitude Longitude Lon_Lat
## Min. :40.51 Min. :-74.25 Length:25596
## 1st Qu.:40.67 1st Qu.:-73.94 Class :character
## Median :40.70 Median :-73.92 Mode :character
## Mean :40.74 Mean :-73.91
## 3rd Qu.:40.82 3rd Qu.:-73.88
## Max. :40.91 Max. :-73.70
##
The output of the summary() function was used to identify variables to include/exclude from the analysis, and to ensure the data were in the appropriate format. After reviewing the output, 17 variables were removed from the data set. The variables were removed because they were outside the scope of the analysis (e.g., race of perpetrator, sex of victim) or had limited utility (e.g., longitude, latitude).
Date data were converted from a string variable to a date object, separated into different columns by day, month, and year, then bound to the original data frame using the bind_rows() function.
# Removing excluded categories from data frame
# nys_rmvd stands for New York Shooting Data with variables removed
nys_rmvd <- subset(nys, select = -c(INCIDENT_KEY, JURISDICTION_CODE, X_COORD_CD,
Y_COORD_CD, Latitude, Longitude, Lon_Lat,
PERP_RACE, PERP_SEX, PRECINCT,
PERP_AGE_GROUP, LOCATION_DESC, OCCUR_TIME,
STATISTICAL_MURDER_FLAG, VIC_AGE_GROUP,
VIC_SEX, VIC_RACE))
# Changing format of date
nys_rmvd$OCCUR_DATE <- as.POSIXct(nys_rmvd$OCCUR_DATE, format = "%m/%d/%Y")
# Creating a new data frame with seperated date data
df <- nys_rmvd %>%
dplyr::mutate(YEAR = lubridate::year(OCCUR_DATE),
MONTH = lubridate::month(OCCUR_DATE),
DAY = lubridate::day(OCCUR_DATE))
# Binding two data frames
nys_mod <- bind_rows(df, nys_rmvd)
# Removing old date data
nys_mod <- subset(nys_mod, select = -c(OCCUR_DATE))
The data were then piped into the count() function to determine the total shootings per month. Data scored as NA (i.e., not associated with a month) were excluded from the analysis and removed from the data frame.
Data on the monthly average temperature were not included in the original data set; therefore, they were retrieved from an external source. Monthly averages from 1991 to 2020 were retrieved from Current Results and manually entered into R. The data can be accessed using the hyper-linked text.
After the data were manually entered into an R vector, they were attached to a data frame containing the total number of shootings organized by month. The new data frame was used to generate scatter plots and build simple linear regression models.
# Data frame with total shootings organized by month
nys_count <- nys_mod %>% count(MONTH)
# Removing NAs from data frame
nys_count <- nys_count[1:12,]
# Average high temperature for NYC by month
average_high <- c(39, 42, 50, 63, 73, 80, 86, 84, 78, 66, 54, 45)
average_low <- c(27, 29, 35, 46, 56, 64, 71, 69, 63, 53, 41, 35)
# Adding average high temperature to data frame
nys_count$average_high <- average_high
nys_count$average_low <- average_low
The nys_count data frame and ggplot2 package were used to create two scatter plots. The purpose of creating the scatter plots was to determine if there was a correlation between temperature and total number of shootings.
# Scatter plot for average high temperature and total shootings
ggplot(data = nys_count, mapping = aes(x = average_high, y = n)) +
geom_jitter() +
geom_smooth(method = lm, se = F) +
ylab("Total Shootings") +
xlab("Average High Temperature (F)") +
ggtitle("Average High Temperature and Shootings")
## `geom_smooth()` using formula 'y ~ x'
The data displayed above indicate a positive correlation between average high temperature and total shootings.
# Scatter plot for average low temperature and total shootings
ggplot(data = nys_count, mapping = aes(x = average_low, y = n)) +
geom_jitter() +
geom_smooth(method = lm, se = F) +
ylab("Total Shootings") +
xlab("Average Low Temperature (F)") +
ggtitle("Average Low Temperature and Shootings")
## `geom_smooth()` using formula 'y ~ x'
Similar to the scatter plot examining average high temperature and total shootings, the scatter plot for average low temperature and total shootings also show a positive correlation.
# Linear model
linear_mod <- lm(average_high ~ n, data = nys_count)
# Summary of linear model
summary(linear_mod)
##
## Call:
## lm(formula = average_high ~ n, data = nys_count)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.708 -2.069 1.808 3.365 8.122
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.202113 7.235995 0.581 0.574
## n 0.027722 0.003285 8.439 7.35e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.26 on 10 degrees of freedom
## Multiple R-squared: 0.8769, Adjusted R-squared: 0.8646
## F-statistic: 71.22 on 1 and 10 DF, p-value: 7.351e-06
The summary output indicates that average high temperature accounts for 86.46% of variance in the data.
# Linear model
linear_mod <- lm(average_low ~ n, data = nys_count)
# Summary of linear model
summary(linear_mod)
##
## Call:
## lm(formula = average_low ~ n, data = nys_count)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.1188 -1.4885 0.5415 2.9339 6.5021
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.800901 5.497538 -1.237 0.244
## n 0.026200 0.002496 10.498 1.02e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.756 on 10 degrees of freedom
## Multiple R-squared: 0.9168, Adjusted R-squared: 0.9085
## F-statistic: 110.2 on 1 and 10 DF, p-value: 1.016e-06
The summary output indicates that average low temperature accounts for 90.85% of variance in the data.
Based on the results of the analysis, there appears to be a relationship between temperature and gun violence. Both scatter plots indicate a positive correlation between temperature and total shootings, and both models account for a large amount of the variance in the data.
The information from the analysis might be useful to law enforcement agencies and other stakeholder groups in New York City. For instance, it can inform when to allocate limited resources (e.g., utilizing additional resources during months with higher average temperatures).
Although the data from this analysis could be used to address gun violence, there are at least two limitations that affect its utility. First, the analysis used monthly averages that were aggregated across several years, some of which were before 2006. Second, the analysis did not identify areas (e.g., public housing) associated with gun violence. Third, the data collection system may have introduced some bias since some shootings go unreported (e.g., shootings in some areas may be under-reported).Ideally, future iterations of this project will address the limitations of the analysis.