Data 101 Project

Author

Asher Scott

INTRO: For my final project I decided to use the “Index, Violent, Property, and Firearm Rates By County” dataset, which came directly from the free New York data government website. The question I set out to answer was a two parter: Was there a correlation between firearm rate and overall crime, as well as which County has the most crime? The parts that were necessary to answer this question were Population, Year, Firearm Rate: how many firearm related instances were reported every year per 100,000 people, Violent Rate: how many murders, rapes, robberies and aggravated assaults were reported every year per 100,000, and Property Rate: how many burglaries, larcenies and motor vehicle thefts were reported every year per 100,000 people. (data:https://data.ny.gov/Public-Safety/Index-Violent-Property-and-Firearm-Rates-By-County/34dd-6g2j/about_data )

From lines 12-18 I load in the dataset and the appropriate packages I’ll need to perform all my visualizations.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(highcharter)
Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo 
library(forcats)
library(leaflet)
setwd("/Users/asherscott/Desktop/Data 110")
Index <- read_csv("Index.csv")
Rows: 2108 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): County
dbl (10): Year, Population, Index Count, Index Rate, Violent Count, Violent ...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

From lines 22-24 I use the filter and !is.na commands to remove any NAs from the dataset.

IDX <- Index %>% filter(if_all(c(Year, Population, `Violent Count`, `Violent Rate`, `Property Count`, `Property Rate`, `Firearm Count`, `Firearm Rate`), ~ !is.na(.)))

From lines 28-32 I summon a table to see how many values are in each County. I decided to use the five Counties Bronx, Kings, New York, Queens, and Richmond, as they correspond with the five boroughs of NYC. For whatever reason these counties were missing data from the 2006-2014, so I just filtered to only include 2015-2022.

table(IDX$County)

     Albany    Allegany       Bronx      Broome Cattaraugus      Cayuga 
         34          34          21          34          34          34 
 Chautauqua     Chemung    Chenango     Clinton    Columbia    Cortland 
         34          34          34          34          34          34 
   Delaware    Dutchess        Erie       Essex    Franklin      Fulton 
         34          34          34          34          34          34 
    Genesee      Greene    Hamilton    Herkimer   Jefferson       Kings 
         34          34          34          34          34          21 
      Lewis  Livingston     Madison      Monroe  Montgomery      Nassau 
         34          34          34          34          34          34 
   New York     Niagara      Oneida    Onondaga     Ontario      Orange 
         21          34          34          34          34          34 
    Orleans      Oswego      Otsego      Putnam      Queens  Rensselaer 
         34          34          34          34          21          34 
   Richmond    Rockland    Saratoga Schenectady   Schoharie    Schuyler 
         21          34          34          34          34          34 
     Seneca St Lawrence     Steuben     Suffolk    Sullivan       Tioga 
         34          34          34          34          34          34 
   Tompkins      Ulster      Warren  Washington       Wayne Westchester 
         34          34          34          34          34          34 
    Wyoming       Yates 
         34          34 
IDX2 <- IDX %>% filter(County %in% c("Bronx", "Kings", "New York", "Queens", "Richmond")) %>%
  filter(Year %in% c("2015", "2016", "2017", "2018", "2019", "2020", "2021", "2022"))

Regression: From lines 35-38 I performed a linear regression comparing Firearm rate to Violent Rate and Property Rate ith P-values of 2.23e-14 and 0.0525, respectfully. Based on this, I hypothesize that there is a strong correlation between Firearm reports and Violent/Property Crime.

lm_model <- lm(`Firearm Rate` ~ `Violent Rate` + `Property Rate`, data = IDX2)
summary(lm_model)

Call:
lm(formula = `Firearm Rate` ~ `Violent Rate` + `Property Rate`, 
    data = IDX2)

Residuals:
    Min      1Q  Median      3Q     Max 
-41.814  -7.078   0.549   9.311  35.654 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -10.440434   8.286313  -1.260   0.2156    
`Violent Rate`    0.165634   0.013740  12.055 2.23e-14 ***
`Property Rate`  -0.010629   0.005305  -2.004   0.0525 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 17.52 on 37 degrees of freedom
Multiple R-squared:  0.8235,    Adjusted R-squared:  0.814 
F-statistic: 86.31 on 2 and 37 DF,  p-value: 1.161e-14

From lines 42-50 I created a scatterplot showing the correlation between Violent Crime and Firearm Rates in the five counties. As we can see there is a substantial correlation between the two variables.

ggplot(data = IDX2, aes(x = `Firearm Rate`, y =`Violent Rate`)) +
  geom_point(color = "black", alpha = 0.6) +  
  geom_smooth(method = "lm", color = "red", se = FALSE) + 
  labs( title = "Firearm Rate vs Violent Crime Rate in New York City Counties",
    x = "Insidences involving Firearms",
    y = "Violent Crime Rate") +
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'

From lines 54-62 I created a second scatterplot showing the correlation between Property Crime and Firearm Rates in the five counties. As we can see there is a relatively strong correlation between the two variables, but not as strong as with the Violent Crimes Scatterplot.

ggplot(data = IDX2, aes(x = `Firearm Rate`, y =`Property Rate`)) +
  geom_point(color = "black", alpha = 0.6) +  
  geom_smooth(method = "lm", color = "blue", se = FALSE) + 
  labs(title = "Firearm Rate vs Violent Crime Rate in New York City Counties",
    x = "Insidences involving Firearms",
    y = "Property Crime Rate") +
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'

Regression: From lines 66-70 I performed a linear regression comparing Firearm rate between the counties. To do this I had to convert County to a factor using the as.factor command, hence why the county Bronx is missing, as it’s the reference level for the regression. Because Bronx (Intercept) has the smallest p-value, 5.34e-15, I hypothesize it will have the most overall crime.

IDX2$County <- as.factor(IDX2$County)
lm_model2 <- lm(`Firearm Rate` ~ County, data = IDX2)
summary(lm_model2)

Call:
lm(formula = `Firearm Rate` ~ County, data = IDX2)

Residuals:
    Min      1Q  Median      3Q     Max 
-44.962 -14.838  -3.125   9.981  89.937 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     124.262      9.528  13.041 5.34e-15 ***
CountyKings     -44.250     13.475  -3.284  0.00233 ** 
CountyNew York  -66.787     13.475  -4.956 1.83e-05 ***
CountyQueens    -75.225     13.475  -5.583 2.74e-06 ***
CountyRichmond  -89.325     13.475  -6.629 1.16e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 26.95 on 35 degrees of freedom
Multiple R-squared:  0.605, Adjusted R-squared:  0.5598 
F-statistic:  13.4 on 4 and 35 DF,  p-value: 1.011e-06

From lines 74-78 I created a line chart that shows the differences in Firearm Rate by each County by year. All are in an upward trend since 2019, but the Bronx is the clear standout with an additional 100+ reports per 100k compared to the next County.

hc1 <- IDX2 %>% hchart('line', hcaes(x = Year, y = `Firearm Rate`, group = County))  %>%
  hc_colors(c("#00AFBB", "green", "#FC4E07", "orange", "hotpink"))
hc1

From lines 83-86 I altered the dataset to include the ten most populous counties to use for the following graph.

IDX3 <- IDX %>% filter(County %in% c("Bronx", "Kings", "New York", "Queens", "Richmond", "Suffolk", "Westchester", "Nassau", "Erie", "Monroe")) %>%
  filter(Year %in% c("2015", "2016", "2017", "2018", "2019", "2020", "2021", "2022"))

From lines 91 to 110 I created a lollipop graph to demonstrate the differences in overall crime between the counties. First I used the mean() function to combine Property Rate and Violent Rate. Second I used the group_by() function to group the counties together. Third I used the summarise() function to Property and Violent Rate and then I used that summary to mutate the column County. Fourth I added a y-intercept to show the mean. Lastly I used geom_point in conjunction geom_segement to show the lines and points on the graph. I added extra counties for the visualization because it works better with 10 points. As we can see New York County has the most overall crime.

overall_mean_crime <- mean(IDX3$`Property Rate` + IDX3$`Violent Rate`, na.rm = TRUE)

Ovrl_Cr <- IDX3 %>%
  group_by(County) %>%
  summarise(mean_crime = mean(`Property Rate` + `Violent Rate`, na.rm = TRUE)) %>%
  mutate(County = fct_reorder(County, mean_crime)) %>%
  ggplot(aes(x = County, y = mean_crime)) +
  labs(title = "Overall Crime Rate Per County",
       x = "Counties",
       y = "Rate") + 
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1), 
        axis.text.y = element_text(face = "bold"),
        plot.title = element_text(hjust = 0.5, size = 20, face = "bold")) +
  geom_hline(yintercept = overall_mean_crime, color = "lightblue", size = 1) +
  geom_segment(aes(xend = County, y = overall_mean_crime, yend = mean_crime), color = "violet") +
  geom_point(aes(color = mean_crime), size = 4) +  
  scale_color_gradient(low = "navy", high = "lightgreen")
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
Ovrl_Cr

From lines 115-126 I made two codes. First I used the filter function to solely get 2022 data. For the second code I used the function data.frame() to manually add the coordinates of each county I got the Longitude and Latitude from a website (check references).

IDX4 <- IDX2 %>% filter(Year %in% c(2022)) 

IDX5 <- data.frame(
  County = c("Bronx", "Kings", "Queens", "Richmond", "New York", 
             "Nassau", "Suffolk", "Westchester", "Erie", "Monroe"),
  Latitude = c(40.837048, 40.65, 40.73, 40.583333, 40.78, 
               40.747222, 40.933333, 41.15, 42.7, 43.3),
  Longitude = c(-73.865433, -73.96, -73.79, -74.15, -73.97, 
                -73.638056, -72.683333, -73.766667, -78.75, -77.683333))
print(IDX5)
        County Latitude Longitude
1        Bronx 40.83705 -73.86543
2        Kings 40.65000 -73.96000
3       Queens 40.73000 -73.79000
4     Richmond 40.58333 -74.15000
5     New York 40.78000 -73.97000
6       Nassau 40.74722 -73.63806
7      Suffolk 40.93333 -72.68333
8  Westchester 41.15000 -73.76667
9         Erie 42.70000 -78.75000
10      Monroe 43.30000 -77.68333

From lines 130-133 I joined the two previously created datasets into one called “merged”. By doing this I’ve matched the coordinates to the proper County.

merged <- IDX4 %>%
  left_join(IDX5, by = "County")

From lines 138-144 I created a popup for the final visualization.

IndexPop <- paste0(
  "<b>County: </b>", merged$County, "<br>",
  "<b>Year: </b>", merged$Year, "<br>",
  "<b>Population: </b>", merged$Population, "<br>",
   "<b>Property Crime: </b>", merged$`Property Rate`, "<br>",
  "<b>Violent Crime: </b>", merged$`Violent Rate`, "<br>",
  "<b>Firearm Incidents: </b>", merged$`Firearm Rate`, "<br>")

From lines 148-158 I created a map to show the location of each county. The radius is also based off of the Firearm rate, meaning larger the circle the higher the firearm rate. This shows that Bronx County has the most cases of Violent and Firearm Rate, showing that it has the most violent County.

leaflet(merged) %>%
  setView(lng =  -74.0060, lat = 40.7128, zoom =9) %>%
  addProviderTiles("Esri.WorldStreetMap") %>%
  addCircles(
    data = merged,
    radius = merged$`Firearm Rate`*25,
    color = "navy",
    popup = IndexPop)
Assuming "Longitude" and "Latitude" are longitude and latitude, respectively

CONCLUSION: Based on the analysis of the “Index, Violent, Property, and Firearm Rates By County” dataset, several significant insights were drawn regarding crime patterns in the five counties of New York City. The analysis confirms that firearm rates are a significant predictor of violent crimes in New York City’s boroughs, with the Bronx being disproportionately affected. While New York County leads in overall crime rates due to property crimes, the Bronx faces the highest levels of violence and firearm-related incidents. These findings highlight the need for targeted interventions in the Bronx to address violent crime and firearm usage while continuing to monitor overall crime trends across the city. Future research could include investigating the relationship between socioeconomic variables such as median income, unemployment rates and education levels to see how they correspond to crime rate. Additionally, comparing New York to other large cities and their subsequent counties could further show how New York compares to the rest of the country on these issues.

References: New York State Index Data: https://data.ny.gov/Public-Safety/Index-Violent-Property-and-Firearm-Rates-By-County/34dd-6g2j/about_data U.S County coordinates data: https://public.opendatasoft.com/explore/dataset/us-county-boundaries/table/?flg=en-us&disjunctive.statefp&disjunctive.countyfp&disjunctive.name&disjunctive.namelsad&disjunctive.stusab&disjunctive.state_name