On January 15th, 2013 the New York Secure Ammunition and Firearms Enforcement (SAFE) Act was signed into law. The Act broadened the umbrella category of assault weapons to include semi-automatic rifles and pistols, made magazines that could hold over 7 rounds illegal, and required ammunition dealers to conduct background checks among many other steps meant to curb tragedies like the Sandy Hook massacre that occurred just a month before the Act was signed. Though SAFE was created to prevent tragedies like Sandy Hook, an interesting question to ask is whether the law was also effective at reducing general shootings in New York City.
This project aims to determine whether the SAFE Act reduced the amount of shootings that occurred in NYC once it was passed. In addition, this project looks at the arrest count of other crimes in the city and demographic data from the U.S. Census to better qualify the impact SAFE had on the amount of shootings in NYC.
In order to achieve its goal, this project combines two unique data sets and creates another by querying census data. The primary data set used for this project contains data on every single shooting in New York City between the years 2006 and 2019. Variables in this data set include, date, time, latitude and longitude of location, age of perpetrator, age of victim, etc. This particular data set was found on Data.gov, a website committed to providing open data from the US government and can be found here. The second data set that will be mentioned in this project is of a similar nature to the first but broadly focuses on crime in New York City. Specifically, the data set contains every arrest made by the NYPD between 2006 and 2019. Ultimately, these two data sets were created by the NYPD, so the two have a very similar structure with similar variables. I found the historic arrest data here. The final data set that this project uses is one that is constructed through queries of the American Community Survey (ACS), a nationwide survey conducted by the Census Bureau to gather data on socioeconomic and demographic factors.
Before determining whether SAFE had an impact on the number of shooting incidents in NYC, it is important to look at the trend of shootings to see if there is enough evidence to suggest SAFE may have had an impact.
From the plot and table above, we find that within our data, the number of shootings in NYC had a peak in 2006 with 2,051 incidents. Between 2006 and 2011 there was a general downward trend with some years like 2008 and 2010 seeing upticks, but there was a dramatic decline in shooting incidents between 2011 and 2012 (-12%) with an even greater decline between 2012 and 2013 (-28%). There was an uptick in shootings in 2014, but then shootings again declined until they stabilized around 1000 shootings per year after 2017. Given that the biggest percent change in shootings came in 2013 adn was followed by a another steep dowward trend later, there seems to be potential for a causal link between the passage of SAFE and the decrease in shooting incidents seen.
To try to determine whether the passage of SAFE had some type of causal influence on the decrease of shootings in the city, a difference-in-differences (DID) analysis came to mind. Typically, a DID analysis can be used to assess the impact of a policy by looking at two different regions: one region where the policy was implemented and another where it was not. It should also be noted that the two different regions looked at should be as similar as possible in terms of demographics, characteristics, etc. With that in mind, I first tried to find a data set that contained shooting incidents for cities like Los Angeles or Chicago. Unfortunately, I was unable to find such a dataset, but I was instead able to find a data set of the same structure as the shootings dataset but for the arrests of other crimes. Thus, I thought it would be clever to conduct a DID analysis by using data from a different crime instead of a different region. By doing so, it would have to be the case that the crime used for comparison would not involve guns, as the crime would then fall under the same treatment that comes from SAFE being signed into law as the count of shootings. To determine which crime could be used to conduct such an analysis, it’s important to know the assumptions that must be satisfied for DID.
While the treatment status of a group can vary over time, the consistency assumption states that there can only be two classifications of treatment status. The first corresponds to units that were never treated (control group) and the other is for those treated only in the post-intervention time period (treated group). As far as I am aware, there were no significant changes or new laws signed into existence concerning prostitution, later found to be the best crime for the analysis, between 2006 and 2019 meaning that the area of prostitution would not have a “treatment” like shootings does.
DID analysis rests on the assumption that the change in the outcomes between pre and post-intervention for the control group is a fair representation of what would be the difference in outcomes between the pre and post-intervention phase for the treated group, if the had not been treated. The primary way to check this assumption is to look at the differences in the dependent variable between the treatment and control groups in the pre-treatment phase. Under the parallel trends assumption, the differences between the two should be constant. With this assumption primarily in mind, I aimed to find a crime in the general crime dataset whose arrests exhibited a similar trend to the number of shootings in the pre-treatment phase (2006 - 2012).
By plotting the arrests of each crime per year in the crime data set, one can first visually check which plots seem similar to that of the shooting incidents between 2006 and 2012. Within the plot of shooting incidents, there is a clear downward trend between 2006 and 2007. By this fact alone, we are able to rule out many of the crimes that see an uptick in arrests within that same time period. For example, arrests for Dangerous Drugs, Criminal Trespass, Felony Assault, and many others all see more arrests in 2007 than in 2006. By doing a similar visual check for the other years, I found that arrests for prostitution and related offenses would be the best candidate crime for a DID analysis.
To do a more thorough check of the parallel trends assumption, I begin by plotting the shooting incidents and prostitution arrests next to each other to make a visual check of the trends easier to determine. While the difference of arrests for prostitution between 2006 and 2007 seems to be greater than the difference in shooting incidents between the two years, this seems to be the only possibly significant challenge to the parallel trends assumption. Above the joint plot is a table representing the difference in the two quantities for each of the years before SAFE was singed into law. As suggested by the visual appearance of the two plots, the difference between the two quantities is much higher in 2006 than other years. Furthermore, the differences in 2007 and 2011 are slightly off compared to all of the other values which hover around 2350, but with all this in mind I think that the parallel trends is reasonably satisfied given that we are not dealing with a textbook/fabricated example. Between the years that could be pushing the assumption (2006, 2007, and 2011), there would be a bias in the direction of the difference being smaller than it should be. This would work against SAFE having an effect of reducing shootings, so we will keep that in mind moving forward. For example, we are going with the assumption that the difference should be about 2,350, but if the true difference is 2,900, then a difference of 2,400 would make it seem like SAFE reduced shootings by 500 under the 2,900 assumption but it will appear to cause an increase of 50 shootings with the 2,350 assumption.
From the regression results above, we find that the coefficient for the difference-in-differences term (the interaction term between time and treatment) has a positive value of about 1,300 and is significant at the .01 p-value level. Keeping in mind that we expected there to be a positive bias in the estimator, I do not think that the bias would be large enough to cause the estimator to reach 1,300 on its own. As we saw before, the difference in values that was furthest from the average difference of 2350 was the difference of 2917 in 2006. This would bias the estimator by about 600. Meanwhile, the potential bias that would come from the years 2007 and 2011 essentially cancel each other out since they are both around 100 off in either direction. Thus, their should be an an upward bias for the DID estimator of around 600, yet the DID estimator is more than double that amount. It is fair to say that even without the upward bias, there would still be a positive coefficient for the DID estimator. While we do not know if the unbiased estimate would be significant, implying that SAFE ultimately led to an increase in shootings), we are able to say that it is unlikely that SAFE had a significant influence on reducing the number of shootings in New York City.
So, if the signing of SAFE into low is not able to explain the decrease in shooting incidents in the city, what other factors could explain it? To try to answer this question, I decided to look at different demographic variables using the American Community Survey (ACS). The ACS collects various demographic information nationwide as part of the Census Bureau. They are able to collect social, economic, and housing data by contacting over 3.5 million households across the nation. While they survey every year, the most accurate data come from the ACS 5-year estimates, and the 5-year estimates are the only ones that reveal tract-level data. It should be noted that these 5-year estimates are estimates of demographic variables that come from 5 years of collected data (data from the 5-year ACS of 2009 would be based on data from 2004-2009).
Census tracts are small subdivisions of a county created by the Census Bureau which have an average population of 4,000 with a minimum population of 1,200 and a maximum of 8,000. These subdivisions are the most granular available to gather demographic variables which will give us the best insight into how the demographics of different regions of NYC influence the amount of shootings seen each year.
The first step in plotting spacial data like census data is to find a shapefile of the geographical divisions one is looking to plot. For instance, this project requires a shapefile of the census tracts within NYC. A simple Google search of “NYC census tracts shapefile” suffices to find one, and the one used for this project can be found here. Once the shapefile is downloaded, the rgdal package can be used to read in the shapefile and create a basic plot of map (see below) with the readOGR function.
The shapefile downloaded is read in as a Spatial Polygons Data Frame (SPDF). An SPDF combines the geographical coordinates of the vertices used to create the boundaries for each of the census tracts seen in the plot above with data about each of the tracts. For instance, this SPDF contains the coordinate information for each of the tracts along with the tract number, borough name, area, etc. for each of the tracts in the plot. All one has to do to plot data by tract is add another column into the data frame part of the SPDF that aligns with the tract that each of the rows refer to.
To both serve as an example and to further the analysis of this project, I find the amount of shootings that occurred in each tract for a given year and plot it using the spplot function from the sp package. Of course, the data that contained each of the shooting incidents did not contain a variable for which census tract the shooting occurred in. However, the data set does include the coordinates of each incident which can be converted into spatial points that can then be plotted on top of the tract map. After much searching, I was able to find the over function, also found in the spplot package, which returns the row of data from an SPDF corresponding to the shape within the map that a spatial point lies in. After assigning the spatial points corresponding to each of the locations of shooting incidents the same coordinate reference system as that used for the tract map, I was able to extract the census tract for the location of each shooting by using the over function. Now that each shooting incident had an associated tract, I simply had to add a column for the number of shootings each year by tract in order to create plots visualizing where shootings were more frequent within NYC in a given year. The plots below will represent shooting incidents within the 3 years that we will be extracting ACS 5-year data for (2009, 2014, and 2019), along with a plot of the total shootings within each tract between 2006 and 2019
Looking at the plot of total shootings clearly depicts which general areas of the city see the most shootings. For the most part, Brooklyn and the Bronx are the two boroughs that tend to have a higher amount of shootings in each tract than the rest of the city. It should be noted that there are also some high-shooting tracts found in Queens and Staten Island, but those two boroughs do not seem as consistent nor as exaggerated as Brooklyn and the Bronx. Moreover, these general trends extend themselves to the yearly plots for 2009, 2014, and 2019. The same areas of the Bronx and Brookyln are often shaded between the three plots. While it looks like there are more shaded areas in the 2019 plot than the others, it should be noted that the scale includes a smaller range than the other plots, making tracts with similar number of shootings between the 3 years appear in a darker shade in 2019 than in the other two years. To better examine the difference in both how many shootings each borough saw within each year along with how the time trend of shootings differed within each borough, a line plot similar to the first figure shown in this project was created but with a breakdown by borough.
The plot above tells a clear story that shootings in each of the five boroughs of New York City have been steadily decreasing since 2006 aside from Staten Island where the number of shootings hovered around 50 shootings per year for the most part. Notably, Brooklyn is the borough with the highest amount of shootings each year, but this is likely due to the fact that Brooklyn is a bigger borough than the Bronx; the Census estimated the population of Brooklyn at roughly 2.56 million in 2019 while the population of the Bronx was estimated at a paltry 1.48 million.
Having done as much analsis with the data available from the crime and shootings data set, I thought it would be best to try to find what demographic variables from the ACS could help explain the difference between high and low shooting census tracts. After combing through the codebook of variables that the ACS attains found here, I settled on the following varibles:
While those were the variables that I queried from the ACS 5-year estimates in 2009, 2014, and 2019 using the tidycensus package, those were not the final variables that I decided to use in my model. A couple of transformations of the variables were made to create a better OLS model. The first transformation conducted was a log transformation of the median income and population variables. Both distributions were heavily skewed right, as can be seen in the histograms below. Failing to treat the skewed distributions would prevent the data from fulfilling the multivariate normal assumption of linear regression and would result in biased results. In order to prevent any bias that would arise from the skewed data, a new variable was created that was a log transformation of both variables. Furthermore, the variables corresponding to the population enrolled in school, suffering from poverty, and in the labor force were all divided by the total population of the tract to get each in terms of percent of population. I thought this would make for a better model so that the discrepancy in population between boroughs and tracts would not influence the model greatly. I also thought that this would make for an easier interpretation of the regression coefficients of these variables in the OLS model.
The get_acs function within the tidycensus package simplified the process of getting the ACS data. The function returns a data frame of the variables of one’s choice, making the merging of the data with the spatial data frame simple. One caveat is that the identifier of each of the variables was the tract number along with the borough name, so a new variable within the spatial data frame had to be created to match it.
After merging all of the new variables with the spatial data frame, I plotted many of the variables and did a visual sanity check the variables which resulted in some final data cleaning. For example, the original plot of the median income revealed that some locations, one of them being JFK airport, reported a median income of $2,499. I am not sure why there were tracts that reported this same exact number, but I hypothesize that it might be the minimum value that can be reported within the ACS for median income. Furthermore, there were some years where there was a small population reported to be living in areas like Central Park. A quick Google search revealed that such occurrences are likely the result of homeless people getting their hands on an ACS form1. I decided to remove any data coming from areas that had a listed population below 100 to remove such instances so they would not influence OLS results. Furthermore, there were two cases where the poverty percent was reported as over 90% in 2009 but then reported poverty percentages much lower (around 50%) in 2014 and 2019. It could be the case that the high poverty rate was tied to the 2008 market crash, but I thought the drastic drop in the following years provided support for the high rates being inaccurate. These two instances were removed from the data set to prevent the influence they would have as outliers on the OLS results. Since there are over 2000 tracts,that provide plenty of data, I thought being safe by dropping the two possible outliers would be best. After these small modifications were made, each of the variables were plotted. Below are the plots that showed a clear difference in the variable around the city; the variables omitted were nearly uniform across the city and did not reveal anything interesting.