Introduction

On January 15th, 2013 the New York Secure Ammunition and Firearms Enforcement (SAFE) Act was signed into law. The Act broadened the umbrella category of assault weapons to include semi-automatic rifles and pistols, made magazines that could hold over 7 rounds illegal, and required ammunition dealers to conduct background checks among many other steps meant to curb tragedies like the Sandy Hook massacre that occurred just a month before the Act was signed. Though SAFE was created to prevent tragedies like Sandy Hook, an interesting question to ask is whether the law was also effective at reducing general shootings in New York City.

This project aims to determine whether the SAFE Act reduced the amount of shootings that occurred in NYC once it was passed. In addition, this project looks at the arrest count of other crimes in the city and demographic data from the U.S. Census to better qualify the impact SAFE had on the amount of shootings in NYC.

In order to achieve its goal, this project combines two unique data sets and creates another by querying census data. The primary data set used for this project contains data on every single shooting in New York City between the years 2006 and 2019. Variables in this data set include, date, time, latitude and longitude of location, age of perpetrator, age of victim, etc. This particular data set was found on Data.gov, a website committed to providing open data from the US government and can be found here. The second data set that will be mentioned in this project is of a similar nature to the first but broadly focuses on crime in New York City. Specifically, the data set contains every arrest made by the NYPD between 2006 and 2019. Ultimately, these two data sets were created by the NYPD, so the two have a very similar structure with similar variables. I found the historic arrest data here. The final data set that this project uses is one that is constructed through queries of the American Community Survey (ACS), a nationwide survey conducted by the Census Bureau to gather data on socioeconomic and demographic factors.

Before determining whether SAFE had an impact on the number of shooting incidents in NYC, it is important to look at the trend of shootings to see if there is enough evidence to suggest SAFE may have had an impact.

From the plot and table above, we find that within our data, the number of shootings in NYC had a peak in 2006 with 2,051 incidents. Between 2006 and 2011 there was a general downward trend with some years like 2008 and 2010 seeing upticks, but there was a dramatic decline in shooting incidents between 2011 and 2012 (-12%) with an even greater decline between 2012 and 2013 (-28%). There was an uptick in shootings in 2014, but then shootings again declined until they stabilized around 1000 shootings per year after 2017. Given that the biggest percent change in shootings came in 2013 adn was followed by a another steep dowward trend later, there seems to be potential for a causal link between the passage of SAFE and the decrease in shooting incidents seen.

Difference-in-Differences Analysis

To try to determine whether the passage of SAFE had some type of causal influence on the decrease of shootings in the city, a difference-in-differences (DID) analysis came to mind. Typically, a DID analysis can be used to assess the impact of a policy by looking at two different regions: one region where the policy was implemented and another where it was not. It should also be noted that the two different regions looked at should be as similar as possible in terms of demographics, characteristics, etc. With that in mind, I first tried to find a data set that contained shooting incidents for cities like Los Angeles or Chicago. Unfortunately, I was unable to find such a dataset, but I was instead able to find a data set of the same structure as the shootings dataset but for the arrests of other crimes. Thus, I thought it would be clever to conduct a DID analysis by using data from a different crime instead of a different region. By doing so, it would have to be the case that the crime used for comparison would not involve guns, as the crime would then fall under the same treatment that comes from SAFE being signed into law as the count of shootings. To determine which crime could be used to conduct such an analysis, it’s important to know the assumptions that must be satisfied for DID.

  1. Consistency:

While the treatment status of a group can vary over time, the consistency assumption states that there can only be two classifications of treatment status. The first corresponds to units that were never treated (control group) and the other is for those treated only in the post-intervention time period (treated group). As far as I am aware, there were no significant changes or new laws signed into existence concerning prostitution, later found to be the best crime for the analysis, between 2006 and 2019 meaning that the area of prostitution would not have a “treatment” like shootings does.

  1. Parallel trends assumption:

DID analysis rests on the assumption that the change in the outcomes between pre and post-intervention for the control group is a fair representation of what would be the difference in outcomes between the pre and post-intervention phase for the treated group, if the had not been treated. The primary way to check this assumption is to look at the differences in the dependent variable between the treatment and control groups in the pre-treatment phase. Under the parallel trends assumption, the differences between the two should be constant. With this assumption primarily in mind, I aimed to find a crime in the general crime dataset whose arrests exhibited a similar trend to the number of shootings in the pre-treatment phase (2006 - 2012).

By plotting the arrests of each crime per year in the crime data set, one can first visually check which plots seem similar to that of the shooting incidents between 2006 and 2012. Within the plot of shooting incidents, there is a clear downward trend between 2006 and 2007. By this fact alone, we are able to rule out many of the crimes that see an uptick in arrests within that same time period. For example, arrests for Dangerous Drugs, Criminal Trespass, Felony Assault, and many others all see more arrests in 2007 than in 2006. By doing a similar visual check for the other years, I found that arrests for prostitution and related offenses would be the best candidate crime for a DID analysis.

To do a more thorough check of the parallel trends assumption, I begin by plotting the shooting incidents and prostitution arrests next to each other to make a visual check of the trends easier to determine. While the difference of arrests for prostitution between 2006 and 2007 seems to be greater than the difference in shooting incidents between the two years, this seems to be the only possibly significant challenge to the parallel trends assumption. Above the joint plot is a table representing the difference in the two quantities for each of the years before SAFE was singed into law. As suggested by the visual appearance of the two plots, the difference between the two quantities is much higher in 2006 than other years. Furthermore, the differences in 2007 and 2011 are slightly off compared to all of the other values which hover around 2350, but with all this in mind I think that the parallel trends is reasonably satisfied given that we are not dealing with a textbook/fabricated example. Between the years that could be pushing the assumption (2006, 2007, and 2011), there would be a bias in the direction of the difference being smaller than it should be. This would work against SAFE having an effect of reducing shootings, so we will keep that in mind moving forward. For example, we are going with the assumption that the difference should be about 2,350, but if the true difference is 2,900, then a difference of 2,400 would make it seem like SAFE reduced shootings by 500 under the 2,900 assumption but it will appear to cause an increase of 50 shootings with the 2,350 assumption.

From the regression results above, we find that the coefficient for the difference-in-differences term (the interaction term between time and treatment) has a positive value of about 1,300 and is significant at the .01 p-value level. Keeping in mind that we expected there to be a positive bias in the estimator, I do not think that the bias would be large enough to cause the estimator to reach 1,300 on its own. As we saw before, the difference in values that was furthest from the average difference of 2350 was the difference of 2917 in 2006. This would bias the estimator by about 600. Meanwhile, the potential bias that would come from the years 2007 and 2011 essentially cancel each other out since they are both around 100 off in either direction. Thus, their should be an an upward bias for the DID estimator of around 600, yet the DID estimator is more than double that amount. It is fair to say that even without the upward bias, there would still be a positive coefficient for the DID estimator. While we do not know if the unbiased estimate would be significant, implying that SAFE ultimately led to an increase in shootings), we are able to say that it is unlikely that SAFE had a significant influence on reducing the number of shootings in New York City.

So, if the signing of SAFE into low is not able to explain the decrease in shooting incidents in the city, what other factors could explain it? To try to answer this question, I decided to look at different demographic variables using the American Community Survey (ACS). The ACS collects various demographic information nationwide as part of the Census Bureau. They are able to collect social, economic, and housing data by contacting over 3.5 million households across the nation. While they survey every year, the most accurate data come from the ACS 5-year estimates, and the 5-year estimates are the only ones that reveal tract-level data. It should be noted that these 5-year estimates are estimates of demographic variables that come from 5 years of collected data (data from the 5-year ACS of 2009 would be based on data from 2004-2009).

Census tracts are small subdivisions of a county created by the Census Bureau which have an average population of 4,000 with a minimum population of 1,200 and a maximum of 8,000. These subdivisions are the most granular available to gather demographic variables which will give us the best insight into how the demographics of different regions of NYC influence the amount of shootings seen each year.

Plotting Spacial Data

The first step in plotting spacial data like census data is to find a shapefile of the geographical divisions one is looking to plot. For instance, this project requires a shapefile of the census tracts within NYC. A simple Google search of “NYC census tracts shapefile” suffices to find one, and the one used for this project can be found here. Once the shapefile is downloaded, the rgdal package can be used to read in the shapefile and create a basic plot of map (see below) with the readOGR function.

The shapefile downloaded is read in as a Spatial Polygons Data Frame (SPDF). An SPDF combines the geographical coordinates of the vertices used to create the boundaries for each of the census tracts seen in the plot above with data about each of the tracts. For instance, this SPDF contains the coordinate information for each of the tracts along with the tract number, borough name, area, etc. for each of the tracts in the plot. All one has to do to plot data by tract is add another column into the data frame part of the SPDF that aligns with the tract that each of the rows refer to.

To both serve as an example and to further the analysis of this project, I find the amount of shootings that occurred in each tract for a given year and plot it using the spplot function from the sp package. Of course, the data that contained each of the shooting incidents did not contain a variable for which census tract the shooting occurred in. However, the data set does include the coordinates of each incident which can be converted into spatial points that can then be plotted on top of the tract map. After much searching, I was able to find the over function, also found in the spplot package, which returns the row of data from an SPDF corresponding to the shape within the map that a spatial point lies in. After assigning the spatial points corresponding to each of the locations of shooting incidents the same coordinate reference system as that used for the tract map, I was able to extract the census tract for the location of each shooting by using the over function. Now that each shooting incident had an associated tract, I simply had to add a column for the number of shootings each year by tract in order to create plots visualizing where shootings were more frequent within NYC in a given year. The plots below will represent shooting incidents within the 3 years that we will be extracting ACS 5-year data for (2009, 2014, and 2019), along with a plot of the total shootings within each tract between 2006 and 2019

Looking at the plot of total shootings clearly depicts which general areas of the city see the most shootings. For the most part, Brooklyn and the Bronx are the two boroughs that tend to have a higher amount of shootings in each tract than the rest of the city. It should be noted that there are also some high-shooting tracts found in Queens and Staten Island, but those two boroughs do not seem as consistent nor as exaggerated as Brooklyn and the Bronx. Moreover, these general trends extend themselves to the yearly plots for 2009, 2014, and 2019. The same areas of the Bronx and Brookyln are often shaded between the three plots. While it looks like there are more shaded areas in the 2019 plot than the others, it should be noted that the scale includes a smaller range than the other plots, making tracts with similar number of shootings between the 3 years appear in a darker shade in 2019 than in the other two years. To better examine the difference in both how many shootings each borough saw within each year along with how the time trend of shootings differed within each borough, a line plot similar to the first figure shown in this project was created but with a breakdown by borough.

The plot above tells a clear story that shootings in each of the five boroughs of New York City have been steadily decreasing since 2006 aside from Staten Island where the number of shootings hovered around 50 shootings per year for the most part. Notably, Brooklyn is the borough with the highest amount of shootings each year, but this is likely due to the fact that Brooklyn is a bigger borough than the Bronx; the Census estimated the population of Brooklyn at roughly 2.56 million in 2019 while the population of the Bronx was estimated at a paltry 1.48 million.

Census Data Analysis

Having done as much analsis with the data available from the crime and shootings data set, I thought it would be best to try to find what demographic variables from the ACS could help explain the difference between high and low shooting census tracts. After combing through the codebook of variables that the ACS attains found here, I settled on the following varibles:

While those were the variables that I queried from the ACS 5-year estimates in 2009, 2014, and 2019 using the tidycensus package, those were not the final variables that I decided to use in my model. A couple of transformations of the variables were made to create a better OLS model. The first transformation conducted was a log transformation of the median income and population variables. Both distributions were heavily skewed right, as can be seen in the histograms below. Failing to treat the skewed distributions would prevent the data from fulfilling the multivariate normal assumption of linear regression and would result in biased results. In order to prevent any bias that would arise from the skewed data, a new variable was created that was a log transformation of both variables. Furthermore, the variables corresponding to the population enrolled in school, suffering from poverty, and in the labor force were all divided by the total population of the tract to get each in terms of percent of population. I thought this would make for a better model so that the discrepancy in population between boroughs and tracts would not influence the model greatly. I also thought that this would make for an easier interpretation of the regression coefficients of these variables in the OLS model.

The get_acs function within the tidycensus package simplified the process of getting the ACS data. The function returns a data frame of the variables of one’s choice, making the merging of the data with the spatial data frame simple. One caveat is that the identifier of each of the variables was the tract number along with the borough name, so a new variable within the spatial data frame had to be created to match it.

After merging all of the new variables with the spatial data frame, I plotted many of the variables and did a visual sanity check the variables which resulted in some final data cleaning. For example, the original plot of the median income revealed that some locations, one of them being JFK airport, reported a median income of $2,499. I am not sure why there were tracts that reported this same exact number, but I hypothesize that it might be the minimum value that can be reported within the ACS for median income. Furthermore, there were some years where there was a small population reported to be living in areas like Central Park. A quick Google search revealed that such occurrences are likely the result of homeless people getting their hands on an ACS form1. I decided to remove any data coming from areas that had a listed population below 100 to remove such instances so they would not influence OLS results. Furthermore, there were two cases where the poverty percent was reported as over 90% in 2009 but then reported poverty percentages much lower (around 50%) in 2014 and 2019. It could be the case that the high poverty rate was tied to the 2008 market crash, but I thought the drastic drop in the following years provided support for the high rates being inaccurate. These two instances were removed from the data set to prevent the influence they would have as outliers on the OLS results. Since there are over 2000 tracts,that provide plenty of data, I thought being safe by dropping the two possible outliers would be best. After these small modifications were made, each of the variables were plotted. Below are the plots that showed a clear difference in the variable around the city; the variables omitted were nearly uniform across the city and did not reveal anything interesting.

As mentioned before, the variables in the plots above were chosen as they portrayed the most identifiable patterns between areas of NYC. Starting with the variable representing median income, the tracts that have a median income of over $200,000 can always be found in Manhattan and the area of Brooklyn that is closest to the bottom tip of Manhattan. Of course, this makes sense as Manhattan is home to many high-paying companies like Goldman Sachs, McKinsey, Pimco, etc. and those living in Manhattan likely work at one of those companies . The darker areas of the plots are the areas where the median income is particularly low, and these areas nearly overlap with the areas in the Bronx and Brooklyn that were found to have a higher amount of shootings than the rest. Staten Island can be seen as the middle ground between the uber-wealthy Manhattan and the poverty-riddled areas of the Bronx and Brooklyn, with many tracts having a median income above $100,000. While poverty is related to a low median income, the plot of percent in poverty highlights the areas in the lower median income areas that are suffering the most. Not surprisingly, the areas that have a higher percent suffering from poverty within the areas that were found to have low income appear to be greatly correlated with a higher number of shootings. For instance, there is a band of Brooklyn that starts at the area where the East portion of Manhattan juts out that has a higher poverty percentage than much of the borough and also sees more shootings each year. In addition, the area of the Bronx nearest to Manhattan also has a higher percentage of people in poverty and also sees more shootings each year. These same areas that see a higher percentage of poverty also see about a 20% reduction in percentage of people in the work force, as can be seen by the slightly darker shading given to these areas in the plots of employment percentage. It should be noted that the employment variable between 2014 and 2019 is consistent, but the same variable was not available in 2009 which is likely the cause of 2009 reporting higher employment percentage; the variable in 2009 seems to be the total number of people who reported an employment information - whether in the labor force or not- while the variable for the other years represents only those in the labor force.

Now that some clear differences in demographic data have been identified, an OLS regression relating the number of shootings in the three years with the demographics data available would be useful in determining which demographic factors have a significant influence on the number of shooting incidents seen each year.

Regression Results:

In table 5 we find that overall there are some significant coefficients within each of the models. Each of the three models correspond to one of the years that demographic data was gathered for, and the variables included in each of the models are the demographic variables along with a fixed effect for the boroughs where the Bronx is the level of reference.

The model for 2009 reports that the log of median income, log of population, percentage of people in school and percentage of people in poverty were all significant at the .05 level. For every unit increase in the log of median income, the number of shootings in the year goes down by about .7. Similarly, For every unit increase in the log of median population, the number of shootings in the year goes up by about .33. For every 1 percentage point increase in the percent of people enrolled in school, the amount of shooting go up by .0155. The effect for poverty is similar but slightly lower with an increase of .0119 shootings for every percentage point increase in the percentage of people in poverty. I find it interesting that the percentage of people enrolled in school would have a bigger effect on the number of shootings than the percentage of people in poverty, but I think this variable may be speaking to different aspects of the area that are not captured by the other variables in this model. For instance, areas with a higher percentage of people enrolled in school can correlate to areas where parents have many kids and are not as attentive to them. Since Manhattan, an area filled with young professionals who likely do not have kids yet is one of the comparison areas, having kids, and thus having a proportion of the population in school can be linked to low-income, poorly educated areas that are more crime-filled. Finally, we note that all of the boroughs have a negative coefficient on them, which is to be expected since the Bronx was one of the more shooting-heavy areas in the city. Of the other boroughs the only significant difference comes from Queens, suggesting that Queens is, on average, significantly safer than the Bronx with about .436 less shootings across its tracts.

Moving on to 2014, a year after the passage of SAFE, the variables that are now significant at the .05 level are the log of median income, log of population, and percentage in school. While the percentage of people in the labor force is significant, it is only significant at the .10 level, so we will attribute it to chance. This year, the impact of the log of median income has a larger effect than it did in 2009, with every unit increase in the log of median income decreasing the amount of shootings by about 1. The effect of the log of population also becomes stronger, increasing the amount of shootings by about .39 for each unit increase in the log of population. Meanwhile the effect of percentage in school has also become stronger; every 1 percentage point increase in the percent of people enrolled in school increases the amount of shooting by .0208. Once again, the only borough significantly different from the Bronx is Queens where on average there are .308 less shootings in across its tracts.

Finally, we look to 2019 and find that the only two demographic variables that are significant are the log of media income and population. The log of median income becomes less diminutive than in 2014 but slightly more than in 2009 while the log of population has a weaker effect than in both 2009 and 2014. Now, every unit increase in the log of median income decreases the amount of shootings by about .76 while every unit increase in the log of population increases shootings by about .22. Interestingly, the borough that is significantly different from the Bronx is now Manhattan, with a positive coefficient. If we look back to the plot of shootings by borough on page 14, we find that in 2019, the Bronx had a dip in shootings while shootings in Manhattan had a sizable increase. This would suggest that on average a tract in Manhattan saw .184 shootings more than a tract in the Bronx did. Again, flipping between the plots of shootings in 2014 and 2019 on pages 11 and 12, it does seem that the shading seems to spill over more into Manhattan from the Bronx than it did before and that in general more areas of Manhattan are shaded in 2019 than in 2014.

More discussion of these results will come in the conclusion, but we will first check the assumptions of the model.

Model Assumptions

The plots above help to determine whether the assumptions of a linear model are met by the models we specified. The first plot is a plot of the residuals vs. the fitted values. In a perfect scenario, we would want to see the residuals randomly bounce around 0 without any of them standing out. However, the plots for the three models depict diagonal lines that roughly form around 0. The reason for this is that the dependent variable (number of shootings) had a lot of values that were constant but likely had different fitted values. For example, a lot of the tracts that we looked at had no shootings in a given year, yet the model likely fitted different values for these variables causing them to form a line in the plot. Furthermore, there are some residuals that are much higher than others that can be attributed to outlier values in the data. While it is not the textbook example of a residual vs fitted plot, the imperfectness of it can likely be attributed to the real-world nature of it and overall may not invalidate the linear model.

The second plot for each of the models is a Q-Q plot that would ideally be a straight line. These plots are meant to check the multivariate normality assumption of linear regressions which assumes that each variable in the model is distributed normally. In an effort to try to create a model that would fit this assumption better, I transformed both the population and median income variables by taking their logs. Overall, the Q-Q plots for each of the models are not terrible, but future exploration could potentially look at other transformations that could be done to fit the model assumptions better.

Conclusions

In trying to determine whether the signing of SAFE into law influenced the amount of shootings in different parts of New York City, two different analyses were conducted.

The first was a difference-in-differences (DID) analysis that took advantage of a similar data set to that of the shootings data, but for arrests of different crimes in NYC. While typical DID analyses look at two different populations, one that receives treatment and another that does not, I was not able to find a similar shooting data set from a similar city that would provide for that type of analysis. Instead, I went forward under the assumption that the signing of SAFE into law would not be considered a treatment for crimes that had seemingly no relationship with guns. Under this assumption, I plotted the trend of arrests within each of the specific crimes and visually checked the plots to see if any of them would meet the parallel trend assumption that is crucial to DID. Luckily, the arrests for prostitution seemed to have a similar trend to the incidents of shootings before the singing of SAFE into law and did not see any change in law in the time period look at. Of course, there were some years where the difference between the two was not perfectly constant, but we kept the possible bias in mind when conducting the analysis. From the DID, we found that we could not say that the signing of SAFE into law decreased the number of shooting incidents in later years. To further our analysis, we turned to a regression analysis with Census data from the ACS 5-year estimates of various demographic information.

After creating a column for the number of shootings that occurred in each of the tracts present in our shapefile within each year of data we had, we used the tidycensus package to query data from the ACS for variables like median income, median age, population, etc. within each of the tracts we were looking at for the years 2009, 2014, and 2019. The demographic data was then merged into the spatial data frame so that we could conduct a regression of the number of shootings within each tract by its demographic variables. Overall, we found that in 2009 many of the demographic variables were significant: log of median income, log of population, percentage of people in school and percentage of people in poverty were all significant at the .05 level. However, more and more of the variables were found to not be significant as time went on, leaving us with just the log of median income and the log of population within each tract being significant. It is possible that as time went on, the areas of New York City started to become more homogeneous, which could explain the lack of significance that started to arise later for many of the demographic variables. It may also be the case that while shootings steadily decreased over time, they also became less concentrated, making it so that differences in demographic variables became less significant. Overall, when looking at the log of median income, it seems that New York City as a whole has seen an increase in its median income, indicated by more lighter-colored tracts in the 2019 map than in the 2009 map. Of course, one could hesitate and say that this is a result of inflation, but the amounts reported by the Census are in inflation-adjusted dollars. There is not much of a trend when it comes to the population of New York City over time. If anything, the population has been increasing gradually since 2009.

In summary, this project combined two unique data sets and created another to try to tackle the question of what factors influence the amount of shootings in New York City in a given year. After a DID analysis resulted in failing to accept a negative impact of the signing of SAFE into law, we turned to demographic data provided by the U.S. Census Bureau in their ACS. Ultimately, we found that the log of median income and the log of population in a tract were the only significant predictors of shootings in the years that we looked at. Earlier plots of the median income in New York City provided evidence that the city as a whole saw an increase in the median income over time, which would point to the lower number of shootings observed, but there are likely more factors at play than that. A future extension of this project would likely benefit from merging datasets that contain demographic data like median years of education, number of gun stores in each tract, etc. to see which other factors could be significant. Furthermore, other transformations of the variables could be used to see if a model could be created that fits the assumptions of a linear model better. If that is not possible, one could look into the possibility of there being a different type of relationship between some of these variables and shootings. While this analysis was not able to explore all of these possibilities, it provides a good foundation of data and techniques that can be used for future research.


  1. Feuer, Alan. “Census Apparently Did Check Behind Every Tree.” The New York Times, The New York Times, 26 Mar. 2011, www.nytimes.com/2011/03/26/nyregion/26census.html#:~:text=Turns out Central Park is,in Greenwood Cemetery in Brooklyn.↩︎