Formulating the Belief
The US is one of the few countries in which the right to bear arms is constitutionally protected. A consequence of this is that the level of gun voilence is very high compared to other countries. But America’s relationship with guns is unique in another crucial way: Among developed nations, the US is far and away the most homicidal. For instance, America has six times as many firearm homicides as Canada, and nearly 16 times as many as Germany.
This unique characterstic of US provokes us to dive deeper into the gun violence issues and understand the very nature of it through data. This forms our belief to do the investigation through data visulazation.
Objective:
The objective is to understand and analyze the gun violence data to extract insightful information embedded in the data to better understand the geographics of gun violence, it’s temporal beheaviour and draw conclusions it’s nature.
Data Source and Description
The date in this dataset was downloaded from gunviolencearchive.org.The CSV file contains data for all recorded gun violence incidents in the US between January 2013 and March 2018. The data consist of 240k incidents, with detailed information about each incident.
Importing Data
library(readr)
Gun_Violence <- read_csv("/Users/Gurtej/Downloads/gun-violence-data_01-2013_03-2018.csv")
## Parsed with column specification:
## cols(
## .default = col_character(),
## incident_id = col_integer(),
## date = col_date(format = ""),
## n_killed = col_integer(),
## n_injured = col_integer(),
## congressional_district = col_integer(),
## latitude = col_double(),
## longitude = col_double(),
## n_guns_involved = col_integer(),
## state_house_district = col_integer(),
## state_senate_district = col_integer()
## )
## See spec(...) for full column specifications.
#Exploring the dataset
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
glimpse(Gun_Violence)
## Observations: 239,677
## Variables: 29
## $ incident_id <int> 461105, 460726, 478855, 478925, 47...
## $ date <date> 2013-01-01, 2013-01-01, 2013-01-0...
## $ state <chr> "Pennsylvania", "California", "Ohi...
## $ city_or_county <chr> "Mckeesport", "Hawthorne", "Lorain...
## $ address <chr> "1506 Versailles Avenue and Coursi...
## $ n_killed <int> 0, 1, 1, 4, 2, 4, 5, 0, 0, 1, 1, 1...
## $ n_injured <int> 4, 3, 3, 0, 2, 0, 0, 5, 4, 6, 3, 3...
## $ incident_url <chr> "http://www.gunviolencearchive.org...
## $ source_url <chr> "http://www.post-gazette.com/local...
## $ incident_url_fields_missing <chr> "False", "False", "False", "False"...
## $ congressional_district <int> 14, 43, 9, 6, 6, 1, 1, 2, 9, 7, 3,...
## $ gun_stolen <chr> NA, NA, "0::Unknown||1::Unknown", ...
## $ gun_type <chr> NA, NA, "0::Unknown||1::Unknown", ...
## $ incident_characteristics <chr> "Shot - Wounded/Injured||Mass Shoo...
## $ latitude <dbl> 40.3467, 33.9090, 41.4455, 39.6518...
## $ location_description <chr> NA, NA, "Cotton Club", NA, NA, "Fa...
## $ longitude <dbl> -79.8559, -118.3330, -82.1377, -10...
## $ n_guns_involved <int> NA, NA, 2, NA, 2, NA, 2, NA, NA, N...
## $ notes <chr> "Julian Sims under investigation: ...
## $ participant_age <chr> "0::20", "0::20", "0::25||1::31||2...
## $ participant_age_group <chr> "0::Adult 18+||1::Adult 18+||2::Ad...
## $ participant_gender <chr> "0::Male||1::Male||3::Male||4::Fem...
## $ participant_name <chr> "0::Julian Sims", "0::Bernard Gill...
## $ participant_relationship <chr> NA, NA, NA, NA, "3::Family", NA, "...
## $ participant_status <chr> "0::Arrested||1::Injured||2::Injur...
## $ participant_type <chr> "0::Victim||1::Victim||2::Victim||...
## $ sources <chr> "http://pittsburgh.cbslocal.com/20...
## $ state_house_district <int> NA, 62, 56, 40, 62, 72, 10, 93, 11...
## $ state_senate_district <int> NA, 35, 13, 28, 27, 11, 14, 5, 7, ...
#Reformatting the date variable
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
Gun_Violence$date <- ymd(Gun_Violence$date)
str(Gun_Violence$date)
## Date[1:239677], format: "2013-01-01" "2013-01-01" "2013-01-01" "2013-01-05" "2013-01-07" ...
In editiorial thinking we establish what we really want the audience to take out of the set of visualizations that we are going to put in front of them. For this project following are the stats that we want our audience to take home once this go through the story telling of gun violence dataset created through set of visulazation graphs in stage 4:
How does gun violence has increased or decreased over the years, variation of gun violence thoughout the year and the reason for it, variation of violence on weekely level and reasoning for it.
Impact of gun violence on various location throughout US and diving to the granularity of state and cities.
How does population of a location plays role in increase or decrease in gun violence, diving to the granularity of state and cities in this case as well.
library(knitr)
library(dplyr)
library(readr)
library(ggplot2)
library(tibble)
library(stringr)
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(scales)
##
## Attaching package: 'scales'
## The following object is masked from 'package:readr':
##
## col_factor
library(lubridate)
library(ggrepel)
library(leaflet)
library(rgdal)
## Loading required package: sp
## rgdal: version: 1.3-4, (SVN revision 766)
## Geospatial Data Abstraction Library extensions to R successfully loaded
## Loaded GDAL runtime: GDAL 2.1.3, released 2017/20/01
## Path to GDAL shared files: /Library/Frameworks/R.framework/Versions/3.5/Resources/library/rgdal/gdal
## GDAL binary built with GEOS: FALSE
## Loaded PROJ.4 runtime: Rel. 4.9.3, 15 August 2016, [PJ_VERSION: 493]
## Path to PROJ.4 shared files: /Library/Frameworks/R.framework/Versions/3.5/Resources/library/rgdal/proj
## Linking to sp version: 1.3-1
summary(Gun_Violence$date)
## Min. 1st Qu. Median Mean 3rd Qu.
## "2013-01-01" "2015-03-07" "2016-04-05" "2016-03-13" "2017-04-03"
## Max.
## "2018-03-31"
It seems as if most incidents in 2013 were not recorded. In addition, 2018 is not a full year of data as the latest recorded incident was on May 31st 2018. As this means exactly one quarter, we are interested if the trend is still upward if we only look at the Q1s of 2014-2018. We will look into this in the next section.
Gun_Violence$year <- year(Gun_Violence$date) #extract year from date using the lubridate year() function
Gun_Violence %>%
ggplot(aes(x=as.factor(year))) + geom_bar(stat='count', fill='blue', color ="red") +
scale_y_continuous(labels=comma) +
geom_label(stat = "count", aes(label = ..count.., y = ..count..)) + labs(title="Gun Violence Count (Yearly)") +
labs(x="Year", y="Number of Incidents")
Gun_Violence$quarter <- quarter(Gun_Violence$date) #extract Quarters from date
q1 <- Gun_Violence %>% filter(year!=2013) %>% select(year, quarter) %>% group_by(year) %>% count(quarter) %>%
ggplot(aes(x=as.factor(quarter), y=n)) + geom_bar(stat='identity', fill='blue', color ="red") +
scale_y_continuous(labels=comma) + facet_grid(.~year) + labs(x='Quarter', y='Number of incidents')
q2 <- Gun_Violence %>% filter(year!=2013 & quarter==1) %>% select(year, quarter) %>%
group_by(year) %>% count(quarter) %>%
ggplot(aes(x=as.factor(year), y=n)) + geom_bar(stat='identity', fill='blue', color ="red") +
scale_y_continuous(labels=comma) + labs(x='Incidents in Q1 of each year', y='Number of incidents')
grid.arrange(q1, q2)
The analysis of quarters shows that more incidents occur in the warmer spring and summer periods. This seems worth diving into a little deeper. In order to compare months We will exclude 2018, as only the first three months have been recorded.
Gun_Violence$month <- month(Gun_Violence$date, label=TRUE)
Below, the most visibile ‘seasonality’ effect seems to us that the colder months seem to have less incidents. November, December, and February are the 3 months with the lowest number of incidents (February also only has 28 days of course). The exception seems to be January, which is worth investigating later on. My first idea is that possibly incidents on new years eve contribute to January having a high number of incidents.
The other peak is the July/August period. Likely because many people will go on holidays in this period.
#only taking the complete years 2014-2017
plotly::ggplotly(Gun_Violence %>% filter(year!=c(2013, 2018)) %>% count(month) %>%
ggplot(aes(x=month, y=n)) + geom_bar(stat='identity', fill='blue', color="red") +
scale_y_continuous(labels=comma) +
labs(x='Month', y='Number of incidents', title='Incidents by Month'))
## Warning in year != c(2013, 2018): longer object length is not a multiple of
## shorter object length
As we mentioned new years eve as a possible explanation of the high January numbers, while the colder month Nov-Feb generally seem to hold less incidents, we wanted to check what the dates with most incidents are.
Gun_Violence$day <- day(Gun_Violence$date)
Gun_Violence <- Gun_Violence %>% mutate(date2=paste(month, day))
Gun_Violence %>% filter(year!=c(2013, 2018)) %>% count(date2) %>% top_n(10) %>% arrange(desc(n))
## Warning in year != c(2013, 2018): longer object length is not a multiple of
## shorter object length
## # A tibble: 10 x 2
## date2 n
## <chr> <int>
## 1 Jan 1 1115
## 2 Jul 4 876
## 3 Jul 5 820
## 4 Jul 30 788
## 5 Oct 25 742
## 6 Jul 17 740
## 7 Jul 19 740
## 8 Aug 13 734
## 9 Aug 1 730
## 10 Jul 25 730
The numbers above are actually the totals of 4 dates, as they are aggregated over 4 years (for instance: 1-1-2014, 1-1-2015, 1-1-2016, 1-1-2017). With the average being 618 (the total number of incidents in 2014-2017 divived by 365 calendar days; 225598/365), there are not many dates that really stand out. Most of the dates in this top 10 seem ‘ordinary’ days in July/August. However, January 1st indeed partially explains the higher incidents numbers in January. In addition, independence day (July 4th) is also dangerous when it comes down to gun incidents. We assume that the high number on July 5th is due to people continuing to celebrate after midnight.
Gun_Violence$weekday <- wday(Gun_Violence$date, label=TRUE)
Weekends (Saturday and Sunday) are more dangerous than working days. My best guess is that this is due to most people not having to work on these days, and also very likely due to nightlife related violence. Voilence happening on Friday nightlife will likely happen after midnight to a large extend, and therefore be recorded as event on Saturdays.
Gun_Violence %>% count(weekday) %>%
ggplot(aes(x=weekday, y=n)) + geom_bar(stat='identity', fill='blue', color="red") +
scale_y_continuous(labels=comma) +
labs(x='Weekday', y='Number of incidents', title='Incidents by Weekday')
First of all, we have to convert State into a factor variable. As we will also analyze city_or_county later on, we will convert them both at the same time.
Gun_Violence[, c('state', 'city_or_county')] <- lapply(Gun_Violence[, c('state', 'city_or_county')], as.factor)
str(Gun_Violence$state)
## Factor w/ 51 levels "Alabama","Alaska",..: 39 5 36 6 34 37 32 19 5 21 ...
str(Gun_Violence$city_or_county)
## Factor w/ 12821 levels "Abbeville","Abbotsford",..: 7132 4917 6564 446 4586 11661 88 8009 1250 526 ...
Below, we are displaying the absolute numbers of incidents by state. However, these numbers should be related to the numbers of inhabitants to get a good view of the relative numbers of incidents. For instance, California is a state with a very large population. Therefore, it is no surprise to see California as the number two in absolute numbers.
plotly::ggplotly(Gun_Violence %>% count(state) %>%
ggplot(aes(x=reorder(state, n), y=n, fill=n, text=state)) +
geom_bar(stat='identity', fill='red') + coord_flip() +
labs(x='', y='Number of incidents'),
tooltip=c("text", "y"))
Although we expected Chicago to have a lot of incidents, we are still surprised to see that it holds the ‘number one spot’ by a huge margin! In the next section, we will relate those absolute numbers to the population by city.
In order be able to relate the numbers above to the population sizes, we had to look for a good source. Source of the file is mentioned with the code.
#source file location: https://www2.census.gov/programs-surveys/popest/datasets/2010-2017/state/asrh/scprc-est2017-18+pop-res.csv
state_population <- read_csv("scprc-est2017-18+pop-res.csv")
## Parsed with column specification:
## cols(
## SUMLEV = col_character(),
## REGION = col_character(),
## DIVISION = col_character(),
## STATE = col_character(),
## NAME = col_character(),
## POPESTIMATE2017 = col_integer(),
## POPEST18PLUS2017 = col_integer(),
## PCNT_POPEST18PLUS = col_double()
## )
state_population <- state_population %>% select(NAME, POPESTIMATE2017)
state_population <- state_population %>% filter(!NAME %in% c("United States", "Puerto Rico Commonwealth"))
state_population <- state_population %>% rename(state= NAME)
state_population$state <- as.factor(state_population$state)
Below, we are creating a small “states” dateframe, that eventually holds the number of incidents relative to the population of each state (per 100,000 inhabitants).
incidentsByState <- Gun_Violence %>% group_by(state) %>% summarize(stateIncidents=n())
incidentsByState <-left_join(incidentsByState, state_population, by="state")
incidentsByState$Per100000 <- round((incidentsByState$stateIncidents/incidentsByState$POPESTIMATE2017)*100000)
head(incidentsByState)
## # A tibble: 6 x 4
## state stateIncidents POPESTIMATE2017 Per100000
## <fct> <int> <int> <dbl>
## 1 Alabama 5471 4874747 112
## 2 Alaska 1349 739795 182
## 3 Arizona 2328 7016270 33
## 4 Arkansas 2842 3004279 95
## 5 California 16306 39536653 41
## 6 Colorado 3201 5607154 57
city_population <- read_csv( "uscitiesv1.4.csv", col_types = cols())
city_population <- city_population %>% select(city, state_name, population_proper) %>% rename(state=state_name, population=population_proper) %>% filter(population>600000)
In the figure below, you can see that the enriched data, which are related to the population of each state, paint a very different picture. As the numbers of incidents are related to the population sizes, these numbers now represent ‘real’ gun danger levels. To show this visually, we have used color codes. Red means a high danger level in terms of relative numbers of incidents, and yellow means that a state is relatively safe.
Now, Alaska, Louisiana and Delaware are showing the highest relative incident numbers. Hawaii seems the safest state to live in (regarding gun related incidents), and the large state of California drops from second state in terms of absolute incidents to a low position when corrected for the large population.
plotly::ggplotly(incidentsByState%>% filter(state!="District of Columbia") %>%
ggplot(aes(x=reorder(state, Per100000), y=Per100000, fill=Per100000, text=state)) +
geom_bar(stat='identity') + coord_flip() +
labs(x='', y='Incidents per 100,000 inhabitants') + scale_fill_gradient(low="yellow", high="red") +
theme(legend.position="none"),
tooltip=c("text", "y"))
Alaska being the state with relatively most incidents came as a surprise to us! Of course, We wanted to find out where this high level of gun violence in Alaska might come from. The folowing link explains why that the data in this dataset are not lying: Here’s why Alaska’s gun problem is so bad (https://www.businessinsider.com/the-state-where-youre-most-likely-to-be-killed-by-a-gun-is-one-of-the-most-beautiful-places-on-earth-2015-6?international=true&r=US&IR=T). In summary, Alaska has very liberal gun laws, very high gun possession as a consequence (60% of homes), and also a very high level of suicides (80% of firearm deaths are self-inflicted).
To get an idea on how these relative figures are composed, We are displaying the underlying numbers of a few states below. As you can see, DoC and Alaska have few inhabitants but a relatively high number of incidents. Hawaii just has low numbers of incidents, and the large state of California comes out with a relatively low number of incidents when taking the population into account.
incidentsByState %>%
filter(state %in% c('District of Columbia', 'Alaska', 'California', 'Hawaii'))
## # A tibble: 4 x 4
## state stateIncidents POPESTIMATE2017 Per100000
## <fct> <int> <int> <dbl>
## 1 Alaska 1349 739795 182
## 2 California 16306 39536653 41
## 3 District of Columbia 3195 693972 460
## 4 Hawaii 289 1427538 20
#source file location: "http://www2.census.gov/geo/tiger/GENZ2017/shp/cb_2017_us_state_500k.zip"
states <- readOGR(dsn ="/Users/Gurtej/Downloads/cb_2017_us_state_500k", layer = "cb_2017_us_state_500k", encoding = "UTF-8")
## OGR data source with driver: ESRI Shapefile
## Source: "/Users/Gurtej/Downloads/cb_2017_us_state_500k", layer: "cb_2017_us_state_500k"
## with 56 features
## It has 9 fields
## Integer64 fields read as strings: ALAND AWATER
class(states)
## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"
head(states@data)
## STATEFP STATENS AFFGEOID GEOID STUSPS NAME LSAD
## 0 54 01779805 0400000US54 54 WV West Virginia 00
## 1 17 01779784 0400000US17 17 IL Illinois 00
## 2 24 01714934 0400000US24 24 MD Maryland 00
## 3 16 01779783 0400000US16 16 ID Idaho 00
## 4 50 01779802 0400000US50 50 VT Vermont 00
## 5 09 01779780 0400000US09 09 CT Connecticut 00
## ALAND AWATER
## 0 62265662566 489840834
## 1 143784114293 6211277447
## 2 25150696145 6980371026
## 3 214048160737 2393355752
## 4 23873457570 1031134839
## 5 12542619303 1815495323
addPer100k <- data.frame(id=states$GEOID, name=states$NAME)
addPer100k <- left_join(addPer100k, incidentsByState %>% select(state, Per100000) %>% rename(name=state), by="name")
addPer100k$Per100000[is.na(addPer100k$Per100000)] <- 0
states$per100k <- addPer100k$Per100000
Now it is time to create the map. Please feel free to hover over the states (this shows a popup with state name and the exact number of incidents per 100,000 inhibitants), and zoom in and out. By zooming out, you will be able to see Alaska and Hawaii. By zooming in, you can also find the District of Columbia (between Maryland and Virginia).
bins <- c(0, 50, 75, 100, 150, Inf)
pal <- colorBin("YlOrRd", domain = states$per100k, bins = bins)
state_popup <- paste0("<strong>State: </strong>",
states$NAME,
"<br><strong>Incidents per 100,000 inhabitants </strong>",
states$per100k) %>% lapply(htmltools::HTML)
leaflet(data = states) %>%
setView(lng=-96, lat=37.8, zoom=4) %>%
addProviderTiles("MapBox", options = providerTileOptions(id = "mapbox.light",
accessToken = Sys.getenv('MAPBOX_ACCESS_TOKEN'))) %>%
addPolygons(
fillColor = ~pal(per100k),
weight = 2,
opacity = 1,
color = "white",
dashArray = "3",
fillOpacity = 0.7,
highlight = highlightOptions(
weight = 5,
color = "#666",
dashArray = "",
fillOpacity = 0.7,
bringToFront = TRUE),
label = state_popup,
labelOptions = labelOptions(
style = list("font-weight" = "normal", padding = "3px 8px"),
textsize = "15px",
direction = "auto")) %>%
addLegend(pal = pal, values = ~per100k, opacity = 0.7, title = "Incidents", position = "bottomright")
First of all, we will create a new variable that adds up the number of killed and injured people for each incident.
Gun_Violence$victims <- Gun_Violence$n_killed + Gun_Violence$n_injured
head(Gun_Violence)
## # A tibble: 6 x 36
## incident_id date state city_or_county address n_killed n_injured
## <int> <date> <fct> <fct> <chr> <int> <int>
## 1 461105 2013-01-01 Penn… Mckeesport 1506 V… 0 4
## 2 460726 2013-01-01 Cali… Hawthorne 13500 … 1 3
## 3 478855 2013-01-01 Ohio Lorain 1776 E… 1 3
## 4 478925 2013-01-05 Colo… Aurora 16000 … 4 0
## 5 478959 2013-01-07 Nort… Greensboro 307 Mo… 2 2
## 6 478948 2013-01-07 Okla… Tulsa 6000 b… 4 0
## # ... with 29 more variables: incident_url <chr>, source_url <chr>,
## # incident_url_fields_missing <chr>, congressional_district <int>,
## # gun_stolen <chr>, gun_type <chr>, incident_characteristics <chr>,
## # latitude <dbl>, location_description <chr>, longitude <dbl>,
## # n_guns_involved <int>, notes <chr>, participant_age <chr>,
## # participant_age_group <chr>, participant_gender <chr>,
## # participant_name <chr>, participant_relationship <chr>,
## # participant_status <chr>, participant_type <chr>, sources <chr>,
## # state_house_district <int>, state_senate_district <int>, year <dbl>,
## # quarter <int>, month <ord>, day <int>, date2 <chr>, weekday <ord>,
## # victims <int>
VictimsByState <- Gun_Violence %>% group_by(state) %>% summarize(sumVic=sum(victims), sumInj=sum(n_injured), sumDeath=sum(n_killed), PercDeath=round(sumDeath/sumVic,2), sumIncidents=n(), vicPerInc=round(sumVic/sumIncidents,2))
head(VictimsByState)
## # A tibble: 6 x 7
## state sumVic sumInj sumDeath PercDeath sumIncidents vicPerInc
## <fct> <int> <int> <int> <dbl> <int> <dbl>
## 1 Alabama 4878 2998 1880 0.39 5471 0.89
## 2 Alaska 592 325 267 0.45 1349 0.44
## 3 Arizona 2190 1096 1094 0.5 2328 0.94
## 4 Arkansas 2120 1347 773 0.36 2842 0.75
## 5 California 13206 7644 5562 0.42 16306 0.81
## 6 Colorado 1929 1133 796 0.41 3201 0.6
#library(gganimate)
p <- VictimsByState %>% filter(vicPerInc>0.8) %>%
ggplot(aes(x=reorder(state, -vicPerInc), y=vicPerInc)) + geom_bar(stat='identity', fill='blue') +
labs(x='State', y='Victims per incidents') +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
p
#p + transition_states((VictimsByState %>% filter(vicPerInc>0.8))$state, transition_length = 1, state_length = 1)
VictimsByState <-left_join(VictimsByState, state_population, by="state")
VictimsByState$Per100000 <- round((VictimsByState$sumVic/VictimsByState$POPESTIMATE2017)*100000)
In the graph below, which again uses the color codes to show danger levels, you can see that Louisiana is the most dangerous state when it comes down to the number of victims relative to the population size. This is no surprise, as the state has a high relative number of incidents and also ranks high regarding the victims per incident ratio. Alaska drops to 9th place, which is due to the relatively low number of victims per incident.
plotly::ggplotly(VictimsByState%>% filter(state!="District of Columbia") %>%
ggplot(aes(x=reorder(state, Per100000), y=Per100000, fill=Per100000, text=state)) +
geom_bar(stat='identity') + coord_flip() +
labs(x='', y='Victims per 100,000 inhabitants') + scale_fill_gradient(low="yellow", high="red") +
theme(legend.position="none"),
tooltip=c("text", "y"))
incidentsByCity <- Gun_Violence %>% select(city_or_county, state) %>% rename(city=city_or_county) %>% group_by(city, state) %>% summarize(cityIncidents=n())
It seems likely that a lot of violence occurs in the big cities of the states. For instance, we expected to see a lot incidents in the big cities of Illinois (Chicago) and Louisiana (New Orleans). The biggest city in Alaska (Anchorage) is unlikely to appear in the list of cities with the highest absolute numbers of incidents as the city is relatively small.
incidentsByCity <- Gun_Violence %>% select(city_or_county, state) %>% rename(city=city_or_county) %>% group_by(city, state) %>% summarize(cityIncidents=n())
New York City did not seem to be recorded as a whole in the datset. It turns out that the boroughs are recorded separately. Before we continue, we are adding New York as a city.
incidentsByCity[(incidentsByCity$city %in% c('Brooklyn', 'Bronx', 'Queens', 'Staten Island','New York (Manhattan)')) & incidentsByCity$state=='New York',]
## # A tibble: 5 x 3
## # Groups: city [5]
## city state cityIncidents
## <fct> <fct> <int>
## 1 Bronx New York 845
## 2 Brooklyn New York 1405
## 3 New York (Manhattan) New York 360
## 4 Queens New York 130
## 5 Staten Island New York 411
sumNewYork <- sum(incidentsByCity$cityIncidents[(incidentsByCity$city %in% c('Brooklyn', 'Bronx', 'Queens', 'Staten Island','New York (Manhattan)')) & incidentsByCity$state=='New York'])
NewYork <- data.frame(city='New York', state='New York', cityIncidents=sumNewYork)
incidentsByCity <- as.tibble(rbind(as.data.frame(incidentsByCity), NewYork))
incidentsByCity %>% top_n(20, wt=cityIncidents) %>%
ggplot(aes(x=reorder(city, cityIncidents), y=cityIncidents)) + geom_bar(stat='identity', fill='red') +
labs(x='City', y='Number of incidents') + coord_flip()
Although we expected Chicago to have a lot of incidents, we are still surprised to see that it holds the ‘number one spot’ by a huge margin! In the next section, we will relate those absolute numbers to the population by city.
#source file location: https://simplemaps.com/data/us-cities
citiesPop <- read_csv("uscitiesv1.4.csv", col_types = cols())
citiesPop <- citiesPop %>% select(city, state_name, population_proper) %>% rename(state=state_name, population=population_proper) %>% filter(population>600000)
we are only focussing on cities with population more than 600000.
citiesPop <-left_join(citiesPop, incidentsByCity, by=c("city", "state"))
## Warning: Column `city` joining character vector and factor, coercing into
## character vector
## Warning: Column `state` joining character vector and factor, coercing into
## character vector
citiesPop$Per100000 <- round((citiesPop$cityIncidents/citiesPop$population)*100000)
citiesPop$citystate <- str_c(citiesPop$city, " - " ,citiesPop$state)
incidentsByState <- incidentsByState %>% rename(state_avg=Per100000)
citiesPop <- left_join(citiesPop, incidentsByState %>% select(state, state_avg), by="state")
## Warning: Column `state` joining character vector and factor, coercing into
## character vector
citiesPop1 <- citiesPop %>% select(citystate, Per100000, state, state_avg) %>% rename(city_avg=Per100000)
gathercols <- c("city_avg", "state_avg")
CitiesStatesLong <- tidyr::gather(citiesPop1, city_or_state, per100k, gathercols)
citiesTop20 <- CitiesStatesLong %>% filter(city_or_state=='city_avg') %>% arrange(per100k) %>% top_n(20, wt=per100k)
Top20names <- citiesTop20$citystate
CitiesStatesLong <- CitiesStatesLong[CitiesStatesLong$citystate %in% Top20names,]
ggplot(CitiesStatesLong, aes(x=factor(citystate), y=per100k, fill=city_or_state)) +
geom_bar(stat="identity", position = position_dodge2(reverse=TRUE, padding=0.1)) + coord_flip() +
scale_fill_manual(values = c("state_avg"="yellow", "city_avg"="red")) +
scale_x_discrete(limits=Top20names) + labs(y='Incidents per 100,000 inhabitants', x="")
As you can see, the city averages of the cities with high incident numbers are much higher than their associated state averages. At the bottom of this list (not displayed) a few cities have lower numbers than the state , but generally cities are more dangerous than the country side.
Below, we are displaying the 10 incidents with the highest numbers of victims. Although any single victim is one too many, the shooting in Las Vegas was by far the worst incident with over 470 victims.
Top10 <- Gun_Violence %>% select(incident_id, date, n_killed, n_injured, victims, location_description, city_or_county, state, latitude, longitude) %>% rename(Incident_Id=incident_id, Date=date, Killed=n_killed, Injured=n_injured, Victims=victims, Location=location_description, City=city_or_county) %>%
arrange(desc(Victims)) %>% top_n(n=13, wt=Victims)
kable(Top10 %>% select(-longitude, -latitude))
Incident_Id | Date | Killed | Injured | Victims | Location | City | state |
---|---|---|---|---|---|---|---|
577157 | 2016-06-12 | 50 | 53 | 103 | Pulse | Orlando | Florida |
980577 | 2017-11-05 | 27 | 20 | 47 | First Baptist Church | Sutherland Springs | Texas |
456893 | 2015-12-02 | 16 | 19 | 35 | Inland Regional Center | San Bernardino | California |
1049217 | 2018-02-14 | 17 | 17 | 34 | Marjory Stoneman Douglas High School | Pompano Beach (Parkland) | Florida |
341622 | 2015-05-17 | 9 | 18 | 27 | Twin Peaks Restaurant | Waco | Texas |
879953 | 2017-07-01 | 0 | 25 | 25 | Power Ultra Lounge | Little Rock | Arkansas |
611479 | 2016-07-25 | 2 | 19 | 21 | Club Blu | Fort Myers | Florida |
121031 | 2014-04-02 | 4 | 16 | 20 | NA | Fort Hood | Texas |
486209 | 2013-05-12 | 0 | 19 | 19 | NA | New Orleans | Louisiana |
423223 | 2015-10-01 | 10 | 9 | 19 | Umpqua Community College | Roseburg | Oregon |
493842 | 2013-11-09 | 2 | 16 | 18 | NA | Cypress | Texas |
511436 | 2016-02-25 | 4 | 14 | 18 | Excel Industries | Hesston | Kansas |
987611 | 2017-11-14 | 6 | 12 | 18 | Rancho Tehama Elementary School | Corning | California |
TopMap <- Top10 %>% select(latitude, longitude, Victims, City, Location)
labels <- paste0("<strong>City: </strong>", TopMap$City,
"<br><strong>Location: </strong>", TopMap$Location,
"<br><strong>Victims </strong>", TopMap$Victims) %>% lapply(htmltools::HTML)
leaflet(TopMap) %>%
setView(lng=-96, lat=37.8, zoom=4) %>%
addTiles() %>%
addProviderTiles("CartoDB.Positron") %>%
addCircleMarkers(~longitude, ~latitude, color = "red", radius=~sqrt(Victims), label = labels)
The trend is unfortunately upward. The number of incidents increased year-on-year from 2013 to 2017. The number of incidents in Q1 2018 is lower that this number in Q1 2017, which can be seen as a cautiously positive sign.
On average, the number of incidents is lower in autumn/winter than in spring/summer. Weekends are more dangerous than weekdays. The most dangerous dates are January 1st (New Year’s Day), and July 4th (Independence Day, with many incidents also happening the day after, July 5th).
After enriching the data with population figures from the US Census Bureau, Alaska came out as the (to us surprising) state with the relatively highest number of incidents. Besides Alaska, Louisiana and Delaware are also states with high average incident numbers. The safest state regarding the relative number of incidents is Hawaii. However, Alaska has a relatively low ratio of victims (number of people killed + injured) per incident. This seems to be caused by a high percentage of “non-shooting incidents” (We assume this mostly means incidents where people have only threatened to shoot). When ranking the states on the number of victims relative to the population size, Louisiana is the most dangerous state. In this statistic, Illinois jumps to second place, which is due to the state having the highest ratio of victims per incident.
Of the big cities with at least 600,000 inhabitants, Baltimore, Washington DC, and Chicago have the highest relative numbers of incidents. Big cities are generally more dangerous than the country side regarding gun violence, as the incident rates are (much) higher.
The most severe incident by far was the shooting at the Strip in Las Vegas (from the Mandalay Bay hotel), with a total of 470 people that were killed or injured.