0.1 Formulating brief: planning, defining and initiating the project

Formulating the Belief

The US is one of the few countries in which the right to bear arms is constitutionally protected. A consequence of this is that the level of gun voilence is very high compared to other countries. But America’s relationship with guns is unique in another crucial way: Among developed nations, the US is far and away the most homicidal. For instance, America has six times as many firearm homicides as Canada, and nearly 16 times as many as Germany.

This unique characterstic of US provokes us to dive deeper into the gun violence issues and understand the very nature of it through data. This forms our belief to do the investigation through data visulazation.

Objective:

The objective is to understand and analyze the gun violence data to extract insightful information embedded in the data to better understand the geographics of gun violence, it’s temporal beheaviour and draw conclusions it’s nature.

0.2 Working with data: going through the mechanics of gathering, handling and preparing data

Data Source and Description

The date in this dataset was downloaded from gunviolencearchive.org.The CSV file contains data for all recorded gun violence incidents in the US between January 2013 and March 2018. The data consist of 240k incidents, with detailed information about each incident.

Importing Data

library(readr)
Gun_Violence <- read_csv("/Users/Gurtej/Downloads/gun-violence-data_01-2013_03-2018.csv")
## Parsed with column specification:
## cols(
##   .default = col_character(),
##   incident_id = col_integer(),
##   date = col_date(format = ""),
##   n_killed = col_integer(),
##   n_injured = col_integer(),
##   congressional_district = col_integer(),
##   latitude = col_double(),
##   longitude = col_double(),
##   n_guns_involved = col_integer(),
##   state_house_district = col_integer(),
##   state_senate_district = col_integer()
## )
## See spec(...) for full column specifications.
#Exploring the dataset
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
glimpse(Gun_Violence)
## Observations: 239,677
## Variables: 29
## $ incident_id                 <int> 461105, 460726, 478855, 478925, 47...
## $ date                        <date> 2013-01-01, 2013-01-01, 2013-01-0...
## $ state                       <chr> "Pennsylvania", "California", "Ohi...
## $ city_or_county              <chr> "Mckeesport", "Hawthorne", "Lorain...
## $ address                     <chr> "1506 Versailles Avenue and Coursi...
## $ n_killed                    <int> 0, 1, 1, 4, 2, 4, 5, 0, 0, 1, 1, 1...
## $ n_injured                   <int> 4, 3, 3, 0, 2, 0, 0, 5, 4, 6, 3, 3...
## $ incident_url                <chr> "http://www.gunviolencearchive.org...
## $ source_url                  <chr> "http://www.post-gazette.com/local...
## $ incident_url_fields_missing <chr> "False", "False", "False", "False"...
## $ congressional_district      <int> 14, 43, 9, 6, 6, 1, 1, 2, 9, 7, 3,...
## $ gun_stolen                  <chr> NA, NA, "0::Unknown||1::Unknown", ...
## $ gun_type                    <chr> NA, NA, "0::Unknown||1::Unknown", ...
## $ incident_characteristics    <chr> "Shot - Wounded/Injured||Mass Shoo...
## $ latitude                    <dbl> 40.3467, 33.9090, 41.4455, 39.6518...
## $ location_description        <chr> NA, NA, "Cotton Club", NA, NA, "Fa...
## $ longitude                   <dbl> -79.8559, -118.3330, -82.1377, -10...
## $ n_guns_involved             <int> NA, NA, 2, NA, 2, NA, 2, NA, NA, N...
## $ notes                       <chr> "Julian Sims under investigation: ...
## $ participant_age             <chr> "0::20", "0::20", "0::25||1::31||2...
## $ participant_age_group       <chr> "0::Adult 18+||1::Adult 18+||2::Ad...
## $ participant_gender          <chr> "0::Male||1::Male||3::Male||4::Fem...
## $ participant_name            <chr> "0::Julian Sims", "0::Bernard Gill...
## $ participant_relationship    <chr> NA, NA, NA, NA, "3::Family", NA, "...
## $ participant_status          <chr> "0::Arrested||1::Injured||2::Injur...
## $ participant_type            <chr> "0::Victim||1::Victim||2::Victim||...
## $ sources                     <chr> "http://pittsburgh.cbslocal.com/20...
## $ state_house_district        <int> NA, 62, 56, 40, 62, 72, 10, 93, 11...
## $ state_senate_district       <int> NA, 35, 13, 28, 27, 11, 14, 5, 7, ...
#Reformatting the date variable
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
Gun_Violence$date <- ymd(Gun_Violence$date)
str(Gun_Violence$date)
##  Date[1:239677], format: "2013-01-01" "2013-01-01" "2013-01-01" "2013-01-05" "2013-01-07" ...

0.3 Establishing editorial thinking: defining what to show to audience

In editiorial thinking we establish what we really want the audience to take out of the set of visualizations that we are going to put in front of them. For this project following are the stats that we want our audience to take home once this go through the story telling of gun violence dataset created through set of visulazation graphs in stage 4:

0.4 Developing design solution: considering all the design options and beginning the production cycle

0.4.1 Loading R Packages

library(knitr)
library(dplyr)
library(readr)
library(ggplot2)
library(tibble)
library(stringr)
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(scales)
## 
## Attaching package: 'scales'
## The following object is masked from 'package:readr':
## 
##     col_factor
library(lubridate)
library(ggrepel)
library(leaflet)
library(rgdal)
## Loading required package: sp
## rgdal: version: 1.3-4, (SVN revision 766)
##  Geospatial Data Abstraction Library extensions to R successfully loaded
##  Loaded GDAL runtime: GDAL 2.1.3, released 2017/20/01
##  Path to GDAL shared files: /Library/Frameworks/R.framework/Versions/3.5/Resources/library/rgdal/gdal
##  GDAL binary built with GEOS: FALSE 
##  Loaded PROJ.4 runtime: Rel. 4.9.3, 15 August 2016, [PJ_VERSION: 493]
##  Path to PROJ.4 shared files: /Library/Frameworks/R.framework/Versions/3.5/Resources/library/rgdal/proj
##  Linking to sp version: 1.3-1

0.4.2 Comparing number of incidents by year, quarter, month, and weekday

summary(Gun_Violence$date)
##         Min.      1st Qu.       Median         Mean      3rd Qu. 
## "2013-01-01" "2015-03-07" "2016-04-05" "2016-03-13" "2017-04-03" 
##         Max. 
## "2018-03-31"

0.4.2.1 By year

It seems as if most incidents in 2013 were not recorded. In addition, 2018 is not a full year of data as the latest recorded incident was on May 31st 2018. As this means exactly one quarter, we are interested if the trend is still upward if we only look at the Q1s of 2014-2018. We will look into this in the next section.

Gun_Violence$year <- year(Gun_Violence$date) #extract year from date using the lubridate year() function

Gun_Violence %>%
        ggplot(aes(x=as.factor(year))) + geom_bar(stat='count', fill='blue', color ="red") +
        scale_y_continuous(labels=comma) +
        geom_label(stat = "count", aes(label = ..count.., y = ..count..)) + labs(title="Gun Violence Count (Yearly)") +
  labs(x="Year", y="Number of Incidents")

0.4.2.2 By quarter; what does Q1 2018 tell us?

Gun_Violence$quarter <- quarter(Gun_Violence$date) #extract Quarters from date
q1 <- Gun_Violence %>% filter(year!=2013) %>% select(year, quarter) %>% group_by(year) %>% count(quarter) %>%
        ggplot(aes(x=as.factor(quarter), y=n)) + geom_bar(stat='identity', fill='blue', color ="red") +
        scale_y_continuous(labels=comma) + facet_grid(.~year) + labs(x='Quarter', y='Number of incidents')

q2 <- Gun_Violence %>% filter(year!=2013 & quarter==1) %>% select(year, quarter) %>%
        group_by(year) %>% count(quarter) %>%
        ggplot(aes(x=as.factor(year), y=n)) + geom_bar(stat='identity', fill='blue', color ="red") +
        scale_y_continuous(labels=comma) + labs(x='Incidents in Q1 of each year', y='Number of incidents')

grid.arrange(q1, q2)

0.4.2.3 By Month

The analysis of quarters shows that more incidents occur in the warmer spring and summer periods. This seems worth diving into a little deeper. In order to compare months We will exclude 2018, as only the first three months have been recorded.

Gun_Violence$month <- month(Gun_Violence$date, label=TRUE)

Below, the most visibile ‘seasonality’ effect seems to us that the colder months seem to have less incidents. November, December, and February are the 3 months with the lowest number of incidents (February also only has 28 days of course). The exception seems to be January, which is worth investigating later on. My first idea is that possibly incidents on new years eve contribute to January having a high number of incidents.

The other peak is the July/August period. Likely because many people will go on holidays in this period.

#only taking the complete years 2014-2017
plotly::ggplotly(Gun_Violence %>% filter(year!=c(2013, 2018)) %>% count(month) %>%
        ggplot(aes(x=month, y=n)) + geom_bar(stat='identity', fill='blue', color="red") +
        scale_y_continuous(labels=comma) +
        labs(x='Month', y='Number of incidents', title='Incidents by Month'))
## Warning in year != c(2013, 2018): longer object length is not a multiple of
## shorter object length
0.4.2.3.1 Dates with most incidents

As we mentioned new years eve as a possible explanation of the high January numbers, while the colder month Nov-Feb generally seem to hold less incidents, we wanted to check what the dates with most incidents are.

Gun_Violence$day <- day(Gun_Violence$date)
Gun_Violence <- Gun_Violence %>% mutate(date2=paste(month, day))
Gun_Violence %>% filter(year!=c(2013, 2018)) %>% count(date2) %>% top_n(10) %>% arrange(desc(n))
## Warning in year != c(2013, 2018): longer object length is not a multiple of
## shorter object length
## # A tibble: 10 x 2
##    date2      n
##    <chr>  <int>
##  1 Jan 1   1115
##  2 Jul 4    876
##  3 Jul 5    820
##  4 Jul 30   788
##  5 Oct 25   742
##  6 Jul 17   740
##  7 Jul 19   740
##  8 Aug 13   734
##  9 Aug 1    730
## 10 Jul 25   730

The numbers above are actually the totals of 4 dates, as they are aggregated over 4 years (for instance: 1-1-2014, 1-1-2015, 1-1-2016, 1-1-2017). With the average being 618 (the total number of incidents in 2014-2017 divived by 365 calendar days; 225598/365), there are not many dates that really stand out. Most of the dates in this top 10 seem ‘ordinary’ days in July/August. However, January 1st indeed partially explains the higher incidents numbers in January. In addition, independence day (July 4th) is also dangerous when it comes down to gun incidents. We assume that the high number on July 5th is due to people continuing to celebrate after midnight.

0.4.2.4 By weekday

Gun_Violence$weekday <- wday(Gun_Violence$date, label=TRUE)

Weekends (Saturday and Sunday) are more dangerous than working days. My best guess is that this is due to most people not having to work on these days, and also very likely due to nightlife related violence. Voilence happening on Friday nightlife will likely happen after midnight to a large extend, and therefore be recorded as event on Saturdays.

Gun_Violence %>% count(weekday) %>%
        ggplot(aes(x=weekday, y=n)) + geom_bar(stat='identity', fill='blue', color="red") +
        scale_y_continuous(labels=comma) +
        labs(x='Weekday', y='Number of incidents', title='Incidents by Weekday')

0.4.3 Comparing number of incidents and victims by location

0.4.3.1 Incidents by State

First of all, we have to convert State into a factor variable. As we will also analyze city_or_county later on, we will convert them both at the same time.

Gun_Violence[, c('state', 'city_or_county')] <- lapply(Gun_Violence[, c('state', 'city_or_county')], as.factor)

str(Gun_Violence$state)
##  Factor w/ 51 levels "Alabama","Alaska",..: 39 5 36 6 34 37 32 19 5 21 ...
str(Gun_Violence$city_or_county)
##  Factor w/ 12821 levels "Abbeville","Abbotsford",..: 7132 4917 6564 446 4586 11661 88 8009 1250 526 ...

Below, we are displaying the absolute numbers of incidents by state. However, these numbers should be related to the numbers of inhabitants to get a good view of the relative numbers of incidents. For instance, California is a state with a very large population. Therefore, it is no surprise to see California as the number two in absolute numbers.

plotly::ggplotly(Gun_Violence %>% count(state) %>%
        ggplot(aes(x=reorder(state, n), y=n, fill=n, text=state)) +
        geom_bar(stat='identity', fill='red') + coord_flip() +
        labs(x='', y='Number of incidents'),
        tooltip=c("text", "y"))

Although we expected Chicago to have a lot of incidents, we are still surprised to see that it holds the ‘number one spot’ by a huge margin! In the next section, we will relate those absolute numbers to the population by city.

0.4.3.1.1 Incidents relative to the State population size

In order be able to relate the numbers above to the population sizes, we had to look for a good source. Source of the file is mentioned with the code.

#source file location: https://www2.census.gov/programs-surveys/popest/datasets/2010-2017/state/asrh/scprc-est2017-18+pop-res.csv

state_population <- read_csv("scprc-est2017-18+pop-res.csv")
## Parsed with column specification:
## cols(
##   SUMLEV = col_character(),
##   REGION = col_character(),
##   DIVISION = col_character(),
##   STATE = col_character(),
##   NAME = col_character(),
##   POPESTIMATE2017 = col_integer(),
##   POPEST18PLUS2017 = col_integer(),
##   PCNT_POPEST18PLUS = col_double()
## )
state_population <- state_population %>% select(NAME, POPESTIMATE2017)
state_population <- state_population %>% filter(!NAME %in% c("United States", "Puerto Rico Commonwealth"))
state_population <- state_population %>% rename(state= NAME)
state_population$state <- as.factor(state_population$state)

Below, we are creating a small “states” dateframe, that eventually holds the number of incidents relative to the population of each state (per 100,000 inhabitants).

incidentsByState <- Gun_Violence %>% group_by(state) %>% summarize(stateIncidents=n())
incidentsByState <-left_join(incidentsByState, state_population, by="state")
incidentsByState$Per100000 <- round((incidentsByState$stateIncidents/incidentsByState$POPESTIMATE2017)*100000)
head(incidentsByState)
## # A tibble: 6 x 4
##   state      stateIncidents POPESTIMATE2017 Per100000
##   <fct>               <int>           <int>     <dbl>
## 1 Alabama              5471         4874747       112
## 2 Alaska               1349          739795       182
## 3 Arizona              2328         7016270        33
## 4 Arkansas             2842         3004279        95
## 5 California          16306        39536653        41
## 6 Colorado             3201         5607154        57
city_population <- read_csv( "uscitiesv1.4.csv", col_types = cols())
city_population <- city_population %>% select(city, state_name, population_proper) %>% rename(state=state_name, population=population_proper) %>% filter(population>600000)

In the figure below, you can see that the enriched data, which are related to the population of each state, paint a very different picture. As the numbers of incidents are related to the population sizes, these numbers now represent ‘real’ gun danger levels. To show this visually, we have used color codes. Red means a high danger level in terms of relative numbers of incidents, and yellow means that a state is relatively safe.

Now, Alaska, Louisiana and Delaware are showing the highest relative incident numbers. Hawaii seems the safest state to live in (regarding gun related incidents), and the large state of California drops from second state in terms of absolute incidents to a low position when corrected for the large population.

plotly::ggplotly(incidentsByState%>% filter(state!="District of Columbia") %>%
        ggplot(aes(x=reorder(state, Per100000), y=Per100000, fill=Per100000, text=state)) +
        geom_bar(stat='identity') + coord_flip() +
        labs(x='', y='Incidents per 100,000 inhabitants') + scale_fill_gradient(low="yellow", high="red") +
        theme(legend.position="none"),
        tooltip=c("text", "y"))

Alaska being the state with relatively most incidents came as a surprise to us! Of course, We wanted to find out where this high level of gun violence in Alaska might come from. The folowing link explains why that the data in this dataset are not lying: Here’s why Alaska’s gun problem is so bad (https://www.businessinsider.com/the-state-where-youre-most-likely-to-be-killed-by-a-gun-is-one-of-the-most-beautiful-places-on-earth-2015-6?international=true&r=US&IR=T). In summary, Alaska has very liberal gun laws, very high gun possession as a consequence (60% of homes), and also a very high level of suicides (80% of firearm deaths are self-inflicted).

To get an idea on how these relative figures are composed, We are displaying the underlying numbers of a few states below. As you can see, DoC and Alaska have few inhabitants but a relatively high number of incidents. Hawaii just has low numbers of incidents, and the large state of California comes out with a relatively low number of incidents when taking the population into account.

incidentsByState %>%
        filter(state %in% c('District of Columbia', 'Alaska', 'California', 'Hawaii'))
## # A tibble: 4 x 4
##   state                stateIncidents POPESTIMATE2017 Per100000
##   <fct>                         <int>           <int>     <dbl>
## 1 Alaska                         1349          739795       182
## 2 California                    16306        39536653        41
## 3 District of Columbia           3195          693972       460
## 4 Hawaii                          289         1427538        20
0.4.3.1.2 An interactive map of incidents by state
#source file location: "http://www2.census.gov/geo/tiger/GENZ2017/shp/cb_2017_us_state_500k.zip"
states <- readOGR(dsn ="/Users/Gurtej/Downloads/cb_2017_us_state_500k", layer = "cb_2017_us_state_500k", encoding = "UTF-8")
## OGR data source with driver: ESRI Shapefile 
## Source: "/Users/Gurtej/Downloads/cb_2017_us_state_500k", layer: "cb_2017_us_state_500k"
## with 56 features
## It has 9 fields
## Integer64 fields read as strings:  ALAND AWATER
class(states)
## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"
head(states@data)
##   STATEFP  STATENS    AFFGEOID GEOID STUSPS          NAME LSAD
## 0      54 01779805 0400000US54    54     WV West Virginia   00
## 1      17 01779784 0400000US17    17     IL      Illinois   00
## 2      24 01714934 0400000US24    24     MD      Maryland   00
## 3      16 01779783 0400000US16    16     ID         Idaho   00
## 4      50 01779802 0400000US50    50     VT       Vermont   00
## 5      09 01779780 0400000US09    09     CT   Connecticut   00
##          ALAND     AWATER
## 0  62265662566  489840834
## 1 143784114293 6211277447
## 2  25150696145 6980371026
## 3 214048160737 2393355752
## 4  23873457570 1031134839
## 5  12542619303 1815495323
addPer100k <- data.frame(id=states$GEOID, name=states$NAME)
addPer100k <- left_join(addPer100k, incidentsByState %>% select(state, Per100000) %>% rename(name=state), by="name")
addPer100k$Per100000[is.na(addPer100k$Per100000)] <- 0
states$per100k <- addPer100k$Per100000

Now it is time to create the map. Please feel free to hover over the states (this shows a popup with state name and the exact number of incidents per 100,000 inhibitants), and zoom in and out. By zooming out, you will be able to see Alaska and Hawaii. By zooming in, you can also find the District of Columbia (between Maryland and Virginia).

bins <- c(0, 50, 75, 100, 150, Inf)
pal <- colorBin("YlOrRd", domain = states$per100k, bins = bins)

state_popup <- paste0("<strong>State: </strong>", 
                      states$NAME, 
                      "<br><strong>Incidents per 100,000 inhabitants </strong>", 
                      states$per100k) %>% lapply(htmltools::HTML)

leaflet(data = states) %>%
        setView(lng=-96, lat=37.8, zoom=4) %>%
        addProviderTiles("MapBox", options = providerTileOptions(id = "mapbox.light",
                accessToken = Sys.getenv('MAPBOX_ACCESS_TOKEN'))) %>%
        addPolygons(
                fillColor = ~pal(per100k),
                weight = 2,
                opacity = 1,
                color = "white",
                dashArray = "3",
                fillOpacity = 0.7,
                highlight = highlightOptions(
                        weight = 5,
                        color = "#666",
                        dashArray = "",
                        fillOpacity = 0.7,
                        bringToFront = TRUE),
                label = state_popup,
                labelOptions = labelOptions(
                        style = list("font-weight" = "normal", padding = "3px 8px"),
                        textsize = "15px",
                        direction = "auto")) %>%
        addLegend(pal = pal, values = ~per100k, opacity = 0.7, title = "Incidents", position = "bottomright")

0.4.3.2 Victims by State

First of all, we will create a new variable that adds up the number of killed and injured people for each incident.

Gun_Violence$victims <- Gun_Violence$n_killed + Gun_Violence$n_injured
head(Gun_Violence)
## # A tibble: 6 x 36
##   incident_id date       state city_or_county address n_killed n_injured
##         <int> <date>     <fct> <fct>          <chr>      <int>     <int>
## 1      461105 2013-01-01 Penn… Mckeesport     1506 V…        0         4
## 2      460726 2013-01-01 Cali… Hawthorne      13500 …        1         3
## 3      478855 2013-01-01 Ohio  Lorain         1776 E…        1         3
## 4      478925 2013-01-05 Colo… Aurora         16000 …        4         0
## 5      478959 2013-01-07 Nort… Greensboro     307 Mo…        2         2
## 6      478948 2013-01-07 Okla… Tulsa          6000 b…        4         0
## # ... with 29 more variables: incident_url <chr>, source_url <chr>,
## #   incident_url_fields_missing <chr>, congressional_district <int>,
## #   gun_stolen <chr>, gun_type <chr>, incident_characteristics <chr>,
## #   latitude <dbl>, location_description <chr>, longitude <dbl>,
## #   n_guns_involved <int>, notes <chr>, participant_age <chr>,
## #   participant_age_group <chr>, participant_gender <chr>,
## #   participant_name <chr>, participant_relationship <chr>,
## #   participant_status <chr>, participant_type <chr>, sources <chr>,
## #   state_house_district <int>, state_senate_district <int>, year <dbl>,
## #   quarter <int>, month <ord>, day <int>, date2 <chr>, weekday <ord>,
## #   victims <int>
VictimsByState <- Gun_Violence %>% group_by(state) %>% summarize(sumVic=sum(victims), sumInj=sum(n_injured), sumDeath=sum(n_killed), PercDeath=round(sumDeath/sumVic,2), sumIncidents=n(), vicPerInc=round(sumVic/sumIncidents,2))
head(VictimsByState)
## # A tibble: 6 x 7
##   state      sumVic sumInj sumDeath PercDeath sumIncidents vicPerInc
##   <fct>       <int>  <int>    <int>     <dbl>        <int>     <dbl>
## 1 Alabama      4878   2998     1880      0.39         5471      0.89
## 2 Alaska        592    325      267      0.45         1349      0.44
## 3 Arizona      2190   1096     1094      0.5          2328      0.94
## 4 Arkansas     2120   1347      773      0.36         2842      0.75
## 5 California  13206   7644     5562      0.42        16306      0.81
## 6 Colorado     1929   1133      796      0.41         3201      0.6
#library(gganimate)
p <- VictimsByState %>% filter(vicPerInc>0.8) %>%
        ggplot(aes(x=reorder(state, -vicPerInc), y=vicPerInc)) + geom_bar(stat='identity', fill='blue') +
        labs(x='State', y='Victims per incidents') +
        theme(axis.text.x = element_text(angle = 45, hjust = 1)) 

p

#p + transition_states((VictimsByState %>% filter(vicPerInc>0.8))$state, transition_length = 1, state_length = 1) 
0.4.3.2.1 Victims relative to the state population sizes
VictimsByState <-left_join(VictimsByState, state_population, by="state")
VictimsByState$Per100000 <- round((VictimsByState$sumVic/VictimsByState$POPESTIMATE2017)*100000)

In the graph below, which again uses the color codes to show danger levels, you can see that Louisiana is the most dangerous state when it comes down to the number of victims relative to the population size. This is no surprise, as the state has a high relative number of incidents and also ranks high regarding the victims per incident ratio. Alaska drops to 9th place, which is due to the relatively low number of victims per incident.

plotly::ggplotly(VictimsByState%>% filter(state!="District of Columbia") %>%
        ggplot(aes(x=reorder(state, Per100000), y=Per100000, fill=Per100000, text=state)) +
        geom_bar(stat='identity') + coord_flip() +
        labs(x='', y='Victims per 100,000 inhabitants') + scale_fill_gradient(low="yellow", high="red") +
        theme(legend.position="none"),
        tooltip=c("text", "y"))

0.4.3.3 Incidents by city

incidentsByCity <- Gun_Violence %>% select(city_or_county, state) %>% rename(city=city_or_county) %>% group_by(city, state) %>% summarize(cityIncidents=n())

It seems likely that a lot of violence occurs in the big cities of the states. For instance, we expected to see a lot incidents in the big cities of Illinois (Chicago) and Louisiana (New Orleans). The biggest city in Alaska (Anchorage) is unlikely to appear in the list of cities with the highest absolute numbers of incidents as the city is relatively small.

incidentsByCity <- Gun_Violence %>% select(city_or_county, state) %>% rename(city=city_or_county) %>% group_by(city, state) %>% summarize(cityIncidents=n())

New York City did not seem to be recorded as a whole in the datset. It turns out that the boroughs are recorded separately. Before we continue, we are adding New York as a city.

incidentsByCity[(incidentsByCity$city %in% c('Brooklyn', 'Bronx', 'Queens', 'Staten Island','New York (Manhattan)')) & incidentsByCity$state=='New York',]
## # A tibble: 5 x 3
## # Groups:   city [5]
##   city                 state    cityIncidents
##   <fct>                <fct>            <int>
## 1 Bronx                New York           845
## 2 Brooklyn             New York          1405
## 3 New York (Manhattan) New York           360
## 4 Queens               New York           130
## 5 Staten Island        New York           411
sumNewYork <- sum(incidentsByCity$cityIncidents[(incidentsByCity$city %in% c('Brooklyn', 'Bronx', 'Queens', 'Staten Island','New York (Manhattan)')) & incidentsByCity$state=='New York'])

NewYork <- data.frame(city='New York', state='New York', cityIncidents=sumNewYork)
incidentsByCity <- as.tibble(rbind(as.data.frame(incidentsByCity), NewYork))
incidentsByCity %>% top_n(20, wt=cityIncidents) %>%
        ggplot(aes(x=reorder(city, cityIncidents), y=cityIncidents)) + geom_bar(stat='identity', fill='red') +
        labs(x='City', y='Number of incidents') + coord_flip()

Although we expected Chicago to have a lot of incidents, we are still surprised to see that it holds the ‘number one spot’ by a huge margin! In the next section, we will relate those absolute numbers to the population by city.

0.4.3.3.1 Incidents relative to the City population size
#source file location: https://simplemaps.com/data/us-cities

citiesPop <- read_csv("uscitiesv1.4.csv", col_types = cols())
citiesPop <- citiesPop %>% select(city, state_name, population_proper) %>% rename(state=state_name, population=population_proper) %>% filter(population>600000)

we are only focussing on cities with population more than 600000.

citiesPop <-left_join(citiesPop, incidentsByCity, by=c("city", "state"))
## Warning: Column `city` joining character vector and factor, coercing into
## character vector
## Warning: Column `state` joining character vector and factor, coercing into
## character vector
citiesPop$Per100000 <- round((citiesPop$cityIncidents/citiesPop$population)*100000)
citiesPop$citystate <- str_c(citiesPop$city, " - " ,citiesPop$state)

incidentsByState <- incidentsByState %>% rename(state_avg=Per100000)
citiesPop <- left_join(citiesPop, incidentsByState %>% select(state, state_avg), by="state")
## Warning: Column `state` joining character vector and factor, coercing into
## character vector
citiesPop1 <- citiesPop %>% select(citystate, Per100000, state, state_avg) %>% rename(city_avg=Per100000)
gathercols <- c("city_avg", "state_avg")
CitiesStatesLong <- tidyr::gather(citiesPop1, city_or_state, per100k, gathercols)
citiesTop20 <- CitiesStatesLong %>% filter(city_or_state=='city_avg') %>% arrange(per100k) %>% top_n(20, wt=per100k)
Top20names <- citiesTop20$citystate
CitiesStatesLong <- CitiesStatesLong[CitiesStatesLong$citystate %in% Top20names,]
ggplot(CitiesStatesLong, aes(x=factor(citystate), y=per100k, fill=city_or_state)) +
        geom_bar(stat="identity", position = position_dodge2(reverse=TRUE, padding=0.1)) + coord_flip() +
        scale_fill_manual(values = c("state_avg"="yellow", "city_avg"="red")) +
        scale_x_discrete(limits=Top20names) + labs(y='Incidents per 100,000 inhabitants', x="")

As you can see, the city averages of the cities with high incident numbers are much higher than their associated state averages. At the bottom of this list (not displayed) a few cities have lower numbers than the state , but generally cities are more dangerous than the country side.

0.4.4 Incidents with highest numbers of victims

Below, we are displaying the 10 incidents with the highest numbers of victims. Although any single victim is one too many, the shooting in Las Vegas was by far the worst incident with over 470 victims.

Top10 <- Gun_Violence %>% select(incident_id, date, n_killed, n_injured, victims, location_description, city_or_county, state, latitude, longitude) %>% rename(Incident_Id=incident_id, Date=date, Killed=n_killed, Injured=n_injured, Victims=victims, Location=location_description, City=city_or_county) %>%
         arrange(desc(Victims)) %>% top_n(n=13, wt=Victims)
kable(Top10 %>% select(-longitude, -latitude))
Incident_Id Date Killed Injured Victims Location City state
577157 2016-06-12 50 53 103 Pulse Orlando Florida
980577 2017-11-05 27 20 47 First Baptist Church Sutherland Springs Texas
456893 2015-12-02 16 19 35 Inland Regional Center San Bernardino California
1049217 2018-02-14 17 17 34 Marjory Stoneman Douglas High School Pompano Beach (Parkland) Florida
341622 2015-05-17 9 18 27 Twin Peaks Restaurant Waco Texas
879953 2017-07-01 0 25 25 Power Ultra Lounge Little Rock Arkansas
611479 2016-07-25 2 19 21 Club Blu Fort Myers Florida
121031 2014-04-02 4 16 20 NA Fort Hood Texas
486209 2013-05-12 0 19 19 NA New Orleans Louisiana
423223 2015-10-01 10 9 19 Umpqua Community College Roseburg Oregon
493842 2013-11-09 2 16 18 NA Cypress Texas
511436 2016-02-25 4 14 18 Excel Industries Hesston Kansas
987611 2017-11-14 6 12 18 Rancho Tehama Elementary School Corning California

0.4.4.1 An interactive map of the incidents with highest numbers of victims

TopMap <- Top10 %>% select(latitude, longitude, Victims, City, Location)

labels <- paste0("<strong>City: </strong>", TopMap$City, 
                 "<br><strong>Location: </strong>", TopMap$Location,
                 "<br><strong>Victims </strong>", TopMap$Victims) %>% lapply(htmltools::HTML)

leaflet(TopMap) %>%
        setView(lng=-96, lat=37.8, zoom=4) %>%
        addTiles() %>%
        addProviderTiles("CartoDB.Positron") %>%
        addCircleMarkers(~longitude, ~latitude, color = "red", radius=~sqrt(Victims), label = labels)

0.5 Conclusions