1. Overview

This document attempts to use publicly available dataset from DataSF, which includes police incident reports filed by officers and by individuals through self-service online reporting for non-emergency cases in San Francisco from 2018 to present, to uncover any insights using Exploratory Data Analysis (EDA).

2. Data Wrangling

2.1. Install and Load Packages

packages <- c('tidyverse', 'DT', 'plotly', 'lubridate', 'zoo', 'plyr', 'sf','tmap')

for (p in packages){
  if (!require(p,character.only = T)){
    install.packages(p)
  }
  library(p,character.only = T)
}

2.2. Read in the data

data <- read_csv("data/Police_Department_Incident_Reports__2018_to_Present.csv")

Take a look at the data types.

glimpse(data)

## Rows: 336,517
## Columns: 36
## $ `Incident Datetime`                                    <chr> "2019/05/01 01…
## $ `Incident Date`                                        <date> 2019-05-01, 2…
## $ `Incident Time`                                        <time> 01:00:00, 07:…
## $ `Incident Year`                                        <dbl> 2019, 2019, 20…
## $ `Incident Day of Week`                                 <chr> "Wednesday", "…
## $ `Report Datetime`                                      <chr> "2019/06/12 08…
## $ `Row ID`                                               <dbl> 81097515200, 8…
## $ `Incident ID`                                          <dbl> 810975, 814655…
## $ `Incident Number`                                      <chr> "190424067", "…
## $ `CAD Number`                                           <dbl> 191634131, 191…
## $ `Report Type Code`                                     <chr> "II", "II", "I…
## $ `Report Type Description`                              <chr> "Initial", "In…
## $ `Filed Online`                                         <lgl> NA, NA, NA, NA…
## $ `Incident Code`                                        <chr> "15200", "6402…
## $ `Incident Category`                                    <chr> "Offences Agai…
## $ `Incident Subcategory`                                 <chr> "Other", "Othe…
## $ `Incident Description`                                 <chr> "Domestic Viol…
## $ Resolution                                             <chr> "Open or Activ…
## $ Intersection                                           <chr> "40TH AVE \\ I…
## $ CNN                                                    <dbl> 27789000, 2428…
## $ `Police District`                                      <chr> "Taraval", "So…
## $ `Analysis Neighborhood`                                <chr> "Sunset/Parksi…
## $ `Supervisor District`                                  <dbl> 4, 6, 10, 3, 2…
## $ Latitude                                               <dbl> 37.76257, 37.7…
## $ Longitude                                              <dbl> -122.4996, -12…
## $ point                                                  <chr> "(37.762569397…
## $ `SF Find Neighborhoods`                                <dbl> 39, 32, 88, 10…
## $ `Current Police Districts`                             <dbl> 10, 1, 2, 6, 4…
## $ `Current Supervisor Districts`                         <dbl> 7, 10, 9, 3, 6…
## $ `Analysis Neighborhoods`                               <dbl> 35, 34, 1, 6, …
## $ `HSOC Zones as of 2018-06-05`                          <dbl> NA, 1, NA, NA,…
## $ `OWED Public Spaces`                                   <dbl> NA, NA, NA, 18…
## $ `Central Market/Tenderloin Boundary Polygon - Updated` <dbl> NA, 1, NA, NA,…
## $ `Parks Alliance CPSI (27+TL sites)`                    <dbl> NA, NA, NA, NA…
## $ `ESNCAG - Boundary File`                               <dbl> NA, NA, NA, NA…
## $ `Areas of Vulnerability, 2016`                         <dbl> 1, 2, 2, 2, 1,…

Some data preparation

Filter for year 2018 and 2019.
Filter out “Out of SF” under Police District column and “NA” under Incident Category.
Change data types.

char_cols <- c("Row ID", "Incident ID")
data_1 <- data %>% filter(`Incident Year` %in% c(2018, 2019),
                          `Police District`!="Out of SF",
                          !(`Incident Category` %in% c('NA', 'Case Closure'))) %>%
                    mutate_at(char_cols, as.character)

Given that there might be multiple Incident IDs in the dataset that represents the one to many relationship when they exist, we would not want to count the duplicate incident reports. It is also noted that for Incident IDs with duplicates, the report datetime are the same for all the duplicated rows. As such, we will remove duplicates by ‘Incident ID’.

data_2 <- data_1 %>% distinct(`Incident ID`,`Incident Number`, .keep_all= TRUE)

To explore the dataset further, we can look at the number of incidents by “Incident Year” and “Incident Category”.

data_cat <- data_2 %>% group_by(`Incident Year`,`Incident Category`) %>%
                        dplyr::summarise(Incidents = dplyr::n()) %>%
                        filter(!is.na(`Incident Category`))

Check the levels of “Incident Category”.

data_cat$`Incident Category` <- as.factor(data_cat$`Incident Category`)
levels(data_cat$`Incident Category`)

##  [1] "Arson"                                     
##  [2] "Assault"                                   
##  [3] "Burglary"                                  
##  [4] "Civil Sidewalks"                           
##  [5] "Courtesy Report"                           
##  [6] "Disorderly Conduct"                        
##  [7] "Drug Offense"                              
##  [8] "Drug Violation"                            
##  [9] "Embezzlement"                              
## [10] "Family Offense"                            
## [11] "Fire Report"                               
## [12] "Forgery And Counterfeiting"                
## [13] "Fraud"                                     
## [14] "Gambling"                                  
## [15] "Homicide"                                  
## [16] "Human Trafficking (A), Commercial Sex Acts"
## [17] "Human Trafficking, Commercial Sex Acts"    
## [18] "Larceny Theft"                             
## [19] "Liquor Laws"                               
## [20] "Lost Property"                             
## [21] "Malicious Mischief"                        
## [22] "Miscellaneous Investigation"               
## [23] "Missing Person"                            
## [24] "Motor Vehicle Theft"                       
## [25] "Motor Vehicle Theft?"                      
## [26] "Non-Criminal"                              
## [27] "Offences Against The Family And Children"  
## [28] "Other"                                     
## [29] "Other Miscellaneous"                       
## [30] "Other Offenses"                            
## [31] "Prostitution"                              
## [32] "Rape"                                      
## [33] "Recovered Vehicle"                         
## [34] "Robbery"                                   
## [35] "Sex Offense"                               
## [36] "Stolen Property"                           
## [37] "Suicide"                                   
## [38] "Suspicious"                                
## [39] "Suspicious Occ"                            
## [40] "Traffic Collision"                         
## [41] "Traffic Violation Arrest"                  
## [42] "Vandalism"                                 
## [43] "Vehicle Impounded"                         
## [44] "Vehicle Misplaced"                         
## [45] "Warrant"                                   
## [46] "Weapons Carrying Etc"                      
## [47] "Weapons Offence"                           
## [48] "Weapons Offense"

Looking at the levels of “Incident Category”, there are some categories that can be grouped together.

For example:

“Weapons Offence”, “Weapons Offense” and “Weapons Carrying Etc”.
“Human Trafficking, Commercial Sex Acts” and “Human Trafficking (A), Commercial Sex Acts”.
“Suspicious” and “Suspicious Occ”.
“Motor Vehicle Theft” and “Motor Vehicle Theft?”.
“Other”, “Other Miscellaneous” and “Other Offenses”.

After we have group and sum these categories, we can plot a basic horizontal bar chart to see the count of Incidents for each “Incident Category” as seen below.

Here is a pie chart showing the proportion of Incident Report cases that were filed online. Most incident cases were not filed online. Also, there are negligible changes in proportion between 2018 and 2019.

library(ggpubr)
fig <- ggarrange(p1, p2, nrow=1, common.legend = TRUE, legend="right")
annotate_figure(
  fig, top=text_grob("Proportion of San Francisco Police Department Incident Reports Filed online\n", 
  color = "black", face = "bold", size = 12)
                )

3. Viz 1: Calendar Heatmap

Calculate the number of incidents for each “Incident Date”.

data_calendar <- data_2 %>% group_by(`Incident Date`) %>%
                            dplyr::summarise(Incidents = dplyr::n())

Plot the Calendar Heatmap.

ggplotly(ggplot(data_calendar, aes(Week_of_month, Day_of_week, fill = Incidents)) + 
        geom_tile(colour = "white") + 
        facet_grid(year(data_calendar$Group)~monthf) + 
        scale_fill_gradient(low="#ffffff", high="#0052cc") + 
        labs(fill = "Incidents", x="", y="") +
        ggtitle("Calendar Heatmap of San Francisco Police Department Incident Reports 2018 - 2019") +
        theme_bw(base_size=10)+
        theme(plot.title = element_text(family = 'Arial', 
                                        size=12, face="bold", hjust=0.5),
              axis.text = element_text(family = 'Arial'),
              legend.text = element_text(family = 'Arial'),
              legend.title=element_text(family = 'Arial'),
              strip.text = element_text(family = 'Arial', size = 10),
              panel.border=element_blank(),
              axis.ticks=element_blank(),
              strip.background=element_blank(),
              legend.position="top",
              legend.justification="right",
              legend.direction="horizontal",
              legend.key.size=unit(0.3,"cm"),
              legend.spacing.x=unit(0.1,"cm"))) %>% 
        layout(annotations = c(list(list(text = "Week of Month",
                                         x = 0.5,
                                         y = -0.1,
                                         xref = "paper",
                                         yref = "paper",
                                         showarrow = F,
                                         font=list(family='Arial', size=13))),
                                list(list(text = "Day of Week",
                                          x = -0.1,
                                          y = 0.51,
                                          xref = "paper",
                                          yref = "paper",
                                          textangle = -90,
                                          showarrow = F,
                                          font=list(family='Arial', size=13)))))

From the Calendar Heatmap, we see that there is generally a higher number of incidents in July and August for both years compared to other months.

We can also narrow down to a particular Incident Category, for instance “Larceny Theft”.

data_calendar_larceny <- data_2 %>% group_by(`Incident Date`, `Incident Category`) %>%
                            filter(`Incident Category`=='Larceny Theft') %>%
                            dplyr::summarise(Incidents = dplyr::n())

4. Viz 2: Choropleth Map

Download the geometry information of San Fransico Neighbourhoods from DataSF.

map_sf <- st_read(dsn = "data/Analysis_Neighborhoods/Analysis_Neighborhoods.shp")

5. Overall Key Insights

Majority of San Francisco’s Police Department Incident reports are not filed online in 2018-2019 by officers and by individuals.
Larceny Theft contributes to the highest count in San Francisco’s Police Department Incident reports in 2018 and 2019.
The distribution of larceny theft cases in across San Francisco neighbourhoods is consistent in 2018 and 2019 and Financial District/South Beach has the highest number of Larcency Theft in San Francisco in 2018 and 2019.

Deciphering San Francisco Police Department Incident Reports

Author: Cherie Wong

Date: 09 Jul 2020