Digging into the Seattle and San Francisco datasets released

We have been provided with datasets from cities of Seattle and San Francisco reporting criminal incidents in summer 2014.
This paper is about analyzing those datasets, visualazing and comparing patterns accross those two cities.
This analysis allowed us to infer that :

We will go trhough all those statements in detail in this paper.

If you want to see the report without code, it’s all available at this link : report without code or at the link under each figure.

Datasets overview

First, we load the datasets, recorded in a .csv format, and see what we are provided with.

sea = read.csv('seattle_incidents_summer_2014.csv');
dim(sea) # 32779 x 19
## [1] 32779    19
str(sea)
## 'data.frame':    32779 obs. of  19 variables:
##  $ RMS.CDW.ID                       : int  483839 481252 481375 481690 478198 480485 470170 465137 461710 456091 ...
##  $ General.Offense.Number           : num  2.02e+09 2.02e+09 2.02e+09 2.02e+09 2.02e+09 ...
##  $ Offense.Code                     : Factor w/ 92 levels "1201","1202",..: 19 42 31 37 32 30 30 38 42 39 ...
##  $ Offense.Code.Extension           : int  0 0 0 0 3 0 0 0 0 1 ...
##  $ Offense.Type                     : Factor w/ 147 levels "[INC - CASE DC USE ONLY]",..: 15 43 121 20 122 117 117 42 43 40 ...
##  $ Summary.Offense.Code             : Factor w/ 26 levels "1200","1300",..: 3 8 5 7 5 5 5 8 8 8 ...
##  $ Summarized.Offense.Description   : Factor w/ 48 levels "[INC - CASE DC USE ONLY]",..: 6 20 26 9 29 29 29 20 20 20 ...
##  $ Date.Reported                    : Factor w/ 15766 levels "01/02/2015 11:12:00 AM",..: 4720 3991 3617 3468 3304 3125 629 39 38 36 ...
##  $ Occurred.Date.or.Date.Range.Start: Factor w/ 11557 levels "06/01/2014 01:00:00 AM",..: 3526 116 11508 2393 115 2275 116 10420 7872 6327 ...
##  $ Occurred.Date.Range.End          : Factor w/ 5935 levels "","01/18/2015 09:00:00 AM",..: 1801 1509 1 1 5913 2533 60 10 1 1 ...
##  $ Hundred.Block.Location           : Factor w/ 7924 levels "1 AV / BATTERY ST",..: 5508 2248 6086 5522 5840 3305 2432 2187 557 3091 ...
##  $ District.Sector                  : Factor w/ 19 levels "","99","B","C",..: 9 4 7 12 9 16 15 3 5 5 ...
##  $ Zone.Beat                        : Factor w/ 53 levels "","99","B1","B2",..: 22 7 17 31 23 44 40 3 11 9 ...
##  $ Census.Tract.2000                : num  2900 6300 11301 8200 2700 ...
##  $ Longitude                        : num  -122 -122 -122 -122 -122 ...
##  $ Latitude                         : num  47.7 47.6 47.5 47.6 47.7 ...
##  $ Location                         : Factor w/ 12970 levels "(0.0, 0.0)","(47.465062973, -122.337868218)",..: 10611 7628 872 5213 10823 2191 7640 9804 6525 6000 ...
##  $ Month                            : int  6 6 8 6 6 6 6 8 8 7 ...
##  $ Year                             : int  2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...
names(sea)
##  [1] "RMS.CDW.ID"                       
##  [2] "General.Offense.Number"           
##  [3] "Offense.Code"                     
##  [4] "Offense.Code.Extension"           
##  [5] "Offense.Type"                     
##  [6] "Summary.Offense.Code"             
##  [7] "Summarized.Offense.Description"   
##  [8] "Date.Reported"                    
##  [9] "Occurred.Date.or.Date.Range.Start"
## [10] "Occurred.Date.Range.End"          
## [11] "Hundred.Block.Location"           
## [12] "District.Sector"                  
## [13] "Zone.Beat"                        
## [14] "Census.Tract.2000"                
## [15] "Longitude"                        
## [16] "Latitude"                         
## [17] "Location"                         
## [18] "Month"                            
## [19] "Year"
san = read.csv('sanfrancisco_incidents_summer_2014.csv');
dim(san) # 28993 x 13
## [1] 28993    13
str(san)
## 'data.frame':    28993 obs. of  13 variables:
##  $ IncidntNum: int  140734311 140736317 146177923 146177531 140734220 140734349 140734349 140734349 140738147 140734258 ...
##  $ Category  : Factor w/ 34 levels "ARSON","ASSAULT",..: 1 20 16 16 20 7 7 6 21 30 ...
##  $ Descript  : Factor w/ 368 levels "ABANDONMENT OF CHILD",..: 15 179 143 143 132 247 239 93 107 347 ...
##  $ DayOfWeek : Factor w/ 7 levels "Friday","Monday",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ Date      : Factor w/ 92 levels "06/01/2014","06/02/2014",..: 92 92 92 92 92 92 92 92 92 92 ...
##  $ Time      : Factor w/ 1379 levels "00:01","00:02",..: 1370 1365 1351 1351 1344 1334 1334 1334 1321 1321 ...
##  $ PdDistrict: Factor w/ 10 levels "BAYVIEW","CENTRAL",..: 1 4 8 7 7 8 8 8 3 2 ...
##  $ Resolution: Factor w/ 16 levels "ARREST, BOOKED",..: 12 12 12 12 12 1 1 1 12 2 ...
##  $ Address   : Factor w/ 8055 levels "0 Block of 10TH ST",..: 6843 4022 1098 6111 5096 1263 1263 1263 1575 5236 ...
##  $ X         : num  -122 -122 -122 -122 -123 ...
##  $ Y         : num  37.7 37.8 37.8 37.8 37.8 ...
##  $ Location  : Factor w/ 8732 levels "(37.7080829769301, -122.419241455854)",..: 1970 3730 5834 4802 4777 4993 4993 4993 2543 7598 ...
##  $ PdId      : num  1.41e+13 1.41e+13 1.46e+13 1.46e+13 1.41e+13 ...
names(san)
##  [1] "IncidntNum" "Category"   "Descript"   "DayOfWeek"  "Date"      
##  [6] "Time"       "PdDistrict" "Resolution" "Address"    "X"         
## [11] "Y"          "Location"   "PdId"

What we can observe from those values :

First, let’s transform those variables and create a single dataframe so that to compare values for those two cities.

# SEATTLE
# Let's first retain only the variables we are interested in
sea_new <- sea[,c("Summarized.Offense.Description", "Occurred.Date.or.Date.Range.Start", "Year", "Month")]
# rename the columns for clarity
names(sea_new) <- c("offenseDescription", "dateRaw", "Year", "Month")
# transform features so they can be used to perform analysis
sea_new$offenseDescription <- as.character(sea_new$offenseDescription)
sea_new$dateRaw <- as.character(sea_new$dateRaw)
sea_new$dateRaw <- as.POSIXct(sea_new$dateRaw, format = "%m/%d/%Y %H:%M:%S")
sea_new$Date <- as.Date(as.POSIXct(sea_new$dateRaw))

# SAN FRANCISCO
# Let's first retain only the variables we are interested in
san_new <- san[,c("Category", "Date", "Time")]
# rename the columns for clarity
names(san_new) <- c("offenseDescription", "dateRaw", "Time")
# transform features so they can be used to perform analysis
san_new$offenseDescription <- as.character(san_new$offenseDescription)
san_new$dateRaw <- as.character(san_new$dateRaw)
san_new$Time <- as.character(san_new$Time)
san_new$dateRaw <- paste(san_new$dateRaw, san_new$Time)
san_new$dateRaw <- as.POSIXct(san_new$dateRaw, format = "%m/%d/%Y %H:%M")
san_new$Date <- as.Date(as.POSIXct(san_new$dateRaw))
san_new$Year <- as.integer(format(san_new$dateRaw, "%Y"))
san_new$Month <- as.integer(format(san_new$dateRaw, "%m"))
san_new <- san_new[, c("offenseDescription", "dateRaw", "Year", "Month", "Date")]

In order to create this signle dataframe, we have to convert all the crimes descriptions in each dataset so that both cities present the same categories.
To do so, we have collected the National Incident-Based Reporting System (NIBRS) data, which categorize crimes into universal categories.
This dataset can be found here : http://data.denvergov.org/download/gis/crime/csv/offense_codes.csv
The NIBRS data contains 15 high categories of crimes, and 294 sub-categories.
As we can observe from the Seattle and the San Francisco datasets :

nibrs <- read.csv("NIBRS.csv")

# SEATTLE
for (i in 1:nrow(sea_new)){
    sea_new$offenseDescription[i] <- switch(sea_new$offenseDescription[i], 
           "BURGLARY" = "Burglary",
           "FRAUD" = "White Collar Crime",
           "MAIL THEFT" = "Larceny",
           "COUNTERFEIT" = "White Collar Crime",
           "OTHER PROPERTY" = "All Other Crimes", 
           "EMBEZZLE" = "White Collar Crime",
           "CAR PROWL" = "All Other Crimes",
           "THREATS" = "Public Disorder",
           "PROPERTY DAMAGE" = "Public Disorder",
           "LOST PROPERTY" = "All Other Crimes",
           "FORGERY" = "White Collar Crime",
           "VEHICLE THEFT" = "Auto Theft",
           "BURGLARY-SECURE PARKING-RES" = "Burglary",
           "PICKPOCKET" = "Robbery",
           "BIKE THEFT" = "Larceny",
           "NARCOTICS" = "Drug & Alcohol",
           "DISPUTE" = "Public Disorder",
           "ASSAULT" = "Aggravated Assault",
           "STOLEN PROPERTY" = "All Other Crimes",
           "WARRANT ARREST" = "All Other Crimes",
           "TRAFFIC" = "Traffic Accident",
           "SHOPLIFTING" = "Larceny",
           "DISTURBANCE" = "Public Disorder",
           "VIOLATION OF COURT ORDER" = "All Other Crimes",
           "ILLEGAL DUMPING" = "All Other Crimes",
           "PROSTITUTION" = "Public Disorder",
           "ROBBERY" = "Robbery",
           "TRESPASS" = "All Other Crimes",
           "LIQUOR VIOLATION" = "Drug & Alcohol",
           "BIAS INCIDENT" = "All Other Crimes",
           "THEFT OF SERVICES" = "Larceny",
           "HOMICIDE" = "Murder",
           "RECOVERED PROPERTY" = "All Other Crimes",
           "OBSTRUCT" = "All Other Crimes",
           "RECKLESS BURNING" = "All Other Crimes",
           "INJURY" = "Aggravated Assault",
           "WEAPON" = "Aggravated Assault",
           "PURSE SNATCH" = "Robbery",
           "FALSE REPORT" = "All Other Crimes",
           "ELUDING" = "All Other Crimes",
           "ANIMAL COMPLAINT" = "All Other Crimes",
           "PORNOGRAPHY" = "All Other Crimes",
           "DUI" = "Traffic Accident",
           "FIREWORK" = "All Other Crimes",
           "[INC - CASE DC USE ONLY]" = "All Other Crimes",
           "PUBLIC NUISANCE" = "Public Disorder",
           "DISORDERLY CONDUCT" = "Public Disorder",
           "ESCAPE" = "All Other Crimes"
           )
}

# SAN FRANCISCO
for (i in 1:nrow(san_new)){
    san_new$offenseDescription[i] <- switch(san_new$offenseDescription[i], 
           "ARSON" = "Arson",
           "NON-CRIMINAL" = "All Other Crimes",
           "LARCENY/THEFT" = "Larceny",
           "DRUG/NARCOTIC" = "Drug & Alcohol",
           "DRIVING UNDER THE INFLUENCE" = "Drug & Alcohol", 
           "OTHER OFFENSES" = "All Other Crimes",
           "TRESPASS" = "All Other Crimes",
           "VEHICLE THEFT" = "Auto Theft",
           "ASSAULT" = "Aggravated Assault",
           "FRAUD" = "White Collar Crime",
           "SUSPICIOUS OCC" = "All Other Crimes",
           "SECONDARY CODES" = "All Other Crimes",
           "WEAPON LAWS" = "Aggravated Assault",
           "MISSING PERSON" = "All Other Crimes",
           "WARRANTS" = "All Other Crimes",
           "ROBBERY" = "Robbery",
           "DRUNKENNESS" = "Drug & Alcohol",
           "PROSTITUTION" = "Public Disorder",
           "LIQUOR LAWS" = "Drug & Alcohol",
           "KIDNAPPING" = "All Other Crimes",
           "FAMILY OFFENSES" = "Aggravated Assault",
           "LOITERING" = "Public Disorder",
           "DISORDERLY CONDUCT" = "Public Disorder",
           "FORGERY/COUNTERFEITING" = "White Collar Crime",
           "EMBEZZLEMENT" = "White Collar Crime",
           "BURGLARY" = "Burglary",
           "SUICIDE" = "All Other Crimes",
           "VANDALISM" = "All Other Crimes",
           "STOLEN PROPERTY" = "All Other Crimes",
           "RUNAWAY" = "All Other Crimes",
           "GAMBLING" = "All Other Crimes",
           "EXTORTION" = "All Other Crimes",
           "PORNOGRAPHY/OBSCENE MAT" = "All Other Crimes",
           "BRIBERY" = "All Other Crimes"
           )
}

# We now create a column "city" in each dataset
sea_new$City <- "Seattle"
san_new$City <- "San Francisco"
# Now we can create a new whole dataset binding rows
alldata <- rbind(sea_new, san_new)
alldata$Date <- as.Date(as.POSIXct(alldata$dateRaw, tz="CET"), tz="CET")

Let’s make a plot of this whole new dataset.

ofDate <- as.data.frame(table(alldata$Date, alldata$City))
colnames(ofDate) <- c("Date", "City", "Freq")
ofDate$Date <- as.Date(ofDate$Date)
library(ggplot2)

g <- ggplot(data = ofDate, aes(x = Date,y = Freq, colour = City)) + geom_line()
g <- g + xlab("Date") + ylab("Number of offenses") + ggtitle("Number of offenses in summer 2014 \n in Seattle and San Francisco")
g

report without code

This plot seems to tell us that the number of offenses is higher in Seattle than in San Francisco. Indeed, we could have noticed before that the number of records in the Seattle dataset (32 779) is higher than in the San Francisco dataset (28 993). This trend may be just due to the fact that the number of people living in Seattle is greater than the ones living in San Francisco.
Let’s check.

So if we plot again the evolution of the number of offenses over time, but this time taking into account the population, and so plotting this time the number of offenses per capita, here is what we get.

pop_sea <- 668342
pop_san <- 805235

ofDate$FreqperC <- ifelse(ofDate$City == "Seattle", ofDate$Freq/pop_sea, ofDate$Freq/pop_san)
library(ggplot2)

g <- ggplot(data = ofDate, aes(x = Date,y = FreqperC, colour = City)) + geom_line() + geom_smooth()
g <- g + xlab("Date") + ylab("Number of offenses per Capita") + ggtitle("Number of offenses per capita in summer 2014 \n in Seattle and San Francisco")
g

report without code

Let’s notice that we have here plotted a smooth curve representing the trend for each city.
This plot allows us to understand that there are more offenses happening in summer 2014 in Seatlle than in San Francisco, this not being due to the demographics. Indeed, Seattle counts less inhabitants than San Francisco, that fact deeping the gap between the two cities in terms of number of offenses.

Let’s dig deeper.

Types of offenses

Now we are going to watch the different types of offenses, trying to understand what kind happens the most. Let’s have alook first at the general distribution, accross both cities.

ofType0 <- as.data.frame(table(alldata$offenseDescription))
colnames(ofType0) <- c("offenseType", "Freq")
ofType0$offenseType <- as.character(ofType0$offenseType)
library(ggplot2)

g <- ggplot(data = ofType0, aes(offenseType, Freq)) + geom_bar(stat = "identity", fill = "dark blue")
g <- g + ylab("Number of offenses") + xlab("Type of offenses") + ggtitle("Offenses in summer 2014") + coord_flip()
g

report without code

So, with this plot, we figure out that the category of offenses occuring the most is “All Other Crimes”. That’s not surprising because this category regroups a lot of sub-categories. The second most important one is “Larceny”. Globally, we can print the top 5 most occuring offenses this way.

ofType0 <- ofType0[order(ofType0$Freq, decreasing = TRUE),]
head(ofType0,5)
##          offenseType  Freq
## 2   All Other Crimes 25056
## 7            Larceny 11567
## 1 Aggravated Assault  5472
## 9    Public Disorder  5401
## 4         Auto Theft  5023

Do those categories are the same in both cities?

ofType <- as.data.frame(table(alldata$offenseDescription, alldata$City))
colnames(ofType) <- c("offenseType", "City", "Freq")
ofType$offenseType <- as.character(ofType$offenseType)
ofType <- ofType[order(ofType$Freq, decreasing = TRUE),]
ofType$FreqperC <- ifelse(ofType$City == "Seattle", ofType$Freq/pop_sea, ofType$Freq/pop_san)
library(ggplot2)

g <- ggplot(data = ofType, aes(offenseType, FreqperC, fill = City)) + geom_bar(stat = "identity", position="dodge")
g <- g + xlab("Type of offenses") + ylab("Frequency per Capita") + ggtitle("Number of offenses per capita, per type, \n in summer 2014 in Seattle and in San Francisco") + theme(axis.text.x = element_text(angle = 75, hjust = 1))
g

report without code

Here we see clearly that, appart from the category “All Other Crimes” for which in both cities the frequency is high, trends depend upon cities.
In Seattle, we observe that “Larceny” is far less occuring than in San Francisco. On the opposite, “Public Disorder” is far more occuring in Seattle while in San Francisco it stays relatively marginal.

Time evolution of offenses

We can go further and watch those types of offenses month by month.

ofTypeM <- as.data.frame(table(alldata$offenseDescription, alldata$City, alldata$Month))
colnames(ofTypeM) <- c("offenseType", "City", "Month", "Freq")
ofTypeM$offenseType <- as.character((ofTypeM$offenseType))
ofTypeM$FreqperC <- ifelse(ofTypeM$City == "Seattle", ofTypeM$Freq/pop_sea, ofTypeM$Freq/pop_san)
ofTypeMSe <- subset(ofTypeM, ofTypeM$City == "Seattle")
ofTypeMSa <- subset(ofTypeM, ofTypeM$City == "San Francisco")
require(gridExtra)
library(ggplot2)

g <- ggplot(data = ofTypeMSe, aes(offenseType, FreqperC, fill = Month)) + geom_bar(stat = "identity", position="dodge")
g <- g + xlab("Type of offenses") + ylab("Frequency per capita") + ggtitle("Number of offenses per capita \n by Month in 2014, in Seattle") + theme(axis.text.x = element_text(angle = 75, hjust = 1))

h <- ggplot(data = ofTypeMSa, aes(offenseType, FreqperC, fill = Month)) + geom_bar(stat = "identity", position="dodge")
h <- h + xlab("Type of offenses") + ylab("Frequency per capita") + ggtitle("Number of offenses per capita by Month \n in 2014, in San Francisco") + theme(axis.text.x = element_text(angle = 75, hjust = 1))

grid.arrange(g,h,ncol=2)

report without code

This plot gives us an overview of the offenses’evolution month by month, but it would be much clearer to plot lines. We will here focus on “Larceny” and “Public Disorder” as they are the most different categories between both cities. Let’s try it.

ofTypeM_new <- as.data.frame(table(alldata$offenseDescription, alldata$City, alldata$Date))
colnames(ofTypeM_new) <- c("offenseType", "City", "Date", "Freq")
ofTypeM_new$offenseType <- as.character((ofTypeM_new$offenseType))
ofTypeM_new$Date <- as.Date(ofTypeM_new$Date)
ofTypeM_new <- subset(ofTypeM_new, ofTypeM_new$offenseType == "Larceny" | ofTypeM_new$offenseType == "Public Disorder")
ofTypeM_new$FreqperC <- ifelse(ofTypeM_new$City == "Seattle", ofTypeM_new$Freq/pop_sea, ofTypeM_new$Freq/pop_san)
ofTypeM_newSe <- subset(ofTypeM_new, ofTypeM_new$City == "Seattle")
ofTypeM_newSa <- subset(ofTypeM_new, ofTypeM_new$City == "San Francisco")
library(ggplot2)

g <- ggplot(data = ofTypeM_newSe, aes(Date, FreqperC, colour = offenseType)) + geom_line() + geom_smooth()
g <- g + xlab("Date") + ylab("Frequency per Capita") + ggtitle("Number of offenses per capita \n over time in summer 2014 in Seattle")

h <- ggplot(data = ofTypeM_newSa, aes(Date, FreqperC, colour = offenseType)) + geom_line() + geom_smooth()
h <- g + xlab("Date") + ylab("Frequency per Capita") + ggtitle("Number of offenses per capita \n over time in summer 2014 in San Francisco")

grid.arrange(g,h,ncol=2)

report without code

In this plot, we can observe that there is a speficic trend, the same one in both cities :

There is certainly here more to investigate, which we would do if we had more time.