Introduction

This is an easy way to share our exploratory data analysis for the Kaggle Competition - https://www.kaggle.com/c/sf-crime.

Data Load

Normally I’d load the data in from the website in the script, but it’s not working so I’ll just assume that it’s in the working directory.

Note: I experimented with setting StringsAsFactors = FALSE, but I ran into some crashes, so am sticking with the default.

train <- read.csv("train.csv")
#test <- read.csv("test.csv")

Data Cleaning

Description and Resolution fields do not exist in the test dataset. Leave them in for exploration to see whether we can find any patterns in the remaining fields which could be good features. We’ll remove them from dataset before fitting a model.

summary(train[train$Y == 90,])
##                  Dates              Category 
##  2003-06-24 18:35:00: 2   OTHER OFFENSES:17  
##  2003-06-25 07:30:00: 2   LARCENY/THEFT :15  
##  2004-03-17 01:16:00: 2   VEHICLE THEFT : 9  
##  2004-06-24 00:03:00: 2   WARRANTS      : 8  
##  2005-08-13 17:00:00: 2   ASSAULT       : 4  
##  2005-08-25 09:26:00: 2   NON-CRIMINAL  : 4  
##  (Other)            :55   (Other)       :10  
##                                   Descript      DayOfWeek       PdDistrict
##  GRAND THEFT FROM LOCKED AUTO         : 9   Friday   :11   TENDERLOIN:15  
##  DRIVERS LICENSE, SUSPENDED OR REVOKED: 6   Monday   :11   NORTHERN  :12  
##  STOLEN AUTOMOBILE                    : 4   Saturday : 6   BAYVIEW   : 9  
##  VEHICLE, RECOVERED, AUTO             : 4   Sunday   : 6   SOUTHERN  : 8  
##  ENROUTE TO OUTSIDE JURISDICTION      : 3   Thursday :11   RICHMOND  : 7  
##  MISCELLANEOUS INVESTIGATION          : 3   Tuesday  :11   CENTRAL   : 5  
##  (Other)                              :38   Wednesday:11   (Other)   :11  
##                             Resolution
##  NONE                            :33  
##  ARREST, BOOKED                  :19  
##  ARREST, CITED                   :10  
##  COMPLAINANT REFUSES TO PROSECUTE: 2  
##  JUVENILE BOOKED                 : 1  
##  PSYCHOPATHIC CASE               : 1  
##  (Other)                         : 1  
##                             Address         X                Y     
##  AUSTIN ST / LARKIN ST          : 6   Min.   :-120.5   Min.   :90  
##  LARKIN ST / AUSTIN ST          : 6   1st Qu.:-120.5   1st Qu.:90  
##  5THSTNORTH ST / EDDY ST        : 3   Median :-120.5   Median :90  
##  7THSTNORTH ST / MCALLISTER ST  : 3   Mean   :-120.5   Mean   :90  
##  BRANNAN ST / 1ST ST            : 3   3rd Qu.:-120.5   3rd Qu.:90  
##  AVENUE OF THE PALMS / EUCLID AV: 2   Max.   :-120.5   Max.   :90  
##  (Other)                        :44
train2 <- train[train$Y < 90,] 

Feature Engineering

Time and Date

Lets split the date into three features - Year, Month and Hour. This will allow us to ignore the date field.

DayOfWeek can also be simplified from 7 levels by creating Weekend as a boolean. Saturday and Sunday would be true.

One possible alternative would be TimeOfDay. Splitting hour into six hour segments:

  • Midnight - 6am: Overnight
  • 6am - Midday: Morning
  • Midday - 6pm: Afternoon
  • 6pm - Midnight: Evening
library(lubridate)
train2$Year <- year(train2$Dates)
train2$Month <- month(train2$Dates)
train2$Hour <- hour(train2$Dates)

train2$Weekend <- train2$DayOfWeek %in% c("Saturday", "Sunday")

train2$TimeOfDay <- cut(train2$Hour, 
                        c(0, 6, 12, 18, 24), 
                        labels = c("Overnight", "Morning", "Afternoon", "Evening"),
                        right = FALSE)

A quick inspection of totals in each of these categories shows that splitting at hours 1, 7, 13 and 19 might be more representative. This is a fine detail that we can investigate later if time permits.

table(train2$Hour, train2$TimeOfDay)
##     
##      Overnight Morning Afternoon Evening
##   0      44858       0         0       0
##   1      26168       0         0       0
##   2      22296       0         0       0
##   3      14013       0         0       0
##   4       9863       0         0       0
##   5       8637       0         0       0
##   6          0   13133         0       0
##   7          0   22046         0       0
##   8          0   32896         0       0
##   9          0   35550         0       0
##   10         0   37804         0       0
##   11         0   38371         0       0
##   12         0       0     51931       0
##   13         0       0     43143       0
##   14         0       0     44423       0
##   15         0       0     48056       0
##   16         0       0     50133       0
##   17         0       0     53548       0
##   18         0       0         0   55095
##   19         0       0         0   49472
##   20         0       0         0   44689
##   21         0       0         0   43659
##   22         0       0         0   45741
##   23         0       0         0   42457

Address

One simple approach would be to categorise addresses by Block or Corner. Output below shows that all addresses are either block or corner.

train2$Block <- grepl("Block of", train2$Address)

# Check that all other rows are corners
block <- sum(train2$Block)
corner <- sum(grepl("/", train2$Address))
block + corner
## [1] 877982
nrow(train2)
## [1] 877982

Subset of data

To enable quick plotting of trends or experimentation with different models, create a subset of the data, removing the rows which we won’t use in the model.

set.seed(42)
sample.rows <- sample(1:nrow(train2), 5000)
train3 <- train2[sample.rows,]
train3$Dates <- NULL
train3$Descript <- NULL
train3$Resolution <- NULL
train3$Address <- NULL

Exploratory Data Analysis

To begin with we’ll order the categories, with the most frequent first. Then create summary dataframes and charts for the proportion of each category, by each address and date-time factor. We’ll calculate a ratio to highlight trends in the data.

Incident Category

suppressMessages(library(dplyr))
library(ggplot2)

total.category <- table(train2$Category)
order.category <- order(total.category, decreasing = TRUE)
train2$Category <- factor(train2$Category, levels = names(total.category[order.category]))

ggplot(train2, aes(Category)) +
  geom_histogram(aes(fill = Category)) +
  scale_y_continuous() +
  scale_x_discrete(limits = levels(train2$Category)[39:1]) +
  theme(legend.position="none", axis.text.x = element_text(angle = 45, hjust = 1)) +
  coord_flip()

Incident Category by Hour

Inspection of the plots below shows that there is significant variation in the type of crime depending on the time. Therefore we will not group the time of day into larger blocks of time.

ratio.Hour <- train2 %>%
  group_by(Category, Hour) %>%
  summarize(count = n()) %>%
  group_by(Hour) %>%
  mutate (hc = sum(count)) %>%
  ungroup() %>%
  mutate (ratio = count / hc)

ggplot(ratio.Hour, aes(Hour, ratio)) + 
  geom_line(aes(colour = Category)) + 
  facet_wrap(~ Category, ncol = 5, scales = "free_y") +
  scale_x_continuous(breaks=c(0, 6, 12, 18, 24)) +
  expand_limits(y = 0) + 
  theme(legend.position="none")

Incident Category by Day

Inspection of the plots below indicate that there isn’t as much variation as in Incident Category by Hour, but there can still be up to 25% variation in the top 15 crimes. To reduce the number of variables the obvious alternative is a two level factor with “Weekday” and “Weekend”. This might be a reasonable approximation, but isn’t clear cut. Let’s start with a 7 level factor, and try a 2 level factor as a possible improvement to the model.

ratio.Day <- train2 %>%
  group_by(Category, DayOfWeek) %>%
  summarize(count = n()) %>%
  group_by(DayOfWeek) %>%
  mutate(dc = sum(count)) %>%
  ungroup() %>%
  mutate(ratio = count / dc) 

ratio.Day$DayOfWeek <- factor(ratio.Day$DayOfWeek,
                              levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday","Saturday", "Sunday"),
                              labels = c("M", "Tu", "W", "Th", "F", "Sa", "Su"),
                              ordered = TRUE)

ggplot(ratio.Day, aes(DayOfWeek, ratio)) + 
  geom_line(aes(colour = Category, group = Category)) + 
  facet_wrap(~ Category, ncol = 5, scales = "free_y") +
  expand_limits(y = 0) + 
  theme(legend.position="none")

Incident Category by Month

The differences between the monthly totals are marginal for the top crimes. Suggest dropping this factor from the model.

ratio.Month <- train2 %>%
  group_by(Category, Month) %>%
  summarize(count = n()) %>%
  group_by(Month) %>%
  mutate(dc = sum(count)) %>%
  ungroup() %>%
  mutate(ratio = count / dc) 

ggplot(ratio.Month, aes(Month, ratio)) + 
  geom_line(aes(colour = Category, group = Category)) + 
  facet_wrap(~ Category, ncol = 5, scales = "free_y") +
  scale_x_continuous(breaks=c(1, 4, 7, 10), 
                     minor_breaks = c(2, 3, 5, 6, 8, 9, 11, 12),
                     labels = c("Jan", "Apr", "Jul", "Oct")) +
  expand_limits(y = 0) + 
  theme(legend.position="none")

Incident Category by Year

Inspection of the plots below indicate that year is an important factor for use in the model.

ratio.Year <- train2 %>%
  group_by(Category, Year) %>%
  summarize(count = n()) %>%
  group_by(Year) %>%
  mutate(dc = sum(count)) %>%
  ungroup() %>%
  mutate(ratio = count / dc) 

ggplot(ratio.Year, aes(Year, ratio)) + 
  geom_line(aes(colour = Category, group = Category)) + 
  facet_wrap(~ Category, ncol = 5, scales = "free_y") +
  scale_x_continuous(breaks=c(2003, 2007, 2011, 2015)) +  
  expand_limits(y = 0) + 
  theme(legend.position="none")

Incident Category by PdDistrict

Inspection of the plots below indicate that PdDistrict is a valuable factor in the model.

ratio.PdDistrict <- train2 %>%
  group_by(Category, PdDistrict) %>%
  summarize(count = n()) %>%
  group_by(PdDistrict) %>%
  mutate(dc = sum(count)) %>%
  ungroup() %>%
  mutate(ratio = count / dc) 

ggplot(ratio.PdDistrict, aes(PdDistrict, ratio)) + 
  geom_line(aes(colour = Category, group = Category)) + 
  facet_wrap(~ Category, ncol = 5, scales = "free_y") +
  expand_limits(y = 0) + 
  theme(legend.position="none", axis.text.x = element_text(angle = 45, hjust = 1))

Incident Category by Block

Inspection of the plots below indicate that there is a very significant difference between block and corner. This factor should be retained in the model.

ratio.Block <- train2 %>%
  group_by(Category, Block) %>%
  summarize(count = n()) %>%
  group_by(Block) %>%
  mutate(dc = sum(count)) %>%
  ungroup() %>%
  mutate(ratio = count / dc) 

ggplot(ratio.Block, aes(Block, ratio)) + 
  geom_boxplot(aes(colour = Category)) + 
  facet_wrap(~ Category, ncol = 5, scales = "free_y") +
  scale_x_discrete(labels = c("Corner", "Block")) +
  expand_limits(y = 0) + 
  theme(legend.position="none")