1. Introduction

Predict the category of crimes that occurred in the city by the bay

From 1934 to 1963, San Francisco was infamous for housing some of the world’s most notorious criminals on the inescapable island of Alcatraz.

Today, the city is known more for its tech scene than its criminal past. But, with rising wealth inequality, housing shortages, and a proliferation of expensive digital toys riding BART to work, there is no scarcity of crime in the city by the bay.

From Sunset to SOMA, and Marina to Excelsior, this competition’s dataset provides nearly 12 years of crime reports from across all of San Francisco’s neighborhoods. Given time and location, you must predict the category of crime that occurred.

We’re also encouraging you to explore the dataset visually. What can we learn about the city through visualizations like this Top Crimes Map? The top most up-voted scripts from this competition will receive official Kaggle swag as prizes.

- Acknowledgements

Kaggle is hosting this competition for the machine learning community to use for fun and practice. This dataset is brought to you by SF OpenData, the central clearinghouse for data published by the City and County of San Francisco.

- Evaluation

Submissions are evaluated using the multi-class logarithmic loss. Each incident has been labeled with one true class. For each incident, you must submit a set of predicted probabilities (one for every class). The formula is then,

\(logloss= -\frac{1}{N}\sum_{i=1}^{N}\sum_{j=1}^{M}y^{ij}log(p_{ij})\)

where N is the number of cases in the test set, M is the number of class labels, log is the natural logarithm, yij is 1 if observation i is in class j and 0 otherwise, and pij is the predicted probability that observation i belongs to class j.

The submitted probabilities for a given incident are not required to sum to one because they are rescaled prior to being scored (each row is divided by the row sum). In order to avoid the extremes of the log function, predicted probabilities are replaced with max(min(p,1−10−15),10−15).

2. Setting

2.1. Import Libraries & Load Dataset

- Import Libraries :

suppressPackageStartupMessages({
library(dplyr)
library(readr)
library(lubridate)  # date
library(sqldf)      # sql
library(ggplot2)    # visualization
library(ggthemes)   # ggplot theme
library(data.table) # fread
library(DT)
})

fillColor = "#FFA07A"
fillColor2 = "#F1C40F"

- Load Dataset :

coltypes <-
    list(Dates = col_datetime("%Y-%m-%d %H:%M:%S"))

train <-
    read_csv(file="./input/train.csv",
                  col_types=coltypes)
test <-
    read_csv(file="./input/test.csv",
                  col_types=coltypes) 

- Reshape Data :

train <-
    train %>%
    mutate(Year  = factor(year(Dates), levels=2003:2015),
           Month = factor(month(Dates), levels=1:12),
           Day   = day(Dates),
           Hour  = factor(hour(Dates), levels=0:23),
           dayDate = as.POSIXct(round(Dates, units = "days")),
           DayOfWeek = factor(DayOfWeek, levels=c("Monday",
                                                  "Tuesday",
                                                  "Wednesday",
                                                  "Thursday",
                                                  "Friday",
                                                  "Saturday",
                                                  "Sunday"))
           )
           
#unique(train$Category)

2.2. Peek into the data

- Overview Data :

train_head <- train %>% head(1000)

datatable(train_head, style="bootstrap", class="table-condensed", options = list(dom = 'tp',scrollX = TRUE))

3. Exploratary Data Analysis

    1. 기초 통계량
    1. 시각화

3.1. 범죄 종류

- 범죄 종류 :
- 총 39개의 범죄가 존재한다.

#length(unique(train$Category))
unique(train$Category)
##  [1] "WARRANTS"                    "OTHER OFFENSES"             
##  [3] "LARCENY/THEFT"               "VEHICLE THEFT"              
##  [5] "VANDALISM"                   "NON-CRIMINAL"               
##  [7] "ROBBERY"                     "ASSAULT"                    
##  [9] "WEAPON LAWS"                 "BURGLARY"                   
## [11] "SUSPICIOUS OCC"              "DRUNKENNESS"                
## [13] "FORGERY/COUNTERFEITING"      "DRUG/NARCOTIC"              
## [15] "STOLEN PROPERTY"             "SECONDARY CODES"            
## [17] "TRESPASS"                    "MISSING PERSON"             
## [19] "FRAUD"                       "KIDNAPPING"                 
## [21] "RUNAWAY"                     "DRIVING UNDER THE INFLUENCE"
## [23] "SEX OFFENSES FORCIBLE"       "PROSTITUTION"               
## [25] "DISORDERLY CONDUCT"          "ARSON"                      
## [27] "FAMILY OFFENSES"             "LIQUOR LAWS"                
## [29] "BRIBERY"                     "EMBEZZLEMENT"               
## [31] "SUICIDE"                     "LOITERING"                  
## [33] "SEX OFFENSES NON FORCIBLE"   "EXTORTION"                  
## [35] "GAMBLING"                    "BAD CHECKS"                 
## [37] "TREA"                        "RECOVERED VEHICLE"          
## [39] "PORNOGRAPHY/OBSCENE MAT"

3.2. 범죄 발생률

- 범죄 발생률 :
- 범죄 발생 빈도를 보면, 강도와 폭행, 약물 관련된 사건 순으로 많이 발생함을 알 수 있다.
- 도시에서 발생하는 범죄의 대다수는 절도, 폭행, 약물에 관련있다는 것을 알 수 있다.
- 상위 5개가 전체 비중의 절반 이상(59.71%)을 차지한다.
- 자세히 살펴보면 범죄가 일어나지 않는 케이스(NON-CRIMINAL)가 3위 (전체의 10%)를 차지한다.
- NON-CRIMINAL를 제외한 상위 4개의 비중은 절반(49.2%)을 차지한다.

summ_ct <- sqldf("
select
  category as category
  , count(category) as freq
from train
group by category
order by count(category) desc
;")

summ_ct
##                       category   freq
## 1                LARCENY/THEFT 174900
## 2               OTHER OFFENSES 126182
## 3                 NON-CRIMINAL  92304
## 4                      ASSAULT  76876
## 5                DRUG/NARCOTIC  53971
## 6                VEHICLE THEFT  53781
## 7                    VANDALISM  44725
## 8                     WARRANTS  42214
## 9                     BURGLARY  36755
## 10              SUSPICIOUS OCC  31414
## 11              MISSING PERSON  25989
## 12                     ROBBERY  23000
## 13                       FRAUD  16679
## 14      FORGERY/COUNTERFEITING  10609
## 15             SECONDARY CODES   9985
## 16                 WEAPON LAWS   8555
## 17                PROSTITUTION   7484
## 18                    TRESPASS   7326
## 19             STOLEN PROPERTY   4540
## 20       SEX OFFENSES FORCIBLE   4388
## 21          DISORDERLY CONDUCT   4320
## 22                 DRUNKENNESS   4280
## 23           RECOVERED VEHICLE   3138
## 24                  KIDNAPPING   2341
## 25 DRIVING UNDER THE INFLUENCE   2268
## 26                     RUNAWAY   1946
## 27                 LIQUOR LAWS   1903
## 28                       ARSON   1513
## 29                   LOITERING   1225
## 30                EMBEZZLEMENT   1166
## 31                     SUICIDE    508
## 32             FAMILY OFFENSES    491
## 33                  BAD CHECKS    406
## 34                     BRIBERY    289
## 35                   EXTORTION    256
## 36   SEX OFFENSES NON FORCIBLE    148
## 37                    GAMBLING    146
## 38     PORNOGRAPHY/OBSCENE MAT     22
## 39                        TREA      6
ct_prob <- sqldf("
select
    category
    , round(100. * c_tot/tot, 2) as prob
from (
    select
        category
        , count(*) as c_tot
    from train
    group by category

    ) as a, (
    select
        count(*) as tot
    from train
    ) as b
order by prob desc
")

ct_prob
##                       category  prob
## 1                LARCENY/THEFT 19.92
## 2               OTHER OFFENSES 14.37
## 3                 NON-CRIMINAL 10.51
## 4                      ASSAULT  8.76
## 5                DRUG/NARCOTIC  6.15
## 6                VEHICLE THEFT  6.13
## 7                    VANDALISM  5.09
## 8                     WARRANTS  4.81
## 9                     BURGLARY  4.19
## 10              SUSPICIOUS OCC  3.58
## 11              MISSING PERSON  2.96
## 12                     ROBBERY  2.62
## 13                       FRAUD  1.90
## 14      FORGERY/COUNTERFEITING  1.21
## 15             SECONDARY CODES  1.14
## 16                 WEAPON LAWS  0.97
## 17                PROSTITUTION  0.85
## 18                    TRESPASS  0.83
## 19             STOLEN PROPERTY  0.52
## 20       SEX OFFENSES FORCIBLE  0.50
## 21          DISORDERLY CONDUCT  0.49
## 22                 DRUNKENNESS  0.49
## 23           RECOVERED VEHICLE  0.36
## 24                  KIDNAPPING  0.27
## 25 DRIVING UNDER THE INFLUENCE  0.26
## 26                 LIQUOR LAWS  0.22
## 27                     RUNAWAY  0.22
## 28                       ARSON  0.17
## 29                   LOITERING  0.14
## 30                EMBEZZLEMENT  0.13
## 31             FAMILY OFFENSES  0.06
## 32                     SUICIDE  0.06
## 33                  BAD CHECKS  0.05
## 34                     BRIBERY  0.03
## 35                   EXTORTION  0.03
## 36                    GAMBLING  0.02
## 37   SEX OFFENSES NON FORCIBLE  0.02
## 38     PORNOGRAPHY/OBSCENE MAT  0.00
## 39                        TREA  0.00

3.3. 범죄 발생률 시각화

- 범죄 발생빈도와 발생률 상위 10개 시각화 :

summary(summ_ct)
##    category              freq       
##  Length:39          Min.   :     6  
##  Class :character   1st Qu.:  1196  
##  Mode  :character   Median :  4388  
##                     Mean   : 22514  
##                     3rd Qu.: 28702  
##                     Max.   :174900
y <- 22514

p1 <- ggplot(data = summ_ct %>% head(10), aes(x = reorder(category, freq), y = freq)) +
  geom_col(fill=fillColor2) +
  ggtitle("Frequency of Crime") +
  xlab("Category") + ylab("Freq") + 
    geom_text(aes(label=paste0(round(freq), sep="")),
              hjust=0.5, vjust=.5, size = 4, colour = 'black',
              fontface = 'bold'
              ) +
  geom_hline(yintercept=y, col="white", lty=2, size=1) +
  theme_economist() +
  coord_flip()

p1

summary(ct_prob)
##    category              prob       
##  Length:39          Min.   : 0.000  
##  Class :character   1st Qu.: 0.135  
##  Mode  :character   Median : 0.500  
##                     Mean   : 2.565  
##                     3rd Qu.: 3.270  
##                     Max.   :19.920
y <- 2.565

p2 <- ggplot(data = ct_prob %>% head(10), aes(x = reorder(category, prob), y = prob)) +
  geom_col(fill=fillColor2) +
  ggtitle("Crime Rate") +
  xlab("Category") + ylab("Prob") + 
    geom_text(aes(label=paste0(round(prob), "%", sep="")),
              hjust=0.5, vjust=.5, size = 4, colour = 'black',
              fontface = 'bold'
             ) +
  geom_hline(yintercept=y, col="white", lty=2, size=1) +
  theme_economist() +
  coord_flip()

p2

3.4. 범죄가 많이 일어나는 요일

#ggplot(data=train, aes(x=DayOfWeek)) +
#  geom_bar(colour="black", fill=fillColor2) +
#  xlab("Day of Week") + ylab('Count') + 
#  theme_economist()
day_of_week <- train %>%
  group_by(DayOfWeek) %>%
  summarise(cnt = n())

# sqldf("
# select 
#     DayOfWeek
#     , count(*) as cnt
# from train
# group by DayOfWeek
# ")

day_of_week
## # A tibble: 7 x 2
##   DayOfWeek    cnt
##   <fct>      <int>
## 1 Monday    121584
## 2 Tuesday   124965
## 3 Wednesday 129211
## 4 Thursday  125038
## 5 Friday    133734
## 6 Saturday  126810
## 7 Sunday    116707
ggplot(data=day_of_week, aes(x=DayOfWeek, y = cnt)) +
  geom_bar(stat = "identity", colour="black", fill=fillColor2) +
  xlab("Day of Week") + ylab('Count') + 
  geom_text(aes(x = DayOfWeek, y = 1, label=paste0(round(cnt), sep="")),
              hjust=0.5, vjust=-.5, size = 4, colour = 'black',
              fontface = 'bold'
             ) +
  theme_economist()

3.5. 요일별 많이 일어나는 범죄

task1 <- sqldf("
select 
    DayOfWeek
    , category
    , count(*) as cnt
from train
group by DayOfWeek, category
order by cnt desc
")


task2 <- sqldf("
select
    t1.DayOfWeek, t1.Category, t1.cnt
from
    task1 as t1
    , (select DayOfWeek, max(cnt) as max_sort from task1 group by DayOfWeek) as t2
where
    t1.cnt = t2.max_sort and t1.DayOfWeek = t2.DayOfWeek
")
task2
##   DayOfWeek      Category   cnt
## 1  Saturday LARCENY/THEFT 27217
## 2    Friday LARCENY/THEFT 27104
## 3 Wednesday LARCENY/THEFT 24487
## 4  Thursday LARCENY/THEFT 24415
## 5    Sunday LARCENY/THEFT 24150
## 6   Tuesday LARCENY/THEFT 23957
## 7    Monday LARCENY/THEFT 23570
ggplot(data=task2, aes(x=DayOfWeek, y = cnt)) +
  geom_bar(stat = "identity", colour="black", fill=fillColor2) +
  xlab("Day of Week") + ylab('Count') + 
  geom_text(aes(x = DayOfWeek, y = 20000, label=paste0(round(cnt), "\n", Category, sep="")),
              hjust=0.5, vjust=-.1, size = 3, colour = 'black',
              fontface = 'bold'
             ) +
  theme_economist() +
  coord_flip()