Predict the category of crimes that occurred in the city by the bay
From 1934 to 1963, San Francisco was infamous for housing some of the world’s most notorious criminals on the inescapable island of Alcatraz.
Today, the city is known more for its tech scene than its criminal past. But, with rising wealth inequality, housing shortages, and a proliferation of expensive digital toys riding BART to work, there is no scarcity of crime in the city by the bay.
From Sunset to SOMA, and Marina to Excelsior, this competition’s dataset provides nearly 12 years of crime reports from across all of San Francisco’s neighborhoods. Given time and location, you must predict the category of crime that occurred.
We’re also encouraging you to explore the dataset visually. What can we learn about the city through visualizations like this Top Crimes Map? The top most up-voted scripts from this competition will receive official Kaggle swag as prizes.
- Acknowledgements
Kaggle is hosting this competition for the machine learning community to use for fun and practice. This dataset is brought to you by SF OpenData, the central clearinghouse for data published by the City and County of San Francisco.
- Evaluation
Submissions are evaluated using the multi-class logarithmic loss. Each incident has been labeled with one true class. For each incident, you must submit a set of predicted probabilities (one for every class). The formula is then,
\(logloss= -\frac{1}{N}\sum_{i=1}^{N}\sum_{j=1}^{M}y^{ij}log(p_{ij})\)
where N is the number of cases in the test set, M is the number of class labels, log is the natural logarithm, yij is 1 if observation i is in class j and 0 otherwise, and pij is the predicted probability that observation i belongs to class j.
The submitted probabilities for a given incident are not required to sum to one because they are rescaled prior to being scored (each row is divided by the row sum). In order to avoid the extremes of the log function, predicted probabilities are replaced with max(min(p,1−10−15),10−15).
- Import Libraries :
suppressPackageStartupMessages({
library(dplyr)
library(readr)
library(lubridate) # date
library(sqldf) # sql
library(ggplot2) # visualization
library(ggthemes) # ggplot theme
library(data.table) # fread
library(DT)
})
fillColor = "#FFA07A"
fillColor2 = "#F1C40F"
- Load Dataset :
coltypes <-
list(Dates = col_datetime("%Y-%m-%d %H:%M:%S"))
train <-
read_csv(file="./input/train.csv",
col_types=coltypes)
test <-
read_csv(file="./input/test.csv",
col_types=coltypes)
- Reshape Data :
train <-
train %>%
mutate(Year = factor(year(Dates), levels=2003:2015),
Month = factor(month(Dates), levels=1:12),
Day = day(Dates),
Hour = factor(hour(Dates), levels=0:23),
dayDate = as.POSIXct(round(Dates, units = "days")),
DayOfWeek = factor(DayOfWeek, levels=c("Monday",
"Tuesday",
"Wednesday",
"Thursday",
"Friday",
"Saturday",
"Sunday"))
)
#unique(train$Category)
- Overview Data :
train_head <- train %>% head(1000)
datatable(train_head, style="bootstrap", class="table-condensed", options = list(dom = 'tp',scrollX = TRUE))
- 범죄 종류 :
- 총 39개의 범죄가 존재한다.
#length(unique(train$Category))
unique(train$Category)
## [1] "WARRANTS" "OTHER OFFENSES"
## [3] "LARCENY/THEFT" "VEHICLE THEFT"
## [5] "VANDALISM" "NON-CRIMINAL"
## [7] "ROBBERY" "ASSAULT"
## [9] "WEAPON LAWS" "BURGLARY"
## [11] "SUSPICIOUS OCC" "DRUNKENNESS"
## [13] "FORGERY/COUNTERFEITING" "DRUG/NARCOTIC"
## [15] "STOLEN PROPERTY" "SECONDARY CODES"
## [17] "TRESPASS" "MISSING PERSON"
## [19] "FRAUD" "KIDNAPPING"
## [21] "RUNAWAY" "DRIVING UNDER THE INFLUENCE"
## [23] "SEX OFFENSES FORCIBLE" "PROSTITUTION"
## [25] "DISORDERLY CONDUCT" "ARSON"
## [27] "FAMILY OFFENSES" "LIQUOR LAWS"
## [29] "BRIBERY" "EMBEZZLEMENT"
## [31] "SUICIDE" "LOITERING"
## [33] "SEX OFFENSES NON FORCIBLE" "EXTORTION"
## [35] "GAMBLING" "BAD CHECKS"
## [37] "TREA" "RECOVERED VEHICLE"
## [39] "PORNOGRAPHY/OBSCENE MAT"
- 범죄 발생률 :
- 범죄 발생 빈도를 보면, 강도와 폭행, 약물 관련된 사건 순으로 많이 발생함을 알 수 있다.
- 도시에서 발생하는 범죄의 대다수는 절도, 폭행, 약물에 관련있다는 것을 알 수 있다.
- 상위 5개가 전체 비중의 절반 이상(59.71%)을 차지한다.
- 자세히 살펴보면 범죄가 일어나지 않는 케이스(NON-CRIMINAL)가 3위 (전체의 10%)를 차지한다.
- NON-CRIMINAL를 제외한 상위 4개의 비중은 절반(49.2%)을 차지한다.
summ_ct <- sqldf("
select
category as category
, count(category) as freq
from train
group by category
order by count(category) desc
;")
summ_ct
## category freq
## 1 LARCENY/THEFT 174900
## 2 OTHER OFFENSES 126182
## 3 NON-CRIMINAL 92304
## 4 ASSAULT 76876
## 5 DRUG/NARCOTIC 53971
## 6 VEHICLE THEFT 53781
## 7 VANDALISM 44725
## 8 WARRANTS 42214
## 9 BURGLARY 36755
## 10 SUSPICIOUS OCC 31414
## 11 MISSING PERSON 25989
## 12 ROBBERY 23000
## 13 FRAUD 16679
## 14 FORGERY/COUNTERFEITING 10609
## 15 SECONDARY CODES 9985
## 16 WEAPON LAWS 8555
## 17 PROSTITUTION 7484
## 18 TRESPASS 7326
## 19 STOLEN PROPERTY 4540
## 20 SEX OFFENSES FORCIBLE 4388
## 21 DISORDERLY CONDUCT 4320
## 22 DRUNKENNESS 4280
## 23 RECOVERED VEHICLE 3138
## 24 KIDNAPPING 2341
## 25 DRIVING UNDER THE INFLUENCE 2268
## 26 RUNAWAY 1946
## 27 LIQUOR LAWS 1903
## 28 ARSON 1513
## 29 LOITERING 1225
## 30 EMBEZZLEMENT 1166
## 31 SUICIDE 508
## 32 FAMILY OFFENSES 491
## 33 BAD CHECKS 406
## 34 BRIBERY 289
## 35 EXTORTION 256
## 36 SEX OFFENSES NON FORCIBLE 148
## 37 GAMBLING 146
## 38 PORNOGRAPHY/OBSCENE MAT 22
## 39 TREA 6
ct_prob <- sqldf("
select
category
, round(100. * c_tot/tot, 2) as prob
from (
select
category
, count(*) as c_tot
from train
group by category
) as a, (
select
count(*) as tot
from train
) as b
order by prob desc
")
ct_prob
## category prob
## 1 LARCENY/THEFT 19.92
## 2 OTHER OFFENSES 14.37
## 3 NON-CRIMINAL 10.51
## 4 ASSAULT 8.76
## 5 DRUG/NARCOTIC 6.15
## 6 VEHICLE THEFT 6.13
## 7 VANDALISM 5.09
## 8 WARRANTS 4.81
## 9 BURGLARY 4.19
## 10 SUSPICIOUS OCC 3.58
## 11 MISSING PERSON 2.96
## 12 ROBBERY 2.62
## 13 FRAUD 1.90
## 14 FORGERY/COUNTERFEITING 1.21
## 15 SECONDARY CODES 1.14
## 16 WEAPON LAWS 0.97
## 17 PROSTITUTION 0.85
## 18 TRESPASS 0.83
## 19 STOLEN PROPERTY 0.52
## 20 SEX OFFENSES FORCIBLE 0.50
## 21 DISORDERLY CONDUCT 0.49
## 22 DRUNKENNESS 0.49
## 23 RECOVERED VEHICLE 0.36
## 24 KIDNAPPING 0.27
## 25 DRIVING UNDER THE INFLUENCE 0.26
## 26 LIQUOR LAWS 0.22
## 27 RUNAWAY 0.22
## 28 ARSON 0.17
## 29 LOITERING 0.14
## 30 EMBEZZLEMENT 0.13
## 31 FAMILY OFFENSES 0.06
## 32 SUICIDE 0.06
## 33 BAD CHECKS 0.05
## 34 BRIBERY 0.03
## 35 EXTORTION 0.03
## 36 GAMBLING 0.02
## 37 SEX OFFENSES NON FORCIBLE 0.02
## 38 PORNOGRAPHY/OBSCENE MAT 0.00
## 39 TREA 0.00
- 범죄 발생빈도와 발생률 상위 10개 시각화 :
summary(summ_ct)
## category freq
## Length:39 Min. : 6
## Class :character 1st Qu.: 1196
## Mode :character Median : 4388
## Mean : 22514
## 3rd Qu.: 28702
## Max. :174900
y <- 22514
p1 <- ggplot(data = summ_ct %>% head(10), aes(x = reorder(category, freq), y = freq)) +
geom_col(fill=fillColor2) +
ggtitle("Frequency of Crime") +
xlab("Category") + ylab("Freq") +
geom_text(aes(label=paste0(round(freq), sep="")),
hjust=0.5, vjust=.5, size = 4, colour = 'black',
fontface = 'bold'
) +
geom_hline(yintercept=y, col="white", lty=2, size=1) +
theme_economist() +
coord_flip()
p1
summary(ct_prob)
## category prob
## Length:39 Min. : 0.000
## Class :character 1st Qu.: 0.135
## Mode :character Median : 0.500
## Mean : 2.565
## 3rd Qu.: 3.270
## Max. :19.920
y <- 2.565
p2 <- ggplot(data = ct_prob %>% head(10), aes(x = reorder(category, prob), y = prob)) +
geom_col(fill=fillColor2) +
ggtitle("Crime Rate") +
xlab("Category") + ylab("Prob") +
geom_text(aes(label=paste0(round(prob), "%", sep="")),
hjust=0.5, vjust=.5, size = 4, colour = 'black',
fontface = 'bold'
) +
geom_hline(yintercept=y, col="white", lty=2, size=1) +
theme_economist() +
coord_flip()
p2
#ggplot(data=train, aes(x=DayOfWeek)) +
# geom_bar(colour="black", fill=fillColor2) +
# xlab("Day of Week") + ylab('Count') +
# theme_economist()
day_of_week <- train %>%
group_by(DayOfWeek) %>%
summarise(cnt = n())
# sqldf("
# select
# DayOfWeek
# , count(*) as cnt
# from train
# group by DayOfWeek
# ")
day_of_week
## # A tibble: 7 x 2
## DayOfWeek cnt
## <fct> <int>
## 1 Monday 121584
## 2 Tuesday 124965
## 3 Wednesday 129211
## 4 Thursday 125038
## 5 Friday 133734
## 6 Saturday 126810
## 7 Sunday 116707
ggplot(data=day_of_week, aes(x=DayOfWeek, y = cnt)) +
geom_bar(stat = "identity", colour="black", fill=fillColor2) +
xlab("Day of Week") + ylab('Count') +
geom_text(aes(x = DayOfWeek, y = 1, label=paste0(round(cnt), sep="")),
hjust=0.5, vjust=-.5, size = 4, colour = 'black',
fontface = 'bold'
) +
theme_economist()
task1 <- sqldf("
select
DayOfWeek
, category
, count(*) as cnt
from train
group by DayOfWeek, category
order by cnt desc
")
task2 <- sqldf("
select
t1.DayOfWeek, t1.Category, t1.cnt
from
task1 as t1
, (select DayOfWeek, max(cnt) as max_sort from task1 group by DayOfWeek) as t2
where
t1.cnt = t2.max_sort and t1.DayOfWeek = t2.DayOfWeek
")
task2
## DayOfWeek Category cnt
## 1 Saturday LARCENY/THEFT 27217
## 2 Friday LARCENY/THEFT 27104
## 3 Wednesday LARCENY/THEFT 24487
## 4 Thursday LARCENY/THEFT 24415
## 5 Sunday LARCENY/THEFT 24150
## 6 Tuesday LARCENY/THEFT 23957
## 7 Monday LARCENY/THEFT 23570
ggplot(data=task2, aes(x=DayOfWeek, y = cnt)) +
geom_bar(stat = "identity", colour="black", fill=fillColor2) +
xlab("Day of Week") + ylab('Count') +
geom_text(aes(x = DayOfWeek, y = 20000, label=paste0(round(cnt), "\n", Category, sep="")),
hjust=0.5, vjust=-.1, size = 3, colour = 'black',
fontface = 'bold'
) +
theme_economist() +
coord_flip()