1 Background

1.1 About this Data

This dataset contains incidents derived from SFPD Crime Incident Reporting system. The data ranges from 1/1/2003 to 5/13/2015.

There are 9 variables:

  • Dates - timestamp of the crime incident
  • Category - category of the crime incident (only in train.csv).
  • Descript - detailed description of the crime incident (only in train.csv)
  • DayOfWeek - the day of the week
  • PdDistrict - name of the Police Department District
  • Resolution - how the crime incident was resolved (only in train.csv)
  • Address - the approximate street address of the crime incident
  • X - Longitude
  • Y - Latitude

1.2 Libraries needed

You will need to use install.packages() to install any packages that are not already downloaded onto your machine. You then load the package into your workspace using the library() function:

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(leaflet)
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(ggridges)
library(hrbrthemes)
## NOTE: Either Arial Narrow or Roboto Condensed fonts are required to use these themes.
##       Please use hrbrthemes::import_roboto_condensed() to install Roboto Condensed and
##       if Arial Narrow is not on your system, please see https://bit.ly/arialnarrow

2 Read Data

The first thing we do is read the train.csv data that we saved. using the read_csv () function. and will be saved with the name crime so that the following dataframe is obtained

crime <- read.csv("train.csv")
head(crime)

3 Data Exploratory

  • Check missing value of the data using anyNa() :
anyNA(crime)
## [1] FALSE

It means that our dataset does not have any missing values.The addition, I want to check the str from dataset so I know the statistic of each columns and I can guess is the data type correct or not.

  • Check the structure of the data using str() :
str(crime)
## 'data.frame':    878049 obs. of  9 variables:
##  $ Dates     : chr  "2015-05-13 23:53:00" "2015-05-13 23:53:00" "2015-05-13 23:33:00" "2015-05-13 23:30:00" ...
##  $ Category  : chr  "WARRANTS" "OTHER OFFENSES" "OTHER OFFENSES" "LARCENY/THEFT" ...
##  $ Descript  : chr  "WARRANT ARREST" "TRAFFIC VIOLATION ARREST" "TRAFFIC VIOLATION ARREST" "GRAND THEFT FROM LOCKED AUTO" ...
##  $ DayOfWeek : chr  "Wednesday" "Wednesday" "Wednesday" "Wednesday" ...
##  $ PdDistrict: chr  "NORTHERN" "NORTHERN" "NORTHERN" "NORTHERN" ...
##  $ Resolution: chr  "ARREST, BOOKED" "ARREST, BOOKED" "ARREST, BOOKED" "NONE" ...
##  $ Address   : chr  "OAK ST / LAGUNA ST" "OAK ST / LAGUNA ST" "VANNESS AV / GREENWICH ST" "1500 Block of LOMBARD ST" ...
##  $ X         : num  -122 -122 -122 -122 -122 ...
##  $ Y         : num  37.8 37.8 37.8 37.8 37.8 ...
  • Changing the wrong data types.

    1. change the Dates type to POSIXct
crime$Dates <- ymd_hms(crime$Dates)
  1. change the categorical columns into factor
crime[,c("Category","Descript","DayOfWeek","PdDistrict","Resolution")] <-
  lapply(crime[,c("Category","Descript","DayOfWeek","PdDistrict","Resolution")],
         as.factor)

Now, make sure the types from each column are correct.

str(crime)
## 'data.frame':    878049 obs. of  9 variables:
##  $ Dates     : POSIXct, format: "2015-05-13 23:53:00" "2015-05-13 23:53:00" ...
##  $ Category  : Factor w/ 39 levels "ARSON","ASSAULT",..: 38 22 22 17 17 17 37 37 17 17 ...
##  $ Descript  : Factor w/ 879 levels "ABANDONMENT OF CHILD",..: 867 811 811 405 405 407 740 740 405 405 ...
##  $ DayOfWeek : Factor w/ 7 levels "Friday","Monday",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ PdDistrict: Factor w/ 10 levels "BAYVIEW","CENTRAL",..: 5 5 5 5 6 3 3 1 7 2 ...
##  $ Resolution: Factor w/ 17 levels "ARREST, BOOKED",..: 1 1 1 12 12 12 12 12 12 12 ...
##  $ Address   : chr  "OAK ST / LAGUNA ST" "OAK ST / LAGUNA ST" "VANNESS AV / GREENWICH ST" "1500 Block of LOMBARD ST" ...
##  $ X         : num  -122 -122 -122 -122 -122 ...
##  $ Y         : num  37.8 37.8 37.8 37.8 37.8 ...

In crime data contain DayofWeek column and we already change the types to factor. Now I want to see are the levels ordered or not.

levels(crime$DayOfWeek)
## [1] "Friday"    "Monday"    "Saturday"  "Sunday"    "Thursday"  "Tuesday"  
## [7] "Wednesday"

Because the order of the levels is not familiar. so I want to change the order of DayofWeek levels into following order:

crime$DayOfWeek <- factor(crime$DayOfWeek,
                          levels = c("Monday","Tuesday","Wednesday",
                                     "Thursday", "Friday", 
                                     "Saturday", "Sunday"),
                          ordered = TRUE)

It’s time to check the summary() of the data to understand the statistical summary on each column.

summary(crime)
##      Dates                               Category     
##  Min.   :2003-01-06 00:01:00   LARCENY/THEFT :174900  
##  1st Qu.:2006-01-11 03:00:00   OTHER OFFENSES:126182  
##  Median :2009-03-07 16:00:00   NON-CRIMINAL  : 92304  
##  Mean   :2009-03-16 08:25:41   ASSAULT       : 76876  
##  3rd Qu.:2012-06-11 10:13:00   DRUG/NARCOTIC : 53971  
##  Max.   :2015-05-13 23:53:00   VEHICLE THEFT : 53781  
##                                (Other)       :300035  
##                                   Descript          DayOfWeek     
##  GRAND THEFT FROM LOCKED AUTO         : 60022   Monday   :121584  
##  LOST PROPERTY                        : 31729   Tuesday  :124965  
##  BATTERY                              : 27441   Wednesday:129211  
##  STOLEN AUTOMOBILE                    : 26897   Thursday :125038  
##  DRIVERS LICENSE, SUSPENDED OR REVOKED: 26839   Friday   :133734  
##  WARRANT ARREST                       : 23754   Saturday :126810  
##  (Other)                              :681367   Sunday   :116707  
##       PdDistrict                 Resolution       Address         
##  SOUTHERN  :157182   NONE             :526790   Length:878049     
##  MISSION   :119908   ARREST, BOOKED   :206403   Class :character  
##  NORTHERN  :105296   ARREST, CITED    : 77004   Mode  :character  
##  BAYVIEW   : 89431   LOCATED          : 17101                     
##  CENTRAL   : 85460   PSYCHOPATHIC CASE: 14534                     
##  TENDERLOIN: 81809   UNFOUNDED        :  9585                     
##  (Other)   :238963   (Other)          : 26632                     
##        X                Y        
##  Min.   :-122.5   Min.   :37.71  
##  1st Qu.:-122.4   1st Qu.:37.75  
##  Median :-122.4   Median :37.78  
##  Mean   :-122.4   Mean   :37.77  
##  3rd Qu.:-122.4   3rd Qu.:37.78  
##  Max.   :-120.5   Max.   :90.00  
## 

4 Studi Case

In this part, I have some business question and try to visualize the case in some chart I have already know.

  1. what is the category of crime that often occurs in San Francisco?
common_crimes <- as.data.frame(table(crime$Category))
colnames(common_crimes) <- c("Category", "Freq")
common_crimes <- common_crimes[order(common_crimes$Freq, decreasing = T),]
common_crimes
  • visualize the data
library(ggplot2)
ggplot(data = common_crimes, mapping = aes(x= Freq, y= reorder(Category, Freq)))+
  geom_col(aes(fill = Category))+
  geom_text(data = common_crimes[c(1,39),],mapping = aes(label = Freq))+
   theme_minimal()+
  labs(title = "Common Crime Category in San Francisco",
       y = NULL,
       x = "Frequency")+
 theme(legend.position = "none")

Interpretation :

This graph show the number of crimes based on categories from the most frequent to the most rare during 2003 - 2015. From this graph, it can be seen that the crime category that often occurs in San Francisco is LARCENY / THEFT with the number of crimes reaching 174900 crime cases. while the rare crime is TREA with a number of cases of 6.

  1. what is the crime rate in San Francisco each year?
  • because the data contain the Dates column, we can extract year from the data and assign to variabel years.
crime <- crime %>% 
  mutate(years = year(Dates))
head(crime)
  • preparing data
crime_per_year <- crime %>% 
  group_by(years) %>% 
  summarise(total = n())
crime_per_year
  • visualize the data
ggplot(crime_per_year, aes(x = years, y = total))+
  geom_line(color = "grey")+
  geom_point(size = 3, color = "firebrick4")+
  theme_minimal()+
  labs(title = "Crime per Year in San Francisco",
       x = NULL,
       y = "Frequency")

Interpretation :

From the graph, we know

  • The frequency of crime in San Francisco decreased from 2003 to 2007.
  • The frequency of crime in San Francisco increased from 2011 to 2013.
  • The frequency of crime in San Francisco saw a drastic drop in 2015.
  • The highest crime frequency in San Francisco occurred in 2013 with a total of 75606 cases.
  1. What is the crime rate in San Francisco on a monthly basis?
  • data preparation
crime <- crime %>% 
  mutate(month = month(Dates, label = T))


monthly_crimes <- crime %>% 
  group_by(month) %>% 
  summarise(count = n())
monthly_crimes
  • visualize the data
ggplot(monthly_crimes, aes(x = count, y = month))+
  geom_col(fill= "salmon")+
  geom_text(aes(label = count), col = "black")+
  theme_minimal()+
  labs(title = "Monthly Crime in San Francisco",
       y = NULL,
       x = "Frequency")

Interpretation :

This graph show the crime rate in each month in San Francisco from 2003 to 2015. It can be seen that the crime rate in San Francisco in each month is quite high, namely over 60,000 cases. Most cases occurred in October with a total of 80,274 criminal cases. while the smallest crime case occurred in December.

  1. What is the crime rate in San Francisco on a daily basis in each district in October 2013?
  • data preparation
daily_crime <- crime %>% 
  filter(years == 2013, month == "Oct") %>% 
  group_by(DayOfWeek, PdDistrict) %>% 
  summarise(total_crime = n())
## `summarise()` has grouped output by 'DayOfWeek'. You can override using the `.groups` argument.
daily_crime
  • visualize the data
ggplot(daily_crime, aes(x = DayOfWeek, y = PdDistrict))+
  geom_count(aes(size = total_crime), col = "goldenrod3")+
  theme_minimal()+
  labs(
    title = "Daily Crime in San Francisco - 2013",
    subtitle = "October Crime",
    x= NULL,
    y = "District"
  )

Interpretation :

  • The highest frequency of crimes occurred in SOUTHERN district
  • The lowest frequency of crimes occurred in RICHMOND district
  • From this graph, it can be seen that the frequency of crime occurs more on weekdays than on weekends.
  1. What was the crime rate in San Francisco by the hour in October 2013?
  • Data preparation
crime$hours <- hour(crime$Dates)
crime_perhours <- crime %>% 
  filter(years == 2013, month == "Oct") %>%
  group_by(hours) %>% 
  summarise(TotalCrime = n())
crime_perhours
  • visualize the data
ggplot(crime_perhours, aes(x = hours, y = TotalCrime))+
  geom_col(fill = "firebrick3")+
  theme_minimal()+
  labs(
    title = "Crime per hour, San Francisco - 2013",
    subtitle = "October Crime",
    x = "Hours",
    y = "Total Crime"
  )

Interpretation :

  • The frequency of the crime was mostly around 5 pm.
  • The frequency of slight crimes occurred around 4 am.
  1. From the previous analysis, we know that the most common crime category is LARCENY / THEFT. we are interested to see how often this category of crime occurred in each district in 2013?
  • data preparation
top_crime <- crime %>% 
  filter(Category == "LARCENY/THEFT", years == 2013) %>% 
  group_by(PdDistrict) %>% 
  summarise(n = n())
head(top_crime)
  • visualize the data
ggplot(top_crime, aes(x = n, y = reorder(PdDistrict,n)))+
  geom_col(fill = "aquamarine3")+
  geom_text(aes(label = n), col= "azure4")+
  geom_vline(xintercept = mean(top_crime$n))+
  geom_label(label = paste("Mean ", round(mean(top_crime$n))),
             x = mean(top_crime$n),
             y = 9)+
  labs(
    title = "LARCENY/THEFT Crime in District",
    subtitle = "in 2013",
    x = "Total Crime",
    y = NULL
  )+
  theme_minimal()

Interpretation :

  • LARCENY/THEFT most commonly occured in SOUTHERN district with total 4414 cases.
  • In 2013 the average crime rate for LARCENY / THEFT was 1,815 cases.
head(crime)
  1. When did the perpetrators of the top 10 crime category in San Francisco commit crimes?
top10_cat <- unique(common_crimes$Category)[1:10]
top10_cat <- droplevels(top10_cat)
top10_cat
##  [1] LARCENY/THEFT  OTHER OFFENSES NON-CRIMINAL   ASSAULT        DRUG/NARCOTIC 
##  [6] VEHICLE THEFT  VANDALISM      WARRANTS       BURGLARY       SUSPICIOUS OCC
## 10 Levels: ASSAULT BURGLARY DRUG/NARCOTIC LARCENY/THEFT ... WARRANTS

we will classify the timing of the crime into 3 categories as follows :

pw <- function(x){ 
    if(x < 8){
      x <- "12am to 8am"
    }else if(x >= 8 & x < 16){
      x <- "8am to 4pm"
    }else{
      x <- "4pm to 12am"
    }
}
crime$hour_cat <- sapply(crime$hours, pw)
crime$hour_cat <- as.factor(crime$hour_cat)
head(crime)
  • data preparation
crime_when <- crime %>% 
  filter(Category %in% top10_cat, years == 2013) %>% 
  group_by(Category, hour_cat) %>% 
  summarise(n = n())
## `summarise()` has grouped output by 'Category'. You can override using the `.groups` argument.
crime_when
  • visualize the data
ggplot(data = crime_when, mapping = aes(x = n, y = reorder(Category, n))) +
  geom_col(mapping = aes(fill = hour_cat), position = "dodge") + 
  labs(x = "Crime Count", y = NULL,
       fill = NULL,
       title = "Categories with Highest Total Crime in San Francisco - 2013") +
  scale_fill_brewer(palette = 1) +
  theme_minimal() +
  theme(legend.position = "top")

Interpretation :

  • Crimes with the category LARCENY / THEFT occurred most frequently between 4pm to 12pm.
  • From this graph it can be concluded that between 12pm and 8am the crime rate tends to be low.
  1. How about the density of the 3 categories of LARCENY / THEFT, DRUG / NARCOTIC, ASSAULT each year?
  • data preparation
list <- c("LARCENY/THEFT","DRUG/NARCOTIC","ASSAULT")
crime_density <- crime %>% 
  filter(Category %in% list) %>% 
  group_by(Category, years) %>% 
  summarise(Total = n())
## `summarise()` has grouped output by 'Category'. You can override using the `.groups` argument.
crime_density
  • visualize the data
ggplot(crime_density, aes(x =Total, y= Category, fill = Category))+
  geom_density_ridges()+
  theme_ipsum(extrafont::loadfonts(device="win"))+
  theme(legend.position = "none")
## Picking joint bandwidth of 418

Interpretation :

From this graph it can be seen that the highest crime density of the 3 categories is ASSAULT, and the low density is LARCENY / THEFT, but LARCENY / THEFT has more cases than ASSAULT.

  1. create a facet plot of the top 5 crime categories in San Francisco?
  • data preparation
top5_cat <- c("LARCENY/THEFT", "OTHER OFFENSES","NON-CRIMINAL","ASSAULT","DRUG/NARCOTIC")
crime_spread <- crime %>% 
  filter(Category %in% top5_cat) %>% 
  group_by(Category,years, DayOfWeek) %>% 
  summarise(totalcrime = n()) %>% 
  ungroup()
## `summarise()` has grouped output by 'Category', 'years'. You can override using the `.groups` argument.
head(crime_spread)
  • visualize the data
ggplot(data=crime_spread, aes(x=years, y=totalcrime))+
  geom_point(col = "skyblue")+
  facet_grid(DayOfWeek~Category)+
  scale_color_gradient(low="red3", high="green2")+
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 90, hjust = 1))

Interpretation :

  • ASSAULT’s crime category tends to be the same every year.
  • The LARCENY / THEFT category tends to increase every day
  1. How the DRUG / NARCOTIC crime case in Northern which has NONE resolution in October spread seen from the map?
  • data preparation
crime <- rename(crime, "lng" = "X", "lat" = "Y")
head(crime)
map_drug <- crime %>% 
  filter(Category == "DRUG/NARCOTIC",
         month == "Oct", 
         PdDistrict == "NORTHERN",
         Resolution == "NONE") %>% 
  select(Address, lng, lat)
map_drug
ico <- makeIcon(iconUrl = "https://cdn.iconscout.com/icon/free/png-256/drugs-26-129384.png",iconWidth=47/2, iconHeight=41/2)
map2 <- leaflet()
map2 <- addTiles(map2)
map2 <- addMarkers(map2, data = map_drug, icon = ico, popup = map_drug[,"Address"])
## Assuming "lng" and "lat" are longitude and latitude, respectively
map2