This dataset contains incidents derived from SFPD Crime Incident Reporting system. The data ranges from 1/1/2003 to 5/13/2015.
There are 9 variables:
You will need to use install.packages() to install any packages that are not already downloaded onto your machine. You then load the package into your workspace using the library() function:
library(dplyr)##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(leaflet)
library(lubridate)##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(ggridges)
library(hrbrthemes)## NOTE: Either Arial Narrow or Roboto Condensed fonts are required to use these themes.
## Please use hrbrthemes::import_roboto_condensed() to install Roboto Condensed and
## if Arial Narrow is not on your system, please see https://bit.ly/arialnarrow
The first thing we do is read the train.csv data that we saved. using the read_csv () function. and will be saved with the name crime so that the following dataframe is obtained
crime <- read.csv("train.csv")
head(crime)anyNa() :anyNA(crime)## [1] FALSE
It means that our dataset does not have any missing values.The addition, I want to check the str from dataset so I know the statistic of each columns and I can guess is the data type correct or not.
str() :str(crime)## 'data.frame': 878049 obs. of 9 variables:
## $ Dates : chr "2015-05-13 23:53:00" "2015-05-13 23:53:00" "2015-05-13 23:33:00" "2015-05-13 23:30:00" ...
## $ Category : chr "WARRANTS" "OTHER OFFENSES" "OTHER OFFENSES" "LARCENY/THEFT" ...
## $ Descript : chr "WARRANT ARREST" "TRAFFIC VIOLATION ARREST" "TRAFFIC VIOLATION ARREST" "GRAND THEFT FROM LOCKED AUTO" ...
## $ DayOfWeek : chr "Wednesday" "Wednesday" "Wednesday" "Wednesday" ...
## $ PdDistrict: chr "NORTHERN" "NORTHERN" "NORTHERN" "NORTHERN" ...
## $ Resolution: chr "ARREST, BOOKED" "ARREST, BOOKED" "ARREST, BOOKED" "NONE" ...
## $ Address : chr "OAK ST / LAGUNA ST" "OAK ST / LAGUNA ST" "VANNESS AV / GREENWICH ST" "1500 Block of LOMBARD ST" ...
## $ X : num -122 -122 -122 -122 -122 ...
## $ Y : num 37.8 37.8 37.8 37.8 37.8 ...
Changing the wrong data types.
Dates type to POSIXctcrime$Dates <- ymd_hms(crime$Dates)factorcrime[,c("Category","Descript","DayOfWeek","PdDistrict","Resolution")] <-
lapply(crime[,c("Category","Descript","DayOfWeek","PdDistrict","Resolution")],
as.factor)Now, make sure the types from each column are correct.
str(crime)## 'data.frame': 878049 obs. of 9 variables:
## $ Dates : POSIXct, format: "2015-05-13 23:53:00" "2015-05-13 23:53:00" ...
## $ Category : Factor w/ 39 levels "ARSON","ASSAULT",..: 38 22 22 17 17 17 37 37 17 17 ...
## $ Descript : Factor w/ 879 levels "ABANDONMENT OF CHILD",..: 867 811 811 405 405 407 740 740 405 405 ...
## $ DayOfWeek : Factor w/ 7 levels "Friday","Monday",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ PdDistrict: Factor w/ 10 levels "BAYVIEW","CENTRAL",..: 5 5 5 5 6 3 3 1 7 2 ...
## $ Resolution: Factor w/ 17 levels "ARREST, BOOKED",..: 1 1 1 12 12 12 12 12 12 12 ...
## $ Address : chr "OAK ST / LAGUNA ST" "OAK ST / LAGUNA ST" "VANNESS AV / GREENWICH ST" "1500 Block of LOMBARD ST" ...
## $ X : num -122 -122 -122 -122 -122 ...
## $ Y : num 37.8 37.8 37.8 37.8 37.8 ...
In crime data contain DayofWeek column and we already change the types to factor. Now I want to see are the levels ordered or not.
levels(crime$DayOfWeek)## [1] "Friday" "Monday" "Saturday" "Sunday" "Thursday" "Tuesday"
## [7] "Wednesday"
Because the order of the levels is not familiar. so I want to change the order of DayofWeek levels into following order:
crime$DayOfWeek <- factor(crime$DayOfWeek,
levels = c("Monday","Tuesday","Wednesday",
"Thursday", "Friday",
"Saturday", "Sunday"),
ordered = TRUE)It’s time to check the summary() of the data to understand the statistical summary on each column.
summary(crime)## Dates Category
## Min. :2003-01-06 00:01:00 LARCENY/THEFT :174900
## 1st Qu.:2006-01-11 03:00:00 OTHER OFFENSES:126182
## Median :2009-03-07 16:00:00 NON-CRIMINAL : 92304
## Mean :2009-03-16 08:25:41 ASSAULT : 76876
## 3rd Qu.:2012-06-11 10:13:00 DRUG/NARCOTIC : 53971
## Max. :2015-05-13 23:53:00 VEHICLE THEFT : 53781
## (Other) :300035
## Descript DayOfWeek
## GRAND THEFT FROM LOCKED AUTO : 60022 Monday :121584
## LOST PROPERTY : 31729 Tuesday :124965
## BATTERY : 27441 Wednesday:129211
## STOLEN AUTOMOBILE : 26897 Thursday :125038
## DRIVERS LICENSE, SUSPENDED OR REVOKED: 26839 Friday :133734
## WARRANT ARREST : 23754 Saturday :126810
## (Other) :681367 Sunday :116707
## PdDistrict Resolution Address
## SOUTHERN :157182 NONE :526790 Length:878049
## MISSION :119908 ARREST, BOOKED :206403 Class :character
## NORTHERN :105296 ARREST, CITED : 77004 Mode :character
## BAYVIEW : 89431 LOCATED : 17101
## CENTRAL : 85460 PSYCHOPATHIC CASE: 14534
## TENDERLOIN: 81809 UNFOUNDED : 9585
## (Other) :238963 (Other) : 26632
## X Y
## Min. :-122.5 Min. :37.71
## 1st Qu.:-122.4 1st Qu.:37.75
## Median :-122.4 Median :37.78
## Mean :-122.4 Mean :37.77
## 3rd Qu.:-122.4 3rd Qu.:37.78
## Max. :-120.5 Max. :90.00
##
In this part, I have some business question and try to visualize the case in some chart I have already know.
common_crimes <- as.data.frame(table(crime$Category))
colnames(common_crimes) <- c("Category", "Freq")
common_crimes <- common_crimes[order(common_crimes$Freq, decreasing = T),]
common_crimeslibrary(ggplot2)
ggplot(data = common_crimes, mapping = aes(x= Freq, y= reorder(Category, Freq)))+
geom_col(aes(fill = Category))+
geom_text(data = common_crimes[c(1,39),],mapping = aes(label = Freq))+
theme_minimal()+
labs(title = "Common Crime Category in San Francisco",
y = NULL,
x = "Frequency")+
theme(legend.position = "none")Interpretation :
This graph show the number of crimes based on categories from the most frequent to the most rare during 2003 - 2015. From this graph, it can be seen that the crime category that often occurs in San Francisco is LARCENY / THEFT with the number of crimes reaching 174900 crime cases. while the rare crime is TREA with a number of cases of 6.
Dates column, we can extract year from the data and assign to variabel years.crime <- crime %>%
mutate(years = year(Dates))
head(crime)crime_per_year <- crime %>%
group_by(years) %>%
summarise(total = n())
crime_per_yearggplot(crime_per_year, aes(x = years, y = total))+
geom_line(color = "grey")+
geom_point(size = 3, color = "firebrick4")+
theme_minimal()+
labs(title = "Crime per Year in San Francisco",
x = NULL,
y = "Frequency")Interpretation :
From the graph, we know
crime <- crime %>%
mutate(month = month(Dates, label = T))
monthly_crimes <- crime %>%
group_by(month) %>%
summarise(count = n())
monthly_crimesggplot(monthly_crimes, aes(x = count, y = month))+
geom_col(fill= "salmon")+
geom_text(aes(label = count), col = "black")+
theme_minimal()+
labs(title = "Monthly Crime in San Francisco",
y = NULL,
x = "Frequency")Interpretation :
This graph show the crime rate in each month in San Francisco from 2003 to 2015. It can be seen that the crime rate in San Francisco in each month is quite high, namely over 60,000 cases. Most cases occurred in October with a total of 80,274 criminal cases. while the smallest crime case occurred in December.
daily_crime <- crime %>%
filter(years == 2013, month == "Oct") %>%
group_by(DayOfWeek, PdDistrict) %>%
summarise(total_crime = n())## `summarise()` has grouped output by 'DayOfWeek'. You can override using the `.groups` argument.
daily_crimeggplot(daily_crime, aes(x = DayOfWeek, y = PdDistrict))+
geom_count(aes(size = total_crime), col = "goldenrod3")+
theme_minimal()+
labs(
title = "Daily Crime in San Francisco - 2013",
subtitle = "October Crime",
x= NULL,
y = "District"
) Interpretation :
crime$hours <- hour(crime$Dates)
crime_perhours <- crime %>%
filter(years == 2013, month == "Oct") %>%
group_by(hours) %>%
summarise(TotalCrime = n())
crime_perhoursggplot(crime_perhours, aes(x = hours, y = TotalCrime))+
geom_col(fill = "firebrick3")+
theme_minimal()+
labs(
title = "Crime per hour, San Francisco - 2013",
subtitle = "October Crime",
x = "Hours",
y = "Total Crime"
)Interpretation :
top_crime <- crime %>%
filter(Category == "LARCENY/THEFT", years == 2013) %>%
group_by(PdDistrict) %>%
summarise(n = n())
head(top_crime)ggplot(top_crime, aes(x = n, y = reorder(PdDistrict,n)))+
geom_col(fill = "aquamarine3")+
geom_text(aes(label = n), col= "azure4")+
geom_vline(xintercept = mean(top_crime$n))+
geom_label(label = paste("Mean ", round(mean(top_crime$n))),
x = mean(top_crime$n),
y = 9)+
labs(
title = "LARCENY/THEFT Crime in District",
subtitle = "in 2013",
x = "Total Crime",
y = NULL
)+
theme_minimal()Interpretation :
head(crime)top10_cat <- unique(common_crimes$Category)[1:10]
top10_cat <- droplevels(top10_cat)
top10_cat## [1] LARCENY/THEFT OTHER OFFENSES NON-CRIMINAL ASSAULT DRUG/NARCOTIC
## [6] VEHICLE THEFT VANDALISM WARRANTS BURGLARY SUSPICIOUS OCC
## 10 Levels: ASSAULT BURGLARY DRUG/NARCOTIC LARCENY/THEFT ... WARRANTS
we will classify the timing of the crime into 3 categories as follows :
pw <- function(x){
if(x < 8){
x <- "12am to 8am"
}else if(x >= 8 & x < 16){
x <- "8am to 4pm"
}else{
x <- "4pm to 12am"
}
}crime$hour_cat <- sapply(crime$hours, pw)
crime$hour_cat <- as.factor(crime$hour_cat)
head(crime)crime_when <- crime %>%
filter(Category %in% top10_cat, years == 2013) %>%
group_by(Category, hour_cat) %>%
summarise(n = n())## `summarise()` has grouped output by 'Category'. You can override using the `.groups` argument.
crime_whenggplot(data = crime_when, mapping = aes(x = n, y = reorder(Category, n))) +
geom_col(mapping = aes(fill = hour_cat), position = "dodge") +
labs(x = "Crime Count", y = NULL,
fill = NULL,
title = "Categories with Highest Total Crime in San Francisco - 2013") +
scale_fill_brewer(palette = 1) +
theme_minimal() +
theme(legend.position = "top")Interpretation :
list <- c("LARCENY/THEFT","DRUG/NARCOTIC","ASSAULT")
crime_density <- crime %>%
filter(Category %in% list) %>%
group_by(Category, years) %>%
summarise(Total = n())## `summarise()` has grouped output by 'Category'. You can override using the `.groups` argument.
crime_densityggplot(crime_density, aes(x =Total, y= Category, fill = Category))+
geom_density_ridges()+
theme_ipsum(extrafont::loadfonts(device="win"))+
theme(legend.position = "none")## Picking joint bandwidth of 418
Interpretation :
From this graph it can be seen that the highest crime density of the 3 categories is ASSAULT, and the low density is LARCENY / THEFT, but LARCENY / THEFT has more cases than ASSAULT.
top5_cat <- c("LARCENY/THEFT", "OTHER OFFENSES","NON-CRIMINAL","ASSAULT","DRUG/NARCOTIC")
crime_spread <- crime %>%
filter(Category %in% top5_cat) %>%
group_by(Category,years, DayOfWeek) %>%
summarise(totalcrime = n()) %>%
ungroup()## `summarise()` has grouped output by 'Category', 'years'. You can override using the `.groups` argument.
head(crime_spread)ggplot(data=crime_spread, aes(x=years, y=totalcrime))+
geom_point(col = "skyblue")+
facet_grid(DayOfWeek~Category)+
scale_color_gradient(low="red3", high="green2")+
theme(legend.position = "none",
axis.text.x = element_text(angle = 90, hjust = 1))Interpretation :
crime <- rename(crime, "lng" = "X", "lat" = "Y")
head(crime)map_drug <- crime %>%
filter(Category == "DRUG/NARCOTIC",
month == "Oct",
PdDistrict == "NORTHERN",
Resolution == "NONE") %>%
select(Address, lng, lat)
map_drugico <- makeIcon(iconUrl = "https://cdn.iconscout.com/icon/free/png-256/drugs-26-129384.png",iconWidth=47/2, iconHeight=41/2)
map2 <- leaflet()
map2 <- addTiles(map2)
map2 <- addMarkers(map2, data = map_drug, icon = ico, popup = map_drug[,"Address"])## Assuming "lng" and "lat" are longitude and latitude, respectively
map2