The Seattle Police Department is the largest Municipal law enforcement agency in Washington State. They pledge to ensure that police services are delivered in a manner that fully complies with the Constitution and laws of the United States. Crime data is used as performance measures of the policing plans implemented within specific neighborhoods in Seattle. Analyzing crime data and the pattern it follows would serve as an important tool to help prevent crime in the future. In this project, the major crime areas in Seattle, the top trends in crime by hour of the day and date are identified. Based on the analysis, some insights are drawn from the data which is explained in the further sections.
Since graphs are plotted and analysis is done on the data, several R packages are used. The packages used are
The Seattle Crime Data has information about the incidents that were reported around the various areas in Seattle. This information was recorded by the officers who responded to the incidents that occurred. The data set is released by Department of Information Technology, Seattle Police Department to ensure Public Safety. The link to the data set( as published) and the code book is given here. This link gets updated with fresh data every 6 hours. For the sake of analysis from a static data source, the source data on 18th November 2016 is uploaded on google drive and fetched from there.
library(prettydoc)
# Steps followed for the initial data fetch.
#download.file("https://data.seattle.gov/resource/7ais-f98f.json",destfile="R/data/data.json")
#library(jsonlite)
#crimedata<-fromJSON("R/data/data.json",flatten=TRUE)
# It was then uploaded to google drive and every new fetch now happens from this drive.
SeattleCrimedata <- read.csv("https://drive.google.com/uc?export=download&id=0Bx4ZGqIvRp0ndllIcVgyNkhqMXc", header=TRUE)The data contains 21 variables and 1000 rows of observations. However, 612 values are missing in this data set. The data which is in the form of json is imported into R. Hence all the data variables are of type character except of the location.needs_recoding which is of logical data type.
The details of the variables are given below -
Year : Year the crime was reported(2016). The data type is character.
Zone_beat : Has detailed information(Code) about the district of crime incident occurrence.
Latitude : The latitudinal location of the occurrence of the incident. The data type is character data.Has 444 levels indicating crimes have occurred along the same latitude multiple times.
Offense_code_extension : Data entered for internal purpose.Has 18 unique values within the range of 0-91.
Summarized_offense_description : Gives a generalized offense description. It is of type character data.
Date_reported : Gives the report date as the name suggests. There are 477 days on which various crime incidents across the city was reported.
Offense_type : Gives a broader description of the offense.There are 74 unique types in this particular variable.
Occurred_date_or_date_range_start : Date the offense occurred or started.This variable has 432 levels.
Summary_offense_code : Summarizes the offense_code. This variable has 21 levels. This has 96 observations with value ‘X’ entered in it.
Occurred_date_range_end : Date when crime was reported to end. This variable has 167 levels. NA is entered in 93 observations.
Month : The data set has crime incidents reported on November 18, November 17 of this year(2016)
General_offense_number : Gives the offense number as recorded by the police department.There are 495 unique values of observations.
Census_tract_2000 : Has information about the census in that particular area. This observation has 434 levels.
Offense_code : There are 51 values assigned according to the crimes. 97 of 1000 observations have ‘X’ recorded.
Hundred_block_location : Has information about the block where crime incident occurred and was reported. Has 443 unique locations indicating that crimes have occurred repeatedly in blocks.
rms_cdw_id : Ever row is given unique number to identify this observation. Hence there are 1000 unique values in this variable.
district_sector : Has a single observation with the value 99.Has 17 different alphabets assigned according to the district.
longitude : This variable gives the longitudinal location of the crime incidents.It has 430 different values indicating that crime has occurred along the same longitude several times.
location.latitude : This variable is a duplicate of the ‘latitude’ variable. It has the same values entered in the ‘latitude’ variable.
location.needs_recoding : This is a logical variable. However, FALSE is present for all the columns indicating that no observation needs to be recoded.This variable can be discarded from future analysis on the data set.
location.longitude : This variable is a duplicate of the ‘longitude’ variable. It has the same values entered in the ‘longitude’ variable.
The original data is cleaned to drop 3 variables. The location.needs_recoding is a logical variable with only FALSE. Hence it is not used in any further analysis. Also, duplicate variables such as location.longitude and location.latitude is eliminated from the data set. As mentioned above, some the variables summary_offense_code and offense_code have values X in them (which is equivalent to NA). The occurred_date_range_end has NA. The number of observations with two ‘X’ and NA are obtained.
#Check total number of observations wih missing data
length(which(SeattleCrimedata$summary_offense_code=='X' & SeattleCrimedata$offense_code=='X' & is.na(SeattleCrimedata$occurred_date_range_end)))## [1] 92
UpdatedCrimeData<-SeattleCrimedata[c(-1,-20,-21,-22)]
#Convert 'X' to NA
UpdatedCrimeData$summary_offense_code[UpdatedCrimeData$summary_offense_code=='X'] = NA
UpdatedCrimeData$offense_code[UpdatedCrimeData$offense_code=='X'] = NA92 observations have all three columns with irrelevant(‘X’ or NA) data. Since it forms a small portion of the whole data set, these observations are ignored. For the sake of consistency,values ‘X’ within summary_offense_code and offense_code are converted to NA.
library(DT)
Table<-datatable(UpdatedCrimeData)
TableTo summarize, we know that the class of all the data is character type. We convert it into date or numeric class types during the analysis for the sake of convenience. SeattleCrime data set provides data about the occurrences of crime across various parts of the city. The longitude and latitude data, as recorded by the Seattle police, provides the exact location of the crime incident. This is further backed by the Zone Beat/District Sector information recorded. Each type of offense has a unique summary code allotted to it. The crime data considered for the analysis is between 14th November,2016 and 18th November,2016
The graph below shows the map of Seattle city and the points on the map indicate the events of crime incidents.
library(ggmap)
SeattleCrimeMap<-get_map("Seattle",zoom=12,source="google",maptype="terrain")
ggmap(SeattleCrimeMap)+
geom_point(data=UpdatedCrimeData,aes(x=longitude,y=latitude),alpha=0.9,color="darkgreen")The maximum number of crimes in Seattle by the hour of the day is analyzed. The graph shows that crime incidents are the least in mornings and evenings.
library(dplyr)
library(lubridate)
DateReported<-ymd_hms(UpdatedCrimeData$date_reported)
Crimehour <- UpdatedCrimeData %>% group_by(chour=hour(DateReported)) %>%
summarise(nfr=n()) %>% arrange(desc(nfr))
Crimehour <- Crimehour %>% mutate(dn=(chour>5&chour<21))
hours <- c("1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","23","0")
Crimehour$chour[Crimehour$chour==0]<-24
Crimehour$hours <-factor(Crimehour$chour, levels=hours)
ggplot(data=Crimehour, aes(x=chour, y=nfr)) + geom_point(colour = "darkgreen", size = 2) + geom_line(colour="darkgreen", size=0.75)+ labs(title= "Crime Hour", x="Hour of the Day Crime was reported", y="Total reported Crime Incidents")The graph shows the occurrences of various crime on specific dates. The date range hear is from 14, November 2916 to 18, November 2016.
library(knitr)
topOffensesSeattle <- UpdatedCrimeData %>% group_by(summarized_offense_description,x=day(date_reported)) %>% summarize(nf=n())%>% arrange(desc(nf),desc(x))
graph<- ggplot(topOffensesSeattle, aes(x=x, y=nf,color=summarized_offense_description))+ ylim(0,60) +
geom_point(size = 2) +
labs(title= "Components of Crime Incidents", x="Day of the month", y="Reported Crime Incidents") +
facet_wrap(~summarized_offense_description)+theme(legend.position="none")
graph + theme(strip.background = element_rect(fill="yellowgreen"))Bar graph is drawn to obtain the maximum occurrences of crime within the four dates considered.
library(knitr)
top50<-topOffensesSeattle %>% head(50)
ggplot(data=top50,aes(x=summarized_offense_description, y= x,fill=summarized_offense_description,las=2))+
geom_bar(stat = "identity")+ theme(axis.text.x = element_text(angle = 90, hjust = 1))+labs(title= "Crime incidents by count", x="Type of Offense", y="Total occurrences") + theme(legend.position = "none")The tabble below shows the top 10 zones maximum prone to crime.
library(knitr)
library(dplyr)
MaximumCrimeProneZone<-UpdatedCrimeData %>% group_by(zone_beat) %>% count(zone_beat) %>% arrange(desc(n))
kable(head(MaximumCrimeProneZone))| zone_beat | n |
|---|---|
| B3 | 43 |
| D2 | 36 |
| U3 | 36 |
| U2 | 31 |
| E1 | 30 |
| E2 | 29 |
The scope of the project is to analyze the crime patterns in the Seattle city. Crime patterns were observed for four days in the month of November. Major crime areas, time when the maximum crime occurred, maximum crime types, etc were analyzed to observe the pattern of the crime events across the city. Graphs are plotted and conclusions are drawn from this analysis. From the graphs plotted, we discover that
the central area of Seattle are affected by crime the most.
Around 3 PM is when the maximum crime takes places across the city.
Vehicle Theft occurred the most in Seattle.
Zone B3 is prone to crimes more than the other zones.