library(prettydoc)Data Description
The Seattle Crime Data has information about the incidents that were reported around the various areas in Seattle. This informataion was recorded by the officers who responded to the incidents that occured. The data set is released by Department of Information Technology, Seattle Police Department to ensure Public Safety. The link to the dataset( as published) and the codebook is given here.
The data contains 21 variables and 1000 rows of observations. However, 612 values are missing in this dataset. The data which is in the form of json is imported into R. Hence all the data variables are of type character except of the location.needs_recoding which is of logical data type.
The details of the variables are given below -
Year : Year the crime was reported(2016). The data type is character.
Zone_beat : Has detailed information(Code) about the district of crime incident occurrence.
Latitude : The latitudinal location of the occurence of the incident. The data type is character data.Has 444 levels indicating crimes have occurred along the same latitude multiple times.
Offense_code_extension : Data entered for internal purpose.Has 18 unique values within the range of 0-91.
Summarized_offense_description : Gives a generalized offense description. It is of type character data.
Date_reported : Gives the report date as the name suggests. There are 477 days on which various crime incidents across the city was reported.
Offense_type : Gives a broader description of the offense.There are 74 unique types in this particular variable.
Occurred_date_or_date_range_start : Date the offense occurred or started.This variable has 432 levels.
Summary_offense_code : Summarizes the offense_code. This variable has 21 levels. This has 96 observations with value ‘X’ entered in it.
Occurred_date_range_end : Date when crime was reported to end. This variable has 167 levels. NA is entered in 93 observations.
Month : The dataset has crime incidents reported on November 18, November 17 of this year(2016)
General_offense_number : Gives the offense number as recorded by the police department.There are 495 unique values of observations.
Census_tract_2000 : Has information about the census in that particular area. This observation has 434 levels.
Offense_code : There are 51 values assigned according to the crimes. 97 of 1000 observations have ‘X’ recorded.
Hundred_block_location : Has information about the block where crime incident occurred and was reported. Has 443 unique locations indicating that crimes have occurred repeatedly in blocks.
rms_cdw_id : Ever row is given unique number to identify this observation. Hence there are 1000 unique values in this variable.
district_sector : Has a single observation with the value 99.Has 17 different alphabets assigned according to the district.
longitude : This variable gives the longitudinal location of the crime incidents.It has 430 different values indicating that crime has occurred along the same longitude several times.
location.latitude : This variable is a duplicate of the ‘latitude’ variable. It has the same values entered in the ‘latitude’ variable.
location.needs_recoding : This is a logical variable. However, FALSE is present for all the columns indicating that no observation needs to be recoded.This variable can be discarded from future analysis on the data set.
location.longitude : This variable is a duplicate of the ‘longitude’ variable. It has the same values entered in the ‘longitude’ variable.
Importing Data
The data which is in json format is imported. The dataset is found in the link given in this page. The dataset link gets updated every 6-12 hours.
#download.file("https://data.seattle.gov/resource/7ais-f98f.json",destfile="data/data1.json")
library(jsonlite)
library(dplyr)
library(knitr)
library(tidyr)
library(tidyverse)
crimedata<-fromJSON("data/data1.json",flatten=TRUE)
as_tibble(crimedata)## # A tibble: 1,000 × 21
## year zone_beat latitude offense_code_extension
## * <chr> <chr> <chr> <chr>
## 1 2016 M2 47.614387512 0
## 2 2016 G3 47.597877502 2
## 3 2016 G3 47.597877502 3
## 4 2016 G1 47.604362488 1
## 5 2016 B3 47.659385681 1
## 6 2016 S3 47.519851685 0
## 7 2016 C1 47.625267029 1
## 8 2016 C1 47.617610931 4
## 9 2016 L1 47.719306946 0
## 10 2016 Q3 47.623924255 5
## # ... with 990 more rows, and 17 more variables:
## # summarized_offense_description <chr>, date_reported <chr>,
## # offense_type <chr>, occurred_date_or_date_range_start <chr>,
## # summary_offense_code <chr>, month <chr>, general_offense_number <chr>,
## # census_tract_2000 <chr>, offense_code <chr>,
## # hundred_block_location <chr>, rms_cdw_id <chr>, district_sector <chr>,
## # longitude <chr>, occurred_date_range_end <chr>,
## # location.latitude <chr>, location.needs_recoding <lgl>,
## # location.longitude <chr>
kable(head(crimedata), options = list(scrollX = TRUE))| year | zone_beat | latitude | offense_code_extension | summarized_offense_description | date_reported | offense_type | occurred_date_or_date_range_start | summary_offense_code | month | general_offense_number | census_tract_2000 | offense_code | hundred_block_location | rms_cdw_id | district_sector | longitude | occurred_date_range_end | location.latitude | location.needs_recoding | location.longitude |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2016 | M2 | 47.614387512 | 0 | NARCOTICS | 2016-11-19T00:15:00 | NARC-POSSESS-HEROIN | 2016-11-19T00:15:00 | 3500 | 11 | 2016417980 | 7200.1059 | 3512 | 6 AV / VIRGINIA ST | 1077344 | M | -122.338317871 | NA | 47.614387512 | FALSE | -122.338317871 |
| 2016 | G3 | 47.597877502 | 2 | WARRANT ARREST | 2016-11-18T22:31:00 | WARRARR-MISDEMEANOR | 2016-11-18T22:31:00 | 5000 | 11 | 2016417868 | 9000.1007 | 5015 | 23 AV S / S KING ST | 1077342 | G | -122.302230835 | NA | 47.597877502 | FALSE | -122.302230835 |
| 2016 | G3 | 47.597877502 | 3 | WARRANT ARREST | 2016-11-18T22:31:00 | WARRANT-FUGITIVE | 2016-11-18T22:31:00 | 5000 | 11 | 2016417932 | 9000.1007 | 5015 | 23 AV S / S KING ST | 1077343 | G | -122.302230835 | NA | 47.597877502 | FALSE | -122.302230835 |
| 2016 | G1 | 47.604362488 | 1 | WARRANT ARREST | 2016-11-18T20:23:00 | WARRARR-FELONY | 2016-11-18T16:24:00 | 5000 | 11 | 2016417465 | 8600.3011 | 5015 | 9XX BLOCK OF E ALDER ST | 1077341 | G | -122.319992065 | 2016-11-18T16:41:00 | 47.604362488 | FALSE | -122.319992065 |
| 2016 | B3 | 47.659385681 | 1 | VEHICLE THEFT | 2016-11-18T19:56:00 | VEH-THEFT-AUTO | 2016-11-18T09:30:00 | 2400 | 11 | 2016417502 | 5000.3019 | 2404 | 43XX BLOCK OF WOODLAND PARK AV N | 1077323 | B | -122.344581604 | 2016-11-18T10:40:00 | 47.659385681 | FALSE | -122.344581604 |
| 2016 | S3 | 47.519851685 | 0 | SHOPLIFTING | 2016-11-18T18:13:00 | THEFT-SHOPLIFT | 2016-11-18T18:13:00 | 2300 | 11 | 2016417614 | 11800.6020 | 2303 | 92XX BLOCK OF RAINIER AV S | 1077332 | S | -122.268028259 | NA | 47.519851685 | FALSE | -122.268028259 |
Please note that the above data gets updated every 6 -12 hours.Hence the observations change repeatedly according to the time of the day it was imported.For the final project,the dataset which was imported for this project proposal will be used for the analysis.That is, 1000 crime incidents reported on 11/18 and 11/17 will be analyzed.
Data Cleaning
The following are the major issues that has to be handled in this dataset.
1.There are 612 missing values in the dataset.
2.Two variables,offense_code and summary_offense_code, have ‘x’ as values in them.
3.The variable longitude appears twice in the dataset. Both the variables contain the same values in for the respective observations.
4.The needs-recoding variable has FALSE to indicate no observation needs recoding. This variable is nont of much use in this dataset and hence can be discarded for future analysis.
5.The district _sector variable has one single observation with the value 99. This particular observation can be discarded for all analysis involving this variable.
The above are the major issues which have to handled in the dataset. The observations with abnormal values for all variables offense_code(X),summary_offense_code(X) and Occured_date_range_end(NA) are discarded. This sums up to 93 observations which needs to be discarded. All the othere observations with missing/‘X’ values are imputed. The location.needs_recoding and the location.longitude variables are dropped from the imported data set.
Planned Analysis
The goal of the project is to analyze the patterns in the crime data.
1.The area or the location where the maximum crime incident occures
2.Time of the day(morning/noon/midnight) when maximum incidents occur through the city
3.Top crimes the city is prone to
4.Top districts prone to these top crimes(as obtained from the past analysis).
The above are the main areas of analysis and study in this project. The scope of this project is limited to the incidents reported by the police in the month of November,2016.