library(prettydoc)

Data Description

The Seattle Crime Data has information about the incidents that were reported around the various areas in Seattle. This informataion was recorded by the officers who responded to the incidents that occured. The data set is released by Department of Information Technology, Seattle Police Department to ensure Public Safety. The link to the dataset( as published) and the codebook is given here.

The data contains 21 variables and 1000 rows of observations. However, 612 values are missing in this dataset. The data which is in the form of json is imported into R. Hence all the data variables are of type character except of the location.needs_recoding which is of logical data type.

The details of the variables are given below -

Year : Year the crime was reported(2016). The data type is character.

Zone_beat : Has detailed information(Code) about the district of crime incident occurrence.

Latitude : The latitudinal location of the occurence of the incident. The data type is character data.Has 444 levels indicating crimes have occurred along the same latitude multiple times.

Offense_code_extension : Data entered for internal purpose.Has 18 unique values within the range of 0-91.

Summarized_offense_description : Gives a generalized offense description. It is of type character data.

Date_reported : Gives the report date as the name suggests. There are 477 days on which various crime incidents across the city was reported.

Offense_type : Gives a broader description of the offense.There are 74 unique types in this particular variable.

Occurred_date_or_date_range_start : Date the offense occurred or started.This variable has 432 levels.

Summary_offense_code : Summarizes the offense_code. This variable has 21 levels. This has 96 observations with value ‘X’ entered in it.

Occurred_date_range_end : Date when crime was reported to end. This variable has 167 levels. NA is entered in 93 observations.

Month : The dataset has crime incidents reported on November 18, November 17 of this year(2016)

General_offense_number : Gives the offense number as recorded by the police department.There are 495 unique values of observations.

Census_tract_2000 : Has information about the census in that particular area. This observation has 434 levels.

Offense_code : There are 51 values assigned according to the crimes. 97 of 1000 observations have ‘X’ recorded.

Hundred_block_location : Has information about the block where crime incident occurred and was reported. Has 443 unique locations indicating that crimes have occurred repeatedly in blocks.

rms_cdw_id : Ever row is given unique number to identify this observation. Hence there are 1000 unique values in this variable.

district_sector : Has a single observation with the value 99.Has 17 different alphabets assigned according to the district.

longitude : This variable gives the longitudinal location of the crime incidents.It has 430 different values indicating that crime has occurred along the same longitude several times.

location.latitude : This variable is a duplicate of the ‘latitude’ variable. It has the same values entered in the ‘latitude’ variable.

location.needs_recoding : This is a logical variable. However, FALSE is present for all the columns indicating that no observation needs to be recoded.This variable can be discarded from future analysis on the data set.

location.longitude : This variable is a duplicate of the ‘longitude’ variable. It has the same values entered in the ‘longitude’ variable.

Importing Data

The data which is in json format is imported. The dataset is found in the link given in this page. The dataset link gets updated every 6-12 hours.

#download.file("https://data.seattle.gov/resource/7ais-f98f.json",destfile="data/data1.json")
library(jsonlite)
library(dplyr)
library(knitr)
library(tidyr)
library(tidyverse)
crimedata<-fromJSON("data/data1.json",flatten=TRUE)
as_tibble(crimedata)
## # A tibble: 1,000 × 21
##     year zone_beat     latitude offense_code_extension
## *  <chr>     <chr>        <chr>                  <chr>
## 1   2016        M2 47.614387512                      0
## 2   2016        G3 47.597877502                      2
## 3   2016        G3 47.597877502                      3
## 4   2016        G1 47.604362488                      1
## 5   2016        B3 47.659385681                      1
## 6   2016        S3 47.519851685                      0
## 7   2016        C1 47.625267029                      1
## 8   2016        C1 47.617610931                      4
## 9   2016        L1 47.719306946                      0
## 10  2016        Q3 47.623924255                      5
## # ... with 990 more rows, and 17 more variables:
## #   summarized_offense_description <chr>, date_reported <chr>,
## #   offense_type <chr>, occurred_date_or_date_range_start <chr>,
## #   summary_offense_code <chr>, month <chr>, general_offense_number <chr>,
## #   census_tract_2000 <chr>, offense_code <chr>,
## #   hundred_block_location <chr>, rms_cdw_id <chr>, district_sector <chr>,
## #   longitude <chr>, occurred_date_range_end <chr>,
## #   location.latitude <chr>, location.needs_recoding <lgl>,
## #   location.longitude <chr>
kable(head(crimedata), options = list(scrollX = TRUE))
year zone_beat latitude offense_code_extension summarized_offense_description date_reported offense_type occurred_date_or_date_range_start summary_offense_code month general_offense_number census_tract_2000 offense_code hundred_block_location rms_cdw_id district_sector longitude occurred_date_range_end location.latitude location.needs_recoding location.longitude
2016 M2 47.614387512 0 NARCOTICS 2016-11-19T00:15:00 NARC-POSSESS-HEROIN 2016-11-19T00:15:00 3500 11 2016417980 7200.1059 3512 6 AV / VIRGINIA ST 1077344 M -122.338317871 NA 47.614387512 FALSE -122.338317871
2016 G3 47.597877502 2 WARRANT ARREST 2016-11-18T22:31:00 WARRARR-MISDEMEANOR 2016-11-18T22:31:00 5000 11 2016417868 9000.1007 5015 23 AV S / S KING ST 1077342 G -122.302230835 NA 47.597877502 FALSE -122.302230835
2016 G3 47.597877502 3 WARRANT ARREST 2016-11-18T22:31:00 WARRANT-FUGITIVE 2016-11-18T22:31:00 5000 11 2016417932 9000.1007 5015 23 AV S / S KING ST 1077343 G -122.302230835 NA 47.597877502 FALSE -122.302230835
2016 G1 47.604362488 1 WARRANT ARREST 2016-11-18T20:23:00 WARRARR-FELONY 2016-11-18T16:24:00 5000 11 2016417465 8600.3011 5015 9XX BLOCK OF E ALDER ST 1077341 G -122.319992065 2016-11-18T16:41:00 47.604362488 FALSE -122.319992065
2016 B3 47.659385681 1 VEHICLE THEFT 2016-11-18T19:56:00 VEH-THEFT-AUTO 2016-11-18T09:30:00 2400 11 2016417502 5000.3019 2404 43XX BLOCK OF WOODLAND PARK AV N 1077323 B -122.344581604 2016-11-18T10:40:00 47.659385681 FALSE -122.344581604
2016 S3 47.519851685 0 SHOPLIFTING 2016-11-18T18:13:00 THEFT-SHOPLIFT 2016-11-18T18:13:00 2300 11 2016417614 11800.6020 2303 92XX BLOCK OF RAINIER AV S 1077332 S -122.268028259 NA 47.519851685 FALSE -122.268028259
Please note that the above data gets updated every 6 -12 hours.Hence the observations change repeatedly according to the time of the day it was imported.For the final project,the dataset which was imported for this project proposal will be used for the analysis.That is, 1000 crime incidents reported on 11/18 and 11/17 will be analyzed.

Data Cleaning

The following are the major issues that has to be handled in this dataset.

1.There are 612 missing values in the dataset.

2.Two variables,offense_code and summary_offense_code, have ‘x’ as values in them.

3.The variable longitude appears twice in the dataset. Both the variables contain the same values in for the respective observations.

4.The needs-recoding variable has FALSE to indicate no observation needs recoding. This variable is nont of much use in this dataset and hence can be discarded for future analysis.

5.The district _sector variable has one single observation with the value 99. This particular observation can be discarded for all analysis involving this variable.

The above are the major issues which have to handled in the dataset. The observations with abnormal values for all variables offense_code(X),summary_offense_code(X) and Occured_date_range_end(NA) are discarded. This sums up to 93 observations which needs to be discarded. All the othere observations with missing/‘X’ values are imputed. The location.needs_recoding and the location.longitude variables are dropped from the imported data set.

Planned Analysis

The goal of the project is to analyze the patterns in the crime data.

1.The area or the location where the maximum crime incident occures

2.Time of the day(morning/noon/midnight) when maximum incidents occur through the city

3.Top crimes the city is prone to

4.Top districts prone to these top crimes(as obtained from the past analysis).

The above are the main areas of analysis and study in this project. The scope of this project is limited to the incidents reported by the police in the month of November,2016.