library(prettydoc)

Data Description

The Seattle Crime Data has information about the incidents that were reported around the various areas in Seattle. This informataion was recorded by the officers who responded to the incidents that occured. The data set is released by Department of Information Technology, Seattle Police Department to ensure Public Safety. The link to the dataset( as published) and the codebook is given here.

The data contains 21 variables and 1000 rows of observations. However, 612 values are missing in this dataset. The data which is in the form of json is imported into R. Hence all the data variables are of type character except of the location.needs_recoding which is of logical data type.

The details of the variables are given below -

Year : Year the crime was reported(2016). The data type is character.

Zone_beat : Has detailed information(Code) about the district of crime incident occurrence.

Latitude : The latitudinal location of the occurence of the incident. The data type is character data.Has 444 levels indicating crimes have occurred along the same latitude multiple times.

Offense_code_extension : Data entered for internal purpose.Has 18 unique values within the range of 0-91.

Summarized_offense_description : Gives a generalized offense description. It is of type character data.

Date_reported : Gives the report date as the name suggests. There are 477 days on which various crime incidents across the city was reported.

Offense_type : Gives a broader description of the offense.There are 74 unique types in this particular variable.

Occurred_date_or_date_range_start : Date the offense occurred or started.This variable has 432 levels.

Summary_offense_code : Summarizes the offense_code. This variable has 21 levels. This has 96 observations with value ‘X’ entered in it.

Occurred_date_range_end : Date when crime was reported to end. This variable has 167 levels. NA is entered in 93 observations.

Month : The dataset has crime incidents reported on November 18, November 17 of this year(2016)

General_offense_number : Gives the offense number as recorded by the police department.There are 495 unique values of observations.

Census_tract_2000 : Has information about the census in that particular area. This observation has 434 levels.

Offense_code : There are 51 values assigned according to the crimes. 97 of 1000 observations have ‘X’ recorded.

Hundred_block_location : Has information about the block where crime incident occurred and was reported. Has 443 unique locations indicating that crimes have occurred repeatedly in blocks.

rms_cdw_id : Ever row is given unique number to identify this observation. Hence there are 1000 unique values in this variable.

district_sector : Has a single observation with the value 99.Has 17 different alphabets assigned according to the district.

longitude : This variable gives the longitudinal location of the crime incidents.It has 430 different values indicating that crime has occurred along the same longitude several times.

location.latitude : This variable is a duplicate of the ‘latitude’ variable. It has the same values entered in the ‘latitude’ variable.

location.needs_recoding : This is a logical variable. However, FALSE is present for all the columns indicating that no observation needs to be recoded.This variable can be discarded from future analysis on the data set.

location.longitude : This variable is a duplicate of the ‘longitude’ variable. It has the same values entered in the ‘longitude’ variable.

Importing Data

The data which is in json format is imported. The dataset is found in the link given in this page. The dataset link gets updated every 6-12 hours.

#download.file("https://data.seattle.gov/resource/7ais-f98f.json",destfile="data/data1.json")
library(jsonlite)
library(dplyr)
library(knitr)
library(tidyr)
library(tidyverse)
crimedata<-fromJSON("data/data1.json",flatten=TRUE)
as_tibble(crimedata)

## # A tibble: 1,000 × 21
##     year zone_beat     latitude offense_code_extension
## *  <chr>     <chr>        <chr>                  <chr>
## 1   2016        M2 47.614387512                      0
## 2   2016        G3 47.597877502                      2
## 3   2016        G3 47.597877502                      3
## 4   2016        G1 47.604362488                      1
## 5   2016        B3 47.659385681                      1
## 6   2016        S3 47.519851685                      0
## 7   2016        C1 47.625267029                      1
## 8   2016        C1 47.617610931                      4
## 9   2016        L1 47.719306946                      0
## 10  2016        Q3 47.623924255                      5
## # ... with 990 more rows, and 17 more variables:
## #   summarized_offense_description <chr>, date_reported <chr>,
## #   offense_type <chr>, occurred_date_or_date_range_start <chr>,
## #   summary_offense_code <chr>, month <chr>, general_offense_number <chr>,
## #   census_tract_2000 <chr>, offense_code <chr>,
## #   hundred_block_location <chr>, rms_cdw_id <chr>, district_sector <chr>,
## #   longitude <chr>, occurred_date_range_end <chr>,
## #   location.latitude <chr>, location.needs_recoding <lgl>,
## #   location.longitude <chr>

kable(head(crimedata), options = list(scrollX = TRUE))

year	zone_beat	latitude	offense_code_extension	summarized_offense_description	date_reported	offense_type	occurred_date_or_date_range_start	summary_offense_code	month	general_offense_number	census_tract_2000	offense_code	hundred_block_location	rms_cdw_id	district_sector	longitude	occurred_date_range_end	location.latitude	location.needs_recoding	location.longitude
2016	M2	47.614387512	0	NARCOTICS	2016-11-19T00:15:00	NARC-POSSESS-HEROIN	2016-11-19T00:15:00	3500	11	2016417980	7200.1059	3512	6 AV / VIRGINIA ST	1077344	M	-122.338317871	NA	47.614387512	FALSE	-122.338317871
2016	G3	47.597877502	2	WARRANT ARREST	2016-11-18T22:31:00	WARRARR-MISDEMEANOR	2016-11-18T22:31:00	5000	11	2016417868	9000.1007	5015	23 AV S / S KING ST	1077342	G	-122.302230835	NA	47.597877502	FALSE	-122.302230835
2016	G3	47.597877502	3	WARRANT ARREST	2016-11-18T22:31:00	WARRANT-FUGITIVE	2016-11-18T22:31:00	5000	11	2016417932	9000.1007	5015	23 AV S / S KING ST	1077343	G	-122.302230835	NA	47.597877502	FALSE	-122.302230835
2016	G1	47.604362488	1	WARRANT ARREST	2016-11-18T20:23:00	WARRARR-FELONY	2016-11-18T16:24:00	5000	11	2016417465	8600.3011	5015	9XX BLOCK OF E ALDER ST	1077341	G	-122.319992065	2016-11-18T16:41:00	47.604362488	FALSE	-122.319992065
2016	B3	47.659385681	1	VEHICLE THEFT	2016-11-18T19:56:00	VEH-THEFT-AUTO	2016-11-18T09:30:00	2400	11	2016417502	5000.3019	2404	43XX BLOCK OF WOODLAND PARK AV N	1077323	B	-122.344581604	2016-11-18T10:40:00	47.659385681	FALSE	-122.344581604
2016	S3	47.519851685	0	SHOPLIFTING	2016-11-18T18:13:00	THEFT-SHOPLIFT	2016-11-18T18:13:00	2300	11	2016417614	11800.6020	2303	92XX BLOCK OF RAINIER AV S	1077332	S	-122.268028259	NA	47.519851685	FALSE	-122.268028259

Please note that the above data gets updated every 6 -12 hours.Hence the observations change repeatedly according to the time of the day it was imported.For the final project,the dataset which was imported for this project proposal will be used for the analysis.That is, 1000 crime incidents reported on 11/18 and 11/17 will be analyzed.

Data Cleaning

The following are the major issues that has to be handled in this dataset.

1.There are 612 missing values in the dataset.

2.Two variables,offense_code and summary_offense_code, have ‘x’ as values in them.

3.The variable longitude appears twice in the dataset. Both the variables contain the same values in for the respective observations.

4.The needs-recoding variable has FALSE to indicate no observation needs recoding. This variable is nont of much use in this dataset and hence can be discarded for future analysis.

5.The district _sector variable has one single observation with the value 99. This particular observation can be discarded for all analysis involving this variable.

The above are the major issues which have to handled in the dataset. The observations with abnormal values for all variables offense_code(X),summary_offense_code(X) and Occured_date_range_end(NA) are discarded. This sums up to 93 observations which needs to be discarded. All the othere observations with missing/‘X’ values are imputed. The location.needs_recoding and the location.longitude variables are dropped from the imported data set.

Planned Analysis

The goal of the project is to analyze the patterns in the crime data.

1.The area or the location where the maximum crime incident occures

2.Time of the day(morning/noon/midnight) when maximum incidents occur through the city

3.Top crimes the city is prone to

4.Top districts prone to these top crimes(as obtained from the past analysis).

The above are the main areas of analysis and study in this project. The scope of this project is limited to the incidents reported by the police in the month of November,2016.

Project Proposal

Prarthana Rajendra

November 17, 2016

Data Description

Importing Data

Data Cleaning

Planned Analysis