Overview

Crime is an issue that escapes no city’s police force. Here, we look at crime data from the city of Seattle, WA collected during the summer of 2014. The subset of the Seattle incident data that we use can be found here, and the full data set is here. Visit the Seattle data portal for more information on their open data projects.

The purpose of this analysis is to answer four questions using the subset of Seattle crime data taken from summer 2014:

  1. How many crimes were committed each day? Skip to: analysis | answer
  2. What hour of the day did the most crimes occur? Skip to: analysis | answer
  3. What were the most common crimes? Skip to: analysis | answer
  4. Which crimes went unreported the longest? Skip to: analysis | answer

Let’s begin by downloading and loading the data into an R object called res.

fileURL <- "https://github.com/uwescience/datasci_course_materials/raw/master/assignment6/seattle_incidents_summer_2014.csv"
if (!file.exists("crimedata.csv")) {
        download.file(fileURL, destfile="crimedata.csv", method="auto")
}
res <- read.csv("crimedata.csv")

Now, let’s look at the summary statistics, the structure, and the head of res to get a better feel for it.

summary(res)
##    RMS.CDW.ID      General.Offense.Number  Offense.Code  
##  Min.   :  12470   Min.   :2.015e+07      2305   : 6230  
##  1st Qu.:  47246   1st Qu.:2.014e+09      X      : 3305  
##  Median : 999859   Median :2.014e+09      2399   : 3034  
##  Mean   : 578650   Mean   :2.020e+09      2404   : 2871  
##  3rd Qu.:1030116   3rd Qu.:2.014e+09      2903   : 1638  
##  Max.   :1125252   Max.   :2.014e+11      1313   : 1368  
##                                           (Other):14333  
##  Offense.Code.Extension                         Offense.Type  
##  Min.   : 0.0           THEFT-CARPROWL                : 6230  
##  1st Qu.: 0.0           VEH-THEFT-AUTO                : 2588  
##  Median : 0.0           THEFT-OTH                     : 2221  
##  Mean   : 4.2           PROPERTY DAMAGE-NON RESIDENTIA: 1638  
##  3rd Qu.: 1.0           ASSLT-NONAGG                  : 1320  
##  Max.   :98.0           DISTURBANCE-OTH               : 1295  
##                         (Other)                       :17487  
##  Summary.Offense.Code Summarized.Offense.Description
##  2300   :12161        CAR PROWL      : 6230         
##  X      : 3305        OTHER PROPERTY : 3755         
##  2200   : 3212        BURGLARY       : 3212         
##  2400   : 2871        VEHICLE THEFT  : 3057         
##  1300   : 2639        PROPERTY DAMAGE: 2365         
##  2900   : 2365        ASSAULT        : 2018         
##  (Other): 6226        (Other)        :12142         
##                 Date.Reported        Occurred.Date.or.Date.Range.Start
##  08/04/2014 07:00:00 PM:   11   07/01/2014 12:00:00 AM:   46          
##  07/10/2014 01:16:00 PM:    9   08/01/2014 12:00:00 PM:   33          
##  06/25/2014 05:26:00 PM:    8   07/25/2014 08:00:00 PM:   24          
##  07/07/2014 08:00:00 PM:    8   08/01/2014 12:00:00 AM:   23          
##  07/24/2014 11:05:00 AM:    8   06/01/2014 12:00:00 AM:   22          
##  06/19/2014 02:52:00 PM:    7   08/02/2014 12:00:00 PM:   22          
##  (Other)               :32728   (Other)               :32609          
##            Occurred.Date.Range.End                  Hundred.Block.Location
##                        :16280      4XX BLOCK OF NE NORTHGATE WY:  194     
##  07/07/2014 07:00:00 AM:   21      16XX BLOCK OF 11 AV         :  138     
##  06/05/2014 08:00:00 AM:   18      5XX BLOCK OF 3 AV           :  129     
##  07/14/2014 08:00:00 AM:   17      4XX BLOCK OF PINE ST        :  119     
##  07/19/2014 09:00:00 AM:   17      84XX BLOCK OF SEAVIEW PL NW :  117     
##  06/30/2014 08:00:00 AM:   16      1XX BLOCK OF S WASHINGTON ST:  110     
##  (Other)               :16410      (Other)                     :31972     
##  District.Sector   Zone.Beat     Census.Tract.2000   Longitude     
##  M      : 2642   E2     : 1089   Min.   :  100.1   Min.   :-122.4  
##  K      : 2379   M3     :  968   1st Qu.: 4600.1   1st Qu.:-122.3  
##  B      : 2338   B2     :  845   Median : 7500.4   Median :-122.3  
##  N      : 2333   K2     :  841   Mean   : 6737.7   Mean   :-114.7  
##  E      : 2255   M1     :  837   3rd Qu.: 9200.2   3rd Qu.:-122.3  
##  J      : 2204   M2     :  837   Max.   :26500.1   Max.   :   0.0  
##  (Other):18628   (Other):27362   NA's   :134                       
##     Latitude                               Location         Month      
##  Min.   : 0.00   (0.0, 0.0)                    : 2050   Min.   :6.000  
##  1st Qu.:47.58   (47.7086028, -122.324615155)  :  120   1st Qu.:6.000  
##  Median :47.61   (47.708602801, -122.324615156):   87   Median :7.000  
##  Mean   :44.65   (47.717174472, -122.344896099):   85   Mean   :6.981  
##  3rd Qu.:47.66   (47.615838397, -122.318168851):   77   3rd Qu.:8.000  
##  Max.   :47.75   (47.602412415, -122.331082197):   67   Max.   :8.000  
##                  (Other)                       :30293                  
##       Year     
##  Min.   :2014  
##  1st Qu.:2014  
##  Median :2014  
##  Mean   :2014  
##  3rd Qu.:2014  
##  Max.   :2014  
## 
str(res)
## 'data.frame':    32779 obs. of  19 variables:
##  $ RMS.CDW.ID                       : int  483839 481252 481375 481690 478198 480485 470170 465137 461710 456091 ...
##  $ General.Offense.Number           : num  2.02e+09 2.02e+09 2.02e+09 2.02e+09 2.02e+09 ...
##  $ Offense.Code                     : Factor w/ 92 levels "1201","1202",..: 19 42 31 37 32 30 30 38 42 39 ...
##  $ Offense.Code.Extension           : int  0 0 0 0 3 0 0 0 0 1 ...
##  $ Offense.Type                     : Factor w/ 147 levels "[INC - CASE DC USE ONLY]",..: 15 43 121 20 122 117 117 42 43 40 ...
##  $ Summary.Offense.Code             : Factor w/ 26 levels "1200","1300",..: 3 8 5 7 5 5 5 8 8 8 ...
##  $ Summarized.Offense.Description   : Factor w/ 48 levels "[INC - CASE DC USE ONLY]",..: 6 20 26 9 29 29 29 20 20 20 ...
##  $ Date.Reported                    : Factor w/ 15766 levels "01/02/2015 11:12:00 AM",..: 4720 3991 3617 3468 3304 3125 629 39 38 36 ...
##  $ Occurred.Date.or.Date.Range.Start: Factor w/ 11557 levels "06/01/2014 01:00:00 AM",..: 3526 116 11508 2393 115 2275 116 10420 7872 6327 ...
##  $ Occurred.Date.Range.End          : Factor w/ 5935 levels "","01/18/2015 09:00:00 AM",..: 1801 1509 1 1 5913 2533 60 10 1 1 ...
##  $ Hundred.Block.Location           : Factor w/ 7924 levels "1 AV / BATTERY ST",..: 5508 2248 6086 5522 5840 3305 2432 2187 557 3091 ...
##  $ District.Sector                  : Factor w/ 19 levels "","99","B","C",..: 9 4 7 12 9 16 15 3 5 5 ...
##  $ Zone.Beat                        : Factor w/ 53 levels "","99","B1","B2",..: 22 7 17 31 23 44 40 3 11 9 ...
##  $ Census.Tract.2000                : num  2900 6300 11301 8200 2700 ...
##  $ Longitude                        : num  -122 -122 -122 -122 -122 ...
##  $ Latitude                         : num  47.7 47.6 47.5 47.6 47.7 ...
##  $ Location                         : Factor w/ 12970 levels "(0.0, 0.0)","(47.465062973, -122.337868218)",..: 10611 7628 872 5213 10823 2191 7640 9804 6525 6000 ...
##  $ Month                            : int  6 6 8 6 6 6 6 8 8 7 ...
##  $ Year                             : int  2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...
head(res)
##   RMS.CDW.ID General.Offense.Number Offense.Code Offense.Code.Extension
## 1     483839             2015218538         2202                      0
## 2     481252             2015213067         2610                      0
## 3     481375             2015210301         2316                      0
## 4     481690             2015209327         2599                      0
## 5     478198             2015207880         2399                      3
## 6     480485             2015904103         2308                      0
##           Offense.Type Summary.Offense.Code Summarized.Offense.Description
## 1   BURGLARY-FORCE-RES                 2200                       BURGLARY
## 2 FRAUD-IDENTITY THEFT                 2600                          FRAUD
## 3           THEFT-MAIL                 2300                     MAIL THEFT
## 4          COUNTERFEIT                 2500                    COUNTERFEIT
## 5            THEFT-OTH                 2300                 OTHER PROPERTY
## 6       THEFT-BUILDING                 2300                 OTHER PROPERTY
##            Date.Reported Occurred.Date.or.Date.Range.Start
## 1 06/28/2015 10:31:00 AM            06/28/2014 10:31:00 AM
## 2 06/24/2015 11:09:00 AM            06/01/2014 12:00:00 AM
## 3 06/22/2015 09:22:00 AM            08/31/2014 09:00:00 AM
## 4 06/21/2015 03:52:00 PM            06/20/2014 01:38:00 PM
## 5 06/20/2015 11:59:00 AM            06/01/2014 11:59:00 AM
## 6 06/19/2015 02:55:00 PM            06/19/2014 02:45:00 PM
##   Occurred.Date.Range.End       Hundred.Block.Location District.Sector
## 1  06/28/2015 10:31:00 AM        6XX BLOCK OF NW 74 ST               J
## 2  06/24/2015 11:09:00 AM        23XX BLOCK OF 43 AV E               C
## 3                               81XX BLOCK OF 11 AV SW               F
## 4                                 6XX BLOCK OF PINE ST               M
## 5  11/01/2014 12:00:00 PM 77XX BLOCK OF SUNNYSIDE AV N               J
## 6  07/10/2014 02:45:00 PM 35XX BLOCK OF S FERDINAND ST               R
##   Zone.Beat Census.Tract.2000 Longitude Latitude
## 1        J2          2900.301 -122.3647 47.68252
## 2        C2          6300.100 -122.2771 47.63990
## 3        F3         11300.501 -122.3493 47.52923
## 4        M2          8200.100 -122.3348 47.61237
## 5        J3          2700.202 -122.3294 47.68596
## 6        R3         10300.401 -122.2875 47.55785
##                         Location Month Year
## 1  (47.68252427, -122.364671996)     6 2014
## 2 (47.639900761, -122.277080248)     6 2014
## 3 (47.529232299, -122.349312181)     8 2014
## 4 (47.612368448, -122.334817763)     6 2014
## 5 (47.685959879, -122.329378505)     6 2014
## 6 (47.557854802, -122.287477902)     6 2014

Notice that there are two variables that describe the offense (i.e. crime) type, Offense.Type and summarised.Offense.Description. The str function showed us that the summarised.Offense.Description column only had 48 levels, while Offense.Type had 147. For simplicity’s sake, we use the summarised.Offense.Description variable to categorize crime. Because its name is long, we use the plyr package to change it to Offense to clean up our code.

library(plyr)
res <- rename(res, c("Summarized.Offense.Description" = "Offense"))

Data Cleaning and Preprocessing

Next, let’s prepare the data for analysis. We first convert the date variables to POSIXlt/ct format using the lubridate library. Then, we convert the offense type variables to lowercase letters for improved readability. We coerce the offense type variable back to a factor after the conversion.

library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:plyr':
## 
##     here
## The following object is masked from 'package:base':
## 
##     date
res$Date.Reported                       <- mdy_hms(res$Date.Reported)
res$Occurred.Date.or.Date.Range.Start   <- mdy_hms(res$Occurred.Date.or.Date.Range.Start) 
res$Occurred.Date.Range.End             <- mdy_hms(res$Occurred.Date.Range.End)
res$Offense                             <- as.factor(tolower(res$Offense))

Let’s see a few of the crimes described in Offense.

head(table(res$Offense), 20)
## 
##    [inc - case dc use only]            animal complaint 
##                           5                          96 
##                     assault               bias incident 
##                        2018                          20 
##                  bike theft                    burglary 
##                         797                        3212 
## burglary-secure parking-res                   car prowl 
##                         388                        6230 
##                 counterfeit          disorderly conduct 
##                         159                           2 
##                     dispute                 disturbance 
##                         171                        1333 
##                         dui                     eluding 
##                          34                           8 
##                    embezzle                      escape 
##                          57                           3 
##                false report                    firework 
##                          23                           9 
##                     forgery                       fraud 
##                          59                        1473

We’ll return to this variable later.

Data Analysis

Question 1. How many crimes were committed each day?

We do a litle feature engineering to answer this question, and we approach it in two ways. Namely, we look at crimes committed per true date and per weekday.

First, we create a continuous time variable called Date.Occurred which is a Date object in R. Then, we use this new variable to create an aggregated dataframe, evPerDay, using the plyr package.

res$Date.Occurred <- as.Date(res$Occurred.Date.or.Date.Range.Start)
evPerDay <- ddply(res, .(Date.Occurred), summarise, Count=length(Offense))
head(evPerDay)
##   Date.Occurred Count
## 1    2014-06-01   397
## 2    2014-06-02   419
## 3    2014-06-03   357
## 4    2014-06-04   392
## 5    2014-06-05   386
## 6    2014-06-06   401

In the second way, we label each date with its appropriate weekday using the lubridate library. We store this information in a new variable called Weekday.

evPerDay$Weekday <- wday(evPerDay$Date.Occurred, label=TRUE)
wdayAgg <- aggregate(Count~Weekday, data=evPerDay, FUN=mean)
print(wdayAgg)
##   Weekday    Count
## 1     Sun 336.7857
## 2     Mon 352.8462
## 3    Tues 354.7692
## 4     Wed 358.6154
## 5   Thurs 353.5385
## 6     Fri 381.5385
## 7     Sat 357.4615

Question 2: What hour of the day did the most crimes occur?

To answer this question, we we extract Time.Occurred using format() and use the chron package to extract hours from Occurred.Date.or.Date.Range.Start.

res$Time.Occurred <- format(res$Occurred.Date.or.Date.Range.Start, format="%H:%M:%S")
class(res$Time.Occurred)
## [1] "character"
library(chron)
## 
## Attaching package: 'chron'
## The following objects are masked from 'package:lubridate':
## 
##     days, hours, minutes, seconds, years
res$Time.Occurred <- times(res$Time.Occurred)
res$Hour.Occurred <- hours(res$Occurred.Date.or.Date.Range.Start) ## "%H:%M:%S"
table(res$Hour.Occurred)
## 
##    0    1    2    3    4    5    6    7    8    9   10   11   12   13   14 
## 2073  989  827  589  475  393  550  811 1130 1155 1286 1281 2020 1589 1481 
##   15   16   17   18   19   20   21   22   23 
## 1697 1669 1844 1838 1738 1866 1942 1931 1605
class(res$Hour.Occurred)
## [1] "numeric"

We also create a new variable that is the POSIXct version of Hour.Occurred called Date.Hour.Occurred.

res$Date.Hour.Occurred <- strptime(res$Occurred.Date.or.Date.Range.Start, "%Y-%m-%d %H")
res$Date.Hour.Occurred <- as.POSIXct(res$Date.Hour.Occurred)

We also approach this question from a few ways using the dplyr package. In the first appraoch, we reduce all timestamps to hour and date (with no minutes), count the number of crimes committed, and name the dataframe, evPerHrDate. We calculate the standard deviation of crimes committed per hour and store it in tmp. Before moving on, we extract just the Hour using the stringr package to use later for plotting.
In a second analysis, we create two dataframes: evPerHr by aggregating the total number of crimes committed each hour, and avgPerHr, the average number of crimes committed per hour over the summer. To this dataframe, we bind the standard deviation of crimes committed per hour from tmp as the variable sd.
For the third dataframe, evPerTs, we use the full timestamp (date, hour, and minute) and counts all crimes that occurred during that single timestamp.

evPerHrDate <- ddply(res, .(Date.Hour.Occurred), summarise, Count=length(Offense))
print(head(evPerHrDate, 10))
##     Date.Hour.Occurred Count
## 1  2014-06-01 00:00:00    51
## 2  2014-06-01 01:00:00    17
## 3  2014-06-01 02:00:00    16
## 4  2014-06-01 03:00:00    17
## 5  2014-06-01 04:00:00     4
## 6  2014-06-01 05:00:00    10
## 7  2014-06-01 06:00:00     5
## 8  2014-06-01 07:00:00     6
## 9  2014-06-01 08:00:00    15
## 10 2014-06-01 09:00:00     8
## Create a new variable for Hour 
evPerHrDate$HourFull <- format(evPerHrDate$Date.Hour.Occurred, format="%H:%M:%S")
print(head(evPerHrDate, 10))
##     Date.Hour.Occurred Count HourFull
## 1  2014-06-01 00:00:00    51 00:00:00
## 2  2014-06-01 01:00:00    17 01:00:00
## 3  2014-06-01 02:00:00    16 02:00:00
## 4  2014-06-01 03:00:00    17 03:00:00
## 5  2014-06-01 04:00:00     4 04:00:00
## 6  2014-06-01 05:00:00    10 05:00:00
## 7  2014-06-01 06:00:00     5 06:00:00
## 8  2014-06-01 07:00:00     6 07:00:00
## 9  2014-06-01 08:00:00    15 08:00:00
## 10 2014-06-01 09:00:00     8 09:00:00
tmp <- tapply(evPerHrDate$Count, evPerHrDate$Hour, sd)

## Use a regular expression to extract only the hour 
library(stringr)
reg.ex <- "[0-9]{2}"
evPerHrDate$Hour <- str_extract(evPerHrDate$HourFull, reg.ex)
evPerHrDate$Hour <- as.integer(evPerHrDate$Hour)
str(evPerHrDate)
## 'data.frame':    2173 obs. of  4 variables:
##  $ Date.Hour.Occurred: POSIXct, format: "2014-06-01 00:00:00" "2014-06-01 01:00:00" ...
##  $ Count             : int  51 17 16 17 4 10 5 6 15 8 ...
##  $ HourFull          : chr  "00:00:00" "01:00:00" "02:00:00" "03:00:00" ...
##  $ Hour              : int  0 1 2 3 4 5 6 7 8 9 ...
evPerHr <- aggregate(Count ~ Hour, evPerHrDate, FUN=sum)
print(head(evPerHr))
##   Hour Count
## 1    0  2073
## 2    1   989
## 3    2   827
## 4    3   589
## 5    4   475
## 6    5   393
avgPerHr <- aggregate(Count ~ Hour, evPerHrDate, FUN=mean)
print(head(avgPerHr))
##   Hour     Count
## 1    0 22.532609
## 2    1 10.868132
## 3    2  9.292135
## 4    3  6.770115
## 5    4  5.654762
## 6    5  4.974684
avgPerHr$sd <- tmp

evPerTs <- ddply(res, .(Occurred.Date.or.Date.Range.Start), summarise, Count=length(Offense))
head(evPerTs)
##   Occurred.Date.or.Date.Range.Start Count
## 1               2014-06-01 00:00:00    22
## 2               2014-06-01 00:01:00    13
## 3               2014-06-01 00:13:00     1
## 4               2014-06-01 00:20:00     2
## 5               2014-06-01 00:22:00     2
## 6               2014-06-01 00:26:00     1

Question 3. What were the most common crimes?

We address this question in a few ways. First, we use the dplyr package to count the total number of occurrences by Offense and arrange it in descending order. Then, we store the top 10 crimes in a new dataframe.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:lubridate':
## 
##     intersect, setdiff, union
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
arrange <- dplyr::arrange 
count <- dplyr::count 
crimes_tbl <- arrange(count(res, Offense), desc(n))

colnames(crimes_tbl) <- c("Offense", "Count")
head(crimes_tbl)
## # A tibble: 6 × 2
##           Offense Count
##            <fctr> <int>
## 1       car prowl  6230
## 2  other property  3755
## 3        burglary  3212
## 4   vehicle theft  3057
## 5 property damage  2365
## 6         assault  2018
top10crimes <- head(crimes_tbl, 10)

Now, let’s look at this question in the context of the previous questions. Which are the most common crimes each day, and for which hour of the day? Let’s begin with crime type per day. We create a new dataframe, eventTypePerDay, using the plyr package. Then, we subset it to include only the top 10 crimes in Seattle.

eventTypePerDay <- ddply(res, Date.Occurred~Offense, summarise, Count=length(Offense))
head(eventTypePerDay)
##   Date.Occurred                  Offense Count
## 1    2014-06-01 [inc - case dc use only]     1
## 2    2014-06-01         animal complaint     4
## 3    2014-06-01                  assault    24
## 4    2014-06-01            bias incident     2
## 5    2014-06-01               bike theft     4
## 6    2014-06-01                 burglary    54
top10EventTypePerDay <- subset(eventTypePerDay, (eventTypePerDay$Offense %in% top10crimes$Offense)==TRUE)
head(top10EventTypePerDay)
##    Date.Occurred        Offense Count
## 3     2014-06-01        assault    24
## 6     2014-06-01       burglary    54
## 8     2014-06-01      car prowl    86
## 11    2014-06-01    disturbance    14
## 12    2014-06-01          fraud     9
## 17    2014-06-01 other property    31

Now, we look at how the top 10 crimes differ across each hour with eventTypePerHr.

eventTypePerHr <- ddply(res, Hour.Occurred~Offense, summarise, Count=length(Offense))
head(eventTypePerHr)
##   Hour.Occurred                  Offense Count
## 1             0 [inc - case dc use only]     1
## 2             0         animal complaint     9
## 3             0                  assault    95
## 4             0            bias incident     2
## 5             0               bike theft    44
## 6             0                 burglary   211
top10EventTypePerHr <- subset(eventTypePerHr, (eventTypePerHr$Offense %in% top10crimes$Offense)==TRUE)
head(top10EventTypePerHr)
##    Hour.Occurred        Offense Count
## 3              0        assault    95
## 6              0       burglary   211
## 8              0      car prowl   394
## 11             0    disturbance    46
## 16             0          fraud   212
## 23             0 other property   289

Question 4. Which crimes went unreported the longest?

Not every crime that occurs is detected immediately. Even more concerning is the notion that some people wait hours or even days before reporting a crime to the police. Let’s find out which types of crimes in our data went unreported the longest.
We begin with a bit more feature engineering using our res dataframe. We create two new variables: Time.Btw.Report represents the time lapsed between Occurred.Date.or.Date.Range.Start and Date.Reported, and Time.Range represents the length of a crime event. We also check for negative values (potential errors) in these new variables.

res$Time.Btw.Report <- difftime(res$Date.Reported, res$Occurred.Date.or.Date.Range.Start, units = "days")
which(res$Time.Btw.Report<0)
## integer(0)
res$Time.Range <- difftime(res$Occurred.Date.Range.End, res$Occurred.Date.or.Date.Range.Start, units = "days")
which(res$Time.Range<0)
##  [1]   463   464  7261  7262  8113  8114  8115  8116 12567 12568 19189
## [12] 19190 20132 20133 28443 28444 30415 30416 31192 31193

There are some negative values for Time.Range. Let’s find out why.

negative <- which(res$Time.Range<0)
head(res[c(negative), c(8:10, 24:25)])
##            Date.Reported Occurred.Date.or.Date.Range.Start
## 463  2014-09-09 16:13:00               2014-07-27 12:00:00
## 464  2014-09-09 16:13:00               2014-07-27 12:00:00
## 7261 2014-08-13 05:41:00               2014-08-13 05:41:00
## 7262 2014-08-13 05:41:00               2014-08-13 05:41:00
## 8113 2014-08-10 14:32:00               2014-08-10 14:32:00
## 8114 2014-08-10 14:32:00               2014-08-10 14:32:00
##      Occurred.Date.Range.End Time.Btw.Report         Time.Range
## 463      2014-07-27 10:00:00   44.17569 days -0.0833333333 days
## 464      2014-07-27 10:00:00   44.17569 days -0.0833333333 days
## 7261     2014-08-13 05:40:00    0.00000 days -0.0006944444 days
## 7262     2014-08-13 05:40:00    0.00000 days -0.0006944444 days
## 8113     2014-08-10 14:30:00    0.00000 days -0.0013888889 days
## 8114     2014-08-10 14:30:00    0.00000 days -0.0013888889 days

It appears that the end date was reported as occurring before the start date for a few crimes. However, the start date for these was the same as the time reported. This must be a data entry error, and because there are so few, we will ignore these errors for the next calculation.
Now, we create another variable called Time.Went.Unreported. Here, we use the ifelse function to check if a row has an end range date. If so, then the time unreported is the difference between the Time.Btw.Report and Time.Range variables. If there is no end range date (i.e. the crime was instantaneous), then we simply use the Time.Btw.Report variable.

res$Time.Went.Unreported <- with(res, ifelse(is.na(res$Occurred.Date.Range.End)==F, res$Time.Btw.Report-res$Time.Range, res$Time.Btw.Report))
which(res$Time.Went.Unreported<0)
##  [1]  1323  1324  3669  3670  7461  9677  9678 11686 18792 18793 19325
## [12] 19327 21331 23066 23067 23326 23327 23328 24182 24183 26901 30957

We still see a few values less than 0. Let’s take a closer look at them.

negative <- which(res$Time.Went.Unreported<0)
head(res[negative, c(8:10, 24:26)])
##            Date.Reported Occurred.Date.or.Date.Range.Start
## 1323 2014-08-30 14:00:00               2014-08-29 13:00:00
## 1324 2014-08-30 14:00:00               2014-08-29 13:00:00
## 3669 2014-08-23 16:37:00               2014-08-23 16:37:00
## 3670 2014-08-23 16:37:00               2014-08-23 16:37:00
## 7461 2014-08-12 13:38:00               2014-08-12 12:30:00
## 9677 2014-08-05 20:52:00               2014-07-02 13:00:00
##      Occurred.Date.Range.End  Time.Btw.Report       Time.Range
## 1323     2014-08-30 15:09:00  1.04166667 days  1.08958333 days
## 1324     2014-08-30 15:09:00  1.04166667 days  1.08958333 days
## 3669     2014-08-23 19:20:00  0.00000000 days  0.11319444 days
## 3670     2014-08-23 19:20:00  0.00000000 days  0.11319444 days
## 7461     2014-08-12 13:45:00  0.04722222 days  0.05208333 days
## 9677     2014-08-05 21:00:00 34.32777778 days 34.33333333 days
##      Time.Went.Unreported
## 1323         -0.047916667
## 1324         -0.047916667
## 3669         -0.113194444
## 3670         -0.113194444
## 7461         -0.004861111
## 9677         -0.005555556

Here, we see that a negative Time.Went.Unreported value means that the crime was ongoing when it was reported.
Next, we aggregate the average time crimes go unreported by offense type. Then, using the dplyr package, we arrange it in descending order. Then, we take the top 10 results and store them in a new dataframe, top10Times. The units are in days.

timeNoReport <- arrange(aggregate(Time.Went.Unreported ~ Offense, res, FUN=mean, na.rm=TRUE), desc(Time.Went.Unreported))
top10Times <- head(timeNoReport, 10)
print(top10Times)
##           Offense Time.Went.Unreported
## 1         forgery            12.560817
## 2      mail theft             9.433820
## 3           fraud             8.503972
## 4   lost property             7.865201
## 5    purse snatch             4.509105
## 6        embezzle             4.351206
## 7  other property             3.951514
## 8     counterfeit             3.915754
## 9      pickpocket             3.616952
## 10     bike theft             1.931416

Results

Answer to Question 1, “How many crimes were committed each day?”

Let’s look at evPerDay.

head(evPerDay, 10)
##    Date.Occurred Count Weekday
## 1     2014-06-01   397     Sun
## 2     2014-06-02   419     Mon
## 3     2014-06-03   357    Tues
## 4     2014-06-04   392     Wed
## 5     2014-06-05   386   Thurs
## 6     2014-06-06   401     Fri
## 7     2014-06-07   389     Sat
## 8     2014-06-08   353     Sun
## 9     2014-06-09   358     Mon
## 10    2014-06-10   398    Tues
summary(evPerDay$Count)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   255.0   328.0   353.0   356.3   388.2   447.0

For the time frame of summer 2014 in Seattle, there was a maximum of 447 and a minimum of 255 crimes committed per day, with a mean of about 356 crimes.

We plot this data and all future data using the ggplot2 package.

library(ggplot2)
ggplot(data=evPerDay, aes(x=Date.Occurred, y=Count)) + geom_point() + geom_smooth(method='loess', col='red', alpha=0.2, size=1) + ggtitle("Seattle Total Crimes Committed Each Day
Summer 2014") + theme_bw()

We can see from the regression line that total crimes per day decreased as the summer continued. It appears that total crimes per day peaked at the end of June, and our lowest point, which seems to occur on July 4th (Independence Day in the U.S.), shortly followed it. Let’s check.

maximum <- max(evPerDay$Count)
evPerDay[which(evPerDay[,2]==maximum),]
##    Date.Occurred Count Weekday
## 25    2014-06-25   447     Wed
minimum <- min(evPerDay$Count)
evPerDay[which(evPerDay[,2]==minimum),]
##    Date.Occurred Count Weekday
## 33    2014-07-03   255   Thurs

Sure enough, crimes per day in Seattle peaked on June 25 and bottomed on July 3, 2014 in our time frame.

Now let’s look at our second model using weekday. Recall our aggregate data frame, wdayAgg.

print(wdayAgg)
##   Weekday    Count
## 1     Sun 336.7857
## 2     Mon 352.8462
## 3    Tues 354.7692
## 4     Wed 358.6154
## 5   Thurs 353.5385
## 6     Fri 381.5385
## 7     Sat 357.4615
summary(wdayAgg$Count)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   336.8   353.2   354.8   356.5   358.0   381.5

On a weekday basis, an average minimum of about 337 crimes occurred on Sundays, and an average maximum of 382 crimes occurred on Fridays. The mean number of crimes per weekday was about 357, less than one crime higher than the mean reported in our previous model.

Now, let’s look at a visual representation of this weekday-centric data.

ggplot(data=evPerDay, aes(x=Weekday, y=Count)) + geom_boxplot() + ggtitle("Seattle Total Crimes Committed per Weekday
Summer 2014")

If you look closely, you can see that the maximum number of crimes per weekday occurs on a Wednesday, and the minimum occurrs on a Thursday. If you recall our previous model, you will remember that our global max, June 25, 2014 was a Wednesday, and our global min, July 3, 2014, was on a Thursday. Our data are consistent.

Answer to Question 2, “What hour of the day did the most crimes occur?”

ggplot(data=evPerHrDate, aes(x=Hour, y=Count)) + geom_point(alpha=0.25) + geom_smooth(method="loess", col="green", size=1.2, alpha=.8) + ggtitle("Seattle Total Crimes per Hour, Summer 2014") + theme_bw()

The data smoothed across the entire time frame shows us that less crimes tended to occur at hour 5 (5 A.M.), and gradually increased as the day went on until midnight, when crimes dropped.

Let’s look again at our small dataframe of average crimes per hour, evPerHr.

print(evPerHr)
##    Hour Count
## 1     0  2073
## 2     1   989
## 3     2   827
## 4     3   589
## 5     4   475
## 6     5   393
## 7     6   550
## 8     7   811
## 9     8  1130
## 10    9  1155
## 11   10  1286
## 12   11  1281
## 13   12  2020
## 14   13  1589
## 15   14  1481
## 16   15  1697
## 17   16  1669
## 18   17  1844
## 19   18  1838
## 20   19  1738
## 21   20  1866
## 22   21  1942
## 23   22  1931
## 24   23  1605
summary(evPerHr$Count)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   393.0   948.5  1535.0  1366.0  1840.0  2073.0

Across our entire time frame of summer 2014, midnight (hour 0) had the maximum total number of crimes reported, at 2073. Our least popular time for crimes was at 5 am, where only 393 crimes were reported in our time frame. The mean number of total crimes reported for each hour in our time frame was 1366.
Note: It is worth mentioning that hour 0 could have the highest number of total crimes due to data entry error; that is, it isn’t uncommon in data entry for an unknown time to be recorded with the correct date and the hour 0. For example, if someone reports a crime having occurred on June 5th but cannot remember the time, the data entry technician may report the timestamp as “2014-06-05 0:00:00”. Furthermore, the estimator “around midnight” is commonly used when people are not exactly sure of the time of a past event.

Let’s look at our next highest value for total crimes committed during any hour.

evPerHrNo0 <- evPerHr[-1,] ## Remove hour 0
summary(evPerHrNo0$Count)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     393     908    1481    1335    1788    2020

Here, we get a new maximum of hour 12 with 2020 crimes total for our time frame. Similarly to midnight, “around noon” is a commonly used estimator by people who do not know the exact hour of a past event’s occurrence. This is highly suspicious. We can only cautiously determine from these data that the most crimes occured “around midnight”, and the second most common time was “around noon” in our time frame. We will deal with this shortly.

Here, we take a look at the average number of crimes per hour by plotting avgPerHr including error bars.

g1 <- ggplot(data=avgPerHr, aes(x=Hour, y=Count, ymin=Count-sd, ymax=Count+sd)) + geom_point() + geom_smooth(method="loess", col="green", size=.6, alpha=.2) + geom_errorbar(size=.2) + theme_bw()
g1 + ggtitle("Seattle Average Crimes Committed Each Hour of Day  
Summer 2014")

Again, we see peaks at hours 0 and 12. Let’s remove all timestamps that report a crime as having occurred at exactly midnight (0:00:00) and exactly at noon (12:00:00) to test the robustness of our data. In this section, we begin with recalculating the total crimes per hour. Then we revisit average number of crimes per hour with standard deviation. Finally, we preemptively recalculate the number of crimes per each timestamp with exact midnight and noon values removed.

resRmNM <- subset(res, (Time.Occurred != "0:00:00") & (Time.Occurred != "12:00:00"))
evPerHrDateRmNM <- ddply(resRmNM, .(Date.Hour.Occurred), summarise, Count=length(Offense))
evPerHrDateRmNM$HourFull <- format(evPerHrDateRmNM$Date.Hour.Occurred, format="%H:%M:%S")
print(head(evPerHrDateRmNM, 10))
##     Date.Hour.Occurred Count HourFull
## 1  2014-06-01 00:00:00    29 00:00:00
## 2  2014-06-01 01:00:00    17 01:00:00
## 3  2014-06-01 02:00:00    16 02:00:00
## 4  2014-06-01 03:00:00    17 03:00:00
## 5  2014-06-01 04:00:00     4 04:00:00
## 6  2014-06-01 05:00:00    10 05:00:00
## 7  2014-06-01 06:00:00     5 06:00:00
## 8  2014-06-01 07:00:00     6 07:00:00
## 9  2014-06-01 08:00:00    15 08:00:00
## 10 2014-06-01 09:00:00     8 09:00:00
tmpRmNM <- tapply(evPerHrDateRmNM$Count, evPerHrDateRmNM$Hour, sd)

evPerHrDateRmNM$Hour <- str_extract(evPerHrDateRmNM$HourFull, reg.ex)
evPerHrDateRmNM$Hour <- as.integer(evPerHrDateRmNM$Hour)
str(evPerHrDateRmNM)
## 'data.frame':    2173 obs. of  4 variables:
##  $ Date.Hour.Occurred: POSIXct, format: "2014-06-01 00:00:00" "2014-06-01 01:00:00" ...
##  $ Count             : int  29 17 16 17 4 10 5 6 15 8 ...
##  $ HourFull          : chr  "00:00:00" "01:00:00" "02:00:00" "03:00:00" ...
##  $ Hour              : int  0 1 2 3 4 5 6 7 8 9 ...
evPerHrRmNM <- aggregate(Count ~ Hour, evPerHrDateRmNM, FUN=sum)
print(evPerHrRmNM)
##    Hour Count
## 1     0  1202
## 2     1   989
## 3     2   827
## 4     3   589
## 5     4   475
## 6     5   393
## 7     6   550
## 8     7   811
## 9     8  1130
## 10    9  1155
## 11   10  1286
## 12   11  1281
## 13   12  1153
## 14   13  1589
## 15   14  1481
## 16   15  1697
## 17   16  1669
## 18   17  1844
## 19   18  1838
## 20   19  1738
## 21   20  1866
## 22   21  1942
## 23   22  1931
## 24   23  1605
evPerHrRmNM[which.max(evPerHrRmNM$Count),]
##    Hour Count
## 22   21  1942

We see that hour 0 and hour 12 no longer possess the maximum values in our total crimes for each hour data. After removing all timestamps that occurred exactly at 0:00:00 or 12:00:00, our new maximum time for crime is hour 21, or the hour of 9 pm. Did removing these timestamps affect our polynomial regression for average crimes over hour? Let’s see.

avgPerHrRmNM <- aggregate(Count ~ Hour, evPerHrDateRmNM, FUN=mean)
avgPerHrRmNM$sd <- tmpRmNM
print(head(avgPerHrRmNM))
##   Hour     Count       sd
## 1    0 13.065217 6.025230
## 2    1 10.868132 6.031784
## 3    2  9.292135 4.439498
## 4    3  6.770115 4.589425
## 5    4  5.654762 3.334717
## 6    5  4.974684 2.541828
evPerTsRmNM <- ddply(resRmNM, .(Occurred.Date.or.Date.Range.Start), summarise, Count=length(Offense))
head(evPerTsRmNM)
##   Occurred.Date.or.Date.Range.Start Count
## 1               2014-06-01 00:01:00    13
## 2               2014-06-01 00:13:00     1
## 3               2014-06-01 00:20:00     2
## 4               2014-06-01 00:22:00     2
## 5               2014-06-01 00:26:00     1
## 6               2014-06-01 00:30:00     4
g2 <- ggplot(data=avgPerHrRmNM, aes(x=Hour, y=Count, ymin=Count-sd, ymax=Count+sd)) + geom_point() + geom_smooth(method="loess", col="green", size=.6, alpha=.2) + geom_errorbar(size=.2) + theme_bw()

library(grid)
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
grid.arrange(g1 + ggtitle("Including 0:00:00 and 12:00:00"), g2 + ggtitle("0:00:00 and 12:00:00 removed"), ncol=2, top="Seattle Total Crimes Committed Each Hour of Day  
Summer 2014")

We still see a positive trend between hour of day and frequency of crimes committed. Therefore, our data are robust.

Now, let’s take a look at our data using the third approach. In the evPerTs dataframe, we combined all crime occurrence timestamps across our data frame and counted the total number of crimes to occur at each reported minute and hour of each day.

head(evPerTs, 10)
##    Occurred.Date.or.Date.Range.Start Count
## 1                2014-06-01 00:00:00    22
## 2                2014-06-01 00:01:00    13
## 3                2014-06-01 00:13:00     1
## 4                2014-06-01 00:20:00     2
## 5                2014-06-01 00:22:00     2
## 6                2014-06-01 00:26:00     1
## 7                2014-06-01 00:30:00     4
## 8                2014-06-01 00:39:00     2
## 9                2014-06-01 00:42:00     2
## 10               2014-06-01 00:45:00     2
summary(evPerTs$Count)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   2.000   2.836   2.000  46.000

Here is a visualization of the data.

g3 <- ggplot(data=evPerTs, aes(x=Occurred.Date.or.Date.Range.Start, y=Count)) + geom_path() + ggtitle("Seattle Total Crimes Committed per Each Reported Timestamp
Summer 2014")
g3 + ggtitle("Seattle Total Crimes Committed for Each Reported Timestamp
Summer 2014")

The summary statistics tell us that the minimum number of crimes to occur at any timestamp was 1, which is not interesting. Looking at the graph, we immediately see two outliers: one at the beginning of July, and another at the beginning of August. The summary stats tell us the maximum number of crimes committed in any single timestamp was 46. Let’s find out what that exact timestamp was.

which.max(evPerTs$Count)
## [1] 3817
evPerTs[3817,]
##      Occurred.Date.or.Date.Range.Start Count
## 3817                        2014-07-01    46

On July 1, 2014, a whopping number of 46 crimes were committed at midnight. However, remembering our previous caveat of data entry and “around midnight” estimators, we should be careful in extrapolating too much from the data. Let’s remove July 1st and look at our second highest peak, which we can see from the graph occurs in the beginning of August.

evPerTsNoJul1 <- evPerTs[-3817,] ## Remove July 1
which.max(evPerTsNoJul1$Count)
## [1] 7802
evPerTsNoJul1[7802,]
##      Occurred.Date.or.Date.Range.Start Count
## 7803               2014-08-01 12:00:00    33

Consistent with our data, the next highest peak of crimes committed during any timestamp occurred at noon on August 1, 2014, with a total number of 33 crimes. Now, let’s re-run our analysis as we did previously, removing our 0:00:00 and 12:00:00 timestamps.

evPerTsRmNM <- ddply(resRmNM, .(Occurred.Date.or.Date.Range.Start), summarise, Count=length(Offense))
head(evPerTsRmNM)
##   Occurred.Date.or.Date.Range.Start Count
## 1               2014-06-01 00:01:00    13
## 2               2014-06-01 00:13:00     1
## 3               2014-06-01 00:20:00     2
## 4               2014-06-01 00:22:00     2
## 5               2014-06-01 00:26:00     1
## 6               2014-06-01 00:30:00     4
summary(evPerTsRmNM$Count)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   2.000   2.729   2.000  24.000

After removing our 0:00:00 and 12:00:00 timestamps from our dataframe, we see that our new maximum number of crimes to occur at any reported timestamp was 24. Let’s visualize this and see where it lies.

g4 <- ggplot(data=evPerTsRmNM, aes(x=Occurred.Date.or.Date.Range.Start, y=Count)) + geom_path()

grid.arrange(g3 + ggtitle("Including 0:00:00 and 12:00:00"), g4 + ggtitle("0:00:00 and 12:00:00 removed"), ncol=2, top="Seattle Total Crimes Committed for Each Reported Time Frame
Summer 2014")

which.max(evPerTsRmNM$Count)
## [1] 6817
evPerTsRmNM[6817,]
##      Occurred.Date.or.Date.Range.Start Count
## 6817               2014-07-25 20:00:00    24

We find that our next most believable maximum timestamp for total crimes occurred happened at 20:00:00 (8 pm) on July 25.

Answer to Question 3. “What were the most common crimes?”

print(top10crimes)
## # A tibble: 10 × 2
##            Offense Count
##             <fctr> <int>
## 1        car prowl  6230
## 2   other property  3755
## 3         burglary  3212
## 4    vehicle theft  3057
## 5  property damage  2365
## 6          assault  2018
## 7            fraud  1473
## 8      disturbance  1333
## 9          threats  1178
## 10 stolen property  1136

Car prowl, or the theft of personal items from cars, was the number one crime across the time frame. Now, we visualize the top Seattle crimes from the summer of 2014 per day.

ggplot(data=top10EventTypePerDay, aes(x=Date.Occurred, y=Count, fill=Offense)) + geom_area(position="stack") + theme(legend.position="bottom") + ggtitle("Breakdown of Top 10 Overall Crimes per Day in Seattle
Summer 2014") 

Let’s take a look at the top Seattle crimes per hour of day. Here, we construct two graphs: one stacked and one unstacked to view the breakdown of crimes per hour.

ggplot(data=top10EventTypePerHr, aes(x=Hour.Occurred, y=Count, fill=Offense)) + geom_area(position="stack") + theme(legend.position="bottom") + ggtitle("Seattle Top 10 Overall Crimes per Each Hour of Day
Summer 2014")

ggplot(data=top10EventTypePerHr, aes(x=Hour.Occurred, y=Count, col=Offense)) + geom_line() + theme(legend.position="bottom") + ggtitle("Seattle Top 10 Overall Crimes per Each Hour of Day
Summer 2014")

Car prowl, the top crime for our time frame, increased much more rapidly as the day progressed than the other top crimes. However, car prowl was surpassed at least once by a different crime. The most discernable place is at hour 12 which, if you recall, is our second highest rate of crime occurrence each day.

hour12CrimeTypes <- subset(top10EventTypePerHr, Hour.Occurred==12)
which.max(hour12CrimeTypes$Count)
## [1] 6
hour12CrimeTypes[6,]
##     Hour.Occurred        Offense Count
## 376            12 other property   290

At noon, the most common crime in Seattle for our time frame was “other property”. If we look at the Offense.Type variable found in the original res dataframe, we can see that ‘other property’ pertains to theft of various belongings.

prop_other <- subset(res, Offense=="other property")
prop_other$Offense.Type <- factor(prop_other$Offense.Type)
table(prop_other$Offense.Type)
## 
##    THEFT-AUTO PARTS       THEFT-AUTOACC      THEFT-BUILDING 
##                  72                 252                 846 
##        THEFT-COINOP THEFT-LICENSE PLATE           THEFT-OTH 
##                   6                 358                2221

Answer to Question 4, “Which crimes went unreported the longest?”

Here is our top10Times dataframe from earlier.

print(top10Times)
##           Offense Time.Went.Unreported
## 1         forgery            12.560817
## 2      mail theft             9.433820
## 3           fraud             8.503972
## 4   lost property             7.865201
## 5    purse snatch             4.509105
## 6        embezzle             4.351206
## 7  other property             3.951514
## 8     counterfeit             3.915754
## 9      pickpocket             3.616952
## 10     bike theft             1.931416
summary(res$Time.Went.Unreported)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  -0.1132   0.0000   0.0285   1.7480   0.2944 366.1000
ggplot(data=top10Times, aes(x=Offense, y=Time.Went.Unreported)) + geom_bar(stat="identity") + coord_flip() + ggtitle("Seattle Top 10 Crimes Grouped by Average Days Gone Unreported 
Summer 2014") 

The data show that during the summer of 2014 in Seattle, forgery was by far the crime that went unreported the longest, with an average of almost 13 days between the end of the crime and the police report. The next closest crime was mail theft, with an average of 9 days unreported. Looking at the summary data, the mean number of days a crime went unreported for our entire dataset was about 2 days, and the maximum was 366 days or more than one year. Which crime avoided getting reported for an entire year? Let’s look.

which.max(res$Time.Went.Unreported)
## [1] 4
res$Offense[4]
## [1] counterfeit
## 48 Levels: [inc - case dc use only] animal complaint ... weapon

The answer is counterfeit. It is possible that, in the cases where crimes go unreported for an extended period of time, the victims did not realize that the crime had taken place.

Conclusion

In answering our analysis questions using the Seattle crime data of summer 2014 dataset, we learned a few things.
We first saw that, on average, 356 crimes per day were committed with a high of 447 crimes on June 25th and a low of 255 crimes on July 3rd. In general, there was a downward trend in the number of daily crimes committed as the summer progressed. Additionally, from a weekday standpoint, an average of 357 crimes were committed for any given weekday, with a high of 382 crimes for Fridays and a low of 337 crimes for Sundays.
Then, we explored crimes as they related to time of day. We saw that, summed across all days, there were 2073 crimes that occurred during hour 0 and 2020 crimes that occurred during hour 12. The average total number of crimes for any hour was 1335. Hour 5 had the fewest number of crime occurrences for the summer with 393 crimes total. The total number of crimes committed increased as a day progressed. From a different perspective, we saw that the maximum number of crimes to occurr at any single time was 46 crimes on July 1 at midnight (0:00:00). The second highest number of crimes that happened at once was 33 on August 1 at noon (12:00:00). However, keeping in mind possible data entry and witness error, we could argue that the true most popular hour of the day for crime was during hour 21, where a total of 1942 crimes occurred that summer, and that the true maximum crime time frame occurred at 20:00:00 on July 25 when 24 crimes were reported. We found these last results by removing all timestamps exactly equal to 0:00:00 and 12:00:00 in our dataframe. Regardless, our positive correlation of total crime frequency increasing as the day progressed upheld.
Next, we asked which crimes were the most common. We saw that car prowl, or the theft of personal belongings from a car, was the most commonly reported crime for the summer in Seattle. The likelihood of car prowl occurring increased dramatically as the night went on. The most common crime to occur at noon for this dataset was described as ‘other property’ (i.e. theft).
Finally, we calculated the length of time a crime went unreported. On average, forgery was the crime that took longest to report, with about 13 days in between the conclusion of the crime and the report of it. The longest crime to go unreported in the data was conterfeit, with 366 in between the conclusion of it and its reporting.
There are a lot of interesting inferences to be drawn from the data, but they must be done carefully. While analyzing the data, we found a few potential caveats in its collection. A high number of crimes were reported to have occurred at midnight and at noon, which is suspicious considering people’s tendencies to round to familiar estimators when precise details are unknown. Furthermore, we found that some crimes had a negative time range between the crime event end date and start date. This suggests that there were some errors in the data.
Despite this, we found relatively few errors in the Seattle summer 2014 crime dataset of 32,779 observations. All in all, it seems that Seattle had a mostly nonviolent summer in 2014 with mostly theft crimes taking place between noon and midnight of any given day and total crimes declining day to day as time went on.