Overview

Crime is an issue that escapes no city’s police force. Here, we look at crime data from the city of Seattle, WA collected during the summer of 2014. The subset of the Seattle incident data that we use can be found here, and the full data set is here. Visit the Seattle data portal for more information on their open data projects.

The purpose of this analysis is to answer four questions using the subset of Seattle crime data taken from summer 2014:

  1. How many crimes were committed each day? Skip to: analysis | answer
  2. What hour of the day did the most crimes occur? Skip to: analysis | answer
  3. What were the most common crimes? Skip to: analysis | answer
  4. Which crimes went unreported the longest? Skip to: analysis | answer

Let’s begin by downloading and loading the data into an R object called res.

fileURL <- "https://github.com/uwescience/datasci_course_materials/raw/master/assignment6/seattle_incidents_summer_2014.csv"
if (!file.exists("crimedata.csv")) {
        download.file(fileURL, destfile="crimedata.csv", method="auto")
}
res <- read.csv("crimedata.csv")

Now, let’s look at the summary statistics, the structure, and the head of res to get a better feel for it.

summary(res)
##    RMS.CDW.ID      General.Offense.Number  Offense.Code  
##  Min.   :  12470   Min.   :2.015e+07      2305   : 6230  
##  1st Qu.:  47246   1st Qu.:2.014e+09      X      : 3305  
##  Median : 999859   Median :2.014e+09      2399   : 3034  
##  Mean   : 578650   Mean   :2.020e+09      2404   : 2871  
##  3rd Qu.:1030116   3rd Qu.:2.014e+09      2903   : 1638  
##  Max.   :1125252   Max.   :2.014e+11      1313   : 1368  
##                                           (Other):14333  
##  Offense.Code.Extension                         Offense.Type  
##  Min.   : 0.0           THEFT-CARPROWL                : 6230  
##  1st Qu.: 0.0           VEH-THEFT-AUTO                : 2588  
##  Median : 0.0           THEFT-OTH                     : 2221  
##  Mean   : 4.2           PROPERTY DAMAGE-NON RESIDENTIA: 1638  
##  3rd Qu.: 1.0           ASSLT-NONAGG                  : 1320  
##  Max.   :98.0           DISTURBANCE-OTH               : 1295  
##                         (Other)                       :17487  
##  Summary.Offense.Code Summarized.Offense.Description
##  2300   :12161        CAR PROWL      : 6230         
##  X      : 3305        OTHER PROPERTY : 3755         
##  2200   : 3212        BURGLARY       : 3212         
##  2400   : 2871        VEHICLE THEFT  : 3057         
##  1300   : 2639        PROPERTY DAMAGE: 2365         
##  2900   : 2365        ASSAULT        : 2018         
##  (Other): 6226        (Other)        :12142         
##                 Date.Reported        Occurred.Date.or.Date.Range.Start
##  08/04/2014 07:00:00 PM:   11   07/01/2014 12:00:00 AM:   46          
##  07/10/2014 01:16:00 PM:    9   08/01/2014 12:00:00 PM:   33          
##  06/25/2014 05:26:00 PM:    8   07/25/2014 08:00:00 PM:   24          
##  07/07/2014 08:00:00 PM:    8   08/01/2014 12:00:00 AM:   23          
##  07/24/2014 11:05:00 AM:    8   06/01/2014 12:00:00 AM:   22          
##  06/19/2014 02:52:00 PM:    7   08/02/2014 12:00:00 PM:   22          
##  (Other)               :32728   (Other)               :32609          
##            Occurred.Date.Range.End                  Hundred.Block.Location
##                        :16280      4XX BLOCK OF NE NORTHGATE WY:  194     
##  07/07/2014 07:00:00 AM:   21      16XX BLOCK OF 11 AV         :  138     
##  06/05/2014 08:00:00 AM:   18      5XX BLOCK OF 3 AV           :  129     
##  07/14/2014 08:00:00 AM:   17      4XX BLOCK OF PINE ST        :  119     
##  07/19/2014 09:00:00 AM:   17      84XX BLOCK OF SEAVIEW PL NW :  117     
##  06/30/2014 08:00:00 AM:   16      1XX BLOCK OF S WASHINGTON ST:  110     
##  (Other)               :16410      (Other)                     :31972     
##  District.Sector   Zone.Beat     Census.Tract.2000   Longitude     
##  M      : 2642   E2     : 1089   Min.   :  100.1   Min.   :-122.4  
##  K      : 2379   M3     :  968   1st Qu.: 4600.1   1st Qu.:-122.3  
##  B      : 2338   B2     :  845   Median : 7500.4   Median :-122.3  
##  N      : 2333   K2     :  841   Mean   : 6737.7   Mean   :-114.7  
##  E      : 2255   M1     :  837   3rd Qu.: 9200.2   3rd Qu.:-122.3  
##  J      : 2204   M2     :  837   Max.   :26500.1   Max.   :   0.0  
##  (Other):18628   (Other):27362   NA's   :134                       
##     Latitude                               Location         Month      
##  Min.   : 0.00   (0.0, 0.0)                    : 2050   Min.   :6.000  
##  1st Qu.:47.58   (47.7086028, -122.324615155)  :  120   1st Qu.:6.000  
##  Median :47.61   (47.708602801, -122.324615156):   87   Median :7.000  
##  Mean   :44.65   (47.717174472, -122.344896099):   85   Mean   :6.981  
##  3rd Qu.:47.66   (47.615838397, -122.318168851):   77   3rd Qu.:8.000  
##  Max.   :47.75   (47.602412415, -122.331082197):   67   Max.   :8.000  
##                  (Other)                       :30293                  
##       Year     
##  Min.   :2014  
##  1st Qu.:2014  
##  Median :2014  
##  Mean   :2014  
##  3rd Qu.:2014  
##  Max.   :2014  
## 
str(res)
## 'data.frame':    32779 obs. of  19 variables:
##  $ RMS.CDW.ID                       : int  483839 481252 481375 481690 478198 480485 470170 465137 461710 456091 ...
##  $ General.Offense.Number           : num  2.02e+09 2.02e+09 2.02e+09 2.02e+09 2.02e+09 ...
##  $ Offense.Code                     : Factor w/ 92 levels "1201","1202",..: 19 42 31 37 32 30 30 38 42 39 ...
##  $ Offense.Code.Extension           : int  0 0 0 0 3 0 0 0 0 1 ...
##  $ Offense.Type                     : Factor w/ 147 levels "[INC - CASE DC USE ONLY]",..: 15 43 121 20 122 117 117 42 43 40 ...
##  $ Summary.Offense.Code             : Factor w/ 26 levels "1200","1300",..: 3 8 5 7 5 5 5 8 8 8 ...
##  $ Summarized.Offense.Description   : Factor w/ 48 levels "[INC - CASE DC USE ONLY]",..: 6 20 26 9 29 29 29 20 20 20 ...
##  $ Date.Reported                    : Factor w/ 15766 levels "01/02/2015 11:12:00 AM",..: 4720 3991 3617 3468 3304 3125 629 39 38 36 ...
##  $ Occurred.Date.or.Date.Range.Start: Factor w/ 11557 levels "06/01/2014 01:00:00 AM",..: 3526 116 11508 2393 115 2275 116 10420 7872 6327 ...
##  $ Occurred.Date.Range.End          : Factor w/ 5935 levels "","01/18/2015 09:00:00 AM",..: 1801 1509 1 1 5913 2533 60 10 1 1 ...
##  $ Hundred.Block.Location           : Factor w/ 7924 levels "1 AV / BATTERY ST",..: 5508 2248 6086 5522 5840 3305 2432 2187 557 3091 ...
##  $ District.Sector                  : Factor w/ 19 levels "","99","B","C",..: 9 4 7 12 9 16 15 3 5 5 ...
##  $ Zone.Beat                        : Factor w/ 53 levels "","99","B1","B2",..: 22 7 17 31 23 44 40 3 11 9 ...
##  $ Census.Tract.2000                : num  2900 6300 11301 8200 2700 ...
##  $ Longitude                        : num  -122 -122 -122 -122 -122 ...
##  $ Latitude                         : num  47.7 47.6 47.5 47.6 47.7 ...
##  $ Location                         : Factor w/ 12970 levels "(0.0, 0.0)","(47.465062973, -122.337868218)",..: 10611 7628 872 5213 10823 2191 7640 9804 6525 6000 ...
##  $ Month                            : int  6 6 8 6 6 6 6 8 8 7 ...
##  $ Year                             : int  2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...
head(res)
##   RMS.CDW.ID General.Offense.Number Offense.Code Offense.Code.Extension
## 1     483839             2015218538         2202                      0
## 2     481252             2015213067         2610                      0
## 3     481375             2015210301         2316                      0
## 4     481690             2015209327         2599                      0
## 5     478198             2015207880         2399                      3
## 6     480485             2015904103         2308                      0
##           Offense.Type Summary.Offense.Code Summarized.Offense.Description
## 1   BURGLARY-FORCE-RES                 2200                       BURGLARY
## 2 FRAUD-IDENTITY THEFT                 2600                          FRAUD
## 3           THEFT-MAIL                 2300                     MAIL THEFT
## 4          COUNTERFEIT                 2500                    COUNTERFEIT
## 5            THEFT-OTH                 2300                 OTHER PROPERTY
## 6       THEFT-BUILDING                 2300                 OTHER PROPERTY
##            Date.Reported Occurred.Date.or.Date.Range.Start
## 1 06/28/2015 10:31:00 AM            06/28/2014 10:31:00 AM
## 2 06/24/2015 11:09:00 AM            06/01/2014 12:00:00 AM
## 3 06/22/2015 09:22:00 AM            08/31/2014 09:00:00 AM
## 4 06/21/2015 03:52:00 PM            06/20/2014 01:38:00 PM
## 5 06/20/2015 11:59:00 AM            06/01/2014 11:59:00 AM
## 6 06/19/2015 02:55:00 PM            06/19/2014 02:45:00 PM
##   Occurred.Date.Range.End       Hundred.Block.Location District.Sector
## 1  06/28/2015 10:31:00 AM        6XX BLOCK OF NW 74 ST               J
## 2  06/24/2015 11:09:00 AM        23XX BLOCK OF 43 AV E               C
## 3                               81XX BLOCK OF 11 AV SW               F
## 4                                 6XX BLOCK OF PINE ST               M
## 5  11/01/2014 12:00:00 PM 77XX BLOCK OF SUNNYSIDE AV N               J
## 6  07/10/2014 02:45:00 PM 35XX BLOCK OF S FERDINAND ST               R
##   Zone.Beat Census.Tract.2000 Longitude Latitude
## 1        J2          2900.301 -122.3647 47.68252
## 2        C2          6300.100 -122.2771 47.63990
## 3        F3         11300.501 -122.3493 47.52923
## 4        M2          8200.100 -122.3348 47.61237
## 5        J3          2700.202 -122.3294 47.68596
## 6        R3         10300.401 -122.2875 47.55785
##                         Location Month Year
## 1  (47.68252427, -122.364671996)     6 2014
## 2 (47.639900761, -122.277080248)     6 2014
## 3 (47.529232299, -122.349312181)     8 2014
## 4 (47.612368448, -122.334817763)     6 2014
## 5 (47.685959879, -122.329378505)     6 2014
## 6 (47.557854802, -122.287477902)     6 2014

Notice that there are two variables that describe the offense (i.e. crime) type, Offense.Type and summarised.Offense.Description. The str function showed us that the summarised.Offense.Description column only had 48 levels, while Offense.Type had 147. For simplicity’s sake, we use the summarised.Offense.Description variable to categorize crime. Because its name is long, we use the plyr package to change it to Offense to clean up our code.

library(plyr)
res <- rename(res, c("Summarized.Offense.Description" = "Offense"))

Data Cleaning and Preprocessing

Next, let’s prepare the data for analysis. We first convert the date variables to POSIXlt/ct format using the lubridate library. Then, we convert the offense type variables to lowercase letters for improved readability. We coerce the offense type variable back to a factor after the conversion.

library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:plyr':
## 
##     here
## The following object is masked from 'package:base':
## 
##     date
res$Date.Reported                       <- mdy_hms(res$Date.Reported)
res$Occurred.Date.or.Date.Range.Start   <- mdy_hms(res$Occurred.Date.or.Date.Range.Start) 
res$Occurred.Date.Range.End             <- mdy_hms(res$Occurred.Date.Range.End)
res$Offense                             <- as.factor(tolower(res$Offense))

Let’s see a few of the crimes described in Offense.

head(table(res$Offense), 20)
## 
##    [inc - case dc use only]            animal complaint 
##                           5                          96 
##                     assault               bias incident 
##                        2018                          20 
##                  bike theft                    burglary 
##                         797                        3212 
## burglary-secure parking-res                   car prowl 
##                         388                        6230 
##                 counterfeit          disorderly conduct 
##                         159                           2 
##                     dispute                 disturbance 
##                         171                        1333 
##                         dui                     eluding 
##                          34                           8 
##                    embezzle                      escape 
##                          57                           3 
##                false report                    firework 
##                          23                           9 
##                     forgery                       fraud 
##                          59                        1473

We’ll return to this variable later.

Data Analysis

Question 1. How many crimes were committed each day?

We do a litle feature engineering to answer this question, and we approach it in two ways. Namely, we look at crimes committed per true date and per weekday.

First, we create a continuous time variable called Date.Occurred which is a Date object in R. Then, we use this new variable to create an aggregated dataframe, evPerDay, using the plyr package.

res$Date.Occurred <- as.Date(res$Occurred.Date.or.Date.Range.Start)
evPerDay <- ddply(res, .(Date.Occurred), summarise, Count=length(Offense))
head(evPerDay)
##   Date.Occurred Count
## 1    2014-06-01   397
## 2    2014-06-02   419
## 3    2014-06-03   357
## 4    2014-06-04   392
## 5    2014-06-05   386
## 6    2014-06-06   401

In the second way, we label each date with its appropriate weekday using the lubridate library. We store this information in a new variable called Weekday.

evPerDay$Weekday <- wday(evPerDay$Date.Occurred, label=TRUE)
wdayAgg <- aggregate(Count~Weekday, data=evPerDay, FUN=mean)
print(wdayAgg)
##   Weekday    Count
## 1     Sun 336.7857
## 2     Mon 352.8462
## 3    Tues 354.7692
## 4     Wed 358.6154
## 5   Thurs 353.5385
## 6     Fri 381.5385
## 7     Sat 357.4615

Question 2: What hour of the day did the most crimes occur?

To answer this question, we we extract Time.Occurred using format() and use the chron package to extract hours from Occurred.Date.or.Date.Range.Start.

res$Time.Occurred <- format(res$Occurred.Date.or.Date.Range.Start, format="%H:%M:%S")
class(res$Time.Occurred)
## [1] "character"
library(chron)
## 
## Attaching package: 'chron'
## The following objects are masked from 'package:lubridate':
## 
##     days, hours, minutes, seconds, years
res$Time.Occurred <- times(res$Time.Occurred)
res$Hour.Occurred <- hours(res$Occurred.Date.or.Date.Range.Start) ## "%H:%M:%S"
table(res$Hour.Occurred)
## 
##    0    1    2    3    4    5    6    7    8    9   10   11   12   13   14 
## 2073  989  827  589  475  393  550  811 1130 1155 1286 1281 2020 1589 1481 
##   15   16   17   18   19   20   21   22   23 
## 1697 1669 1844 1838 1738 1866 1942 1931 1605
class(res$Hour.Occurred)
## [1] "numeric"

We also create a new variable that is the POSIXct version of Hour.Occurred called Date.Hour.Occurred.

res$Date.Hour.Occurred <- strptime(res$Occurred.Date.or.Date.Range.Start, "%Y-%m-%d %H")
res$Date.Hour.Occurred <- as.POSIXct(res$Date.Hour.Occurred)

We also approach this question from a few ways using the dplyr package. In the first appraoch, we reduce all timestamps to hour and date (with no minutes), count the number of crimes committed, and name the dataframe, evPerHrDate. We calculate the standard deviation of crimes committed per hour and store it in tmp. Before moving on, we extract just the Hour using the stringr package to use later for plotting.
In a second analysis, we create two dataframes: evPerHr by aggregating the total number of crimes committed each hour, and avgPerHr, the average number of crimes committed per hour over the summer. To this dataframe, we bind the standard deviation of crimes committed per hour from tmp as the variable sd.
For the third dataframe, evPerTs, we use the full timestamp (date, hour, and minute) and counts all crimes that occurred during that single timestamp.

evPerHrDate <- ddply(res, .(Date.Hour.Occurred), summarise, Count=length(Offense))
print(head(evPerHrDate, 10))
##     Date.Hour.Occurred Count
## 1  2014-06-01 00:00:00    51
## 2  2014-06-01 01:00:00    17
## 3  2014-06-01 02:00:00    16
## 4  2014-06-01 03:00:00    17
## 5  2014-06-01 04:00:00     4
## 6  2014-06-01 05:00:00    10
## 7  2014-06-01 06:00:00     5
## 8  2014-06-01 07:00:00     6
## 9  2014-06-01 08:00:00    15
## 10 2014-06-01 09:00:00     8
## Create a new variable for Hour 
evPerHrDate$HourFull <- format(evPerHrDate$Date.Hour.Occurred, format="%H:%M:%S")
print(head(evPerHrDate, 10))
##     Date.Hour.Occurred Count HourFull
## 1  2014-06-01 00:00:00    51 00:00:00
## 2  2014-06-01 01:00:00    17 01:00:00
## 3  2014-06-01 02:00:00    16 02:00:00
## 4  2014-06-01 03:00:00    17 03:00:00
## 5  2014-06-01 04:00:00     4 04:00:00
## 6  2014-06-01 05:00:00    10 05:00:00
## 7  2014-06-01 06:00:00     5 06:00:00
## 8  2014-06-01 07:00:00     6 07:00:00
## 9  2014-06-01 08:00:00    15 08:00:00
## 10 2014-06-01 09:00:00     8 09:00:00
tmp <- tapply(evPerHrDate$Count, evPerHrDate$Hour, sd)

## Use a regular expression to extract only the hour 
library(stringr)
reg.ex <- "[0-9]{2}"
evPerHrDate$Hour <- str_extract(evPerHrDate$HourFull, reg.ex)
evPerHrDate$Hour <- as.integer(evPerHrDate$Hour)
str(evPerHrDate)
## 'data.frame':    2173 obs. of  4 variables:
##  $ Date.Hour.Occurred: POSIXct, format: "2014-06-01 00:00:00" "2014-06-01 01:00:00" ...
##  $ Count             : int  51 17 16 17 4 10 5 6 15 8 ...
##  $ HourFull          : chr  "00:00:00" "01:00:00" "02:00:00" "03:00:00" ...
##  $ Hour              : int  0 1 2 3 4 5 6 7 8 9 ...
evPerHr <- aggregate(Count ~ Hour, evPerHrDate, FUN=sum)
print(head(evPerHr))
##   Hour Count
## 1    0  2073
## 2    1   989
## 3    2   827
## 4    3   589
## 5    4   475
## 6    5   393
avgPerHr <- aggregate(Count ~ Hour, evPerHrDate, FUN=mean)
print(head(avgPerHr))
##   Hour     Count
## 1    0 22.532609
## 2    1 10.868132
## 3    2  9.292135
## 4    3  6.770115
## 5    4  5.654762
## 6    5  4.974684
avgPerHr$sd <- tmp

evPerTs <- ddply(res, .(Occurred.Date.or.Date.Range.Start), summarise, Count=length(Offense))
head(evPerTs)
##   Occurred.Date.or.Date.Range.Start Count
## 1               2014-06-01 00:00:00    22
## 2               2014-06-01 00:01:00    13
## 3               2014-06-01 00:13:00     1
## 4               2014-06-01 00:20:00     2
## 5               2014-06-01 00:22:00     2
## 6               2014-06-01 00:26:00     1

Question 3. What were the most common crimes?

We address this question in a few ways. First, we use the dplyr package to count the total number of occurrences by Offense and arrange it in descending order. Then, we store the top 10 crimes in a new dataframe.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:lubridate':
## 
##     intersect, setdiff, union
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
arrange <- dplyr::arrange 
count <- dplyr::count 
crimes_tbl <- arrange(count(res, Offense), desc(n))

colnames(crimes_tbl) <- c("Offense", "Count")
head(crimes_tbl)
## # A tibble: 6 × 2
##           Offense Count
##            <fctr> <int>
## 1       car prowl  6230
## 2  other property  3755
## 3        burglary  3212
## 4   vehicle theft  3057
## 5 property damage  2365
## 6         assault  2018
top10crimes <- head(crimes_tbl, 10)

Now, let’s look at this question in the context of the previous questions. Which are the most common crimes each day, and for which hour of the day? Let’s begin with crime type per day. We create a new dataframe, eventTypePerDay, using the plyr package. Then, we subset it to include only the top 10 crimes in Seattle.

eventTypePerDay <- ddply(res, Date.Occurred~Offense, summarise, Count=length(Offense))
head(eventTypePerDay)
##   Date.Occurred                  Offense Count
## 1    2014-06-01 [inc - case dc use only]     1
## 2    2014-06-01         animal complaint     4
## 3    2014-06-01                  assault    24
## 4    2014-06-01            bias incident     2
## 5    2014-06-01               bike theft     4
## 6    2014-06-01                 burglary    54
top10EventTypePerDay <- subset(eventTypePerDay, (eventTypePerDay$Offense %in% top10crimes$Offense)==TRUE)
head(top10EventTypePerDay)
##    Date.Occurred        Offense Count
## 3     2014-06-01        assault    24
## 6     2014-06-01       burglary    54
## 8     2014-06-01      car prowl    86
## 11    2014-06-01    disturbance    14
## 12    2014-06-01          fraud     9
## 17    2014-06-01 other property    31

Now, we look at how the top 10 crimes differ across each hour with eventTypePerHr.

eventTypePerHr <- ddply(res, Hour.Occurred~Offense, summarise, Count=length(Offense))
head(eventTypePerHr)
##   Hour.Occurred                  Offense Count
## 1             0 [inc - case dc use only]     1
## 2             0         animal complaint     9
## 3             0                  assault    95
## 4             0            bias incident     2
## 5             0               bike theft    44
## 6             0                 burglary   211
top10EventTypePerHr <- subset(eventTypePerHr, (eventTypePerHr$Offense %in% top10crimes$Offense)==TRUE)
head(top10EventTypePerHr)
##    Hour.Occurred        Offense Count
## 3              0        assault    95
## 6              0       burglary   211
## 8              0      car prowl   394
## 11             0    disturbance    46
## 16             0          fraud   212
## 23             0 other property   289

Question 4. Which crimes went unreported the longest?

Not every crime that occurs is detected immediately. Even more concerning is the notion that some people wait hours or even days before reporting a crime to the police. Let’s find out which types of crimes in our data went unreported the longest.
We begin with a bit more feature engineering using our res dataframe. We create two new variables: Time.Btw.Report represents the time lapsed between Occurred.Date.or.Date.Range.Start and Date.Reported, and Time.Range represents the length of a crime event. We also check for negative values (potential errors) in these new variables.

res$Time.Btw.Report <- difftime(res$Date.Reported, res$Occurred.Date.or.Date.Range.Start, units = "days")
which(res$Time.Btw.Report<0)
## integer(0)
res$Time.Range <- difftime(res$Occurred.Date.Range.End, res$Occurred.Date.or.Date.Range.Start, units = "days")
which(res$Time.Range<0)
##  [1]   463   464  7261  7262  8113  8114  8115  8116 12567 12568 19189
## [12] 19190 20132 20133 28443 28444 30415 30416 31192 31193

There are some negative values for Time.Range. Let’s find out why.

negative <- which(res$Time.Range<0)
head(res[c(negative), c(8:10, 24:25)])
##            Date.Reported Occurred.Date.or.Date.Range.Start
## 463  2014-09-09 16:13:00               2014-07-27 12:00:00
## 464  2014-09-09 16:13:00               2014-07-27 12:00:00
## 7261 2014-08-13 05:41:00               2014-08-13 05:41:00
## 7262 2014-08-13 05:41:00               2014-08-13 05:41:00
## 8113 2014-08-10 14:32:00               2014-08-10 14:32:00
## 8114 2014-08-10 14:32:00               2014-08-10 14:32:00
##      Occurred.Date.Range.End Time.Btw.Report         Time.Range
## 463      2014-07-27 10:00:00   44.17569 days -0.0833333333 days
## 464      2014-07-27 10:00:00   44.17569 days -0.0833333333 days
## 7261     2014-08-13 05:40:00    0.00000 days -0.0006944444 days
## 7262     2014-08-13 05:40:00    0.00000 days -0.0006944444 days
## 8113     2014-08-10 14:30:00    0.00000 days -0.0013888889 days
## 8114     2014-08-10 14:30:00    0.00000 days -0.0013888889 days

It appears that the end date was reported as occurring before the start date for a few crimes. However, the start date for these was the same as the time reported. This must be a data entry error, and because there are so few, we will ignore these errors for the next calculation.
Now, we create another variable called Time.Went.Unreported. Here, we use the ifelse function to check if a row has an end range date. If so, then the time unreported is the difference between the Time.Btw.Report and Time.Range variables. If there is no end range date (i.e. the crime was instantaneous), then we simply use the Time.Btw.Report variable.

res$Time.Went.Unreported <- with(res, ifelse(is.na(res$Occurred.Date.Range.End)==F, res$Time.Btw.Report-res$Time.Range, res$Time.Btw.Report))
which(res$Time.Went.Unreported<0)
##  [1]  1323  1324  3669  3670  7461  9677  9678 11686 18792 18793 19325
## [12] 19327 21331 23066 23067 23326 23327 23328 24182 24183 26901 30957

We still see a few values less than 0. Let’s take a closer look at them.

negative <- which(res$Time.Went.Unreported<0)
head(res[negative, c(8:10, 24:26)])
##            Date.Reported Occurred.Date.or.Date.Range.Start
## 1323 2014-08-30 14:00:00               2014-08-29 13:00:00
## 1324 2014-08-30 14:00:00               2014-08-29 13:00:00
## 3669 2014-08-23 16:37:00               2014-08-23 16:37:00
## 3670 2014-08-23 16:37:00               2014-08-23 16:37:00
## 7461 2014-08-12 13:38:00               2014-08-12 12:30:00
## 9677 2014-08-05 20:52:00               2014-07-02 13:00:00
##      Occurred.Date.Range.End  Time.Btw.Report       Time.Range
## 1323     2014-08-30 15:09:00  1.04166667 days  1.08958333 days
## 1324     2014-08-30 15:09:00  1.04166667 days  1.08958333 days
## 3669     2014-08-23 19:20:00  0.00000000 days  0.11319444 days
## 3670     2014-08-23 19:20:00  0.00000000 days  0.11319444 days
## 7461     2014-08-12 13:45:00  0.04722222 days  0.05208333 days
## 9677     2014-08-05 21:00:00 34.32777778 days 34.33333333 days
##      Time.Went.Unreported
## 1323         -0.047916667
## 1324         -0.047916667
## 3669         -0.113194444
## 3670         -0.113194444
## 7461         -0.004861111
## 9677         -0.005555556

Here, we see that a negative Time.Went.Unreported value means that the crime was ongoing when it was reported.
Next, we aggregate the average time crimes go unreported by offense type. Then, using the dplyr package, we arrange it in descending order. Then, we take the top 10 results and store them in a new dataframe, top10Times. The units are in days.

timeNoReport <- arrange(aggregate(Time.Went.Unreported ~ Offense, res, FUN=mean, na.rm=TRUE), desc(Time.Went.Unreported))
top10Times <- head(timeNoReport, 10)
print(top10Times)
##           Offense Time.Went.Unreported
## 1         forgery            12.560817
## 2      mail theft             9.433820
## 3           fraud             8.503972
## 4   lost property             7.865201
## 5    purse snatch             4.509105
## 6        embezzle             4.351206
## 7  other property             3.951514
## 8     counterfeit             3.915754
## 9      pickpocket             3.616952
## 10     bike theft             1.931416

Results

Answer to Question 1, “How many crimes were committed each day?”

Let’s look at evPerDay.

head(evPerDay, 10)
##    Date.Occurred Count Weekday
## 1     2014-06-01   397     Sun
## 2     2014-06-02   419     Mon
## 3     2014-06-03   357    Tues
## 4     2014-06-04   392     Wed
## 5     2014-06-05   386   Thurs
## 6     2014-06-06   401     Fri
## 7     2014-06-07   389     Sat
## 8     2014-06-08   353     Sun
## 9     2014-06-09   358     Mon
## 10    2014-06-10   398    Tues
summary(evPerDay$Count)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   255.0   328.0   353.0   356.3   388.2   447.0

For the time frame of summer 2014 in Seattle, there was a maximum of 447 and a minimum of 255 crimes committed per day, with a mean of about 356 crimes.

We plot this data and all future data using the ggplot2 package.

library(ggplot2)
ggplot(data=evPerDay, aes(x=Date.Occurred, y=Count)) + geom_point() + geom_smooth(method='loess', col='red', alpha=0.2, size=1) + ggtitle("Seattle Total Crimes Committed Each Day
Summer 2014") + theme_bw()

We can see from the regression line that total crimes per day decreased as the summer continued. It appears that total crimes per day peaked at the end of June, and our lowest point, which seems to occur on July 4th (Independence Day in the U.S.), shortly followed it. Let’s check.

maximum <- max(evPerDay$Count)
evPerDay[which(evPerDay[,2]==maximum),]
##    Date.Occurred Count Weekday
## 25    2014-06-25   447     Wed
minimum <- min(evPerDay$Count)
evPerDay[which(evPerDay[,2]==minimum),]
##    Date.Occurred Count Weekday
## 33    2014-07-03   255   Thurs

Sure enough, crimes per day in Seattle peaked on June 25 and bottomed on July 3, 2014 in our time frame.

Now let’s look at our second model using weekday. Recall our aggregate data frame, wdayAgg.

print(wdayAgg)
##   Weekday    Count
## 1     Sun 336.7857
## 2     Mon 352.8462
## 3    Tues 354.7692
## 4     Wed 358.6154
## 5   Thurs 353.5385
## 6     Fri 381.5385
## 7     Sat 357.4615
summary(wdayAgg$Count)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   336.8   353.2   354.8   356.5   358.0   381.5

On a weekday basis, an average minimum of about 337 crimes occurred on Sundays, and an average maximum of 382 crimes occurred on Fridays. The mean number of crimes per weekday was about 357, less than one crime higher than the mean reported in our previous model.

Now, let’s look at a visual representation of this weekday-centric data.

ggplot(data=evPerDay, aes(x=Weekday, y=Count)) + geom_boxplot() + ggtitle("Seattle Total Crimes Committed per Weekday
Summer 2014")

If you look closely, you can see that the maximum number of crimes per weekday occurs on a Wednesday, and the minimum occurrs on a Thursday. If you recall our previous model, you will remember that our global max, June 25, 2014 was a Wednesday, and our global min, July 3, 2014, was on a Thursday. Our data are consistent.

Answer to Question 2, “What hour of the day did the most crimes occur?”

ggplot(data=evPerHrDate, aes(x=Hour, y=Count)) + geom_point(alpha=0.25) + geom_smooth(method="loess", col="green", size=1.2, alpha=.8) + ggtitle("Seattle Total Crimes per Hour, Summer 2014") + theme_bw()