Libraries Used

library(dplyr)
library(stringr)
library(data.table)
library(utils)

Dataset

Consists of approximately 2.4 million records for speed camera citations in the City of Baltimore from 2013 to the present https://data.baltimorecity.gov/Transportation/2013-Speed-Camera-Citations/pzza-s46z.

The following code downloads a copy of the data from the Open Baltimore website and stores as a CSV file, then loads it into a tibble for use by the dplyr package.

fileURL = "https://data.baltimorecity.gov/api/views/pzza-s46z/rows.csv?accessType=DOWNLOAD&bom=true&format=true"
download.file(url = fileURL, destfile = "datasets/citations.csv", method = "curl")
citations = as_tibble(fread("datasets/citations.csv"))

Exploring the Data

No documentation is provided for the dataset so we will have to perform some exploratory analysis to see what we are dealing with. First we begin by examining the head of the dataset and running some summary statistics:

head(citations)
## # A tibble: 6 x 17
##   Tag   Citation ExpMM ExpYY Make  State Address ViolCode Description
##   <chr>    <int> <chr> <int> <chr> <chr> <chr>      <int> <chr>      
## 1 6994… 82691651 04       19 MAZD  MD    1300 B…       32 Fixed Spee…
## 2 T720… 85937226 12       18 ACUR  MD    300 BL…       32 Fixed Spee…
## 3 5BA2… 82691115 03       19 TOYT  MD    5000 B…       32 Fixed Spee…
## 4 5BT8… 82694689 01       18 FORD  MD    2400 B…       32 Fixed Spee…
## 5 1BV0… 82691214 11       18 SUBA  MD    1300 B…       32 Fixed Spee…
## 6 5CZ9… 82688467 08       19 NISS  MD    3400 B…       32 Fixed Spee…
## # ... with 8 more variables: ViolFine <dbl>, ViolDate <chr>,
## #   Balance <chr>, PenaltyDate <lgl>, OpenFine <dbl>, OpenPenalty <chr>,
## #   NoticeDate <chr>, ImportDate <chr>
summary(citations)
##      Tag               Citation           ExpMM               ExpYY     
##  Length:2438544     Min.   :   13425   Length:2438544     Min.   : 0.0  
##  Class :character   1st Qu.:61960544   Class :character   1st Qu.:16.0  
##  Mode  :character   Median :82329306   Mode  :character   Median :18.0  
##                     Mean   :74881715                      Mean   :17.1  
##                     3rd Qu.:93757147                      3rd Qu.:19.0  
##                     Max.   :99999998                      Max.   :99.0  
##      Make              State             Address             ViolCode    
##  Length:2438544     Length:2438544     Length:2438544     Min.   : 1.00  
##  Class :character   Class :character   Class :character   1st Qu.:18.00  
##  Mode  :character   Mode  :character   Mode  :character   Median :22.00  
##                                                           Mean   :24.71  
##                                                           3rd Qu.:32.00  
##                                                           Max.   :99.00  
##  Description           ViolFine        ViolDate           Balance         
##  Length:2438544     Min.   : 17.00   Length:2438544     Length:2438544    
##  Class :character   1st Qu.: 32.00   Class :character   Class :character  
##  Mode  :character   Median : 40.00   Mode  :character   Mode  :character  
##                     Mean   : 50.04                                        
##                     3rd Qu.: 52.00                                        
##                     Max.   :502.00                                        
##  PenaltyDate       OpenFine      OpenPenalty         NoticeDate       
##  Mode:logical   Min.   :  0.00   Length:2438544     Length:2438544    
##  NA's:2438544   1st Qu.:  0.00   Class :character   Class :character  
##                 Median :  0.00   Mode  :character   Mode  :character  
##                 Mean   : 20.84                                        
##                 3rd Qu.: 32.00                                        
##                 Max.   :502.00                                        
##   ImportDate       
##  Length:2438544    
##  Class :character  
##  Mode  :character  
##                    
##                    
## 

Looking at the summary, we find that two numeric variables, Balance and OpenPenalty, were coerced to character variables. This is because the values for these fields in the thousands are formatted with commas. We will use the dplyr mutate function to convert these fields to numeric:

citations = citations %>% mutate(
  Balance = as.numeric(gsub(",","",Balance)),
  OpenPenalty = as.numeric(gsub(",","",OpenPenalty))
)

Each observation appears to be uniquely identified by the citation number. We can verify this by running dplyr’s n_distinct function on the Citation field. The results show that there are 2,438,544 distinct values - the same number of observations in the dataset.

citations %>% summarise(DistinctCitations = n_distinct(Citation))
## # A tibble: 1 x 1
##   DistinctCitations
##               <int>
## 1           2438544

Other fields include the Tag (if available) of the car receiving a citation, expiration month and year of the registration, Make of the vehicle (if available), State of the plate on the vehicle, Address of the camera where the violation occurred, as well as Violation code, description, fines and penalties. Below are a sampling of records with missing values for Tag and Make fields.

citations %>% filter(is.na(Tag))
## # A tibble: 298 x 17
##    Tag   Citation ExpMM ExpYY Make  State Address ViolCode Description
##    <chr>    <int> <chr> <int> <chr> <chr> <chr>      <int> <chr>      
##  1 <NA>  90804015 ""        0 NO PA MD    1800 M…       99 All Other …
##  2 <NA>  91266362 ""        0 HONDA MD    900 FE…       99 All Other …
##  3 <NA>  91266370 ""        0 HONDA MD    900 FE…       12 No Stoppin…
##  4 <NA>  91790460 ""        0 CADIL XX    3000 O…       99 All Other …
##  5 <NA>  92329086 ""        0 HONDA XX    1800 M…       99 All Other …
##  6 <NA>  92329094 ""        0 NISSA XX    1800 M…       99 All Other …
##  7 <NA>  92329078 ""        0 HONDA XX    1800 M…        5 Obstruct/I…
##  8 <NA>  92343020 ""        0 VOLKS MD    3300 O…       99 All Other …
##  9 <NA>  92460337 ""        0 MERCE MD    1800 M…       99 All Other …
## 10 <NA>  22718423 02       15 PONTI MD    5900 B…       99 All Other …
## # ... with 288 more rows, and 8 more variables: ViolFine <dbl>,
## #   ViolDate <chr>, Balance <dbl>, PenaltyDate <lgl>, OpenFine <dbl>,
## #   OpenPenalty <dbl>, NoticeDate <chr>, ImportDate <chr>
citations %>% filter(is.na(Make))
## # A tibble: 15 x 17
##    Tag   Citation ExpMM ExpYY Make  State Address ViolCode Description
##    <chr>    <int> <chr> <int> <chr> <chr> <chr>      <int> <chr>      
##  1 0579… 91435777 07       15 <NA>  MD    3100 B…       10 Commercial…
##  2 4311… 24081283 07       15 <NA>  MD    200 BA…       18 All Other …
##  3 Y481… 60747012 ""        0 <NA>  TN    1500 N…       99 All Other …
##  4 1090… 93686344 05       17 <NA>  MD    3300 P…       99 All Other …
##  5 RA41… 61284650 05       18 <NA>  VA    U/B CA…        8 No Stoppin…
##  6 0109… 99912347 07       13 <NA>  MD    700 WH…        5 Obstruct/I…
##  7 5AB3… 96633715 10       10 <NA>  MD    3600 C…       43 Obstructin…
##  8 4420… 65581325 ""        0 <NA>  MD    1700 L…        3 Obstruct/I…
##  9 2BW5… 95315405 02       17 <NA>  MD    U/B CR…       18 All Other …
## 10 CIQB… 96254793 08       16 <NA>  FL    400 PA…       12 No Stoppin…
## 11 301E… 96728622 04       17 <NA>  MD    U/B BA…        8 No Stoppin…
## 12 UH 5… 96770822 ""        0 <NA>  OR    1300 N…       99 All Other …
## 13 YYT9… 98587307 09       17 <NA>  PA    3800 F…       12 No Stoppin…
## 14 6CX0… 63359898 05       17 <NA>  MD    2500 T…       22 Expired Ta…
## 15 KNS2… 63415161 05       18 <NA>  MD    1600 T…       12 No Stoppin…
## # ... with 8 more variables: ViolFine <dbl>, ViolDate <chr>,
## #   Balance <dbl>, PenaltyDate <lgl>, OpenFine <dbl>, OpenPenalty <dbl>,
## #   NoticeDate <chr>, ImportDate <chr>

We are interested in seeing what values are available for the State field. We can do this by grouping records by State and counting the number of records in each group. The results show that there are 102 distinct states (e.g. MD, VA, DC), territories (e.g. AS, PR), Canadian provinces (e.g. OT, BC, NB), and country codes.

citations %>%
  select(State) %>%
  group_by(State) %>%
  tally(sort = T)
## # A tibble: 102 x 2
##    State       n
##    <chr>   <int>
##  1 MD    2121766
##  2 VA      59755
##  3 PA      55845
##  4 NJ      22514
##  5 NY      19394
##  6 FL      14605
##  7 NC      14170
##  8 TN      13687
##  9 DC      12827
## 10 DE      11805
## # ... with 92 more rows

Removing Bad Tags

One question we would like to ask is which vehicle/Tag has received the most citations in the dataset. We can answer this question with dplyr by grouping records by Tag and tallying the observations:

citations %>%
  select(Tag) %>% 
  group_by(Tag) %>%
  tally(sort = T)
## # A tibble: 1,016,851 x 2
##    Tag          n
##    <chr>    <int>
##  1 NOTAGS T   714
##  2 NO TAGST   599
##  3 <NA>       298
##  4 5BF6174    202
##  5 2216618    192
##  6 2AK8974    177
##  7 8767631    158
##  8 3DGREES    148
##  9 3BV0242    147
## 10 3CM9052    146
## # ... with 1,016,841 more rows

We can see that the top three listings are for tags that were blank (i.e. NA) or were entered as a variant of ‘NO TAGS’. There are likely to be a similar entries that indicate no tags were available so we would like to obtain a listing of these tags for removal.

First, we get a listing of the Tags with missing values. As previously observed, there are 298 citations with a blank/missing value.

MISSING_TAGS = citations %>%
  select(Tag, Make) %>%
  filter(is.na(Tag)) %>%
  group_by(Tag) %>%
  tally(sort = T) 
MISSING_TAGS
## # A tibble: 1 x 2
##   Tag       n
##   <chr> <int>
## 1 <NA>    298

Next, we would like to look for tags that contain the word ‘TAG’ within it. We can do this by using the str_detect function of the stringr library to determine whether ‘TAG’ is present and then group by Tag and tally the results. However, we can see initially that there are some tags mixed in the results that are legitamate tags (e.g., WTTAGW1, V1NTAGE, LITAG8R). We would like a way to make sure that the list of tags returned are definitely bad tags. One general assumption might be that Tags are tied to one distinct vehicle or Make. Therefore, we can identify and group Tags with ‘TAG’ present and then count the number of distinct Makes tied to each Tag group, keeping only observations with more than one distinct Make.

citations %>%
  select(Tag, Make) %>% 
  filter(str_detect(Tag, "TAG")) %>%   
  group_by(Tag) %>%
  tally() %>%
  arrange(desc(Tag))
## # A tibble: 68 x 2
##    Tag          n
##    <chr>    <int>
##  1 XNO TAG      5
##  2 X NO TAG     5
##  3 WTTAGW1      1
##  4 V1NTAGE      1
##  5 TEMP TAG     1
##  6 TAGNO        1
##  7 TAGDAT       3
##  8 TAG1         2
##  9 TAG NO       1
## 10 TAG DAT      2
## # ... with 58 more rows
TAG_MULTIPLE_MAKES = citations %>%
  select(Tag, Make) %>% 
  filter(str_detect(Tag, "TAG")) %>%   
  group_by(Tag) %>%
  summarise(DistinctTags = n_distinct(Make)) %>%
  filter(DistinctTags > 1) %>%
  arrange(desc(Tag))

TAG_MULTIPLE_MAKES
## # A tibble: 27 x 2
##    Tag      DistinctTags
##    <chr>           <int>
##  1 XNO TAG             5
##  2 X NO TAG            4
##  3 TAGDAT              2
##  4 NT TAG              2
##  5 NOTTAGS             2
##  6 NOTAGSDI           29
##  7 NOTAGS T           39
##  8 NOTAGGS             3
##  9 NOTAGG              2
## 10 NOTAGA              2
## # ... with 17 more rows

We would still like to retain bad tags that are only tied to 1 distinct Make, so we need a different way to identify a bad tag other than the presence of ‘TAG’. Since bad tags appear to be a variant of the string ‘NO TAGS’, we can use the adist function of the utils package to determine the string/edit distance of a tag from ‘NO TAGS’. This measure determines the number of insertions, deletions, and substitutions needed to transform ‘NO TAGS’ into the Tag being examined. We will filter on distances less than or equal to 3 to find tags that are very similar to ‘NO TAGS’.

NO_TAGS_DIFF_BY_DIST = citations %>%
  select(Tag, Make) %>% 
  mutate(dist = adist(Tag, "NO TAGS")) %>% 
  filter(dist <= 3) %>% 
  group_by(Tag) %>%
  summarise(DistinctTags = n_distinct(Make)) %>%
  filter(DistinctTags == 1)

NO_TAGS_DIFF_BY_DIST
## # A tibble: 48 x 2
##    Tag      DistinctTags
##    <chr>           <int>
##  1 B NOTAGS            1
##  2 NC TNNS             1
##  3 NE PATS             1
##  4 NETTAS              1
##  5 NIOTAGS             1
##  6 NJOTAGS             1
##  7 NO AGS              1
##  8 NO STAGS            1
##  9 NO T AGS            1
## 10 NO TA               1
## # ... with 38 more rows

We can then bind these 3 sets of bad tags together to obtain a single listing of bad tag names. The full listing contains 76 identified bad tags.

BAD_TAGS = bind_rows(MISSING_TAGS, TAG_MULTIPLE_MAKES, NO_TAGS_DIFF_BY_DIST) %>%
  arrange(desc(DistinctTags))

BAD_TAGS
## # A tibble: 76 x 3
##    Tag          n DistinctTags
##    <chr>    <int>        <int>
##  1 NOTAGS T    NA           39
##  2 NO TAGST    NA           38
##  3 NOTAGSDI    NA           29
##  4 NOTAG  T    NA           21
##  5 N0TAGS      NA           20
##  6 NO TAG T    NA           12
##  7 NO  TAG     NA            9
##  8 NO  TAGS    NA            7
##  9 XNO TAG     NA            5
## 10 NOT TAGS    NA            5
## # ... with 66 more rows

This can then be used as a filtering criteria on the citations data frame to remove bad tags. This is by no means a full listing of bad tags, but handles a majority of cases.

citations = citations %>% filter(!Tag %in% BAD_TAGS$Tag)

We can the re-run the original query to find the Tag with the most citations.

citations %>%
  select(Tag) %>% 
  group_by(Tag) %>%
  tally(sort = T)
## # A tibble: 1,016,775 x 2
##    Tag         n
##    <chr>   <int>
##  1 5BF6174   202
##  2 2216618   192
##  3 2AK8974   177
##  4 8767631   158
##  5 3DGREES   148
##  6 3BV0242   147
##  7 3CM9052   146
##  8 1CG7343   139
##  9 JRA4646   122
## 10 LVF001    122
## # ... with 1,016,765 more rows