library(dplyr)
library(stringr)
library(data.table)
library(utils)
Consists of approximately 2.4 million records for speed camera citations in the City of Baltimore from 2013 to the present https://data.baltimorecity.gov/Transportation/2013-Speed-Camera-Citations/pzza-s46z.
The following code downloads a copy of the data from the Open Baltimore website and stores as a CSV file, then loads it into a tibble for use by the dplyr package.
fileURL = "https://data.baltimorecity.gov/api/views/pzza-s46z/rows.csv?accessType=DOWNLOAD&bom=true&format=true"
download.file(url = fileURL, destfile = "datasets/citations.csv", method = "curl")
citations = as_tibble(fread("datasets/citations.csv"))
No documentation is provided for the dataset so we will have to perform some exploratory analysis to see what we are dealing with. First we begin by examining the head of the dataset and running some summary statistics:
head(citations)
## # A tibble: 6 x 17
## Tag Citation ExpMM ExpYY Make State Address ViolCode Description
## <chr> <int> <chr> <int> <chr> <chr> <chr> <int> <chr>
## 1 6994… 82691651 04 19 MAZD MD 1300 B… 32 Fixed Spee…
## 2 T720… 85937226 12 18 ACUR MD 300 BL… 32 Fixed Spee…
## 3 5BA2… 82691115 03 19 TOYT MD 5000 B… 32 Fixed Spee…
## 4 5BT8… 82694689 01 18 FORD MD 2400 B… 32 Fixed Spee…
## 5 1BV0… 82691214 11 18 SUBA MD 1300 B… 32 Fixed Spee…
## 6 5CZ9… 82688467 08 19 NISS MD 3400 B… 32 Fixed Spee…
## # ... with 8 more variables: ViolFine <dbl>, ViolDate <chr>,
## # Balance <chr>, PenaltyDate <lgl>, OpenFine <dbl>, OpenPenalty <chr>,
## # NoticeDate <chr>, ImportDate <chr>
summary(citations)
## Tag Citation ExpMM ExpYY
## Length:2438544 Min. : 13425 Length:2438544 Min. : 0.0
## Class :character 1st Qu.:61960544 Class :character 1st Qu.:16.0
## Mode :character Median :82329306 Mode :character Median :18.0
## Mean :74881715 Mean :17.1
## 3rd Qu.:93757147 3rd Qu.:19.0
## Max. :99999998 Max. :99.0
## Make State Address ViolCode
## Length:2438544 Length:2438544 Length:2438544 Min. : 1.00
## Class :character Class :character Class :character 1st Qu.:18.00
## Mode :character Mode :character Mode :character Median :22.00
## Mean :24.71
## 3rd Qu.:32.00
## Max. :99.00
## Description ViolFine ViolDate Balance
## Length:2438544 Min. : 17.00 Length:2438544 Length:2438544
## Class :character 1st Qu.: 32.00 Class :character Class :character
## Mode :character Median : 40.00 Mode :character Mode :character
## Mean : 50.04
## 3rd Qu.: 52.00
## Max. :502.00
## PenaltyDate OpenFine OpenPenalty NoticeDate
## Mode:logical Min. : 0.00 Length:2438544 Length:2438544
## NA's:2438544 1st Qu.: 0.00 Class :character Class :character
## Median : 0.00 Mode :character Mode :character
## Mean : 20.84
## 3rd Qu.: 32.00
## Max. :502.00
## ImportDate
## Length:2438544
## Class :character
## Mode :character
##
##
##
Looking at the summary, we find that two numeric variables, Balance and OpenPenalty, were coerced to character variables. This is because the values for these fields in the thousands are formatted with commas. We will use the dplyr mutate function to convert these fields to numeric:
citations = citations %>% mutate(
Balance = as.numeric(gsub(",","",Balance)),
OpenPenalty = as.numeric(gsub(",","",OpenPenalty))
)
Each observation appears to be uniquely identified by the citation number. We can verify this by running dplyr’s n_distinct function on the Citation field. The results show that there are 2,438,544 distinct values - the same number of observations in the dataset.
citations %>% summarise(DistinctCitations = n_distinct(Citation))
## # A tibble: 1 x 1
## DistinctCitations
## <int>
## 1 2438544
Other fields include the Tag (if available) of the car receiving a citation, expiration month and year of the registration, Make of the vehicle (if available), State of the plate on the vehicle, Address of the camera where the violation occurred, as well as Violation code, description, fines and penalties. Below are a sampling of records with missing values for Tag and Make fields.
citations %>% filter(is.na(Tag))
## # A tibble: 298 x 17
## Tag Citation ExpMM ExpYY Make State Address ViolCode Description
## <chr> <int> <chr> <int> <chr> <chr> <chr> <int> <chr>
## 1 <NA> 90804015 "" 0 NO PA MD 1800 M… 99 All Other …
## 2 <NA> 91266362 "" 0 HONDA MD 900 FE… 99 All Other …
## 3 <NA> 91266370 "" 0 HONDA MD 900 FE… 12 No Stoppin…
## 4 <NA> 91790460 "" 0 CADIL XX 3000 O… 99 All Other …
## 5 <NA> 92329086 "" 0 HONDA XX 1800 M… 99 All Other …
## 6 <NA> 92329094 "" 0 NISSA XX 1800 M… 99 All Other …
## 7 <NA> 92329078 "" 0 HONDA XX 1800 M… 5 Obstruct/I…
## 8 <NA> 92343020 "" 0 VOLKS MD 3300 O… 99 All Other …
## 9 <NA> 92460337 "" 0 MERCE MD 1800 M… 99 All Other …
## 10 <NA> 22718423 02 15 PONTI MD 5900 B… 99 All Other …
## # ... with 288 more rows, and 8 more variables: ViolFine <dbl>,
## # ViolDate <chr>, Balance <dbl>, PenaltyDate <lgl>, OpenFine <dbl>,
## # OpenPenalty <dbl>, NoticeDate <chr>, ImportDate <chr>
citations %>% filter(is.na(Make))
## # A tibble: 15 x 17
## Tag Citation ExpMM ExpYY Make State Address ViolCode Description
## <chr> <int> <chr> <int> <chr> <chr> <chr> <int> <chr>
## 1 0579… 91435777 07 15 <NA> MD 3100 B… 10 Commercial…
## 2 4311… 24081283 07 15 <NA> MD 200 BA… 18 All Other …
## 3 Y481… 60747012 "" 0 <NA> TN 1500 N… 99 All Other …
## 4 1090… 93686344 05 17 <NA> MD 3300 P… 99 All Other …
## 5 RA41… 61284650 05 18 <NA> VA U/B CA… 8 No Stoppin…
## 6 0109… 99912347 07 13 <NA> MD 700 WH… 5 Obstruct/I…
## 7 5AB3… 96633715 10 10 <NA> MD 3600 C… 43 Obstructin…
## 8 4420… 65581325 "" 0 <NA> MD 1700 L… 3 Obstruct/I…
## 9 2BW5… 95315405 02 17 <NA> MD U/B CR… 18 All Other …
## 10 CIQB… 96254793 08 16 <NA> FL 400 PA… 12 No Stoppin…
## 11 301E… 96728622 04 17 <NA> MD U/B BA… 8 No Stoppin…
## 12 UH 5… 96770822 "" 0 <NA> OR 1300 N… 99 All Other …
## 13 YYT9… 98587307 09 17 <NA> PA 3800 F… 12 No Stoppin…
## 14 6CX0… 63359898 05 17 <NA> MD 2500 T… 22 Expired Ta…
## 15 KNS2… 63415161 05 18 <NA> MD 1600 T… 12 No Stoppin…
## # ... with 8 more variables: ViolFine <dbl>, ViolDate <chr>,
## # Balance <dbl>, PenaltyDate <lgl>, OpenFine <dbl>, OpenPenalty <dbl>,
## # NoticeDate <chr>, ImportDate <chr>
We are interested in seeing what values are available for the State field. We can do this by grouping records by State and counting the number of records in each group. The results show that there are 102 distinct states (e.g. MD, VA, DC), territories (e.g. AS, PR), Canadian provinces (e.g. OT, BC, NB), and country codes.
citations %>%
select(State) %>%
group_by(State) %>%
tally(sort = T)
## # A tibble: 102 x 2
## State n
## <chr> <int>
## 1 MD 2121766
## 2 VA 59755
## 3 PA 55845
## 4 NJ 22514
## 5 NY 19394
## 6 FL 14605
## 7 NC 14170
## 8 TN 13687
## 9 DC 12827
## 10 DE 11805
## # ... with 92 more rows