library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.2
## Warning: package 'ggplot2' was built under R version 4.2.2
library(lubridate)
library(httr)
library(jsonlite)
Tidyverse contains many packages within it that allows users to work with strings, mutate and rearange dataframes and access data through APIs or websites. We can see a few of these packages listed below.
tidyverse_packages()
## [1] "broom" "cli" "crayon" "dbplyr"
## [5] "dplyr" "dtplyr" "forcats" "ggplot2"
## [9] "googledrive" "googlesheets4" "haven" "hms"
## [13] "httr" "jsonlite" "lubridate" "magrittr"
## [17] "modelr" "pillar" "purrr" "readr"
## [21] "readxl" "reprex" "rlang" "rstudioapi"
## [25] "rvest" "stringr" "tibble" "tidyr"
## [29] "xml2" "tidyverse"
To demonstrate the capabilities of the lubridate package within the Tidyverse, I will be using a data set on requested film permits in NYC. As per its documentation “Lubridate provides tools that make it easier to parse and manipulate dates.”
Description of Data: The Film Office issues permits to productions filming on location in the City of New York and provides free police assistance, free parking privileges and access to most exterior locations free of charge. Not all filming activity requires a permit. These permits are generally required when asserting the exclusive use of city property, like a sidewalk, a street, or a park.
Source: NYC Filming Permits
import_data <- as.data.frame(fromJSON('https://data.cityofnewyork.us/resource/tg4x-b46p.json', simplifyVector = TRUE))
After importing the data using jsonlite, we are left we nineteen (19) columns which includes the start and end dates within different zip codes in New York City. We can see it uses the ISO 8601 standard format of date and time separated from each other with the usage of the string literal “T”.
rmarkdown::paged_table(head(import_data, 5))
Lubridate has fantastic documentation you can find here. To transform our date columns to be much more useful, I will start by using the ymd_hms() function. Here, it will transform the original datetime columns from ISO format to POSIXct.
update_df <-
import_data |>
mutate(start_date = ymd_hms(startdatetime),
end_date = ymd_hms(enddatetime))
Now that we have dates in a workable datetime format, I can extract the day of the week and the hour from the timestamps to perform some exploratory analysis. To keep it simple, the start date will only be used.
update_df <-
update_df |>
mutate(start_day_of_week = wday(start_date, label = TRUE, abbr = TRUE),
start_hour = hour(start_date))
Lets reduce the number of columns to look at for our exploration. This will contain the start date, start time (hour), category, event type, and borough.
| borough | category | eventtype | start_date | start_day_of_week | start_hour |
|---|---|---|---|---|---|
| Queens | Television | Shooting Permit | 2023-01-20 06:00:00 | Fri | 6 |
| Bronx | Television | Shooting Permit | 2023-01-20 09:00:00 | Fri | 9 |
| Manhattan | Television | Shooting Permit | 2023-01-20 11:30:00 | Fri | 11 |
| Brooklyn | Television | Shooting Permit | 2023-01-20 02:30:00 | Fri | 2 |
| Manhattan | Television | Shooting Permit | 2023-01-20 13:00:00 | Fri | 13 |
| Manhattan | Television | Shooting Permit | 2023-01-20 06:00:00 | Fri | 6 |
Lastly, two plots will be created filtering out for only “shooting permits” where we can see the day of the week filming begins within each borough and what hour does filming begin based on the category such as television or commercial.
| borough | Sun | Mon | Tue | Wed | Thu | Fri | Sat |
|---|---|---|---|---|---|---|---|
| Bronx | 1 | 4 | 6 | 5 | 5 | 6 | 1 |
| Brooklyn | 3 | 49 | 60 | 71 | 62 | 50 | 12 |
| Manhattan | 16 | 49 | 53 | 60 | 62 | 48 | 20 |
| Queens | 1 | 27 | 36 | 38 | 33 | 33 | 0 |
| Staten Island | 0 | 1 | 2 | 1 | 1 | 1 | 0 |
| start_hour | Commercial | Documentary | Film | Music Video | Still Photography | Student | Television | Theater | WEB |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 1 | 0 | 0 | 0 | 6 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 |
| 2 | 0 | 0 | 2 | 0 | 2 | 0 | 11 | 0 | 0 |
| 3 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 |
| 4 | 0 | 0 | 0 | 0 | 0 | 0 | 13 | 0 | 0 |
| 5 | 0 | 0 | 0 | 0 | 0 | 0 | 9 | 0 | 0 |
| 6 | 39 | 2 | 13 | 1 | 16 | 0 | 200 | 0 | 10 |
| 7 | 9 | 0 | 5 | 0 | 5 | 0 | 250 | 0 | 7 |
| 8 | 2 | 0 | 10 | 2 | 3 | 0 | 47 | 1 | 1 |
| 9 | 1 | 0 | 3 | 0 | 0 | 1 | 32 | 0 | 6 |
| 10 | 0 | 0 | 1 | 0 | 0 | 0 | 27 | 0 | 0 |
| 11 | 0 | 0 | 0 | 0 | 0 | 0 | 14 | 0 | 0 |
| 12 | 0 | 0 | 2 | 0 | 0 | 0 | 15 | 2 | 0 |
| 13 | 0 | 0 | 2 | 0 | 0 | 0 | 10 | 0 | 0 |
| 14 | 0 | 0 | 2 | 0 | 0 | 0 | 7 | 0 | 0 |
| 15 | 0 | 0 | 1 | 0 | 0 | 0 | 5 | 0 | 0 |
| 16 | 1 | 0 | 1 | 0 | 0 | 0 | 5 | 0 | 0 |
| 17 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 |
| 19 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
Another functionality of the lubridate package is
the ability to extract just the date from a timestamp, or to convert
from a character type date to a date data type using
as_date().
First, let’s use as_date() to extract the date from each
row of startdatetime and create a new variable
start_date:
import_data |>
mutate(start_date = as_date(startdatetime)) |>
select(borough, category, eventtype, startdatetime, start_date) |>
head(5) |>
knitr::kable()
| borough | category | eventtype | startdatetime | start_date |
|---|---|---|---|---|
| Queens | Television | Shooting Permit | 2023-01-20T06:00:00.000 | 2023-01-20 |
| Bronx | Television | Shooting Permit | 2023-01-20T09:00:00.000 | 2023-01-20 |
| Manhattan | Television | Shooting Permit | 2023-01-20T11:30:00.000 | 2023-01-20 |
| Brooklyn | Television | Shooting Permit | 2023-01-20T02:30:00.000 | 2023-01-20 |
| Manhattan | Television | Shooting Permit | 2023-01-20T13:00:00.000 | 2023-01-20 |
Now, let’s use as_date() to convert dates from a
character data type to a date data type. If the character date is not in
the form “yyyy-mm-dd”, we can use the format argument to
tell the function how the date is formatted so that it knows how to
convert it.
Let’s convert a few different dates from character to date data types. Note that you do not need to provide a 2 digit date, as the function will supply a starting zero for single digit months or days. You do, however, need to provide a four digit year.
# format = yyyy-mm-dd
dates1 <- c("2021-4-7", "2021-9-13", "2021-10-15", "2021-11-14")
as_date(dates1)
## [1] "2021-04-07" "2021-09-13" "2021-10-15" "2021-11-14"
# format = mm-dd-yyyy
dates2 <- c("4-7-2021", "9-13-2021", "10-15-2021", "11-14-2021")
as_date(dates2, format = '%m-%d-%Y')
## [1] "2021-04-07" "2021-09-13" "2021-10-15" "2021-11-14"
# format = mm/dd/yyyy
dates3 <- c("4/7/2021", "9/13/2021", "10/15/2021", "11/14/2021")
as_date(dates3, format = '%m/%d/%Y')
## [1] "2021-04-07" "2021-09-13" "2021-10-15" "2021-11-14"
If the date is improperly formatted, the function will coerce those values to NA.
dates4 <- c("12-1-2021", "1-5-2023", "2/14/2023")
as_date(dates4, format = '%m-%d-%Y')
## [1] "2021-12-01" "2023-01-05" NA
Additional documentation for lubridate package can be found here.