Clone the provided repository. Write a vignette using one TidyVerse package. Write a vignette using more than one TidyVerse packages. For this assignment I will be using the lubridate TidyVerse package using NJ Transit Train Data for the Northeast Corridor for the month of January 2020.
library(tidyverse)
library(lubridate)
library(ggplot2)
library(RCurl)
library(DT)The data for this vignette was obtained from kaggle which includes the NJ Transit Train data from 03/2018 - 05/2020. Since this is a large dataset, I decided to focus on the last month I was taking the train to work which is the Northeast Corridor line. As such, I filtered the original 2020_01 file by line to only include the Northeast Corridor. I also filtered original file from Kaggle by column L (line) by “Northeast Corrdr” in order to reduce file size to save to GitHub (34,426 KB to 4,723 KB).
x <- url("https://raw.githubusercontent.com/gabbypaola/DATA-607/main/2020_01%20NEC.csv")
NEC <- read_csv(x)##
## -- Column specification --------------------------------------------------------
## cols(
## date = col_character(),
## train_id = col_double(),
## stop_sequence = col_double(),
## from = col_character(),
## from_id = col_double(),
## to = col_character(),
## to_id = col_double(),
## scheduled_time = col_character(),
## actual_time = col_character(),
## delay_minutes = col_double(),
## status = col_character(),
## line = col_character(),
## type = col_character()
## )
head(NEC, 5)## # A tibble: 5 x 13
## date train_id stop_sequence from from_id to to_id scheduled_time
## <chr> <dbl> <dbl> <chr> <dbl> <chr> <dbl> <chr>
## 1 1/1/2~ 7867 1 New York~ 105 New York~ 105 1/1/2020 19:03
## 2 1/1/2~ 7867 2 New York~ 105 Secaucus~ 38187 1/1/2020 19:12
## 3 1/1/2~ 7867 3 Secaucus~ 38187 Newark P~ 107 1/1/2020 19:21
## 4 1/1/2~ 7867 4 Newark P~ 107 Newark A~ 37953 1/1/2020 19:26
## 5 1/1/2~ 7867 5 Newark A~ 37953 Metropark 83 1/1/2020 19:40
## # ... with 5 more variables: actual_time <chr>, delay_minutes <dbl>,
## # status <chr>, line <chr>, type <chr>
tail(NEC,5)## # A tibble: 5 x 13
## date train_id stop_sequence from from_id to to_id scheduled_time
## <chr> <dbl> <dbl> <chr> <dbl> <chr> <dbl> <chr>
## 1 1/31/2~ 3837 10 Metuchen 84 Edison 38 1/31/2020 11:~
## 2 1/31/2~ 3837 11 Edison 38 New Bru~ 103 1/31/2020 12:~
## 3 1/31/2~ 3837 12 New Brun~ 103 Princet~ 125 1/31/2020 12:~
## 4 1/31/2~ 3837 13 Princeto~ 125 Hamilton 32905 1/31/2020 12:~
## 5 1/31/2~ 3837 14 Hamilton 32905 Trenton 148 1/31/2020 12:~
## # ... with 5 more variables: actual_time <chr>, delay_minutes <dbl>,
## # status <chr>, line <chr>, type <chr>
The lubridate pacakge touts many benefits for working with dates. For example, to get the current date or date-time you can use today() or now(). Lubridate can also recognize the system timezone. As you can see I am in the New York time zone. Lubridate also keeps daylight savings in mind and will print “EDT” as opposed to “EST” during daylight savings. A list of timezones can be requested as well which goes by the OlsonNames() as seen below. It contains a total of 593 time zone names:
today()## [1] "2021-04-25"
Sys.timezone()## [1] "America/New_York"
head(OlsonNames())## [1] "Africa/Abidjan" "Africa/Accra" "Africa/Addis_Ababa"
## [4] "Africa/Algiers" "Africa/Asmara" "Africa/Asmera"
length(OlsonNames())## [1] 593
Using the NJ Transit data for the Northeast Corridor line for the month of January 2020, we can play with dates and times using the lubridate pacakge. The dmy, mdy, and ymd functions take a date, which can also be input as worded months, and will convert the date 01/31/2020 from the dataset into YYYY-MM-DD as follows:
mdy(NEC$date[37692:37697])## [1] "2020-01-31" "2020-01-31" "2020-01-31" "2020-01-31" "2020-01-31"
## [6] "2020-01-31"
Additional examples using dmy, mdy, and ymd functions with different input values:
dmy(26051965)## [1] "1965-05-26"
mdy("April 16 1924")## [1] "1924-04-16"
ymd("2026 February 14")## [1] "2026-02-14"
Lubridate also has timezone related functions with_tz which changes the printing to include the specified timezone, and forxe_tz which changes the time. Timezone is important to specify early on when working with time sensitive data especial if converting dates using mdy_hm() because a function like mdy_hm(), dmy_hms(), ymd_hms() and others default the timezone to UTC which is Coordinated Universal Time (UTC). If the time is converted from UTC to EST even though the time was originally expressed in the data as EST (but not programmatically), this will create and unintentional time conversion.
timezone <- force_tz(mdy_hm(NEC$scheduled_time[17441]), "America/New_York")
timezone## [1] "2020-01-15 13:39:00 EST"
# Changes printing
with_tz(timezone, "America/Chicago")## [1] "2020-01-15 12:39:00 CST"
You can also pull out individual parts of the date with the accessor functions year(), month(), mday() (day of the month), yday() (day of the year), wday() (day of the week), hour(), minute(), and second(). Such functions can come in handy if for example, you wanted to split up the scheduled_time and actual_time columns by time component. Below we have the example date of 1/29/2020 14:11. First we use the mdy_hm function to convert to YYYY-MM-DD HH:MM:SS TZ format to then use each of the above mentioned functions.
datetime <- force_tz(mdy_hm(NEC$scheduled_time[33832]), "America/New_York")
year(datetime)## [1] 2020
month(datetime)## [1] 1
mday(datetime) ## [1] 29
yday(datetime) #day of the year## [1] 29
wday(datetime, label=TRUE, abbr = FALSE)## [1] Wednesday
## 7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday
hour(datetime)## [1] 14
minute(datetime)## [1] 11
It is also possible to nest functions. For example, to see what day of the year 03/12/2020 is, you first need to input it into the mdy function, and then yday().
yday(mdy(03122020))## [1] 72
A function such as wday() is useful to use for plotting what day of the week experiences the most amount of train rides. Based on the plot below, it looks like Thursday and Friday with Friday having the most train rides in and out of New York City. This is most likely because NJ Transit knows people want to get home on time for the weekend!
NEC$scheduled_time <- force_tz(mdy_hm(NEC$scheduled_time), "America/New_York")
NEC %>%
mutate(wday = wday(scheduled_time, label = TRUE)) %>%
ggplot(aes(x = wday)) +
geom_bar(fill="#f68060", alpha=.6, width=.4) +
xlab("Day of the Week")+
ggtitle("Train counts by day for January 2020") ### Analyzing Delays: ### A new datasets named NEC3 has been created from rows 1 to 15 of NEC dataset to analyze delay for two specific train id.Delay times of train id 7858 and 7867 is compared.However 7867 has more delayed minutes than 7858
NEC3<-NEC[1:15,]
head(NEC3)## # A tibble: 6 x 13
## date train_id stop_sequence from from_id to to_id scheduled_time
## <chr> <dbl> <dbl> <chr> <dbl> <chr> <dbl> <dttm>
## 1 1/1/2~ 7867 1 New Yo~ 105 New Y~ 105 2020-01-01 19:03:00
## 2 1/1/2~ 7867 2 New Yo~ 105 Secau~ 38187 2020-01-01 19:12:00
## 3 1/1/2~ 7867 3 Secauc~ 38187 Newar~ 107 2020-01-01 19:21:00
## 4 1/1/2~ 7867 4 Newark~ 107 Newar~ 37953 2020-01-01 19:26:00
## 5 1/1/2~ 7867 5 Newark~ 37953 Metro~ 83 2020-01-01 19:40:00
## 6 1/1/2~ 7867 6 Metrop~ 83 Metuc~ 84 2020-01-01 19:45:00
## # ... with 5 more variables: actual_time <chr>, delay_minutes <dbl>,
## # status <chr>, line <chr>, type <chr>
p<-ggplot(NEC3, aes(x=delay_minutes, fill=train_id)) + geom_histogram()+facet_wrap(~train_id)
p## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
### Histogram2
p1<-ggplot(NEC3, aes(x=NEC3$delay_minutes, fill=NEC3$train_id)) + geom_histogram() + facet_wrap(~train_id, ncol=1)
p1## Warning: Use of `NEC3$delay_minutes` is discouraged. Use `delay_minutes`
## instead.
## Warning: Use of `NEC3$train_id` is discouraged. Use `train_id` instead.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Additionally, we can use the update function to update a specified date such as 1/29/2020.
date <- mdy(NEC$date[33832])
date <- update(date, year = 2020, month = 2, mday = 26) #changes month to 2 and date to 26
date## [1] "2020-02-26"
Functions such as dseconds(), dminutes(), d()hours, ddays(), dweeks, and dyears() output a given duration in the form of secons along with its original input. As example, the delay_minutes column was taken and converted using dminutes to seconds in the delay_min_to_sec below.
NEC<- NEC %>%
mutate(delay_min_to_sec = dminutes(NEC$delay_minutes))
NECtable<-NEC[1:100,]
datatable(NECtable,options = list(pageLength = 5, dom = 'tip'), rownames = FALSE)Time periods are another functionality offered by the lubridate package.As seen in the example below, lubridate allows a way to add days, weeks, months and years.
date2 <- mdy(NEC$date[1741])
date2## [1] "2020-01-02"
date2 + days(1)## [1] "2020-01-03"
date2 + weeks(2)## [1] "2020-01-16"
date2 + months(3)## [1] "2020-04-02"
date2 + years(4)## [1] "2024-01-02"