Assignment Overview

Clone the provided repository. Write a vignette using one TidyVerse package. Write a vignette using more than one TidyVerse packages. For this assignment I will be using the lubridate TidyVerse package using NJ Transit Train Data for the Northeast Corridor for the month of January 2020.

Packages

library(tidyverse)
library(lubridate)
library(ggplot2)
library(RCurl)
library(DT)

Load the Data

The data for this vignette was obtained from kaggle which includes the NJ Transit Train data from 03/2018 - 05/2020. Since this is a large dataset, I decided to focus on the last month I was taking the train to work which is the Northeast Corridor line. As such, I filtered the original 2020_01 file by line to only include the Northeast Corridor. I also filtered original file from Kaggle by column L (line) by “Northeast Corrdr” in order to reduce file size to save to GitHub (34,426 KB to 4,723 KB).

x <- url("https://raw.githubusercontent.com/gabbypaola/DATA-607/main/2020_01%20NEC.csv")
NEC <- read_csv(x)
## 
## -- Column specification --------------------------------------------------------
## cols(
##   date = col_character(),
##   train_id = col_double(),
##   stop_sequence = col_double(),
##   from = col_character(),
##   from_id = col_double(),
##   to = col_character(),
##   to_id = col_double(),
##   scheduled_time = col_character(),
##   actual_time = col_character(),
##   delay_minutes = col_double(),
##   status = col_character(),
##   line = col_character(),
##   type = col_character()
## )
head(NEC, 5)
## # A tibble: 5 x 13
##   date   train_id stop_sequence from      from_id to        to_id scheduled_time
##   <chr>     <dbl>         <dbl> <chr>       <dbl> <chr>     <dbl> <chr>         
## 1 1/1/2~     7867             1 New York~     105 New York~   105 1/1/2020 19:03
## 2 1/1/2~     7867             2 New York~     105 Secaucus~ 38187 1/1/2020 19:12
## 3 1/1/2~     7867             3 Secaucus~   38187 Newark P~   107 1/1/2020 19:21
## 4 1/1/2~     7867             4 Newark P~     107 Newark A~ 37953 1/1/2020 19:26
## 5 1/1/2~     7867             5 Newark A~   37953 Metropark    83 1/1/2020 19:40
## # ... with 5 more variables: actual_time <chr>, delay_minutes <dbl>,
## #   status <chr>, line <chr>, type <chr>
tail(NEC,5)
## # A tibble: 5 x 13
##   date    train_id stop_sequence from      from_id to       to_id scheduled_time
##   <chr>      <dbl>         <dbl> <chr>       <dbl> <chr>    <dbl> <chr>         
## 1 1/31/2~     3837            10 Metuchen       84 Edison      38 1/31/2020 11:~
## 2 1/31/2~     3837            11 Edison         38 New Bru~   103 1/31/2020 12:~
## 3 1/31/2~     3837            12 New Brun~     103 Princet~   125 1/31/2020 12:~
## 4 1/31/2~     3837            13 Princeto~     125 Hamilton 32905 1/31/2020 12:~
## 5 1/31/2~     3837            14 Hamilton    32905 Trenton    148 1/31/2020 12:~
## # ... with 5 more variables: actual_time <chr>, delay_minutes <dbl>,
## #   status <chr>, line <chr>, type <chr>

The lubridate pacakge touts many benefits for working with dates. For example, to get the current date or date-time you can use today() or now(). Lubridate can also recognize the system timezone. As you can see I am in the New York time zone. Lubridate also keeps daylight savings in mind and will print “EDT” as opposed to “EST” during daylight savings. A list of timezones can be requested as well which goes by the OlsonNames() as seen below. It contains a total of 593 time zone names:

today()
## [1] "2021-04-25"
Sys.timezone()
## [1] "America/New_York"
head(OlsonNames())
## [1] "Africa/Abidjan"     "Africa/Accra"       "Africa/Addis_Ababa"
## [4] "Africa/Algiers"     "Africa/Asmara"      "Africa/Asmera"
length(OlsonNames())
## [1] 593

NJ Transit Northeast Corridor January 2020

Using the NJ Transit data for the Northeast Corridor line for the month of January 2020, we can play with dates and times using the lubridate pacakge. The dmy, mdy, and ymd functions take a date, which can also be input as worded months, and will convert the date 01/31/2020 from the dataset into YYYY-MM-DD as follows:

mdy(NEC$date[37692:37697])
## [1] "2020-01-31" "2020-01-31" "2020-01-31" "2020-01-31" "2020-01-31"
## [6] "2020-01-31"

Additional examples using dmy, mdy, and ymd functions with different input values:

dmy(26051965)
## [1] "1965-05-26"
mdy("April 16 1924")
## [1] "1924-04-16"
ymd("2026 February 14")
## [1] "2026-02-14"

Time Zones

Lubridate also has timezone related functions with_tz which changes the printing to include the specified timezone, and forxe_tz which changes the time. Timezone is important to specify early on when working with time sensitive data especial if converting dates using mdy_hm() because a function like mdy_hm(), dmy_hms(), ymd_hms() and others default the timezone to UTC which is Coordinated Universal Time (UTC). If the time is converted from UTC to EST even though the time was originally expressed in the data as EST (but not programmatically), this will create and unintentional time conversion.

timezone <- force_tz(mdy_hm(NEC$scheduled_time[17441]), "America/New_York")
timezone
## [1] "2020-01-15 13:39:00 EST"
# Changes printing
with_tz(timezone, "America/Chicago")
## [1] "2020-01-15 12:39:00 CST"

Accessor Functions

You can also pull out individual parts of the date with the accessor functions year(), month(), mday() (day of the month), yday() (day of the year), wday() (day of the week), hour(), minute(), and second(). Such functions can come in handy if for example, you wanted to split up the scheduled_time and actual_time columns by time component. Below we have the example date of 1/29/2020 14:11. First we use the mdy_hm function to convert to YYYY-MM-DD HH:MM:SS TZ format to then use each of the above mentioned functions.

datetime <- force_tz(mdy_hm(NEC$scheduled_time[33832]), "America/New_York")

year(datetime)
## [1] 2020
month(datetime)
## [1] 1
mday(datetime) 
## [1] 29
yday(datetime) #day of the year
## [1] 29
wday(datetime, label=TRUE, abbr = FALSE)
## [1] Wednesday
## 7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday
hour(datetime)
## [1] 14
minute(datetime)
## [1] 11

It is also possible to nest functions. For example, to see what day of the year 03/12/2020 is, you first need to input it into the mdy function, and then yday().

yday(mdy(03122020))
## [1] 72

A function such as wday() is useful to use for plotting what day of the week experiences the most amount of train rides. Based on the plot below, it looks like Thursday and Friday with Friday having the most train rides in and out of New York City. This is most likely because NJ Transit knows people want to get home on time for the weekend!

NEC$scheduled_time <- force_tz(mdy_hm(NEC$scheduled_time), "America/New_York")
NEC %>% 
  mutate(wday = wday(scheduled_time, label = TRUE)) %>% 
  ggplot(aes(x = wday)) +
  geom_bar(fill="#f68060", alpha=.6, width=.4) +
  xlab("Day of the Week")+
  ggtitle("Train counts by day for January 2020")

### Analyzing Delays: ### A new datasets named NEC3 has been created from rows 1 to 15 of NEC dataset to analyze delay for two specific train id.Delay times of train id 7858 and 7867 is compared.However 7867 has more delayed minutes than 7858

NEC3<-NEC[1:15,]
head(NEC3)
## # A tibble: 6 x 13
##   date   train_id stop_sequence from    from_id to     to_id scheduled_time     
##   <chr>     <dbl>         <dbl> <chr>     <dbl> <chr>  <dbl> <dttm>             
## 1 1/1/2~     7867             1 New Yo~     105 New Y~   105 2020-01-01 19:03:00
## 2 1/1/2~     7867             2 New Yo~     105 Secau~ 38187 2020-01-01 19:12:00
## 3 1/1/2~     7867             3 Secauc~   38187 Newar~   107 2020-01-01 19:21:00
## 4 1/1/2~     7867             4 Newark~     107 Newar~ 37953 2020-01-01 19:26:00
## 5 1/1/2~     7867             5 Newark~   37953 Metro~    83 2020-01-01 19:40:00
## 6 1/1/2~     7867             6 Metrop~      83 Metuc~    84 2020-01-01 19:45:00
## # ... with 5 more variables: actual_time <chr>, delay_minutes <dbl>,
## #   status <chr>, line <chr>, type <chr>

Histogram1

p<-ggplot(NEC3, aes(x=delay_minutes, fill=train_id)) + geom_histogram()+facet_wrap(~train_id) 
p
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

### Histogram2

p1<-ggplot(NEC3, aes(x=NEC3$delay_minutes, fill=NEC3$train_id)) + geom_histogram() + facet_wrap(~train_id, ncol=1)
p1
## Warning: Use of `NEC3$delay_minutes` is discouraged. Use `delay_minutes`
## instead.
## Warning: Use of `NEC3$train_id` is discouraged. Use `train_id` instead.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Additionally, we can use the update function to update a specified date such as 1/29/2020.

date <- mdy(NEC$date[33832])
date <- update(date, year = 2020, month = 2, mday = 26) #changes month to 2 and date to 26
date
## [1] "2020-02-26"

Durations

Functions such as dseconds(), dminutes(), d()hours, ddays(), dweeks, and dyears() output a given duration in the form of secons along with its original input. As example, the delay_minutes column was taken and converted using dminutes to seconds in the delay_min_to_sec below.

NEC<- NEC %>% 
  mutate(delay_min_to_sec = dminutes(NEC$delay_minutes))
NECtable<-NEC[1:100,] 
datatable(NECtable,options = list(pageLength = 5, dom = 'tip'), rownames = FALSE)

Periods

Time periods are another functionality offered by the lubridate package.As seen in the example below, lubridate allows a way to add days, weeks, months and years.

date2 <- mdy(NEC$date[1741])
date2
## [1] "2020-01-02"
date2 + days(1)
## [1] "2020-01-03"
date2 + weeks(2)
## [1] "2020-01-16"
date2 + months(3)
## [1] "2020-04-02"
date2 + years(4)
## [1] "2024-01-02"

1 2