a3.Rmd

As a prerequisite, I load any required packages and set stringsAsFactors to false:

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following object is masked from 'package:base':
## 
##     date

library(tidyverse)

## -- Attaching packages ------------------------------------------------------------------------------------------------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.0     v purrr   0.3.3
## v tibble  2.1.3     v dplyr   0.8.4
## v tidyr   1.0.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0

## -- Conflicts ---------------------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x lubridate::as.difftime() masks base::as.difftime()
## x lubridate::date()        masks base::date()
## x dplyr::filter()          masks stats::filter()
## x lubridate::intersect()   masks base::intersect()
## x dplyr::lag()             masks stats::lag()
## x lubridate::setdiff()     masks base::setdiff()
## x lubridate::union()       masks base::union()

library(ggplot2)

options(stringsAsFactors=FALSE)

I used the provided script a3_get_ttc_data.R to create a serialized data object of data wrangled from Toronto’s open data portal for ttc subway delay data; the object was saved as a3_ttc_subway_delay. Here, I use load() on that data object and assign it to “subwaydata”.

subwaydata <- load("C:/users/ruffs/Desktop/csc275/a3/a3_ttc_subway_delay.RData")

Note, “subwaydata” contains 3 tables: the first, “ttcSubwayDelayCodebook”, contains explanatory annotations for various delay fields.

ttcSubwayDelayCodebook

## # A tibble: 10 x 3
##    `Field Name` Description                               Example              
##    <chr>        <chr>                                     <chr>                
##  1 Date         Date (YYYY/MM/DD)                         42735                
##  2 Time         Time (24h clock)                          8.2638888888888887E-2
##  3 Day          Name of the day of the week               Saturday             
##  4 Station      TTC subway station name                   Rosedale Station     
##  5 Code         TTC delay code                            MUIS                 
##  6 Min Delay    Delay (in minutes) to subway service      5                    
##  7 Min Gap      Time length (in minutes) between trains   9                    
##  8 Bound        Direction of train dependant on the line  N                    
##  9 Line         TTC subway line i.e. YU, BD, SHP, and SRT YU                   
## 10 Vehicle      TTC train number                          5961

The second, “ttcSubwayDelay2019”, has data on every delay in 2019.

ttcSubwayDelay2019

## # A tibble: 18,473 x 9
##    Day   Station Code  Min.Delay Min.Gap Bound Line  Vehicle date_time          
##    <chr> <chr>   <chr>     <dbl>   <dbl> <chr> <chr>   <dbl> <dttm>             
##  1 Mond~ SHERBO~ SUDP          4       8 E     BD       5245 2019-04-01 00:02:00
##  2 Mond~ WILSON~ MUPAA         0       0 N     YU       5541 2019-04-01 01:11:00
##  3 Mond~ KIPLIN~ MUIS          0       0 W     BD         NA 2019-04-01 01:48:00
##  4 Mond~ DAVISV~ PUTOE         0       0 <NA>  YU         NA 2019-04-01 02:18:00
##  5 Mond~ ST AND~ MUNCA         0       0 <NA>  YU         NA 2019-04-01 05:26:00
##  6 Mond~ FINCH ~ TUSC          0       0 <NA>  YU       5511 2019-04-01 05:36:00
##  7 Mond~ DAVISV~ EUYRD         4       8 N     YU       5746 2019-04-01 05:46:00
##  8 Mond~ GREENW~ EUSC          0       0 E     BD       5121 2019-04-01 05:54:00
##  9 Mond~ COXWEL~ MUSC          0       0 E     BD       5296 2019-04-01 06:08:00
## 10 Mond~ DUPONT~ MUATC         9      12 N     YU       5456 2019-04-01 06:44:00
## # ... with 18,463 more rows

The third, “ttcSubwayDelayCodes”, has codes which explain the reasons for delays as listed in the above table.

ttcSubwayDelayCodes

## # A tibble: 129 x 2
##    Code  description                               
##    <chr> <chr>                                     
##  1 EUAC  Air Conditioning                          
##  2 EUAL  Alternating Current                       
##  3 EUATC ATC RC&S Equipment                        
##  4 EUBK  Brakes                                    
##  5 EUBO  Body                                      
##  6 EUCA  Compressed Air                            
##  7 EUCD  Consequential Delay (2nd Delay Same Fault)
##  8 EUCH  Chopper Control                           
##  9 EUCO  Couplers                                  
## 10 EUDO  Door Problems - Faulty Equipment          
## # ... with 119 more rows

From this information, Question 1 asks: “Which delays are the longest? How long are the most frequent delays?” Below, I break this question down into two parts to address it with greater ease.

Question 1a: Which delays are the longest?

First, I create an object “actualDelays” which contains only delays that are not equal to 0 minutes.

actualDelays <- 
  ttcSubwayDelay2019 %>%
  filter(Min.Delay != 0) %>%
  arrange(desc(Min.Delay))

Then, to analyze which types of delays are the longest, I group them by “code,” i.e. the type of delay experienced, then determine the means of each group, and arrange them in descending order for ease of use.

delays_means_by_code_df <-
    data.frame(
    actualDelays %>%
    group_by(Code) %>%
    summarize(mean_delay=mean(Min.Delay)) %>%
    arrange(desc(mean_delay)) %>%
    ungroup())

I do the same here to determine standard deviations instead of means.

delays_sd_by_code_df <-
    data.frame(
    actualDelays %>%
    group_by(Code) %>%
    summarize(sd_delay=sd(Min.Delay)) %>%
    ungroup())

Then, I join them together in one dataframe so I can access both means and std. devs.

delaysDf <- left_join(delays_means_by_code_df, delays_sd_by_code_df, by="Code")

Using the ttcSubwayDelayCodes dataframe, I join with the dataframe I created above so that each listed object has a code and also a description.

delaysDfwithNames <- left_join(delaysDf, ttcSubwayDelayCodes, by="Code")

With this information all in one dataframe–the delays, their description, mean, and std. dev., I can create a bar chart with error bars to represent the data visually.

ggplot(head(delaysDfwithNames, 6)) +
  geom_bar(aes(x = description, y=mean_delay), stat="identity", fill="black") +
  geom_errorbar(aes(x = description, ymin=mean_delay-sd_delay, ymax=mean_delay+sd_delay), width=0.5, size=2, color = "skyblue") +
  geom_text(aes(x = description, y = mean_delay + 15, label = (round(mean_delay, 1))))+
  labs(title = "Causes of Longest Delays in the Toronto Subway System in 2019")+
  xlab("Cause")+
  theme(axis.text.x = element_text(angle = 90))+
  ylab("Mean Delay Time (Minutes)")

Here we can see the top 6 causes of long delays in the Toronto Subway System in 2019. The extreme error bars on the chart above hint that some of the delay types in the top 6 are present due to extreme outliers. To get a more holistic view of what is going on, I decided to also create a small scatter plot with the 20 longest delays and their causes.

To do this, I first join my fullDelayData df with the less restricted actualDelays df to regain information I need to create the plot.

fullDelayData <- inner_join(delaysDfwithNames, actualDelays, by="Code") %>%
  arrange(desc(Min.Delay))

Then, creating the plot itself:

ggplot(data = head(fullDelayData, 20), mapping=aes(x = date_time, y=Min.Delay, size = 3))+
  geom_point(mapping=aes(color=as.factor(description)))+
  xlab("Date")+
  ylab("Delay (Minutes)")+
  labs(title = "20 Longest Delays in the Toronto Subway System in 2019 and their Causes")+
  labs(color = "Cause")

Although the rankings provided by the bar chart above are 1)Traction Power Rail Related, 2)Structure Related Problem, 3)Rail Related Problem, 4) Priority One - Train in Contact with Person, 5)Fire/Smoke Plan A, and 6)Bomb Threat, this graph helps us parse through that data further. For example, we can see that 3)Rail Related Problems do not appear in high frequency in the top 20 delays, and that the one appearance has a very high resulting delay. This would suggest that rail related problems do not normally cause such extreme delays, and that one outlier has placed this issue in the top 6 ranking above. Conversely, “Held By Police - Non-TTC Related” appears to have a high frequency in the top 20 delays despite not being listed in the top 6.

Question 1b: How long are the most frequent delays?

To answer this question, I first group and arrange delay types by their frequency:

most_freq_delays <-
  fullDelayData %>%
  group_by(description) %>%
  count()%>%
  arrange(desc(n))%>%
  ungroup()

Then, I join this with the “fullDelayData” df to regain lost info:

most_freq_delays_full <-
  left_join(most_freq_delays, fullDelayData, by="description")

Next, I find the means and std. devs. again for these groupings (it’s a redundant step as I have done this for 1a, but these datasets are sorted by frequency instead of delay length). Again, I bring together mean and std. dev. dfs.

most_freq_delays_full_means_df <-
  data.frame(
    (most_freq_delays_full) %>%
      group_by(description, n) %>%
      summarize(mean_delay=mean(Min.Delay)) %>%
      arrange(desc(n))%>%
      ungroup())

most_freq_delays_full_sd_df <-
  data.frame(
    (most_freq_delays_full) %>%
      group_by(description, n) %>%
      summarize(sd_delay=sd(Min.Delay)) %>%
      arrange(desc(n))%>%
      ungroup())

most_freq_delaysDf <- inner_join(most_freq_delays_full_means_df, most_freq_delays_full_sd_df, by="description")

Using this df, I create a plot of delay time vs. frequency, with points colored to represent the different types of delay (causes).

ggplot(data = head(most_freq_delaysDf, 10), mapping=aes(x = mean_delay, y=n.x, size = 3))+
  geom_point(mapping=aes(color=as.factor(description)))+
  xlab("Mean Delay (Minutes)")+
  ylab("Frequency of Cause")+
  labs(title = "20 Most Frequent Causes of Delay in the Toronto Subway System in 2019")+
  labs(color = "Cause")

From this plot we can see that the top 20 most frequent causes of delay average in length from ~3 to 11 minutes, with most falling in the 4-6 minute range. Interestingly, the most frequent delay type is due to Disorderly Patrons, and averages about 5.5 minutes per delay. The longest delay in the top 20 most frequent is an Injured or ill Customer (On Train) that is Transported to safety; these delays appear to average about 10.5 minutes.

Question 2: Which stations have the most delays? Where are these stations?

To answer this, I count how many delays there are by each group (stations) and which line each station belongs to.

descending_station_delay_frequencies <-
actualDelays %>%
  group_by(Station, Line)%>%
  count()%>%
  arrange(desc(n))

Then, I use this information to create a bar plot to visualize the top 10 stations with the most delays:

ggplot(head(descending_station_delay_frequencies,10)) +
  geom_bar(aes(x = Station, y=n, fill=Line), stat="identity") +
  geom_text(aes(x = Station, y = n + 15, label = (n)))+
  labs(title = "Stations in the Toronto Subway System with the top 10 most Delays in 2019")+
  xlab("Station")+
  theme(axis.text.x = element_text(angle = 90))+
  ylab("Number of Delays")

Here we can see the top 10 most delayed subway stations in the Toronto system in 2019. Most of these stations (70%) are found on YU line, while others (30%) are on the BD line. This may have to do with the inherent efficiency of the lines, but could also be explained by one line being more popular than the other.

Question 3: What time of the day (hr) has the greatest number of delays? How do rush hour and weekends affect this?

First, I count how many delays there are by hour of day across the whole year.

time_of_delay <- actualDelays %>%
  mutate(delay_hour = hour(date_time))%>%
  group_by(delay_hour)%>%
  count()%>%
  arrange(desc(n))

Then, I do the same thing but remove weekends as a possibility.

time_of_delay_no_weekends <- actualDelays %>%
  filter(Day != "Saturday")%>%
  filter(Day != "Sunday")%>%
  mutate(delay_hour = hour(date_time))%>%
  group_by(delay_hour)%>%
  count()%>%
  arrange(desc(n))

I combine the two dfs so that I can plot both atonce.

timesDf <- full_join(time_of_delay, time_of_delay_no_weekends, by = "delay_hour")

By plotting the data with and without weekends included, I can determine the effect of weekends on delays by hour in 2019.

ggplot(timesDf, aes(delay_hour)) +                   
  geom_line(aes(y=n.x, colour="red")) + 
  geom_line(aes(y=n.y, colour="green"))+
  xlab("Time of Day")+
  ylab("Frequency of Delays")+
  labs(title = "Frequency of 2019 Delays in the Toronto Subway by Time of Day")

From the above, we can compare the frequency/time of day of the full week (green) and that of just the work week, with no weekend (red). On both lines, we see two spikes in delays: the first begins ~5AM and ends ~8:30AM. The second begins ~3PM and ends ~6PM. These two spikes correspond to morning and evening rush hour, to and from school/work, etc. Comparing the two lines shows us that delay patterns do not differ greatly during the weekend, as the general shape of the line stays the same in both graphs; a net decrease in hours is to be expected with the loss of the weekend. Some minor differences can be noticed: for example, delay rates after ~7PM decrease at a greater rate in the red (no weekend) graph, suggesting that delays (and travel) is significantly greater past 7PM on the weekend relative to weekdays.

a3.Rmd

David Ruff

4/15/2020