Data Source: NYC MTA Turnstile Dataset

Notes/Caveats on data definition:

  • Date Range: Q3 2013.
  • Station variable is not coded uniquely by comparing data files before and after 10/18/2014, as well as ‘Remote Unit/Control Area/Station Name Key.’ Hence below.
  • Station is defined by the author as control_area and unit combination.
  • 34 ST-PENN STA station is defined by multiple codes as an exception.
  • Turnstile is defined by the author as control_area, unit, and scp combination.
  • If not specified by questions pertaining to ‘closed’ or ‘not at full capacity’, only ‘Regular’ in description variable is considered for analysis.

System Prep: Set working directory and load libraries

library(tidyr)
library(data.table)
library(magrittr)
library(gdata)
library(readr)
library(lubridate)
library(dplyr)
library(ggplot2)
library(reshape2)
library(quantreg)

Download multiple data files

Data processing + cleaning (analysis is provided below while answering questions)

Loading cached datasets for time saving

new_data<-read.csv('new_data.csv')
mta.full <- read_csv('mta.full.csv')
mta.data <- read_csv('mta.data.csv')

Data Analysis

1. Which station has the most number of units?

stations <- read.xls("Remote-Booth-Station.xls",header=T)
stations %>% group_by(Station) %>% summarise(count=n()) %>% arrange(desc(count))
## Source: local data frame [395 x 2]
## 
##            Station count
##             (fctr) (int)
## 1   34 ST-PENN STA    14
## 2        FULTON ST    12
## 3            86 ST    10
## 4            23 ST     9
## 5   42 ST-TIMES SQ     9
## 6         CANAL ST     9
## 7  42 ST-GRD CNTRL     8
## 8     CORTLANDT ST     8
## 9          WALL ST     8
## 10          116 ST     6
## ..             ...   ...
  • 34 ST-PENN STA station has the most number of units (‘Remote’ Variable).

2. What is the total number of entries & exits across the subway system for August 1, 2013?

  • From plotting, outrageous outliers are identified on 2013/7/29 and 2013/9/28. Further investigation is performed.

  • Finding: Some turnstiles were reset (293), approximately 6%, which caused abnormal difference counts.

  • Solution: Due to large sample size, these outliers will be removed to avoid skew.

## [1] 8297954
  • Total number of entries & exits across subway system for Auguest 1, 2013 is 8,297,954.

3. Let’s define the busyness as sum of entry & exit count. What station was the busiest on August 1, 2013? What turnstile was the busiest on that date?

## Source: local data frame [4,187 x 4]
## 
##              turnstile entry.count exits.count   sum
##                 (fctr)       (dbl)       (dbl) (dbl)
## 1  N063A R011 00-00-00        1694        8852 10546
## 2   R240 R047 00-00-00        3282        6800 10082
## 3   R238 R046 00-00-01        1763        7766  9529
## 4  R241A R048 00-00-00        2836        6686  9522
## 5  N063A R011 00-00-01        2783        6405  9188
## 6   N083 R138 01-00-00        1396        7777  9173
## 7   R240 R047 00-03-08        7429        1429  8858
## 8   R238 R046 00-03-00        2242        6612  8854
## 9   R533 R055 00-03-00        4561        4088  8649
## 10  R221 R170 01-00-00        4689        3792  8481
## ..                 ...         ...         ...   ...
## Source: local data frame [1 x 1]
## 
##             turnstile
##                (fctr)
## 1 N063A R011 00-00-00
##   Remote Booth         Station Line.Name Division
## 1   R046  R238 42 ST-GRD CNTRL     4567S      IRT
  • 42 ST-GRD CNTRL station is the busiest on August 1, 2013.
  • N063A R011 00-00-00 turnstile is the busiest on that date.

4. What stations have seen the most usage growth/decline in Q3 2013?

##   Remote Booth  Station Line.Name Division
## 1   R431  R730 DYRE AVE         5      IRT
##   Remote Booth Station Line.Name Division
## 1   R443  N208  170 ST        BD      IND
  • DYRE AVE station has seen the most usage growth in Q3 2013.
  • 170 ST station has seen the most usage decline in Q3 2013.
  • Note: by observing usage column, DYRE AVE station shows approximately 100 times growth. Further research may be necessary.

5. What dates are the least busy in Q3 2013? Could you identify days on which stations were not operating at full capacity or closed entirely?

## Source: local data frame [1 x 1]
## 
##         date
##       (time)
## 1 2013-07-07

  • 2013-07-07 is the least busy in Q3 2013.
  • Inter-quantile graph shows that on weekends stations were not operating at full capacity or closed entirely for maintenance.

Visualization:

1. Plot the daily row counts for data files in Q3 2013.

  • The surge reflects clearly the 4th of July holiday travel registered.

2. Plot the daily total number of entries & exits across the system for Q3 2013.

  • Graph shows daily Entries (blue) and Exits (green) in Q3 2013.

  • More Entries than Exits – due to people using emergency exits to leave the subway.

  • The peaks show 4th of July and Labor Day travels.

3. Plot the mean and standard deviation of the daily total number of entries & exits for each month in Q3 2013 for station 34 ST-PENN STA.

## Source: local data frame [4,483 x 2]
## 
##             turnstile     diff
##                (fctr)    (dbl)
## 1  R138 R293 00-03-06 13958165
## 2  N506 R022 00-03-02  2844642
## 3  H012 R268 01-00-01     6706
## 4  H009 R235 00-00-03     6185
## 5  H012 R268 01-00-00     5985
## 6  R250 R179 00-00-0B     4889
## 7  N327 R254 00-05-01     4735
## 8  R529 R208 00-00-04     4587
## 9  R513 R093 00-03-00     4528
## 10 R512 R092 00-03-00     4526
## ..                ...      ...
##   control_area unit      scp       date     time description  entries
## 1         R138 R293 00-03-06 2013-09-28 02:00:00     REGULAR 13959730
## 2         R138 R293 00-03-06 2013-09-28 06:00:00     REGULAR 13959758
## 3         R138 R293 00-03-06 2013-09-28 10:00:00     REGULAR     1593
## 4         R138 R293 00-03-06 2013-09-28 14:00:00     REGULAR     2145
## 5         R138 R293 00-03-06 2013-09-28 18:00:00     REGULAR     2851
## 6         R138 R293 00-03-06 2013-09-28 22:00:00     REGULAR     3361
##      exits stationid          turnstile            datetime
## 1 18211780 R138 R293 R138 R293 00-03-06 2013-09-28 02:00:00
## 2 18211821 R138 R293 R138 R293 00-03-06 2013-09-28 06:00:00
## 3       40 R138 R293 R138 R293 00-03-06 2013-09-28 10:00:00
## 4      601 R138 R293 R138 R293 00-03-06 2013-09-28 14:00:00
## 5     1168 R138 R293 R138 R293 00-03-06 2013-09-28 18:00:00
## 6     1669 R138 R293 R138 R293 00-03-06 2013-09-28 22:00:00

  • The mean (dot) and one standard deviation (errorbar) of daily total of entries & exits for each month in Q3 2013 for station 34 ST-PENN STA are indicated by different colors for clear effect.

4. Plot 25/50/75 percentile of the daily total number of entries & exits for each month in Q3 2013 for station 34 ST-PENN STA.

  • The median (X) and inter-quantile (errorbar) of daily total of entries & exits for each month in Q3 2013 for station 34 ST-PENN STA.
good.penn %>% ggplot(aes(month,total)) + geom_boxplot() +ggtitle('34 ST-PENN Station 25/50/75 Percentile In Q3 2013 (boxplot)') +
  xlab('Month In Q3 2013') + ylab('Daily Total')+theme(plot.title = element_text(size = rel(1.25), colour = "black"), 
  axis.title.x = element_text(size = rel(1.1),colour = "black"),axis.title.y = element_text(size = rel(1.1),colour = "black"))

  • The median (bold line inside) and inter-quantile range (upper and bottom edge) are shown in classic boxplot version.

5. Plot the daily number of closed stations and number of stations that were not operating at full capacity in Q3 2013.

  • Again, weekends show similar trend of less usage of subway system.