library(tidyr)
library(data.table)
library(magrittr)
library(gdata)
library(readr)
library(lubridate)
library(dplyr)
library(ggplot2)
library(reshape2)
library(quantreg)
new_data<-read.csv('new_data.csv')
mta.full <- read_csv('mta.full.csv')
mta.data <- read_csv('mta.data.csv')
stations <- read.xls("Remote-Booth-Station.xls",header=T)
stations %>% group_by(Station) %>% summarise(count=n()) %>% arrange(desc(count))
## Source: local data frame [395 x 2]
##
## Station count
## (fctr) (int)
## 1 34 ST-PENN STA 14
## 2 FULTON ST 12
## 3 86 ST 10
## 4 23 ST 9
## 5 42 ST-TIMES SQ 9
## 6 CANAL ST 9
## 7 42 ST-GRD CNTRL 8
## 8 CORTLANDT ST 8
## 9 WALL ST 8
## 10 116 ST 6
## .. ... ...
From plotting, outrageous outliers are identified on 2013/7/29 and 2013/9/28. Further investigation is performed.
Finding: Some turnstiles were reset (293), approximately 6%, which caused abnormal difference counts.
Solution: Due to large sample size, these outliers will be removed to avoid skew.
## [1] 8297954
## Source: local data frame [4,187 x 4]
##
## turnstile entry.count exits.count sum
## (fctr) (dbl) (dbl) (dbl)
## 1 N063A R011 00-00-00 1694 8852 10546
## 2 R240 R047 00-00-00 3282 6800 10082
## 3 R238 R046 00-00-01 1763 7766 9529
## 4 R241A R048 00-00-00 2836 6686 9522
## 5 N063A R011 00-00-01 2783 6405 9188
## 6 N083 R138 01-00-00 1396 7777 9173
## 7 R240 R047 00-03-08 7429 1429 8858
## 8 R238 R046 00-03-00 2242 6612 8854
## 9 R533 R055 00-03-00 4561 4088 8649
## 10 R221 R170 01-00-00 4689 3792 8481
## .. ... ... ... ...
## Source: local data frame [1 x 1]
##
## turnstile
## (fctr)
## 1 N063A R011 00-00-00
## Remote Booth Station Line.Name Division
## 1 R046 R238 42 ST-GRD CNTRL 4567S IRT
## Remote Booth Station Line.Name Division
## 1 R431 R730 DYRE AVE 5 IRT
## Remote Booth Station Line.Name Division
## 1 R443 N208 170 ST BD IND
## Source: local data frame [1 x 1]
##
## date
## (time)
## 1 2013-07-07
Graph shows daily Entries (blue) and Exits (green) in Q3 2013.
More Entries than Exits – due to people using emergency exits to leave the subway.
The peaks show 4th of July and Labor Day travels.
## Source: local data frame [4,483 x 2]
##
## turnstile diff
## (fctr) (dbl)
## 1 R138 R293 00-03-06 13958165
## 2 N506 R022 00-03-02 2844642
## 3 H012 R268 01-00-01 6706
## 4 H009 R235 00-00-03 6185
## 5 H012 R268 01-00-00 5985
## 6 R250 R179 00-00-0B 4889
## 7 N327 R254 00-05-01 4735
## 8 R529 R208 00-00-04 4587
## 9 R513 R093 00-03-00 4528
## 10 R512 R092 00-03-00 4526
## .. ... ...
## control_area unit scp date time description entries
## 1 R138 R293 00-03-06 2013-09-28 02:00:00 REGULAR 13959730
## 2 R138 R293 00-03-06 2013-09-28 06:00:00 REGULAR 13959758
## 3 R138 R293 00-03-06 2013-09-28 10:00:00 REGULAR 1593
## 4 R138 R293 00-03-06 2013-09-28 14:00:00 REGULAR 2145
## 5 R138 R293 00-03-06 2013-09-28 18:00:00 REGULAR 2851
## 6 R138 R293 00-03-06 2013-09-28 22:00:00 REGULAR 3361
## exits stationid turnstile datetime
## 1 18211780 R138 R293 R138 R293 00-03-06 2013-09-28 02:00:00
## 2 18211821 R138 R293 R138 R293 00-03-06 2013-09-28 06:00:00
## 3 40 R138 R293 R138 R293 00-03-06 2013-09-28 10:00:00
## 4 601 R138 R293 R138 R293 00-03-06 2013-09-28 14:00:00
## 5 1168 R138 R293 R138 R293 00-03-06 2013-09-28 18:00:00
## 6 1669 R138 R293 R138 R293 00-03-06 2013-09-28 22:00:00
good.penn %>% ggplot(aes(month,total)) + geom_boxplot() +ggtitle('34 ST-PENN Station 25/50/75 Percentile In Q3 2013 (boxplot)') +
xlab('Month In Q3 2013') + ylab('Daily Total')+theme(plot.title = element_text(size = rel(1.25), colour = "black"),
axis.title.x = element_text(size = rel(1.1),colour = "black"),axis.title.y = element_text(size = rel(1.1),colour = "black"))