Data Source

I will be using the mta.info website to get the data.

http://web.mta.info/developers/data/nyct/turnstile/turnstile_191026.txt

For now I have the above dataset, I might add some other dataset if I find anything further which might help to get some good findings for the project.

mta <- read.delim2("http://web.mta.info/developers/data/nyct/turnstile/turnstile_191026.txt",sep = ",")

mta <- data.frame(mta)

head(mta)
##    C.A UNIT      SCP STATION LINENAME DIVISION       DATE     TIME    DESC
## 1 A002 R051 02-00-00   59 ST  NQR456W      BMT 10/19/2019 00:00:00 REGULAR
## 2 A002 R051 02-00-00   59 ST  NQR456W      BMT 10/19/2019 04:00:00 REGULAR
## 3 A002 R051 02-00-00   59 ST  NQR456W      BMT 10/19/2019 08:00:00 REGULAR
## 4 A002 R051 02-00-00   59 ST  NQR456W      BMT 10/19/2019 12:00:00 REGULAR
## 5 A002 R051 02-00-00   59 ST  NQR456W      BMT 10/19/2019 16:00:00 REGULAR
## 6 A002 R051 02-00-00   59 ST  NQR456W      BMT 10/19/2019 20:00:00 REGULAR
##   ENTRIES   EXITS
## 1 7238905 2452500
## 2 7238924 2452505
## 3 7238945 2452536
## 4 7239029 2452602
## 5 7239280 2452651
## 6 7239629 2452702
#summary
summary(mta)
##       C.A              UNIT              SCP        
##  PTH22  :  1800   R549   :  2859   00-00-00: 19459  
##  PTH02  :  1080   R014   :  2079   00-00-01: 19295  
##  R610   :   946   R057   :  2038   00-00-02: 17232  
##  R238   :   924   R540   :  2000   00-00-03:  8703  
##  PTH07  :   918   R029   :  1974   00-03-00:  7289  
##  PTH16  :   898   R550   :  1816   00-03-01:  6977  
##  (Other):199029   (Other):192829   (Other) :126640  
##             STATION          LINENAME      DIVISION            DATE      
##  34 ST-PENN STA :  4209   1      : 25878   BMT:43896   10/19/2019:29282  
##  FULTON ST      :  4009   6      : 11913   IND:72199   10/20/2019:29263  
##  GRD CNTRL-42 ST:  3126   7      :  9518   IRT:74458   10/21/2019:29517  
##  23 ST          :  3061   F      :  7737   PTH:13255   10/22/2019:29268  
##  86 ST          :  2662   25     :  6506   RIT:  420   10/23/2019:29551  
##  CANAL ST       :  2435   A      :  5959   SRT: 1367   10/24/2019:29320  
##  (Other)        :186093   (Other):138084               10/25/2019:29394  
##        TIME                DESC           ENTRIES         
##  04:00:00: 17421   RECOVR AUD:   517   Min.   :0.000e+00  
##  16:00:00: 17405   REGULAR   :205078   1st Qu.:2.979e+05  
##  20:00:00: 17401                       Median :1.980e+06  
##  08:00:00: 17397                       Mean   :4.107e+07  
##  12:00:00: 17397                       3rd Qu.:6.634e+06  
##  00:00:00: 17390                       Max.   :2.129e+09  
##  (Other) :101184                                          
##      EXITS          
##  Min.   :0.000e+00  
##  1st Qu.:1.381e+05  
##  Median :1.160e+06  
##  Mean   :3.413e+07  
##  3rd Qu.:4.532e+06  
##  Max.   :2.124e+09  
## 
library(skimr)
## 
## Attaching package: 'skimr'
## The following object is masked from 'package:stats':
## 
##     filter
#skim(mta)
mta %>%
  skim()
## Skim summary statistics
##  n obs: 205595 
##  n variables: 11 
## 
## ── Variable type:factor ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##  variable missing complete      n n_unique
##       C.A       0   205595 205595      745
##      DATE       0   205595 205595        7
##      DESC       0   205595 205595        2
##  DIVISION       0   205595 205595        6
##  LINENAME       0   205595 205595      113
##       SCP       0   205595 205595      223
##   STATION       0   205595 205595      377
##      TIME       0   205595 205595    12705
##      UNIT       0   205595 205595      468
##                                      top_counts ordered
##        PTH: 1800, PTH: 1080, R61: 946, R23: 924   FALSE
##  10/: 29551, 10/: 29517, 10/: 29394, 10/: 29320   FALSE
##                    REG: 205078, REC: 517, NA: 0   FALSE
##  IRT: 74458, IND: 72199, BMT: 43896, PTH: 13255   FALSE
##            1: 25878, 6: 11913, 7: 9518, F: 7737   FALSE
##   00-: 19459, 00-: 19295, 00-: 17232, 00-: 8703   FALSE
##      34 : 4209, FUL: 4009, GRD: 3126, 23 : 3061   FALSE
##  04:: 17421, 16:: 17405, 20:: 17401, 08:: 17397   FALSE
##      R54: 2859, R01: 2079, R05: 2038, R54: 2000   FALSE
## 
## ── Variable type:integer ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##  variable missing complete      n    mean      sd p0   p25   p50     p75
##   ENTRIES       0   205595 205595 4.1e+07 2.1e+08  0 3e+05 2e+06 6634000
##     p100     hist
##  2.1e+09 ▇▁▁▁▁▁▁▁
## 
## ── Variable type:numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##  variable missing complete      n    mean    sd p0      p25     p50
##     EXITS       0   205595 205595 3.4e+07 2e+08  0 138113.5 1159635
##        p75    p100     hist
##  4531500.5 2.1e+09 ▇▁▁▁▁▁▁▁
hist(mta$ENTRIES)

I will be doing some ggplots, histograms and etc. but for that I will have to convert the current data columns to numeric vales.

I will be working on that further.

But for now this is kind of a sample work that will conduct.

library(ggplot2) ggplot(aes(x=‘division’, y=‘count’), data=mta)

library(ggplot2) # Basic histogram ggplot(mta$DIVISION, aes(x=DIVISION)) + geom_histogram()

Analysis

The main effort of this project will be to see what kinds of improvements MTA is doing. Has there been improvements or not?

But if I couldn’t find these information from the given dataset then my alternative project will be to find the bussiest, slowest, train stations. The busy train numbers, most used station to get in and off, and so on. I might find other things also which I will include in this project.