I will be using the mta.info website to get the data.
http://web.mta.info/developers/data/nyct/turnstile/turnstile_191026.txt
For now I have the above dataset, I might add some other dataset if I find anything further which might help to get some good findings for the project.
mta <- read.delim2("http://web.mta.info/developers/data/nyct/turnstile/turnstile_191026.txt",sep = ",")
mta <- data.frame(mta)
head(mta)
## C.A UNIT SCP STATION LINENAME DIVISION DATE TIME DESC
## 1 A002 R051 02-00-00 59 ST NQR456W BMT 10/19/2019 00:00:00 REGULAR
## 2 A002 R051 02-00-00 59 ST NQR456W BMT 10/19/2019 04:00:00 REGULAR
## 3 A002 R051 02-00-00 59 ST NQR456W BMT 10/19/2019 08:00:00 REGULAR
## 4 A002 R051 02-00-00 59 ST NQR456W BMT 10/19/2019 12:00:00 REGULAR
## 5 A002 R051 02-00-00 59 ST NQR456W BMT 10/19/2019 16:00:00 REGULAR
## 6 A002 R051 02-00-00 59 ST NQR456W BMT 10/19/2019 20:00:00 REGULAR
## ENTRIES EXITS
## 1 7238905 2452500
## 2 7238924 2452505
## 3 7238945 2452536
## 4 7239029 2452602
## 5 7239280 2452651
## 6 7239629 2452702
#summary
summary(mta)
## C.A UNIT SCP
## PTH22 : 1800 R549 : 2859 00-00-00: 19459
## PTH02 : 1080 R014 : 2079 00-00-01: 19295
## R610 : 946 R057 : 2038 00-00-02: 17232
## R238 : 924 R540 : 2000 00-00-03: 8703
## PTH07 : 918 R029 : 1974 00-03-00: 7289
## PTH16 : 898 R550 : 1816 00-03-01: 6977
## (Other):199029 (Other):192829 (Other) :126640
## STATION LINENAME DIVISION DATE
## 34 ST-PENN STA : 4209 1 : 25878 BMT:43896 10/19/2019:29282
## FULTON ST : 4009 6 : 11913 IND:72199 10/20/2019:29263
## GRD CNTRL-42 ST: 3126 7 : 9518 IRT:74458 10/21/2019:29517
## 23 ST : 3061 F : 7737 PTH:13255 10/22/2019:29268
## 86 ST : 2662 25 : 6506 RIT: 420 10/23/2019:29551
## CANAL ST : 2435 A : 5959 SRT: 1367 10/24/2019:29320
## (Other) :186093 (Other):138084 10/25/2019:29394
## TIME DESC ENTRIES
## 04:00:00: 17421 RECOVR AUD: 517 Min. :0.000e+00
## 16:00:00: 17405 REGULAR :205078 1st Qu.:2.979e+05
## 20:00:00: 17401 Median :1.980e+06
## 08:00:00: 17397 Mean :4.107e+07
## 12:00:00: 17397 3rd Qu.:6.634e+06
## 00:00:00: 17390 Max. :2.129e+09
## (Other) :101184
## EXITS
## Min. :0.000e+00
## 1st Qu.:1.381e+05
## Median :1.160e+06
## Mean :3.413e+07
## 3rd Qu.:4.532e+06
## Max. :2.124e+09
##
library(skimr)
##
## Attaching package: 'skimr'
## The following object is masked from 'package:stats':
##
## filter
#skim(mta)
mta %>%
skim()
## Skim summary statistics
## n obs: 205595
## n variables: 11
##
## ── Variable type:factor ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## variable missing complete n n_unique
## C.A 0 205595 205595 745
## DATE 0 205595 205595 7
## DESC 0 205595 205595 2
## DIVISION 0 205595 205595 6
## LINENAME 0 205595 205595 113
## SCP 0 205595 205595 223
## STATION 0 205595 205595 377
## TIME 0 205595 205595 12705
## UNIT 0 205595 205595 468
## top_counts ordered
## PTH: 1800, PTH: 1080, R61: 946, R23: 924 FALSE
## 10/: 29551, 10/: 29517, 10/: 29394, 10/: 29320 FALSE
## REG: 205078, REC: 517, NA: 0 FALSE
## IRT: 74458, IND: 72199, BMT: 43896, PTH: 13255 FALSE
## 1: 25878, 6: 11913, 7: 9518, F: 7737 FALSE
## 00-: 19459, 00-: 19295, 00-: 17232, 00-: 8703 FALSE
## 34 : 4209, FUL: 4009, GRD: 3126, 23 : 3061 FALSE
## 04:: 17421, 16:: 17405, 20:: 17401, 08:: 17397 FALSE
## R54: 2859, R01: 2079, R05: 2038, R54: 2000 FALSE
##
## ── Variable type:integer ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## variable missing complete n mean sd p0 p25 p50 p75
## ENTRIES 0 205595 205595 4.1e+07 2.1e+08 0 3e+05 2e+06 6634000
## p100 hist
## 2.1e+09 ▇▁▁▁▁▁▁▁
##
## ── Variable type:numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## variable missing complete n mean sd p0 p25 p50
## EXITS 0 205595 205595 3.4e+07 2e+08 0 138113.5 1159635
## p75 p100 hist
## 4531500.5 2.1e+09 ▇▁▁▁▁▁▁▁
hist(mta$ENTRIES)
I will be doing some ggplots, histograms and etc. but for that I will have to convert the current data columns to numeric vales.
I will be working on that further.
But for now this is kind of a sample work that will conduct.
library(ggplot2) ggplot(aes(x=‘division’, y=‘count’), data=mta)
library(ggplot2) # Basic histogram ggplot(mta$DIVISION, aes(x=DIVISION)) + geom_histogram()
The main effort of this project will be to see what kinds of improvements MTA is doing. Has there been improvements or not?
But if I couldn’t find these information from the given dataset then my alternative project will be to find the bussiest, slowest, train stations. The busy train numbers, most used station to get in and off, and so on. I might find other things also which I will include in this project.