Data Source

I will be using the mta.info website to get the data.

http://web.mta.info/developers/data/nyct/turnstile/turnstile_191026.txt

For now I have the above dataset, I might add some other dataset if I find anything further which might help to get some good findings for the project.

mta <- read.delim2("http://web.mta.info/developers/data/nyct/turnstile/turnstile_191026.txt",sep = ",")

mta <- data.frame(mta)

head(mta)

##    C.A UNIT      SCP STATION LINENAME DIVISION       DATE     TIME    DESC
## 1 A002 R051 02-00-00   59 ST  NQR456W      BMT 10/19/2019 00:00:00 REGULAR
## 2 A002 R051 02-00-00   59 ST  NQR456W      BMT 10/19/2019 04:00:00 REGULAR
## 3 A002 R051 02-00-00   59 ST  NQR456W      BMT 10/19/2019 08:00:00 REGULAR
## 4 A002 R051 02-00-00   59 ST  NQR456W      BMT 10/19/2019 12:00:00 REGULAR
## 5 A002 R051 02-00-00   59 ST  NQR456W      BMT 10/19/2019 16:00:00 REGULAR
## 6 A002 R051 02-00-00   59 ST  NQR456W      BMT 10/19/2019 20:00:00 REGULAR
##   ENTRIES   EXITS
## 1 7238905 2452500
## 2 7238924 2452505
## 3 7238945 2452536
## 4 7239029 2452602
## 5 7239280 2452651
## 6 7239629 2452702

#summary
summary(mta)

##       C.A              UNIT              SCP        
##  PTH22  :  1800   R549   :  2859   00-00-00: 19459  
##  PTH02  :  1080   R014   :  2079   00-00-01: 19295  
##  R610   :   946   R057   :  2038   00-00-02: 17232  
##  R238   :   924   R540   :  2000   00-00-03:  8703  
##  PTH07  :   918   R029   :  1974   00-03-00:  7289  
##  PTH16  :   898   R550   :  1816   00-03-01:  6977  
##  (Other):199029   (Other):192829   (Other) :126640  
##             STATION          LINENAME      DIVISION            DATE      
##  34 ST-PENN STA :  4209   1      : 25878   BMT:43896   10/19/2019:29282  
##  FULTON ST      :  4009   6      : 11913   IND:72199   10/20/2019:29263  
##  GRD CNTRL-42 ST:  3126   7      :  9518   IRT:74458   10/21/2019:29517  
##  23 ST          :  3061   F      :  7737   PTH:13255   10/22/2019:29268  
##  86 ST          :  2662   25     :  6506   RIT:  420   10/23/2019:29551  
##  CANAL ST       :  2435   A      :  5959   SRT: 1367   10/24/2019:29320  
##  (Other)        :186093   (Other):138084               10/25/2019:29394  
##        TIME                DESC           ENTRIES         
##  04:00:00: 17421   RECOVR AUD:   517   Min.   :0.000e+00  
##  16:00:00: 17405   REGULAR   :205078   1st Qu.:2.979e+05  
##  20:00:00: 17401                       Median :1.980e+06  
##  08:00:00: 17397                       Mean   :4.107e+07  
##  12:00:00: 17397                       3rd Qu.:6.634e+06  
##  00:00:00: 17390                       Max.   :2.129e+09  
##  (Other) :101184                                          
##      EXITS          
##  Min.   :0.000e+00  
##  1st Qu.:1.381e+05  
##  Median :1.160e+06  
##  Mean   :3.413e+07  
##  3rd Qu.:4.532e+06  
##  Max.   :2.124e+09  
##

library(skimr)

## 
## Attaching package: 'skimr'

## The following object is masked from 'package:stats':
## 
##     filter

#skim(mta)
mta %>%
  skim()

## Skim summary statistics
##  n obs: 205595 
##  n variables: 11 
## 
## ── Variable type:factor ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##  variable missing complete      n n_unique
##       C.A       0   205595 205595      745
##      DATE       0   205595 205595        7
##      DESC       0   205595 205595        2
##  DIVISION       0   205595 205595        6
##  LINENAME       0   205595 205595      113
##       SCP       0   205595 205595      223
##   STATION       0   205595 205595      377
##      TIME       0   205595 205595    12705
##      UNIT       0   205595 205595      468
##                                      top_counts ordered
##        PTH: 1800, PTH: 1080, R61: 946, R23: 924   FALSE
##  10/: 29551, 10/: 29517, 10/: 29394, 10/: 29320   FALSE
##                    REG: 205078, REC: 517, NA: 0   FALSE
##  IRT: 74458, IND: 72199, BMT: 43896, PTH: 13255   FALSE
##            1: 25878, 6: 11913, 7: 9518, F: 7737   FALSE
##   00-: 19459, 00-: 19295, 00-: 17232, 00-: 8703   FALSE
##      34 : 4209, FUL: 4009, GRD: 3126, 23 : 3061   FALSE
##  04:: 17421, 16:: 17405, 20:: 17401, 08:: 17397   FALSE
##      R54: 2859, R01: 2079, R05: 2038, R54: 2000   FALSE
## 
## ── Variable type:integer ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##  variable missing complete      n    mean      sd p0   p25   p50     p75
##   ENTRIES       0   205595 205595 4.1e+07 2.1e+08  0 3e+05 2e+06 6634000
##     p100     hist
##  2.1e+09 ▇▁▁▁▁▁▁▁
## 
## ── Variable type:numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##  variable missing complete      n    mean    sd p0      p25     p50
##     EXITS       0   205595 205595 3.4e+07 2e+08  0 138113.5 1159635
##        p75    p100     hist
##  4531500.5 2.1e+09 ▇▁▁▁▁▁▁▁

hist(mta$ENTRIES)

I will be doing some ggplots, histograms and etc. but for that I will have to convert the current data columns to numeric vales.

I will be working on that further.

But for now this is kind of a sample work that will conduct.

library(ggplot2) ggplot(aes(x=‘division’, y=‘count’), data=mta)

library(ggplot2) # Basic histogram ggplot(mta$DIVISION, aes(x=DIVISION)) + geom_histogram()

FinalProjectProposal

Sudhan Maharjan

11/21/2019

Data Source

Analysis