This is the approach I took for the analysis of NYC taxi data

I will be taking you through the different steps one by one.

Data Set Acquisition

The data set that was given for this task was from

I took the December 2013 data as per the suggestions for this exercise.

Let us read the data in a data frame

setwd("D:/Vibs/Work/Tech-Prep/Profile/Elula")

#clear the environment of all existing variables
rm(list=ls())

# setting the time zone UTC
Sys.setenv(TZ='UTC')

#read data
trip_data <- read.csv("D:/Vibs/Work/Tech-Prep/Profile/Elula/trip_data/trip_data_12.csv", header=TRUE)

Let us now check the data

head(trip_data)
##                          medallion                     hack_license
## 1 D7D598CD99978BD012A87A76A7C891B7 82F90D5EFE52FDFD2FDEC3EAD6D5771D
## 2 5455D5FF2BD94D10B304A15D4B7F2735 177B80B867CEC990DA166BA1D0FCAF82
## 3 93D6821F86A12B537C5EADBDFB432CA7 28B0AA10202F83FEB0F4E69340CA8F86
## 4 0C107B532C1207A74F0D8609B9E092FF 66C2CECD93E395CB9B875E9B382DB5D9
## 5 801C69A08B51470871A8110F8B0505EE 91A07EEF642E8590C2EFD631C3DF89C9
## 6 0FCAD8D2B0D6CAFB4CA81DD5137BE34E 307AD6E7FFB24B72779C91A0D682E42B
##   vendor_id rate_code store_and_fwd_flag     pickup_datetime
## 1       VTS         1                    2013-12-01 00:13:00
## 2       VTS         1                    2013-12-01 00:40:00
## 3       VTS         1                    2013-12-01 02:21:00
## 4       VTS         1                    2013-12-01 02:14:00
## 5       VTS         1                    2013-12-01 04:45:00
## 6       VTS         1                    2013-12-01 04:45:00
##      dropoff_datetime passenger_count trip_time_in_secs trip_distance
## 1 2013-12-01 00:31:00               1              1080          3.90
## 2 2013-12-01 00:48:00               6               480          3.20
## 3 2013-12-01 02:30:00               5               540          3.28
## 4 2013-12-01 02:22:00               1               480          1.84
## 5 2013-12-01 04:50:00               1               300          1.02
## 6 2013-12-01 04:50:00               3               300          0.99
##   pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude
## 1        -73.97934        40.77665         -73.98186         40.73428
## 2        -73.93967        40.72615         -73.98558         40.71807
## 3        -73.95875        40.76808         -73.95875         40.76808
## 4        -73.97884        40.72419         -73.97974         40.74341
## 5        -73.99136        40.73507         -73.97894         40.73460
## 6        -73.98717        40.76076         -73.99380         40.76552
tail(trip_data)
##                                 medallion                     hack_license
## 13971113 19BF1BB516C4E992EA3FBAEDA73D6262 DC545D5F2A5176371D585C7FC7A0835D
## 13971114 803FA45BD7CC003F460410D5D9EE4D3A 6C1B453B66ACE118D95802CA27053C2B
## 13971115 DBE7422E79192B5F654FE1FE02526125 3158E46011E95D376541A461895A7730
## 13971116 ABC075C8871353E9F12ADEB5F1E27C2A 2E24AB48335EDC44EADE1CE29F6BCBF4
## 13971117 1FB2BF12B504498BC8B7B860294CD372 03D43975C87C8B5CFBDA1FF33B3AD743
## 13971118 FC2D556EBF3EC3043221010139A16D70 FD028AE01E96C2518503E60A463ABAEE
##          vendor_id rate_code store_and_fwd_flag     pickup_datetime
## 13971113       VTS         1                    2013-12-06 07:00:00
## 13971114       VTS         1                    2013-12-03 17:12:00
## 13971115       VTS         1                    2013-12-06 06:55:00
## 13971116       VTS         1                    2013-12-06 06:57:00
## 13971117       VTS         1                    2013-12-06 07:04:00
## 13971118       VTS         1                    2013-12-06 07:02:00
##             dropoff_datetime passenger_count trip_time_in_secs
## 13971113 2013-12-06 07:02:00               1               120
## 13971114 2013-12-03 17:42:00               1              1800
## 13971115 2013-12-06 07:04:00               1               540
## 13971116 2013-12-06 07:06:00               3               540
## 13971117 2013-12-06 07:06:00               1               120
## 13971118 2013-12-06 07:08:00               1               360
##          trip_distance pickup_longitude pickup_latitude dropoff_longitude
## 13971113          0.62        -73.99340        40.74733         -73.99194
## 13971114         10.01        -73.87308        40.77405         -73.97300
## 13971115          2.17        -73.99634        40.73305         -73.97856
## 13971116          1.69        -73.97218        40.79441         -73.97234
## 13971117          0.70          0.00000         0.00000           0.00000
## 13971118          1.35        -73.96292        40.77075         -73.98126
##          dropoff_latitude
## 13971113         40.74163
## 13971114         40.75688
## 13971115         40.75028
## 13971116         40.79640
## 13971117          0.00000
## 13971118         40.77111
nrow(trip_data)
## [1] 13971118

There are a total of 14 columns in the data set with 13971118 rows. Also it is quite visible that there is a definite need of imputation as there is an empty column (store_and_fwd_flag).

Let us look at the column types

str(trip_data)
## 'data.frame':    13971118 obs. of  14 variables:
##  $ medallion         : Factor w/ 13460 levels "00005007A9F30E289E760362F69E4EAD",..: 11374 4404 7810 644 6728 866 8555 9641 3480 1079 ...
##  $ hack_license      : Factor w/ 33381 levels "0002555BBE359440D6CEB34B699D3932",..: 17028 3027 5256 13410 18952 6267 21701 32914 5852 21985 ...
##  $ vendor_id         : Factor w/ 2 levels "CMT","VTS": 2 2 2 2 2 2 2 2 2 2 ...
##  $ rate_code         : int  1 1 1 1 1 1 2 1 1 1 ...
##  $ store_and_fwd_flag: Factor w/ 3 levels "","N","Y": 1 1 1 1 1 1 1 1 1 1 ...
##  $ pickup_datetime   : Factor w/ 2289920 levels "2013-12-01 00:00:00",..: 758 2322 8058 7665 15307 15307 18047 22994 25497 25844 ...
##  $ dropoff_datetime  : Factor w/ 2291803 levels "2013-12-01 00:00:00",..: 1456 2445 8238 7791 15303 15303 18522 23256 25826 26176 ...
##  $ passenger_count   : int  1 6 5 1 1 3 1 6 6 5 ...
##  $ trip_time_in_secs : int  1080 480 540 480 300 300 1140 600 600 600 ...
##  $ trip_distance     : num  3.9 3.2 3.28 1.84 1.02 ...
##  $ pickup_longitude  : num  -74 -73.9 -74 -74 -74 ...
##  $ pickup_latitude   : num  40.8 40.7 40.8 40.7 40.7 ...
##  $ dropoff_longitude : num  -74 -74 -74 -74 -74 ...
##  $ dropoff_latitude  : num  40.7 40.7 40.8 40.7 40.7 ...

Medallion column has approximately 13k levels which means that it could be one of the id’s of the taxi drivers.

Similarly, hack_license also has only 33k levels which could potentially be another id column.

Vendor_id could represent the vendor of the taxi and has only 2 levels. We need to check if all the records are having a vendor id or there are few records without vendor as well.

Let us go column by column and understand each one.

medallion

str(trip_data$medallion)
##  Factor w/ 13460 levels "00005007A9F30E289E760362F69E4EAD",..: 11374 4404 7810 644 6728 866 8555 9641 3480 1079 ...
#table(trip_data$medallion)

This shows that this column is a categorical variable as expected with 13k levels.

Let us try and filter records for one of the medaliion id with very less number of records and analyze further.

medaliion “0F4216F3AF4B80598DB77C558700AFF1” has only 8 records

#0F4216F3AF4B80598DB77C558700AFF1
test <- subset(trip_data, trip_data$medallion=="0F4216F3AF4B80598DB77C558700AFF1")

test
##                                medallion                     hack_license
## 54496   0F4216F3AF4B80598DB77C558700AFF1 7B19DE6D4D54999531BEB27F758F71F6
## 59644   0F4216F3AF4B80598DB77C558700AFF1 7B19DE6D4D54999531BEB27F758F71F6
## 60382   0F4216F3AF4B80598DB77C558700AFF1 7B19DE6D4D54999531BEB27F758F71F6
## 69860   0F4216F3AF4B80598DB77C558700AFF1 7B19DE6D4D54999531BEB27F758F71F6
## 69996   0F4216F3AF4B80598DB77C558700AFF1 7B19DE6D4D54999531BEB27F758F71F6
## 2224407 0F4216F3AF4B80598DB77C558700AFF1 7B19DE6D4D54999531BEB27F758F71F6
## 2225122 0F4216F3AF4B80598DB77C558700AFF1 7B19DE6D4D54999531BEB27F758F71F6
## 3824519 0F4216F3AF4B80598DB77C558700AFF1 7B19DE6D4D54999531BEB27F758F71F6
##         vendor_id rate_code store_and_fwd_flag     pickup_datetime
## 54496         VTS         1                    2013-12-31 11:56:00
## 59644         VTS         1                    2013-12-31 11:08:00
## 60382         VTS         1                    2013-12-31 10:03:00
## 69860         VTS         1                    2013-12-31 09:40:00
## 69996         VTS         1                    2013-12-31 09:53:00
## 2224407       VTS         1                    2013-12-11 10:52:00
## 2225122       VTS         1                    2013-12-10 20:19:00
## 3824519       VTS         1                    2013-12-20 12:44:00
##            dropoff_datetime passenger_count trip_time_in_secs
## 54496   2013-12-31 12:02:00               2               360
## 59644   2013-12-31 11:17:00               2               540
## 60382   2013-12-31 10:32:00               1              1740
## 69860   2013-12-31 09:51:00               2               660
## 69996   2013-12-31 09:54:00               1                60
## 2224407 2013-12-11 10:54:00               1               120
## 2225122 2013-12-10 20:19:00               1                 0
## 3824519 2013-12-20 12:52:00               2               480
##         trip_distance pickup_longitude pickup_latitude dropoff_longitude
## 54496               0                0               0                 0
## 59644               0                0               0                 0
## 60382               0                0               0                 0
## 69860               0                0               0                 0
## 69996               0                0               0                 0
## 2224407             0                0               0                 0
## 2225122             0                0               0                 0
## 3824519             0                0               0                 0
##         dropoff_latitude
## 54496                  0
## 59644                  0
## 60382                  0
## 69860                  0
## 69996                  0
## 2224407                0
## 2225122                0
## 3824519                0

Strangely, the hack_license remains constant for the medallion and hence let us see few more examples.

But we also saw that few trips are less than 1 minute duration (60 secs.) Let us see how many of the trips are less than 60 seconds and if few then they could be outliers and we can remove them from the data set.

test <- subset(trip_data, trip_data$trip_time_in_secs <=60)

nrow(test) * 100/nrow(trip_data)
## [1] 1.085697

Hence almost 1% of the records could be ignored.

Let us now perform a similar exercise on the other end.

summary(trip_data$trip_time_in_secs)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   370.0   636.0   788.2  1020.0 10800.0

The maximum time is also suspicious and hence let us see how many records are thera bove 1000 secs trip duration.

test1 <- subset(trip_data, trip_data$trip_time_in_secs >=1000)

nrow(test1) * 100/nrow(trip_data)
## [1] 26.16098

Around 26% of the records are above 1000 seconds.

Hence it seems to be ok. So let us move to another interesting column trip_distance.

summary(trip_data$trip_distance)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.010   1.750   2.925   3.200 100.000
test2 <- subset(trip_data, trip_data$trip_distance ==0)

nrow(test2) * 100/nrow(trip_data)
## [1] 0.5609501

Around 0.5% of the records have trip_distance = 0 which is not possible, hence we will ignore all those records which have trip_distance == 0 or trip_time_in_secs <= 60

let us filter out these 2 sets and continue the analysis.

trip_data_filtered <- subset(trip_data, trip_data$trip_time_in_secs>60)

trip_data_filtered <- subset(trip_data_filtered, trip_data_filtered$trip_distance > 0)

Let us understand the relation between medallion and hack_license columns.

med1 <- subset(trip_data_filtered, trip_data_filtered$medallion=="EBAF059FA9FE6F082A50F4C09C362B5D")

#drop the unused levels
med1$hack_license <- droplevels(med1$hack_license)

str(med1)
## 'data.frame':    800 obs. of  14 variables:
##  $ medallion         : Factor w/ 13460 levels "00005007A9F30E289E760362F69E4EAD",..: 12407 12407 12407 12407 12407 12407 12407 12407 12407 12407 ...
##  $ hack_license      : Factor w/ 2 levels "AC945E7E46CD4EB5B11816777AF2CC14",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ vendor_id         : Factor w/ 2 levels "CMT","VTS": 2 2 2 2 2 2 2 2 2 2 ...
##  $ rate_code         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ store_and_fwd_flag: Factor w/ 3 levels "","N","Y": 1 1 1 1 1 1 1 1 1 1 ...
##  $ pickup_datetime   : Factor w/ 2289920 levels "2013-12-01 00:00:00",..: 194981 194082 197409 192570 198816 201118 202598 201950 203722 204840 ...
##  $ dropoff_datetime  : Factor w/ 2291803 levels "2013-12-01 00:00:00",..: 195793 194626 197491 193411 200496 201500 202562 202206 204159 205110 ...
##  $ passenger_count   : int  6 6 5 6 6 6 5 6 6 6 ...
##  $ trip_time_in_secs : int  1200 900 420 1260 2040 720 300 600 780 600 ...
##  $ trip_distance     : num  1.37 1.18 0.6 9.2 3.66 1.72 0.62 1.53 1.65 0.89 ...
##  $ pickup_longitude  : num  -74 -74 -74 -73.9 -74 ...
##  $ pickup_latitude   : num  40.8 40.8 40.8 40.8 40.8 ...
##  $ dropoff_longitude : num  -74 -74 -74 -74 -74 ...
##  $ dropoff_latitude  : num  40.8 40.8 40.8 40.8 40.7 ...

It shows 1 medallion had more than 1 hack_license. let us perform a reverse analysis and see if one hack_license has a map to more than 1 medallion.

hack1 <- subset(trip_data_filtered, trip_data_filtered$hack_license=="A69D2180076DCD9954F5EB66E2A747F7")

#drop the unused levels
hack1$medallion <- droplevels(hack1$medallion)

str(hack1)
## 'data.frame':    507 obs. of  14 variables:
##  $ medallion         : Factor w/ 1 level "A18CC3E9191D21F604DFC2423916E6A2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ hack_license      : Factor w/ 33381 levels "0002555BBE359440D6CEB34B699D3932",..: 21701 21701 21701 21701 21701 21701 21701 21701 21701 21701 ...
##  $ vendor_id         : Factor w/ 2 levels "CMT","VTS": 2 2 2 2 2 2 2 2 2 2 ...
##  $ rate_code         : int  2 1 1 1 1 1 1 1 1 1 ...
##  $ store_and_fwd_flag: Factor w/ 3 levels "","N","Y": 1 1 1 1 1 1 1 1 1 1 ...
##  $ pickup_datetime   : Factor w/ 2289920 levels "2013-12-01 00:00:00",..: 18047 2177678 2169877 2240384 2180710 2243383 2171192 2167570 2190066 2172971 ...
##  $ dropoff_datetime  : Factor w/ 2291803 levels "2013-12-01 00:00:00",..: 18522 2180922 2172029 2242460 2182759 2245370 2173476 2169533 2191376 2174400 ...
##  $ passenger_count   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ trip_time_in_secs : int  1140 2220 1080 1080 960 960 1200 840 120 300 ...
##  $ trip_distance     : num  22.04 11.68 4.8 8.84 1.82 ...
##  $ pickup_longitude  : num  -74 -73.9 -74 -74 -74 ...
##  $ pickup_latitude   : num  40.8 40.8 40.7 40.7 40.8 ...
##  $ dropoff_longitude : num  -73.8 -74 -74 -73.9 -74 ...
##  $ dropoff_latitude  : num  40.6 40.8 40.7 40.8 40.8 ...
#' @description This function returns the sets which have repeated records
#' 
#' @param x object - the data frame's column which needs to be analyzed for duplicacy
#' 
d <- function(x) duplicated(x) | duplicated(x, fromLast=TRUE)

# One to one
onetoOne <- trip_data_filtered[!d(trip_data_filtered$medallion) & !d(trip_data_filtered$hack_license),]

# One to many
onetoMany <- trip_data_filtered[d(trip_data_filtered$medallion) & !d(trip_data_filtered$hack_license),]

# Many to one
ManytoOne <- trip_data_filtered[!d(trip_data_filtered$medallion) & d(trip_data_filtered$hack_license),]

nrow(onetoOne)
## [1] 0
nrow(onetoMany)
## [1] 164
nrow(ManytoOne)
## [1] 1

It seems that most of the relation between medallion and hack_license is manytomany.

Let us now filter the trips where passenger count <0.

summary(trip_data_filtered$passenger_count)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   1.000   1.725   2.000   9.000
test3 <- subset(trip_data_filtered, trip_data_filtered$passenger_count ==0)

nrow(test3) * 100/nrow(trip_data_filtered)
## [1] 0.000376655
trip_data_filtered <- subset(trip_data_filtered, trip_data_filtered$passenger_count >0)

Analysis for Busiest Location

Let us now create location categories. 1 for Pickup location with combinations of pickup longitude and latitude & 1 for drop location with drop longitude and drop latitude.

trip_data_filtered$pickup_longitude <- as.factor(trip_data_filtered$pickup_longitude)
trip_data_filtered$pickup_latitude <- as.factor(trip_data_filtered$pickup_latitude)
trip_data_filtered <- cbind(trip_data_filtered, pickupLocation = 'NA')
trip_data_filtered$pickupLocation <- as.character(trip_data_filtered$pickupLocation)
trip_data_filtered$pickupLocation <- paste0("Lon: ", trip_data_filtered$pickup_longitude," Lat: ",trip_data_filtered$pickup_latitude)

trip_data_filtered$pickupLocation <- as.factor(trip_data_filtered$pickupLocation)

Now we need to perform the same for the drop off location.

trip_data_filtered$dropoff_longitude <- as.factor(trip_data_filtered$dropoff_longitude)
trip_data_filtered$dropoff_latitude <- as.factor(trip_data_filtered$dropoff_latitude)
trip_data_filtered <- cbind(trip_data_filtered, dropoffLocation = 'NA')
trip_data_filtered$dropoffLocation <- as.character(trip_data_filtered$dropoffLocation)
trip_data_filtered$dropoffLocation <- paste0("Lon: ",trip_data_filtered$dropoff_longitude," Lat: ",trip_data_filtered$dropoff_latitude)

trip_data_filtered$dropoffLocation <- as.factor(trip_data_filtered$dropoffLocation)

Now since we have got the pickup and dropoff locations let us do a sort on the most frequent pickup and most frequent drop off.

library(plyr)
## Warning: package 'plyr' was built under R version 3.4.4
topMostPickup <- as.data.frame(table(trip_data_filtered$pickupLocation))

topMostPickup <- topMostPickup[order(topMostPickup$Freq),]

#see the maximum pickup location
tail(topMostPickup,10)
##                                    Var1   Freq
## 1815594  Lon: -73.961014 Lat: 40.802319     83
## 6309725  Lon: -73.985985 Lat: 40.684994    130
## 423645   Lon: -73.937714 Lat: 40.758278    155
## 10142185 Lon: -74.011147 Lat: 40.701656    220
## 9549985  Lon: -74.004982 Lat: 40.723339    243
## 136589    Lon: -73.83461 Lat: 40.765968   1052
## 123286   Lon: -73.793983 Lat: 40.656971   1249
## 680007   Lon: -73.948723 Lat: 40.744843   1302
## 1890952   Lon: -73.96167 Lat: 40.774151   1432
## 10336530                  Lon: 0 Lat: 0 207268

Let us now see the most frequent drop off

topMostDropOff <- as.data.frame(table(trip_data_filtered$dropoffLocation))

topMostDropOff <- topMostDropOff[order(topMostDropOff$Freq),]

#see the maximum dropoff location
tail(topMostDropOff,10)
##                                    Var1   Freq
## 4505254  Lon: -73.974739 Lat: 40.616863     77
## 2537889  Lon: -73.961014 Lat: 40.802319     84
## 7368436  Lon: -73.985985 Lat: 40.684994    132
## 11137097 Lon: -74.011147 Lat: 40.701656    221
## 10541025 Lon: -74.004982 Lat: 40.723339    243
## 137126    Lon: -73.83461 Lat: 40.765968   1053
## 91500    Lon: -73.793983 Lat: 40.656971   1254
## 1183003  Lon: -73.948723 Lat: 40.744843   1302
## 2620855   Lon: -73.96167 Lat: 40.774151   1429
## 11406567                  Lon: 0 Lat: 0 203607

Thus, it seems we have some data which has lognitude and latitude mentioed as zero. It could be filtered out.

trip_data_filtered <- subset(trip_data_filtered, trip_data_filtered$pickup_latitude !=0)
trip_data_filtered <- subset(trip_data_filtered, trip_data_filtered$pickup_longitude !=0)
trip_data_filtered <- subset(trip_data_filtered, trip_data_filtered$dropoff_latitude !=0)
trip_data_filtered <- subset(trip_data_filtered, trip_data_filtered$dropoff_longitude !=0)

Let us now perform the same study again and capture the most frequent pickup location.

topMostPickup <- as.data.frame(table(trip_data_filtered$pickupLocation))

topMostPickup <- topMostPickup[order(topMostPickup$Freq),]

#see the maximum pickup location
tail(topMostPickup,10)
##                                    Var1 Freq
## 3650380  Lon: -73.974739 Lat: 40.616863   77
## 1815594  Lon: -73.961014 Lat: 40.802319   83
## 423645   Lon: -73.937714 Lat: 40.758278   91
## 6309725  Lon: -73.985985 Lat: 40.684994  129
## 10142185 Lon: -74.011147 Lat: 40.701656  220
## 9549985  Lon: -74.004982 Lat: 40.723339  243
## 136589    Lon: -73.83461 Lat: 40.765968 1052
## 123286   Lon: -73.793983 Lat: 40.656971 1247
## 680007   Lon: -73.948723 Lat: 40.744843 1302
## 1890952   Lon: -73.96167 Lat: 40.774151 1432

Let us visualize it in a graph.

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.4.4
topMostPickup <- tail(topMostPickup,10)

topMostPickup$Var1 <- droplevels(topMostPickup$Var1)

g <- ggplot(topMostPickup,aes(Var1)) + geom_bar(aes(weight = Freq)) +
  xlab("Pickup Location") + ylab("Frequency") + ggtitle("Topmost Pickup Locations") + coord_flip()

g

Let us perform the same exercise for topmost dropoff location.

topMostDropOff <- as.data.frame(table(trip_data_filtered$dropoffLocation))

topMostDropOff <- topMostDropOff[order(topMostDropOff$Freq),]

#see the maximum dropoff location
tail(topMostDropOff,10)
##                                    Var1 Freq
## 1182999  Lon: -73.948723 Lat: 40.743874   75
## 4505254  Lon: -73.974739 Lat: 40.616863   77
## 2537889  Lon: -73.961014 Lat: 40.802319   83
## 7368436  Lon: -73.985985 Lat: 40.684994  129
## 11137097 Lon: -74.011147 Lat: 40.701656  220
## 10541025 Lon: -74.004982 Lat: 40.723339  243
## 137126    Lon: -73.83461 Lat: 40.765968 1052
## 91500    Lon: -73.793983 Lat: 40.656971 1247
## 1183003  Lon: -73.948723 Lat: 40.744843 1302
## 2620855   Lon: -73.96167 Lat: 40.774151 1429
topMostDropOff <- tail(topMostDropOff,10)

topMostDropOff$Var1 <- droplevels(topMostDropOff$Var1)

g1 <- ggplot(topMostDropOff,aes(Var1)) + geom_bar(aes(weight = Freq)) +
  xlab("Dropoff Location") + ylab("Frequency") + ggtitle("Topmost Dropoff Locations") + coord_flip()

g1

Hence we can see that there are 4 places which are frequently visited (both dropoff & pickup)

  1. Lon: -73.96167 Lat: 40.774151
  2. Lon: -73.948723 Lat: 40.744843
  3. Lon: -73.793983 Lat: 40.656971
  4. Lon: -73.83461 Lat: 40.765968

Analysis for Busiest Hour

Similarly let us now try and find out the busiest hours.

trip_data_filtered <- cbind(trip_data_filtered, pickUpTime = 'NA')
trip_data_filtered$pickUpTime <- as.character(trip_data_filtered$pickUpTime)
trip_data_filtered$pickUpTime <- strftime(trip_data_filtered$pickup_datetime, format="%H")

topMostPickupTime <- as.data.frame(table(trip_data_filtered$pickUpTime))

topMostPickupTime <- topMostPickupTime[order(topMostPickupTime$Freq),]

#see the maximum pickup time
tail(topMostPickupTime,10)
##    Var1   Freq
## 12   11 667282
## 24   23 685069
## 14   13 693292
## 13   12 698435
## 15   14 709397
## 23   22 764756
## 22   21 783211
## 19   18 786744
## 21   20 789235
## 20   19 819789
topMostPickupTime <- tail(topMostPickupTime,10)

topMostPickupTime$Var1 <- droplevels(topMostPickupTime$Var1)

options("scipen"=100, "digits"=4)

g2 <- ggplot(topMostPickupTime,aes(Var1)) + geom_bar(aes(weight = Freq)) +
  xlab("Pickup Time") + ylab("Frequency") + ggtitle("Topmost Pickup time") + coord_flip()

g2

Thus it shows that maximum pickups are between 1900 hrs and 2000 hrs.

Let us now see the dropoff times.

trip_data_filtered <- cbind(trip_data_filtered, dropOffTime = 'NA')
trip_data_filtered$dropOffTime <- as.character(trip_data_filtered$dropOffTime)
trip_data_filtered$dropOffTime <- strftime(trip_data_filtered$dropoff_datetime, format="%H")

topMostDropOffTime <- as.data.frame(table(trip_data_filtered$dropOffTime))

topMostDropOffTime <- topMostDropOffTime[order(topMostDropOffTime$Freq),]

#see the maximum dropoff time
tail(topMostDropOffTime,10)
##    Var1   Freq
## 16   15 683964
## 14   13 688020
## 13   12 696222
## 15   14 697974
## 24   23 709897
## 19   18 772995
## 23   22 775022
## 22   21 783579
## 21   20 804042
## 20   19 838968
topMostDropOffTime <- tail(topMostDropOffTime,10)

topMostDropOffTime$Var1 <- droplevels(topMostDropOffTime$Var1)

options("scipen"=100, "digits"=4)

g3 <- ggplot(topMostDropOffTime,aes(Var1)) + geom_bar(aes(weight = Freq)) +
  xlab("DropOff Time") + ylab("Frequency") + ggtitle("Topmost DropOff time") + coord_flip()

g3

Thus here also it is seen that the busiest hours were between 1900 hrs to 2000 hrs.

Passerngers Per Trip Distribution Analysis

There are few dimensions across which passerenger per trip could be distributed: 1. Per Vendor 2. Per Pickup Hour 3. Per Dropoff Hour 4. For each of the top 4 busiest locations 5. Per Rate Code 6. Per Store_Fwd_Flag

Passenger per trip per vendor

trip_data_filtered$passenger_count <- as.factor(trip_data_filtered$passenger_count)

g5 <- ggplot(trip_data_filtered,aes(trip_data_filtered$passenger_count, fill=vendor_id)) + geom_bar(position ="stack") +  xlab("Passenger Count") + ylab("Frequency") + ggtitle("Passenger Distribution per Vendor") + coord_flip()

g5

Inference: There is an equal distribution of passenger count till 4 between CMT & VTS vendors.

However, once passenger count reaches 5, VTS is the preferred vendor of choice!

Also most of the trips are single passengers.

Passenger per trip per pickup hour

Next let us perform analysis of passenger per trip per pickup hour.

g6 <- ggplot(trip_data_filtered,aes(trip_data_filtered$passenger_count, fill=pickUpTime)) + geom_bar(position ="stack") +  xlab("Passenger Count") + ylab("Frequency") + ggtitle("Passenger Distribution per Pickup Hour") + coord_flip()

g6

Inference: Not much can be inferred from here, hence we will analyze the most busiest hour (when pickup time is between 1900 to 2000 hrs.)

MostBusyHourTrips <- subset(trip_data_filtered, trip_data_filtered$pickUpTime==19)

g7 <- ggplot(MostBusyHourTrips,aes(MostBusyHourTrips$passenger_count)) + geom_bar() +
  xlab("Passenger Count") + ylab("Frequency") + ggtitle("Passenger Distribution for 1900 hours") + coord_flip()

g7

Inference: at 1900 hours maximum single passenger trips are made and least is 4 passenger trips.

Passenger per trip for dropoff hour analysis at 1900 hours

Next let us analyze passenger count distribution per drop off hour at busiest 1900 hours

MostBusyHourTripsDropOff <- subset(trip_data_filtered, trip_data_filtered$dropOffTime==19)

g8 <- ggplot(MostBusyHourTripsDropOff,aes(MostBusyHourTripsDropOff$passenger_count)) + geom_bar() +
  xlab("Passenger Count") + ylab("Frequency") + ggtitle("Passenger Distribution for 1900 hours") + coord_flip()

g8

Inference: At 1900 hours maximum single passenger trips are made and least is 4 passenger trips.

Passenger per Trip for each of the busiest locations

Next let us analyze the Passenger per trip distribution for each of the busiest location

BusyLocation1 <- subset(trip_data_filtered, trip_data_filtered$pickupLocation =="Lon: -73.96167 Lat: 40.774151")

g9 <- ggplot(BusyLocation1,aes(BusyLocation1$passenger_count)) + geom_bar() +
  xlab("Passenger Count") + ylab("Frequency") + ggtitle("Passenger Distribution for Lon: -73.96167 Lat: 40.774151") + coord_flip()

g9

Inference: Maximum Passenger for Lon: -73.96167 Lat: 40.774151 is 4 passengers & most of them went for single trip.

BusyLocation2 <- subset(trip_data_filtered, trip_data_filtered$pickupLocation =="Lon: -73.948723 Lat: 40.744843")

g10 <- ggplot(BusyLocation2,aes(BusyLocation2$passenger_count)) + geom_bar() +
  xlab("Passenger Count") + ylab("Frequency") + ggtitle("Passenger Distribution for Lon: -73.948723 Lat: 40.744843") + coord_flip()

g10

Inference: Maximum Passenger for Lon: -73.948723 Lat: 40.744843 is 5 passengers & most of them went for single trip.

BusyLocation3 <- subset(trip_data_filtered, trip_data_filtered$pickupLocation =="Lon: -73.793983 Lat: 40.656971")

g11 <- ggplot(BusyLocation3,aes(BusyLocation3$passenger_count)) + geom_bar() +
  xlab("Passenger Count") + ylab("Frequency") + ggtitle("Passenger Distribution for Lon: -73.793983 Lat: 40.656971") + coord_flip()

g11

Inference: Maximum Passenger for Lon: -73.793983 Lat: 40.656971 is 4 passengers & most of them went for single trip.

BusyLocation4 <- subset(trip_data_filtered, trip_data_filtered$pickupLocation =="Lon: -73.83461 Lat: 40.765968")

g12 <- ggplot(BusyLocation4,aes(BusyLocation4$passenger_count)) + geom_bar() +
  xlab("Passenger Count") + ylab("Frequency") + ggtitle("Passenger Distribution for Lon: -73.83461 Lat: 40.765968") + coord_flip()

g12

Inference: Maximum Passenger for Lon: -73.83461 Lat: 40.765968 is 4 passengers & most of them went for single trip.

Passenger per trip analysis per rate code analysis

Next we will analyze the passenger per trip per rate code

trip_data_filtered$rate_code <- as.factor(trip_data_filtered$rate_code)

g13 <- ggplot(trip_data_filtered,aes(trip_data_filtered$passenger_count, fill=rate_code)) + geom_bar(position ="stack") +  xlab("Passenger Count") + ylab("Frequency") + ggtitle("Passenger Distribution per rate code") + coord_flip()

g13

Inference: Maximum used rate code is “1”

Passenger per trip as per store & forward flag

Next we will analyze the passenger per trip distribution for the store & forward flag

g14 <- ggplot(trip_data_filtered,aes(trip_data_filtered$passenger_count, fill=store_and_fwd_flag)) + geom_bar(position ="stack") +  xlab("Passenger Count") + ylab("Frequency") + ggtitle("Passenger Distribution per store_and_fwd_flag") + coord_flip()

g14

Inference: Close to half of this flag is blank. Passengers for more than 4 always have this flag as balnk. Single trip and double passenger trips have either flag as N or blank.

In repect of time & computational resources of my machine, here I will pause on the exploratory analysis of Trip Data.