I will be taking you through the different steps one by one.
The data set that was given for this task was from
I took the December 2013 data as per the suggestions for this exercise.
Let us read the data in a data frame
setwd("D:/Vibs/Work/Tech-Prep/Profile/Elula")
#clear the environment of all existing variables
rm(list=ls())
# setting the time zone UTC
Sys.setenv(TZ='UTC')
#read data
trip_data <- read.csv("D:/Vibs/Work/Tech-Prep/Profile/Elula/trip_data/trip_data_12.csv", header=TRUE)
Let us now check the data
head(trip_data)
## medallion hack_license
## 1 D7D598CD99978BD012A87A76A7C891B7 82F90D5EFE52FDFD2FDEC3EAD6D5771D
## 2 5455D5FF2BD94D10B304A15D4B7F2735 177B80B867CEC990DA166BA1D0FCAF82
## 3 93D6821F86A12B537C5EADBDFB432CA7 28B0AA10202F83FEB0F4E69340CA8F86
## 4 0C107B532C1207A74F0D8609B9E092FF 66C2CECD93E395CB9B875E9B382DB5D9
## 5 801C69A08B51470871A8110F8B0505EE 91A07EEF642E8590C2EFD631C3DF89C9
## 6 0FCAD8D2B0D6CAFB4CA81DD5137BE34E 307AD6E7FFB24B72779C91A0D682E42B
## vendor_id rate_code store_and_fwd_flag pickup_datetime
## 1 VTS 1 2013-12-01 00:13:00
## 2 VTS 1 2013-12-01 00:40:00
## 3 VTS 1 2013-12-01 02:21:00
## 4 VTS 1 2013-12-01 02:14:00
## 5 VTS 1 2013-12-01 04:45:00
## 6 VTS 1 2013-12-01 04:45:00
## dropoff_datetime passenger_count trip_time_in_secs trip_distance
## 1 2013-12-01 00:31:00 1 1080 3.90
## 2 2013-12-01 00:48:00 6 480 3.20
## 3 2013-12-01 02:30:00 5 540 3.28
## 4 2013-12-01 02:22:00 1 480 1.84
## 5 2013-12-01 04:50:00 1 300 1.02
## 6 2013-12-01 04:50:00 3 300 0.99
## pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude
## 1 -73.97934 40.77665 -73.98186 40.73428
## 2 -73.93967 40.72615 -73.98558 40.71807
## 3 -73.95875 40.76808 -73.95875 40.76808
## 4 -73.97884 40.72419 -73.97974 40.74341
## 5 -73.99136 40.73507 -73.97894 40.73460
## 6 -73.98717 40.76076 -73.99380 40.76552
tail(trip_data)
## medallion hack_license
## 13971113 19BF1BB516C4E992EA3FBAEDA73D6262 DC545D5F2A5176371D585C7FC7A0835D
## 13971114 803FA45BD7CC003F460410D5D9EE4D3A 6C1B453B66ACE118D95802CA27053C2B
## 13971115 DBE7422E79192B5F654FE1FE02526125 3158E46011E95D376541A461895A7730
## 13971116 ABC075C8871353E9F12ADEB5F1E27C2A 2E24AB48335EDC44EADE1CE29F6BCBF4
## 13971117 1FB2BF12B504498BC8B7B860294CD372 03D43975C87C8B5CFBDA1FF33B3AD743
## 13971118 FC2D556EBF3EC3043221010139A16D70 FD028AE01E96C2518503E60A463ABAEE
## vendor_id rate_code store_and_fwd_flag pickup_datetime
## 13971113 VTS 1 2013-12-06 07:00:00
## 13971114 VTS 1 2013-12-03 17:12:00
## 13971115 VTS 1 2013-12-06 06:55:00
## 13971116 VTS 1 2013-12-06 06:57:00
## 13971117 VTS 1 2013-12-06 07:04:00
## 13971118 VTS 1 2013-12-06 07:02:00
## dropoff_datetime passenger_count trip_time_in_secs
## 13971113 2013-12-06 07:02:00 1 120
## 13971114 2013-12-03 17:42:00 1 1800
## 13971115 2013-12-06 07:04:00 1 540
## 13971116 2013-12-06 07:06:00 3 540
## 13971117 2013-12-06 07:06:00 1 120
## 13971118 2013-12-06 07:08:00 1 360
## trip_distance pickup_longitude pickup_latitude dropoff_longitude
## 13971113 0.62 -73.99340 40.74733 -73.99194
## 13971114 10.01 -73.87308 40.77405 -73.97300
## 13971115 2.17 -73.99634 40.73305 -73.97856
## 13971116 1.69 -73.97218 40.79441 -73.97234
## 13971117 0.70 0.00000 0.00000 0.00000
## 13971118 1.35 -73.96292 40.77075 -73.98126
## dropoff_latitude
## 13971113 40.74163
## 13971114 40.75688
## 13971115 40.75028
## 13971116 40.79640
## 13971117 0.00000
## 13971118 40.77111
nrow(trip_data)
## [1] 13971118
There are a total of 14 columns in the data set with 13971118 rows. Also it is quite visible that there is a definite need of imputation as there is an empty column (store_and_fwd_flag).
Let us look at the column types
str(trip_data)
## 'data.frame': 13971118 obs. of 14 variables:
## $ medallion : Factor w/ 13460 levels "00005007A9F30E289E760362F69E4EAD",..: 11374 4404 7810 644 6728 866 8555 9641 3480 1079 ...
## $ hack_license : Factor w/ 33381 levels "0002555BBE359440D6CEB34B699D3932",..: 17028 3027 5256 13410 18952 6267 21701 32914 5852 21985 ...
## $ vendor_id : Factor w/ 2 levels "CMT","VTS": 2 2 2 2 2 2 2 2 2 2 ...
## $ rate_code : int 1 1 1 1 1 1 2 1 1 1 ...
## $ store_and_fwd_flag: Factor w/ 3 levels "","N","Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ pickup_datetime : Factor w/ 2289920 levels "2013-12-01 00:00:00",..: 758 2322 8058 7665 15307 15307 18047 22994 25497 25844 ...
## $ dropoff_datetime : Factor w/ 2291803 levels "2013-12-01 00:00:00",..: 1456 2445 8238 7791 15303 15303 18522 23256 25826 26176 ...
## $ passenger_count : int 1 6 5 1 1 3 1 6 6 5 ...
## $ trip_time_in_secs : int 1080 480 540 480 300 300 1140 600 600 600 ...
## $ trip_distance : num 3.9 3.2 3.28 1.84 1.02 ...
## $ pickup_longitude : num -74 -73.9 -74 -74 -74 ...
## $ pickup_latitude : num 40.8 40.7 40.8 40.7 40.7 ...
## $ dropoff_longitude : num -74 -74 -74 -74 -74 ...
## $ dropoff_latitude : num 40.7 40.7 40.8 40.7 40.7 ...
Medallion column has approximately 13k levels which means that it could be one of the id’s of the taxi drivers.
Similarly, hack_license also has only 33k levels which could potentially be another id column.
Vendor_id could represent the vendor of the taxi and has only 2 levels. We need to check if all the records are having a vendor id or there are few records without vendor as well.
Let us go column by column and understand each one.
str(trip_data$medallion)
## Factor w/ 13460 levels "00005007A9F30E289E760362F69E4EAD",..: 11374 4404 7810 644 6728 866 8555 9641 3480 1079 ...
#table(trip_data$medallion)
This shows that this column is a categorical variable as expected with 13k levels.
Let us try and filter records for one of the medaliion id with very less number of records and analyze further.
medaliion “0F4216F3AF4B80598DB77C558700AFF1” has only 8 records
#0F4216F3AF4B80598DB77C558700AFF1
test <- subset(trip_data, trip_data$medallion=="0F4216F3AF4B80598DB77C558700AFF1")
test
## medallion hack_license
## 54496 0F4216F3AF4B80598DB77C558700AFF1 7B19DE6D4D54999531BEB27F758F71F6
## 59644 0F4216F3AF4B80598DB77C558700AFF1 7B19DE6D4D54999531BEB27F758F71F6
## 60382 0F4216F3AF4B80598DB77C558700AFF1 7B19DE6D4D54999531BEB27F758F71F6
## 69860 0F4216F3AF4B80598DB77C558700AFF1 7B19DE6D4D54999531BEB27F758F71F6
## 69996 0F4216F3AF4B80598DB77C558700AFF1 7B19DE6D4D54999531BEB27F758F71F6
## 2224407 0F4216F3AF4B80598DB77C558700AFF1 7B19DE6D4D54999531BEB27F758F71F6
## 2225122 0F4216F3AF4B80598DB77C558700AFF1 7B19DE6D4D54999531BEB27F758F71F6
## 3824519 0F4216F3AF4B80598DB77C558700AFF1 7B19DE6D4D54999531BEB27F758F71F6
## vendor_id rate_code store_and_fwd_flag pickup_datetime
## 54496 VTS 1 2013-12-31 11:56:00
## 59644 VTS 1 2013-12-31 11:08:00
## 60382 VTS 1 2013-12-31 10:03:00
## 69860 VTS 1 2013-12-31 09:40:00
## 69996 VTS 1 2013-12-31 09:53:00
## 2224407 VTS 1 2013-12-11 10:52:00
## 2225122 VTS 1 2013-12-10 20:19:00
## 3824519 VTS 1 2013-12-20 12:44:00
## dropoff_datetime passenger_count trip_time_in_secs
## 54496 2013-12-31 12:02:00 2 360
## 59644 2013-12-31 11:17:00 2 540
## 60382 2013-12-31 10:32:00 1 1740
## 69860 2013-12-31 09:51:00 2 660
## 69996 2013-12-31 09:54:00 1 60
## 2224407 2013-12-11 10:54:00 1 120
## 2225122 2013-12-10 20:19:00 1 0
## 3824519 2013-12-20 12:52:00 2 480
## trip_distance pickup_longitude pickup_latitude dropoff_longitude
## 54496 0 0 0 0
## 59644 0 0 0 0
## 60382 0 0 0 0
## 69860 0 0 0 0
## 69996 0 0 0 0
## 2224407 0 0 0 0
## 2225122 0 0 0 0
## 3824519 0 0 0 0
## dropoff_latitude
## 54496 0
## 59644 0
## 60382 0
## 69860 0
## 69996 0
## 2224407 0
## 2225122 0
## 3824519 0
Strangely, the hack_license remains constant for the medallion and hence let us see few more examples.
But we also saw that few trips are less than 1 minute duration (60 secs.) Let us see how many of the trips are less than 60 seconds and if few then they could be outliers and we can remove them from the data set.
test <- subset(trip_data, trip_data$trip_time_in_secs <=60)
nrow(test) * 100/nrow(trip_data)
## [1] 1.085697
Hence almost 1% of the records could be ignored.
Let us now perform a similar exercise on the other end.
summary(trip_data$trip_time_in_secs)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 370.0 636.0 788.2 1020.0 10800.0
The maximum time is also suspicious and hence let us see how many records are thera bove 1000 secs trip duration.
test1 <- subset(trip_data, trip_data$trip_time_in_secs >=1000)
nrow(test1) * 100/nrow(trip_data)
## [1] 26.16098
Around 26% of the records are above 1000 seconds.
Hence it seems to be ok. So let us move to another interesting column trip_distance.
summary(trip_data$trip_distance)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.010 1.750 2.925 3.200 100.000
test2 <- subset(trip_data, trip_data$trip_distance ==0)
nrow(test2) * 100/nrow(trip_data)
## [1] 0.5609501
Around 0.5% of the records have trip_distance = 0 which is not possible, hence we will ignore all those records which have trip_distance == 0 or trip_time_in_secs <= 60
let us filter out these 2 sets and continue the analysis.
trip_data_filtered <- subset(trip_data, trip_data$trip_time_in_secs>60)
trip_data_filtered <- subset(trip_data_filtered, trip_data_filtered$trip_distance > 0)
Let us understand the relation between medallion and hack_license columns.
med1 <- subset(trip_data_filtered, trip_data_filtered$medallion=="EBAF059FA9FE6F082A50F4C09C362B5D")
#drop the unused levels
med1$hack_license <- droplevels(med1$hack_license)
str(med1)
## 'data.frame': 800 obs. of 14 variables:
## $ medallion : Factor w/ 13460 levels "00005007A9F30E289E760362F69E4EAD",..: 12407 12407 12407 12407 12407 12407 12407 12407 12407 12407 ...
## $ hack_license : Factor w/ 2 levels "AC945E7E46CD4EB5B11816777AF2CC14",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ vendor_id : Factor w/ 2 levels "CMT","VTS": 2 2 2 2 2 2 2 2 2 2 ...
## $ rate_code : int 1 1 1 1 1 1 1 1 1 1 ...
## $ store_and_fwd_flag: Factor w/ 3 levels "","N","Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ pickup_datetime : Factor w/ 2289920 levels "2013-12-01 00:00:00",..: 194981 194082 197409 192570 198816 201118 202598 201950 203722 204840 ...
## $ dropoff_datetime : Factor w/ 2291803 levels "2013-12-01 00:00:00",..: 195793 194626 197491 193411 200496 201500 202562 202206 204159 205110 ...
## $ passenger_count : int 6 6 5 6 6 6 5 6 6 6 ...
## $ trip_time_in_secs : int 1200 900 420 1260 2040 720 300 600 780 600 ...
## $ trip_distance : num 1.37 1.18 0.6 9.2 3.66 1.72 0.62 1.53 1.65 0.89 ...
## $ pickup_longitude : num -74 -74 -74 -73.9 -74 ...
## $ pickup_latitude : num 40.8 40.8 40.8 40.8 40.8 ...
## $ dropoff_longitude : num -74 -74 -74 -74 -74 ...
## $ dropoff_latitude : num 40.8 40.8 40.8 40.8 40.7 ...
It shows 1 medallion had more than 1 hack_license. let us perform a reverse analysis and see if one hack_license has a map to more than 1 medallion.
hack1 <- subset(trip_data_filtered, trip_data_filtered$hack_license=="A69D2180076DCD9954F5EB66E2A747F7")
#drop the unused levels
hack1$medallion <- droplevels(hack1$medallion)
str(hack1)
## 'data.frame': 507 obs. of 14 variables:
## $ medallion : Factor w/ 1 level "A18CC3E9191D21F604DFC2423916E6A2": 1 1 1 1 1 1 1 1 1 1 ...
## $ hack_license : Factor w/ 33381 levels "0002555BBE359440D6CEB34B699D3932",..: 21701 21701 21701 21701 21701 21701 21701 21701 21701 21701 ...
## $ vendor_id : Factor w/ 2 levels "CMT","VTS": 2 2 2 2 2 2 2 2 2 2 ...
## $ rate_code : int 2 1 1 1 1 1 1 1 1 1 ...
## $ store_and_fwd_flag: Factor w/ 3 levels "","N","Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ pickup_datetime : Factor w/ 2289920 levels "2013-12-01 00:00:00",..: 18047 2177678 2169877 2240384 2180710 2243383 2171192 2167570 2190066 2172971 ...
## $ dropoff_datetime : Factor w/ 2291803 levels "2013-12-01 00:00:00",..: 18522 2180922 2172029 2242460 2182759 2245370 2173476 2169533 2191376 2174400 ...
## $ passenger_count : int 1 1 1 1 1 1 1 1 1 1 ...
## $ trip_time_in_secs : int 1140 2220 1080 1080 960 960 1200 840 120 300 ...
## $ trip_distance : num 22.04 11.68 4.8 8.84 1.82 ...
## $ pickup_longitude : num -74 -73.9 -74 -74 -74 ...
## $ pickup_latitude : num 40.8 40.8 40.7 40.7 40.8 ...
## $ dropoff_longitude : num -73.8 -74 -74 -73.9 -74 ...
## $ dropoff_latitude : num 40.6 40.8 40.7 40.8 40.8 ...
#' @description This function returns the sets which have repeated records
#'
#' @param x object - the data frame's column which needs to be analyzed for duplicacy
#'
d <- function(x) duplicated(x) | duplicated(x, fromLast=TRUE)
# One to one
onetoOne <- trip_data_filtered[!d(trip_data_filtered$medallion) & !d(trip_data_filtered$hack_license),]
# One to many
onetoMany <- trip_data_filtered[d(trip_data_filtered$medallion) & !d(trip_data_filtered$hack_license),]
# Many to one
ManytoOne <- trip_data_filtered[!d(trip_data_filtered$medallion) & d(trip_data_filtered$hack_license),]
nrow(onetoOne)
## [1] 0
nrow(onetoMany)
## [1] 164
nrow(ManytoOne)
## [1] 1
It seems that most of the relation between medallion and hack_license is manytomany.
Let us now filter the trips where passenger count <0.
summary(trip_data_filtered$passenger_count)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 1.000 1.725 2.000 9.000
test3 <- subset(trip_data_filtered, trip_data_filtered$passenger_count ==0)
nrow(test3) * 100/nrow(trip_data_filtered)
## [1] 0.000376655
trip_data_filtered <- subset(trip_data_filtered, trip_data_filtered$passenger_count >0)
Let us now create location categories. 1 for Pickup location with combinations of pickup longitude and latitude & 1 for drop location with drop longitude and drop latitude.
trip_data_filtered$pickup_longitude <- as.factor(trip_data_filtered$pickup_longitude)
trip_data_filtered$pickup_latitude <- as.factor(trip_data_filtered$pickup_latitude)
trip_data_filtered <- cbind(trip_data_filtered, pickupLocation = 'NA')
trip_data_filtered$pickupLocation <- as.character(trip_data_filtered$pickupLocation)
trip_data_filtered$pickupLocation <- paste0("Lon: ", trip_data_filtered$pickup_longitude," Lat: ",trip_data_filtered$pickup_latitude)
trip_data_filtered$pickupLocation <- as.factor(trip_data_filtered$pickupLocation)
Now we need to perform the same for the drop off location.
trip_data_filtered$dropoff_longitude <- as.factor(trip_data_filtered$dropoff_longitude)
trip_data_filtered$dropoff_latitude <- as.factor(trip_data_filtered$dropoff_latitude)
trip_data_filtered <- cbind(trip_data_filtered, dropoffLocation = 'NA')
trip_data_filtered$dropoffLocation <- as.character(trip_data_filtered$dropoffLocation)
trip_data_filtered$dropoffLocation <- paste0("Lon: ",trip_data_filtered$dropoff_longitude," Lat: ",trip_data_filtered$dropoff_latitude)
trip_data_filtered$dropoffLocation <- as.factor(trip_data_filtered$dropoffLocation)
Now since we have got the pickup and dropoff locations let us do a sort on the most frequent pickup and most frequent drop off.
library(plyr)
## Warning: package 'plyr' was built under R version 3.4.4
topMostPickup <- as.data.frame(table(trip_data_filtered$pickupLocation))
topMostPickup <- topMostPickup[order(topMostPickup$Freq),]
#see the maximum pickup location
tail(topMostPickup,10)
## Var1 Freq
## 1815594 Lon: -73.961014 Lat: 40.802319 83
## 6309725 Lon: -73.985985 Lat: 40.684994 130
## 423645 Lon: -73.937714 Lat: 40.758278 155
## 10142185 Lon: -74.011147 Lat: 40.701656 220
## 9549985 Lon: -74.004982 Lat: 40.723339 243
## 136589 Lon: -73.83461 Lat: 40.765968 1052
## 123286 Lon: -73.793983 Lat: 40.656971 1249
## 680007 Lon: -73.948723 Lat: 40.744843 1302
## 1890952 Lon: -73.96167 Lat: 40.774151 1432
## 10336530 Lon: 0 Lat: 0 207268
Let us now see the most frequent drop off
topMostDropOff <- as.data.frame(table(trip_data_filtered$dropoffLocation))
topMostDropOff <- topMostDropOff[order(topMostDropOff$Freq),]
#see the maximum dropoff location
tail(topMostDropOff,10)
## Var1 Freq
## 4505254 Lon: -73.974739 Lat: 40.616863 77
## 2537889 Lon: -73.961014 Lat: 40.802319 84
## 7368436 Lon: -73.985985 Lat: 40.684994 132
## 11137097 Lon: -74.011147 Lat: 40.701656 221
## 10541025 Lon: -74.004982 Lat: 40.723339 243
## 137126 Lon: -73.83461 Lat: 40.765968 1053
## 91500 Lon: -73.793983 Lat: 40.656971 1254
## 1183003 Lon: -73.948723 Lat: 40.744843 1302
## 2620855 Lon: -73.96167 Lat: 40.774151 1429
## 11406567 Lon: 0 Lat: 0 203607
Thus, it seems we have some data which has lognitude and latitude mentioed as zero. It could be filtered out.
trip_data_filtered <- subset(trip_data_filtered, trip_data_filtered$pickup_latitude !=0)
trip_data_filtered <- subset(trip_data_filtered, trip_data_filtered$pickup_longitude !=0)
trip_data_filtered <- subset(trip_data_filtered, trip_data_filtered$dropoff_latitude !=0)
trip_data_filtered <- subset(trip_data_filtered, trip_data_filtered$dropoff_longitude !=0)
Let us now perform the same study again and capture the most frequent pickup location.
topMostPickup <- as.data.frame(table(trip_data_filtered$pickupLocation))
topMostPickup <- topMostPickup[order(topMostPickup$Freq),]
#see the maximum pickup location
tail(topMostPickup,10)
## Var1 Freq
## 3650380 Lon: -73.974739 Lat: 40.616863 77
## 1815594 Lon: -73.961014 Lat: 40.802319 83
## 423645 Lon: -73.937714 Lat: 40.758278 91
## 6309725 Lon: -73.985985 Lat: 40.684994 129
## 10142185 Lon: -74.011147 Lat: 40.701656 220
## 9549985 Lon: -74.004982 Lat: 40.723339 243
## 136589 Lon: -73.83461 Lat: 40.765968 1052
## 123286 Lon: -73.793983 Lat: 40.656971 1247
## 680007 Lon: -73.948723 Lat: 40.744843 1302
## 1890952 Lon: -73.96167 Lat: 40.774151 1432
Let us visualize it in a graph.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.4.4
topMostPickup <- tail(topMostPickup,10)
topMostPickup$Var1 <- droplevels(topMostPickup$Var1)
g <- ggplot(topMostPickup,aes(Var1)) + geom_bar(aes(weight = Freq)) +
xlab("Pickup Location") + ylab("Frequency") + ggtitle("Topmost Pickup Locations") + coord_flip()
g
Let us perform the same exercise for topmost dropoff location.
topMostDropOff <- as.data.frame(table(trip_data_filtered$dropoffLocation))
topMostDropOff <- topMostDropOff[order(topMostDropOff$Freq),]
#see the maximum dropoff location
tail(topMostDropOff,10)
## Var1 Freq
## 1182999 Lon: -73.948723 Lat: 40.743874 75
## 4505254 Lon: -73.974739 Lat: 40.616863 77
## 2537889 Lon: -73.961014 Lat: 40.802319 83
## 7368436 Lon: -73.985985 Lat: 40.684994 129
## 11137097 Lon: -74.011147 Lat: 40.701656 220
## 10541025 Lon: -74.004982 Lat: 40.723339 243
## 137126 Lon: -73.83461 Lat: 40.765968 1052
## 91500 Lon: -73.793983 Lat: 40.656971 1247
## 1183003 Lon: -73.948723 Lat: 40.744843 1302
## 2620855 Lon: -73.96167 Lat: 40.774151 1429
topMostDropOff <- tail(topMostDropOff,10)
topMostDropOff$Var1 <- droplevels(topMostDropOff$Var1)
g1 <- ggplot(topMostDropOff,aes(Var1)) + geom_bar(aes(weight = Freq)) +
xlab("Dropoff Location") + ylab("Frequency") + ggtitle("Topmost Dropoff Locations") + coord_flip()
g1
Hence we can see that there are 4 places which are frequently visited (both dropoff & pickup)
Similarly let us now try and find out the busiest hours.
trip_data_filtered <- cbind(trip_data_filtered, pickUpTime = 'NA')
trip_data_filtered$pickUpTime <- as.character(trip_data_filtered$pickUpTime)
trip_data_filtered$pickUpTime <- strftime(trip_data_filtered$pickup_datetime, format="%H")
topMostPickupTime <- as.data.frame(table(trip_data_filtered$pickUpTime))
topMostPickupTime <- topMostPickupTime[order(topMostPickupTime$Freq),]
#see the maximum pickup time
tail(topMostPickupTime,10)
## Var1 Freq
## 12 11 667282
## 24 23 685069
## 14 13 693292
## 13 12 698435
## 15 14 709397
## 23 22 764756
## 22 21 783211
## 19 18 786744
## 21 20 789235
## 20 19 819789
topMostPickupTime <- tail(topMostPickupTime,10)
topMostPickupTime$Var1 <- droplevels(topMostPickupTime$Var1)
options("scipen"=100, "digits"=4)
g2 <- ggplot(topMostPickupTime,aes(Var1)) + geom_bar(aes(weight = Freq)) +
xlab("Pickup Time") + ylab("Frequency") + ggtitle("Topmost Pickup time") + coord_flip()
g2
Thus it shows that maximum pickups are between 1900 hrs and 2000 hrs.
Let us now see the dropoff times.
trip_data_filtered <- cbind(trip_data_filtered, dropOffTime = 'NA')
trip_data_filtered$dropOffTime <- as.character(trip_data_filtered$dropOffTime)
trip_data_filtered$dropOffTime <- strftime(trip_data_filtered$dropoff_datetime, format="%H")
topMostDropOffTime <- as.data.frame(table(trip_data_filtered$dropOffTime))
topMostDropOffTime <- topMostDropOffTime[order(topMostDropOffTime$Freq),]
#see the maximum dropoff time
tail(topMostDropOffTime,10)
## Var1 Freq
## 16 15 683964
## 14 13 688020
## 13 12 696222
## 15 14 697974
## 24 23 709897
## 19 18 772995
## 23 22 775022
## 22 21 783579
## 21 20 804042
## 20 19 838968
topMostDropOffTime <- tail(topMostDropOffTime,10)
topMostDropOffTime$Var1 <- droplevels(topMostDropOffTime$Var1)
options("scipen"=100, "digits"=4)
g3 <- ggplot(topMostDropOffTime,aes(Var1)) + geom_bar(aes(weight = Freq)) +
xlab("DropOff Time") + ylab("Frequency") + ggtitle("Topmost DropOff time") + coord_flip()
g3
Thus here also it is seen that the busiest hours were between 1900 hrs to 2000 hrs.
There are few dimensions across which passerenger per trip could be distributed: 1. Per Vendor 2. Per Pickup Hour 3. Per Dropoff Hour 4. For each of the top 4 busiest locations 5. Per Rate Code 6. Per Store_Fwd_Flag
trip_data_filtered$passenger_count <- as.factor(trip_data_filtered$passenger_count)
g5 <- ggplot(trip_data_filtered,aes(trip_data_filtered$passenger_count, fill=vendor_id)) + geom_bar(position ="stack") + xlab("Passenger Count") + ylab("Frequency") + ggtitle("Passenger Distribution per Vendor") + coord_flip()
g5
Inference: There is an equal distribution of passenger count till 4 between CMT & VTS vendors.
However, once passenger count reaches 5, VTS is the preferred vendor of choice!
Also most of the trips are single passengers.
Next let us perform analysis of passenger per trip per pickup hour.
g6 <- ggplot(trip_data_filtered,aes(trip_data_filtered$passenger_count, fill=pickUpTime)) + geom_bar(position ="stack") + xlab("Passenger Count") + ylab("Frequency") + ggtitle("Passenger Distribution per Pickup Hour") + coord_flip()
g6
Inference: Not much can be inferred from here, hence we will analyze the most busiest hour (when pickup time is between 1900 to 2000 hrs.)
MostBusyHourTrips <- subset(trip_data_filtered, trip_data_filtered$pickUpTime==19)
g7 <- ggplot(MostBusyHourTrips,aes(MostBusyHourTrips$passenger_count)) + geom_bar() +
xlab("Passenger Count") + ylab("Frequency") + ggtitle("Passenger Distribution for 1900 hours") + coord_flip()
g7
Inference: at 1900 hours maximum single passenger trips are made and least is 4 passenger trips.
Next let us analyze passenger count distribution per drop off hour at busiest 1900 hours
MostBusyHourTripsDropOff <- subset(trip_data_filtered, trip_data_filtered$dropOffTime==19)
g8 <- ggplot(MostBusyHourTripsDropOff,aes(MostBusyHourTripsDropOff$passenger_count)) + geom_bar() +
xlab("Passenger Count") + ylab("Frequency") + ggtitle("Passenger Distribution for 1900 hours") + coord_flip()
g8
Inference: At 1900 hours maximum single passenger trips are made and least is 4 passenger trips.
Next let us analyze the Passenger per trip distribution for each of the busiest location
BusyLocation1 <- subset(trip_data_filtered, trip_data_filtered$pickupLocation =="Lon: -73.96167 Lat: 40.774151")
g9 <- ggplot(BusyLocation1,aes(BusyLocation1$passenger_count)) + geom_bar() +
xlab("Passenger Count") + ylab("Frequency") + ggtitle("Passenger Distribution for Lon: -73.96167 Lat: 40.774151") + coord_flip()
g9
Inference: Maximum Passenger for Lon: -73.96167 Lat: 40.774151 is 4 passengers & most of them went for single trip.
BusyLocation2 <- subset(trip_data_filtered, trip_data_filtered$pickupLocation =="Lon: -73.948723 Lat: 40.744843")
g10 <- ggplot(BusyLocation2,aes(BusyLocation2$passenger_count)) + geom_bar() +
xlab("Passenger Count") + ylab("Frequency") + ggtitle("Passenger Distribution for Lon: -73.948723 Lat: 40.744843") + coord_flip()
g10
Inference: Maximum Passenger for Lon: -73.948723 Lat: 40.744843 is 5 passengers & most of them went for single trip.
BusyLocation3 <- subset(trip_data_filtered, trip_data_filtered$pickupLocation =="Lon: -73.793983 Lat: 40.656971")
g11 <- ggplot(BusyLocation3,aes(BusyLocation3$passenger_count)) + geom_bar() +
xlab("Passenger Count") + ylab("Frequency") + ggtitle("Passenger Distribution for Lon: -73.793983 Lat: 40.656971") + coord_flip()
g11
Inference: Maximum Passenger for Lon: -73.793983 Lat: 40.656971 is 4 passengers & most of them went for single trip.
BusyLocation4 <- subset(trip_data_filtered, trip_data_filtered$pickupLocation =="Lon: -73.83461 Lat: 40.765968")
g12 <- ggplot(BusyLocation4,aes(BusyLocation4$passenger_count)) + geom_bar() +
xlab("Passenger Count") + ylab("Frequency") + ggtitle("Passenger Distribution for Lon: -73.83461 Lat: 40.765968") + coord_flip()
g12
Inference: Maximum Passenger for Lon: -73.83461 Lat: 40.765968 is 4 passengers & most of them went for single trip.
Next we will analyze the passenger per trip per rate code
trip_data_filtered$rate_code <- as.factor(trip_data_filtered$rate_code)
g13 <- ggplot(trip_data_filtered,aes(trip_data_filtered$passenger_count, fill=rate_code)) + geom_bar(position ="stack") + xlab("Passenger Count") + ylab("Frequency") + ggtitle("Passenger Distribution per rate code") + coord_flip()
g13
Inference: Maximum used rate code is “1”
Next we will analyze the passenger per trip distribution for the store & forward flag
g14 <- ggplot(trip_data_filtered,aes(trip_data_filtered$passenger_count, fill=store_and_fwd_flag)) + geom_bar(position ="stack") + xlab("Passenger Count") + ylab("Frequency") + ggtitle("Passenger Distribution per store_and_fwd_flag") + coord_flip()
g14
Inference: Close to half of this flag is blank. Passengers for more than 4 always have this flag as balnk. Single trip and double passenger trips have either flag as N or blank.
In repect of time & computational resources of my machine, here I will pause on the exploratory analysis of Trip Data.