This is the approach I took for the analysis of NYC taxi data

I will be taking you through the different steps one by one.

Data Set Acquisition

The data set that was given for this task was from

I took the December 2013 data as per the suggestions for this exercise.

Let us read the data in a data frame

setwd("D:/Vibs/Work/Tech-Prep/Profile/Elula")

#clear the environment of all existing variables
rm(list=ls())

# setting the time zone UTC
Sys.setenv(TZ='UTC')

trip_fare <- read.csv("D:/Vibs/Work/Tech-Prep/Profile/Elula/trip_fare/trip_fare_12.csv", header=TRUE)

payment type distribution analysis

Let us now analyze the payment type from the other data set (trip_fare)

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.4.4
library(plyr)
## Warning: package 'plyr' was built under R version 3.4.4
summary(trip_fare$payment_type)
##     CRD     CSH     DIS     NOC     UNK 
## 7444798 6426107   10100   35184   54929
g15 <- ggplot(trip_fare,aes(trip_fare$payment_type)) + geom_bar() +
  xlab("payment type") + ylab("Frequency") + ggtitle("Payment type Distribution") + coord_flip()

g15

There are 5 types of payments of which CRD is the most popular followed by CSH & DIS is the least popular.

We can further analyze distribution of payment type by grouping the total_amount into bins or by the tip_amount.

summary(trip_fare$total_amount)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.50    8.00   11.30   15.13   17.00  650.00
trip_fare <- cbind(trip_fare, total_amount_category = 'NA')

b <- c(2.5, 8, 11, 17, 650)
names <- c("Low", "Medium", "High","Very High")
trip_fare$total_amount_category <- cut(trip_fare$total_amount, breaks = b, labels = names)
trip_fare$total_amount_category <- as.factor(trip_fare$total_amount_category)

table(trip_fare$total_amount_category)
## 
##       Low    Medium      High Very High 
##   3695808   3187403   3675488   3412399
g16 <- ggplot(trip_fare,aes(trip_fare$payment_type, fill=total_amount_category)) + geom_bar(position ="stack") +  xlab("payment type") + ylab("Frequency") + ggtitle("Payment type Distribution over total amount category") + coord_flip()

g16

Inference: The CRD & CSG payment types are almost equally distributed among all total amount categories.

summary(trip_fare$tip_amount)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   1.000   1.414   2.000 200.000
trip_fare <- cbind(trip_fare, tip_amount_category = 'NA')

b <- c(-Inf, 0, 1, 2, 200)
names <- c("Zero", "One", "Two","High")
trip_fare$tip_amount_category <- cut(trip_fare$tip_amount, breaks = b, labels = names)
trip_fare$tip_amount_category <- as.factor(trip_fare$tip_amount_category)

table(trip_fare$tip_amount_category)
## 
##    Zero     One     Two    High 
## 6720363 1343343 2675992 3231420
g17 <- ggplot(trip_fare,aes(trip_fare$payment_type, fill=tip_amount_category)) + geom_bar(position ="stack") +  xlab("payment type") + ylab("Frequency") + ggtitle("Payment type Distribution over tip amount category") + coord_flip()

g17

Inference: If the payment is by cash, then the tip amount is zero. High tip is given in case of CRD payment as well as tip of 1$ & 2$ is popular amoung CRD payments.

g19 <- ggplot(trip_fare,aes(trip_fare$payment_type, fill=vendor_id)) + geom_bar(position ="stack") +  xlab("payment type") + ylab("Frequency") + ggtitle("Payment type Distribution over vendor") + coord_flip()

g19

Inference: CSH & CRD modes are almost equally divided between CMT & VMS vendors. However, UNK payment type is only done for VTS & NOC and DIS payment type is only done for CMT.

fare distribution analysis

Let us now understand how the fare is distributed.

summary(trip_fare$fare_amount)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.50    6.50    9.50   12.63   14.50  500.00
trip_fare <- cbind(trip_fare, fare_amount_category = 'NA')

b <- c(2, 6.5, 9.5, 14.5, 500)
names <- c("Low", "Medium", "High","very High")
trip_fare$fare_amount_category <- cut(trip_fare$fare_amount, breaks = b, labels = names)
trip_fare$fare_amount_category <- as.factor(trip_fare$fare_amount_category)

table(trip_fare$fare_amount_category)
## 
##       Low    Medium      High very High 
##   3748826   3539946   3293220   3389126
g18 <- ggplot(trip_fare,aes(trip_fare$fare_amount_category, fill=payment_type)) + geom_bar(position ="stack") +  xlab("Fare Amount") + ylab("Frequency") + ggtitle("Fare Amount Distribution over payment type") + coord_flip()

g18

Inference: Whatever is the Fare Amount, it is mostly paid by CRD or CSH.

g20 <- ggplot(trip_fare,aes(trip_fare$fare_amount_category, fill=vendor_id)) + geom_bar(position ="stack") +  xlab("fare amount") + ylab("Frequency") + ggtitle("Fare Amount Distribution over vendor") + coord_flip()

g20

Inference: Fare amounts are almost equally distributed between CMT & VTS Vendors.

Let us now see the distribution of fare amount over pickup time.

trip_fare <- cbind(trip_fare, pickUpTime = 'NA')
trip_fare$pickUpTime <- as.character(trip_fare$pickUpTime)
trip_fare$pickUpTime <- strftime(trip_fare$pickup_datetime, format="%H")
trip_fare$pickUpTime <- as.factor(trip_fare$pickUpTime)

g21 <- ggplot(trip_fare,aes(trip_fare$pickUpTime, fill=fare_amount_category)) + geom_bar(position ="stack") +  xlab("Pick Up Time") + ylab("Frequency") + ggtitle("Fare Amount Distribution over pick up time") + coord_flip()

g21

Inference: Very High fare amount starts from 12 noon to 12 midnight.

Tip amount distribution analysis

summary(trip_fare$tip_amount)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   1.000   1.414   2.000 200.000
g24 <- ggplot(trip_fare,aes(trip_fare$pickUpTime, fill=tip_amount_category)) + geom_bar(position ="stack") +  xlab("Pick Up Time") + ylab("Frequency") + ggtitle("Tip Amount Distribution over pick up time") + coord_flip()

g24

Inference: Highest Tip amount is between 6 pm to midnight.

In repect of time, here I will pause on the exploratory analysis and start with the Open Questions section.