This is the approach I took for the analysis of NYC Taxi Fare & Tip Analysis for Prediction

This is in continuation with the exploratory analysis that we performed on Trip data.

In this exercise the focus would be on fare & tip amounts and the factors affecting them.

We will also try to build a model (if possible) to predict the amounts.

Data Set Acquisition

The data set that was given for this task was from

I took the December 2013 data as per the suggestions for this exercise.

Let us read the data in a data frame

setwd("D:/Vibs/Work/Tech-Prep/Profile/Elula")

#clear the environment of all existing variables
rm(list=ls())

# setting the time zone UTC
Sys.setenv(TZ='UTC')

trip_fare <- read.csv("D:/Vibs/Work/Tech-Prep/Profile/Elula/trip_fare/trip_fare_12.csv", header=TRUE)

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.4.4
library(plyr)
## Warning: package 'plyr' was built under R version 3.4.4
options("scipen"=100, "digits"=4)

Tip amount distribution analysis

summary(trip_fare$tip_amount)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    1.00    1.41    2.00  200.00
trip_fare <- cbind(trip_fare, tip_amount_category = 'NA')

b <- c(-Inf, 0, 1, 2, 200)
names <- c("Zero", "One", "Two","High")
trip_fare$tip_amount_category <- cut(trip_fare$tip_amount, breaks = b, labels = names)
trip_fare$tip_amount_category <- as.factor(trip_fare$tip_amount_category)

table(trip_fare$tip_amount_category)
## 
##    Zero     One     Two    High 
## 6720363 1343343 2675992 3231420
g17 <- ggplot(trip_fare,aes(trip_fare$payment_type, fill=tip_amount_category)) + geom_bar(position ="stack") +  xlab("payment type") + ylab("Frequency") + ggtitle("tip amount category Distribution over Payment type") + coord_flip()

g17

Inference: If the payment is by CSH, NOC or DIS, then the tip amount is zero. High tip is given in case of CRD payment as well as tip of 1$ & 2$ is popular amoung CRD payments.

summary(trip_fare$tip_amount)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    1.00    1.41    2.00  200.00
trip_fare <- cbind(trip_fare, pickUpTime = 'NA')
trip_fare$pickUpTime <- as.character(trip_fare$pickUpTime)
trip_fare$pickUpTime <- strftime(trip_fare$pickup_datetime, format="%H")
trip_fare$pickUpTime <- as.factor(trip_fare$pickUpTime)

g24 <- ggplot(trip_fare,aes(trip_fare$pickUpTime, fill=tip_amount_category)) + geom_bar(position ="stack") +  xlab("Pick Up Time") + ylab("Frequency") + ggtitle("Tip Amount Distribution over pick up time") + coord_flip()

g24

Inference: Highest Tip amount is between 6 pm to midnight.

Let us analyze the tip amount w.r.t vendors

g25 <- ggplot(trip_fare,aes(trip_fare$vendor_id, fill=tip_amount_category)) + geom_bar(position ="stack") +  xlab("Vendor") + ylab("Frequency") + ggtitle("Tip Amount Distribution over vendor") + coord_flip()

g25

Inference: vendor does not seem to affect the tip amounts.

Let us now analyze the tip amount distribution on fare amount.

summary(trip_fare$fare_amount)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     2.5     6.5     9.5    12.6    14.5   500.0
trip_fare <- cbind(trip_fare, fare_amount_category = 'NA')

b <- c(2, 6.5, 9.5, 14.5, 500)
names <- c("Low", "Medium", "High","very High")
trip_fare$fare_amount_category <- cut(trip_fare$fare_amount, breaks = b, labels = names)
trip_fare$fare_amount_category <- as.factor(trip_fare$fare_amount_category)

table(trip_fare$fare_amount_category)
## 
##       Low    Medium      High very High 
##   3748826   3539946   3293220   3389126
g26 <- ggplot(trip_fare,aes(trip_fare$tip_amount_category, fill=fare_amount_category)) + geom_bar(position ="stack") +  xlab("Tip Amount") + ylab("Frequency") + ggtitle("Tip Amount Distribution over Fare Amount") + coord_flip()

g26

Inference: When Fare Amount is high, Tip Amount is also high. But it contradicts with Fare Amount being high but tip being zero. Hence, we can reject this hypothesis.

Also, when fare amount is low, then also the tip amount does not show any trend.

Similarly for fare amount being medium.

Hence, the fare amount does not seem to have any conclusive affect on tip amount.

Let us now analyze the tip amount over surcharge.

summary(trip_fare$surcharge)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.316   0.500  11.110
trip_fare <- cbind(trip_fare, surcharge_category = 'NA')

b <- c(-Inf, 0.3, 12)
names <- c("LessThanThirtyCents", "MoreThanThirtyCents")
trip_fare$surcharge_category <- cut(trip_fare$surcharge, breaks = b, labels = names)
trip_fare$surcharge_category <- as.factor(trip_fare$surcharge_category)

table(trip_fare$surcharge_category)
## 
## LessThanThirtyCents MoreThanThirtyCents 
##             7107075             6864043
g27 <- ggplot(trip_fare,aes(trip_fare$tip_amount_category, fill=surcharge_category)) + geom_bar(position ="stack") +  xlab("Tip Amount") + ylab("Frequency") + ggtitle("Tip Amount Distribution over Surcharge") + coord_flip()

g27

Inference: Tip amount is equally distributed and surcharge does not seem to have any effect on Tip amount.

Let us now analyze the mta tax and it’s effect on tip (if any)

summary(trip_fare$mta_tax)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.500   0.500   0.498   0.500   0.500

Inference: MTA Tax seems to be stagnant with 50 cents value and hence we can ignore this feature for any effect on tip amount.

Let us now analyze the toll amount and it’s effect on tip amount.

summary(trip_fare$tolls_amount)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.269   0.000  20.000
trip_fare <- cbind(trip_fare, tolls_amount_category = 'NA')

b <- c(-Inf, 0, 20)
names <- c("Zero", "NonZero")
trip_fare$tolls_amount_category <- cut(trip_fare$tolls_amount, breaks = b, labels = names)
trip_fare$tolls_amount_category <- as.factor(trip_fare$tolls_amount_category)

table(trip_fare$tolls_amount_category)
## 
##     Zero  NonZero 
## 13309116   662002
g28 <- ggplot(trip_fare,aes(trip_fare$tip_amount_category, fill=tolls_amount_category)) + geom_bar(position ="stack") +  xlab("Tip Amount") + ylab("Frequency") + ggtitle("Tip Amount Distribution over Tolls Amount") + coord_flip()

g28

Inference: If the toll amount is Non Zero, then tip amount is either zero or high. Which are contradictory and hence we can ignore this feature as part of the feature engineering.

Conclusion: The factors which seem to affect the tip amount are: 1. payment_type 2. pickUpTime

Prediction Model for Tip Amount

To simplify the problem I would rather take this as a multi-class calssification problem where I will try to predict the tip amount category (0\(, 1\), 2$ or High) based on the 2 factors (payment_type & pickupTime)

dataForPrediction <- trip_fare[,c("pickUpTime","payment_type","tip_amount_category")]

library(randomForest)
## Warning: package 'randomForest' was built under R version 3.4.4
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
# train a random forest with the default parameters using pickUpTime & payment_type

# We will take 30% of the data for training purposes (Owing to the lack of computational resources)

train <- dataForPrediction[1:3779783, c("pickUpTime","payment_type","tip_amount_category")]

rf.train.1 <- train[,c("pickUpTime","payment_type")]

rf.label <-as.factor(train$tip_amount_category)

set.seed(1234)

#rf.1 <- randomForest(x=rf.train.1, y=rf.label, importance = TRUE, ntree = 20)

Unfortunately, I was not able to craete a model using random forest, owing to lack of computational resources.

Let us try and go for random forest model. Again owing to the rsources amd just taking 7000 records.

train <- dataForPrediction[1:7000, c("pickUpTime","payment_type","tip_amount_category")]

rf.train.1 <- train[,c("pickUpTime","payment_type")]

rf.label <-as.factor(train$tip_amount_category)

set.seed(1234)
#use the default parameters and marked importance= TRUE, so that the algo keeps track of the features and # their importance during training, so that it can be reported out.
rf.1 <- randomForest(x=rf.train.1, y=rf.label, importance = TRUE, ntree = 500)

rf.1
## 
## Call:
##  randomForest(x = rf.train.1, y = rf.label, ntree = 500, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 1
## 
##         OOB estimate of  error rate: 27.77%
## Confusion matrix:
##      Zero One Two High class.error
## Zero 3672   0  57  119     0.04574
## One     0   0 205  374     1.00000
## Two     0   0 404  831     0.67287
## High    0   0 358  980     0.26756

OOB estimates, tells that each of the decision tree has a certain accuracy of prediction. Sometimes few rows and few features get selected repeatedly and hence some rows / columns might never get selected for training purposes. This algo keeps a track of the same and comes up with a matrix for comparing the results on the unselected rows/columns. Number of variables tried at each split can be changed using the mtry paramater.

If we have long list of variables, then they come on the y-axis of varImpPlot() The farther the dots for a variable, the more important it is for that classification.

The metrics are not very good.

Let us try and use k-fold cross validation and see how it performs.

library(caret)
## Warning: package 'caret' was built under R version 3.4.4
## Loading required package: lattice
library(doSNOW)
## Warning: package 'doSNOW' was built under R version 3.4.4
## Loading required package: foreach
## Warning: package 'foreach' was built under R version 3.4.4
## Loading required package: iterators
## Warning: package 'iterators' was built under R version 3.4.4
## Loading required package: snow
## Warning: package 'snow' was built under R version 3.4.4
set.seed(2348)

cv.10.folds <- createMultiFolds(rf.label, k=10, times = 10)

table(rf.label)
## rf.label
## Zero  One  Two High 
## 3848  579 1235 1338
#check for stratification (Every fold has the same ratio of classes)
table(rf.label[cv.10.folds[[33]]])
## 
## Zero  One  Two High 
## 3464  521 1112 1204
#train the model using repeated cross validation, use 10 folds and repeat it 10 times and use the 100 #collection of indexes
ctrl.1 <- trainControl(method = "repeatedcv", number = 10, repeats = 10, index = cv.10.folds)

#Setup the doSNOW Package for multi-core training which is helpful for training the trees.
#socket server are the simplest and it is being run on a single machine
# 6 child processes
cl <- makeCluster(6, type = "SOCK")

registerDoSNOW(cl)

set.seed(34324)

#tune length allows the caret to use a maximum combination of 3 values (e.g. 3 Mtry values)
rf.1.cv.1 <- train(x= rf.train.1, y=rf.label, method="rf", tuneLength=3, ntree=1000, trControl=ctrl.1)
## note: only 1 unique complexity parameters in default grid. Truncating the grid to 1 .
stopCluster(cl)

rf.1.cv.1
## Random Forest 
## 
## 7000 samples
##    2 predictor
##    4 classes: 'Zero', 'One', 'Two', 'High' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 6299, 6300, 6299, 6302, 6300, 6299, ... 
## Resampling results:
## 
##   Accuracy  Kappa 
##   0.7172    0.5467
## 
## Tuning parameter 'mtry' was held constant at a value of 2

In respect of time, I would close the model building exercise here as there is lot of improvements that can be performed, provided the computational resources. We can try several other classification algorithms such as SVM, Decision trees or even neural netwworks etc. and measure & improve the metrics. Also we can increase the training data to get better results once we have bigger and better computational resources.

Fare Amount Analysis

Similarly, the approach for prediction of Fare Amount can be showcased.

Suggestion (Actionable Insight) for a Taxi owner to increase the earning

Tip is one of the factor which can increase your earnings. As per the exploratory analysis performed, it can be suggested to accept payments by CRD to get better tips as well as working during the evening hours and picking up the passengers from 6 pm to midnight.

Suggestion for a Taxi Company with 10 taxis to maximize earnings.

Since the tip is for the driver, hence tip increase factors will not help your earnings. Also number of rides are very less between 1 AM & 7 AM, hence if we have the data of passengers waiting for taxis during these hours, we can use surge pricing to maximize earnings. Or even we can segment the passengers who are price sensitive and give them lower fare rates during off peak hours so as to utilize the taxi to its maximum throughput. We can have surge pricing as it boosts supply because more drivers are incentivised to pick up passengers.

Issues in data impacting the approach

Few Data was skewed e.g. toll amount, mta tax which had led myself to ignore their effects. The Data set was very huge and needed far more computational resources to analyze. Data from trip_data & trip_fare were not joined (there was no identification of trip like a trip id) which affected my approach to analyze the features present in cross sheets. Data did not have passenger demographics with which we can segment them and use for surge pricing for taxi companies to maximize earnings.

Potential Drawbacks in the model that I chose

Random forest is based on bagging, the idea of bagging is to build many models with low bias and high variance and then average them. Therefore to achieve good performance with random forest you need very deep trees because only deep trees have a low bias. But deep trees require a lot of computational resources to evaluate, because you need to traverse every tree in the forest from the top to the bottom, evaluating splitting conditions on every level.

Random Forests fail when there are rare outcomes or rare predictors, as the algorithm is based on bootstrap sampling. Requires too many computations and too many training examples. Need to choose number of trees. Less Interpretability. Hence we need to carefully perform a cross validation exercise to avoid it. Also, the hyper parameters were not explored to the fullest. That can also be done to come up with better results. Also we can use a smaller number of folds in the cross validation to avoid over fitting. Because it will get less data for training and hence less prone to overfit. Also, the test set will become bigger in size.

Neural networks or SVM could have been explored to measure the metrics and see if they behaved better. A deep neural network needs more samples to deliver the same level of accuracy, but it will benefit from massive amounts of data, and continuously improve the accuracy.