1 Introduction

Taxicabs are an integral part of the New York City (NYC) experience. Widely recognized by their yellow color and checkered boxes, there are roughly 13,500 taxicabs in service today, worth about $800,000 each, with 50,000 drivers serving 236 million passengers per year (Devaraj & Patel, 2007).

Our study looks at what affects the amount of tip paid to the driver. This is relevant for a few reasons. Tips are an important part of the service industry, which includes taxicab drivers (Devaraj & Patel, 2017; Azar, 2007; Azar, 2010; Lynn, 2006; Flynn & Greenberg, 2012). In general, tips average around 15% and generate billions of dollars each year (Azar, 2007). Thus, knowing the likelihood of tipping have the potential to provide greater financial certainty for workers. In addition, despite New York law preventing taxicab drivers from refusing service, service refusal still happens (New York City Taxi and Limousine Commission, n.d.; Rivoli & Jorgensen, 2018). It is possible that drivers making assumptions about how likely a customer is to tip (Ayres, Vars, & Zakariya, 2005). Finally, taxicab data can also provide valuable information on city life, city culture, human behavior, and various socioeconomic variables such as economic activity (Ferreira, Poco, Vo, Freire, & Silva, 2013).

2 Literature Review

Tips are voluntary payments—in addition to the obligatory transaction amount—usually in the form of money from the customer to service worker who performs a service for them (Devaraj & Patel, 2017). There has been substantial research done on tipping in the restaurant industry, mostly through empirical studies (Azar, 2007).

Tipping behavior can be explained by both economic and noneconomic factors (Devaraj & Patel, 2007), such as providing an economic incentive for a higher quality of service (Flath, 2012), wealth (Harris, 1995), the idea of tipping being a social norm (Azar, 2010), gratitude or appreciation of service (Azar, 2010; Lynn, 2001), and weather (Flynn & Greenberg, 2012). The impact of these variables, however, are not agreed upon by researchers (Azar, 2010; Harris, 1995; Lynn, 2001; Azar, 2007; Flynn & Greenberg, 2012). Most of these studies were conducted through interviews, and Azar (2007) stresses the need to consider what is said during an interview versus what a customer actually does. While a customer may in principle agree with the fact that tips should depend on the quality of service, in an actual situation, the social norm of tipping may take precedence over that principle.

2.1 Taxicab Tipping

Non-tipping work related to the NYC dataset has been conducted, especially in regard to pick up and drop off locations (Zhan, Hasan, Ukkusuri, & Kamga, 2013; Neutens, Delafontaine, Scott, & De Maeyer, 2012) and to supply and demand (Qian & Ukkusuri, 2015; Gonzales, Yang, Morgul, & Ozbay 2014). Accessibility/location (Qian & Ukkusuri, 2015), income, population, age, and number of jobs were strongly related to demand (Qian & Ukkusuri, 2015; Gonzales, Yang, Morgul, & Ozbay 2014). Sun and McIntosh (2016) discover that taxicab ridership is lowest in the early mornings and the highest in the evenings. They also note that for short distances, trip time, distance, and fare are all linearly related; however, this relationship grows weaker at greater distances due to the influence of other potential factors.

There exists a paucity of research on tipping in other industries, such as hotel and transportation (Azar, 2007; Devaraj & Patel, 2017). Devaraj and Patel (2017) and Flath (2012) argue that the factors previously discussed have less influence in a taxi setting given the standardized nature of the taxicab service and the minimal interaction between the driver and the passengers. Instead, taxicab tips may be influenced by weather (Devaraj & Patel, 2017), race (Ayres et al., 2005), or wealth (Ayres et al., 2005).

2.2 Relationship to Current Project

In this context, given the lack of research in taxicab tipping, this project seeks to address the gap in research by identifying if there are other factors other than weather that may affect the amount of tipping by those who paid their fares by credit card. Using the variables provided in the NYC taxicab dataset mirrors the appearances the drivers “interact” with; they cannot see wealth or know a passenger’s background, but they can make inferences (and subsequently guess tip amount) based on perception and pickup and dropoff locations (Ayres et al., 2005). Moreover, pickup and dropoff locations may provide some information on a passenger’s background since they may be where the passengers live (Lee, Shin, & Park, 2008) or be indicative of people’s lifestyles (Lynn, 2006; Kwan, 1999; Neutens et al. 2012; van Ham & Tammaru, 2015).

3 Research Question

In this context, our research question is as follows: Using a subset of NYC Yellow Taxi data from June 2019, what factors play a significant role in the prediction of tip-fare ratio (also referred to as tip ratio) and tip amount paid via credit card?

4 Data

The source dataset used for the project was taken from the New York City Taxi and Limousine Commission (2019) website (https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page). The dataset was collected and provided to the agency by technology providers authorized under the Taxicab & Livery Passenger Enhancement Programs (TPEP/LPEP).

The dataset is available for Yellow Taxi, Green Taxi and For-Hire Vehicle Records from 2009 to 2019. The total size of the complete dataset is over 10 GB. We have randomly sampled and selected 20,000 observations from the latest dataset, June 2019, from Yellow Taxi Records (New York City Taxi and Limousine Commission, 2019), due to hardware limitations. The dataset also comes with an associated data dictionary for download (New York City Taxi and Limousine Commission, 2018).

In the previous project, we loaded data into R and cleaned (preprocess) it so that it could be used for further analysis. Main steps performed during cleaning data include filtering irrelevant columns and values. Relevant columns were then formatted to their correct type. Also we checked if data frame contains any NAs or duplicate values which might affect the final results.

Below is the statistics for the selected columns for analysis:

##  tip_fare_ratio     tip_amount     VendorID passenger_count
##  Min.   :0.1091   Min.   : 0.440   1:3893   Min.   :1.000  
##  1st Qu.:0.2274   1st Qu.: 1.865   2:7353   1st Qu.:1.000  
##  Median :0.2660   Median : 2.460            Median :1.000  
##  Mean   :0.2636   Mean   : 2.962            Mean   :1.607  
##  3rd Qu.:0.3083   3rd Qu.: 3.460            3rd Qu.:2.000  
##  Max.   :0.4283   Max.   :47.650            Max.   :6.000  
##                                                            
##  trip_distance     fare_amount     congestion_surcharge     Borough_pu   
##  Min.   : 0.030   Min.   :  3.00   Min.   :0.000        Bronx    :    1  
##  1st Qu.: 1.060   1st Qu.:  7.00   1st Qu.:2.500        Brooklyn :  104  
##  Median : 1.665   Median :  9.50   Median :2.500        Manhattan:10618  
##  Mean   : 2.523   Mean   : 11.81   Mean   :2.385        Queens   :  434  
##  3rd Qu.: 2.800   3rd Qu.: 14.00   3rd Qu.:2.500        Unknown  :   89  
##  Max.   :25.100   Max.   :238.00   Max.   :2.750                         
##                                                                          
##          Borough_do      pickup_period     drop_period   trip_duration  
##  Bronx        :   32   Afternoon:2716   Afternoon:2713   Min.   : 1.00  
##  Brooklyn     :  377   Evening  :3356   Evening  :3334   1st Qu.: 7.00  
##  EWR          :   12   Morning  :2895   Morning  :2797   Median :11.00  
##  Manhattan    :10335   Night    :2279   Night    :2402   Mean   :13.26  
##  Queens       :  384                                     3rd Qu.:18.00  
##  Staten Island:    1                                     Max.   :37.00  
##  Unknown      :  105

5 Exploratory Data Analysis

Once we have cleaned our data, we start exploratory data analysis by seeking if there exists any kind of relationship between our dependent and independent variables.

5.1 Distribution

The data is approximately normally distributed ready to use for analysis. There are lesser points on the left side of the mean.

#ggplot histogram of tip_fare_ratio for processed df
df_sub %>%
  ggplot(aes(x=tip_fare_ratio)) +
  geom_histogram(aes(y =..density..),  colour = "black", fill = "#66B2FF", binwidth = 0.01) + 
  stat_function(fun = dnorm, args = list(mean = mean(df_sub$tip_fare_ratio), sd = sd(df_sub$tip_fare_ratio))) + ggtitle("Distribution of NYC Taxi Tip fare ratio")

5.2 Hypothesis Testing

5.2.1 Trip distance

While trying to find the the variation in tip-fare ratio between short distance (less than 2.4 miles) and long distance (greater than 2.4 miles) commuters of NYC yellow cabs, results (box plot below) obtained tell us that short distance travellers pay larger share of tips relative to their overall fare than those who travel longer distances within NYC radius.

Apart from just plotting box-plots, we also performed independent two-sample t-test, a significance test that can give us an estimate as to whether different means between two groups are the result of random variation or the product of specific characteristics within the groups.

Our hypotheses for this test are as follows:

Null Hypothesis: H_o Average tip fare ratio is same for both short and long distance passenger(s)
Alternate Hypothesis: H_a Average tip fare ratio is NOT same for both short and long distance passenger(s)

With significance level of 0.05, we got a p-value very close to zero. We have enough evidence to reject the null hypothesis in favor of alternate hypothesis meaning that people travelling through yellow cabs in NYC tip differently based on distance travelled.

5.2.2 Trip duration

We hyothesize that there is a relationship between the duration of the trip and the tip-fare ratio. Trip duration is calculated by subtracting pickup time from drop time. Trip duration is the number of minutes taken for the trip.

For this test, our hypotheses were:

Null Hypothesis: H_o Average tip fare ratio is same irrespective of Trip Duration
Alternate Hypothesis: H_a Average tip fare ratio is NOT same across Trip Duration

From the results, we observe that p-value of 0 is less than 0.05. At a significance level of 0.05, we reject null hypothesis. We have enough evidence to reject the null hypothesis in favor of alternative hypothesis, meaning that people travelling via yellow cabs in NYC tip differently based on duration of the trip.

5.2.3 Number of passengers

We hyothesize that there is a relationship between the number of passenger in the car and the tip-fare ratio. We are considering passenger count as a cateogrical variable. There are 6 factors of passenger count varing from 1 to 6.

Our hypotheses are:

Null Hypothesis: H_o Average tip fare ratio is same across all classes of passenger count
Alternate Hypothesis: H_a Average tip fare ratio is NOT same across all classes of passenger count

The mean of all the groups of passenger count does not vary much. Also, the output of ANOVA gives the p-value 0.00115 which is less than 0.05; hence, we reject null hypothesis. This means that the number of passenger in the car does affect ratio of the tip amount.

5.2.4 Vendor ID

Similarly, we checked the distribution for the vendor ID and run a t-test to verify the hypoethsis.

The p-value 0.000175 is less than 0.05; hence, we reject null hypothesis and conclude that the vendor does affect tip amount.

5.2.5 Location

For tip ratio and location, ANOVA tests give p-values of 3.97e-20 and 2.8e-25. We reject the null hypothesis that the means are the same at a significance level of 0.05. The Tukey HSD test show the following pick up pairs have significant differences in tipping ratio: Manhattan and Brooklyn, and Queens and Manhattan. For drop off pairs, Manhattan and Bronx, Manhattan and Brooklyn, and Manhattan and Queens are significant.

From the previous research, we ran various tests to gain a basic understanding of the data, which can be summarized in this table:

Feature (variable)	Test	P-value	Null Hypothesis (H0)	Decision on H0
pickup location	ANOVA	3.97e-20	means are equal	reject H0
dropoff location	ANOVA	2.8e-25	means are equal	reject H0
distance	T-Test	2.22e-176	means are equal	reject H0
pickup time	ANOVA	1.45e-09	means are equal	reject H0
dropoff time	ANOVA	3.94e-10	means are equal	reject H0
passenger count	ANOVA	0.00115	means are equal	reject H0
vendor ID	T-test	0.000175	means are equal	reject H0

*based on a significance level of 0.05

It appears that the following variables affect the tip ratio for those who paid by credit card: pickup loation, dropoff location, distance travelled in miles, time of day and passenger count. Using these variables we will try to predict tip-fare ratio.

5.3 Correlation

Our next step was to look at correlation to see how the data variables relate to each other. For this step, we have Pearson’s correlation method to indicate the extent to which two variables are linearly-related. Here, y-variable is tip-fare ratio.

The results show that trip distance, fare amount, and trip duration are negatively (and weakly) correlated whereas passenger count and congestion surcharge are not correlated at all with tip-fare ratio.

6 Model Building

We first chose linear (univariate, multivariate) regression models see how it fits our data. Linear regression is a standard, typical method that may do a good job explaining the relationship with tip, so we decided to try that first. We also applied step by feature selection, lasso, ridge, elastic net and principal component analysis (PCA) to improve our linear model. We also used decision trees. One advantage of decision trees is its ability to approximate non-linear relationships. As our literature review suggets, for short distances, trip time, distance, and fare are all linearly related, yet this relationship grows weaker at greater distances due to the influence of other potential factors (Sun & McIntosh, 2016). Consequently, there may be in fact a non-linear relationship with tip, too.

6.1 Model Building Preparation

Before building our models, we first prepared our data for modeling building through one hot encoding, creating training and testing sets, and scaling our data.

6.1.1 One hot encoding

In order to convert factor columns to numerical columns we have used one hot encoding. One hot encoding converts all factors as a separate boolean column.

After one-hot encoding we get a total of 29 columns.

6.1.2 Test train split

To avoid introducing a bias in test using train-data, the train-test split should be performed before (most) data preparation steps for instance scaling. To simulate a train and test set, we randomly split the dataset into 80% train and 20% test.

6.1.3 Scaling variables

After splitting, we need to scale the numerical variables in our test and train datasets because the magnitude of the values might not necessarily be proportional. We calculate the mean and standard deviation of each numerical column for comparison purposes.

All the variables have different means and standard deviation; thus, we scaled all the variables for our analysis to prevent variables with larger variance from inadvertently impacting the analysis.

6.2 Tip Ratio

Now that our EDA is complete with tip-fare as dependent variable, the next step is to apply a variety of regression based algorithms allowing us to extract insights from data that we can then use to tell which outcome is likely to hold true for our target variable based on training data.

Model evaluation metric used for all models are r-squared and mean absolute percentage error (MAPE). Mean absolute error (MAE) or mean squared error (MSE) are scale dependent variables. We wanted to use scale independent model evaluation creteria hence we decided to use MAPE.MAPE is measured as follows: MAPE formula

6.2.1 Linear regression

We built a linear model with tip amount and the first 3 highly correlated values for ‘Tip_Amount’: fare_amount (-0.31), trip_distance (-0.27), and trip_duration (-0.23).

We subsequently perform ANOVA tests on all the three models.

The summary for three linear models is:

Fit	Feature (variable)	R^2	ANOVA P-value
1	lm(tip_fare_ratio ~ trip_duration)	0.0952	–
2	lm(tip_fare_ratio ~ trip_duration+fare_amount)	0.0967	0.000187
3	lm(tip_fare_ratio ~ trip_duration+fare_amount+trip_distance)	0.0967	0.755

At a significance level of 0.05, we observe that the p-value is greater than 0.05 for the third fit, which implies that fit 2 and fit 3 are the same.

After removal of the outliers from the data there is no improvement in the values for tip fare ratio. For linear with 3 variables with best correlation coefficients, MAPE is 21.97 and r-squared is 0.108. For the linear-treated outlier, MAPE is 22.02 and r-squared is 0.106.

Since the results are not satisfactory, we perform stepwise feature selection.

6.2.2 Stepwise regression

We perform the step feature selection with adjusted r-squared scaling.

# build stepwise model using regsubsets
mod_tip_ratio <- regsubsets(tip_fare_ratio ~ .-tip_amount, data = train, nvmax = 14, nbest = 1, method = "backward")

## Reordering variables and trying again:

# plot stepwise results using adj r2 as the model evaluation creteria 
plot(mod_tip_ratio, scale = "adjr2", main = "Adjusted R^2")

The graph above shows that trip_duration, congestion_surcharge, Borough_do_Unknown are the three features that give maximum adjusted r-squared of 0.11, after which adjusted r-squared remians constant.

Below is the summary from lm with the first 3 features selected with linear model.

## 
## Call:
## lm(formula = tip_fare_ratio ~ trip_duration + congestion_surcharge + 
##     Borough_do_Unknown, data = train)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.184485 -0.027109  0.006471  0.034159  0.215660 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           0.2635334  0.0006844 385.044  < 2e-16 ***
## trip_duration        -0.0207052  0.0006826 -30.335  < 2e-16 ***
## congestion_surcharge  0.0052224  0.0006961   7.503 6.85e-14 ***
## Borough_do_Unknown    0.0033330  0.0073110   0.456    0.648    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06461 on 8994 degrees of freedom
## Multiple R-squared:  0.101,  Adjusted R-squared:  0.1007 
## F-statistic: 336.7 on 3 and 8994 DF,  p-value: < 2.2e-16

##        trip_duration congestion_surcharge   Borough_do_Unknown 
##             1.003949             1.044124             1.040229

Below is the graph for predicted versus actual values of tip-fare ratio:

We improve the model using ridge, lasso, elastic net, and PCA.

6.2.3 Ridge regression

Ordinary least squares (OLS) finds the coefficients that best fit the data. But OLS doesn’t consider which independent variable is more important than others. It simply finds the coefficients for a given data set. In short, there is only one set of betas to be found, resulting in the lowest ‘Residual Sum of Squares (RSS)’.

The question then becomes “Is a model with the lowest RSS truly the best model?”. Therefore, an OLS model becomes more complex as new variables are added. It can be said that an OLS provides model with lowest bias and the highest variance. It is fixed there, never moves, but we want have a model with low bias as well as low variance. Ridge can fill this gap, and this is also referred to as Regularization. The ridge regression will penalize coefficients, such that those that are the least efficient in the estimation will “shrink” the fastest. In ridge regression, we can tune the lambda parameter (penalizing factor) so that model coefficients change.

Ridge regression shrinks coefficients by penalizing; thus the features were scaled for the “starting conditions” to be fair. Next, we iterated through a range of lambda values. The plot below shows shrinking of attributes. Only five attributes with largest coefficient values have been labeled for better visualization.

To choose the best lambda, we consult the MSE versus lambda plot shown below. The best lambda value in our case turns out to be 0.0021026.

But the r-squared (i.e. the amount of variance that our model can account for) is only 12%, which means there is room for a lot improvement.

6.2.4 Lasso regression

This is another type of regularization (L1) technique that can lead to zeroed-out coefficients (in other words, some of the features are completely neglected for the evaluation of output). Thus, Lasso regression not only helps in reducing over-fitting but it can help us in feature selection.

Again, we iterated through a range of lambda values. The plot below shows how lasso is setting irrelevant attributes coefficients to 0. Only five attributes with largest coefficient values are labeled for better visualization.

To choose the best lambda, we looked at the MSE versus lambda plot as given below. The best lambda value in our case is 0.0021026.

As with before, the r-squared is only 12% (as shown in graph below), which means there is room for a lot improvement.

6.2.5 Elastic net

The last penalized regression model that we used for our problem is elastic net, which linearly combines the L1 and L2 penalties of the lasso and ridge methods. Unfortunately, it also did not help in improving R-square value.

6.2.6 Principal component analysis

As most of our variables were correlated with each other, and there were 27 features, we decided to use PCA as a variable reduction technique.

Around 92% of the variance is explained by first 7 components (shown below).

We then ran a simple linear regression model on those components to check if this increased the variability explained as indicated by r-squared.

## 
## Call:
## lm(formula = tip_fare_ratio ~ ., data = pca_train)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.180186 -0.027542  0.004196  0.037583  0.246940 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.636e-01  6.831e-04 385.840  < 2e-16 ***
## PC1          1.207e-02  4.182e-04  28.849  < 2e-16 ***
## PC2         -1.443e-03  6.738e-04  -2.141   0.0323 *  
## PC3         -1.745e-05  6.866e-04  -0.025   0.9797    
## PC4          5.026e-03  9.215e-04   5.455 5.03e-08 ***
## PC5         -4.787e-03  9.709e-04  -4.931 8.34e-07 ***
## PC6         -7.103e-03  1.008e-03  -7.049 1.94e-12 ***
## PC7          4.392e-03  1.049e-03   4.187 2.85e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0648 on 8990 degrees of freedom
## Multiple R-squared:  0.09631,    Adjusted R-squared:  0.09561 
## F-statistic: 136.9 on 7 and 8990 DF,  p-value: < 2.2e-16

As seen from the model summary, all variables are significant, except component 3. The residuals are not symmetric around the median value. The 0.09631257 and 0.09560892 did not improve. Thus, even if the components explains 90% of the variance in the data, that may not necessarily mean that a good R2 or high coefficients will result. We evaluated the metric using MAPE, which is around 22%; there is no significant improvement from the previous results.

The above graph represents predicted tip fare ratio verses actual tip fare ratio. It is evident that predicted varies signifcantly from the actual.

6.2.7 Decision tree

Finally, we decided to grow a decision tree for our dataset.

To grow our decision tree, the minimum within node deviance (essentially the minimum deviance for a node to be split) was set to 0.001.

Here is the decision tree for tip ratio:

One disadvantage of decision trees is its potential to overfit the training data. While the fit may be strong for the training data, overfitting results in the model not fitting the test set particularly well. As such, it is important to prune back the tree by reducing the number of leaves (or terminal nodes).

Here is the best pruned decision tree with only 5 leaves, down from our original of 14 leaves.

Another way to prune back the leaves is to use cross-validation. Cross-validation is where the data is broken into multiple sets (or folds). Generally, the folds are divided to train and test the model, with the goal of generalizing the results.

Here is the pruned tree by 10-fold CV:

The accuracy of each can be summarized in this table:

Variable	Original Tree	Pruned Tree	Cross-Validated Tree
variables used	fare_amount, pickup_period_Night, pickup_period_Evening, congestion_surcharge, drop_period_Evening, trip_distance	fare_amount	fare_amount
leaves	14	5	4
in-sample MSE	0.0039953	0.004061	0.0040748
out-of-sample MSE	0.0038471	0.0038912	0.0039024
out-of-sample R2	0.1516613	0.1417781	0.1393065
MAPE	21.4091472	21.7536222	21.7401352

As expected, the original decision tree is the best of the tree, given that it has the lowest MAPE and highest r-squared. This is because the tree was not pruned back to prevent overfitting, so it fits the data the best (out of the three). The pruned tree has a higher MAPE than the cross-validated tree, but the pruned tree has a higher r-squared. All three models, however, do not differ by much when looking at MAPE and r-squared. This indicates that there may not be much differences in the models.

Fare amount appears to be the main predictor of tip ratio. In the three models, the first node split was fare amount, suggesting that there is a difference in tip ratio according to whether a passenger’s fare amount is high or low. This notion is further solidified by the fact that the tip ratio terminal node values descend from left to right. This means that the lower the fare amount, the higher proportion of tip is paid to the driver. In this context, passengers with higher fare amounts tend to tip less proportional to their fare amount.

Although the models roughly have 79% accuracy, the model only explains roughly 13-15% of the variation in tip ratio. This suggests that the model is not comprehensive.

6.2.8 Summary and analysis

The best model, when looking at for a low MAPE and high r-squared, is the unpruned decision tree. It is important to note that looking across the various models, the MAPE value is mostly the same around 21 or 22, which indicates a 78-79% accuracy. The r-squared values for all the models, however, are extremely low, explaining anywhere from 7% to 15% of the variation in our dependent variable. Consequently, this indicates that the models are neither comprehensive nor good, reliable fits.

This also suggests that a null model may be the “best fit” for tip ratio. In this case, tip ratio remains constant despite changes in the independent variables. This may make sense because based on our literature review, tips in the service industry generally hover around 15%; the tips for taxis may also be hovering around a certain percentage.

6.3 Tip Amount

From analysis shown above it is clear that either there is no relation (the null model will work just fine). Tip-to-fare is roughly a constant overall, or we do not have the predictors to capture the relationship. Thus, we may need to redefine our dependent variable. We replaced tip-fare ratio with tip amount as our dependent variable in models to see if it makes any difference on our results.

6.3.1 Distribution

The data has right skewed distribution.

6.3.2 Correlation

##                              [,1]
## tip_amount            1.000000000
## passenger_count      -0.007277463
## trip_distance         0.826695026
## fare_amount           0.909907704
## congestion_surcharge -0.148518053
## trip_duration         0.712704191

This time, we get stronger and positive correlation coefficients between tip amount and trip distance, fare amount, and trip duration variables. Passenger count and congestion surcharge, however, are not correlated with the tip amount.

The same relationship can be visually confirmed through scatter plots below:

Below are the scatter plot along with correlation and regression line for numerical variables

# Plot correlations and regression line for each variable
Q1 <- ggplot(df_sub, aes(x=trip_distance, y=tip_amount)) +
  geom_point(color = "blue")+
  geom_smooth(method=lm, color = "black") +
  labs(title="Variation in trip distance and tip amount",
       x="Trip distance", y = "Tip amount") +
  geom_text(x = 23, y = 40, label = corr_eqn(df_sub$trip_distance,
                             df_sub$tip_amount), parse = TRUE)

Q2 <- ggplot(df_sub, aes(x=trip_duration, y=tip_amount)) +
  geom_point(color = "red")+
  geom_smooth(method=lm) +
  labs(title="Variation in trip duration and tip amount",
       x="Trip duration", y = "Tip amount")+
  geom_text(x = 30, y = 40, label = corr_eqn(df_sub$trip_duration,
                             df_sub$tip_amount), parse = TRUE)


Q3 <- ggplot(df_sub, aes(x=congestion_surcharge, y=tip_amount)) +
  geom_point(color = "red")+
  geom_smooth(method=lm) +
  labs(title="Variation in congestion surcharge and tip amount",
       x="Congestion surcharge", y = "Tip amount")+
  geom_text(x = 2, y = 40, label = corr_eqn(df_sub$congestion_surcharge,
                             df_sub$tip_amount), parse = TRUE)

Q4 <- ggplot(df_sub, aes(x=passenger_count, y=tip_amount)) +
  geom_point(color = "red")+
  geom_smooth(method=lm) +
  labs(title="Variation in passenger count and tip amount",
       x="Passenger count", y = "Tip amount")+
  geom_text(x = 2, y = 40, label = corr_eqn(df_sub$passenger_count,
                             df_sub$tip_amount), parse = TRUE)
ggarrange(Q1, Q2, Q3,Q4 + rremove("x.text"),
          ncol = 2, nrow = 2)

6.3.3 Linear regression

We built a linear model with tip amount and the first 3 highly correlated values for ‘Tip_Amount’: fare_amount (0.91), trip_distance (0.83), and trip_duration (0.71).

We then perform ANOVA tests on all the three models.

Here’s a summary of the results:

Fit	Feature (variable)	R^2	ANOVA P-value
1	lm(tip_amount ~ fare_amount)	0.826	–
2	lm(tip_amount ~ fare_amount+trip_distance)	0.827	5.38e-08
3	lm(tip_amount ~ fare_amount+trip_distance+trip_duration)	0.827	0.0503

With a significance level of 0.05, we observe that the p-value is greater than 0.05 for the third fit, which means that the third fit and second fit are pretty much the same.

Below is the graph for predicted versus actual values of tip_amount:

The results are good. In order to check if any combination of features set gives us better results than this, we performed stepwise regression.

6.3.4 Stepwise regression

We performed backward stepwise regression, and variables were selected based on adjusted r-squared:

## Reordering variables and trying again:

From the graph above, we see that adjusted r-squared remains constant throughout. Hence, we select fare_amount, congestion_surcharge, Borough_do_Bronxn as our first three features that give maximum adjusted r-squared of 0.83.

Below is the summary from lm with the first 3 features selected with feature selection:

## 
## Call:
## lm(formula = tip_amount ~ fare_amount + congestion_surcharge + 
##     Borough_do_Bronx, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.1211 -0.2287  0.0915  0.2216  8.8039 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          2.952390   0.008418 350.708  < 2e-16 ***
## fare_amount          1.756174   0.008609 203.982  < 2e-16 ***
## congestion_surcharge 0.065920   0.008597   7.668 1.93e-14 ***
## Borough_do_Bronx     0.006297   0.160542   0.039    0.969    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7974 on 8994 degrees of freedom
## Multiple R-squared:  0.8271, Adjusted R-squared:  0.8271 
## F-statistic: 1.435e+04 on 3 and 8994 DF,  p-value: < 2.2e-16

Below is the graph for predicted versus actual values of tip_amount:

To further improve the model, ridge, lasso, elastic net, and PCA are performed.

6.3.5 Ridge

So far, replacing our dependent variable to tip amount gives a good r-squared. We then test to see if we get comparable results from a ridge model. We perform all the steps in the same manner as we did earlier for tip-fare ratio and visualize the coefficients below. Each curve corresponds to a variable. It shows the path of its coefficient against the L2-norm of the whole coefficient vector at as λ varies. The axis above indicates the number of nonzero coefficients at the current λ.

Then, we plot the cross-validation curve (red dotted line below), and upper and lower standard deviation curves along the λ sequence (error bars). Two selected λ’s are indicated by the vertical dotted lines.

Finally after training and testing, our results are better (R^2 of 0.8186755 as compared to only 12% for tip fare ratio) for our model now.

6.3.6 Lasso

The second penalized regression model used is lasso. We first plot the coefficients. Each curve corresponds to a variable. It shows the path of its coefficient against the L1-norm of the whole coefficient vector as λ varies. The axis above indicates the number of nonzero coefficients at the current λ. In this case, the coefficients are going down to zero (instead of just shrinking).

Then we plot the MSE cross-validation curve (red dotted line below), and upper and lower standard deviation curves along the λ sequence (error bars). Two selected λ’s are indicated by the vertical dotted lines.

Finally after training and testing, our results are better (R^2 of 0.8299322 as compared to only 12% for tip ratio) for our model now.

6.3.7 Elastic net

The elastic-net penalty is controlled by α, and bridges the gap between lasso (α=1, the default) and ridge (α=0). The elastic-net penalty mixes these two; if predictors are correlated in groups, an α=0.5 tends to select the groups in or out together. We plot the coefficients against the log-lambda value for each feature in our set.

When we plot against percent deviance, we get a very different picture. This is percent deviance explained on the training data. Here, toward the end of the path, this value are not changing much, but the coefficients are “blowing up” a bit. This lets us focus attention on the parts of the fit that matter.

We perform all the steps exactly in same manner as we did earlier for tip ratio; the results indicate an improvement from before (R^2 of 0.8299175 as compared to only 12% above) for elastic net model.

6.3.8 Principal component analysis

The components which we created earlier are now used against tip amount. We built a linear regression model using 7 principal components.

## 
## Call:
## lm(formula = tip_amount ~ ., data = pca_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.8697 -0.3149  0.0868  0.3081 28.1832 
## 
## Coefficients:
##              Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)  2.952407   0.009560  308.840  < 2e-16 ***
## PC1         -1.027057   0.005853 -175.471  < 2e-16 ***
## PC2          0.026404   0.009429    2.800  0.00512 ** 
## PC3          0.184681   0.009608   19.221  < 2e-16 ***
## PC4          0.025901   0.012895    2.009  0.04462 *  
## PC5         -0.087060   0.013587   -6.408 1.55e-10 ***
## PC6         -0.066462   0.014102   -4.713 2.48e-06 ***
## PC7          0.046735   0.014679    3.184  0.00146 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9068 on 8990 degrees of freedom
## Multiple R-squared:  0.7766, Adjusted R-squared:  0.7764 
## F-statistic:  4464 on 7 and 8990 DF,  p-value: < 2.2e-16

As seen from the model summary, except component 4, all variables are significant. The residuals are not symmetric around the median value. The 0.7765614 and 0.7763874 is good as seen previously from the other models as well. We evaluated the metric using MAPE, which is around 22%; there is no significant improvement from the previous results.

The above graph represents predicted tip amount verses actual tip amount. Predicted values do not deviate much from the actual values.

6.3.9 Decision tree

Here is the decision tree for tip amount:

We then prune the tree to 5 leaves.

Finally, we prune the tree to 5 leaves using cross-validation.

The accuracy of each can be summarized in this table:

Variable	Original Tree	Pruned Tree	Cross-Validated Tree
variables used	fare_amount, trip_distance, congestion_surcharge, Borough_do_Unknown	fare_amount	fare_amount
leaves	18	5	5
in-sample MSE	0.7077294	0.9935942	0.9935942
out-of-sample MSE	0.681556	0.9550914	0.9550914
out-of-sample R2	0.8280845	0.7590768	0.7590768
MAPE	22.0991441	28.1193944	28.1193944

As with tip ratio, the unpruned, original decision tree is the best model of the three, with the lowest MAPE and the highest r-squared. The other two (pruned and cross-validated) are worse than the original decision tree but are pretty much similar when compared with each other. Regardless, the r-squared values of these three models indicates that the decision tree models can explain anywhere from 75% to 83% of the variation in tip fare. These numbers are pretty high, which indicates that the models are pretty reliable.

In addition, fare amount appears to be the main predictor of tip ratio. In the three models, the first node split was fare amount, suggesting that there is a difference in tipped amount according to whether a passenger’s fare amount is high or low. The terminal leaves ascend from left to right, which means that the higher the fare amount, the higher tipped paid to the driver.

6.3.10 Summary and analysis

The best model in our analysis for tip amount is the elastic net regularized regression model. It had the lowest MAPE value and highest r-squared, explaining roughly 84% of the variation in the dependent variable. Comparing all models, however, the MAPE and r-squared values are roughly the same, with the exception of the pruned and cross-validated decision tree: The MAPE values are around 21 and 22, indicating an accuracy of 78-79%, and the r-squared values range from 0.79 to 0.84. The high r-squared values suggest that the models are good and reliable.

In short, most of the models are producing the same results, and the improvement is very minimal. Thus, it can be argued that the relationship with tip amount is pretty stable.

7 Conclusion

In summary, initially we chose dependent variable (tip-fare ratio) that was poorly describing our dataset and hence failing to predict good results which meant either we do not have the predictors to capture the relationship or else we need to redefine our dependent variable. So we replace tip-fare-ratio by tip amount as our dependent variable in models and were able to receive significantly good results.

From the variables (distance, duration, congestion surcharge, location), we could not explain any relationship with tip fare ratio. That means tip fare ratio is roughly a constant value. There were no significant predictors in the data to explain the relationship. On the other hand, tip amount has very strong relationship with fare amount, distance, and duration. These results are not contradictory but rather complement each other. Tip amount increasing with fare amount means that the percentage of tip relative to the fare amount is roughly the same. Our analysis, however, does suggest that there may be other variables (that we did not analyze) that may affect the tip amount and tip ratio.

results_df <-results_df[order(results_df$dependent, results_df$Rsquare),]

rownames(results_df) <- 1:nrow(results_df)

formattable(results_df,
            align =c("l","c","c","c","c"),
            list(`Model` = formatter(
              "span", style = ~ style(color = "grey",font.weight = "bold"))
              # `Rsquared` = color_bar("pink")
))

technique	dependent	mape	Rsquare
PCR	tip_fare_ratio	21.95279	0.09630752
Linear-treated outlier	tip_fare_ratio	22.01764	0.10567551
Linear(3 vars with best cor-coeffs)	tip_fare_ratio	21.97484	0.10757866
Linear-Stepwise	tip_fare_ratio	21.85252	0.11401387
Ridge	tip_fare_ratio	21.68116	0.12132190
Lasso	tip_fare_ratio	21.67026	0.12231857
Decision Tree (CV)	tip_fare_ratio	21.74014	0.13930652
Decision Tree (Prune)	tip_fare_ratio	21.75362	0.14177812
Decision Tree	tip_fare_ratio	21.40915	0.15166127
Decision Tree (Prune)	tip_amount	28.11939	0.75907684
Decision Tree (CV)	tip_amount	28.11939	0.75907684
PCR	tip_amount	22.56076	0.79578495
Decision Tree	tip_amount	22.09914	0.82808449
Ridge	tip_amount	21.17501	0.83085809
Linear(3 vars with best cor-coeffs)	tip_amount	21.64672	0.83596548
Linear-Stepwise	tip_amount	21.14823	0.83614911
Lasso	tip_amount	21.15167	0.83776544
Elastic Net	tip_amount	21.14815	0.83776887

7.1 Project Limitations and Future Work

Our limitations remain the same from the previous project. We ran into hardware limitations given the large amount of data available for processing. The data was over 10 GB, which forced us to take a small sample of the dataset for processing. Moreover, some of the data may have been inputted incorrectly (i.e. a passenger count greater than 6, which is illegal in NYC). In addition, our analysis only focused on those who paid tip by credit card, as cash tips were not recorded. Thus, our analysis does not account for those who choose to pay with cash.

Our literature review suggests that weather may play an important role in tip amount, so future work on refining the model can utilize weather data. Our literature review also indicates that location may be a proxy for race, which affects tip amount. Combining demographic information and location and incorporating those into the model may provide new information. Likewise, additional information on the driver–including rating, gender, and race–may (or may not) provide additional information that could help refine our model. Given the highly fluid nature of the passenger-driver interactions, the taxicab experience, and the inability to quantify some of these instances, conducting empirical experiments may be the only effective way to refine the model.

8 References

Ayres, I., Vars, F. E., & Zakariya, N. (2005). To insure prejudice: Racial disparities in taxicab tipping. Yale Law Journal, 114, 1613–1674. doi: 10.2139/ssrn.401201

Azar, O. H. (2007). The social norm of tipping: A review. Journal of Applied Social Psychology, 37(2), 380–402. doi: 10.1111/j.0021-9029.2007.00165.x

Azar, O. H. (2010). Tipping motivations and behavior in the U.S. and Israel. Journal of Applied Social Psychology, 40(2), 421–457. doi: 10.1111/j.1559-1816.2009.00581.x

Devaraj, S., & Patel, P. C. (2017). Taxicab tipping and sunlight. PLoS One, 12(6). doi: 10.1371/journal.pone.0179193

Ferreira, N., Poco, J., Vo, H. T., Freire, J., & Silva, C. T. (2013). Visual exploration of big spatio-temporal urban data: A study of New York City taxi trips. IEEE Transactions on Visualization and Computer Graphics, 19(12), 2149–2158. doi: 10.1109/TVCG.2013.226

Flath, D. (2012). Why do we tip taxicab drivers? Japanese Economy, 39(3), 69–76. doi: 10.2753/JES1097-203X390304

Flynn, S. M., & Greenberg, A. E. (2012). Does weather actually affect tipping? An empirical analysis of time-series data. Journal of Applied Social Psychology 42(3), 702–716. doi: 10.2139/ssrn.1617465

Gonzales, E. J., Yang, C., Morgul, E. F., & Ozbay, K. (2014). Modeling taxi demand with GPS data from taxis and transit. Mineta National Transit Research Consortium. Retrieved from https://transweb.sjsu.edu/sites/default/files/1141-modeling-taxi-demand-gps-transit-data.pdf.

Harris, M. B. (1995). Waiters, customers, and service: Some tips about tipping. Journal of Applied Social Psychology, 25(8), 725–744. doi: 10.1111/j.1559-1816.1995.tb01771.x

Kwan, M. (1999). Gender and individual access to urban opportunities: A study using space–time measures. The Professional Geographer, 51, 211–227. doi: 10.1111/0033-0124.00158

Lee, J., Shin, I., & Park, G. (2008). Analysis of the passenger pick-up pattern for taxi location recommendation. 2008 Fourth International Conference on Networked Computing and Advanced Information Management. doi: 10.1109/NCM.2008.24

Lynn, M. (2001). Restaurant tipping and service quality: A tenuous relationship. The Cornell Hotel and Restaurant Administration Quarterly, 42(1), 14–20. doi: 10.1016/s0010-8804(01)90006-0

Lynn, M. (2006). Tipping in restaurants and around the globe: An interdisciplinary review. Handbook of Contemporary Behavioral Economics: Foundations and Developments. Retrieved from https://ssrn.com/abstract=465942.

New York City Taxi and Limousine Commission. (n.d.). Passenger frequently asked questions. Retrieved October 14, 2019, from https://www1.nyc.gov/site/tlc/passengers/passenger-frequently-asked-questions.page.

New York City Taxi and Limousine Commission. (2018, May 1). Data dictionary - Yellow taxi trip records. Retrieved October 14, 2019, from https://www1.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf.

New York City Taxi and Limousine Commission. (2019). Yellow taxi trip records, June 2019 [Data set]. Retrieved from https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page.

Neutens, T., Delafontaine, M., Scott, D. M., & De Maeyer, P. (2012). An analysis of day-to-day variations in individual space-time accessibility. Journal of Transport Geography, 23, 81–91. doi: 10.1016/j.jtrangeo.2012.04.001

Qian, X., & Ukkusuri, S. (2015). Spatial variation of the urban taxi ridership using GPS data. Applied Geography, 59, 31–42. doi: 10.1016/j.apgeog.2015.02.011

Rivoli, D., & Jorgensen, J. (2018, July 31). City vows to crack down on taxis refusing service as it also looks to cap Uber. Retrieved October 14, 2019, from https://www.nydailynews.com/news/politics/ny-pol-tax-uber-refuse-service-20180731-story.html.

Sun, H., & McIntosh, S. (2016). Big data mobile services for new york city taxi riders and drivers. 2016 IEEE International Conference on Mobile Services, 57–64. doi: 10.1109/MoSb.2S0e1r6v.128016.19

van Ham, M., & Tammaru, T. (2015). New perspectives on ethnic segregation over time and space. A domains approach. Urban Geography, 37(7), 953–962. doi: 10.1080/02723638.2016.1142152

Zhan, X., Hasan, S., Ukkusuri, S., & Kamga, C. (2013). Urban link travel time estimation using large-scale taxi data with partial information. Transportation Research Part C, 33, 37–49. doi: 10.1016/j.trc.2013.04.001

DATS6101_FTT_Taxi_Analysis-II_WriteUp

Steven Chao, Tanaya Kavathekar, Madhuri Yadav, Amna Gul

2021-07-30