1 Exordium

One of the most famous pictures of New York is the wave of yellow taxi taxis flooding the streets. So, where better to research taxi cab data than New York City? This is exactly what we intended to do. From 2009 until the present, the NYC Taxi and Limousine Commission (TLC) has gathered massive amounts of data for every taxi travel in New York City. We set out to get our hands dirty and put the sophisticated analysis,we learnt over the semester to work.
We wanted to see how parameters like pick-up location, distance, number of passengers, and drop-off location impact the tipping behavior of NYC taxi drivers.

2 Data Preparation

2.1 Data Gathering

## 'data.frame':    3558124 obs. of  19 variables:
##  $ VendorID             : int  1 1 2 1 1 2 2 1 2 2 ...
##  $ tpep_pickup_datetime : POSIXct, format: "2022-05-31 20:25:41" "2022-05-31 20:44:40" ...
##  $ tpep_dropoff_datetime: POSIXct, format: "2022-05-31 20:48:22" "2022-05-31 21:01:48" ...
##  $ passenger_count      : num  1 1 1 2 0 1 1 1 1 1 ...
##  $ trip_distance        : num  11 4.2 9.49 12.1 1.8 2.02 8.08 4.3 8.78 1.76 ...
##  $ RatecodeID           : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ store_and_fwd_flag   : chr  "N" "N" "N" "N" ...
##  $ PULocationID         : int  70 170 264 132 140 148 158 246 197 48 ...
##  $ DOLocationID         : int  48 226 113 17 163 158 116 262 191 186 ...
##  $ payment_type         : int  1 1 1 2 1 1 1 1 1 1 ...
##  $ fare_amount          : num  32 14 26 37 9 9 26.5 15 26.5 7.5 ...
##  $ extra                : num  3 3 0.5 1.75 3 0.5 0.5 3 0.5 0.5 ...
##  $ mta_tax              : num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
##  $ tip_amount           : num  2 0 5 0 2.55 0.64 7.58 3.75 5.56 2.26 ...
##  $ tolls_amount         : num  6.55 0 6.55 0 0 0 0 0 0 0 ...
##  $ improvement_surcharge: num  0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 ...
##  $ total_amount         : num  44.4 17.8 42.6 39.5 15.3 ...
##  $ congestion_surcharge : num  2.5 2.5 2.5 0 2.5 2.5 2.5 2.5 0 2.5 ...
##  $ airport_fee          : num  0 0 1.25 1.25 0 0 0 0 0 0 ...
Comments : At a first glance, there are total 67604356 observation across 3558124 and 19 variables in which 7 are categorical and 12 are numerical variables. The data was procured from the NYC Open Source GIS website - https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page.

2.2 Data Descriptors

Zones
Field.Name Description
VendorID A code indicating the TPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc. 
tpep_pickup_datetime The date and time when the meter was engaged.
tpep_dropoff_datetime The date and time when the meter was disengaged.
Passenger_count The number of passengers in the vehicle. This is a driver-entered value.
Trip_distance The elapsed trip distance in miles reported by the taximeter.
PULocationID TLC Taxi Zone in which the taximeter was engaged
DOLocationID TLC Taxi Zone in which the taximeter was disengaged
RateCodeID The final rate code in effect at the end of the trip. 1= Standard rate 2=JFK 3=Newark 4=Nassau or Westchester 5=Negotiated fare 6=Group ride
Store_and_fwd_flag This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka “store and forward,” because the vehicle did not have a connection to the server. Y= store and forward trip N= not a store and forward trip
Payment_type A numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided trip
Fare_amount The time-and-distance fare calculated by the meter.
Extra Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges.
MTA_tax $0.50 MTA tax that is automatically triggered based on the metered rate in use.
Improvement_surcharge $0.30 improvement surcharge assessed trips at the flag drop. The improvement surcharge began being levied in 2015.
Tip_amount Tip amount – This field is automatically populated for credit card tips. Cash tips are not included.
Tolls_amount Total amount of all tolls paid in trip.
Total_amount The total amount charged to passengers. Does not include cash tips.
Congestion_Surcharge Total amount collected in trip for NYS congestion surcharge.
Airport_fee $1.25 for pick up only at LaGuardia and John F. Kennedy Airports
Comments : There are total 19 variables but not all are used in our analysis, we will shortly remove the irreverent columns. Some major columns are vital for this analysis are Trip distance, Trip duration, Fare amount , Tip amount, Passenger count and Vendor ID.
Zones
LocationID Borough Zone service_zone
1 EWR Newark Airport EWR
2 Queens Jamaica Bay Boro Zone
3 Bronx Allerton/Pelham Gardens Boro Zone
4 Manhattan Alphabet City Yellow Zone
5 Staten Island Arden Heights Boro Zone
6 Staten Island Arrochar/Fort Wadsworth Boro Zone
Comments: For this analysis NYC has been divided 6 Borough and 261 distinct Zones.

2.3 Data Statistics

## 'data.frame':    265 obs. of  4 variables:
##  $ LocationID  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Borough     : chr  "EWR" "Queens" "Bronx" "Manhattan" ...
##  $ Zone        : chr  "Newark Airport" "Jamaica Bay" "Allerton/Pelham Gardens" "Alphabet City" ...
##  $ service_zone: chr  "EWR" "Boro Zone" "Boro Zone" "Yellow Zone" ...

2.4 Data Manipulation

2.4.1 Look up values

Looking up Location names to the corresponding location ids such as 1-EWR, 2-Queens, 3- Bronx.

2.4.2 Calculated column of interest

Calculating columns of interest such as Trip duration, Trip percentage and Day

2.4.3 Missing vaules

Dealing with missing values

2.4.4 Defining categorical variables

Defining variables such as vendor id ,passenger count, pick-up and Drop-off location as Categorical Variables.

2.4.5 Outliers

Investigating Outliers:
Looking at the above summary, following observations are considered while dealing with outliers:
  1. Passenger_count seems to go up to nine, which seems incorrect; hence, we will consider passengers up to 6.
  2. The trip distance has a maximum value of 184341 miles, more than the entire United States. Therefore we consider trip distances up to 40 miles.
  3. Rate codeID, according to our data, can be only six values; however, the data contains values beyond six. These values are neglected.
  4. The fare amount ranges from -907 to 395845. Unless someone is too generous, these values are incorrect. The range for fare amount is considered from 0 to 150.
  5. Similarly considering the range 0-100 for the tipping amount and toll collected.
  6. The unknown values for the pick-up and drop-off locations are dropped.
  7. When considering the payment type, only credit card payments are referenced because we don’t have data for cash tips.
  8. Finally, values beyond the 500% tipping percentage looks skeptical; hence we will ignore these values and consider up to a 60% tipping ratio.

Summary for Cleaned Data

##  VendorID   pickup_time                      dropoff_time                   
##  1:28040   Min.   :2002-10-21 05:50:44.00   Min.   :2002-10-21 06:17:17.00  
##  2:80292   1st Qu.:2022-06-08 10:40:14.00   1st Qu.:2022-06-08 11:12:00.50  
##            Median :2022-06-15 15:27:32.00   Median :2022-06-15 15:54:23.00  
##            Mean   :2022-06-15 07:48:12.17   Mean   :2022-06-15 14:37:32.81  
##            3rd Qu.:2022-06-23 04:57:50.00   3rd Qu.:2022-06-23 05:27:29.75  
##            Max.   :2022-06-30 19:59:43.00   Max.   :2022-06-30 20:35:20.00  
##                                                                             
##  passenger_count trip_distance  RatecodeID payment_type  fare_amount   
##  Min.   :1.00    Min.   : 0.0   1 :79853   1:108332     Min.   :  3.5  
##  1st Qu.:1.00    1st Qu.: 8.8   2 :24087   2:     0     1st Qu.: 27.5  
##  Median :1.00    Median :10.3   3 : 3427   3:     0     Median : 32.5  
##  Mean   :1.45    Mean   :11.6   4 :   35   4:     0     Mean   : 36.2  
##  3rd Qu.:2.00    3rd Qu.:15.5   5 :  930                3rd Qu.: 52.0  
##  Max.   :6.00    Max.   :37.5   6 :    0                Max.   :149.5  
##                                 99:    0                               
##    tip_amount    tolls_amount          PULocation            DOLocation   
##  Min.   : 0.1   Min.   : 0.0   Bronx        :   49   Bronx        : 2579  
##  1st Qu.: 7.0   1st Qu.: 6.6   Brooklyn     :  271   Brooklyn     : 6741  
##  Median : 8.5   Median : 6.6   EWR          :   18   EWR          : 3622  
##  Mean   : 8.9   Mean   : 6.9   Manhattan    :46274   Manhattan    :60995  
##  3rd Qu.:10.8   3rd Qu.: 6.6   Queens       :61716   Queens       :34297  
##  Max.   :50.0   Max.   :56.0   Staten Island:    4   Staten Island:   98  
##                                Unknown      :    0   Unknown      :    0  
##     tip_perc    trip_duration   day          PU_time_of_day    DO_time_of_day 
##  Min.   : 1.0   Min.   : 1.0   Sun:17205   Night    :28670   Night    :26102  
##  1st Qu.:24.0   1st Qu.:22.0   Mon:17248   Morning  :31547   Morning  :32640  
##  Median :26.0   Median :28.0   Tue:14027   Afternoon:34259   Afternoon:33344  
##  Mean   :25.3   Mean   :27.7   Wed:16971   Evening  :13856   Evening  :16246  
##  3rd Qu.:28.0   3rd Qu.:34.0   Thu:16985                                      
##  Max.   :59.0   Max.   :40.0   Fri:13831                                      
##                                Sat:12065
The number of observations post data cleaning are ** 1841644 **

2.5 Distribution Check

The distribution of the primary candidate for this study (Tip) is virtually normally distributed, with the lack of some value on the left side.

3 Explanatory Data Analysis

3.1 Parameter Visualization

The graph above indicates that the number of journeys increases as the number of passengers decreases. The right-skewed graph corroborates this observation.

We had expected that evenings would have the most travels, but our data revealed that afternoons were the busiest in terms of number of trips, followed by mornings.

3.2 Relationship Exploration

Observations - We initially skimmed correlation coefficients for our continuous variables, such as travel distance and trip time, to see if they were connected to tip amount. As seen in the above cor plot, the results were 0.65 and 0.33, which suggest a moderate association. However, since correlation does not imply causation, statistical tests must be performed on these variables to establish their relationship.

3.3 Location Analysis

Location Distribution
Pick up Location Drop off Location No. of Trips Avg. Fare
Queens Manhattan 59411 36.9
Manhattan Queens 33523 33.8
Manhattan Brooklyn 6681 25.9
Manhattan EWR 3587 66.1
Queens Bronx 1466 39.6
Manhattan Manhattan 1328 30.5
Manhattan Bronx 1094 31.7
Queens Queens 744 46.3
Brooklyn Manhattan 232 23.6
Manhattan Staten Island 61 47.2
Now that we have looked at the insights from our location vs tip percentage data, we explore the statistical significance of the two

This graph illustrates that Queens has the most tipping passengers, followed by Manhattan.Similarly, Manhattan has higher tipping passengers than others in Drop-Off Location
Observations - The p-value for both variables is 0.2*10^−15 or 0.0000000000000002, which are infinitesimal compared to the significance level of 0.05, thus rejecting the null hypothesis that the means of the two entities are the same, making them statically different.

3.4 Trip Duration Impact on Tips

Observations - The data is approximately normally distributed with slight skewness on the right side. This is to say that majority of out trips are 20-40 mins in length.
Observations - A t-test between travel time and customer tip percentage reveals the p-value of the relationship between the variables, which is 0.2*10-15; consequently, since the value is much lower than the significance level of 0.05, we can state that Yellow taxi passengers tip differently depending on the length of the trip and successfully reject the null hypothesis that the means of the two variables are equal.

3.5 Trip Length and Tips

We are attempting to determine if those doing shorter journeys are more likely to leave larger gratuities or those taking longer travels are more giving.

To study these two groups separately, we divide the data for trip distance into two categories: short and long trips. When we plot the journey distance against the number of tips paid, we notice that passengers tips higher number of times on shorter rides than on longer ones.

Observations - A Simple two way test for pvalues which is found to be way less than the significant level 0.05. We can reject the null hypothesis, Z-test cannot be used because we don’t know population’s mean & std dev.
Declaring hypothesis
Null Hypothesis: Ho Tip amount is same for both short and long distance passenger(s)
Alternate Hypothesis: Ha Tip amount is NOT same for both short and long distance passenger(s)

3.6 Importance of passenger count and vendor

A Anova test between tip percentage and passenger count shows a significant relationship between the number of passengers and the amount of tips because the p-value is 0.00006, which is less than the significant value(0.005). Hence, we can reject the null hypothesis(H0).

3.7 Summary

Feature (variable) Test P-value Null Hypothesis (H0) Decision on H0
pickup location ANOVA 1.79e-140 means are equal reject H0
dropoff location ANOVA 0 means are equal reject H0
distance T-Test 0 means are equal reject H0
passenger count ANOVA 0.0000604 means are equal reject H0
vendor ID T-test 0.264 means are equal failed to reject H0

4 Model Building

We began by selecting linear (univariate and multivariate) regression models to examine how they fit our data. Linear regression is a conventional, common approach that may explain the association with tip well, so we chose to test it first. To strengthen our linear model, we also used lasso, ridge, and principal component analysis (PCA). We also made use of decision trees and random forest as regression. The capacity of decision trees to mimic non-linear connections is one of its advantages. According to our EDA, journey duration, distance, and fare are all linearly connected over small distances, but this connection weakens over longer distances due to the involvement of other possible factors.Consequently, there may be in fact a non-linear relationship with tip, too.

4.1 Preparation

We prepped our data for modeling before developing our models by using one hot encoding, establishing training and testing sets, and scaling our data.

4.1.1 One hot encoding (OHE)

We employed one hot encoding to convert factor columns to numerical columns. All factors will be converted into a distinct boolean column by a one hot encoding.

## Rows: 108,332
## Columns: 48
## $ pickup_time                <chr> "2022-05-31 20:25:41", "2022-05-31 20:21:00…
## $ dropoff_time               <chr> "2022-05-31 20:48:22", "2022-05-31 20:59:50…
## $ trip_distance              <dbl> 11.00, 18.18, 10.60, 10.40, 12.33, 6.88, 18…
## $ fare_amount                <dbl> 32.0, 52.0, 31.0, 30.0, 37.5, 21.5, 52.0, 5…
## $ tip_amount                 <dbl> 2.00, 12.37, 10.65, 12.10, 11.96, 6.62, 12.…
## $ tolls_amount               <dbl> 6.55, 6.55, 6.55, 6.55, 6.55, 6.55, 6.55, 6…
## $ tip_perc                   <int> 6, 24, 34, 40, 32, 31, 24, 24, 19, 36, 29, …
## $ trip_duration              <int> 22, 38, 22, 22, 33, 14, 39, 23, 29, 33, 31,…
## $ VendorID_1                 <int> 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0…
## $ VendorID_2                 <int> 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1…
## $ passenger_count_1          <int> 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ passenger_count_2          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ passenger_count_3          <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ passenger_count_4          <int> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ passenger_count_5          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ passenger_count_6          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ RatecodeID_1               <int> 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1…
## $ RatecodeID_2               <int> 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0…
## $ RatecodeID_3               <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ RatecodeID_4               <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ RatecodeID_5               <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ PULocation_Bronx           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ PULocation_Brooklyn        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ PULocation_EWR             <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ PULocation_Manhattan       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ PULocation_Queens          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ `PULocation_Staten Island` <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ DOLocation_Bronx           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ DOLocation_Brooklyn        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ DOLocation_EWR             <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ DOLocation_Manhattan       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ DOLocation_Queens          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ `DOLocation_Staten Island` <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ day_Fri                    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ day_Mon                    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ day_Sat                    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ day_Sun                    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ day_Thu                    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ day_Tue                    <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day_Wed                    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ PU_time_of_day_Afternoon   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ PU_time_of_day_Evening     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ PU_time_of_day_Morning     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ PU_time_of_day_Night       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ DO_time_of_day_Afternoon   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ DO_time_of_day_Evening     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ DO_time_of_day_Morning     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ DO_time_of_day_Night       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…

Post One Hot Encoding (OHE) we are now left with 48 columns.

4.1.2 Scaling variables

Because the magnitude of the values may not be proportionate, we must scale the numerical variables in our datasets. For comparative reasons, we compute the mean and standard deviation of each numerical column.

4.1.3 Test train split

In order to eliminate any bias in test results while utilizing train data, the train-test split should be implemented before (most) data modeling. We randomly divided the dataset into 70% train and 30% test to replicate a train and test set.

Number of rows of observations in training dataset is 75799 and in testing dataset are 32533 post split.

4.2 Evaluation Metrics

From Wikipedia

  1. Mean Absolute Percentage Error (MAPE)
              The mean absolute percentage error (MAPE), also known as mean absolute percentage deviation (MAPD), is a measure of prediction
              accuracy of a forecasting method in statistics. It usually expresses the accuracy as a ratio defined by the formula:
              
  1. Akaike information criterion (AIC)
              The Akaike information criterion is an estimator of prediction error and thereby relative quality of statistical models for a
              given set of data.Given a collection of models for the data, AIC estimates the quality of each model, relative to each of the
              other models.:
              
  1. R squared (R2)
              In statistics, the coefficient of determination, denoted R2 or r2 and pronounced “R squared”, is the proportion of the variation
              in the dependent variable that is predictable from the independent variable(s).
              

4.3 Tip Percentage

The subsequent step post preparing our data is to employ a number of regression-based methods to extract insights from data, which we can then use to predict which result is likely to hold true for our target variable based on training data.

4.3.1 Principal component analysis

We chose PCA as a variable reduction strategy because the majority of our variables were associated with one another and there were 48 features.

Variables graph: Variables that are positively associated point to the same side of the plot. Negatively associated variables point to the graph’s opposing sides.

Observations: Even if the fist three components explain 94.1% of the variance in the data, that may not necessarily mean that a good R2 or high coefficients will result. However this gives us enough statistical basis for which variables to go after. Hence, we proceed to build out Linear regression model with these variables.

4.3.2 Linear regression

From the results of the Principle component analysis, we constructed a linear model with tip percentage along with the first three high variability explainers and correlated values for ‘Tip Percentage’ - trip_distance (-0.28), trip_duration (-0.175), and toll_amount (-0.04).

ANOVA tests on all the three models

The summary for three linear models is:

Fit Model Equation R^2 ANOVA P-value AIC
1 tip_perc ~ trip_duration 0.03 509392.627
2 tip_perc ~ trip_duration+fare_amount 0.0305 0.000000000145 513199.554
3 tip_perc ~ trip_duration+fare_amount+trip_distance 0.078 0 513236.639

Observations: Looking at the combination of p-value and r-squared, we conclude that fit3 performs slightly better than the other two fits. Hence, we check if there is an improvement in model 3 in the absence of outliers.

Treating Outliers and Modeling

Even after treating the outliers in our model three fit, There is little to no difference in the results, and the r-squared and MAPE values remains 0.0835 and 0.185.

4.3.3 Lasso regression

Lasso regression is a form of regularization (L1) approach that might result in coefficients that are canceled out (in other words, some of the features are completely neglected for the evaluation of output). As a result, it not only helps to reduce over-fitting, but it may also aid in feature selection.

As we increase the value of lambda, the bias increase and variance decrease, so we Iterated through a set of lambda values to find the optimum value. The graph below shows how lasso reduces the value of unnecessary attribute coefficients to 0. Only the five attributes with the greatest coefficient values are indicated for greater visibility

It is interesting, that the Trip Distance (Short and Long Trip) and the Standard Rate Applied to the Rides (Rate Code 1) survived the longest. it also surprising to see how long did the Drop off Location Bronx prevails, which is understandable.

The lambda value that minimizes the test MSE turns out to be 0.002 .

There is a slight improvement in the r-sqaured and mape value in comparison to the base linear model, however the r-squared is only 0.0991, which means there is room for a lot improvement.

4.3.4 Ridge regression

The coefficients that best suit the data are discovered using the least squares approach. It should also determine the unbiased coefficients as a further requirement. Here, unbiased refers to the fact that OLS ignores the independent variables’ relative importance. A given data set’s coefficients are easily found. In other words, the lowest “Residual Sum of Squares (RSS)” can only be obtained from one set of betas. It therefore poses a question whether the model with the lowest RSS is actually the better model.

In a sense, OLS offers the model with the highest variance and the lowest bias, and it gets more complex as the number of variables rises. Although it is stationary and never moves, we still want a model with little bias and little variance. This void can be filled by Ridge, which is also known as regularization. Since the ridge regression penalizes coefficients, the least effective ones in the estimation will “shrink” the quickest. In ridge regression, the lambda parameter (penalizing factor) can be adjusted to alter the model coefficients.

Again, Only the five attributes with the greatest coefficient values are indicated for greater visibility.

Observations: The plot shows the whole path of variables as they shrink towards zero as lambda increases. The Pick up location of Staten Island and Newark Airport survives the longest as they shrink to zero.

The lambda value that minimizes the test MSE turns out to be 0.978 .

The Plot shows that all the variables explain ~9.30% (~0.0930 point on the plot) of the variance in the data. Same is bolstered by the R2 value of the model.

4.3.5 Decision tree

The classification and regression tree (CART) methodology is one of the earliest methods for creating regression trees, however there are many more. A data set is divided into smaller subgroups by basic regression trees, which then fit a straightforward constant to each observation in each segment. By using successive binary partitions (also known as recursive partitioning) depending on several predictors, the partitioning is accomplished.

Cost complexity criterion

To enhance prediction performance on certain unknown data, a balance in the depth and complexity of the tree is generally required. To achieve this balance, we generally create a very big tree and then prune it back to identify an ideal subtree. We identify the best subtree by applying a cost complexity parameter (α) that penalizes our objective function for the number of terminal nodes in the tree.

When we consider all the variables while building our decision tree, the model quickly becomes overfitted.

The above shows the compares the error over the range of α’s (cost complexity - cp value at the bottom X-axis). The upper X-axis gives the number of nodes. We can see returns diminish after around 10 leafs (dashed vertical line).

Pruned Decision Tree

Pruning the decision tree to 10 variables gives a much better model as seen below.

The above plot confirms that only the first ten variables actually contributes towards reducing the relative error.

4.3.6 Random Forest

Finally, we apply our last model to further obtained enhanced results. The Random Forest builds on top of the classical decision tree by a method called Bagging

Note: Due to limitation in computation power, the number of trees are limited to 100.

## 
## Call:
##  randomForest(formula = tip_perc ~ trip_distance + tolls_amount +      trip_duration + VendorID_1 + VendorID_2 + passenger_count_1 +      passenger_count_2 + passenger_count_3 + passenger_count_4 +      passenger_count_5 + passenger_count_6 + RatecodeID_1 + RatecodeID_2 +      RatecodeID_3 + RatecodeID_4 + RatecodeID_5 + PULocation_Bronx +      PULocation_Brooklyn + PULocation_EWR + PULocation_Manhattan +      PULocation_Queens + PULocation_Staten_Island + DOLocation_Bronx +      DOLocation_Brooklyn + DOLocation_EWR + DOLocation_Manhattan +      DOLocation_Queens + DOLocation_Staten_Island + day_Fri +      day_Mon + day_Sat + day_Sun + day_Thu + day_Tue + day_Wed +      PU_time_of_day_Afternoon + PU_time_of_day_Evening + PU_time_of_day_Morning +      PU_time_of_day_Night + DO_time_of_day_Afternoon + DO_time_of_day_Evening +      DO_time_of_day_Morning + DO_time_of_day_Night, data = train,      ntree = 100, keep.forest = FALSE, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 100
## No. of variables tried at each split: 14
## 
##           Mean of squared residuals: 50
##                     % Var explained: 5.1

Node purity is the total decrease in residual sum of squares when splitting on a variable averaged over all trees (i.e. how well a predictor decreases variance). Importance gives you what the model has learnt. The above plot shows, for each variable, how important it is in classifying the data. The Mean Decrease Accuracy plot expresses how much accuracy the model losses by excluding each variable. The more the accuracy suffers, the more important the variable is for the successful classification. The variables are presented from descending importance.

Since the results of the Random Forest was so low, we decided to exclude it from out model selection.

4.3.7 Summary and analysis

The unpruned decision tree is the optimal model when aiming for a low MAPE, low AIC, and high r-squared. It is crucial to remember that when comparing the different models, the MAPE number is typically the same, hovering around 0.18 - 0.22, which denotes a 78-82 % accuracy. But all of the models’ r-squared values are quite low, explaining about 8% to 11% of the variance in our dependent variable. As a result, this suggests that the models are neither thorough nor accurate fits.

4.4 Tip Amount

The subsequent step post preparing our data is to employ a number of regression-based methods to extract insights from data, which we can then use to predict which result is likely to hold true for our target variable based on training data.

4.4.1 Principal component analysis

We chose PCA as a variable reduction strategy because the majority of our variables were associated with one another and there were 48 features.

Variables graph: Variables that are positively associated point to the same side of the plot. Negatively associated variables point to the graph’s opposing sides.

Observations: Even if the first three components explain 94% of the variance in the data, that may not necessarily mean that a good R2 or high coefficients will result. However this gives us enough statistical basis for which variables to go after. Hence, we proceed to build out Linear regression model with these variables.

4.4.2 Linear regression

From the results of the Principle component analysis, we constructed a linear model with tip amount along with the first three high variability explainers and correlated values for ‘Tip_Amount’: fare_amount (0.65), trip_distance (0.54), and trip_duration (0.36).

ANOVA tests on all the three models

The summary for three linear models is:

Fit Model Equation R^2 ANOVA P-value AIC
1 tip_amount ~ trip_duration 0.13 172650.796
2 tip_amount ~ trip_duration+fare_amount 0.432 0 172813.743
3 tip_amount ~ trip_duration+fare_amount+trip_distance 0.433 0.0000000000000000000000000000000000000944 205075.254

Observations: Looking at the combination of p-value and r-squared, we conclude that fit3 performs slightly better than the other two fits. Hence, we check if there is an improvement in model 3 in the absence of outliers.

Treating Outliers and Modeling

Even after treating the outliers in our model three fit, There is little to no difference in the results, and the r-squared and MAPE values remains 0.439 and 4.662.

4.4.3 Lasso regression

As expected the fare amount survives the longest. However, it surprising to see how long did the Toll Amount and Drop Off Location Bronx prevails.

The lambda value that minimizes the test MSE turns out to be 0 .

As with before, the r-squared is around 0.4463, which means there is room for a lot improvement.

4.4.4 Ridge regression

Observations: The plot shows the whole path of variables as they shrink towards zero as lambda increases. The Pick up location of Staten Island and Nassau or Winchester (Rate Code 4) survives the longest as they shrink to zero.

The lambda value that minimizes the test MSE turns out to be 0.282 .

The Plot shows that all the vairables explain ~42% (~0.4288 point on the plot) of the variance in the data. Same is bolstered by the R2 value of the model.

4.4.5 Decision tree

Cost complexity criterion

When we consider all the variables while building our decision tree, the model quickly becomes overfitted.

The above shows the compares the error over the range of α’s (cost complexity - cp value at the bottom X-axis). The upper X-axis gives the number of nodes. We can see returns deminish after around 13 leafs (dashed vertical line).

Pruned Decision Tree

Pruning the decision tree to 13 variables gives a much better model as seen below.

4.4.6 Random Forest

Finally, we apply our last model to further obtained enhanced results. The Random Forest builds on top of the classical decision tree by a method called Bagging

## 
## Call:
##  randomForest(formula = tip_amount ~ trip_distance + fare_amount +      tolls_amount + trip_duration + VendorID_1 + VendorID_2 +      passenger_count_1 + passenger_count_2 + passenger_count_3 +      passenger_count_4 + passenger_count_5 + passenger_count_6 +      RatecodeID_1 + RatecodeID_2 + RatecodeID_3 + RatecodeID_4 +      RatecodeID_5 + PULocation_Bronx + PULocation_Brooklyn + PULocation_EWR +      PULocation_Manhattan + PULocation_Queens + PULocation_Staten_Island +      DOLocation_Bronx + DOLocation_Brooklyn + DOLocation_EWR +      DOLocation_Manhattan + DOLocation_Queens + DOLocation_Staten_Island +      day_Fri + day_Mon + day_Sat + day_Sun + day_Thu + day_Tue +      day_Wed + PU_time_of_day_Afternoon + PU_time_of_day_Evening +      PU_time_of_day_Morning + PU_time_of_day_Night + DO_time_of_day_Afternoon +      DO_time_of_day_Evening + DO_time_of_day_Morning + DO_time_of_day_Night,      data = train, ntree = 100, keep.forest = FALSE, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 100
## No. of variables tried at each split: 14
## 
##           Mean of squared residuals: 0.602
##                     % Var explained: 40.2

As seen from the above plot, the trip duration, duration of the trip, and fare amount has the highest impact on the model if they were to be removed.

4.4.7 Summary and analysis

The unpruned decision tree is the optimal model when aiming for a low MAPE, low AIC, and high r-squared. It is crucial to remember that when comparing the different models, the MAPE number is typically the same, hovering around 0.18 - 0.20, which denotes a 80-82 % accuracy. But all of the models’ r-squared values are quite low, explaining about 7% to 15% of the variance in our dependent variable. As a result, this suggests that the models are neither thorough nor accurate fits.

4.5 Model Evalution Summary

Summay for All Model’s Evaluation Metrics
technique dependent mape Rsquare AIC
Linear(3 vars with best cor-coeffs) tip_perc 0.185 0.0835 513236.638877281
Linear-treated outlier tip_perc 0.185 0.0835 509392.626959756
Lasso tip_perc 0.183 0.0991 -372632.599400254 *
Ridge tip_perc 0.183 0.0930 -344041.945716197 *
Decision Tree tip_perc 0.226 0.0463
Decision Tree (Prune) tip_perc 0.183 0.1020
Linear(3 vars with best cor-coeffs) tip_amount 4.662 0.4394 205075.253842576
Linear-treated outlier tip_amount 4.662 0.4394 172650.795971072
Lasso tip_amount 6.322 0.4463 -33544.6455051331 *
Ridge tip_amount 8.136 0.4288 -31818.3532968901 *
Decision Tree tip_amount 6.521 0.3444
Decision Tree (Prune) tip_amount 2.049 0.4349

5 Conclusion

We conclude our analysis by discussing the limitations and Future scopes of this project

Limitations

  1. Daylight savings are not considered
  2. In this will not consider large dataset (hardware )
  3. In this there is no cash Tips are considered
  4. There are more variables to consider such as Gender and Weather
  5. human error in data
  6. Central Limit Theorem (CLT) states that sample means of moderately large samples are often well-approximated by a normal distribution even if the data is not normally distributed. Our dataset contains a significant amount of observations thus qualifying it to be approximately normal under CLT.

Future scope

The dataset produced for this project will serve as the foundation for future study. More insights will be obtained by analyzing at least a year’s worth of data. If weather impacts the amount of rides, hourly weather data combined with weather events may provide further information. A forecast and prediction based on zones, as well as boroughs, will make it extremely easy for drivers to be present at any particular moment in time and Gender and Driver and Trip Rating.