## 'data.frame': 3558124 obs. of 19 variables:
## $ VendorID : int 1 1 2 1 1 2 2 1 2 2 ...
## $ tpep_pickup_datetime : POSIXct, format: "2022-05-31 20:25:41" "2022-05-31 20:44:40" ...
## $ tpep_dropoff_datetime: POSIXct, format: "2022-05-31 20:48:22" "2022-05-31 21:01:48" ...
## $ passenger_count : num 1 1 1 2 0 1 1 1 1 1 ...
## $ trip_distance : num 11 4.2 9.49 12.1 1.8 2.02 8.08 4.3 8.78 1.76 ...
## $ RatecodeID : num 1 1 1 1 1 1 1 1 1 1 ...
## $ store_and_fwd_flag : chr "N" "N" "N" "N" ...
## $ PULocationID : int 70 170 264 132 140 148 158 246 197 48 ...
## $ DOLocationID : int 48 226 113 17 163 158 116 262 191 186 ...
## $ payment_type : int 1 1 1 2 1 1 1 1 1 1 ...
## $ fare_amount : num 32 14 26 37 9 9 26.5 15 26.5 7.5 ...
## $ extra : num 3 3 0.5 1.75 3 0.5 0.5 3 0.5 0.5 ...
## $ mta_tax : num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
## $ tip_amount : num 2 0 5 0 2.55 0.64 7.58 3.75 5.56 2.26 ...
## $ tolls_amount : num 6.55 0 6.55 0 0 0 0 0 0 0 ...
## $ improvement_surcharge: num 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 ...
## $ total_amount : num 44.4 17.8 42.6 39.5 15.3 ...
## $ congestion_surcharge : num 2.5 2.5 2.5 0 2.5 2.5 2.5 2.5 0 2.5 ...
## $ airport_fee : num 0 0 1.25 1.25 0 0 0 0 0 0 ...
Field.Name | Description |
---|---|
VendorID | A code indicating the TPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc. |
tpep_pickup_datetime | The date and time when the meter was engaged. |
tpep_dropoff_datetime | The date and time when the meter was disengaged. |
Passenger_count | The number of passengers in the vehicle. This is a driver-entered value. |
Trip_distance | The elapsed trip distance in miles reported by the taximeter. |
PULocationID | TLC Taxi Zone in which the taximeter was engaged |
DOLocationID | TLC Taxi Zone in which the taximeter was disengaged |
RateCodeID | The final rate code in effect at the end of the trip. 1= Standard rate 2=JFK 3=Newark 4=Nassau or Westchester 5=Negotiated fare 6=Group ride |
Store_and_fwd_flag | This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka “store and forward,” because the vehicle did not have a connection to the server. Y= store and forward trip N= not a store and forward trip |
Payment_type | A numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided trip |
Fare_amount | The time-and-distance fare calculated by the meter. |
Extra | Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges. |
MTA_tax | $0.50 MTA tax that is automatically triggered based on the metered rate in use. |
Improvement_surcharge | $0.30 improvement surcharge assessed trips at the flag drop. The improvement surcharge began being levied in 2015. |
Tip_amount | Tip amount – This field is automatically populated for credit card tips. Cash tips are not included. |
Tolls_amount | Total amount of all tolls paid in trip. |
Total_amount | The total amount charged to passengers. Does not include cash tips. |
Congestion_Surcharge | Total amount collected in trip for NYS congestion surcharge. |
Airport_fee | $1.25 for pick up only at LaGuardia and John F. Kennedy Airports |
LocationID | Borough | Zone | service_zone |
---|---|---|---|
1 | EWR | Newark Airport | EWR |
2 | Queens | Jamaica Bay | Boro Zone |
3 | Bronx | Allerton/Pelham Gardens | Boro Zone |
4 | Manhattan | Alphabet City | Yellow Zone |
5 | Staten Island | Arden Heights | Boro Zone |
6 | Staten Island | Arrochar/Fort Wadsworth | Boro Zone |
## 'data.frame': 265 obs. of 4 variables:
## $ LocationID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Borough : chr "EWR" "Queens" "Bronx" "Manhattan" ...
## $ Zone : chr "Newark Airport" "Jamaica Bay" "Allerton/Pelham Gardens" "Alphabet City" ...
## $ service_zone: chr "EWR" "Boro Zone" "Boro Zone" "Yellow Zone" ...
Summary for Cleaned Data
## VendorID pickup_time dropoff_time
## 1:28040 Min. :2002-10-21 05:50:44.00 Min. :2002-10-21 06:17:17.00
## 2:80292 1st Qu.:2022-06-08 10:40:14.00 1st Qu.:2022-06-08 11:12:00.50
## Median :2022-06-15 15:27:32.00 Median :2022-06-15 15:54:23.00
## Mean :2022-06-15 07:48:12.17 Mean :2022-06-15 14:37:32.81
## 3rd Qu.:2022-06-23 04:57:50.00 3rd Qu.:2022-06-23 05:27:29.75
## Max. :2022-06-30 19:59:43.00 Max. :2022-06-30 20:35:20.00
##
## passenger_count trip_distance RatecodeID payment_type fare_amount
## Min. :1.00 Min. : 0.0 1 :79853 1:108332 Min. : 3.5
## 1st Qu.:1.00 1st Qu.: 8.8 2 :24087 2: 0 1st Qu.: 27.5
## Median :1.00 Median :10.3 3 : 3427 3: 0 Median : 32.5
## Mean :1.45 Mean :11.6 4 : 35 4: 0 Mean : 36.2
## 3rd Qu.:2.00 3rd Qu.:15.5 5 : 930 3rd Qu.: 52.0
## Max. :6.00 Max. :37.5 6 : 0 Max. :149.5
## 99: 0
## tip_amount tolls_amount PULocation DOLocation
## Min. : 0.1 Min. : 0.0 Bronx : 49 Bronx : 2579
## 1st Qu.: 7.0 1st Qu.: 6.6 Brooklyn : 271 Brooklyn : 6741
## Median : 8.5 Median : 6.6 EWR : 18 EWR : 3622
## Mean : 8.9 Mean : 6.9 Manhattan :46274 Manhattan :60995
## 3rd Qu.:10.8 3rd Qu.: 6.6 Queens :61716 Queens :34297
## Max. :50.0 Max. :56.0 Staten Island: 4 Staten Island: 98
## Unknown : 0 Unknown : 0
## tip_perc trip_duration day PU_time_of_day DO_time_of_day
## Min. : 1.0 Min. : 1.0 Sun:17205 Night :28670 Night :26102
## 1st Qu.:24.0 1st Qu.:22.0 Mon:17248 Morning :31547 Morning :32640
## Median :26.0 Median :28.0 Tue:14027 Afternoon:34259 Afternoon:33344
## Mean :25.3 Mean :27.7 Wed:16971 Evening :13856 Evening :16246
## 3rd Qu.:28.0 3rd Qu.:34.0 Thu:16985
## Max. :59.0 Max. :40.0 Fri:13831
## Sat:12065
Pick up Location | Drop off Location | No. of Trips | Avg. Fare |
---|---|---|---|
Queens | Manhattan | 59411 | 36.9 |
Manhattan | Queens | 33523 | 33.8 |
Manhattan | Brooklyn | 6681 | 25.9 |
Manhattan | EWR | 3587 | 66.1 |
Queens | Bronx | 1466 | 39.6 |
Manhattan | Manhattan | 1328 | 30.5 |
Manhattan | Bronx | 1094 | 31.7 |
Queens | Queens | 744 | 46.3 |
Brooklyn | Manhattan | 232 | 23.6 |
Manhattan | Staten Island | 61 | 47.2 |
Feature (variable) | Test | P-value | Null Hypothesis (H0) | Decision on H0 |
---|---|---|---|---|
pickup location | ANOVA | 1.79e-140 | means are equal | reject H0 |
dropoff location | ANOVA | 0 | means are equal | reject H0 |
distance | T-Test | 0 | means are equal | reject H0 |
passenger count | ANOVA | 0.0000604 | means are equal | reject H0 |
vendor ID | T-test | 0.264 | means are equal | failed to reject H0 |
We began by selecting linear (univariate and multivariate) regression models to examine how they fit our data. Linear regression is a conventional, common approach that may explain the association with tip well, so we chose to test it first. To strengthen our linear model, we also used lasso, ridge, and principal component analysis (PCA). We also made use of decision trees and random forest as regression. The capacity of decision trees to mimic non-linear connections is one of its advantages. According to our EDA, journey duration, distance, and fare are all linearly connected over small distances, but this connection weakens over longer distances due to the involvement of other possible factors.Consequently, there may be in fact a non-linear relationship with tip, too.
We prepped our data for modeling before developing our models by using one hot encoding, establishing training and testing sets, and scaling our data.
We employed one hot encoding to convert factor columns to numerical columns. All factors will be converted into a distinct boolean column by a one hot encoding.
## Rows: 108,332
## Columns: 48
## $ pickup_time <chr> "2022-05-31 20:25:41", "2022-05-31 20:21:00…
## $ dropoff_time <chr> "2022-05-31 20:48:22", "2022-05-31 20:59:50…
## $ trip_distance <dbl> 11.00, 18.18, 10.60, 10.40, 12.33, 6.88, 18…
## $ fare_amount <dbl> 32.0, 52.0, 31.0, 30.0, 37.5, 21.5, 52.0, 5…
## $ tip_amount <dbl> 2.00, 12.37, 10.65, 12.10, 11.96, 6.62, 12.…
## $ tolls_amount <dbl> 6.55, 6.55, 6.55, 6.55, 6.55, 6.55, 6.55, 6…
## $ tip_perc <int> 6, 24, 34, 40, 32, 31, 24, 24, 19, 36, 29, …
## $ trip_duration <int> 22, 38, 22, 22, 33, 14, 39, 23, 29, 33, 31,…
## $ VendorID_1 <int> 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0…
## $ VendorID_2 <int> 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1…
## $ passenger_count_1 <int> 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ passenger_count_2 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ passenger_count_3 <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ passenger_count_4 <int> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ passenger_count_5 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ passenger_count_6 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ RatecodeID_1 <int> 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1…
## $ RatecodeID_2 <int> 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0…
## $ RatecodeID_3 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ RatecodeID_4 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ RatecodeID_5 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ PULocation_Bronx <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ PULocation_Brooklyn <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ PULocation_EWR <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ PULocation_Manhattan <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ PULocation_Queens <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ `PULocation_Staten Island` <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ DOLocation_Bronx <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ DOLocation_Brooklyn <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ DOLocation_EWR <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ DOLocation_Manhattan <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ DOLocation_Queens <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ `DOLocation_Staten Island` <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ day_Fri <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ day_Mon <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ day_Sat <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ day_Sun <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ day_Thu <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ day_Tue <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day_Wed <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ PU_time_of_day_Afternoon <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ PU_time_of_day_Evening <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ PU_time_of_day_Morning <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ PU_time_of_day_Night <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ DO_time_of_day_Afternoon <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ DO_time_of_day_Evening <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ DO_time_of_day_Morning <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ DO_time_of_day_Night <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
Post One Hot Encoding (OHE) we are now left with 48 columns.
Because the magnitude of the values may not be proportionate, we must scale the numerical variables in our datasets. For comparative reasons, we compute the mean and standard deviation of each numerical column.
In order to eliminate any bias in test results while utilizing train data, the train-test split should be implemented before (most) data modeling. We randomly divided the dataset into 70% train and 30% test to replicate a train and test set.
Number of rows of observations in training dataset is 75799 and in testing dataset are 32533 post split.
From Wikipedia
The subsequent step post preparing our data is to employ a number of regression-based methods to extract insights from data, which we can then use to predict which result is likely to hold true for our target variable based on training data.
We chose PCA as a variable reduction strategy because the majority of our variables were associated with one another and there were 48 features.
Variables graph: Variables that are positively associated point to the same side of the plot. Negatively associated variables point to the graph’s opposing sides.
Observations: Even if the fist three components explain 94.1% of the variance in the data, that may not necessarily mean that a good R2 or high coefficients will result. However this gives us enough statistical basis for which variables to go after. Hence, we proceed to build out Linear regression model with these variables.
From the results of the Principle component analysis, we constructed a linear model with tip percentage along with the first three high variability explainers and correlated values for ‘Tip Percentage’ - trip_distance (-0.28), trip_duration (-0.175), and toll_amount (-0.04).
ANOVA tests on all the three models
The summary for three linear models is:
Fit | Model Equation | R^2 | ANOVA P-value | AIC |
---|---|---|---|---|
1 | tip_perc ~ trip_duration | 0.03 | – | 509392.627 |
2 | tip_perc ~ trip_duration+fare_amount | 0.0305 | 0.000000000145 | 513199.554 |
3 | tip_perc ~ trip_duration+fare_amount+trip_distance | 0.078 | 0 | 513236.639 |
Observations: Looking at the combination of p-value and r-squared, we conclude that fit3 performs slightly better than the other two fits. Hence, we check if there is an improvement in model 3 in the absence of outliers.
Even after treating the outliers in our model three fit, There is little to no difference in the results, and the r-squared and MAPE values remains 0.0835 and 0.185.
Lasso regression is a form of regularization (L1) approach that might result in coefficients that are canceled out (in other words, some of the features are completely neglected for the evaluation of output). As a result, it not only helps to reduce over-fitting, but it may also aid in feature selection.
As we increase the value of lambda, the bias increase and variance decrease, so we Iterated through a set of lambda values to find the optimum value. The graph below shows how lasso reduces the value of unnecessary attribute coefficients to 0. Only the five attributes with the greatest coefficient values are indicated for greater visibility
It is interesting, that the Trip Distance (Short and Long Trip) and the Standard Rate Applied to the Rides (Rate Code 1) survived the longest. it also surprising to see how long did the Drop off Location Bronx prevails, which is understandable.
The lambda value that minimizes the test MSE turns out to be 0.002 .
There is a slight improvement in the r-sqaured and mape value in comparison to the base linear model, however the r-squared is only 0.0991, which means there is room for a lot improvement.
The coefficients that best suit the data are discovered using the least squares approach. It should also determine the unbiased coefficients as a further requirement. Here, unbiased refers to the fact that OLS ignores the independent variables’ relative importance. A given data set’s coefficients are easily found. In other words, the lowest “Residual Sum of Squares (RSS)” can only be obtained from one set of betas. It therefore poses a question whether the model with the lowest RSS is actually the better model.
In a sense, OLS offers the model with the highest variance and the lowest bias, and it gets more complex as the number of variables rises. Although it is stationary and never moves, we still want a model with little bias and little variance. This void can be filled by Ridge, which is also known as regularization. Since the ridge regression penalizes coefficients, the least effective ones in the estimation will “shrink” the quickest. In ridge regression, the lambda parameter (penalizing factor) can be adjusted to alter the model coefficients.
Again, Only the five attributes with the greatest coefficient values are indicated for greater visibility.
Observations: The plot shows the whole path of variables as they shrink towards zero as lambda increases. The Pick up location of Staten Island and Newark Airport survives the longest as they shrink to zero.
The lambda value that minimizes the test MSE turns out to be 0.978 .
The Plot shows that all the variables explain ~9.30% (~0.0930 point on the plot) of the variance in the data. Same is bolstered by the R2 value of the model.
The classification and regression tree (CART) methodology is one of the earliest methods for creating regression trees, however there are many more. A data set is divided into smaller subgroups by basic regression trees, which then fit a straightforward constant to each observation in each segment. By using successive binary partitions (also known as recursive partitioning) depending on several predictors, the partitioning is accomplished.
Cost complexity criterion
To enhance prediction performance on certain unknown data, a balance in the depth and complexity of the tree is generally required. To achieve this balance, we generally create a very big tree and then prune it back to identify an ideal subtree. We identify the best subtree by applying a cost complexity parameter (α) that penalizes our objective function for the number of terminal nodes in the tree.
When we consider all the variables while building our decision tree, the model quickly becomes overfitted.
The above shows the compares the error over the range of α’s (cost complexity - cp value at the bottom X-axis). The upper X-axis gives the number of nodes. We can see returns diminish after around 10 leafs (dashed vertical line).
Pruning the decision tree to 10 variables gives a much better model as seen below.
The above plot confirms that only the first ten variables actually contributes towards reducing the relative error.
Finally, we apply our last model to further obtained enhanced results. The Random Forest builds on top of the classical decision tree by a method called Bagging
Note: Due to limitation in computation power, the number of trees are limited to 100.
##
## Call:
## randomForest(formula = tip_perc ~ trip_distance + tolls_amount + trip_duration + VendorID_1 + VendorID_2 + passenger_count_1 + passenger_count_2 + passenger_count_3 + passenger_count_4 + passenger_count_5 + passenger_count_6 + RatecodeID_1 + RatecodeID_2 + RatecodeID_3 + RatecodeID_4 + RatecodeID_5 + PULocation_Bronx + PULocation_Brooklyn + PULocation_EWR + PULocation_Manhattan + PULocation_Queens + PULocation_Staten_Island + DOLocation_Bronx + DOLocation_Brooklyn + DOLocation_EWR + DOLocation_Manhattan + DOLocation_Queens + DOLocation_Staten_Island + day_Fri + day_Mon + day_Sat + day_Sun + day_Thu + day_Tue + day_Wed + PU_time_of_day_Afternoon + PU_time_of_day_Evening + PU_time_of_day_Morning + PU_time_of_day_Night + DO_time_of_day_Afternoon + DO_time_of_day_Evening + DO_time_of_day_Morning + DO_time_of_day_Night, data = train, ntree = 100, keep.forest = FALSE, importance = TRUE)
## Type of random forest: regression
## Number of trees: 100
## No. of variables tried at each split: 14
##
## Mean of squared residuals: 50
## % Var explained: 5.1
Node purity is the total decrease in residual sum of squares when splitting on a variable averaged over all trees (i.e. how well a predictor decreases variance). Importance gives you what the model has learnt. The above plot shows, for each variable, how important it is in classifying the data. The Mean Decrease Accuracy plot expresses how much accuracy the model losses by excluding each variable. The more the accuracy suffers, the more important the variable is for the successful classification. The variables are presented from descending importance.
Since the results of the Random Forest was so low, we decided to exclude it from out model selection.
The unpruned decision tree is the optimal model when aiming for a low MAPE, low AIC, and high r-squared. It is crucial to remember that when comparing the different models, the MAPE number is typically the same, hovering around 0.18 - 0.22, which denotes a 78-82 % accuracy. But all of the models’ r-squared values are quite low, explaining about 8% to 11% of the variance in our dependent variable. As a result, this suggests that the models are neither thorough nor accurate fits.
The subsequent step post preparing our data is to employ a number of regression-based methods to extract insights from data, which we can then use to predict which result is likely to hold true for our target variable based on training data.
We chose PCA as a variable reduction strategy because the majority of our variables were associated with one another and there were 48 features.
Variables graph: Variables that are positively associated point to the same side of the plot. Negatively associated variables point to the graph’s opposing sides.
Observations: Even if the first three components explain 94% of the variance in the data, that may not necessarily mean that a good R2 or high coefficients will result. However this gives us enough statistical basis for which variables to go after. Hence, we proceed to build out Linear regression model with these variables.
From the results of the Principle component analysis, we constructed a linear model with tip amount along with the first three high variability explainers and correlated values for ‘Tip_Amount’: fare_amount (0.65), trip_distance (0.54), and trip_duration (0.36).
ANOVA tests on all the three models
The summary for three linear models is:
Fit | Model Equation | R^2 | ANOVA P-value | AIC |
---|---|---|---|---|
1 | tip_amount ~ trip_duration | 0.13 | – | 172650.796 |
2 | tip_amount ~ trip_duration+fare_amount | 0.432 | 0 | 172813.743 |
3 | tip_amount ~ trip_duration+fare_amount+trip_distance | 0.433 | 0.0000000000000000000000000000000000000944 | 205075.254 |
Observations: Looking at the combination of p-value and r-squared, we conclude that fit3 performs slightly better than the other two fits. Hence, we check if there is an improvement in model 3 in the absence of outliers.
Even after treating the outliers in our model three fit, There is little to no difference in the results, and the r-squared and MAPE values remains 0.439 and 4.662.
As expected the fare amount survives the longest. However, it surprising to see how long did the Toll Amount and Drop Off Location Bronx prevails.
The lambda value that minimizes the test MSE turns out to be 0 .
As with before, the r-squared is around 0.4463, which means there is room for a lot improvement.
Observations: The plot shows the whole path of variables as they shrink towards zero as lambda increases. The Pick up location of Staten Island and Nassau or Winchester (Rate Code 4) survives the longest as they shrink to zero.
The lambda value that minimizes the test MSE turns out to be 0.282 .
The Plot shows that all the vairables explain ~42% (~0.4288 point on the plot) of the variance in the data. Same is bolstered by the R2 value of the model.
Cost complexity criterion
When we consider all the variables while building our decision tree, the model quickly becomes overfitted.
The above shows the compares the error over the range of α’s (cost complexity - cp value at the bottom X-axis). The upper X-axis gives the number of nodes. We can see returns deminish after around 13 leafs (dashed vertical line).
Pruning the decision tree to 13 variables gives a much better model as seen below.
Finally, we apply our last model to further obtained enhanced results. The Random Forest builds on top of the classical decision tree by a method called Bagging
##
## Call:
## randomForest(formula = tip_amount ~ trip_distance + fare_amount + tolls_amount + trip_duration + VendorID_1 + VendorID_2 + passenger_count_1 + passenger_count_2 + passenger_count_3 + passenger_count_4 + passenger_count_5 + passenger_count_6 + RatecodeID_1 + RatecodeID_2 + RatecodeID_3 + RatecodeID_4 + RatecodeID_5 + PULocation_Bronx + PULocation_Brooklyn + PULocation_EWR + PULocation_Manhattan + PULocation_Queens + PULocation_Staten_Island + DOLocation_Bronx + DOLocation_Brooklyn + DOLocation_EWR + DOLocation_Manhattan + DOLocation_Queens + DOLocation_Staten_Island + day_Fri + day_Mon + day_Sat + day_Sun + day_Thu + day_Tue + day_Wed + PU_time_of_day_Afternoon + PU_time_of_day_Evening + PU_time_of_day_Morning + PU_time_of_day_Night + DO_time_of_day_Afternoon + DO_time_of_day_Evening + DO_time_of_day_Morning + DO_time_of_day_Night, data = train, ntree = 100, keep.forest = FALSE, importance = TRUE)
## Type of random forest: regression
## Number of trees: 100
## No. of variables tried at each split: 14
##
## Mean of squared residuals: 0.602
## % Var explained: 40.2
As seen from the above plot, the trip duration, duration of the trip, and fare amount has the highest impact on the model if they were to be removed.
The unpruned decision tree is the optimal model when aiming for a low MAPE, low AIC, and high r-squared. It is crucial to remember that when comparing the different models, the MAPE number is typically the same, hovering around 0.18 - 0.20, which denotes a 80-82 % accuracy. But all of the models’ r-squared values are quite low, explaining about 7% to 15% of the variance in our dependent variable. As a result, this suggests that the models are neither thorough nor accurate fits.
technique | dependent | mape | Rsquare | AIC |
---|---|---|---|---|
Linear(3 vars with best cor-coeffs) | tip_perc | 0.185 | 0.0835 | 513236.638877281 |
Linear-treated outlier | tip_perc | 0.185 | 0.0835 | 509392.626959756 |
Lasso | tip_perc | 0.183 | 0.0991 | -372632.599400254 * |
Ridge | tip_perc | 0.183 | 0.0930 | -344041.945716197 * |
Decision Tree | tip_perc | 0.226 | 0.0463 | – |
Decision Tree (Prune) | tip_perc | 0.183 | 0.1020 | – |
Linear(3 vars with best cor-coeffs) | tip_amount | 4.662 | 0.4394 | 205075.253842576 |
Linear-treated outlier | tip_amount | 4.662 | 0.4394 | 172650.795971072 |
Lasso | tip_amount | 6.322 | 0.4463 | -33544.6455051331 * |
Ridge | tip_amount | 8.136 | 0.4288 | -31818.3532968901 * |
Decision Tree | tip_amount | 6.521 | 0.3444 | – |
Decision Tree (Prune) | tip_amount | 2.049 | 0.4349 | – |
Comments : At a first glance, there are total 67604356 observation across 3558124 and 19 variables in which 7 are categorical and 12 are numerical variables. The data was procured from the NYC Open Source GIS website - https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page.