Demystifying Components of Riding Hailing

1 Exordium

One of the most famous pictures of New York is the wave of yellow taxi taxis flooding the streets. So, where better to research taxi cab data than New York City? This is exactly what we intended to do. From 2009 until the present, the NYC Taxi and Limousine Commission (TLC) has gathered massive amounts of data for every taxi travel in New York City. We set out to get our hands dirty and put the sophisticated analysis,we learnt over the semester to work.

We wanted to see how parameters like pick-up location, distance, number of passengers, and drop-off location impact the tipping behavior of NYC taxi drivers.

2 Data Preparation

2.1 Data Gathering

## 'data.frame':    3558124 obs. of  19 variables:
##  $ VendorID             : int  1 1 2 1 1 2 2 1 2 2 ...
##  $ tpep_pickup_datetime : POSIXct, format: "2022-05-31 20:25:41" "2022-05-31 20:44:40" ...
##  $ tpep_dropoff_datetime: POSIXct, format: "2022-05-31 20:48:22" "2022-05-31 21:01:48" ...
##  $ passenger_count      : num  1 1 1 2 0 1 1 1 1 1 ...
##  $ trip_distance        : num  11 4.2 9.49 12.1 1.8 2.02 8.08 4.3 8.78 1.76 ...
##  $ RatecodeID           : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ store_and_fwd_flag   : chr  "N" "N" "N" "N" ...
##  $ PULocationID         : int  70 170 264 132 140 148 158 246 197 48 ...
##  $ DOLocationID         : int  48 226 113 17 163 158 116 262 191 186 ...
##  $ payment_type         : int  1 1 1 2 1 1 1 1 1 1 ...
##  $ fare_amount          : num  32 14 26 37 9 9 26.5 15 26.5 7.5 ...
##  $ extra                : num  3 3 0.5 1.75 3 0.5 0.5 3 0.5 0.5 ...
##  $ mta_tax              : num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
##  $ tip_amount           : num  2 0 5 0 2.55 0.64 7.58 3.75 5.56 2.26 ...
##  $ tolls_amount         : num  6.55 0 6.55 0 0 0 0 0 0 0 ...
##  $ improvement_surcharge: num  0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 ...
##  $ total_amount         : num  44.4 17.8 42.6 39.5 15.3 ...
##  $ congestion_surcharge : num  2.5 2.5 2.5 0 2.5 2.5 2.5 2.5 0 2.5 ...
##  $ airport_fee          : num  0 0 1.25 1.25 0 0 0 0 0 0 ...

Comments : At a first glance, there are total 67604356 observation across 3558124 and 19 variables in which 7 are categorical and 12 are numerical variables. The data was procured from the NYC Open Source GIS website - https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page.

2.2 Data Descriptors

Zones
Field.Name	Description
VendorID	A code indicating the TPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.
tpep_pickup_datetime	The date and time when the meter was engaged.
tpep_dropoff_datetime	The date and time when the meter was disengaged.
Passenger_count	The number of passengers in the vehicle. This is a driver-entered value.
Trip_distance	The elapsed trip distance in miles reported by the taximeter.
PULocationID	TLC Taxi Zone in which the taximeter was engaged
DOLocationID	TLC Taxi Zone in which the taximeter was disengaged
RateCodeID	The final rate code in effect at the end of the trip. 1= Standard rate 2=JFK 3=Newark 4=Nassau or Westchester 5=Negotiated fare 6=Group ride
Store_and_fwd_flag	This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka “store and forward,” because the vehicle did not have a connection to the server. Y= store and forward trip N= not a store and forward trip
Payment_type	A numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided trip
Fare_amount	The time-and-distance fare calculated by the meter.
Extra	Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges.
MTA_tax	$0.50 MTA tax that is automatically triggered based on the metered rate in use.
Improvement_surcharge	$0.30 improvement surcharge assessed trips at the flag drop. The improvement surcharge began being levied in 2015.
Tip_amount	Tip amount – This field is automatically populated for credit card tips. Cash tips are not included.
Tolls_amount	Total amount of all tolls paid in trip.
Total_amount	The total amount charged to passengers. Does not include cash tips.
Congestion_Surcharge	Total amount collected in trip for NYS congestion surcharge.
Airport_fee	$1.25 for pick up only at LaGuardia and John F. Kennedy Airports

Comments : There are total 19 variables but not all are used in our analysis, we will shortly remove the irreverent columns. Some major columns are vital for this analysis are Trip distance, Trip duration, Fare amount , Tip amount, Passenger count and Vendor ID.

Zones
LocationID	Borough	Zone	service_zone
1	EWR	Newark Airport	EWR
2	Queens	Jamaica Bay	Boro Zone
3	Bronx	Allerton/Pelham Gardens	Boro Zone
4	Manhattan	Alphabet City	Yellow Zone
5	Staten Island	Arden Heights	Boro Zone
6	Staten Island	Arrochar/Fort Wadsworth	Boro Zone

Comments: For this analysis NYC has been divided 6 Borough and 261 distinct Zones.

2.3 Data Statistics

## 'data.frame':    3558124 obs. of  19 variables:
##  $ VendorID             : int  1 1 2 1 1 2 2 1 2 2 ...
##  $ tpep_pickup_datetime : POSIXct, format: "2022-05-31 20:25:41" "2022-05-31 20:44:40" ...
##  $ tpep_dropoff_datetime: POSIXct, format: "2022-05-31 20:48:22" "2022-05-31 21:01:48" ...
##  $ passenger_count      : num  1 1 1 2 0 1 1 1 1 1 ...
##  $ trip_distance        : num  11 4.2 9.49 12.1 1.8 2.02 8.08 4.3 8.78 1.76 ...
##  $ RatecodeID           : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ store_and_fwd_flag   : chr  "N" "N" "N" "N" ...
##  $ PULocationID         : int  70 170 264 132 140 148 158 246 197 48 ...
##  $ DOLocationID         : int  48 226 113 17 163 158 116 262 191 186 ...
##  $ payment_type         : int  1 1 1 2 1 1 1 1 1 1 ...
##  $ fare_amount          : num  32 14 26 37 9 9 26.5 15 26.5 7.5 ...
##  $ extra                : num  3 3 0.5 1.75 3 0.5 0.5 3 0.5 0.5 ...
##  $ mta_tax              : num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
##  $ tip_amount           : num  2 0 5 0 2.55 0.64 7.58 3.75 5.56 2.26 ...
##  $ tolls_amount         : num  6.55 0 6.55 0 0 0 0 0 0 0 ...
##  $ improvement_surcharge: num  0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 ...
##  $ total_amount         : num  44.4 17.8 42.6 39.5 15.3 ...
##  $ congestion_surcharge : num  2.5 2.5 2.5 0 2.5 2.5 2.5 2.5 0 2.5 ...
##  $ airport_fee          : num  0 0 1.25 1.25 0 0 0 0 0 0 ...

## 'data.frame':    265 obs. of  4 variables:
##  $ LocationID  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Borough     : chr  "EWR" "Queens" "Bronx" "Manhattan" ...
##  $ Zone        : chr  "Newark Airport" "Jamaica Bay" "Allerton/Pelham Gardens" "Alphabet City" ...
##  $ service_zone: chr  "EWR" "Boro Zone" "Boro Zone" "Yellow Zone" ...

2.4 Data Manipulation

2.4.1 Look up values

Looking up Location names to the corresponding location ids such as 1-EWR, 2-Queens, 3- Bronx.

2.4.2 Calculated column of interest

Calculating columns of interest such as Trip duration, Trip percentage and Day

2.4.3 Missing vaules

Dealing with missing values

2.4.4 Defining categorical variables

Defining variables such as vendor id ,passenger count, pick-up and Drop-off location as Categorical Variables.

2.4.5 Outliers

Investigating Outliers:

Looking at the above summary, following observations are considered while dealing with outliers:

Passenger_count seems to go up to nine, which seems incorrect; hence, we will consider passengers up to 6.
The trip distance has a maximum value of 184341 miles, more than the entire United States. Therefore we consider trip distances up to 40 miles.
Rate codeID, according to our data, can be only six values; however, the data contains values beyond six. These values are neglected.
The fare amount ranges from -907 to 395845. Unless someone is too generous, these values are incorrect. The range for fare amount is considered from 0 to 150.
Similarly considering the range 0-100 for the tipping amount and toll collected.
The unknown values for the pick-up and drop-off locations are dropped.
When considering the payment type, only credit card payments are referenced because we don’t have data for cash tips.
Finally, values beyond the 500% tipping percentage looks skeptical; hence we will ignore these values and consider up to a 60% tipping ratio.

Summary for Cleaned Data

##  VendorID   pickup_time                      dropoff_time                   
##  1:28040   Min.   :2002-10-21 05:50:44.00   Min.   :2002-10-21 06:17:17.00  
##  2:80292   1st Qu.:2022-06-08 10:40:14.00   1st Qu.:2022-06-08 11:12:00.50  
##            Median :2022-06-15 15:27:32.00   Median :2022-06-15 15:54:23.00  
##            Mean   :2022-06-15 07:48:12.17   Mean   :2022-06-15 14:37:32.81  
##            3rd Qu.:2022-06-23 04:57:50.00   3rd Qu.:2022-06-23 05:27:29.75  
##            Max.   :2022-06-30 19:59:43.00   Max.   :2022-06-30 20:35:20.00  
##                                                                             
##  passenger_count trip_distance  RatecodeID payment_type  fare_amount   
##  Min.   :1.00    Min.   : 0.0   1 :79853   1:108332     Min.   :  3.5  
##  1st Qu.:1.00    1st Qu.: 8.8   2 :24087   2:     0     1st Qu.: 27.5  
##  Median :1.00    Median :10.3   3 : 3427   3:     0     Median : 32.5  
##  Mean   :1.45    Mean   :11.6   4 :   35   4:     0     Mean   : 36.2  
##  3rd Qu.:2.00    3rd Qu.:15.5   5 :  930                3rd Qu.: 52.0  
##  Max.   :6.00    Max.   :37.5   6 :    0                Max.   :149.5  
##                                 99:    0                               
##    tip_amount    tolls_amount          PULocation            DOLocation   
##  Min.   : 0.1   Min.   : 0.0   Bronx        :   49   Bronx        : 2579  
##  1st Qu.: 7.0   1st Qu.: 6.6   Brooklyn     :  271   Brooklyn     : 6741  
##  Median : 8.5   Median : 6.6   EWR          :   18   EWR          : 3622  
##  Mean   : 8.9   Mean   : 6.9   Manhattan    :46274   Manhattan    :60995  
##  3rd Qu.:10.8   3rd Qu.: 6.6   Queens       :61716   Queens       :34297  
##  Max.   :50.0   Max.   :56.0   Staten Island:    4   Staten Island:   98  
##                                Unknown      :    0   Unknown      :    0  
##     tip_perc    trip_duration   day          PU_time_of_day    DO_time_of_day 
##  Min.   : 1.0   Min.   : 1.0   Sun:17205   Night    :28670   Night    :26102  
##  1st Qu.:24.0   1st Qu.:22.0   Mon:17248   Morning  :31547   Morning  :32640  
##  Median :26.0   Median :28.0   Tue:14027   Afternoon:34259   Afternoon:33344  
##  Mean   :25.3   Mean   :27.7   Wed:16971   Evening  :13856   Evening  :16246  
##  3rd Qu.:28.0   3rd Qu.:34.0   Thu:16985                                      
##  Max.   :59.0   Max.   :40.0   Fri:13831                                      
##                                Sat:12065

The number of observations post data cleaning are 1841644

2.5 Distribution Check

The distribution of the primary candidate for this study (Tip) is virtually normally distributed, with the lack of some value on the left side.

3 Explanatory Data Analysis

3.1 Parameter Visualization

We start by visualizing crucial parameters to determine their importance in our data.

We first take a look at the number of trips segregated by day of the week, coming to the conclusion that most trips occurred on Sunday, Monday, Wednesday, and Thursday.

The graph above indicates that the number of journeys increases as the number of passengers decreases. The right-skewed graph corroborates this observation.

We had expected that evenings would have the most travels, but our data revealed that afternoons were the busiest in terms of number of trips, followed by mornings.

3.2 Relationship Exploration

Observations - We initially skimmed correlation coefficients for our continuous variables, such as travel distance and trip time, to see if they were connected to tip amount. As seen in the above cor plot, the results were 0.65 and 0.33, which suggest a moderate association. However, since correlation does not imply causation, statistical tests must be performed on these variables to establish their relationship.

3.3 Location Analysis

Location Distribution
Pick up Location	Drop off Location	No. of Trips	Avg. Fare
Queens	Manhattan	59411	36.9
Manhattan	Queens	33523	33.8
Manhattan	Brooklyn	6681	25.9
Manhattan	EWR	3587	66.1
Queens	Bronx	1466	39.6
Manhattan	Manhattan	1328	30.5
Manhattan	Bronx	1094	31.7
Queens	Queens	744	46.3
Brooklyn	Manhattan	232	23.6
Manhattan	Staten Island	61	47.2

Observations - At first glance, Queens has the most pickups, followed by Manhattan and Brooklyn in second and third, respectively. Similarly, Manhattan, Queens, and Brooklyn make up the top three drop-off locations. Further investigation showed that the highest number of trips was between Queens and Manhattan, followed by Manhattan to Queens and Manhattan to Brooklyn. These results are credible when considering yellow cabs (the focus of this analysis), as they primarily serve the above regions, in contrast to green cabs, which serve areas where yellow cabs do not operate.

We can see the highest avg fare price, Staten to Island EWR, which is $102; however, we do not have sufficient data for these locations; hence we consider only the top 10 source and destination boroughs in terms of number of trips. A trip from Manhattan to EWR costs around $66 on average and $46 for travelling within Queens.

Now that we have looked at the insights from our location vs tip percentage data, we explore the statistical significance of the two

This graph illustrates that Queens has the most tipping passengers, followed by Manhattan.Similarly, Manhattan has higher tipping passengers than others in Drop-Off Location

##                 Df  Sum Sq Mean Sq F value Pr(>F)    
## PULocation       5    7691    1538     132 <2e-16 ***
## Residuals   108326 1257827      12                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##                 Df  Sum Sq Mean Sq F value Pr(>F)    
## DOLocation       5  174314   34863    3461 <2e-16 ***
## Residuals   108326 1091205      10                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Observations - The p-value for both variables is 0.2*10^−15 or 0.0000000000000002, which are infinitesimal compared to the significance level of 0.05, thus rejecting the null hypothesis that the means of the two entities are the same, making them statically different.

3.4 Trip Duration Impact on Tips

Observations - The data is approximately normally distributed with slight skewness on the right side. This is to say that majority of out trips are 20-40 mins in length.

Observations - A t-test between travel time and customer tip percentage reveals the p-value of the relationship between the variables, which is 0.2*10-15; consequently, since the value is much lower than the significance level of 0.05, we can state that Yellow taxi passengers tip differently depending on the length of the trip and successfully reject the null hypothesis that the means of the two variables are equal.

3.5 Trip Length and Tips

We are attempting to determine if those doing shorter journeys are more likely to leave larger gratuities or those taking longer travels are more giving.

To study these two groups separately, we divide the data for trip distance into two categories: short and long trips. When we plot the journey distance against the number of tips paid, we notice that passengers tips higher number of times on shorter rides than on longer ones.

Observations - A Simple two way test for pvalues which is found to be way less than the significant level 0.05. We can reject the null hypothesis, Z-test cannot be used because we don’t know population’s mean & std dev.

Declaring hypothesis

Null Hypothesis: Ho Tip amount is same for both short and long distance passenger(s)

Alternate Hypothesis: Ha Tip amount is NOT same for both short and long distance passenger(s)

3.6 Importance of passenger count and vendor

A Anova test between tip percentage and passenger count shows a significant relationship between the number of passengers and the amount of tips because the p-value is 0.00006, which is less than the significant value(0.005). Hence, we can reject the null hypothesis(H0).

Observation- Finally, we explore the relationship between the vendor and our response variable, tip. Unsurprisingly a two-way T-test between the aforementioned variables reveals that there is no significance with a p-value of 0.264.

4 Conclusion

We conclude our analysis by discussing the limitations and Future scopes of this project

Limitations

Daylight savings are not considered
In this will not consider large dataset (hardware )
In this there is no cash Tips are considered
There are more variables to consider such as Gender and Weather
human error in data
Central Limit Theorem (CLT) states that sample means of moderately large samples are often well-approximated by a normal distribution even if the data is not normally distributed. Our dataset contains a significant amount of observations thus qualifying it to be approximately normal under CLT.

Future scope

The dataset produced for this project will serve as the foundation for future study. More insights will be obtained by analyzing at least a year’s worth of data. If weather impacts the amount of rides, hourly weather data combined with weather events may provide further information. A forecast and prediction based on zones, as well as boroughs, will make it extremely easy for drivers to be present at any particular moment in time and Gender and Driver and Trip Rating.

Feature (variable)	Test	P-value	Null Hypothesis (H0)	Decision on H0
pickup location	ANOVA	1.79e-140	means are equal	reject H0
dropoff location	ANOVA	0	means are equal	reject H0
distance	T-Test	0	means are equal	reject H0
passenger count	ANOVA	0.0000604	means are equal	reject H0
vendor ID	T-test	0.264	means are equal	failed to reject H0

Demystifying Components of Riding Hailing

Chirag Lakhanpal, Shikha Sharma, Abhishek Pradhan

1 Exordium

We wanted to see how parameters like pick-up location, distance, number of passengers, and drop-off location impact the tipping behavior of NYC taxi drivers.

2 Data Preparation

2.1 Data Gathering

Comments : At a first glance, there are total 67604356 observation across 3558124 and 19 variables in which 7 are categorical and 12 are numerical variables. The data was procured from the NYC Open Source GIS website - https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page.

2.2 Data Descriptors

Comments : There are total 19 variables but not all are used in our analysis, we will shortly remove the irreverent columns. Some major columns are vital for this analysis are Trip distance, Trip duration, Fare amount , Tip amount, Passenger count and Vendor ID.

Comments: For this analysis NYC has been divided 6 Borough and 261 distinct Zones.

2.3 Data Statistics

2.4 Data Manipulation

2.4.1 Look up values

Looking up Location names to the corresponding location ids such as 1-EWR, 2-Queens, 3- Bronx.

2.4.2 Calculated column of interest

Calculating columns of interest such as Trip duration, Trip percentage and Day

2.4.3 Missing vaules

Dealing with missing values

2.4.4 Defining categorical variables

Defining variables such as vendor id ,passenger count, pick-up and Drop-off location as Categorical Variables.

2.4.5 Outliers

Investigating Outliers:

Looking at the above summary, following observations are considered while dealing with outliers:

The number of observations post data cleaning are ** 1841644 **

2.5 Distribution Check

The distribution of the primary candidate for this study (Tip) is virtually normally distributed, with the lack of some value on the left side.

3 Explanatory Data Analysis

3.1 Parameter Visualization

We start by visualizing crucial parameters to determine their importance in our data.

We first take a look at the number of trips segregated by day of the week, coming to the conclusion that most trips occurred on Sunday, Monday, Wednesday, and Thursday.

It is observe from the above chart VeriFone has a bigger share of approximately 75% in the yellow taxi rides.

The graph above indicates that the number of journeys increases as the number of passengers decreases. The right-skewed graph corroborates this observation.

We had expected that evenings would have the most travels, but our data revealed that afternoons were the busiest in terms of number of trips, followed by mornings.

3.2 Relationship Exploration

3.3 Location Analysis

Now that we have looked at the insights from our location vs tip percentage data, we explore the statistical significance of the two

This graph illustrates that Queens has the most tipping passengers, followed by Manhattan.Similarly, Manhattan has higher tipping passengers than others in Drop-Off Location

Observations - The p-value for both variables is 0.2*10^−15 or 0.0000000000000002, which are infinitesimal compared to the significance level of 0.05, thus rejecting the null hypothesis that the means of the two entities are the same, making them statically different.

3.4 Trip Duration Impact on Tips

Observations - The data is approximately normally distributed with slight skewness on the right side. This is to say that majority of out trips are 20-40 mins in length.

3.5 Trip Length and Tips

We are attempting to determine if those doing shorter journeys are more likely to leave larger gratuities or those taking longer travels are more giving.

To study these two groups separately, we divide the data for trip distance into two categories: short and long trips. When we plot the journey distance against the number of tips paid, we notice that passengers tips higher number of times on shorter rides than on longer ones.

Observations - A Simple two way test for pvalues which is found to be way less than the significant level 0.05. We can reject the null hypothesis, Z-test cannot be used because we don’t know population’s mean & std dev.

Declaring hypothesis

Null Hypothesis: Ho Tip amount is same for both short and long distance passenger(s)

Alternate Hypothesis: Ha Tip amount is NOT same for both short and long distance passenger(s)

3.6 Importance of passenger count and vendor

A Anova test between tip percentage and passenger count shows a significant relationship between the number of passengers and the amount of tips because the p-value is 0.00006, which is less than the significant value(0.005). Hence, we can reject the null hypothesis(H0).

Observation- Finally, we explore the relationship between the vendor and our response variable, tip. Unsurprisingly a two-way T-test between the aforementioned variables reveals that there is no significance with a p-value of 0.264.

4 Conclusion

We conclude our analysis by discussing the limitations and Future scopes of this project

Limitations

Future scope

The number of observations post data cleaning are 1841644