1 Exordium

One of the most famous pictures of New York is the wave of yellow taxi taxis flooding the streets. So, where better to research taxi cab data than New York City? This is exactly what we intended to do. From 2009 until the present, the NYC Taxi and Limousine Commission (TLC) has gathered massive amounts of data for every taxi travel in New York City. We set out to get our hands dirty and put the sophisticated analysis,we learnt over the semester to work.
We wanted to see how parameters like pick-up location, distance, number of passengers, and drop-off location impact the tipping behavior of NYC taxi drivers.

2 Data Preparation

2.1 Data Gathering

## 'data.frame':    3558124 obs. of  19 variables:
##  $ VendorID             : int  1 1 2 1 1 2 2 1 2 2 ...
##  $ tpep_pickup_datetime : POSIXct, format: "2022-05-31 20:25:41" "2022-05-31 20:44:40" ...
##  $ tpep_dropoff_datetime: POSIXct, format: "2022-05-31 20:48:22" "2022-05-31 21:01:48" ...
##  $ passenger_count      : num  1 1 1 2 0 1 1 1 1 1 ...
##  $ trip_distance        : num  11 4.2 9.49 12.1 1.8 2.02 8.08 4.3 8.78 1.76 ...
##  $ RatecodeID           : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ store_and_fwd_flag   : chr  "N" "N" "N" "N" ...
##  $ PULocationID         : int  70 170 264 132 140 148 158 246 197 48 ...
##  $ DOLocationID         : int  48 226 113 17 163 158 116 262 191 186 ...
##  $ payment_type         : int  1 1 1 2 1 1 1 1 1 1 ...
##  $ fare_amount          : num  32 14 26 37 9 9 26.5 15 26.5 7.5 ...
##  $ extra                : num  3 3 0.5 1.75 3 0.5 0.5 3 0.5 0.5 ...
##  $ mta_tax              : num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
##  $ tip_amount           : num  2 0 5 0 2.55 0.64 7.58 3.75 5.56 2.26 ...
##  $ tolls_amount         : num  6.55 0 6.55 0 0 0 0 0 0 0 ...
##  $ improvement_surcharge: num  0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 ...
##  $ total_amount         : num  44.4 17.8 42.6 39.5 15.3 ...
##  $ congestion_surcharge : num  2.5 2.5 2.5 0 2.5 2.5 2.5 2.5 0 2.5 ...
##  $ airport_fee          : num  0 0 1.25 1.25 0 0 0 0 0 0 ...
Comments : At a first glance, there are total 67604356 observation across 3558124 and 19 variables in which 7 are categorical and 12 are numerical variables. The data was procured from the NYC Open Source GIS website - https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page.

2.2 Data Descriptors

Zones
Field.Name Description
VendorID A code indicating the TPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc. 
tpep_pickup_datetime The date and time when the meter was engaged.
tpep_dropoff_datetime The date and time when the meter was disengaged.
Passenger_count The number of passengers in the vehicle. This is a driver-entered value.
Trip_distance The elapsed trip distance in miles reported by the taximeter.
PULocationID TLC Taxi Zone in which the taximeter was engaged
DOLocationID TLC Taxi Zone in which the taximeter was disengaged
RateCodeID The final rate code in effect at the end of the trip. 1= Standard rate 2=JFK 3=Newark 4=Nassau or Westchester 5=Negotiated fare 6=Group ride
Store_and_fwd_flag This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka “store and forward,” because the vehicle did not have a connection to the server. Y= store and forward trip N= not a store and forward trip
Payment_type A numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided trip
Fare_amount The time-and-distance fare calculated by the meter.
Extra Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges.
MTA_tax $0.50 MTA tax that is automatically triggered based on the metered rate in use.
Improvement_surcharge $0.30 improvement surcharge assessed trips at the flag drop. The improvement surcharge began being levied in 2015.
Tip_amount Tip amount – This field is automatically populated for credit card tips. Cash tips are not included.
Tolls_amount Total amount of all tolls paid in trip.
Total_amount The total amount charged to passengers. Does not include cash tips.
Congestion_Surcharge Total amount collected in trip for NYS congestion surcharge.
Airport_fee $1.25 for pick up only at LaGuardia and John F. Kennedy Airports
Comments : There are total 19 variables but not all are used in our analysis, we will shortly remove the irreverent columns. Some major columns are vital for this analysis are Trip distance, Trip duration, Fare amount , Tip amount, Passenger count and Vendor ID.
Zones
LocationID Borough Zone service_zone
1 EWR Newark Airport EWR
2 Queens Jamaica Bay Boro Zone
3 Bronx Allerton/Pelham Gardens Boro Zone
4 Manhattan Alphabet City Yellow Zone
5 Staten Island Arden Heights Boro Zone
6 Staten Island Arrochar/Fort Wadsworth Boro Zone
Comments: For this analysis NYC has been divided 6 Borough and 261 distinct Zones.

2.3 Data Statistics

## 'data.frame':    3558124 obs. of  19 variables:
##  $ VendorID             : int  1 1 2 1 1 2 2 1 2 2 ...
##  $ tpep_pickup_datetime : POSIXct, format: "2022-05-31 20:25:41" "2022-05-31 20:44:40" ...
##  $ tpep_dropoff_datetime: POSIXct, format: "2022-05-31 20:48:22" "2022-05-31 21:01:48" ...
##  $ passenger_count      : num  1 1 1 2 0 1 1 1 1 1 ...
##  $ trip_distance        : num  11 4.2 9.49 12.1 1.8 2.02 8.08 4.3 8.78 1.76 ...
##  $ RatecodeID           : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ store_and_fwd_flag   : chr  "N" "N" "N" "N" ...
##  $ PULocationID         : int  70 170 264 132 140 148 158 246 197 48 ...
##  $ DOLocationID         : int  48 226 113 17 163 158 116 262 191 186 ...
##  $ payment_type         : int  1 1 1 2 1 1 1 1 1 1 ...
##  $ fare_amount          : num  32 14 26 37 9 9 26.5 15 26.5 7.5 ...
##  $ extra                : num  3 3 0.5 1.75 3 0.5 0.5 3 0.5 0.5 ...
##  $ mta_tax              : num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
##  $ tip_amount           : num  2 0 5 0 2.55 0.64 7.58 3.75 5.56 2.26 ...
##  $ tolls_amount         : num  6.55 0 6.55 0 0 0 0 0 0 0 ...
##  $ improvement_surcharge: num  0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 ...
##  $ total_amount         : num  44.4 17.8 42.6 39.5 15.3 ...
##  $ congestion_surcharge : num  2.5 2.5 2.5 0 2.5 2.5 2.5 2.5 0 2.5 ...
##  $ airport_fee          : num  0 0 1.25 1.25 0 0 0 0 0 0 ...
## 'data.frame':    265 obs. of  4 variables:
##  $ LocationID  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Borough     : chr  "EWR" "Queens" "Bronx" "Manhattan" ...
##  $ Zone        : chr  "Newark Airport" "Jamaica Bay" "Allerton/Pelham Gardens" "Alphabet City" ...
##  $ service_zone: chr  "EWR" "Boro Zone" "Boro Zone" "Yellow Zone" ...

2.4 Data Manipulation

2.4.1 Look up values

Looking up Location names to the corresponding location ids such as 1-EWR, 2-Queens, 3- Bronx.

2.4.2 Calculated column of interest

Calculating columns of interest such as Trip duration, Trip percentage and Day

2.4.3 Missing vaules

Dealing with missing values

2.4.4 Defining categorical variables

Defining variables such as vendor id ,passenger count, pick-up and Drop-off location as Categorical Variables.

2.4.5 Outliers

Investigating Outliers:
Looking at the above summary, following observations are considered while dealing with outliers:
  1. Passenger_count seems to go up to nine, which seems incorrect; hence, we will consider passengers up to 6.
  2. The trip distance has a maximum value of 184341 miles, more than the entire United States. Therefore we consider trip distances up to 40 miles.
  3. Rate codeID, according to our data, can be only six values; however, the data contains values beyond six. These values are neglected.
  4. The fare amount ranges from -907 to 395845. Unless someone is too generous, these values are incorrect. The range for fare amount is considered from 0 to 150.
  5. Similarly considering the range 0-100 for the tipping amount and toll collected.
  6. The unknown values for the pick-up and drop-off locations are dropped.
  7. When considering the payment type, only credit card payments are referenced because we don’t have data for cash tips.
  8. Finally, values beyond the 500% tipping percentage looks skeptical; hence we will ignore these values and consider up to a 60% tipping ratio.

Summary for Cleaned Data

##  VendorID   pickup_time                      dropoff_time                   
##  1:28040   Min.   :2002-10-21 05:50:44.00   Min.   :2002-10-21 06:17:17.00  
##  2:80292   1st Qu.:2022-06-08 10:40:14.00   1st Qu.:2022-06-08 11:12:00.50  
##            Median :2022-06-15 15:27:32.00   Median :2022-06-15 15:54:23.00  
##            Mean   :2022-06-15 07:48:12.17   Mean   :2022-06-15 14:37:32.81  
##            3rd Qu.:2022-06-23 04:57:50.00   3rd Qu.:2022-06-23 05:27:29.75  
##            Max.   :2022-06-30 19:59:43.00   Max.   :2022-06-30 20:35:20.00  
##                                                                             
##  passenger_count trip_distance  RatecodeID payment_type  fare_amount   
##  Min.   :1.00    Min.   : 0.0   1 :79853   1:108332     Min.   :  3.5  
##  1st Qu.:1.00    1st Qu.: 8.8   2 :24087   2:     0     1st Qu.: 27.5  
##  Median :1.00    Median :10.3   3 : 3427   3:     0     Median : 32.5  
##  Mean   :1.45    Mean   :11.6   4 :   35   4:     0     Mean   : 36.2  
##  3rd Qu.:2.00    3rd Qu.:15.5   5 :  930                3rd Qu.: 52.0  
##  Max.   :6.00    Max.   :37.5   6 :    0                Max.   :149.5  
##                                 99:    0                               
##    tip_amount    tolls_amount          PULocation            DOLocation   
##  Min.   : 0.1   Min.   : 0.0   Bronx        :   49   Bronx        : 2579  
##  1st Qu.: 7.0   1st Qu.: 6.6   Brooklyn     :  271   Brooklyn     : 6741  
##  Median : 8.5   Median : 6.6   EWR          :   18   EWR          : 3622  
##  Mean   : 8.9   Mean   : 6.9   Manhattan    :46274   Manhattan    :60995  
##  3rd Qu.:10.8   3rd Qu.: 6.6   Queens       :61716   Queens       :34297  
##  Max.   :50.0   Max.   :56.0   Staten Island:    4   Staten Island:   98  
##                                Unknown      :    0   Unknown      :    0  
##     tip_perc    trip_duration   day          PU_time_of_day    DO_time_of_day 
##  Min.   : 1.0   Min.   : 1.0   Sun:17205   Night    :28670   Night    :26102  
##  1st Qu.:24.0   1st Qu.:22.0   Mon:17248   Morning  :31547   Morning  :32640  
##  Median :26.0   Median :28.0   Tue:14027   Afternoon:34259   Afternoon:33344  
##  Mean   :25.3   Mean   :27.7   Wed:16971   Evening  :13856   Evening  :16246  
##  3rd Qu.:28.0   3rd Qu.:34.0   Thu:16985                                      
##  Max.   :59.0   Max.   :40.0   Fri:13831                                      
##                                Sat:12065
The number of observations post data cleaning are ** 1841644 **

2.5 Distribution Check

The distribution of the primary candidate for this study (Tip) is virtually normally distributed, with the lack of some value on the left side.

3 Explanatory Data Analysis

3.1 Parameter Visualization

We start by visualizing crucial parameters to determine their importance in our data.
We first take a look at the number of trips segregated by day of the week, coming to the conclusion that most trips occurred on Sunday, Monday, Wednesday, and Thursday.

It is observe from the above chart VeriFone has a bigger share of approximately 75% in the yellow taxi rides.

The graph above indicates that the number of journeys increases as the number of passengers decreases. The right-skewed graph corroborates this observation.

We had expected that evenings would have the most travels, but our data revealed that afternoons were the busiest in terms of number of trips, followed by mornings.

3.2 Relationship Exploration

Observations - We initially skimmed correlation coefficients for our continuous variables, such as travel distance and trip time, to see if they were connected to tip amount. As seen in the above cor plot, the results were 0.65 and 0.33, which suggest a moderate association. However, since correlation does not imply causation, statistical tests must be performed on these variables to establish their relationship.

3.3 Location Analysis

Location Distribution
Pick up Location Drop off Location No. of Trips Avg. Fare
Queens Manhattan 59411 36.9
Manhattan Queens 33523 33.8
Manhattan Brooklyn 6681 25.9
Manhattan EWR 3587 66.1
Queens Bronx 1466 39.6
Manhattan Manhattan 1328 30.5
Manhattan Bronx 1094 31.7
Queens Queens 744 46.3
Brooklyn Manhattan 232 23.6
Manhattan Staten Island 61 47.2
Observations - At first glance, Queens has the most pickups, followed by Manhattan and Brooklyn in second and third, respectively. Similarly, Manhattan, Queens, and Brooklyn make up the top three drop-off locations. Further investigation showed that the highest number of trips was between Queens and Manhattan, followed by Manhattan to Queens and Manhattan to Brooklyn. These results are credible when considering yellow cabs (the focus of this analysis), as they primarily serve the above regions, in contrast to green cabs, which serve areas where yellow cabs do not operate.
We can see the highest avg fare price, Staten to Island EWR, which is $102; however, we do not have sufficient data for these locations; hence we consider only the top 10 source and destination boroughs in terms of number of trips. A trip from Manhattan to EWR costs around $66 on average and $46 for travelling within Queens.
Now that we have looked at the insights from our location vs tip percentage data, we explore the statistical significance of the two

This graph illustrates that Queens has the most tipping passengers, followed by Manhattan.Similarly, Manhattan has higher tipping passengers than others in Drop-Off Location
##                 Df  Sum Sq Mean Sq F value Pr(>F)    
## PULocation       5    7691    1538     132 <2e-16 ***
## Residuals   108326 1257827      12                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##                 Df  Sum Sq Mean Sq F value Pr(>F)    
## DOLocation       5  174314   34863    3461 <2e-16 ***
## Residuals   108326 1091205      10                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Observations - The p-value for both variables is 0.2*10^−15 or 0.0000000000000002, which are infinitesimal compared to the significance level of 0.05, thus rejecting the null hypothesis that the means of the two entities are the same, making them statically different.

3.4 Trip Duration Impact on Tips

Observations - The data is approximately normally distributed with slight skewness on the right side. This is to say that majority of out trips are 20-40 mins in length.

Observations - A t-test between travel time and customer tip percentage reveals the p-value of the relationship between the variables, which is 0.2*10-15; consequently, since the value is much lower than the significance level of 0.05, we can state that Yellow taxi passengers tip differently depending on the length of the trip and successfully reject the null hypothesis that the means of the two variables are equal.

3.5 Trip Length and Tips

We are attempting to determine if those doing shorter journeys are more likely to leave larger gratuities or those taking longer travels are more giving.

To study these two groups separately, we divide the data for trip distance into two categories: short and long trips. When we plot the journey distance against the number of tips paid, we notice that passengers tips higher number of times on shorter rides than on longer ones.

Observations - A Simple two way test for pvalues which is found to be way less than the significant level 0.05. We can reject the null hypothesis, Z-test cannot be used because we don’t know population’s mean & std dev.
Declaring hypothesis
Null Hypothesis: Ho Tip amount is same for both short and long distance passenger(s)
Alternate Hypothesis: Ha Tip amount is NOT same for both short and long distance passenger(s)

3.6 Importance of passenger count and vendor

A Anova test between tip percentage and passenger count shows a significant relationship between the number of passengers and the amount of tips because the p-value is 0.00006, which is less than the significant value(0.005). Hence, we can reject the null hypothesis(H0).

Observation- Finally, we explore the relationship between the vendor and our response variable, tip. Unsurprisingly a two-way T-test between the aforementioned variables reveals that there is no significance with a p-value of 0.264.

4 Conclusion

We conclude our analysis by discussing the limitations and Future scopes of this project

Limitations

  1. Daylight savings are not considered
  2. In this will not consider large dataset (hardware )
  3. In this there is no cash Tips are considered
  4. There are more variables to consider such as Gender and Weather
  5. human error in data
  6. Central Limit Theorem (CLT) states that sample means of moderately large samples are often well-approximated by a normal distribution even if the data is not normally distributed. Our dataset contains a significant amount of observations thus qualifying it to be approximately normal under CLT.

Future scope

The dataset produced for this project will serve as the foundation for future study. More insights will be obtained by analyzing at least a year’s worth of data. If weather impacts the amount of rides, hourly weather data combined with weather events may provide further information. A forecast and prediction based on zones, as well as boroughs, will make it extremely easy for drivers to be present at any particular moment in time and Gender and Driver and Trip Rating.
Feature (variable) Test P-value Null Hypothesis (H0) Decision on H0
pickup location ANOVA 1.79e-140 means are equal reject H0
dropoff location ANOVA 0 means are equal reject H0
distance T-Test 0 means are equal reject H0
passenger count ANOVA 0.0000604 means are equal reject H0
vendor ID T-test 0.264 means are equal failed to reject H0