1 Introduction

This R markdown file contains code used to analyse New York City yellow taxi dataset. Our objective is to identify what factors contribute to tipping amount for taxicab services.

2 Data preprocessing

2.1 Load data

To obtain the data, we have subset the taxi cab data for the most recent available dataset at time of download (June 2019). We have randomly selected 20000 observations due to hardware limitations, which have prevented us from analyzing the entire dataset. Although we have set a seed, we exported the subset dataset and used that for our analysis, to ensure that all group members were working on the same dataset. The code we used to subset the data is commented below.

The data was downloaded from the NYC Open Source GIS website: https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page.

# Get glimpse / structure of data
glimpse(unprocessed_data)

## Observations: 20,000
## Variables: 19
## $ X                     <int> 4524218, 6458048, 3369795, 18532, 1743670,…
## $ DOLocationID          <int> 211, 249, 161, 4, 107, 246, 237, 125, 142,…
## $ PULocationID          <int> 90, 125, 68, 87, 234, 230, 163, 249, 236, …
## $ VendorID              <int> 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, …
## $ tpep_pickup_datetime  <fct> 6/16/19 0:15, 6/28/19 0:09, 6/14/19 23:04,…
## $ tpep_dropoff_datetime <fct> 6/16/19 0:28, 6/28/19 0:16, 6/14/19 23:22,…
## $ passenger_count       <int> 1, 1, 2, 1, 1, 6, 1, 2, 1, 1, 1, 1, 1, 1, …
## $ trip_distance         <dbl> 1.60, 1.12, 2.72, 2.90, 0.62, 1.90, 0.96, …
## $ RatecodeID            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ store_and_fwd_flag    <fct> N, N, N, N, N, N, N, N, N, N, N, N, N, N, …
## $ payment_type          <int> 2, 1, 1, 2, 1, 2, 1, 1, 1, 1, 2, 2, 1, 2, …
## $ fare_amount           <dbl> 10.0, 6.5, 13.5, 11.0, 5.0, 11.0, 7.0, 7.0…
## $ extra                 <dbl> 0.5, 0.5, 0.5, 3.5, 0.5, 0.5, 0.0, 3.0, 0.…
## $ mta_tax               <dbl> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.…
## $ tip_amount            <dbl> 0.00, 2.06, 2.60, 0.00, 1.76, 0.00, 1.00, …
## $ tolls_amount          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ improvement_surcharge <dbl> 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.…
## $ total_amount          <dbl> 13.80, 12.36, 19.90, 15.30, 10.56, 14.80, …
## $ congestion_surcharge  <dbl> 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.…

The total number of rows and columns are20000, 19 in unprocessed df.There are around 10 columns with numerical type and 9 columns with double for unprocessed taxi data. We need to convert columns like passenger count, vendor id, payment type into factor columns during analysis.

# taxi zone mapping file
glimpse(taxi_zones)

## Observations: 265
## Variables: 4
## $ LocationID   <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, …
## $ Borough      <fct> EWR, Queens, Bronx, Manhattan, Staten Island, State…
## $ Zone         <fct> Newark Airport, Jamaica Bay, Allerton/Pelham Garden…
## $ service_zone <fct> EWR, Boro Zone, Boro Zone, Yellow Zone, Boro Zone, …

The total number of rows and columns are265, 4 in df. And around 4 columns with numerical type and 0 columns with double for unprocessed taxi data. We need to convert columns Borough, Zone, service_zone to factor columns during analysis.

2.2 Data statistics

2.2.1 Data Summary

Here is a summary of the subset taxi data:

##        X            DOLocationID    PULocationID      VendorID    
##  Min.   :   1016   Min.   :  1.0   Min.   :  3.0   Min.   :1.000  
##  1st Qu.:1725764   1st Qu.:107.0   1st Qu.:114.0   1st Qu.:1.000  
##  Median :3459818   Median :162.0   Median :161.0   Median :2.000  
##  Mean   :3461833   Mean   :160.4   Mean   :161.9   Mean   :1.642  
##  3rd Qu.:5200940   3rd Qu.:233.0   3rd Qu.:233.0   3rd Qu.:2.000  
##  Max.   :6940096   Max.   :265.0   Max.   :265.0   Max.   :4.000  
##                                                                   
##     tpep_pickup_datetime   tpep_dropoff_datetime passenger_count
##  6/11/19 7:56 :    7     6/24/19 18:36:    6     Min.   :0.000  
##  6/14/19 13:28:    6     6/27/19 21:45:    6     1st Qu.:1.000  
##  6/3/19 15:08 :    6     6/29/19 0:26 :    6     Median :1.000  
##  6/11/19 13:53:    5     6/1/19 22:46 :    5     Mean   :1.565  
##  6/14/19 23:16:    5     6/10/19 11:17:    5     3rd Qu.:2.000  
##  6/20/19 9:26 :    5     6/11/19 19:25:    5     Max.   :6.000  
##  (Other)      :19966     (Other)      :19967                    
##  trip_distance      RatecodeID    store_and_fwd_flag  payment_type  
##  Min.   : 0.000   Min.   :1.000   N:19893            Min.   :1.000  
##  1st Qu.: 0.990   1st Qu.:1.000   Y:  107            1st Qu.:1.000  
##  Median : 1.645   Median :1.000                      Median :1.000  
##  Mean   : 3.037   Mean   :1.054                      Mean   :1.291  
##  3rd Qu.: 3.100   3rd Qu.:1.000                      3rd Qu.:2.000  
##  Max.   :51.200   Max.   :5.000                      Max.   :4.000  
##                                                                     
##   fare_amount          extra           mta_tax          tip_amount     
##  Min.   :-160.00   Min.   :-1.000   Min.   :-0.5000   Min.   :  0.000  
##  1st Qu.:   6.50   1st Qu.: 0.000   1st Qu.: 0.5000   1st Qu.:  0.000  
##  Median :   9.50   Median : 0.500   Median : 0.5000   Median :  1.960  
##  Mean   :  13.47   Mean   : 1.163   Mean   : 0.4949   Mean   :  2.277  
##  3rd Qu.:  15.00   3rd Qu.: 2.500   3rd Qu.: 0.5000   3rd Qu.:  3.000  
##  Max.   : 399.20   Max.   : 7.000   Max.   : 0.5000   Max.   :175.000  
##                                                                        
##   tolls_amount     improvement_surcharge  total_amount    
##  Min.   :-6.1200   Min.   :-0.3000       Min.   :-160.80  
##  1st Qu.: 0.0000   1st Qu.: 0.3000       1st Qu.:  11.30  
##  Median : 0.0000   Median : 0.3000       Median :  14.80  
##  Mean   : 0.4059   Mean   : 0.2985       Mean   :  19.56  
##  3rd Qu.: 0.0000   3rd Qu.: 0.3000       3rd Qu.:  21.20  
##  Max.   :43.4300   Max.   : 0.3000       Max.   : 400.00  
##                                                           
##  congestion_surcharge
##  Min.   :-2.500      
##  1st Qu.: 2.500      
##  Median : 2.500      
##  Mean   : 2.273      
##  3rd Qu.: 2.500      
##  Max.   : 2.750      
##

Tip amound varies from 0 and 175 dollars.

Also trip distance varies from 0 and 51.2 with an average distance of 3.0372785.

Minimu fare amount is -160. As fare amount is negative this looks like an outlier.

Vendor ID maximum is 1. But according to data dictionary provided, the data is collected for two vendors namely Creative Mobile Technologies, LLC as ID 1 and VeriFone Inc as ID 2.

2.2.2 Outlier and Normality Check of Unprocessed Data

ggplot(data = unprocessed_data, aes(x = "", y = tip_amount)) + 
  geom_boxplot(color="#00AFBB")+ stat_summary(fun.y=mean, geom="point", shape=23, size=4) + 
  labs(x=" ", y = "Tip amount (dollars)") + ggtitle("Boxplot of NYC Taxi Tip Amount")

For the graph above, there are many observations tagged as outliers. We need to treat the data for outliers before analysis.

Looking at the distribution of raw tip amount, it is clear that it is not normally distributed and that there are some outliers.

2.3 Column Creation

2.3.1 Tip Percentage

A normal distribution is often an assumption for many statistical analyses. Generally. raw tip amounts vary because the fare amounts vary. One factor that may not necessarily vary is tipping percentage. In the US, there is often a standardized percentage that a customer gives (for example, 15% at restaurants). We divided the fare amount by the tip amount to obtain a tipping percentage:

Here is the structure, summary, and the first few rows of tip percentage:

##  num [1:20000] 0 0.317 0.193 0 0.352 ...

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.0000  0.2267  0.1839  0.2878 11.4400      15

##         X DOLocationID PULocationID VendorID tpep_pickup_datetime
## 1 4524218          211           90        2         6/16/19 0:15
## 2 6458048          249          125        2         6/28/19 0:09
## 3 3369795          161           68        2        6/14/19 23:04
## 4   18532            4           87        1        6/24/19 16:02
## 5 1743670          107          234        2        6/28/19 23:38
##   tpep_dropoff_datetime passenger_count trip_distance RatecodeID
## 1          6/16/19 0:28               1          1.60          1
## 2          6/28/19 0:16               1          1.12          1
## 3         6/14/19 23:22               2          2.72          1
## 4         6/24/19 16:12               1          2.90          1
## 5         6/28/19 23:43               1          0.62          1
##   store_and_fwd_flag payment_type fare_amount extra mta_tax tip_amount
## 1                  N            2        10.0   0.5     0.5       0.00
## 2                  N            1         6.5   0.5     0.5       2.06
## 3                  N            1        13.5   0.5     0.5       2.60
## 4                  N            2        11.0   3.5     0.5       0.00
## 5                  N            1         5.0   0.5     0.5       1.76
##   tolls_amount improvement_surcharge total_amount congestion_surcharge
## 1            0                   0.3        13.80                  2.5
## 2            0                   0.3        12.36                  2.5
## 3            0                   0.3        19.90                  2.5
## 4            0                   0.3        15.30                  2.5
## 5            0                   0.3        10.56                  2.5
##   tip_fare_ratio
## 1      0.0000000
## 2      0.3169231
## 3      0.1925926
## 4      0.0000000
## 5      0.3520000

2.3.2 Location Columns

The dataset provides a location ID that corresponds to a taxi zone in each of the five boroughs. These nominal variables do not provide much value in its integer format since we do not know the geographical locations of each location ID. We downloaded a taxi zone and ID dataset that provides the boroughs for each location ID. The dataset also indicates the specific neighborhoods within each borough. We merged that dataset to the taxi dataset to identify the borough for both pick up and drop off.

2.3.3 Pickup time column

The dataset provides pickup datetime column in factor datatype. For our analysis we create a new column pickup_period which is of type factor contains values “Morning”, “Afternoon”, “Evening” or “Night” based on the pickup hours.

2.3.4 Dropoff column

The dataset provides dropoff datetime column in factor datatype. For our analysis we create a new column drop_period which is of type factor contains values “Morning”, “Afternoon”, “Evening” or “Night” based on the pickup hours.

2.3.5 Trip_duration

We thought there might be also correlation between the duration of the trip (time taken for the trip) and the tip amount paid. Since the duration of the trip is missing in the dataset we calculate the same by taking the difference between the pickup and dropoff time.

Here is the structure, summary, and the first few rows of the dataset with new columns:

## 'data.frame':    20000 obs. of  33 variables:
##  $ DOLocationID         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ PULocationID         : int  233 231 186 234 231 161 246 68 50 132 ...
##  $ X                    : int  14365 1376 8878 13985 9970 7812 3427 13195 9857 7422 ...
##  $ VendorID             : int  2 1 1 1 2 1 2 2 2 1 ...
##  $ tpep_pickup_datetime : Factor w/ 15407 levels "5/31/19 23:58",..: 13958 4851 10536 8018 11326 1673 8964 12138 12514 12942 ...
##  $ tpep_dropoff_datetime: Factor w/ 15299 levels "6/1/19 0:04",..: 13903 4833 10443 7946 11238 1655 8895 12052 12466 12912 ...
##  $ passenger_count      : int  1 1 1 1 1 1 4 1 1 1 ...
##  $ trip_distance        : num  23.9 15.5 16.7 14.2 13.5 ...
##  $ RatecodeID           : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ store_and_fwd_flag   : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
##  $ payment_type         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ fare_amount          : num  96 73.5 71 64 54.5 68 61 61.5 70.5 117 ...
##  $ extra                : num  0 1 0 0 0 0 1 0.5 1 0 ...
##  $ mta_tax              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ tip_amount           : num  29.2 18.4 12 19.1 13.1 ...
##  $ tolls_amount         : num  20.5 17.5 10.5 12.5 10.5 ...
##  $ improvement_surcharge: num  0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 ...
##  $ total_amount         : num  146 110.8 93.8 96 78.4 ...
##  $ congestion_surcharge : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ tip_fare_ratio       : num  0.304 0.251 0.169 0.299 0.24 ...
##  $ Borough_pu           : Factor w/ 7 levels "Bronx","Brooklyn",..: 4 4 4 4 4 4 4 4 4 5 ...
##  $ Borough_do           : Factor w/ 7 levels "Bronx","Brooklyn",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Zone                 : Factor w/ 261 levels "Allerton/Pelham Gardens",..: 169 169 169 169 169 169 169 169 169 169 ...
##  $ service_zone         : Factor w/ 5 levels "Airports","Boro Zone",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ pickup_datetime      : POSIXct, format: "0019-06-07 11:06:00" "0019-06-18 16:03:00" ...
##  $ pickup_time          : chr  "11:06" "16:03" "14:42" "16:22" ...
##  $ pickup_hrs           : num  11 16 14 16 7 12 16 20 17 14 ...
##  $ pickup_period        : Ord.factor w/ 4 levels "Morning"<"Afternoon"<..: 1 2 2 2 1 2 2 3 3 2 ...
##  $ dropoff_datetime     : POSIXct, format: "0019-06-07 12:21:00" "0019-06-18 17:08:00" ...
##  $ dropoff_time         : chr  "12:21" "17:08" "15:38" "17:12" ...
##  $ dropoff_hrs          : num  12 17 15 17 8 12 17 20 18 15 ...
##  $ drop_period          : Ord.factor w/ 4 levels "Morning"<"Afternoon"<..: 2 3 2 3 1 2 3 3 3 2 ...
##  $ trip_duration        : num  75 65 56 50 21 35 41 27 41 99 ...

##   DOLocationID    PULocationID         X              VendorID    
##  Min.   :  1.0   Min.   :  3.0   Min.   :   1016   Min.   :1.000  
##  1st Qu.:107.0   1st Qu.:114.0   1st Qu.:1725764   1st Qu.:1.000  
##  Median :162.0   Median :161.0   Median :3459818   Median :2.000  
##  Mean   :160.4   Mean   :161.9   Mean   :3461833   Mean   :1.642  
##  3rd Qu.:233.0   3rd Qu.:233.0   3rd Qu.:5200940   3rd Qu.:2.000  
##  Max.   :265.0   Max.   :265.0   Max.   :6940096   Max.   :4.000  
##                                                                   
##     tpep_pickup_datetime   tpep_dropoff_datetime passenger_count
##  6/11/19 7:56 :    7     6/24/19 18:36:    6     Min.   :0.000  
##  6/14/19 13:28:    6     6/27/19 21:45:    6     1st Qu.:1.000  
##  6/3/19 15:08 :    6     6/29/19 0:26 :    6     Median :1.000  
##  6/11/19 13:53:    5     6/1/19 22:46 :    5     Mean   :1.565  
##  6/14/19 23:16:    5     6/10/19 11:17:    5     3rd Qu.:2.000  
##  6/20/19 9:26 :    5     6/11/19 19:25:    5     Max.   :6.000  
##  (Other)      :19966     (Other)      :19967                    
##  trip_distance      RatecodeID    store_and_fwd_flag  payment_type  
##  Min.   : 0.000   Min.   :1.000   N:19893            Min.   :1.000  
##  1st Qu.: 0.990   1st Qu.:1.000   Y:  107            1st Qu.:1.000  
##  Median : 1.645   Median :1.000                      Median :1.000  
##  Mean   : 3.037   Mean   :1.054                      Mean   :1.291  
##  3rd Qu.: 3.100   3rd Qu.:1.000                      3rd Qu.:2.000  
##  Max.   :51.200   Max.   :5.000                      Max.   :4.000  
##                                                                     
##   fare_amount          extra           mta_tax          tip_amount     
##  Min.   :-160.00   Min.   :-1.000   Min.   :-0.5000   Min.   :  0.000  
##  1st Qu.:   6.50   1st Qu.: 0.000   1st Qu.: 0.5000   1st Qu.:  0.000  
##  Median :   9.50   Median : 0.500   Median : 0.5000   Median :  1.960  
##  Mean   :  13.47   Mean   : 1.163   Mean   : 0.4949   Mean   :  2.277  
##  3rd Qu.:  15.00   3rd Qu.: 2.500   3rd Qu.: 0.5000   3rd Qu.:  3.000  
##  Max.   : 399.20   Max.   : 7.000   Max.   : 0.5000   Max.   :175.000  
##                                                                        
##   tolls_amount     improvement_surcharge  total_amount    
##  Min.   :-6.1200   Min.   :-0.3000       Min.   :-160.80  
##  1st Qu.: 0.0000   1st Qu.: 0.3000       1st Qu.:  11.30  
##  Median : 0.0000   Median : 0.3000       Median :  14.80  
##  Mean   : 0.4059   Mean   : 0.2985       Mean   :  19.56  
##  3rd Qu.: 0.0000   3rd Qu.: 0.3000       3rd Qu.:  21.20  
##  Max.   :43.4300   Max.   : 0.3000       Max.   : 400.00  
##                                                           
##  congestion_surcharge tip_fare_ratio            Borough_pu   
##  Min.   :-2.500       Min.   : 0.0000   Bronx        :   35  
##  1st Qu.: 2.500       1st Qu.: 0.0000   Brooklyn     :  242  
##  Median : 2.500       Median : 0.2267   EWR          :    0  
##  Mean   : 2.273       Mean   : 0.1839   Manhattan    :18088  
##  3rd Qu.: 2.500       3rd Qu.: 0.2878   Queens       : 1466  
##  Max.   : 2.750       Max.   :11.4400   Staten Island:    1  
##                       NA's   :15        Unknown      :  168  
##          Borough_do                           Zone      
##  Bronx        :  159   Midtown Center           :  791  
##  Brooklyn     :  810   Upper East Side North    :  765  
##  EWR          :   40   Upper East Side South    :  758  
##  Manhattan    :17661   Murray Hill              :  619  
##  Queens       : 1073   Times Sq/Theatre District:  614  
##  Staten Island:    3   (Other)                  :16380  
##  Unknown      :  254   NA's                     :   73  
##       service_zone   pickup_datetime               pickup_time       
##  Airports   :  459   Min.   :0019-05-31 23:58:00   Length:20000      
##  Boro Zone  : 2639   1st Qu.:0019-06-08 10:40:15   Class :character  
##  EWR        :   40   Median :0019-06-15 17:14:00   Mode  :character  
##  N/A        :  254   Mean   :0019-06-15 22:05:19                     
##  Yellow Zone:16608   3rd Qu.:0019-06-23 04:01:15                     
##                      Max.   :0019-06-30 23:57:00                     
##                                                                      
##    pickup_hrs      pickup_period  dropoff_datetime             
##  Min.   : 0.00   Morning  :5092   Min.   :0019-06-01 00:04:00  
##  1st Qu.: 9.00   Afternoon:5336   1st Qu.:0019-06-08 10:43:45  
##  Median :14.00   Evening  :5695   Median :0019-06-15 17:27:00  
##  Mean   :13.75   Night    :3877   Mean   :0019-06-15 22:23:23  
##  3rd Qu.:19.00                    3rd Qu.:0019-06-23 04:19:30  
##  Max.   :23.00                    Max.   :0019-07-01 22:24:00  
##                                                                
##  dropoff_time        dropoff_hrs       drop_period   trip_duration    
##  Length:20000       Min.   : 0.00   Morning  :4870   Min.   :   0.00  
##  Class :character   1st Qu.: 9.00   Afternoon:5274   1st Qu.:   7.00  
##  Mode  :character   Median :14.00   Evening  :5762   Median :  11.00  
##                     Mean   :13.74   Night    :4094   Mean   :  18.06  
##                     3rd Qu.:19.00                    3rd Qu.:  19.00  
##                     Max.   :23.00                    Max.   :1439.00  
##

##   DOLocationID PULocationID     X VendorID tpep_pickup_datetime
## 1            1          233 14365        2         6/7/19 11:06
## 2            1          231  1376        1        6/18/19 16:03
## 3            1          186  8878        1        6/28/19 14:42
## 4            1          234 13985        1        6/23/19 16:22
## 5            1          231  9970        2         6/29/19 7:40
##   tpep_dropoff_datetime passenger_count trip_distance RatecodeID
## 1          6/7/19 12:21               1         23.93          3
## 2         6/18/19 17:08               1         15.50          3
## 3         6/28/19 15:38               1         16.70          3
## 4         6/23/19 17:12               1         14.20          3
## 5          6/29/19 8:01               1         13.46          3
##   store_and_fwd_flag payment_type fare_amount extra mta_tax tip_amount
## 1                  N            1        96.0     0       0      29.20
## 2                  N            1        73.5     1       0      18.45
## 3                  N            1        71.0     0       0      12.00
## 4                  N            1        64.0     0       0      19.15
## 5                  N            1        54.5     0       0      13.06
##   tolls_amount improvement_surcharge total_amount congestion_surcharge
## 1         20.5                   0.3       146.00                    0
## 2         17.5                   0.3       110.75                    0
## 3         10.5                   0.3        93.80                    0
## 4         12.5                   0.3        95.95                    0
## 5         10.5                   0.3        78.36                    0
##   tip_fare_ratio Borough_pu Borough_do           Zone service_zone
## 1      0.3041667  Manhattan        EWR Newark Airport          EWR
## 2      0.2510204  Manhattan        EWR Newark Airport          EWR
## 3      0.1690141  Manhattan        EWR Newark Airport          EWR
## 4      0.2992187  Manhattan        EWR Newark Airport          EWR
## 5      0.2396330  Manhattan        EWR Newark Airport          EWR
##       pickup_datetime pickup_time pickup_hrs pickup_period
## 1 0019-06-07 11:06:00       11:06         11       Morning
## 2 0019-06-18 16:03:00       16:03         16     Afternoon
## 3 0019-06-28 14:42:00       14:42         14     Afternoon
## 4 0019-06-23 16:22:00       16:22         16     Afternoon
## 5 0019-06-29 07:40:00       07:40          7       Morning
##      dropoff_datetime dropoff_time dropoff_hrs drop_period trip_duration
## 1 0019-06-07 12:21:00        12:21          12   Afternoon            75
## 2 0019-06-18 17:08:00        17:08          17     Evening            65
## 3 0019-06-28 15:38:00        15:38          15   Afternoon            56
## 4 0019-06-23 17:12:00        17:12          17     Evening            50
## 5 0019-06-29 08:01:00        08:01           8     Morning            21

2.4 Outlier detection and treatment

Following filters are applied on the data set: 1. VendorID = Data has vendor id 4, but according to the data dictionary only 1 and 2 should be present. 2. payment type = Only credit card details have corresponding tip amount value. Hence we will analysing only credit card payment types 3. fare_amount = data has fareamount in negative, such entries are outliers hence removed 4. passenger count = according to the law maximum 7 passengers are allowed in a taxi 5. trip distance = there are some observation with 0 trip distance, maybe a cancelled taxi. These observations are not considered for the analysis 6. trip duration = the values >=37 and <=0 fall outside the boxplot. These are removed. Using the boxplot, we removed the outliers from the dataset.

Below is the box plot after outlier treatment

Observation : All outliers are removed after treatment

Here is the structure, summary, and the first few rows of the processed dataset:

## 'data.frame':    11881 obs. of  33 variables:
##  $ DOLocationID         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ PULocationID         : int  231 161 68 143 68 125 164 87 230 100 ...
##  $ X                    : int  9970 7812 13195 12280 2391 15423 1082 1862 9274 11863 ...
##  $ VendorID             : int  2 1 2 2 1 1 2 2 2 2 ...
##  $ tpep_pickup_datetime : Factor w/ 15407 levels "5/31/19 23:58",..: 11326 1673 12138 9232 7771 14662 6282 3135 6642 14376 ...
##  $ tpep_dropoff_datetime: Factor w/ 15299 levels "6/1/19 0:04",..: 11238 1655 12052 9152 7686 14570 6243 3092 6579 14275 ...
##  $ passenger_count      : int  1 1 1 1 1 1 1 1 5 2 ...
##  $ trip_distance        : num  13.5 17.3 16.4 17.8 15.3 ...
##  $ RatecodeID           : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ store_and_fwd_flag   : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
##  $ payment_type         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ fare_amount          : num  54.5 68 61.5 65.5 60 57.5 58 69.5 64 65.5 ...
##  $ extra                : num  0 0 0.5 0 0 0 0 0.5 0.5 0 ...
##  $ mta_tax              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ tip_amount           : num  13.1 15.8 10 16.7 8 ...
##  $ tolls_amount         : num  10.5 10.5 12.5 17.5 10.5 23 10.5 23 10.5 17.5 ...
##  $ improvement_surcharge: num  0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 ...
##  $ total_amount         : num  78.4 94.5 84.8 100 78.8 ...
##  $ congestion_surcharge : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ tip_fare_ratio       : num  0.24 0.232 0.163 0.254 0.133 ...
##  $ Borough_pu           : Factor w/ 7 levels "Bronx","Brooklyn",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ Borough_do           : Factor w/ 7 levels "Bronx","Brooklyn",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Zone                 : Factor w/ 261 levels "Allerton/Pelham Gardens",..: 169 169 169 169 169 169 169 169 169 169 ...
##  $ service_zone         : Factor w/ 5 levels "Airports","Boro Zone",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ pickup_datetime      : POSIXct, format: "0019-06-29 07:40:00" "0019-06-12 12:12:00" ...
##  $ pickup_time          : chr  "07:40" "12:12" "20:09" "07:11" ...
##  $ pickup_hrs           : num  7 12 20 7 6 16 12 4 3 6 ...
##  $ pickup_period        : Ord.factor w/ 4 levels "Morning"<"Afternoon"<..: 1 2 3 1 1 2 2 4 4 1 ...
##  $ dropoff_datetime     : POSIXct, format: "0019-06-29 08:01:00" "0019-06-12 12:47:00" ...
##  $ dropoff_time         : chr  "08:01" "12:47" "20:36" "07:45" ...
##  $ dropoff_hrs          : num  8 12 20 7 7 16 12 4 3 7 ...
##  $ drop_period          : Ord.factor w/ 4 levels "Morning"<"Afternoon"<..: 1 2 3 1 1 2 2 4 4 1 ...
##  $ trip_duration        : num  21 35 27 34 26 34 31 31 24 31 ...

##   DOLocationID    PULocationID       X              VendorID    
##  Min.   :  1.0   Min.   :  4   Min.   :   1082   Min.   :1.000  
##  1st Qu.:113.0   1st Qu.:125   1st Qu.:1784279   1st Qu.:1.000  
##  Median :162.0   Median :162   Median :3523917   Median :2.000  
##  Mean   :162.4   Mean   :165   Mean   :3520009   Mean   :1.629  
##  3rd Qu.:234.0   3rd Qu.:234   3rd Qu.:5288564   3rd Qu.:2.000  
##  Max.   :265.0   Max.   :265   Max.   :6939845   Max.   :2.000  
##                                                                 
##     tpep_pickup_datetime   tpep_dropoff_datetime passenger_count
##  6/11/19 7:56 :    6     6/24/19 18:36:    6     Min.   :0.000  
##  6/14/19 13:28:    5     6/6/19 9:40  :    5     1st Qu.:1.000  
##  6/14/19 23:16:    5     6/12/19 10:48:    4     Median :1.000  
##  6/6/19 9:33  :    5     6/12/19 23:05:    4     Mean   :1.559  
##  6/10/19 10:08:    4     6/14/19 12:58:    4     3rd Qu.:2.000  
##  6/11/19 19:34:    4     6/14/19 9:26 :    4     Max.   :6.000  
##  (Other)      :11852     (Other)      :11854                    
##  trip_distance      RatecodeID    store_and_fwd_flag  payment_type
##  Min.   : 0.030   Min.   :1.000   N:11828            Min.   :1    
##  1st Qu.: 1.010   1st Qu.:1.000   Y:   53            1st Qu.:1    
##  Median : 1.600   Median :1.000                      Median :1    
##  Mean   : 2.465   Mean   :1.018                      Mean   :1    
##  3rd Qu.: 2.720   3rd Qu.:1.000                      3rd Qu.:1    
##  Max.   :25.100   Max.   :5.000                      Max.   :1    
##                                                                   
##   fare_amount         extra          mta_tax         tip_amount    
##  Min.   :  3.00   Min.   :0.000   Min.   :0.0000   Min.   : 0.440  
##  1st Qu.:  6.50   1st Qu.:0.000   1st Qu.:0.5000   1st Qu.: 1.860  
##  Median :  9.50   Median :0.500   Median :0.5000   Median : 2.450  
##  Mean   : 11.61   Mean   :1.218   Mean   :0.4989   Mean   : 2.924  
##  3rd Qu.: 13.50   3rd Qu.:2.500   3rd Qu.:0.5000   3rd Qu.: 3.360  
##  Max.   :238.00   Max.   :4.500   Max.   :0.5000   Max.   :47.650  
##                                                                    
##   tolls_amount     improvement_surcharge  total_amount   
##  Min.   : 0.0000   Min.   :0.3           Min.   :  4.50  
##  1st Qu.: 0.0000   1st Qu.:0.3           1st Qu.: 12.30  
##  Median : 0.0000   Median :0.3           Median : 15.35  
##  Mean   : 0.2448   Mean   :0.3           Mean   : 18.30  
##  3rd Qu.: 0.0000   3rd Qu.:0.3           3rd Qu.: 20.75  
##  Max.   :23.0000   Max.   :0.3           Max.   :285.95  
##                                                          
##  congestion_surcharge tip_fare_ratio           Borough_pu   
##  Min.   :0.00         Min.   :0.1091   Bronx        :    1  
##  1st Qu.:2.50         1st Qu.:0.2287   Brooklyn     :  105  
##  Median :2.50         Median :0.2667   EWR          :    0  
##  Mean   :2.39         Mean   :0.2649   Manhattan    :11247  
##  3rd Qu.:2.50         3rd Qu.:0.3086   Queens       :  437  
##  Max.   :2.75         Max.   :0.4283   Staten Island:    0  
##                                        Unknown      :   91  
##          Borough_do                       Zone           service_zone  
##  Bronx        :   32   Upper East Side North: 495   Airports   :  175  
##  Brooklyn     :  383   Upper East Side South: 472   Boro Zone  : 1155  
##  EWR          :   12   Midtown Center       : 451   EWR        :   12  
##  Manhattan    :10959   Murray Hill          : 400   N/A        :  107  
##  Queens       :  387   Midtown East         : 376   Yellow Zone:10432  
##  Staten Island:    1   (Other)              :9664                      
##  Unknown      :  107   NA's                 :  23                      
##  pickup_datetime               pickup_time          pickup_hrs   
##  Min.   :0019-05-31 23:58:00   Length:11881       Min.   : 0.00  
##  1st Qu.:0019-06-08 09:50:00   Class :character   1st Qu.: 9.00  
##  Median :0019-06-15 14:09:00   Mode  :character   Median :14.00  
##  Mean   :0019-06-15 20:32:38                      Mean   :13.82  
##  3rd Qu.:0019-06-23 03:21:00                      3rd Qu.:19.00  
##  Max.   :0019-06-30 23:57:00                      Max.   :23.00  
##                                                                  
##    pickup_period  dropoff_datetime              dropoff_time      
##  Morning  :3106   Min.   :0019-06-01 00:04:00   Length:11881      
##  Afternoon:2873   1st Qu.:0019-06-08 09:59:00   Class :character  
##  Evening  :3507   Median :0019-06-15 14:17:00   Mode  :character  
##  Night    :2395   Mean   :0019-06-15 20:45:40                     
##                   3rd Qu.:0019-06-23 03:48:00                     
##                   Max.   :0019-07-01 00:02:00                     
##                                                                   
##   dropoff_hrs      drop_period   trip_duration  
##  Min.   : 0.0   Morning  :3005   Min.   : 1.00  
##  1st Qu.: 9.0   Afternoon:2871   1st Qu.: 7.00  
##  Median :14.0   Evening  :3481   Median :11.00  
##  Mean   :13.8   Night    :2524   Mean   :13.05  
##  3rd Qu.:19.0                    3rd Qu.:18.00  
##  Max.   :23.0                    Max.   :37.00  
##

##   DOLocationID PULocationID     X VendorID tpep_pickup_datetime
## 1            1          231  9970        2         6/29/19 7:40
## 2            1          161  7812        1        6/12/19 12:12
## 3            1           68 13195        2        6/30/19 20:09
## 5            1          143 12280        2         6/25/19 7:11
## 6            1           68  2391        1         6/22/19 6:38
##   tpep_dropoff_datetime passenger_count trip_distance RatecodeID
## 1          6/29/19 8:01               1         13.46          3
## 2         6/12/19 12:47               1         17.30          3
## 3         6/30/19 20:36               1         16.41          3
## 5          6/25/19 7:45               1         17.84          3
## 6          6/22/19 7:04               1         15.30          3
##   store_and_fwd_flag payment_type fare_amount extra mta_tax tip_amount
## 1                  N            1        54.5   0.0       0      13.06
## 2                  N            1        68.0   0.0       0      15.75
## 3                  N            1        61.5   0.5       0      10.00
## 5                  N            1        65.5   0.0       0      16.66
## 6                  N            1        60.0   0.0       0       8.00
##   tolls_amount improvement_surcharge total_amount congestion_surcharge
## 1         10.5                   0.3        78.36                    0
## 2         10.5                   0.3        94.55                    0
## 3         12.5                   0.3        84.80                    0
## 5         17.5                   0.3        99.96                    0
## 6         10.5                   0.3        78.80                    0
##   tip_fare_ratio Borough_pu Borough_do           Zone service_zone
## 1      0.2396330  Manhattan        EWR Newark Airport          EWR
## 2      0.2316176  Manhattan        EWR Newark Airport          EWR
## 3      0.1626016  Manhattan        EWR Newark Airport          EWR
## 5      0.2543511  Manhattan        EWR Newark Airport          EWR
## 6      0.1333333  Manhattan        EWR Newark Airport          EWR
##       pickup_datetime pickup_time pickup_hrs pickup_period
## 1 0019-06-29 07:40:00       07:40          7       Morning
## 2 0019-06-12 12:12:00       12:12         12     Afternoon
## 3 0019-06-30 20:09:00       20:09         20       Evening
## 5 0019-06-25 07:11:00       07:11          7       Morning
## 6 0019-06-22 06:38:00       06:38          6       Morning
##      dropoff_datetime dropoff_time dropoff_hrs drop_period trip_duration
## 1 0019-06-29 08:01:00        08:01           8     Morning            21
## 2 0019-06-12 12:47:00        12:47          12   Afternoon            35
## 3 0019-06-30 20:36:00        20:36          20     Evening            27
## 5 0019-06-25 07:45:00        07:45           7     Morning            34
## 6 0019-06-22 07:04:00        07:04           7     Morning            26

2.4.1 Normality check

Here’s a histogram of the NYC taxi tip data after the removal of outliers:

#ggplot histogram of tip_fare_ratio for processed df
processed_df %>%
  ggplot(aes(tip_fare_ratio)) +
  geom_histogram(aes(y =..density..),  colour = "black", fill = "#66B2FF", binwidth = 0.01) + 
  stat_function(fun = dnorm, args = list(mean = mean(processed_df$tip_fare_ratio), sd = sd(processed_df$tip_fare_ratio))) + ggtitle("Distribution of NYC Taxi Tip Data Post-Outlier Removal")

Observation : The data is approximately normally distributed. There are lesser points on the left side of the mean.

3 Exploratory Data Analysis

3.1 Feature visualization

In order to understand how different features are distributed in the data we plotted the below graphs.

# plotting count of drives with passenger count
processed_df %>%
  group_by(passenger_count) %>%
  count() %>%
  ggplot(aes(passenger_count, n, fill = passenger_count)) +
  geom_col() +
  scale_y_sqrt() +
  theme(legend.position = "none")+  
  xlab("Number of passengers") + 
  ylab("Total number of trips") +
  ggtitle('Distribution according to the number of passengers in a taxi')

Observation : 1. There are not many trips with zero passengers, and majority of the rides are with 1 passenger 2. The number of the trips starts reducing as the number of passengers till 4. This maybe due to the size of the car. 3. Until the increase at 5 which maybe due to large car size.

# plotting  count of drives for vendor
processed_df %>%
  group_by(VendorID) %>%
  count() %>%
  ggplot(aes(VendorID, n, fill = VendorID)) +
  geom_col() +
  theme(legend.position = "none") +
  xlab("Vendor ID") + 
  ylab("Total number of trips") +
  ggtitle('Distribution according to vendors of the taxi service')

Vendor Ids 1 and 2 belongs to Creative Mobile Technologies, LLC; VeriFone Inc respectively

Observation : Vendor 2 has more number of trips that vendor 1

# plotting  count of drives for vendor
processed_df %>%
  mutate(wday = wday(processed_df$pickup_datetime)) %>%
  group_by(wday, VendorID) %>%
  count() %>%
  ggplot(aes(wday, n, colour = VendorID)) +
  geom_point(size = 3) +
  labs(x = "Day of the week", y = "Total number of pickups")

Observation : Vendor 2 has more number of trips that vendor 1 holds true for every day of the week

# plotting  count of drives for vendor
processed_df %>%
  group_by(RatecodeID) %>%
  count() %>%
  ggplot(aes(RatecodeID, n, fill = RatecodeID)) +
  geom_col() +
  theme(legend.position = "none") +
  xlab("Rate code") + 
  ylab("Count of rides") +
  ggtitle('Distribution according to rate code for the service')

Following are the different rate codes: 1= Standard rate 2=JFK 3=Newark 4=Nassau or Westchester 5=Negotiated fare 6=Group ride

Observation : Most of the trips are with booked with standard rate in the sample set. As the distribution for rate code is highly skewed this variable won’t be used in our futher analysis for hypothesis testing

3.2 Impacts of Location on Tip

Location can also impact the amount of tipping. First, we calculate the number of trips in each borough, firstly grouped by pickup location and secondly grouped by drop off location.

# Create barplots for trip counts based on location
ggplot(data = processed_df, aes(x = Borough_pu, fill = Borough_pu)) + geom_bar() + ggtitle("Trip Counts Based on Pickup Location") + xlab("Location") + ylab("Trip Frequency")

# Create barplots for trip counts based on location
ggplot(data = processed_df, aes(x = Borough_do, fill = Borough_do)) + geom_bar() + ggtitle("Trip Counts Based on Dropoff Location") + xlab("Location") + ylab("Trip Frequency")

Observation: These bar charts show that Manhattan has the highest number of both pick up and drop offs, followed by Queens and Brooklyn in second and third, respectively. We also looked at the frequency of various drop off and pick up combinations.

# Subset data to get locations only
pu_do <- subset(processed_df, select = c("Borough_pu", "Borough_do"))

# Create table of location combinations
ddply(pu_do, .(Borough_pu, Borough_do), nrow)

##    Borough_pu    Borough_do    V1
## 1       Bronx     Manhattan     1
## 2    Brooklyn      Brooklyn    58
## 3    Brooklyn     Manhattan    45
## 4    Brooklyn        Queens     2
## 5   Manhattan         Bronx    28
## 6   Manhattan      Brooklyn   268
## 7   Manhattan           EWR    12
## 8   Manhattan     Manhattan 10640
## 9   Manhattan        Queens   280
## 10  Manhattan Staten Island     1
## 11  Manhattan       Unknown    18
## 12     Queens         Bronx     3
## 13     Queens      Brooklyn    56
## 14     Queens     Manhattan   262
## 15     Queens        Queens   104
## 16     Queens       Unknown    12
## 17    Unknown         Bronx     1
## 18    Unknown      Brooklyn     1
## 19    Unknown     Manhattan    11
## 20    Unknown        Queens     1
## 21    Unknown       Unknown    77

Observation: This table shows that Manhattan to Manhattan has the highest number of trips, followed by Queens to Manhattan, Manhattan to Queens, and Manhattan to Brooklyn.

Considering the Manhattan and Queens have the highest number of pick ups and drop offs, the fact that the number of trips within and between these places are also the highest make sense. The fact that Manhattan scores the highest in both measurements is also reasonable because yellow taxis (the focus of this study) mainly serve Manhattan, whereas green taxis usually serve the other boroughs that have been traditionally underserved by taxis.

With these descriptive statistics in mind, we decided to compare the mean tipping percentage for each borough based on drop off and pick up location to see if they were statistically different.

We used the ANOVA test. Here are the results:

## Call:
##    aov(formula = tip_fare_ratio ~ Borough_pu, data = processed_df)
## 
## Terms:
##                 Borough_pu Residuals
## Sum of Squares     0.48833  54.44540
## Deg. of Freedom          4     11876
## 
## Residual standard error: 0.06770886
## Estimated effects may be unbalanced

##                Df Sum Sq Mean Sq F value Pr(>F)    
## Borough_pu      4   0.49 0.12208   26.63 <2e-16 ***
## Residuals   11876  54.45 0.00458                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## Call:
##    aov(formula = tip_fare_ratio ~ Borough_do, data = processed_df)
## 
## Terms:
##                 Borough_do Residuals
## Sum of Squares     0.66484  54.26889
## Deg. of Freedom          6     11874
## 
## Residual standard error: 0.06760471
## Estimated effects may be unbalanced

##                Df Sum Sq Mean Sq F value Pr(>F)    
## Borough_do      6   0.66 0.11081   24.24 <2e-16 ***
## Residuals   11874  54.27 0.00457                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Plot tip ratio by pickup location
ggplot(data = processed_df, aes(x = Borough_pu, y = tip_fare_ratio, fill = Borough_pu)) + ggtitle("Tip Ratio by Pickup Location") + geom_boxplot() + xlab("Location") + ylab("Tip Ratio")

# Plot tip ratio by dropoff location
ggplot(data = processed_df, aes(x = Borough_do, y = tip_fare_ratio, fill = Borough_do)) + ggtitle("Tip Ratio by Dropoff Location") + geom_boxplot() + xlab("Location") + ylab("Tip Ratio")

Observation: The p-values are 5.056902310^{-22} and 1.067382710^{-28} for pick up and drop off, respectively. Both of these are smaller than a significance level of 0.05 (a 0.95 confidence level). Thus, we can reject the null hypothesis that the means are the same and say the means are statistically different at a significance level of 0.05.

Because they are significant, the next step would be to conduct a Tukey’s HSD test, which looks at each pair of variables to see if they are significantly different. Here are the results from that analysis:

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = tip_fare_ratio ~ Borough_pu, data = processed_df)
## 
## $Borough_pu
##                            diff         lwr         upr     p adj
## Brooklyn-Bronx     -0.144985128 -0.33058557  0.04061531 0.2068399
## Manhattan-Bronx    -0.110932826 -0.29566393  0.07379828 0.4727764
## Queens-Bronx       -0.140110246 -0.32504437  0.04482388 0.2345923
## Unknown-Bronx      -0.105217640 -0.29095272  0.08051744 0.5327308
## Manhattan-Brooklyn  0.034052303  0.01594124  0.05216336 0.0000029
## Queens-Brooklyn     0.004874882 -0.01520148  0.02495124 0.9643270
## Unknown-Brooklyn    0.039767488  0.01331093  0.06622405 0.0003988
## Queens-Manhattan   -0.029177420 -0.03818395 -0.02017089 0.0000000
## Unknown-Manhattan   0.005715185 -0.01372722  0.02515759 0.9300852
## Unknown-Queens      0.034892606  0.01360748  0.05617773 0.0000764

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = tip_fare_ratio ~ Borough_do, data = processed_df)
## 
## $Borough_do
##                                 diff           lwr          upr     p adj
## Brooklyn-Bronx           0.002541040 -0.0341428978  0.039224979 0.9999940
## EWR-Bronx               -0.002278280 -0.0697601339  0.065203574 0.9999999
## Manhattan-Bronx          0.035597726  0.0003050629  0.070890389 0.0464445
## Queens-Bronx             0.011394320 -0.0252749973  0.048063637 0.9701039
## Staten Island-Bronx      0.047879679 -0.1545658832  0.250325241 0.9927784
## Unknown-Bronx            0.024362072 -0.0158046879  0.064528832 0.5558605
## EWR-Brooklyn            -0.004819320 -0.0632626320  0.053623992 0.9999830
## Manhattan-Brooklyn       0.033056686  0.0226936677  0.043419704 0.0000000
## Queens-Brooklyn          0.008853279 -0.0055153974  0.023221956 0.5364941
## Staten Island-Brooklyn   0.045338639 -0.1542760545  0.244953332 0.9941996
## Unknown-Brooklyn         0.021821032  0.0000222088  0.043619855 0.0495693
## Manhattan-EWR            0.037876006 -0.0197042117  0.095456223 0.4538348
## Queens-EWR               0.013672599 -0.0447615360  0.072106735 0.9931849
## Staten Island-EWR        0.050157959 -0.1573368967  0.257652814 0.9918773
## Unknown-EWR              0.026640352 -0.0340496637  0.087330368 0.8548017
## Queens-Manhattan        -0.024203406 -0.0345145474 -0.013892265 0.0000000
## Staten Island-Manhattan  0.012281953 -0.1870817511  0.211645657 0.9999970
## Unknown-Manhattan       -0.011235654 -0.0306018472  0.008130539 0.6087610
## Staten Island-Queens     0.036485359 -0.1631266474  0.236097366 0.9982604
## Unknown-Queens           0.012967752 -0.0088064563  0.034741961 0.5779243
## Unknown-Staten Island   -0.023517607 -0.2238016130  0.176766399 0.9998637

Observation: While the means are overall not the same, the following pick up pairs have significant differences in tipping percentage (excluding Unknown) given their small p-values: Manhattan and Brooklyn, and Queens and Manhattan. For drop off pairs, Manhattan and Bronx, Manhattan and Brooklyn, and Manhattan and Queens are significant.

ANOVA assumes a normal distribution, and as previously highlighted, the tipping amount is not necessarily normally distributed. Nonetheless, a look at the tipping amount can provide some context to the situation. We compared the mean tipping amount for each borough to see if those were statistically different. Here are the results for raw tip amount based on location:

## Call:
##    aov(formula = tip_amount ~ Borough_pu, data = processed_df)
## 
## Terms:
##                 Borough_pu Residuals
## Sum of Squares     8919.59  34065.84
## Deg. of Freedom          4     11876
## 
## Residual standard error: 1.693653
## Estimated effects may be unbalanced

##                Df Sum Sq Mean Sq F value Pr(>F)    
## Borough_pu      4   8920  2229.9   777.4 <2e-16 ***
## Residuals   11876  34066     2.9                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## Call:
##    aov(formula = tip_amount ~ Borough_do, data = processed_df)
## 
## Terms:
##                 Borough_do Residuals
## Sum of Squares     8157.25  34828.18
## Deg. of Freedom          6     11874
## 
## Residual standard error: 1.712643
## Estimated effects may be unbalanced

##                Df Sum Sq Mean Sq F value Pr(>F)    
## Borough_do      6   8157  1359.5   463.5 <2e-16 ***
## Residuals   11874  34828     2.9                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Plot tip amount vs pickup location
ggplot(data = processed_df, aes(x = Borough_pu, y = tip_amount, fill = Borough_pu)) + ggtitle("Tip Amount by Pickup Location") + geom_boxplot() + xlab("Location") + ylab("Tip Amount")

# Plot tip amount vs dropoff location
ggplot(data = processed_df, aes(x = Borough_do, y = tip_amount, fill = Borough_do)) + ggtitle("Tip Amount by Dropoff Location") + geom_boxplot() + xlab("Location") + ylab("Tip Amount")

Observation: For these ANOVA analyses, the null hypothesis is that all means for different locations are the same. The p-values are smaller than a significance level of 0.05 (a 0.95 confidence level). Thus, we can reject the null hypothesis that the means are the same and say the means are statistically different at a significance level of 0.05.

As with the previous, because both are significant, we can run Tukey’s HSD test:

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = tip_amount ~ Borough_pu, data = processed_df)
## 
## $Borough_pu
##                          diff        lwr         upr     p adj
## Brooklyn-Bronx      1.8662857 -2.7762790  6.50885042 0.8083314
## Manhattan-Bronx     1.4282324 -3.1925869  6.04905175 0.9170756
## Queens-Bronx        6.0300915  1.4041939 10.65598918 0.0034681
## Unknown-Bronx       1.7268132 -2.9191194  6.37274573 0.8490430
## Manhattan-Brooklyn -0.4380533 -0.8910790  0.01497244 0.0637298
## Queens-Brooklyn     4.1638058  3.6616206  4.66599107 0.0000000
## Unknown-Brooklyn   -0.1394725 -0.8012506  0.52230554 0.9787356
## Queens-Manhattan    4.6018591  4.3765720  4.82714625 0.0000000
## Unknown-Manhattan   0.2985808 -0.1877468  0.78490831 0.4495574
## Unknown-Queens     -4.3032783 -4.8356994 -3.77085728 0.0000000

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = tip_amount ~ Borough_do, data = processed_df)
## 
## $Borough_do
##                                diff         lwr         upr     p adj
## Brooklyn-Bronx           -1.1642632  -2.0935845  -0.2349419 0.0041593
## EWR-Bronx                 8.4019792   6.6924483  10.1115100 0.0000000
## Manhattan-Bronx          -3.2824489  -4.1765248  -2.3883731 0.0000000
## Queens-Bronx              0.1450347  -0.7839162   1.0739856 0.9992899
## Staten Island-Bronx       5.3228125   0.1942200  10.4514050 0.0359711
## Unknown-Bronx            -1.3390567  -2.3566089  -0.3215044 0.0020199
## EWR-Brooklyn              9.5662424   8.0856867  11.0467981 0.0000000
## Manhattan-Brooklyn       -2.1181857  -2.3807140  -1.8556574 0.0000000
## Queens-Brooklyn           1.3092979   0.9452935   1.6733024 0.0000000
## Staten Island-Brooklyn    6.4870757   1.4301982  11.5439532 0.0029651
## Unknown-Brooklyn         -0.1747934  -0.7270272   0.3774403 0.9672358
## Manhattan-EWR           -11.6844281 -13.1431188 -10.2257373 0.0000000
## Queens-EWR               -8.2569444  -9.7372677  -6.7766212 0.0000000
## Staten Island-EWR        -3.0791667  -8.3356738   2.1773405 0.5975372
## Unknown-EWR              -9.7410358 -11.2785077  -8.2035640 0.0000000
## Queens-Manhattan          3.4274837   3.1662695   3.6886978 0.0000000
## Staten Island-Manhattan   8.6052614   3.5547423  13.6557806 0.0000106
## Unknown-Manhattan         1.9433923   1.4527848   2.4339998 0.0000000
## Staten Island-Queens      5.1777778   0.1209683  10.2345872 0.0406846
## Unknown-Queens           -1.4840914  -2.0357016  -0.9324812 0.0000000
## Unknown-Staten Island    -6.6618692 -11.7357025  -1.5880358 0.0020916

Observation: For pick up locations, there is no statistical difference for the between the following pairs (excluding Unknown) given their large p-values: Brooklyn and Bronx, and Manhattan and Bronx. For drop off locations, there is no difference for: Queens and Bronx, Staten Island and Bronz, Staten Island and Brooklyn, Staten Island and EWR, and Staten Island and Queens. All the Manhattan drop off locations are significant. Based on this analysis, it seems that being dropped off in Manhattan is significantly different from being dropped off in another location. The same seems to be true for being picked up in Manhattan.

3.3 Passenger count and Vendor

We hyothesize that there is a relationship between the number of passenger in the car to the tip amount paid to the driver. The passenger count varies from 0 to 6.

# Create factor data
processed_df$passenger_count <- as.factor(processed_df$passenger_count)

# Create boxplot
processed_df %>%
  ggplot(aes(passenger_count, tip_fare_ratio, fill = passenger_count)) +
  geom_boxplot() +
  theme(legend.position = "none") +
  labs(y = "Tip ratio", x = "Number of passengers") + 
  ggtitle("Box plot distribution for ratio of tip amount and passenger count")

Observation: The mean of all the groups of passenger count is not varying much.

While performing anova our null hypothesis is the means across all groups of passenger count is same whereas alternate hypothesis is the mean is not same.

# ANOVA test
anova_tip_amount = aov(tip_fare_ratio ~ passenger_count, data = processed_df)
summary(anova_tip_amount)

##                    Df Sum Sq  Mean Sq F value Pr(>F)
## passenger_count     6   0.04 0.006672   1.443  0.194
## Residuals       11874  54.89 0.004623

The P value 0.1937649 is > 0.05, hence we fail to reject null hypothesis.

Hence, we can conclude that the number of passenger in the car does not affect tip amount

Below is the box plot distribution for vendors 1 and 2. We hypothesise that the vendor brand name affects the % tip amount paid to the driver

# Create factor
processed_df$VendorID <- as.factor(processed_df$VendorID)

# Create boxplot of vendor ID vs tip percentage
processed_df %>%
  ggplot(aes(VendorID, tip_fare_ratio, fill = VendorID)) +
  geom_boxplot() +
  theme(legend.position = "none") +
  labs(y = "Tip ratio", x = "Vendor ID") + 
  ggtitle("Box plot distribution for trip ratio and vendor")

observation: Distribution for both the vendors looks almost same.

While performing anova our null hypothesis is the means across vendors is same whereas alternate hypothesis is the mean is not same.

# t test for vendor
ttest_vendor = t.test(tip_fare_ratio ~ VendorID, data = processed_df)
ttest_vendor

## 
##  Welch Two Sample t-test
## 
## data:  tip_fare_ratio by VendorID
## t = -2.5074, df = 9128, p-value = 0.01218
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.0057922981 -0.0007094282
## sample estimates:
## mean in group 1 mean in group 2 
##       0.2628451       0.2660959

The P value 0.0121792 is > 0.05, hence we fail to reject null hypothesis.

Hence, we can conclude that the vendor does affect tip amount

3.4 Impacts of Pickup and Dropoff time on Tip

The amount of tipping can be impacted by pickup time, dropoff time and the duration of the trip itself. For this analysis we have categorised the data into four categories, “Morning”, “Afternoon”, “Evening” and “Night” hours.

Let us analyse the pickup period.

Let us analyse the dropoff period.

From the bar charts its is crealy seen that the data is almost uniformly distributed. Except for the Night time there are less number of trips observed for both pickup and dropoff time periods. However from the box plots it is observed that the means of various time periods are not equal. So we perform ANOVA test on the column to obtain the statistical hypothesis.

Anova for pickup time periods

## Call:
##    aov(formula = tip_fare_ratio ~ pickup_period, data = processed_df)
## 
## Terms:
##                 pickup_period Residuals
## Sum of Squares        0.18735  54.74638
## Deg. of Freedom             3     11877
## 
## Residual standard error: 0.06789289
## Estimated effects may be unbalanced

##                  Df Sum Sq Mean Sq F value   Pr(>F)    
## pickup_period     3   0.19 0.06245   13.55 8.04e-09 ***
## Residuals     11877  54.75 0.00461                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Anova for dropoff time periods

## Call:
##    aov(formula = tip_fare_ratio ~ drop_period, data = processed_df)
## 
## Terms:
##                 drop_period Residuals
## Sum of Squares      0.19737  54.73636
## Deg. of Freedom           3     11877
## 
## Residual standard error: 0.06788668
## Estimated effects may be unbalanced

##                Df Sum Sq Mean Sq F value   Pr(>F)    
## drop_period     3   0.20 0.06579   14.28 2.78e-09 ***
## Residuals   11877  54.74 0.00461                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

For these ANOVA analyses, the null hypothesis is that all means for different time period are the same. The p-values are 8.035737110^{-9} and 2.778748910^{-9} for pick up and drop off, respectively. Both of these are smaller than a significance level of 0.05 (a 0.95 confidence level). Thus, we can reject the null hypothesis that the means are the same and say the means are statistically different at a significance level of 0.05.

As the previous results were significant, we can run Tukey’s HSD test:

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = tip_fare_ratio ~ pickup_period, data = processed_df)
## 
## $pickup_period
##                           diff          lwr          upr     p adj
## Afternoon-Morning  0.002486928 -0.002028522 0.0070023772 0.4898287
## Evening-Morning    0.010043145  0.005744952 0.0143413382 0.0000000
## Night-Morning      0.006094137  0.001350379 0.0108378962 0.0053440
## Evening-Afternoon  0.007556218  0.003166567 0.0119458677 0.0000580
## Night-Afternoon    0.003607210 -0.001219571 0.0084339905 0.2195216
## Night-Evening     -0.003949008 -0.008573182 0.0006751668 0.1248348

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = tip_fare_ratio ~ drop_period, data = processed_df)
## 
## $drop_period
##                           diff           lwr          upr     p adj
## Afternoon-Morning  0.002106065 -2.446096e-03  0.006658225 0.6339951
## Evening-Morning    0.010372431  6.029032e-03  0.014715831 0.0000000
## Night-Morning      0.004613450 -9.601819e-05  0.009322919 0.0573673
## Evening-Afternoon  0.008266367  3.868904e-03  0.012663829 0.0000082
## Night-Afternoon    0.002507386 -2.251990e-03  0.007266761 0.5286702
## Night-Evening     -0.005758981 -1.031909e-02 -0.001198871 0.0064709

For pick up period, the larger p-values are observed for : Afternoon-Morning, Night-Afternoon and Night-Evening. For drop off period, the larger p-values are observed for : Afternoon-Morning, Night-Afternoon. p-value observed is approximately zero of Evening-Morning and Evening-Afternoon shows that being dropped off or picked up in the evening is significant from other time periods.

3.5 Impacts of Duration of trip on Tip

There might be a relation between the tip paid vs trip duration. We plot a tip paid vs trip duration histogram to observe the pattern.

From the graph it is observed that tip is maximum for trips ranging from 5 to 10 minutes and decreases with the time. We perform Ttest hypothesis.

## 
##  Welch Two Sample t-test
## 
## data:  processed_df$tip_fare_ratio and processed_df$trip_duration
## t = -177.85, df = 11882, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -12.92108 -12.63937
## sample estimates:
## mean of x mean of y 
##  0.264889 13.045114

From the results we observe that p-value is < 0.05 With Significance level set to 5% hence we reject null hypothesis. null hyp = means for tip_fare_ratio and trip_duration_min are equal Concluding that we have enough evidence to reject the Null hypothesis in favor of Alt Hyp meaning that people travelling through Yellow cabs in NYC tip differently based on duration of the trip.

3.6 Tip payed vs distance travelled by passengers

The last independent factor that we chose for our study is distance travelled and we try to investigate if it has anything to do with amount tipped. Or in other words does amount tipped varies for passengers who travel shorter distances vs those who travel longer distances.

First we take a look at summary of Trip Distance column. We can visually confirm these values through a box plot and histogram.

Now coming towards our second and dependent variable. Tip percentage …

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1091  0.2287  0.2667  0.2649  0.3086  0.4283

It looks pretty normal. It also fulfills the CLT for normality so with a normal dependent variable we are all set to move forward.

Now that we know my variables, the question is which test to apply and why when comparing tip paid with distance travelled? Z-test cannot be used because we dont know population mean & std dev. We also cannot use one sample t-test because we dont have a pre-determined population mean or some other theoretically derived value with which could be compared with the mean value of our observed sample.

So the simplest option that comes to mind is to use independent two-sample t-test, a significance test that can give us an estimate as to whether different means between two groups are the result of random variation or the product of specific characteristics within the groups.

But before applying the 2 sampled t-test, we first need to fulfull some conditions for reliable results. i.e. random, normal, independent

Assumptions/Criteria to be fulfilled:

Random sampling data
Normally distributed dependent variable (CLT)
Independence of observations

As always, First Step in every Significance testing:

Null Hypothesis: H_o Average tip amount is same for both short and long distance passenger(s)
Alternate Hypothesis: H_a Average tip amount is NOT same for both short and long distance passenger(s)

## Observations: 11,881
## Variables: 2
## $ tip_fare_ratio <dbl> 0.2396330, 0.2316176, 0.1626016, 0.2543511, 0.133…
## $ trip_distance  <dbl> 13.46, 17.30, 16.41, 17.84, 15.30, 13.30, 14.27, …

Since we are going to work on 2 variable cols, dependent variable is tip_amount & independent variable is trip_distance so just for own own ease subsetting my variables into a new df.

Since we chose “2-sample” t test, we divide independent variable i.e. distances covered in miles during each ride into “two factored categorical data” Short & Long. We divided this column into 2 factors based on mean value of distance travelled, which can be seen from glimpse of dataset provided.

## Observations: 11,881
## Variables: 2
## $ tip_fare_ratio <dbl> 0.2396330, 0.2316176, 0.1626016, 0.2543511, 0.133…
## $ trip_distance  <fct> Longer Distances, Longer Distances, Longer Distan…

Here is a Histogram where we combined values from both of my variables. It shows the frequency of %tips paid by both short & long distance travellers. We can get a rough idea about the means of tip percentages paid by both short & long distance travellers but we cant be sure. Just Judging by the shape of this plot, we are unable to say whether there is a Relationship between these two variables or not.

## quartz_off_screen 
##                 2

## 'data.frame':    8446 obs. of  2 variables:
##  $ tip_fare_ratio: num  0.25 0.3 0.3 0.301 0.308 ...
##  $ trip_distance : Factor w/ 2 levels "Shorter Distances",..: 1 1 1 1 1 1 1 1 1 1 ...

## 'data.frame':    3435 obs. of  2 variables:
##  $ tip_fare_ratio: num  0.24 0.232 0.163 0.254 0.133 ...
##  $ trip_distance : Factor w/ 2 levels "Shorter Distances",..: 2 2 2 2 2 2 2 2 2 2 ...

## 
##  Welch Two Sample t-test
## 
## data:  short_dist$tip_fare_ratio and long_dist$tip_fare_ratio
## t = 31.414, df = 7773.4, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.03593062 0.04071328
## sample estimates:
## mean of x mean of y 
## 0.2759685 0.2376466

With Significance level set to 5%, we get a P-value very close to zero. Concluding that we have enough evidence to reject the Null hypothesis in favor of Alt Hyp meaning that people travelling through Yellow cabs in NYC tip differently based on distance travelled.

We can do a little bit of further investigation as to whether they these 2 categories of travellers tip more or less when compared with each other so we plot a box plot to compare means of both groups.

Honestly speaking we were a bit surprised to see the results that short distance travellers pay more tip in terms of fare percentage amount than those who travelled more miles. We think that this result deserves a separate study of its own, why short distance travellers pay more, like maybe because of pshychological reasons or why long distance travellers pay less maybe because of phychological or socio-economic reasons or they are tired of the commute which affects their mood but that is another topic.

DATS6101_Project1_Taxi-Analysis

Steven Chao, Tanaya Kavathekar, Madhuri Yadav, Amna Gul

2021-07-30