This R markdown file contains code used to analyse New York City yellow taxi dataset. Our objective is to identify what factors contribute to tipping amount for taxicab services.
To obtain the data, we have subset the taxi cab data for the most recent available dataset at time of download (June 2019). We have randomly selected 20000 observations due to hardware limitations, which have prevented us from analyzing the entire dataset. Although we have set a seed, we exported the subset dataset and used that for our analysis, to ensure that all group members were working on the same dataset. The code we used to subset the data is commented below.
The data was downloaded from the NYC Open Source GIS website: https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page.
## Observations: 20,000
## Variables: 19
## $ X <int> 4524218, 6458048, 3369795, 18532, 1743670,…
## $ DOLocationID <int> 211, 249, 161, 4, 107, 246, 237, 125, 142,…
## $ PULocationID <int> 90, 125, 68, 87, 234, 230, 163, 249, 236, …
## $ VendorID <int> 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, …
## $ tpep_pickup_datetime <fct> 6/16/19 0:15, 6/28/19 0:09, 6/14/19 23:04,…
## $ tpep_dropoff_datetime <fct> 6/16/19 0:28, 6/28/19 0:16, 6/14/19 23:22,…
## $ passenger_count <int> 1, 1, 2, 1, 1, 6, 1, 2, 1, 1, 1, 1, 1, 1, …
## $ trip_distance <dbl> 1.60, 1.12, 2.72, 2.90, 0.62, 1.90, 0.96, …
## $ RatecodeID <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ store_and_fwd_flag <fct> N, N, N, N, N, N, N, N, N, N, N, N, N, N, …
## $ payment_type <int> 2, 1, 1, 2, 1, 2, 1, 1, 1, 1, 2, 2, 1, 2, …
## $ fare_amount <dbl> 10.0, 6.5, 13.5, 11.0, 5.0, 11.0, 7.0, 7.0…
## $ extra <dbl> 0.5, 0.5, 0.5, 3.5, 0.5, 0.5, 0.0, 3.0, 0.…
## $ mta_tax <dbl> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.…
## $ tip_amount <dbl> 0.00, 2.06, 2.60, 0.00, 1.76, 0.00, 1.00, …
## $ tolls_amount <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ improvement_surcharge <dbl> 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.…
## $ total_amount <dbl> 13.80, 12.36, 19.90, 15.30, 10.56, 14.80, …
## $ congestion_surcharge <dbl> 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.…
The total number of rows and columns are20000, 19 in unprocessed df.There are around 10 columns with numerical type and 9 columns with double for unprocessed taxi data. We need to convert columns like passenger count, vendor id, payment type into factor columns during analysis.
## Observations: 265
## Variables: 4
## $ LocationID <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, …
## $ Borough <fct> EWR, Queens, Bronx, Manhattan, Staten Island, State…
## $ Zone <fct> Newark Airport, Jamaica Bay, Allerton/Pelham Garden…
## $ service_zone <fct> EWR, Boro Zone, Boro Zone, Yellow Zone, Boro Zone, …
The total number of rows and columns are265, 4 in df. And around 4 columns with numerical type and 0 columns with double for unprocessed taxi data. We need to convert columns Borough, Zone, service_zone to factor columns during analysis.
Here is a summary of the subset taxi data:
## X DOLocationID PULocationID VendorID
## Min. : 1016 Min. : 1.0 Min. : 3.0 Min. :1.000
## 1st Qu.:1725764 1st Qu.:107.0 1st Qu.:114.0 1st Qu.:1.000
## Median :3459818 Median :162.0 Median :161.0 Median :2.000
## Mean :3461833 Mean :160.4 Mean :161.9 Mean :1.642
## 3rd Qu.:5200940 3rd Qu.:233.0 3rd Qu.:233.0 3rd Qu.:2.000
## Max. :6940096 Max. :265.0 Max. :265.0 Max. :4.000
##
## tpep_pickup_datetime tpep_dropoff_datetime passenger_count
## 6/11/19 7:56 : 7 6/24/19 18:36: 6 Min. :0.000
## 6/14/19 13:28: 6 6/27/19 21:45: 6 1st Qu.:1.000
## 6/3/19 15:08 : 6 6/29/19 0:26 : 6 Median :1.000
## 6/11/19 13:53: 5 6/1/19 22:46 : 5 Mean :1.565
## 6/14/19 23:16: 5 6/10/19 11:17: 5 3rd Qu.:2.000
## 6/20/19 9:26 : 5 6/11/19 19:25: 5 Max. :6.000
## (Other) :19966 (Other) :19967
## trip_distance RatecodeID store_and_fwd_flag payment_type
## Min. : 0.000 Min. :1.000 N:19893 Min. :1.000
## 1st Qu.: 0.990 1st Qu.:1.000 Y: 107 1st Qu.:1.000
## Median : 1.645 Median :1.000 Median :1.000
## Mean : 3.037 Mean :1.054 Mean :1.291
## 3rd Qu.: 3.100 3rd Qu.:1.000 3rd Qu.:2.000
## Max. :51.200 Max. :5.000 Max. :4.000
##
## fare_amount extra mta_tax tip_amount
## Min. :-160.00 Min. :-1.000 Min. :-0.5000 Min. : 0.000
## 1st Qu.: 6.50 1st Qu.: 0.000 1st Qu.: 0.5000 1st Qu.: 0.000
## Median : 9.50 Median : 0.500 Median : 0.5000 Median : 1.960
## Mean : 13.47 Mean : 1.163 Mean : 0.4949 Mean : 2.277
## 3rd Qu.: 15.00 3rd Qu.: 2.500 3rd Qu.: 0.5000 3rd Qu.: 3.000
## Max. : 399.20 Max. : 7.000 Max. : 0.5000 Max. :175.000
##
## tolls_amount improvement_surcharge total_amount
## Min. :-6.1200 Min. :-0.3000 Min. :-160.80
## 1st Qu.: 0.0000 1st Qu.: 0.3000 1st Qu.: 11.30
## Median : 0.0000 Median : 0.3000 Median : 14.80
## Mean : 0.4059 Mean : 0.2985 Mean : 19.56
## 3rd Qu.: 0.0000 3rd Qu.: 0.3000 3rd Qu.: 21.20
## Max. :43.4300 Max. : 0.3000 Max. : 400.00
##
## congestion_surcharge
## Min. :-2.500
## 1st Qu.: 2.500
## Median : 2.500
## Mean : 2.273
## 3rd Qu.: 2.500
## Max. : 2.750
##
Tip amound varies from 0 and 175 dollars.
Also trip distance varies from 0 and 51.2 with an average distance of 3.0372785.
Minimu fare amount is -160. As fare amount is negative this looks like an outlier.
Vendor ID maximum is 1. But according to data dictionary provided, the data is collected for two vendors namely Creative Mobile Technologies, LLC as ID 1 and VeriFone Inc as ID 2.
ggplot(data = unprocessed_data, aes(x = "", y = tip_amount)) +
geom_boxplot(color="#00AFBB")+ stat_summary(fun.y=mean, geom="point", shape=23, size=4) +
labs(x=" ", y = "Tip amount (dollars)") + ggtitle("Boxplot of NYC Taxi Tip Amount")For the graph above, there are many observations tagged as outliers. We need to treat the data for outliers before analysis.
Looking at the distribution of raw tip amount, it is clear that it is not normally distributed and that there are some outliers.
A normal distribution is often an assumption for many statistical analyses. Generally. raw tip amounts vary because the fare amounts vary. One factor that may not necessarily vary is tipping percentage. In the US, there is often a standardized percentage that a customer gives (for example, 15% at restaurants). We divided the fare amount by the tip amount to obtain a tipping percentage:
Here is the structure, summary, and the first few rows of tip percentage:
## num [1:20000] 0 0.317 0.193 0 0.352 ...
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 0.0000 0.2267 0.1839 0.2878 11.4400 15
## X DOLocationID PULocationID VendorID tpep_pickup_datetime
## 1 4524218 211 90 2 6/16/19 0:15
## 2 6458048 249 125 2 6/28/19 0:09
## 3 3369795 161 68 2 6/14/19 23:04
## 4 18532 4 87 1 6/24/19 16:02
## 5 1743670 107 234 2 6/28/19 23:38
## tpep_dropoff_datetime passenger_count trip_distance RatecodeID
## 1 6/16/19 0:28 1 1.60 1
## 2 6/28/19 0:16 1 1.12 1
## 3 6/14/19 23:22 2 2.72 1
## 4 6/24/19 16:12 1 2.90 1
## 5 6/28/19 23:43 1 0.62 1
## store_and_fwd_flag payment_type fare_amount extra mta_tax tip_amount
## 1 N 2 10.0 0.5 0.5 0.00
## 2 N 1 6.5 0.5 0.5 2.06
## 3 N 1 13.5 0.5 0.5 2.60
## 4 N 2 11.0 3.5 0.5 0.00
## 5 N 1 5.0 0.5 0.5 1.76
## tolls_amount improvement_surcharge total_amount congestion_surcharge
## 1 0 0.3 13.80 2.5
## 2 0 0.3 12.36 2.5
## 3 0 0.3 19.90 2.5
## 4 0 0.3 15.30 2.5
## 5 0 0.3 10.56 2.5
## tip_fare_ratio
## 1 0.0000000
## 2 0.3169231
## 3 0.1925926
## 4 0.0000000
## 5 0.3520000
The dataset provides a location ID that corresponds to a taxi zone in each of the five boroughs. These nominal variables do not provide much value in its integer format since we do not know the geographical locations of each location ID. We downloaded a taxi zone and ID dataset that provides the boroughs for each location ID. The dataset also indicates the specific neighborhoods within each borough. We merged that dataset to the taxi dataset to identify the borough for both pick up and drop off.
The dataset provides pickup datetime column in factor datatype. For our analysis we create a new column pickup_period which is of type factor contains values “Morning”, “Afternoon”, “Evening” or “Night” based on the pickup hours.
The dataset provides dropoff datetime column in factor datatype. For our analysis we create a new column drop_period which is of type factor contains values “Morning”, “Afternoon”, “Evening” or “Night” based on the pickup hours.
We thought there might be also correlation between the duration of the trip (time taken for the trip) and the tip amount paid. Since the duration of the trip is missing in the dataset we calculate the same by taking the difference between the pickup and dropoff time.
Here is the structure, summary, and the first few rows of the dataset with new columns:
## 'data.frame': 20000 obs. of 33 variables:
## $ DOLocationID : int 1 1 1 1 1 1 1 1 1 1 ...
## $ PULocationID : int 233 231 186 234 231 161 246 68 50 132 ...
## $ X : int 14365 1376 8878 13985 9970 7812 3427 13195 9857 7422 ...
## $ VendorID : int 2 1 1 1 2 1 2 2 2 1 ...
## $ tpep_pickup_datetime : Factor w/ 15407 levels "5/31/19 23:58",..: 13958 4851 10536 8018 11326 1673 8964 12138 12514 12942 ...
## $ tpep_dropoff_datetime: Factor w/ 15299 levels "6/1/19 0:04",..: 13903 4833 10443 7946 11238 1655 8895 12052 12466 12912 ...
## $ passenger_count : int 1 1 1 1 1 1 4 1 1 1 ...
## $ trip_distance : num 23.9 15.5 16.7 14.2 13.5 ...
## $ RatecodeID : int 3 3 3 3 3 3 3 3 3 3 ...
## $ store_and_fwd_flag : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ payment_type : int 1 1 1 1 1 1 1 1 1 1 ...
## $ fare_amount : num 96 73.5 71 64 54.5 68 61 61.5 70.5 117 ...
## $ extra : num 0 1 0 0 0 0 1 0.5 1 0 ...
## $ mta_tax : num 0 0 0 0 0 0 0 0 0 0 ...
## $ tip_amount : num 29.2 18.4 12 19.1 13.1 ...
## $ tolls_amount : num 20.5 17.5 10.5 12.5 10.5 ...
## $ improvement_surcharge: num 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 ...
## $ total_amount : num 146 110.8 93.8 96 78.4 ...
## $ congestion_surcharge : num 0 0 0 0 0 0 0 0 0 0 ...
## $ tip_fare_ratio : num 0.304 0.251 0.169 0.299 0.24 ...
## $ Borough_pu : Factor w/ 7 levels "Bronx","Brooklyn",..: 4 4 4 4 4 4 4 4 4 5 ...
## $ Borough_do : Factor w/ 7 levels "Bronx","Brooklyn",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ Zone : Factor w/ 261 levels "Allerton/Pelham Gardens",..: 169 169 169 169 169 169 169 169 169 169 ...
## $ service_zone : Factor w/ 5 levels "Airports","Boro Zone",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ pickup_datetime : POSIXct, format: "0019-06-07 11:06:00" "0019-06-18 16:03:00" ...
## $ pickup_time : chr "11:06" "16:03" "14:42" "16:22" ...
## $ pickup_hrs : num 11 16 14 16 7 12 16 20 17 14 ...
## $ pickup_period : Ord.factor w/ 4 levels "Morning"<"Afternoon"<..: 1 2 2 2 1 2 2 3 3 2 ...
## $ dropoff_datetime : POSIXct, format: "0019-06-07 12:21:00" "0019-06-18 17:08:00" ...
## $ dropoff_time : chr "12:21" "17:08" "15:38" "17:12" ...
## $ dropoff_hrs : num 12 17 15 17 8 12 17 20 18 15 ...
## $ drop_period : Ord.factor w/ 4 levels "Morning"<"Afternoon"<..: 2 3 2 3 1 2 3 3 3 2 ...
## $ trip_duration : num 75 65 56 50 21 35 41 27 41 99 ...
## DOLocationID PULocationID X VendorID
## Min. : 1.0 Min. : 3.0 Min. : 1016 Min. :1.000
## 1st Qu.:107.0 1st Qu.:114.0 1st Qu.:1725764 1st Qu.:1.000
## Median :162.0 Median :161.0 Median :3459818 Median :2.000
## Mean :160.4 Mean :161.9 Mean :3461833 Mean :1.642
## 3rd Qu.:233.0 3rd Qu.:233.0 3rd Qu.:5200940 3rd Qu.:2.000
## Max. :265.0 Max. :265.0 Max. :6940096 Max. :4.000
##
## tpep_pickup_datetime tpep_dropoff_datetime passenger_count
## 6/11/19 7:56 : 7 6/24/19 18:36: 6 Min. :0.000
## 6/14/19 13:28: 6 6/27/19 21:45: 6 1st Qu.:1.000
## 6/3/19 15:08 : 6 6/29/19 0:26 : 6 Median :1.000
## 6/11/19 13:53: 5 6/1/19 22:46 : 5 Mean :1.565
## 6/14/19 23:16: 5 6/10/19 11:17: 5 3rd Qu.:2.000
## 6/20/19 9:26 : 5 6/11/19 19:25: 5 Max. :6.000
## (Other) :19966 (Other) :19967
## trip_distance RatecodeID store_and_fwd_flag payment_type
## Min. : 0.000 Min. :1.000 N:19893 Min. :1.000
## 1st Qu.: 0.990 1st Qu.:1.000 Y: 107 1st Qu.:1.000
## Median : 1.645 Median :1.000 Median :1.000
## Mean : 3.037 Mean :1.054 Mean :1.291
## 3rd Qu.: 3.100 3rd Qu.:1.000 3rd Qu.:2.000
## Max. :51.200 Max. :5.000 Max. :4.000
##
## fare_amount extra mta_tax tip_amount
## Min. :-160.00 Min. :-1.000 Min. :-0.5000 Min. : 0.000
## 1st Qu.: 6.50 1st Qu.: 0.000 1st Qu.: 0.5000 1st Qu.: 0.000
## Median : 9.50 Median : 0.500 Median : 0.5000 Median : 1.960
## Mean : 13.47 Mean : 1.163 Mean : 0.4949 Mean : 2.277
## 3rd Qu.: 15.00 3rd Qu.: 2.500 3rd Qu.: 0.5000 3rd Qu.: 3.000
## Max. : 399.20 Max. : 7.000 Max. : 0.5000 Max. :175.000
##
## tolls_amount improvement_surcharge total_amount
## Min. :-6.1200 Min. :-0.3000 Min. :-160.80
## 1st Qu.: 0.0000 1st Qu.: 0.3000 1st Qu.: 11.30
## Median : 0.0000 Median : 0.3000 Median : 14.80
## Mean : 0.4059 Mean : 0.2985 Mean : 19.56
## 3rd Qu.: 0.0000 3rd Qu.: 0.3000 3rd Qu.: 21.20
## Max. :43.4300 Max. : 0.3000 Max. : 400.00
##
## congestion_surcharge tip_fare_ratio Borough_pu
## Min. :-2.500 Min. : 0.0000 Bronx : 35
## 1st Qu.: 2.500 1st Qu.: 0.0000 Brooklyn : 242
## Median : 2.500 Median : 0.2267 EWR : 0
## Mean : 2.273 Mean : 0.1839 Manhattan :18088
## 3rd Qu.: 2.500 3rd Qu.: 0.2878 Queens : 1466
## Max. : 2.750 Max. :11.4400 Staten Island: 1
## NA's :15 Unknown : 168
## Borough_do Zone
## Bronx : 159 Midtown Center : 791
## Brooklyn : 810 Upper East Side North : 765
## EWR : 40 Upper East Side South : 758
## Manhattan :17661 Murray Hill : 619
## Queens : 1073 Times Sq/Theatre District: 614
## Staten Island: 3 (Other) :16380
## Unknown : 254 NA's : 73
## service_zone pickup_datetime pickup_time
## Airports : 459 Min. :0019-05-31 23:58:00 Length:20000
## Boro Zone : 2639 1st Qu.:0019-06-08 10:40:15 Class :character
## EWR : 40 Median :0019-06-15 17:14:00 Mode :character
## N/A : 254 Mean :0019-06-15 22:05:19
## Yellow Zone:16608 3rd Qu.:0019-06-23 04:01:15
## Max. :0019-06-30 23:57:00
##
## pickup_hrs pickup_period dropoff_datetime
## Min. : 0.00 Morning :5092 Min. :0019-06-01 00:04:00
## 1st Qu.: 9.00 Afternoon:5336 1st Qu.:0019-06-08 10:43:45
## Median :14.00 Evening :5695 Median :0019-06-15 17:27:00
## Mean :13.75 Night :3877 Mean :0019-06-15 22:23:23
## 3rd Qu.:19.00 3rd Qu.:0019-06-23 04:19:30
## Max. :23.00 Max. :0019-07-01 22:24:00
##
## dropoff_time dropoff_hrs drop_period trip_duration
## Length:20000 Min. : 0.00 Morning :4870 Min. : 0.00
## Class :character 1st Qu.: 9.00 Afternoon:5274 1st Qu.: 7.00
## Mode :character Median :14.00 Evening :5762 Median : 11.00
## Mean :13.74 Night :4094 Mean : 18.06
## 3rd Qu.:19.00 3rd Qu.: 19.00
## Max. :23.00 Max. :1439.00
##
## DOLocationID PULocationID X VendorID tpep_pickup_datetime
## 1 1 233 14365 2 6/7/19 11:06
## 2 1 231 1376 1 6/18/19 16:03
## 3 1 186 8878 1 6/28/19 14:42
## 4 1 234 13985 1 6/23/19 16:22
## 5 1 231 9970 2 6/29/19 7:40
## tpep_dropoff_datetime passenger_count trip_distance RatecodeID
## 1 6/7/19 12:21 1 23.93 3
## 2 6/18/19 17:08 1 15.50 3
## 3 6/28/19 15:38 1 16.70 3
## 4 6/23/19 17:12 1 14.20 3
## 5 6/29/19 8:01 1 13.46 3
## store_and_fwd_flag payment_type fare_amount extra mta_tax tip_amount
## 1 N 1 96.0 0 0 29.20
## 2 N 1 73.5 1 0 18.45
## 3 N 1 71.0 0 0 12.00
## 4 N 1 64.0 0 0 19.15
## 5 N 1 54.5 0 0 13.06
## tolls_amount improvement_surcharge total_amount congestion_surcharge
## 1 20.5 0.3 146.00 0
## 2 17.5 0.3 110.75 0
## 3 10.5 0.3 93.80 0
## 4 12.5 0.3 95.95 0
## 5 10.5 0.3 78.36 0
## tip_fare_ratio Borough_pu Borough_do Zone service_zone
## 1 0.3041667 Manhattan EWR Newark Airport EWR
## 2 0.2510204 Manhattan EWR Newark Airport EWR
## 3 0.1690141 Manhattan EWR Newark Airport EWR
## 4 0.2992187 Manhattan EWR Newark Airport EWR
## 5 0.2396330 Manhattan EWR Newark Airport EWR
## pickup_datetime pickup_time pickup_hrs pickup_period
## 1 0019-06-07 11:06:00 11:06 11 Morning
## 2 0019-06-18 16:03:00 16:03 16 Afternoon
## 3 0019-06-28 14:42:00 14:42 14 Afternoon
## 4 0019-06-23 16:22:00 16:22 16 Afternoon
## 5 0019-06-29 07:40:00 07:40 7 Morning
## dropoff_datetime dropoff_time dropoff_hrs drop_period trip_duration
## 1 0019-06-07 12:21:00 12:21 12 Afternoon 75
## 2 0019-06-18 17:08:00 17:08 17 Evening 65
## 3 0019-06-28 15:38:00 15:38 15 Afternoon 56
## 4 0019-06-23 17:12:00 17:12 17 Evening 50
## 5 0019-06-29 08:01:00 08:01 8 Morning 21
Following filters are applied on the data set: 1. VendorID = Data has vendor id 4, but according to the data dictionary only 1 and 2 should be present. 2. payment type = Only credit card details have corresponding tip amount value. Hence we will analysing only credit card payment types 3. fare_amount = data has fareamount in negative, such entries are outliers hence removed 4. passenger count = according to the law maximum 7 passengers are allowed in a taxi 5. trip distance = there are some observation with 0 trip distance, maybe a cancelled taxi. These observations are not considered for the analysis 6. trip duration = the values >=37 and <=0 fall outside the boxplot. These are removed. Using the boxplot, we removed the outliers from the dataset.
Below is the box plot after outlier treatment
Observation : All outliers are removed after treatment
Here is the structure, summary, and the first few rows of the processed dataset:
## 'data.frame': 11881 obs. of 33 variables:
## $ DOLocationID : int 1 1 1 1 1 1 1 1 1 1 ...
## $ PULocationID : int 231 161 68 143 68 125 164 87 230 100 ...
## $ X : int 9970 7812 13195 12280 2391 15423 1082 1862 9274 11863 ...
## $ VendorID : int 2 1 2 2 1 1 2 2 2 2 ...
## $ tpep_pickup_datetime : Factor w/ 15407 levels "5/31/19 23:58",..: 11326 1673 12138 9232 7771 14662 6282 3135 6642 14376 ...
## $ tpep_dropoff_datetime: Factor w/ 15299 levels "6/1/19 0:04",..: 11238 1655 12052 9152 7686 14570 6243 3092 6579 14275 ...
## $ passenger_count : int 1 1 1 1 1 1 1 1 5 2 ...
## $ trip_distance : num 13.5 17.3 16.4 17.8 15.3 ...
## $ RatecodeID : int 3 3 3 3 3 3 3 3 3 3 ...
## $ store_and_fwd_flag : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ payment_type : int 1 1 1 1 1 1 1 1 1 1 ...
## $ fare_amount : num 54.5 68 61.5 65.5 60 57.5 58 69.5 64 65.5 ...
## $ extra : num 0 0 0.5 0 0 0 0 0.5 0.5 0 ...
## $ mta_tax : num 0 0 0 0 0 0 0 0 0 0 ...
## $ tip_amount : num 13.1 15.8 10 16.7 8 ...
## $ tolls_amount : num 10.5 10.5 12.5 17.5 10.5 23 10.5 23 10.5 17.5 ...
## $ improvement_surcharge: num 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 ...
## $ total_amount : num 78.4 94.5 84.8 100 78.8 ...
## $ congestion_surcharge : num 0 0 0 0 0 0 0 0 0 0 ...
## $ tip_fare_ratio : num 0.24 0.232 0.163 0.254 0.133 ...
## $ Borough_pu : Factor w/ 7 levels "Bronx","Brooklyn",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ Borough_do : Factor w/ 7 levels "Bronx","Brooklyn",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ Zone : Factor w/ 261 levels "Allerton/Pelham Gardens",..: 169 169 169 169 169 169 169 169 169 169 ...
## $ service_zone : Factor w/ 5 levels "Airports","Boro Zone",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ pickup_datetime : POSIXct, format: "0019-06-29 07:40:00" "0019-06-12 12:12:00" ...
## $ pickup_time : chr "07:40" "12:12" "20:09" "07:11" ...
## $ pickup_hrs : num 7 12 20 7 6 16 12 4 3 6 ...
## $ pickup_period : Ord.factor w/ 4 levels "Morning"<"Afternoon"<..: 1 2 3 1 1 2 2 4 4 1 ...
## $ dropoff_datetime : POSIXct, format: "0019-06-29 08:01:00" "0019-06-12 12:47:00" ...
## $ dropoff_time : chr "08:01" "12:47" "20:36" "07:45" ...
## $ dropoff_hrs : num 8 12 20 7 7 16 12 4 3 7 ...
## $ drop_period : Ord.factor w/ 4 levels "Morning"<"Afternoon"<..: 1 2 3 1 1 2 2 4 4 1 ...
## $ trip_duration : num 21 35 27 34 26 34 31 31 24 31 ...
## DOLocationID PULocationID X VendorID
## Min. : 1.0 Min. : 4 Min. : 1082 Min. :1.000
## 1st Qu.:113.0 1st Qu.:125 1st Qu.:1784279 1st Qu.:1.000
## Median :162.0 Median :162 Median :3523917 Median :2.000
## Mean :162.4 Mean :165 Mean :3520009 Mean :1.629
## 3rd Qu.:234.0 3rd Qu.:234 3rd Qu.:5288564 3rd Qu.:2.000
## Max. :265.0 Max. :265 Max. :6939845 Max. :2.000
##
## tpep_pickup_datetime tpep_dropoff_datetime passenger_count
## 6/11/19 7:56 : 6 6/24/19 18:36: 6 Min. :0.000
## 6/14/19 13:28: 5 6/6/19 9:40 : 5 1st Qu.:1.000
## 6/14/19 23:16: 5 6/12/19 10:48: 4 Median :1.000
## 6/6/19 9:33 : 5 6/12/19 23:05: 4 Mean :1.559
## 6/10/19 10:08: 4 6/14/19 12:58: 4 3rd Qu.:2.000
## 6/11/19 19:34: 4 6/14/19 9:26 : 4 Max. :6.000
## (Other) :11852 (Other) :11854
## trip_distance RatecodeID store_and_fwd_flag payment_type
## Min. : 0.030 Min. :1.000 N:11828 Min. :1
## 1st Qu.: 1.010 1st Qu.:1.000 Y: 53 1st Qu.:1
## Median : 1.600 Median :1.000 Median :1
## Mean : 2.465 Mean :1.018 Mean :1
## 3rd Qu.: 2.720 3rd Qu.:1.000 3rd Qu.:1
## Max. :25.100 Max. :5.000 Max. :1
##
## fare_amount extra mta_tax tip_amount
## Min. : 3.00 Min. :0.000 Min. :0.0000 Min. : 0.440
## 1st Qu.: 6.50 1st Qu.:0.000 1st Qu.:0.5000 1st Qu.: 1.860
## Median : 9.50 Median :0.500 Median :0.5000 Median : 2.450
## Mean : 11.61 Mean :1.218 Mean :0.4989 Mean : 2.924
## 3rd Qu.: 13.50 3rd Qu.:2.500 3rd Qu.:0.5000 3rd Qu.: 3.360
## Max. :238.00 Max. :4.500 Max. :0.5000 Max. :47.650
##
## tolls_amount improvement_surcharge total_amount
## Min. : 0.0000 Min. :0.3 Min. : 4.50
## 1st Qu.: 0.0000 1st Qu.:0.3 1st Qu.: 12.30
## Median : 0.0000 Median :0.3 Median : 15.35
## Mean : 0.2448 Mean :0.3 Mean : 18.30
## 3rd Qu.: 0.0000 3rd Qu.:0.3 3rd Qu.: 20.75
## Max. :23.0000 Max. :0.3 Max. :285.95
##
## congestion_surcharge tip_fare_ratio Borough_pu
## Min. :0.00 Min. :0.1091 Bronx : 1
## 1st Qu.:2.50 1st Qu.:0.2287 Brooklyn : 105
## Median :2.50 Median :0.2667 EWR : 0
## Mean :2.39 Mean :0.2649 Manhattan :11247
## 3rd Qu.:2.50 3rd Qu.:0.3086 Queens : 437
## Max. :2.75 Max. :0.4283 Staten Island: 0
## Unknown : 91
## Borough_do Zone service_zone
## Bronx : 32 Upper East Side North: 495 Airports : 175
## Brooklyn : 383 Upper East Side South: 472 Boro Zone : 1155
## EWR : 12 Midtown Center : 451 EWR : 12
## Manhattan :10959 Murray Hill : 400 N/A : 107
## Queens : 387 Midtown East : 376 Yellow Zone:10432
## Staten Island: 1 (Other) :9664
## Unknown : 107 NA's : 23
## pickup_datetime pickup_time pickup_hrs
## Min. :0019-05-31 23:58:00 Length:11881 Min. : 0.00
## 1st Qu.:0019-06-08 09:50:00 Class :character 1st Qu.: 9.00
## Median :0019-06-15 14:09:00 Mode :character Median :14.00
## Mean :0019-06-15 20:32:38 Mean :13.82
## 3rd Qu.:0019-06-23 03:21:00 3rd Qu.:19.00
## Max. :0019-06-30 23:57:00 Max. :23.00
##
## pickup_period dropoff_datetime dropoff_time
## Morning :3106 Min. :0019-06-01 00:04:00 Length:11881
## Afternoon:2873 1st Qu.:0019-06-08 09:59:00 Class :character
## Evening :3507 Median :0019-06-15 14:17:00 Mode :character
## Night :2395 Mean :0019-06-15 20:45:40
## 3rd Qu.:0019-06-23 03:48:00
## Max. :0019-07-01 00:02:00
##
## dropoff_hrs drop_period trip_duration
## Min. : 0.0 Morning :3005 Min. : 1.00
## 1st Qu.: 9.0 Afternoon:2871 1st Qu.: 7.00
## Median :14.0 Evening :3481 Median :11.00
## Mean :13.8 Night :2524 Mean :13.05
## 3rd Qu.:19.0 3rd Qu.:18.00
## Max. :23.0 Max. :37.00
##
## DOLocationID PULocationID X VendorID tpep_pickup_datetime
## 1 1 231 9970 2 6/29/19 7:40
## 2 1 161 7812 1 6/12/19 12:12
## 3 1 68 13195 2 6/30/19 20:09
## 5 1 143 12280 2 6/25/19 7:11
## 6 1 68 2391 1 6/22/19 6:38
## tpep_dropoff_datetime passenger_count trip_distance RatecodeID
## 1 6/29/19 8:01 1 13.46 3
## 2 6/12/19 12:47 1 17.30 3
## 3 6/30/19 20:36 1 16.41 3
## 5 6/25/19 7:45 1 17.84 3
## 6 6/22/19 7:04 1 15.30 3
## store_and_fwd_flag payment_type fare_amount extra mta_tax tip_amount
## 1 N 1 54.5 0.0 0 13.06
## 2 N 1 68.0 0.0 0 15.75
## 3 N 1 61.5 0.5 0 10.00
## 5 N 1 65.5 0.0 0 16.66
## 6 N 1 60.0 0.0 0 8.00
## tolls_amount improvement_surcharge total_amount congestion_surcharge
## 1 10.5 0.3 78.36 0
## 2 10.5 0.3 94.55 0
## 3 12.5 0.3 84.80 0
## 5 17.5 0.3 99.96 0
## 6 10.5 0.3 78.80 0
## tip_fare_ratio Borough_pu Borough_do Zone service_zone
## 1 0.2396330 Manhattan EWR Newark Airport EWR
## 2 0.2316176 Manhattan EWR Newark Airport EWR
## 3 0.1626016 Manhattan EWR Newark Airport EWR
## 5 0.2543511 Manhattan EWR Newark Airport EWR
## 6 0.1333333 Manhattan EWR Newark Airport EWR
## pickup_datetime pickup_time pickup_hrs pickup_period
## 1 0019-06-29 07:40:00 07:40 7 Morning
## 2 0019-06-12 12:12:00 12:12 12 Afternoon
## 3 0019-06-30 20:09:00 20:09 20 Evening
## 5 0019-06-25 07:11:00 07:11 7 Morning
## 6 0019-06-22 06:38:00 06:38 6 Morning
## dropoff_datetime dropoff_time dropoff_hrs drop_period trip_duration
## 1 0019-06-29 08:01:00 08:01 8 Morning 21
## 2 0019-06-12 12:47:00 12:47 12 Afternoon 35
## 3 0019-06-30 20:36:00 20:36 20 Evening 27
## 5 0019-06-25 07:45:00 07:45 7 Morning 34
## 6 0019-06-22 07:04:00 07:04 7 Morning 26
Here’s a histogram of the NYC taxi tip data after the removal of outliers:
#ggplot histogram of tip_fare_ratio for processed df
processed_df %>%
ggplot(aes(tip_fare_ratio)) +
geom_histogram(aes(y =..density..), colour = "black", fill = "#66B2FF", binwidth = 0.01) +
stat_function(fun = dnorm, args = list(mean = mean(processed_df$tip_fare_ratio), sd = sd(processed_df$tip_fare_ratio))) + ggtitle("Distribution of NYC Taxi Tip Data Post-Outlier Removal")Observation : The data is approximately normally distributed. There are lesser points on the left side of the mean.
In order to understand how different features are distributed in the data we plotted the below graphs.
# plotting count of drives with passenger count
processed_df %>%
group_by(passenger_count) %>%
count() %>%
ggplot(aes(passenger_count, n, fill = passenger_count)) +
geom_col() +
scale_y_sqrt() +
theme(legend.position = "none")+
xlab("Number of passengers") +
ylab("Total number of trips") +
ggtitle('Distribution according to the number of passengers in a taxi')Observation : 1. There are not many trips with zero passengers, and majority of the rides are with 1 passenger 2. The number of the trips starts reducing as the number of passengers till 4. This maybe due to the size of the car. 3. Until the increase at 5 which maybe due to large car size.
# plotting count of drives for vendor
processed_df %>%
group_by(VendorID) %>%
count() %>%
ggplot(aes(VendorID, n, fill = VendorID)) +
geom_col() +
theme(legend.position = "none") +
xlab("Vendor ID") +
ylab("Total number of trips") +
ggtitle('Distribution according to vendors of the taxi service')Vendor Ids 1 and 2 belongs to Creative Mobile Technologies, LLC; VeriFone Inc respectively
Observation : Vendor 2 has more number of trips that vendor 1
# plotting count of drives for vendor
processed_df %>%
mutate(wday = wday(processed_df$pickup_datetime)) %>%
group_by(wday, VendorID) %>%
count() %>%
ggplot(aes(wday, n, colour = VendorID)) +
geom_point(size = 3) +
labs(x = "Day of the week", y = "Total number of pickups") Observation : Vendor 2 has more number of trips that vendor 1 holds true for every day of the week
# plotting count of drives for vendor
processed_df %>%
group_by(RatecodeID) %>%
count() %>%
ggplot(aes(RatecodeID, n, fill = RatecodeID)) +
geom_col() +
theme(legend.position = "none") +
xlab("Rate code") +
ylab("Count of rides") +
ggtitle('Distribution according to rate code for the service')Following are the different rate codes: 1= Standard rate 2=JFK 3=Newark 4=Nassau or Westchester 5=Negotiated fare 6=Group ride
Observation : Most of the trips are with booked with standard rate in the sample set. As the distribution for rate code is highly skewed this variable won’t be used in our futher analysis for hypothesis testing
Location can also impact the amount of tipping. First, we calculate the number of trips in each borough, firstly grouped by pickup location and secondly grouped by drop off location.
# Create barplots for trip counts based on location
ggplot(data = processed_df, aes(x = Borough_pu, fill = Borough_pu)) + geom_bar() + ggtitle("Trip Counts Based on Pickup Location") + xlab("Location") + ylab("Trip Frequency")# Create barplots for trip counts based on location
ggplot(data = processed_df, aes(x = Borough_do, fill = Borough_do)) + geom_bar() + ggtitle("Trip Counts Based on Dropoff Location") + xlab("Location") + ylab("Trip Frequency")Observation: These bar charts show that Manhattan has the highest number of both pick up and drop offs, followed by Queens and Brooklyn in second and third, respectively. We also looked at the frequency of various drop off and pick up combinations.
# Subset data to get locations only
pu_do <- subset(processed_df, select = c("Borough_pu", "Borough_do"))
# Create table of location combinations
ddply(pu_do, .(Borough_pu, Borough_do), nrow)## Borough_pu Borough_do V1
## 1 Bronx Manhattan 1
## 2 Brooklyn Brooklyn 58
## 3 Brooklyn Manhattan 45
## 4 Brooklyn Queens 2
## 5 Manhattan Bronx 28
## 6 Manhattan Brooklyn 268
## 7 Manhattan EWR 12
## 8 Manhattan Manhattan 10640
## 9 Manhattan Queens 280
## 10 Manhattan Staten Island 1
## 11 Manhattan Unknown 18
## 12 Queens Bronx 3
## 13 Queens Brooklyn 56
## 14 Queens Manhattan 262
## 15 Queens Queens 104
## 16 Queens Unknown 12
## 17 Unknown Bronx 1
## 18 Unknown Brooklyn 1
## 19 Unknown Manhattan 11
## 20 Unknown Queens 1
## 21 Unknown Unknown 77
Observation: This table shows that Manhattan to Manhattan has the highest number of trips, followed by Queens to Manhattan, Manhattan to Queens, and Manhattan to Brooklyn.
Considering the Manhattan and Queens have the highest number of pick ups and drop offs, the fact that the number of trips within and between these places are also the highest make sense. The fact that Manhattan scores the highest in both measurements is also reasonable because yellow taxis (the focus of this study) mainly serve Manhattan, whereas green taxis usually serve the other boroughs that have been traditionally underserved by taxis.
With these descriptive statistics in mind, we decided to compare the mean tipping percentage for each borough based on drop off and pick up location to see if they were statistically different.
We used the ANOVA test. Here are the results:
## Call:
## aov(formula = tip_fare_ratio ~ Borough_pu, data = processed_df)
##
## Terms:
## Borough_pu Residuals
## Sum of Squares 0.48833 54.44540
## Deg. of Freedom 4 11876
##
## Residual standard error: 0.06770886
## Estimated effects may be unbalanced
## Df Sum Sq Mean Sq F value Pr(>F)
## Borough_pu 4 0.49 0.12208 26.63 <2e-16 ***
## Residuals 11876 54.45 0.00458
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Call:
## aov(formula = tip_fare_ratio ~ Borough_do, data = processed_df)
##
## Terms:
## Borough_do Residuals
## Sum of Squares 0.66484 54.26889
## Deg. of Freedom 6 11874
##
## Residual standard error: 0.06760471
## Estimated effects may be unbalanced
## Df Sum Sq Mean Sq F value Pr(>F)
## Borough_do 6 0.66 0.11081 24.24 <2e-16 ***
## Residuals 11874 54.27 0.00457
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Plot tip ratio by pickup location
ggplot(data = processed_df, aes(x = Borough_pu, y = tip_fare_ratio, fill = Borough_pu)) + ggtitle("Tip Ratio by Pickup Location") + geom_boxplot() + xlab("Location") + ylab("Tip Ratio")# Plot tip ratio by dropoff location
ggplot(data = processed_df, aes(x = Borough_do, y = tip_fare_ratio, fill = Borough_do)) + ggtitle("Tip Ratio by Dropoff Location") + geom_boxplot() + xlab("Location") + ylab("Tip Ratio")Observation: The p-values are 5.056902310^{-22} and 1.067382710^{-28} for pick up and drop off, respectively. Both of these are smaller than a significance level of 0.05 (a 0.95 confidence level). Thus, we can reject the null hypothesis that the means are the same and say the means are statistically different at a significance level of 0.05.
Because they are significant, the next step would be to conduct a Tukey’s HSD test, which looks at each pair of variables to see if they are significantly different. Here are the results from that analysis:
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = tip_fare_ratio ~ Borough_pu, data = processed_df)
##
## $Borough_pu
## diff lwr upr p adj
## Brooklyn-Bronx -0.144985128 -0.33058557 0.04061531 0.2068399
## Manhattan-Bronx -0.110932826 -0.29566393 0.07379828 0.4727764
## Queens-Bronx -0.140110246 -0.32504437 0.04482388 0.2345923
## Unknown-Bronx -0.105217640 -0.29095272 0.08051744 0.5327308
## Manhattan-Brooklyn 0.034052303 0.01594124 0.05216336 0.0000029
## Queens-Brooklyn 0.004874882 -0.01520148 0.02495124 0.9643270
## Unknown-Brooklyn 0.039767488 0.01331093 0.06622405 0.0003988
## Queens-Manhattan -0.029177420 -0.03818395 -0.02017089 0.0000000
## Unknown-Manhattan 0.005715185 -0.01372722 0.02515759 0.9300852
## Unknown-Queens 0.034892606 0.01360748 0.05617773 0.0000764
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = tip_fare_ratio ~ Borough_do, data = processed_df)
##
## $Borough_do
## diff lwr upr p adj
## Brooklyn-Bronx 0.002541040 -0.0341428978 0.039224979 0.9999940
## EWR-Bronx -0.002278280 -0.0697601339 0.065203574 0.9999999
## Manhattan-Bronx 0.035597726 0.0003050629 0.070890389 0.0464445
## Queens-Bronx 0.011394320 -0.0252749973 0.048063637 0.9701039
## Staten Island-Bronx 0.047879679 -0.1545658832 0.250325241 0.9927784
## Unknown-Bronx 0.024362072 -0.0158046879 0.064528832 0.5558605
## EWR-Brooklyn -0.004819320 -0.0632626320 0.053623992 0.9999830
## Manhattan-Brooklyn 0.033056686 0.0226936677 0.043419704 0.0000000
## Queens-Brooklyn 0.008853279 -0.0055153974 0.023221956 0.5364941
## Staten Island-Brooklyn 0.045338639 -0.1542760545 0.244953332 0.9941996
## Unknown-Brooklyn 0.021821032 0.0000222088 0.043619855 0.0495693
## Manhattan-EWR 0.037876006 -0.0197042117 0.095456223 0.4538348
## Queens-EWR 0.013672599 -0.0447615360 0.072106735 0.9931849
## Staten Island-EWR 0.050157959 -0.1573368967 0.257652814 0.9918773
## Unknown-EWR 0.026640352 -0.0340496637 0.087330368 0.8548017
## Queens-Manhattan -0.024203406 -0.0345145474 -0.013892265 0.0000000
## Staten Island-Manhattan 0.012281953 -0.1870817511 0.211645657 0.9999970
## Unknown-Manhattan -0.011235654 -0.0306018472 0.008130539 0.6087610
## Staten Island-Queens 0.036485359 -0.1631266474 0.236097366 0.9982604
## Unknown-Queens 0.012967752 -0.0088064563 0.034741961 0.5779243
## Unknown-Staten Island -0.023517607 -0.2238016130 0.176766399 0.9998637
Observation: While the means are overall not the same, the following pick up pairs have significant differences in tipping percentage (excluding Unknown) given their small p-values: Manhattan and Brooklyn, and Queens and Manhattan. For drop off pairs, Manhattan and Bronx, Manhattan and Brooklyn, and Manhattan and Queens are significant.
ANOVA assumes a normal distribution, and as previously highlighted, the tipping amount is not necessarily normally distributed. Nonetheless, a look at the tipping amount can provide some context to the situation. We compared the mean tipping amount for each borough to see if those were statistically different. Here are the results for raw tip amount based on location:
## Call:
## aov(formula = tip_amount ~ Borough_pu, data = processed_df)
##
## Terms:
## Borough_pu Residuals
## Sum of Squares 8919.59 34065.84
## Deg. of Freedom 4 11876
##
## Residual standard error: 1.693653
## Estimated effects may be unbalanced
## Df Sum Sq Mean Sq F value Pr(>F)
## Borough_pu 4 8920 2229.9 777.4 <2e-16 ***
## Residuals 11876 34066 2.9
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Call:
## aov(formula = tip_amount ~ Borough_do, data = processed_df)
##
## Terms:
## Borough_do Residuals
## Sum of Squares 8157.25 34828.18
## Deg. of Freedom 6 11874
##
## Residual standard error: 1.712643
## Estimated effects may be unbalanced
## Df Sum Sq Mean Sq F value Pr(>F)
## Borough_do 6 8157 1359.5 463.5 <2e-16 ***
## Residuals 11874 34828 2.9
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Plot tip amount vs pickup location
ggplot(data = processed_df, aes(x = Borough_pu, y = tip_amount, fill = Borough_pu)) + ggtitle("Tip Amount by Pickup Location") + geom_boxplot() + xlab("Location") + ylab("Tip Amount")# Plot tip amount vs dropoff location
ggplot(data = processed_df, aes(x = Borough_do, y = tip_amount, fill = Borough_do)) + ggtitle("Tip Amount by Dropoff Location") + geom_boxplot() + xlab("Location") + ylab("Tip Amount")Observation: For these ANOVA analyses, the null hypothesis is that all means for different locations are the same. The p-values are smaller than a significance level of 0.05 (a 0.95 confidence level). Thus, we can reject the null hypothesis that the means are the same and say the means are statistically different at a significance level of 0.05.
As with the previous, because both are significant, we can run Tukey’s HSD test:
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = tip_amount ~ Borough_pu, data = processed_df)
##
## $Borough_pu
## diff lwr upr p adj
## Brooklyn-Bronx 1.8662857 -2.7762790 6.50885042 0.8083314
## Manhattan-Bronx 1.4282324 -3.1925869 6.04905175 0.9170756
## Queens-Bronx 6.0300915 1.4041939 10.65598918 0.0034681
## Unknown-Bronx 1.7268132 -2.9191194 6.37274573 0.8490430
## Manhattan-Brooklyn -0.4380533 -0.8910790 0.01497244 0.0637298
## Queens-Brooklyn 4.1638058 3.6616206 4.66599107 0.0000000
## Unknown-Brooklyn -0.1394725 -0.8012506 0.52230554 0.9787356
## Queens-Manhattan 4.6018591 4.3765720 4.82714625 0.0000000
## Unknown-Manhattan 0.2985808 -0.1877468 0.78490831 0.4495574
## Unknown-Queens -4.3032783 -4.8356994 -3.77085728 0.0000000
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = tip_amount ~ Borough_do, data = processed_df)
##
## $Borough_do
## diff lwr upr p adj
## Brooklyn-Bronx -1.1642632 -2.0935845 -0.2349419 0.0041593
## EWR-Bronx 8.4019792 6.6924483 10.1115100 0.0000000
## Manhattan-Bronx -3.2824489 -4.1765248 -2.3883731 0.0000000
## Queens-Bronx 0.1450347 -0.7839162 1.0739856 0.9992899
## Staten Island-Bronx 5.3228125 0.1942200 10.4514050 0.0359711
## Unknown-Bronx -1.3390567 -2.3566089 -0.3215044 0.0020199
## EWR-Brooklyn 9.5662424 8.0856867 11.0467981 0.0000000
## Manhattan-Brooklyn -2.1181857 -2.3807140 -1.8556574 0.0000000
## Queens-Brooklyn 1.3092979 0.9452935 1.6733024 0.0000000
## Staten Island-Brooklyn 6.4870757 1.4301982 11.5439532 0.0029651
## Unknown-Brooklyn -0.1747934 -0.7270272 0.3774403 0.9672358
## Manhattan-EWR -11.6844281 -13.1431188 -10.2257373 0.0000000
## Queens-EWR -8.2569444 -9.7372677 -6.7766212 0.0000000
## Staten Island-EWR -3.0791667 -8.3356738 2.1773405 0.5975372
## Unknown-EWR -9.7410358 -11.2785077 -8.2035640 0.0000000
## Queens-Manhattan 3.4274837 3.1662695 3.6886978 0.0000000
## Staten Island-Manhattan 8.6052614 3.5547423 13.6557806 0.0000106
## Unknown-Manhattan 1.9433923 1.4527848 2.4339998 0.0000000
## Staten Island-Queens 5.1777778 0.1209683 10.2345872 0.0406846
## Unknown-Queens -1.4840914 -2.0357016 -0.9324812 0.0000000
## Unknown-Staten Island -6.6618692 -11.7357025 -1.5880358 0.0020916
Observation: For pick up locations, there is no statistical difference for the between the following pairs (excluding Unknown) given their large p-values: Brooklyn and Bronx, and Manhattan and Bronx. For drop off locations, there is no difference for: Queens and Bronx, Staten Island and Bronz, Staten Island and Brooklyn, Staten Island and EWR, and Staten Island and Queens. All the Manhattan drop off locations are significant. Based on this analysis, it seems that being dropped off in Manhattan is significantly different from being dropped off in another location. The same seems to be true for being picked up in Manhattan.
We hyothesize that there is a relationship between the number of passenger in the car to the tip amount paid to the driver. The passenger count varies from 0 to 6.
# Create factor data
processed_df$passenger_count <- as.factor(processed_df$passenger_count)
# Create boxplot
processed_df %>%
ggplot(aes(passenger_count, tip_fare_ratio, fill = passenger_count)) +
geom_boxplot() +
theme(legend.position = "none") +
labs(y = "Tip ratio", x = "Number of passengers") +
ggtitle("Box plot distribution for ratio of tip amount and passenger count")Observation: The mean of all the groups of passenger count is not varying much.
While performing anova our null hypothesis is the means across all groups of passenger count is same whereas alternate hypothesis is the mean is not same.
# ANOVA test
anova_tip_amount = aov(tip_fare_ratio ~ passenger_count, data = processed_df)
summary(anova_tip_amount)## Df Sum Sq Mean Sq F value Pr(>F)
## passenger_count 6 0.04 0.006672 1.443 0.194
## Residuals 11874 54.89 0.004623
The P value 0.1937649 is > 0.05, hence we fail to reject null hypothesis.
Hence, we can conclude that the number of passenger in the car does not affect tip amount
Below is the box plot distribution for vendors 1 and 2. We hypothesise that the vendor brand name affects the % tip amount paid to the driver
# Create factor
processed_df$VendorID <- as.factor(processed_df$VendorID)
# Create boxplot of vendor ID vs tip percentage
processed_df %>%
ggplot(aes(VendorID, tip_fare_ratio, fill = VendorID)) +
geom_boxplot() +
theme(legend.position = "none") +
labs(y = "Tip ratio", x = "Vendor ID") +
ggtitle("Box plot distribution for trip ratio and vendor")observation: Distribution for both the vendors looks almost same.
While performing anova our null hypothesis is the means across vendors is same whereas alternate hypothesis is the mean is not same.
# t test for vendor
ttest_vendor = t.test(tip_fare_ratio ~ VendorID, data = processed_df)
ttest_vendor##
## Welch Two Sample t-test
##
## data: tip_fare_ratio by VendorID
## t = -2.5074, df = 9128, p-value = 0.01218
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.0057922981 -0.0007094282
## sample estimates:
## mean in group 1 mean in group 2
## 0.2628451 0.2660959
The P value 0.0121792 is > 0.05, hence we fail to reject null hypothesis.
Hence, we can conclude that the vendor does affect tip amount
The amount of tipping can be impacted by pickup time, dropoff time and the duration of the trip itself. For this analysis we have categorised the data into four categories, “Morning”, “Afternoon”, “Evening” and “Night” hours.
Let us analyse the pickup period.
Let us analyse the dropoff period.
From the bar charts its is crealy seen that the data is almost uniformly distributed. Except for the Night time there are less number of trips observed for both pickup and dropoff time periods. However from the box plots it is observed that the means of various time periods are not equal. So we perform ANOVA test on the column to obtain the statistical hypothesis.
Anova for pickup time periods
## Call:
## aov(formula = tip_fare_ratio ~ pickup_period, data = processed_df)
##
## Terms:
## pickup_period Residuals
## Sum of Squares 0.18735 54.74638
## Deg. of Freedom 3 11877
##
## Residual standard error: 0.06789289
## Estimated effects may be unbalanced
## Df Sum Sq Mean Sq F value Pr(>F)
## pickup_period 3 0.19 0.06245 13.55 8.04e-09 ***
## Residuals 11877 54.75 0.00461
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Anova for dropoff time periods
## Call:
## aov(formula = tip_fare_ratio ~ drop_period, data = processed_df)
##
## Terms:
## drop_period Residuals
## Sum of Squares 0.19737 54.73636
## Deg. of Freedom 3 11877
##
## Residual standard error: 0.06788668
## Estimated effects may be unbalanced
## Df Sum Sq Mean Sq F value Pr(>F)
## drop_period 3 0.20 0.06579 14.28 2.78e-09 ***
## Residuals 11877 54.74 0.00461
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
For these ANOVA analyses, the null hypothesis is that all means for different time period are the same. The p-values are 8.035737110^{-9} and 2.778748910^{-9} for pick up and drop off, respectively. Both of these are smaller than a significance level of 0.05 (a 0.95 confidence level). Thus, we can reject the null hypothesis that the means are the same and say the means are statistically different at a significance level of 0.05.
As the previous results were significant, we can run Tukey’s HSD test:
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = tip_fare_ratio ~ pickup_period, data = processed_df)
##
## $pickup_period
## diff lwr upr p adj
## Afternoon-Morning 0.002486928 -0.002028522 0.0070023772 0.4898287
## Evening-Morning 0.010043145 0.005744952 0.0143413382 0.0000000
## Night-Morning 0.006094137 0.001350379 0.0108378962 0.0053440
## Evening-Afternoon 0.007556218 0.003166567 0.0119458677 0.0000580
## Night-Afternoon 0.003607210 -0.001219571 0.0084339905 0.2195216
## Night-Evening -0.003949008 -0.008573182 0.0006751668 0.1248348
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = tip_fare_ratio ~ drop_period, data = processed_df)
##
## $drop_period
## diff lwr upr p adj
## Afternoon-Morning 0.002106065 -2.446096e-03 0.006658225 0.6339951
## Evening-Morning 0.010372431 6.029032e-03 0.014715831 0.0000000
## Night-Morning 0.004613450 -9.601819e-05 0.009322919 0.0573673
## Evening-Afternoon 0.008266367 3.868904e-03 0.012663829 0.0000082
## Night-Afternoon 0.002507386 -2.251990e-03 0.007266761 0.5286702
## Night-Evening -0.005758981 -1.031909e-02 -0.001198871 0.0064709
For pick up period, the larger p-values are observed for : Afternoon-Morning, Night-Afternoon and Night-Evening. For drop off period, the larger p-values are observed for : Afternoon-Morning, Night-Afternoon. p-value observed is approximately zero of Evening-Morning and Evening-Afternoon shows that being dropped off or picked up in the evening is significant from other time periods.
There might be a relation between the tip paid vs trip duration. We plot a tip paid vs trip duration histogram to observe the pattern.
From the graph it is observed that tip is maximum for trips ranging from 5 to 10 minutes and decreases with the time. We perform Ttest hypothesis.
##
## Welch Two Sample t-test
##
## data: processed_df$tip_fare_ratio and processed_df$trip_duration
## t = -177.85, df = 11882, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -12.92108 -12.63937
## sample estimates:
## mean of x mean of y
## 0.264889 13.045114
From the results we observe that p-value is < 0.05 With Significance level set to 5% hence we reject null hypothesis. null hyp = means for tip_fare_ratio and trip_duration_min are equal Concluding that we have enough evidence to reject the Null hypothesis in favor of Alt Hyp meaning that people travelling through Yellow cabs in NYC tip differently based on duration of the trip.
The last independent factor that we chose for our study is distance travelled and we try to investigate if it has anything to do with amount tipped. Or in other words does amount tipped varies for passengers who travel shorter distances vs those who travel longer distances.
First we take a look at summary of Trip Distance column. We can visually confirm these values through a box plot and histogram.
Now coming towards our second and dependent variable. Tip percentage …
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1091 0.2287 0.2667 0.2649 0.3086 0.4283
It looks pretty normal. It also fulfills the CLT for normality so with a normal dependent variable we are all set to move forward.
Now that we know my variables, the question is which test to apply and why when comparing tip paid with distance travelled? Z-test cannot be used because we dont know population mean & std dev. We also cannot use one sample t-test because we dont have a pre-determined population mean or some other theoretically derived value with which could be compared with the mean value of our observed sample.
So the simplest option that comes to mind is to use independent two-sample t-test, a significance test that can give us an estimate as to whether different means between two groups are the result of random variation or the product of specific characteristics within the groups.
But before applying the 2 sampled t-test, we first need to fulfull some conditions for reliable results. i.e. random, normal, independent
Assumptions/Criteria to be fulfilled:## Observations: 11,881
## Variables: 2
## $ tip_fare_ratio <dbl> 0.2396330, 0.2316176, 0.1626016, 0.2543511, 0.133…
## $ trip_distance <dbl> 13.46, 17.30, 16.41, 17.84, 15.30, 13.30, 14.27, …
Since we are going to work on 2 variable cols, dependent variable is tip_amount & independent variable is trip_distance so just for own own ease subsetting my variables into a new df.
Since we chose “2-sample” t test, we divide independent variable i.e. distances covered in miles during each ride into “two factored categorical data” Short & Long. We divided this column into 2 factors based on mean value of distance travelled, which can be seen from glimpse of dataset provided.
## Observations: 11,881
## Variables: 2
## $ tip_fare_ratio <dbl> 0.2396330, 0.2316176, 0.1626016, 0.2543511, 0.133…
## $ trip_distance <fct> Longer Distances, Longer Distances, Longer Distan…
Here is a Histogram where we combined values from both of my variables. It shows the frequency of %tips paid by both short & long distance travellers. We can get a rough idea about the means of tip percentages paid by both short & long distance travellers but we cant be sure. Just Judging by the shape of this plot, we are unable to say whether there is a Relationship between these two variables or not.
## quartz_off_screen
## 2
## 'data.frame': 8446 obs. of 2 variables:
## $ tip_fare_ratio: num 0.25 0.3 0.3 0.301 0.308 ...
## $ trip_distance : Factor w/ 2 levels "Shorter Distances",..: 1 1 1 1 1 1 1 1 1 1 ...
## 'data.frame': 3435 obs. of 2 variables:
## $ tip_fare_ratio: num 0.24 0.232 0.163 0.254 0.133 ...
## $ trip_distance : Factor w/ 2 levels "Shorter Distances",..: 2 2 2 2 2 2 2 2 2 2 ...
##
## Welch Two Sample t-test
##
## data: short_dist$tip_fare_ratio and long_dist$tip_fare_ratio
## t = 31.414, df = 7773.4, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.03593062 0.04071328
## sample estimates:
## mean of x mean of y
## 0.2759685 0.2376466
With Significance level set to 5%, we get a P-value very close to zero. Concluding that we have enough evidence to reject the Null hypothesis in favor of Alt Hyp meaning that people travelling through Yellow cabs in NYC tip differently based on distance travelled.
We can do a little bit of further investigation as to whether they these 2 categories of travellers tip more or less when compared with each other so we plot a box plot to compare means of both groups.
Honestly speaking we were a bit surprised to see the results that short distance travellers pay more tip in terms of fare percentage amount than those who travelled more miles. We think that this result deserves a separate study of its own, why short distance travellers pay more, like maybe because of pshychological reasons or why long distance travellers pay less maybe because of phychological or socio-economic reasons or they are tired of the commute which affects their mood but that is another topic.