## Introduction

``````path_image <- "C:/Users/Akshay Kher/Desktop/Uber/Data Challenge Related/Akshay Kher - Uber Data Challenge/documents/"
knitr::include_graphics(paste0(path_image,'uber_logo.jpg'))``````

#### Problem Description

Uber has just started offering promotions to riders in city X. The goal of the promotional campaign is to encourage more trips among the riders and increase engagement. Now, the city team wants to make use of data to analyze the campaign’s net benefit, including it’s financial effectiveness. This will help them to run more data-driven promotional campaigns in the future, which should result in increased financial benefits.

#### Steps

1. Data Preparation
2. Data Analysis
1. Data Exploration
2. A/B Testing
3. Summary & Recommendations
4. Future Work

## 1. Data Preparation

Libraries Required

``````library(readxl) # package to read excel files
library(tidyverse) # package to perform data manipulation
library(lubridate) # package to work with date objects
library(kableExtra) # package to output HTML tables
library(DT) # package to output HTML tables
library(ggthemes) # package to change visualization themes
library(FNN) # package to run K-Nearest Neighbour Algorithm``````

``````path <- "C:/Users/Akshay Kher/Desktop/Uber/Data Challenge Related/Akshay Kher - Uber Data Challenge/data/" # data path

driver_rider_trips <- read_excel(paste0(path,"data_set_170705.xlsx"), sheet = "driver_trips") # historical trips for drivers
rider_trips <- read_excel(paste0(path,"data_set_170705.xlsx"), sheet = "rider_trips") # historical trips for riders

Cleaning Data

#### (I) Driver Trips (59854 X 9) | Rider Trips (60000 X 10)

Aggregating Driver and Rider Trips data (59999 X 13):

• Driver Trips table has been outer joined with Rider Trips table
• Trip ID: 94a1-82d8 has 2 values. The one corresponding to trip status: unfulfilled has been removed.
• driver payout has been renamed to driver_payout
``````# remove duplicate trip id
remove_index_driver <- which(driver_rider_trips\$trip_id == '94a1-82d8' & driver_rider_trips\$trip_status == 'unfulfilled')
remove_index_rider <- which(rider_trips\$trip_id == '94a1-82d8' & rider_trips\$trip_status == 'unfulfilled')
driver_rider_trips <- driver_rider_trips[-remove_index_driver,]
rider_trips <- rider_trips[-remove_index_rider,]

# Joining driver and rider trip data
driver_rider_trips <-
driver_rider_trips %>%
full_join(select(rider_trips, trip_id, rider_id, estimated_time_to_arrival,
trip_price_pre_discount, rider_payment), by='trip_id') %>%
select(trip_id, driver_id, rider_id, trip_status, request_time, estimated_time_to_arrival, actual_time_to_arrival,
surge_multiplier, driver_payout = `driver payout` , trip_price_pre_discount, rider_payment, start_geo, end_geo)``````

Data Dictionary

``````text_tbl <- data.frame (
Variable = names(driver_rider_trips),
Description = c(
"Unique identifier for trip",
"Unique identifier for driver",
"Unique identifier for rider",
"Status of the trip",
"Local request time of the trip",
"Minutes estimated at time of request to pick-up time",
"Minutes from request to pick-up",
"The surge multiplier on the regular price of the trip (base fare + time charge + distance charge)",
"The amount the driver was paid for the trip",
"The original price of the trip",
"The amount the rider paid for the trip (includes the discount, which Uber covers)",
"The geo of the request point of the trip",
"The geo of the dropoff point of the trip"
)
)

kable(text_tbl) %>%
kable_styling(full_width = F) %>%
column_spec(1, bold = T, border_right = T) %>%
column_spec(2, width = "30em")``````
Variable Description
trip_id Unique identifier for trip
driver_id Unique identifier for driver
rider_id Unique identifier for rider
trip_status Status of the trip
request_time Local request time of the trip
estimated_time_to_arrival Minutes estimated at time of request to pick-up time
actual_time_to_arrival Minutes from request to pick-up
surge_multiplier The surge multiplier on the regular price of the trip (base fare + time charge + distance charge)
driver_payout The amount the driver was paid for the trip
trip_price_pre_discount The original price of the trip
rider_payment The amount the rider paid for the trip (includes the discount, which Uber covers)
start_geo The geo of the request point of the trip
end_geo The geo of the dropoff point of the trip

First 100 rows

``````kable(head(driver_rider_trips, 100)) %>%
kable_styling(bootstrap_options = c("striped", "hover", "responsive")) %>%
scroll_box(width = "100%", height = "500px")``````
trip_id driver_id rider_id trip_status request_time estimated_time_to_arrival actual_time_to_arrival surge_multiplier driver_payout trip_price_pre_discount rider_payment start_geo end_geo
b3cb-b9bf 4649-2f0a 5ba9-cd29 completed 2012-04-09 06:12:00 4.4833333 3.5166667 1.0 4.2368 5.296 5.2960 Allen Abby Chelsea Court
8569-0fbf 4003-e2e7 b8ed-f739 completed 2012-05-06 05:44:00 4.6833333 2.6333333 2.0 8.2432 10.304 10.3040 Blair Bend Chelsea Court
86f2-e4cc 44af-7b74 a12d-275f completed 2012-04-25 11:28:00 1.9833333 1.3333333 1.0 6.1952 7.744 7.7440 Chelsea Court Blair Bend
aef6-924e 4909-aa90 14af-2a91 completed 2012-04-27 21:34:00 2.7500000 1.7000000 1.0 8.1856 10.232 10.2320 Chelsea Court Daisy Drive
921a-11d3 4e79-a40f a0b5-fbfc completed 2012-04-30 16:44:00 NA NA 1.0 11.8080 14.760 14.7600 Daisy Drive Daisy Drive
be30-1475 4dbd-e6e5 5e2b-bf7c completed 2012-04-19 21:23:00 1.4333333 0.7833333 1.0 4.7744 5.968 5.9680 Chelsea Court Allen Abby
a7af-1f24 4864-277e 8bfc-9289 completed 2012-04-12 18:41:00 2.4833333 1.7000000 1.0 6.8544 8.568 8.5680 Blair Bend Blair Bend
a127-d6cf 4771-14c0 230e-661c completed 2012-05-02 18:58:00 3.8166667 3.4833333 1.0 6.5792 8.224 8.2240 Chelsea Court Chelsea Court
bc53-25f7 4a2f-40dc 040d-e78f completed 2012-04-16 11:26:00 1.9166667 1.4833333 1.0 8.1664 10.208 10.2080 Chelsea Court Chelsea Court
8626-c0f3 424f-f649 42db-5144 completed 2012-04-10 14:19:00 2.2166667 1.9500000 1.0 5.7088 7.136 7.1360 Allen Abby Daisy Drive
b296-e5ab 48dc-464c 5dc8-715a completed 2012-04-14 14:48:00 2.2166667 0.8000000 1.0 4.2368 5.296 5.2960 Blair Bend Chelsea Court
82b8-4bd6 432f-4641 9afa-988a completed 2012-05-04 13:40:00 3.2166667 2.7000000 1.5 6.4640 8.080 7.2720 Chelsea Court Chelsea Court
9a69-0480 4593-b359 c5d4-0a59 completed 2012-04-15 11:29:00 4.3166667 2.4000000 1.0 4.2368 5.296 5.2960 Chelsea Court Chelsea Court
808d-46aa 4bf5-8ff9 e62d-39d6 rider_canceled 2012-05-03 10:26:00 4.1500000 NA 1.0 0.0000 0.000 0.0000 Chelsea Court Chelsea Court
a5fb-3df7 4c27-2fdb af7d-5914 completed 2012-04-25 15:45:00 4.5166667 3.3833333 1.0 7.0272 8.784 8.7840 Chelsea Court Chelsea Court
a3ce-4e65 47ad-9ece 765d-3f7c rider_canceled 2012-04-22 18:52:00 3.6333333 1.4833333 1.0 3.2000 4.000 4.0000 Chelsea Court Chelsea Court
b945-bd81 4c35-6423 e708-5959 completed 2012-04-21 18:25:00 8.0166667 6.7500000 1.0 9.3312 11.664 11.6640 Allen Abby Chelsea Court
b38c-f300 4c9f-fd75 4b9e-08cf completed 2012-04-28 15:08:00 8.0500000 5.6166667 1.4 3.9680 4.960 4.9600 Allen Abby Chelsea Court
909f-9acd 42bc-27e4 38e8-eacf completed 2012-04-20 20:25:00 1.1000000 0.8166667 1.0 4.2368 5.296 5.2960 Chelsea Court Chelsea Court
87f1-6172 4e84-fe6c ffa8-7fe5 completed 2012-04-24 17:32:00 5.4166667 5.7333333 1.0 4.2368 5.296 5.2960 Allen Abby Allen Abby
ab3b-1a81 4cad-1623 3e7b-e615 completed 2012-05-04 16:30:00 1.1000000 0.1333333 1.0 16.3200 20.400 20.4000 Daisy Drive Blair Bend
8f14-9407 45cf-4b7c afc1-b6a0 rider_canceled 2012-04-09 15:12:00 6.0666667 NA 1.0 0.0000 0.000 0.0000 Blair Bend Chelsea Court
a501-707d 4460-6448 cb31-0bc8 completed 2012-04-19 06:53:00 1.3166667 0.8000000 1.0 5.6960 7.120 7.1200 Allen Abby Chelsea Court
84e0-e094 4267-6351 14bc-638e completed 2012-04-26 17:42:00 1.8166667 0.8666667 1.0 4.2752 5.344 5.3440 Allen Abby Allen Abby
bb19-65ec 4641-9c69 265c-8bb9 completed 2012-04-18 11:02:00 2.0000000 0.8000000 1.0 4.3968 5.496 5.4960 Allen Abby Chelsea Court
b95a-39eb 4da9-89c7 b50b-a57c completed 2012-04-20 15:40:00 1.2833333 0.8000000 2.1 15.2960 19.120 17.2080 Chelsea Court Blair Bend
ba8b-3bd8 4b27-509a 361f-300d completed 2012-04-25 11:33:00 2.3833333 2.8166667 1.0 4.2368 5.296 5.2960 Chelsea Court Chelsea Court
a86c-8349 49ad-ff4a 7895-c332 completed 2012-05-06 11:20:00 2.0833333 1.4000000 1.4 9.6064 12.008 12.0080 Blair Bend Blair Bend
ac38-7fff 4b4e-95d4 6cdb-4bf7 completed 2012-04-13 16:56:00 4.8666667 4.0666667 1.4 6.3680 7.960 7.9600 Chelsea Court Chelsea Court
b340-9b21 4511-d0cd 071a-8ec2 completed 2012-04-14 14:54:00 2.3166667 0.7833333 1.0 7.9936 9.992 9.9920 Chelsea Court Chelsea Court
b46e-cf9e 4479-4706 f314-92ea completed 2012-04-10 07:57:00 1.4333333 1.0833333 1.0 7.4560 9.320 9.3200 Allen Abby Chelsea Court
bd02-6fff 462a-5684 6aff-f70e completed 2012-04-17 08:25:00 1.1333333 0.9500000 1.0 4.9216 6.152 6.1520 Allen Abby Chelsea Court
bb42-dda1 4c35-224c 0549-38a6 completed 2012-04-14 15:58:00 2.0000000 2.7666667 1.0 5.3376 6.672 6.6720 Chelsea Court Chelsea Court
987d-9509 4e79-a40f b845-c8da completed 2012-05-06 18:48:00 NA NA 1.0 70.7776 88.472 88.4720 Chelsea Court Chelsea Court
8bd4-b2b0 4402-b26d cc81-d066 completed 2012-05-03 10:08:00 4.1500000 5.3166667 1.0 9.4208 11.776 11.7760 Chelsea Court Chelsea Court
96ee-5a04 4f71-6965 752e-2328 completed 2012-04-27 16:47:00 1.0833333 0.1333333 1.0 6.2336 7.792 7.7920 Allen Abby Daisy Drive
95e0-12ab 46b2-26e5 42c6-9305 completed 2012-04-18 18:53:00 3.9666667 0.7833333 1.0 9.6960 12.120 12.1200 Chelsea Court Chelsea Court
967c-bff6 4d85-f3e1 ee39-25c3 completed 2012-04-14 19:35:00 2.1833333 1.2000000 1.0 4.2368 5.296 5.2960 Chelsea Court Chelsea Court
85da-8123 4c66-e75e 6988-c7e0 completed 2012-05-05 04:30:00 2.2666667 1.6000000 1.3 12.8896 16.112 16.1120 Chelsea Court Blair Bend
bf0c-26cf 4e9a-529e 49b0-5f25 completed 2012-04-15 08:43:00 0.9500000 0.7833333 1.0 4.2368 5.296 5.2960 Chelsea Court Chelsea Court
8998-a3e8 43e2-f9be 5cf7-44a8 completed 2012-04-25 23:13:00 0.5166667 0.0166667 1.0 4.2368 5.296 5.2960 Chelsea Court Chelsea Court
ae7e-3b43 445e-dd18 bf37-b192 rider_canceled 2012-04-22 13:05:00 3.6000000 NA 1.0 0.0000 0.000 0.0000 Blair Bend Blair Bend
bc93-1754 4efd-5d71 11d2-d5b0 completed 2012-04-29 21:02:00 3.1166667 2.2833333 1.0 9.1648 11.456 11.4560 Allen Abby Blair Bend
bee2-3e2a 4ecd-782a c8be-6691 completed 2012-04-11 20:52:00 2.2333333 1.9500000 1.0 7.3408 9.176 9.1760 Chelsea Court Allen Abby
98e9-2919 4519-363a d213-ee7e rider_canceled 2012-04-17 15:12:00 9.6666667 NA 1.0 0.0000 0.000 0.0000 Allen Abby Allen Abby
856a-094d 4c92-800b 1f80-1d93 completed 2012-04-22 20:09:00 1.6666667 0.8000000 1.0 4.2368 5.296 5.2960 Chelsea Court Chelsea Court
8b4d-43b8 435c-49a9 a4b1-3d40 completed 2012-04-17 16:11:00 1.6500000 4.4833333 1.7 6.4960 8.120 7.3080 Chelsea Court Chelsea Court
bc20-1b8e 4d87-1393 1b4e-8d74 completed 2012-04-14 15:38:00 4.7833333 4.7333333 1.0 2.5920 3.240 3.2400 Daisy Drive Daisy Drive
9873-5973 4b44-761b e60b-ba1c completed 2012-04-22 01:06:00 1.0666667 0.8166667 1.0 5.8816 7.352 7.3520 Chelsea Court Chelsea Court
8fba-c605 48db-4b6f f6db-330d completed 2012-04-12 13:47:00 4.1333333 4.2500000 1.0 4.8192 6.024 6.0240 Chelsea Court Allen Abby
b27f-d1f5 47ee-c77e 2a4e-fe2e completed 2012-04-27 20:14:00 0.9000000 0.0166667 1.0 4.2368 5.296 5.2960 Allen Abby Allen Abby
9c49-d914 4ce0-d7ad e91a-63e5 completed 2012-04-30 18:51:00 6.3000000 6.9666667 1.0 5.0304 6.288 6.2880 Chelsea Court Chelsea Court
9add-ac73 49ef-1adb b2a6-fa8b completed 2012-05-04 15:30:00 2.6166667 1.9833333 1.8 8.0000 10.000 9.0000 Chelsea Court Blair Bend
9799-0a20 439b-c0fc 1f56-a5b8 completed 2012-04-17 16:34:00 1.8333333 1.2000000 1.3 3.1040 3.880 3.8800 Chelsea Court Chelsea Court
9fed-17b0 4d4b-c4ba 76cd-1b74 completed 2012-04-29 20:56:00 0.4666667 0.0166667 1.0 4.2368 5.296 5.2960 Chelsea Court Chelsea Court
92ca-da8e 46df-ce53 4573-311f completed 2012-04-25 19:29:00 3.2000000 2.2000000 1.0 6.1696 7.712 7.7120 Blair Bend Blair Bend
98cb-5e77 4174-ce68 1e72-eee7 completed 2012-04-29 04:38:00 4.5500000 2.7666667 1.0 5.2800 6.600 6.6000 Blair Bend Allen Abby
855c-7ee4 4b2f-ddc0 24d1-f72a completed 2012-04-10 07:09:00 3.5500000 2.8000000 1.0 3.2064 4.008 4.0080 Chelsea Court Chelsea Court
9648-93e3 41a9-7ef0 c287-69ef completed 2012-04-16 19:12:00 6.1000000 9.0833333 1.0 3.0208 3.776 3.7760 Chelsea Court Daisy Drive
bd1a-b59f 4e79-a40f 2626-72af completed 2012-05-05 17:43:00 NA NA 1.0 21.4592 26.824 26.8240 Chelsea Court Chelsea Court
a480-1d32 4e39-92e4 6f80-309a completed 2012-04-18 09:24:00 3.5333333 6.2666667 1.3 5.2032 6.504 6.5040 Allen Abby Allen Abby
8717-22f4 476c-0aa6 0f85-f809 completed 2012-05-03 21:00:00 1.6000000 0.6500000 1.0 4.2368 5.296 5.2960 Chelsea Court Blair Bend
afe2-dd9d 42b9-2d69 abac-3b86 rider_canceled 2012-04-15 10:01:00 1.3666667 0.7833333 1.4 0.0000 0.000 0.0000 Chelsea Court Chelsea Court
9f19-ec51 45b4-d6b4 3d5b-f2e1 rider_canceled 2012-04-09 19:48:00 1.4166667 2.5166667 1.0 0.0000 0.000 0.0000 Chelsea Court Chelsea Court
ae45-15ab 4c52-3174 b5f9-936f rider_canceled 2012-05-05 19:13:00 3.1666667 NA 1.0 0.0000 0.000 0.0000 Chelsea Court Chelsea Court
8cd2-82e0 4542-1ddc 423b-ff47 completed 2012-04-16 16:23:00 1.7666667 1.6166667 1.4 8.6080 10.760 10.7600 Blair Bend Chelsea Court
8d53-041d 42f6-eeba 1eba-d693 completed 2012-04-26 13:05:00 5.2500000 4.2166667 1.0 4.3328 5.416 5.4160 Chelsea Court Blair Bend
90c4-0751 4e79-a40f 10d8-e09a completed 2012-04-23 11:26:00 NA NA 1.0 25.2928 31.616 31.6160 Chelsea Court Chelsea Court
82a2-c586 4e79-a40f 6f66-e3fc completed 2012-04-25 23:33:00 NA NA 1.0 13.7536 17.192 17.1920 Daisy Drive Daisy Drive
ad56-3279 439a-e8ea 0876-e9b4 completed 2012-04-27 11:05:00 4.5500000 6.7166667 1.0 4.6144 5.768 5.7680 Chelsea Court Chelsea Court
b306-eaf3 4e79-a40f d75a-7e57 failed 2012-04-12 21:33:00 NA NA 1.0 8.6272 10.784 10.7840 Chelsea Court Chelsea Court
9591-070c 44fa-ceee f82b-45ea completed 2012-04-19 10:59:00 0.8500000 0.0666667 1.0 4.9024 6.128 6.1280 Chelsea Court Chelsea Court
85a1-9b62 4e79-a40f 4124-e9c0 completed 2012-04-11 11:29:00 NA NA 1.0 81.0560 101.320 101.3200 Chelsea Court Chelsea Court
8d4e-ae57 4fe5-a702 e79e-142d completed 2012-04-13 22:55:00 1.5666667 0.8000000 1.0 4.2368 5.296 5.2960 Chelsea Court Allen Abby
8b04-3ac9 493d-c131 c9a3-8d55 completed 2012-04-14 11:21:00 2.0000000 0.7833333 1.0 4.8512 6.064 6.0640 Chelsea Court Allen Abby
b602-6067 4436-6806 61e0-ed48 completed 2012-05-05 02:02:00 2.6166667 2.5833333 1.0 6.0480 7.560 7.5600 Chelsea Court Allen Abby
b150-c566 4fc7-9d8e ecf0-f3b5 completed 2012-04-27 19:39:00 0.9500000 0.0166667 1.0 10.9568 13.696 13.6960 Chelsea Court Chelsea Court
9004-4d2d 403f-30ea 6274-dae9 completed 2012-04-11 22:30:00 1.5666667 1.6833333 1.5 8.5248 10.656 9.5904 Chelsea Court Chelsea Court
956e-9453 4fa1-ec63 7043-e5dd completed 2012-04-15 17:56:00 2.8666667 1.8500000 1.0 2.6176 3.272 3.2720 Chelsea Court Chelsea Court
97dc-1091 4045-1674 cc4f-433c completed 2012-04-20 20:40:00 2.6000000 0.8000000 1.0 5.2352 6.544 6.5440 Chelsea Court Chelsea Court
8d1e-5277 450e-a1ad 4d88-ab83 completed 2012-04-10 12:03:00 3.7000000 15.5000000 1.0 6.1696 7.712 7.7120 Chelsea Court Blair Bend
96fc-5b63 47a0-36fd 9716-f4ee completed 2012-04-13 22:24:00 1.7333333 2.0166667 1.3 3.1040 3.880 3.4920 Chelsea Court Daisy Drive
a029-dab9 4d91-259f 3379-6df9 completed 2012-04-20 18:43:00 0.7333333 0.8333333 1.0 4.8768 6.096 6.0960 Chelsea Court Allen Abby
ad33-a5b2 4ccf-2716 30f5-5c53 completed 2012-04-11 23:06:00 4.2833333 6.6500000 1.0 2.6176 3.272 3.2720 Chelsea Court Chelsea Court
a228-737a 43fc-ec02 37cd-53c5 completed 2012-04-13 08:53:00 2.7500000 4.0833333 1.2 4.8832 6.104 6.1040 Chelsea Court Chelsea Court
873b-2017 4e79-a40f 2ed4-2247 completed 2012-04-27 17:20:00 NA NA 1.0 34.7520 43.440 43.4400 Chelsea Court Chelsea Court
98cb-68e1 45d3-3ee8 3b1b-d3ca completed 2012-04-22 09:28:00 7.8833333 5.3000000 1.2 4.8832 6.104 6.1040 Chelsea Court Chelsea Court
81e8-29e1 4f76-cd67 51da-a8f5 completed 2012-05-04 18:34:00 2.4166667 2.2166667 1.3 8.8384 11.048 11.0480 Chelsea Court Blair Bend
8cb1-e736 4f36-b57b 4fab-d205 completed 2012-05-06 13:43:00 1.4500000 0.4833333 1.0 5.2096 6.512 6.5120 Allen Abby Chelsea Court
bbf9-7484 429b-035a 6115-1fe0 completed 2012-04-25 16:36:00 1.9666667 NA 1.0 3.9808 4.976 4.9760 Chelsea Court Allen Abby
be71-3052 43d1-6d2c a855-9d88 completed 2012-04-26 20:54:00 0.7666667 0.0166667 1.0 6.0864 7.608 7.6080 Chelsea Court Chelsea Court
99ef-1a75 411a-0a2b 3562-dfa8 completed 2012-05-04 21:53:00 7.3500000 6.4833333 1.0 4.8896 6.112 6.1120 Blair Bend Blair Bend
bdee-3181 4e16-0778 0c7b-ccb4 completed 2012-04-25 18:33:00 0.4666667 0.0166667 1.0 7.0528 8.816 8.8160 Chelsea Court Chelsea Court
8640-f89e 4e00-0fbd b4e6-d79b rider_canceled 2012-04-19 12:44:00 5.6166667 6.3500000 1.0 1.2800 1.600 1.6000 Allen Abby Allen Abby
a16c-165a 400d-b444 8051-8849 completed 2012-04-12 12:38:00 1.6333333 1.1333333 1.4 5.5296 6.912 6.9120 Allen Abby Allen Abby
b389-0113 4952-26d9 ff66-9cbf completed 2012-04-12 03:02:00 4.6666667 3.5666667 1.0 4.2368 5.296 5.2960 Chelsea Court Chelsea Court
8f49-ea9e 40e4-546b 5946-1009 completed 2012-04-16 19:14:00 1.4000000 1.3666667 1.0 4.2368 5.296 5.2960 Chelsea Court Allen Abby
afe4-2852 46c0-abc2 d748-6add completed 2012-04-22 22:49:00 3.1833333 3.6666667 1.0 9.8496 12.312 12.3120 Allen Abby Chelsea Court
8af8-15b9 4d20-2151 2542-20fa rider_canceled 2012-04-13 15:05:00 4.8000000 NA 1.6 0.0000 0.000 0.0000 Chelsea Court Allen Abby
a410-c2a4 4278-f2b6 54cd-c985 completed 2012-04-28 05:25:00 6.0000000 3.8333333 1.0 9.7600 12.200 12.2000 Daisy Drive Daisy Drive

Converting data types:

• Trip status to factor
• Request time to date format
• Start location as factor
• End location as factor
``````driver_rider_trips\$trip_status <- as.factor(driver_rider_trips\$trip_status) # converting to factor
driver_rider_trips\$request_time <- ymd_hms(driver_rider_trips\$request_time) # converting to date
driver_rider_trips\$start_geo <- as.factor(driver_rider_trips\$start_geo) # converting to factor
driver_rider_trips\$end_geo <- as.factor(driver_rider_trips\$end_geo) # converting to factor

glimpse(driver_rider_trips)``````
``````## Observations: 59,999
## Variables: 13
## \$ trip_id                   <chr> "b3cb-b9bf", "8569-0fbf", "86f2-e4cc...
## \$ driver_id                 <chr> "4649-2f0a", "4003-e2e7", "44af-7b74...
## \$ rider_id                  <chr> "5ba9-cd29", "b8ed-f739", "a12d-275f...
## \$ trip_status               <fct> completed, completed, completed, com...
## \$ request_time              <dttm> 2012-04-09 06:12:00, 2012-05-06 05:...
## \$ estimated_time_to_arrival <dbl> 4.483333, 4.683333, 1.983333, 2.7500...
## \$ actual_time_to_arrival    <dbl> 3.5166667, 2.6333333, 1.3333333, 1.7...
## \$ surge_multiplier          <dbl> 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1...
## \$ driver_payout             <dbl> 4.2368, 8.2432, 6.1952, 8.1856, 11.8...
## \$ trip_price_pre_discount   <dbl> 5.296, 10.304, 7.744, 10.232, 14.760...
## \$ rider_payment             <dbl> 5.296, 10.304, 7.744, 10.232, 14.760...
## \$ start_geo                 <fct> Allen Abby, Blair Bend, Chelsea Cour...
## \$ end_geo                   <fct> Chelsea Court, Chelsea Court, Blair ...``````

Missing Values

``````# calculate missing values
na_table <-
map_dbl(driver_rider_trips, function(x) sum(is.na(x))) %>%
sort(decreasing = TRUE) %>%
data.frame()

# rename column
colnames(na_table) <- c("total_missing")

# display missing value table
kable(na_table) %>%
kable_styling(bootstrap_options = c("striped", "hover", "responsive"), full_width = F, position='left')``````
total_missing
actual_time_to_arrival 6416
estimated_time_to_arrival 2500
driver_id 146
trip_status 146
request_time 146
surge_multiplier 146
driver_payout 146
start_geo 146
end_geo 146
trip_id 0
rider_id 0
trip_price_pre_discount 0
rider_payment 0

There are 146 Trip IDs that do not have a corresponding driver assigned. All of these 146 trips were cancelled by the rider. It seems like the trips were cancelled before a driver could be assigned. After removing these 146 Trip IDs the missing value table looks like this:

``````# Removing Trip IDs with no driver assigned
driver_rider_trips <-
filter(driver_rider_trips, !is.na(driver_id))

# calculate missing values
na_table <-
map_dbl(driver_rider_trips, function(x) sum(is.na(x))) %>%
sort(decreasing = TRUE) %>%
data.frame()

# rename column
colnames(na_table) <- c("total_missing")

# display missing value table
kable(na_table) %>%
kable_styling(bootstrap_options = c("striped", "hover", "responsive"), full_width = F, position='left')``````
total_missing
actual_time_to_arrival 6270
estimated_time_to_arrival 2354
trip_id 0
driver_id 0
rider_id 0
trip_status 0
request_time 0
surge_multiplier 0
driver_payout 0
trip_price_pre_discount 0
rider_payment 0
start_geo 0
end_geo 0

Only actual time to arrival and estimated time to arrival columns has missing values.

Note: Out of 6270 missing values for actual time to arrival:

• 3800: rider cancelled trips
• 2407: completed trips
• 35: unfulfilled trips
• 29: failed trips

Completed trips should have non-missing values i.e. 2407 trips. Other trips should have missing values.

Note: Out of 2354 missing values for estimated time to arrival:

• 81: rider cancelled trips
• 2210: completed trips
• 34: unfulfilled trips
• 29: failed trips

Trips which are unfulfilled should have missing values. Other trips can have non-missing values i.e. 2325 trips. The updated missing value table looks like:

``````# calculate missing values
na_table <-
map_dbl(filter(driver_rider_trips, trip_status == 'completed'), function(x) sum(is.na(x))) %>%
sort(decreasing = TRUE) %>%
data.frame()

# rename column
colnames(na_table) <- c("total_missing")

na_table\$total_missing[2] <- 2325

# display missing value table
kable(na_table) %>%
kable_styling(bootstrap_options = c("striped", "hover", "responsive"), full_width = F, position='left')``````
total_missing
actual_time_to_arrival 2407
estimated_time_to_arrival 2325
trip_id 0
driver_id 0
rider_id 0
trip_status 0
request_time 0
surge_multiplier 0
driver_payout 0
trip_price_pre_discount 0
rider_payment 0
start_geo 0
end_geo 0

Imputing Missing Values

After replacing the missing actual time to arrival values with estimated time to arrival, the missing value table looks like this:

``````# replacing the missing actual time to arrival values with estimated time to arrival
driver_rider_trips\$actual_time_to_arrival <-
ifelse(is.na(driver_rider_trips\$actual_time_to_arrival) & driver_rider_trips\$trip_status == "completed",
driver_rider_trips\$estimated_time_to_arrival, driver_rider_trips\$actual_time_to_arrival)

# calculate missing values
na_table <-
map_dbl(filter(driver_rider_trips, trip_status == 'completed'), function(x) sum(is.na(x))) %>%
sort(decreasing = TRUE) %>%
data.frame()

# rename column
colnames(na_table) <- c("total_missing")

na_table\$total_missing[2] <- 2325

# display missing value table
kable(na_table) %>%
kable_styling(bootstrap_options = c("striped", "hover", "responsive"), full_width = F, position='left')``````
total_missing
estimated_time_to_arrival 2210
actual_time_to_arrival 2325
trip_id 0
driver_id 0
rider_id 0
trip_status 0
request_time 0
surge_multiplier 0
driver_payout 0
trip_price_pre_discount 0
rider_payment 0
start_geo 0
end_geo 0

To fill the remaining 2210 actual time to arrival values and 2325 estimated time to arrival values, I have incorporated the following approach:

• Calculated the median actual time/estimated time to arrival values at each hour for all start locations
• Replaced the missing actual time/estimated time to arrival values by the median value for that particular hour and start location respectively

Using the above logic we are able to fill all the valuess.

``````# df1 contains all the median actual time to arrival by hour and start location for all completed driver trips
df1 <-
driver_rider_trips %>%
filter(trip_status == "completed") %>%
group_by(start_geo,  hour(request_time)) %>%
summarize(median_actual_time_to_arrival = median(actual_time_to_arrival, na.rm = TRUE)) %>%
select(start_geo, hour = `hour(request_time)`, median_actual_time_to_arrival)

# Repalacing missing ATA values with the median values for that particular hour and start location, for all completed trips
driver_rider_trips <-
mutate(driver_rider_trips, hour = hour(request_time)) %>%
left_join(df1, by= c('start_geo'='start_geo', 'hour'='hour')) %>%
mutate(actual_time_to_arrival = ifelse(is.na(actual_time_to_arrival)
, median_actual_time_to_arrival
, actual_time_to_arrival)) %>%
select(everything(), -median_actual_time_to_arrival, -hour)

# df2 contains all the median estimated time to arrival by hour and start location
df2 <-
driver_rider_trips %>%
group_by(start_geo,  hour(request_time)) %>%
summarize(median_estimated_time_to_arrival = median(estimated_time_to_arrival, na.rm = TRUE)) %>%
select(start_geo, hour = `hour(request_time)`, median_estimated_time_to_arrival)

# Repalacing missing estimated time to arrival values with the median values for that particular hour and start location
driver_rider_trips <-
mutate(driver_rider_trips, hour = hour(request_time)) %>%
left_join(df2, by= c('start_geo'='start_geo', 'hour'='hour')) %>%
mutate(estimated_time_to_arrival = ifelse(is.na(estimated_time_to_arrival)
, median_estimated_time_to_arrival
, estimated_time_to_arrival)) %>%
select(everything(), -median_estimated_time_to_arrival, -hour)

# Unfulfilled trips should have missing estimated time to arrival values
driver_rider_trips\$estimated_time_to_arrival <-
ifelse(driver_rider_trips\$trip_status == "unfulfilled", NA, driver_rider_trips\$estimated_time_to_arrival)``````

Final missing values table

``````# calculate missing values
na_table <-
map_dbl(filter(driver_rider_trips, trip_status == 'completed'), function(x) sum(is.na(x))) %>%
sort(decreasing = TRUE) %>%
data.frame()

# rename column
colnames(na_table) <- c("total_missing")

# display missing value table
kable(na_table) %>%
kable_styling(bootstrap_options = c("striped", "hover", "responsive"), full_width = F, position='left')``````
total_missing
trip_id 0
driver_id 0
rider_id 0
trip_status 0
request_time 0
estimated_time_to_arrival 0
actual_time_to_arrival 0
surge_multiplier 0
driver_payout 0
trip_price_pre_discount 0
rider_payment 0
start_geo 0
end_geo 0

Outlier Analysis

Observations from the summary table:

• Estimated time to arrival ranges between 0.02 - 33.77 minutes. However, 99% values lie below 9.5 minutes.
• Actual time to arrival ranges between 0.02 - 35.68 minutes. However, 99% values lie below 10.88 minutes.
• Surge multiplier ranges between 1 - 4.8. The mean surge price is 1.16.
• Driver Payout ranges between \$0 - \$175.87. However, 99% values lie below \$24.24. There are 5 completed trips with \$0 driver payout which have been removed.
• Trip price per discount ranges between \$0 - \$219.83. However, 99% values lie below \$30.30.
• Rider payment ranges between \$0 - \$219.83. However, 99% values lie below \$30.27. There are 5 completed trips with \$0 rider payment which have been removed.
• Note: Added a new column Actual Minus Estimated Time which is the difference between actual and estimated time of arrivals. It ranges between 50.95 minutes early to 65.77 minutes late. I have removed all values having an absolute value of over 30 minutes.

All variables seem to be heavily right skewed and top 1% values might be outliers. However, a throrough analysis needs to be done before removing any of the top 1% values.

``````# Difference between the actual and estimated time of arrival
driver_rider_trips\$actual_minus_estimated_time <- driver_rider_trips\$actual_time_to_arrival - driver_rider_trips\$estimated_time_to_arrival

# removing all difference values having an absolute value of over 30 minutes.
driver_rider_trips <- filter(driver_rider_trips, abs(actual_minus_estimated_time) <=30 | is.na(actual_minus_estimated_time))

index1 <- 6 # from column
index2 <- 11 # to column

# summary statistics for numerical variables
summary <- data.frame()
for(i in c(index1:index2, 14))
{
name = colnames(driver_rider_trips)[i]
min = min(driver_rider_trips[,i], na.rm=TRUE) %>% round(2)
percentile_1st = quantile(driver_rider_trips[,i,drop=TRUE], probs = 0.01, na.rm = TRUE) %>% round(2) %>% as.numeric()
mean = mean(driver_rider_trips[,i,drop=TRUE], na.rm=TRUE) %>% round(2)
median = median(driver_rider_trips[,i,drop=TRUE], na.rm=TRUE) %>% round(2)
percentile_99th = quantile(driver_rider_trips[,i,drop=TRUE], probs = 0.99, na.rm = TRUE) %>% round(2) %>% as.numeric()
max = max(driver_rider_trips[,i], na.rm=TRUE) %>% round(2)
count = sum(!is.na(driver_rider_trips[,i]))
df = data.frame(name, min, percentile_1st, mean=mean, median=median, percentile_99th, max, count)
summary <- rbind(summary, df)
}

# printing data
kable(summary) %>%
kable_styling(bootstrap_options = c("striped", "hover", "responsive"), full_width = F, position='left')``````
name min percentile_1st mean median percentile_99th max count
estimated_time_to_arrival 0.02 0.40 3.25 2.82 9.48 33.77 59805
actual_time_to_arrival 0.02 0.02 2.62 1.95 10.88 35.68 59839
surge_multiplier 1.00 1.00 1.16 1.00 2.50 4.80 59839
driver_payout 0.00 0.00 6.10 4.94 24.24 175.87 59839
trip_price_pre_discount 0.00 0.00 7.62 6.18 30.30 219.83 59839
rider_payment 0.00 0.00 7.57 6.14 30.27 219.83 59839
actual_minus_estimated_time -24.55 -5.83 -0.63 -0.75 4.83 29.93 59805

#### (II) Driver Data (20202 X 5)

Data Dictionary
Variable Description
driver_id Unique identifier for driver
first_completed_trip Timestamp of the driver’s first completed trip

First 100 rows

``````kable(head(driver_data, 100)) %>%
kable_styling(bootstrap_options = c("striped", "hover", "responsive")) %>%
scroll_box(width = "100%", height = "500px")``````
48dc-c2db 2010-11-18 00:36:00 4.816199 8038.104 877
49aa-b176 2011-12-29 15:34:00 4.793532 9428.024 967
4332-47bc 2010-08-13 16:14:00 4.924581 13578.024 1719
4004-efca 2012-02-12 22:40:00 4.408000 2020.952 172
4efa-33c0 2011-04-28 03:56:00 4.605660 15984.752 1755
473e-8b46 2011-01-13 01:11:00 4.585777 38344.432 4481
452d-ee3d 2010-10-25 09:57:00 4.721304 31428.432 3562
4df7-09fd 2010-08-21 12:42:00 4.794983 63105.080 6301
4c98-9a22 2011-06-27 23:12:00 4.758839 30314.240 3283
49bd-0b7c 2012-02-08 16:59:00 4.878992 11082.416 1161
408f-5ff1 2010-11-07 01:18:00 4.779570 19582.576 2698
4403-a9b0 2012-04-03 03:11:00 4.944444 760.960 67
4e36-9cab 2012-02-26 04:35:00 4.457627 2755.000 331
46e0-3688 2012-02-16 14:35:00 4.722488 7188.136 785
4f81-9298 2010-11-09 15:00:00 4.685209 44642.504 4803
4248-98f2 2012-02-25 01:21:00 4.923077 1357.944 129
426c-8516 2010-01-15 02:52:00 4.867809 41737.232 4300
463c-cb98 2011-11-26 17:09:00 4.495413 10920.200 1106
448a-d367 2010-08-29 01:06:00 4.867924 26527.992 935
403e-6c5e 2012-04-27 00:16:00 4.905000 1016.896 120
41bb-7bd6 2009-11-28 17:33:00 4.658796 33244.416 3359
4c01-6f1b 2010-02-07 01:33:00 4.905595 50930.792 5143
48d8-8c58 2010-07-21 00:52:00 4.751610 20325.872 2090
4a7f-a553 2010-06-28 11:33:00 4.871662 17968.040 1884
4332-ca4b 2010-05-09 01:45:00 4.661159 55469.288 6049
41ff-9521 2012-04-24 15:49:00 4.250000 348.904 29
49e1-4706 2011-06-26 03:49:00 4.865598 12608.440 1321
4875-dfb3 2010-11-22 22:08:00 4.723230 43722.752 4544
4a5d-abe8 2011-02-02 11:43:00 4.746875 4676.960 524
447e-71c0 2009-10-13 20:55:00 4.807692 123681.856 12632
47b7-05b3 2010-09-07 00:45:00 4.742366 24589.696 2365
42b6-7805 2012-05-01 14:07:00 4.695652 379.248 36
4447-094e 2011-07-18 22:55:00 4.883408 11966.568 1119
4538-b1ab 2011-03-30 05:02:00 4.800238 33486.888 3912
4a09-5f97 2010-08-05 23:59:00 4.800819 11485.872 1298
4792-1960 2012-03-23 13:11:00 4.540284 3379.808 367
4483-0184 2010-06-04 02:47:00 4.798387 3756.320 329
4300-5615 2012-02-23 19:39:00 4.668293 2456.312 229
4485-97df 2009-09-13 17:32:00 4.619641 94095.456 9313
461e-f95c 2009-04-13 00:26:00 4.784842 31141.512 3119
42f6-eeba 2012-01-06 17:37:00 NA 8518.360 1338
4b33-4989 2012-04-11 00:09:00 4.947368 1000.872 112
45f5-1e40 2010-11-13 03:25:00 4.883598 6696.600 704
4fba-ba6a 2010-03-15 22:11:00 4.857357 79225.704 6877
4934-fe71 2011-12-04 00:28:00 4.845292 12173.544 1152
4840-6a49 2009-12-30 16:22:00 4.840787 40379.048 4260
4fd8-7367 2012-02-24 14:36:00 4.287671 1284.640 162
4385-4173 2010-11-08 14:50:00 4.862913 11518.224 1394
4cf4-243e 2009-09-29 19:56:00 4.782112 95215.728 10635
4b61-61ec 2012-01-25 21:52:00 4.797203 2922.384 376
4cb1-3fbb 2010-09-23 23:08:00 4.828037 22115.768 2585
491b-5297 2010-05-21 01:16:00 4.792157 22537.712 2317
4c44-9b8b 2009-10-31 03:44:00 4.710031 37561.808 4131
4ea3-3b90 2009-11-14 22:55:00 4.246377 5258.984 531
4b3d-deb7 2010-06-25 15:43:00 4.615980 57861.896 5743
4649-8cbc 2012-02-15 20:06:00 4.577061 11147.864 1306
44e9-db58 2012-01-11 00:09:00 4.523046 10212.016 1116
49fa-23f0 2012-02-20 12:51:00 4.607143 5980.040 636
4761-ce55 2012-04-21 02:25:00 4.765957 706.096 70
474c-8b38 2011-09-16 19:26:00 4.666189 13712.352 1298
4dff-1b21 2012-01-11 18:18:00 4.791403 14601.280 1702
4c6d-22db 2012-03-28 19:39:00 4.902778 975.888 89
43f3-ed32 2011-10-17 09:08:00 4.716418 5191.024 537
4d87-1393 2010-06-02 11:29:00 4.779441 37773.800 4119
4f20-ad68 2010-11-09 17:11:00 4.705608 23182.424 2431
4c90-950a 2010-12-11 05:14:00 4.636364 34667.304 3847
4025-dc78 2011-08-29 22:59:00 NA 1174.368 180
4005-68e6 2012-03-15 23:16:00 4.742515 4167.112 360
432d-c81a 2010-08-21 12:00:00 4.848020 27659.056 2875
4a8c-40bf 2010-08-07 03:40:00 4.472719 39607.128 3480
40b3-4b1f 2010-07-16 02:42:00 4.715126 57591.368 5861
40e2-6fee 2008-10-25 20:00:00 4.804990 46995.800 2714
4505-3bc0 2008-08-27 18:11:00 4.815020 258998.496 8761
415c-1c1b 2010-01-08 21:45:00 4.737914 34073.536 3640
4355-d53d 2007-08-04 00:44:00 4.778516 216194.184 9374
448d-9038 2011-11-11 02:06:00 4.686617 28040.936 2972
46aa-6579 2009-07-30 15:55:00 4.579928 66376.664 4239
4b13-fda7 2010-11-03 02:25:00 4.876975 24194.536 2637
45f2-c9c3 2007-09-23 23:29:00 4.776902 164770.504 6505
421d-fd9a 2010-06-22 00:23:00 4.689189 6019.680 606
4f3b-36c5 2011-03-09 23:19:00 4.791579 49847.320 5205
45bb-5847 2009-08-21 03:14:00 4.743920 93875.640 9999
4268-b91c 2011-07-06 19:31:00 4.680431 24333.512 2836
489c-45e3 2011-01-31 22:08:00 4.814552 15406.920 1722
468c-39ae 2012-03-08 16:20:00 4.808362 5371.912 560
42c5-ea43 2010-08-10 19:57:00 4.667339 6801.832 699
4bb8-66bd 2007-06-16 01:13:00 4.710927 226156.208 11710
44a4-f9ed 2011-11-08 14:17:00 4.781931 6973.312 695
47b6-2c8e 2010-12-14 18:20:00 4.798508 11827.168 1182
4153-9769 2011-04-18 16:10:00 4.713287 14204.224 1715
4772-e969 2012-01-17 21:37:00 4.785575 12165.920 1148
4c84-f5dd 2009-11-28 14:54:00 4.725834 52511.080 4903
4542-f3fd 2010-09-20 21:53:00 4.950000 1310.632 152
4374-91bf 2012-04-03 21:27:00 4.870130 3089.688 357
49d4-2b33 2011-01-23 02:00:00 4.743371 17113.592 1905
4b6a-aac7 2011-08-05 13:24:00 4.657407 4343.336 502
47b9-fd57 2010-06-24 21:37:00 4.769859 24953.232 2090
4a02-f1be 2009-03-14 02:27:00 4.812075 20352.768 2042
4da7-b331 2011-01-18 23:08:00 4.851326 33796.904 3734
43e4-b120 2012-03-29 06:05:00 4.718593 2383.360 222

Missing Values

``````# calculate missing values
na_table <-
map_dbl(driver_data, function(x) sum(is.na(x))) %>%
sort(decreasing = TRUE) %>%
data.frame()

# rename column
colnames(na_table) <- c("total_missing")

# display missing value table
kable(na_table) %>%
kable_styling(bootstrap_options = c("striped", "hover", "responsive"), full_width = F, position='left')``````
total_missing
driver_id 0
first_completed_trip 0

Imputing Missing Values

99% of the drivers have a lifetime rating between 4.38 to 5. Hence, it is reasonable to fill in the missing values using the median value i.e. 4.79. The missing value table after the imputation looks like:

``````# filling the missing lifetime rating with the median value

# calculate missing values
na_table <-
map_dbl(driver_data, function(x) sum(is.na(x))) %>%
sort(decreasing = TRUE) %>%
data.frame()

# rename column
colnames(na_table) <- c("total_missing")

# display missing value table
kable(na_table) %>%
kable_styling(bootstrap_options = c("striped", "hover", "responsive"), full_width = F, position='left')``````
total_missing
driver_id 0
first_completed_trip 0

Outlier Analysis

• There is a duplicate value for driver id: 4a3c-ac69. The duplicate value has been removed.
• Driver id: 4e79-a40f has had \$805410479 of lifetime fares and 41940330 completed trips. This is clearly an erroneous value which has been removed.
• Note: Added a new column lifetime_fare_per_trip which is the lifetime average fare per trip for the drivers.
``````# removing duplicate driver id
remove_index <- which(driver_data\$driver_id == '4a3c-ac69' & year(driver_data\$first_completed_trip) == 2012)
driver_data <- driver_data[-remove_index,]

# removing outlier driver id
remove_index <- which(driver_data\$driver_id == '4e79-a40f')
driver_data <- driver_data[-remove_index,]

# calculating lifetime fare per trip

Observations from the summary table:

• lifetime rating of the drivers ranges between 2 - 5. However, only 1% of the ratings lie below 4.39. We might want to investigate drivers that have ratings lower than 4.
• lifetime fares of the driver ranges between \$8.78 - \$579564.45.
• lifetime completed trips of the driver ranges between 1 - 24701. However, 99% of the driver have trips lower than 13501.
• life fare per trip ranges from \$5.10 - \$64.78. This value highly depends upon the number of trips undertaken by the drivers.

All variables seem to be heavily right skewed and top 1% values might be outliers. However, a throrough analysis needs to be done before removing any of the top 1% values.

``````index1 <- 3 # from column
index2 <- 6 # to column

# summary statistics for numerical variables
summary <- data.frame()
for(i in index1:index2)
{
name = colnames(driver_data)[i]
min = min(driver_data[,i], na.rm=TRUE) %>% round(2)
percentile_1st = quantile(driver_data[,i,drop=TRUE], probs = 0.01, na.rm = TRUE) %>% round(2) %>% as.numeric()
mean = mean(driver_data[,i,drop=TRUE], na.rm=TRUE) %>% round(2)
median = median(driver_data[,i,drop=TRUE], na.rm=TRUE) %>% round(2)
percentile_99th = quantile(driver_data[,i,drop=TRUE], probs = 0.99, na.rm = TRUE) %>% round(2) %>% as.numeric()
max = max(driver_data[,i], na.rm=TRUE) %>% round(2)
count = sum(!is.na(driver_data[,i]))
df = data.frame(name, min, percentile_1st, mean=mean, median=median, percentile_99th, max, count)
summary <- rbind(summary, df)
}

kable(summary) %>%
kable_styling(bootstrap_options = c("striped", "hover", "responsive"), full_width = F, position='left')``````
name min percentile_1st mean median percentile_99th max count
lifetime_rating 2.00 4.39 4.77 4.79 4.96 5.00 20200
lifetime_fares 8.78 349.58 31598.88 17657.71 220139.19 579564.45 20200
lifetime_completed_trips 1.00 37.00 2814.53 1802.00 13501.04 24701.00 20200
lifetime_fare_per_trip 5.10 6.52 10.46 9.61 28.63 64.78 20200

#### (III) Rider Data (50436 X 5)

Data Dictionary
Variable Description
rider_id Unique identifier for rider
first_completed_trip Timestamp of the rider’s first completed trip
first_trip_city_id The city ID of the rider’s first trip

First 100 rows

``````kable(head(rider_data, 100)) %>%
kable_styling(bootstrap_options = c("striped", "hover", "responsive")) %>%
scroll_box(width = "100%", height = "500px")``````
9bdf-012c NA 3 1000 101.680
512f-e359 NA 1 1000 33.824
12d7-7db9 NA 2 1000 23.584
ace8-3e1e NA 8 1000 234.920
069a-5579 NA 23 1000 568.688
1f79-dab4 NA 2 1000 36.136
c12d-4dd9 NA 6 1000 131.984
92b5-f5e9 NA 2 1000 35.824
aa7d-646e NA 47 1000 1575.560
32e7-a6fa NA 1 1000 20.696
e061-4636 NA 45 1000 768.576
1704-1339 NA 2 1000 58.056
3e67-0838 NA 1 1000 24.536
02eb-e58c NA 13 1000 368.848
97d1-6b17 NA 26 1000 510.608
c464-e1b3 NA 14 1000 269.232
61ed-8611 NA 7 1000 102.624
a87a-0eab NA 8 1000 274.456
6c65-6177 NA 10 1000 147.880
edc3-5f73 NA 3 1000 42.568
d7b4-c92c NA 18 1000 357.352
2a99-69a3 NA 8 1000 108.952
1457-32e0 NA 17 1000 353.616
81be-dc5c NA 9 1000 147.424
d3c6-5484 NA 3 1000 84.048
85f6-5ac9 NA 4 1000 81.768
c4dd-f470 NA 40 1000 783.192
8787-98f4 NA 22 1000 322.216
3d2b-d3de NA 1 1000 11.080
1eba-9c41 NA NA 1000 NA
3bae-5d1f NA 1 1000 0.000
1d2e-496d NA 4 1000 130.832
d592-e992 NA 5 1000 133.424
f082-f35b NA 16141 1000 92493.840
f8da-3894 NA 4 1000 163.800
f1b2-53f1 NA 3 1000 65.488
7b73-260b NA 1 1000 16.304
e388-07c3 NA 1 1000 23.776
070a-34b4 NA 2 1000 48.424
fb29-0edf NA 5210 1000 41967.144
5505-c9c6 NA 28 1000 695.472
0b4e-40ab NA 1 1000 0.000
4f2c-bbfa NA 26 1000 513.184
bd34-ae53 NA 7 1000 171.048
27ed-b8a5 NA 18 1000 239.816
8652-4c06 NA 6 1000 148.232
1ec4-4903 NA 14 1000 325.832
03b0-475a NA 5 1000 97.272
4874-806d NA 1849 1000 11752.472
f40a-b325 NA 3 1000 53.560
624f-71c9 NA 1 1000 5.192
f56b-f194 NA 5 1000 149.632
8487-9f18 NA 52 1000 868.192
b684-df47 NA 2 1000 56.168
1b4c-b95e NA 13568 1000 90843.888
7ce4-96c9 NA 1 1000 8.296
1e8f-2f8f NA 149 1000 2961.224
22f7-8388 NA 145 1000 2999.760
eae8-b4e2 NA 2 1000 24.688
5971-106e NA 1 1000 24.208
5d23-2634 NA 7 1000 108.408
7d3e-6a83 NA 4 1000 74.272
4c42-87ab NA 20 1000 456.336
5e4b-d2eb NA 4 1000 41.904
b7c8-3fb3 NA 16 1000 248.792
4059-da58 NA 2 1000 33.216
3b92-a81e NA 1280 1000 8700.280
5b30-e575 NA 41 1000 611.328
f617-b153 NA 6 1000 135.240
81ec-7737 NA 1 1000 22.352
e3c2-a4ba NA 11 1000 390.776
469d-4252 NA 11 1000 163.600
5f34-0888 NA 3 1000 51.448
3498-8e17 NA 7 1000 93.880
dbac-f985 NA 1 1000 26.320
c39b-2cf1 NA 2 1000 12.080
bffc-67e2 NA 37 1000 633.792
c7a4-bf0a NA 2 1000 62.504
ccee-ff6d NA 43 1000 983.576
cf4e-046e NA 13 1000 476.768
dc4f-0c80 NA 7 1000 172.512
1232-2663 NA 2 1000 28.464
1139-fa21 NA 58 1000 1015.240
3ee4-c6e0 NA 9 1000 181.360
8541-78a3 NA 5 1000 92.656
78b8-c758 NA 2 1000 28.208
9e52-88ef NA 4 1000 81.488
2bc5-5739 NA 2 1000 39.912
4d52-65f9 NA 18 1000 347.264
c1d8-89e5 NA 1 1000 44.704
104f-a9a9 NA 1 1000 13.600
a559-5c36 NA 5 1000 132.480
bdbd-a262 NA 1 1000 22.624
7ec1-5b70 2012-06-09 13:45:33 NA 1032 NA
c418-0033 2012-05-28 19:19:28 2 1032 71.424
de2e-8a18 2012-05-25 00:42:36 3 1032 36.656
b470-06e5 2012-05-24 19:55:27 37 1032 1029.600
7f34-68f9 2012-05-23 00:49:02 NA 1032 NA
deff-4647 2012-05-20 15:18:27 NA 1578 NA
64df-6848 2012-05-08 11:25:59 NA 1020 NA

Missing Values

``````# calculate missing values
na_table <-
map_dbl(rider_data, function(x) sum(is.na(x))) %>%
sort(decreasing = TRUE) %>%
data.frame()

# rename column
colnames(na_table) <- c("total_missing")

# display missing value table
kable(na_table) %>%
kable_styling(bootstrap_options = c("striped", "hover", "responsive"), full_width = F, position='left')``````
total_missing
first_completed_trip 93
rider_id 0
first_trip_city_id 0

Imputing/Removing Missing Values

• There are 6 rider IDs that have missing lifetime trips and lifetime payments. These rider IDs have been removed.
• There are 93 rider IDs that have missing first completed trip date. There is not enough data to confidently impute these values, hence I will leave them as missing.

The final missing value table looks like:

``````# removing rider data with missing lifetime trips value

# calculate missing values
na_table <-
map_dbl(rider_data, function(x) sum(is.na(x))) %>%
sort(decreasing = TRUE) %>%
data.frame()

# rename column
colnames(na_table) <- c("total_missing")

# display missing value table
kable(na_table) %>%
kable_styling(bootstrap_options = c("striped", "hover", "responsive"), full_width = F, position='left')``````
total_missing
first_completed_trip 92
rider_id 0
first_trip_city_id 0

Outlier Analysis

Observations from the summary table:

• lifetime trips of the rider ranges between 1 - 16141. However, only 1% of the trips lie above 1358. Also, 25% of the riders have just 1 trip.
• lifetime payments of the riders ranges between \$0 - \$139337.62. However, only 1% of the lifetime payments lie above \$17281.77. Also, 25% of the riders have earned less than \$20.
• Note: Added a new column lifetime_payment_per_trip which is the lifetime average payment per trip for the riders. lifetime payment per trip of the riders ranges between \$0 - \$136.61. This value highly depends upon the number of trips undertaken by the riders

All variables seem to be heavily right skewed and top 1% values might be outliers. However, a throrough analysis needs to be done before removing any of the top 1% values.

``````# adding variable lifetime payment per trip

# summary statistics for numerical variables
summary <- data.frame()
for(i in c(3,5,6))
{
name = colnames(rider_data)[i]
min = min(rider_data[,i], na.rm=TRUE) %>% round(2)
percentile_1st = quantile(rider_data[,i,drop=TRUE], probs = 0.01, na.rm = TRUE) %>% round(2) %>% as.numeric()
mean = mean(rider_data[,i,drop=TRUE], na.rm=TRUE) %>% round(2)
median = median(rider_data[,i,drop=TRUE], na.rm=TRUE) %>% round(2)
percentile_99th = quantile(rider_data[,i,drop=TRUE], probs = 0.99, na.rm = TRUE) %>% round(2) %>% as.numeric()
max = max(rider_data[,i], na.rm=TRUE) %>% round(2)
count = sum(!is.na(rider_data[,i]))
df = data.frame(name, min, percentile_1st, mean=mean, median=median, percentile_99th, max, count)
summary <- rbind(summary, df)
}

kable(summary) %>%
kable_styling(bootstrap_options = c("striped", "hover", "responsive"), full_width = F, position='left')``````
name min percentile_1st mean median percentile_99th max count
lifetime_trips 1 2.00 246.87 154.00 1358.42 16141.00 50430
lifetime_payments 0 20.20 2658.03 1522.42 17281.77 139337.62 50430
lifetime_payment_per_trip 0 5.27 10.97 9.73 30.04 136.61 50430

#### (IV) City Metrics (672 X 5)

Data Dictionary
Variable Description
timestamp Day and hour of data
requests Total requests in the given hour
trips Total completed trips in the given hour
supply_hours Total hours all partners were online, en route, or on trip in the given hour
mean_surge_multipler Mean surge multiplier of completed trips in that hour

All 672 rows

``````kable(head(city_metrics, 100)) %>%
kable_styling(bootstrap_options = c("striped", "hover", "responsive")) %>%
scroll_box(width = "100%", height = "500px")``````
timestamp requests trips supply_hours mean_surge_multipler
2012-04-09 00:00:00 41.661679 34.712236 19.522378 1.084495
2012-04-09 01:00:00 19.132005 15.439689 11.533070 1.000209
2012-04-09 02:00:00 17.261499 13.894794 11.143415 1.007107
2012-04-09 03:00:00 6.634934 5.048320 3.741938 1.028779
2012-04-09 04:00:00 11.001874 8.020565 4.482725 1.278058
2012-04-09 05:00:00 24.061376 18.515455 10.156035 1.180672
2012-04-09 06:00:00 43.866352 33.141216 16.480258 1.218928
2012-04-09 07:00:00 81.487893 66.764570 31.549079 1.244724
2012-04-09 08:00:00 123.515560 103.407342 53.024222 1.318618
2012-04-09 09:00:00 106.247002 93.288761 52.659649 1.140532
2012-04-09 10:00:00 105.175772 93.082788 47.339123 1.147106
2012-04-09 11:00:00 146.687485 128.139303 65.609098 1.233137
2012-04-09 12:00:00 101.715101 87.681704 49.037245 1.189434
2012-04-09 13:00:00 86.100557 76.519898 41.544747 1.025907
2012-04-09 14:00:00 97.626823 85.367473 43.920234 1.029580
2012-04-09 15:00:00 110.361569 92.671751 48.391408 1.177816
2012-04-09 16:00:00 112.414947 90.986936 50.288811 1.285126
2012-04-09 17:00:00 170.397026 142.495011 81.644074 1.212854
2012-04-09 18:00:00 192.250490 165.787776 89.408792 1.133506
2012-04-09 19:00:00 185.075182 162.670384 77.018646 1.068151
2012-04-09 20:00:00 164.754040 138.727229 54.448583 1.485569
2012-04-09 21:00:00 139.099337 117.006144 47.329752 1.319643
2012-04-09 22:00:00 97.227191 82.152611 37.819991 1.082898
2012-04-09 23:00:00 83.066461 66.065345 31.050963 1.281126
2012-04-10 00:00:00 37.519886 31.396967 16.431722 1.270626
2012-04-10 01:00:00 16.271995 13.045098 7.782139 1.070253
2012-04-10 02:00:00 13.798056 10.753948 7.523259 1.069072
2012-04-10 03:00:00 3.291288 2.396868 1.711440 1.016734
2012-04-10 04:00:00 4.298336 3.314060 2.088576 1.060679
2012-04-10 05:00:00 17.091401 14.428258 8.243692 1.068999
2012-04-10 06:00:00 48.405218 40.893894 20.997252 1.068619
2012-04-10 07:00:00 111.877500 90.956067 43.916156 1.124463
2012-04-10 08:00:00 160.683784 135.874462 71.599181 1.250776
2012-04-10 09:00:00 141.221054 121.327350 71.606275 1.139207
2012-04-10 10:00:00 83.994252 74.834245 48.164508 1.023742
2012-04-10 11:00:00 98.110818 89.225826 52.017643 1.009254
2012-04-10 12:00:00 124.667966 111.066792 64.037443 1.007562
2012-04-10 13:00:00 108.015320 97.503438 52.815066 1.007598
2012-04-10 14:00:00 104.307765 92.038897 50.089108 1.071421
2012-04-10 15:00:00 98.447122 84.984752 45.263050 1.095435
2012-04-10 16:00:00 111.794213 93.928675 52.447671 1.173888
2012-04-10 17:00:00 151.528621 125.819699 72.013217 1.201485
2012-04-10 18:00:00 141.326002 122.880199 67.864346 1.083575
2012-04-10 19:00:00 152.044346 135.683961 70.256966 1.017518
2012-04-10 20:00:00 154.942614 136.664423 64.730204 1.134743
2012-04-10 21:00:00 155.475170 135.406209 61.214513 1.211015
2012-04-10 22:00:00 99.161838 85.611001 41.847109 1.149590
2012-04-10 23:00:00 74.228706 64.388026 33.703135 1.011853
2012-04-11 00:00:00 31.988072 27.540586 15.720587 1.005723
2012-04-11 01:00:00 24.787250 19.940300 12.060659 1.069858
2012-04-11 02:00:00 5.436583 4.419600 3.047781 1.164685
2012-04-11 03:00:00 6.311691 4.672941 3.519495 1.045805
2012-04-11 04:00:00 2.878763 2.156833 1.289621 1.144991
2012-04-11 05:00:00 12.101487 9.640023 5.477688 1.136393
2012-04-11 06:00:00 36.448765 30.731180 15.422923 1.104682
2012-04-11 07:00:00 102.727902 85.695505 41.937915 1.118716
2012-04-11 08:00:00 135.583971 114.355949 61.993685 1.250200
2012-04-11 09:00:00 110.122026 96.425307 57.185240 1.126102
2012-04-11 10:00:00 91.202981 82.096015 52.573755 1.017868
2012-04-11 11:00:00 115.992101 105.039505 61.765826 1.007470
2012-04-11 12:00:00 105.582881 95.492853 53.791623 1.015996
2012-04-11 13:00:00 100.613013 89.682457 49.716372 1.052343
2012-04-11 14:00:00 108.545802 95.316817 51.040920 1.087843
2012-04-11 15:00:00 94.660878 81.309725 43.753378 1.134780
2012-04-11 16:00:00 138.821583 118.437277 64.037354 1.093981
2012-04-11 17:00:00 138.116971 116.502976 65.476738 1.187833
2012-04-11 18:00:00 171.813908 148.559401 79.942593 1.064210
2012-04-11 19:00:00 176.386360 156.901507 80.863709 1.006714
2012-04-11 20:00:00 176.920911 155.736252 74.357398 1.014118
2012-04-11 21:00:00 159.545750 137.248077 61.916167 1.042825
2012-04-11 22:00:00 147.635508 122.496551 56.524712 1.241235
2012-04-11 23:00:00 80.747277 69.203198 36.907351 1.019032
2012-04-12 00:00:00 45.537811 39.037666 22.484780 1.002882
2012-04-12 01:00:00 34.693095 28.829473 17.397583 1.031549
2012-04-12 02:00:00 15.861861 12.823090 9.379150 1.073576
2012-04-12 03:00:00 5.679946 4.560541 3.656149 1.012044
2012-04-12 04:00:00 4.020391 3.230810 1.995655 1.119639
2012-04-12 05:00:00 16.553737 13.944019 7.919775 1.154306
2012-04-12 06:00:00 39.960340 33.577227 17.102401 1.078916
2012-04-12 07:00:00 133.413518 115.451191 57.051676 1.086048
2012-04-12 08:00:00 211.024285 177.215383 87.350200 1.182949
2012-04-12 09:00:00 200.045260 165.018621 85.554346 1.439102
2012-04-12 10:00:00 141.486902 125.379855 72.418350 1.027904
2012-04-12 11:00:00 174.414947 155.541687 74.397981 1.071213
2012-04-12 12:00:00 176.165111 154.623984 72.942290 1.076016
2012-04-12 13:00:00 151.061630 131.886867 65.011454 1.048181
2012-04-12 14:00:00 151.179910 130.402168 64.002661 1.074171
2012-04-12 15:00:00 123.979337 105.504726 56.499595 1.056710
2012-04-12 16:00:00 192.934719 158.468606 82.030967 1.389474
2012-04-12 17:00:00 160.185008 129.357074 73.981030 1.150256
2012-04-12 18:00:00 196.018837 165.085127 90.173595 1.124442
2012-04-12 19:00:00 292.405366 248.891930 114.487180 1.108367
2012-04-12 20:00:00 188.465187 164.107467 75.941602 1.024212
2012-04-12 21:00:00 229.594319 197.264809 88.996596 1.019440
2012-04-12 22:00:00 154.299237 127.871848 58.955032 1.059989
2012-04-12 23:00:00 148.933552 115.060578 51.841730 1.262545
2012-04-13 00:00:00 91.636084 73.950079 37.698786 1.025263
2012-04-13 01:00:00 54.936409 42.653599 22.712904 1.082638
2012-04-13 02:00:00 33.252139 25.669538 15.623172 1.185732
2012-04-13 03:00:00 14.594441 11.354893 7.746182 1.015877

Missing Values

``````# calculate missing values
na_table <-
map_dbl(city_metrics, function(x) sum(is.na(x))) %>%
sort(decreasing = TRUE) %>%
data.frame()

# rename column
colnames(na_table) <- c("total_missing")

# display missing value table
kable(na_table) %>%
kable_styling(bootstrap_options = c("striped", "hover", "responsive"), full_width = F, position='left')``````
total_missing
timestamp 0
requests 0
trips 0
supply_hours 0
mean_surge_multipler 0

Outlier Analysis

There seems to be no outlier values in this data set.

``````# summary statistics for numerical variables
summary <- data.frame()
for(i in 2:5)
{
name = colnames(city_metrics)[i]
min = min(city_metrics[,i], na.rm=TRUE) %>% round(2)
percentile_1st = quantile(city_metrics[,i,drop=TRUE], probs = 0.01, na.rm = TRUE) %>% round(2) %>% as.numeric()
mean = mean(city_metrics[,i,drop=TRUE], na.rm=TRUE) %>% round(2)
median = median(city_metrics[,i,drop=TRUE], na.rm=TRUE) %>% round(2)
percentile_99th = quantile(city_metrics[,i,drop=TRUE], probs = 0.99, na.rm = TRUE) %>% round(2) %>% as.numeric()
max = max(city_metrics[,i], na.rm=TRUE) %>% round(2)
count = sum(!is.na(city_metrics[,i]))
df = data.frame(name, min, percentile_1st, mean=mean, median=median, percentile_99th, max, count)
summary <- rbind(summary, df)
}

kable(summary) %>%
kable_styling(bootstrap_options = c("striped", "hover", "responsive"), full_width = F, position='left')``````
name min percentile_1st mean median percentile_99th max count
requests 0 4.76 115.56 116.83 285.22 349.86 672
trips 0 3.49 98.11 101.48 236.84 290.62 672
supply_hours 0 2.62 48.39 51.90 106.44 120.40 672
mean_surge_multipler 1 1.00 1.11 1.07 1.51 2.16 672

## 2. Data Analysis

### Data Exploration

#### Overall

For the given data there are:

• 59,805 trips
• 20,200 unique drivers
• 50,430 unique riders

#### Trip Status

Out of a total of 59,805 trips:

• 90.45% trips are completed
• 9.40% trips are rider cancelled. We shall dig deeper into the reasons behind ride cancellation.
• Less than 1% of the rides either fail or are unfulfilled. This indicates smooth functioning of the Uber App.
``````# reordering factor levels
df <- driver_rider_trips
df\$trip_status = factor(df\$trip_status,levels = c("completed", "rider_canceled", "failed", "unfulfilled"))

# plotting visual
ggplot(data = df, aes(x = trip_status)) +
geom_bar(fill = "#E69F00") +
theme_tufte() +
theme(axis.title.x=element_blank(),
axis.title.y=element_blank())``````

#### Trip Distribution by day and hour

• Saturday experiences highest trips whereas Monday experiences the least.
• 17:00 - 22:00 seems to be the rush period.
• There is virtually no demand from 1am - 6am on Weekdays and 4am - 7am on Weekends.
• 7-9 on Weekdays, there seems to be a higher demand as compared to the same time period on Weekends. This seems logical as people would leave for work in the morning on weekdays and wake up late on weekends.
• Friday and Saturday late evenings experience extremely high demand. Probably people like to party late night on Friday and Saturday.
• An interesting observation: Sunday constantly experiences a low demand throughout the day except from 12 - 1 am.
``````# reordering factor levels and plotting visual
city_metrics %>%
mutate(hour = as.factor(hour(timestamp)),
day = factor(as.factor(weekdays(city_metrics\$timestamp)),
levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))) %>%
group_by(hour, day) %>%
summarise(mean_trips = mean(trips)) %>%
ggplot(aes(day, hour)) +
geom_tile(aes(fill = mean_trips),colour = "white") +
scale_fill_gradient(low = "white",high = "red") +
theme_tufte() +
theme(axis.title.x=element_blank(),
axis.title.y=element_blank()) +
labs(fill="Avg Trips")``````

#### Trip Distribution by Geography

• Chelsea Court experiences highest demand followed by Allen Abby
• Blair Bend and Daisy Drive experience equally low demand.
``````# reordering factor levels and plotting visual
df <- driver_rider_trips
df\$start_geo = factor(df\$start_geo,levels = c("Chelsea Court", "Allen Abby", "Daisy Drive", "Blair Bend"))
ggplot(df, aes(x=start_geo)) +
geom_bar(fill = "#E69F00") +
theme_tufte() +
theme(axis.title.x=element_blank(),
axis.title.y=element_blank()) +
labs(fill="Avg Trips")``````

#### Trip Distribution Week on Week

• The date for week’s are defined as follows:
• Week 1: 2012-04-09 - 2012-04-15
• Week 2: 2012-04-16 - 2012-04-22
• Week 3: 2012-04-23 - 2012-04-29
• Week 4: 2012-04-30 - 2012-05-06
• Saturday experiences highest trips whereas Monday experiences the least. This point was evident in the heat map (trip distribution by day and hour) as well.
• Demand on Saturday’s > Friday’s > Thursday’s > Wednesday’s > Tuesday’s > Monday’s week on week.
• Week 3 has the highest trip demand whereas Week 2 has the lowest. Also, the trip distribution seems to fluctuate week on week. It would be interesting to dig deeper into the exact reason, provided we have more data.
``````# plotting visual
city_metrics %>%
mutate(date = date(timestamp)) %>%
group_by(date) %>%
summarise(count = sum(trips)) %>%
mutate(week = as.factor(c(rep(1,7), rep(2,7), rep(3,7), rep(4,7))),
day_of_week = substr(weekdays(ymd(date)), 1, 2)) %>%
ggplot(aes(x=date, y=count, group=week, color=week)) +
geom_point() +
geom_line() +
theme_tufte() +
geom_text(aes(label=day_of_week),hjust=0, vjust=0) +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
axis.title.y=element_blank()) +
labs(color="Week")``````