Initial setup
library(ggplot2)
library(gridExtra)
library(rpart)
library(rpart.plot)
library(caTools)
setwd('~shifan/Documents/Jobs/Uber')
Load in data, and replace empty data with NA’s.
uber <- read.csv(
"ds_challenge_v2_1_data.csv"
, header=TRUE, na.strings=c("","NA"))
summary(uber)
## id city_name signup_os signup_channel
## Min. : 1 Berton :20117 android web:14944 Organic :13427
## 1st Qu.:13671 Strark :29557 ios web :16632 Paid :23938
## Median :27341 Wrouver: 5007 mac : 5824 Referral:17316
## Mean :27341 other : 3648
## 3rd Qu.:41011 windows : 6776
## Max. :54681 NA's : 6857
##
## signup_date bgc_date vehicle_added_date vehicle_make
## 1/5/16 : 2489 1/29/16: 1125 1/26/16: 377 Toyota : 3219
## 1/4/16 : 2460 1/28/16: 1103 1/28/16: 370 Honda : 1845
## 1/1/16 : 2282 1/27/16: 1071 1/22/16: 336 Nissan : 1311
## 1/6/16 : 2207 1/30/16: 1071 1/29/16: 331 Ford : 778
## 1/7/16 : 2078 1/22/16: 1028 1/24/16: 328 Hyundai: 677
## 1/21/16: 2024 (Other):27498 (Other):11392 (Other): 5393
## (Other):41141 NA's :21785 NA's :41547 NA's :41458
## vehicle_model vehicle_year first_completed_date
## Civic : 689 Min. : 0 1/23/16: 257
## Corolla: 688 1st Qu.:2008 1/30/16: 243
## Camry : 683 Median :2013 1/29/16: 218
## Accord : 595 Mean :2011 1/22/16: 215
## Prius V: 522 3rd Qu.:2015 1/26/16: 209
## (Other):10046 Max. :2017 (Other): 4995
## NA's :41458 NA's :41458 NA's :48544
Convert into appropriate data types.
uber$id = as.integer(uber$id)
uber$city_name = as.factor(uber$city_name)
uber$signup_os = as.factor(uber$signup_os)
uber$signup_channel = as.factor(uber$signup_channel)
uber$signup_date = as.Date(uber$signup_date, format="%m/%d/%Y")
uber$bgc_date = as.Date(uber$bgc_date, format="%m/%d/%Y")
uber$vehicle_added_date = as.Date(uber$vehicle_added_date, format="%m/%d/%Y")
uber$vehicle_make = as.factor(uber$vehicle_make)
uber$vehicle_model = as.factor(uber$vehicle_model)
uber$vehicle_year = as.integer(uber$vehicle_year)
# replace the 0 with NA's in vehicle_year, since it represents the yr car is made
uber$vehicle_year[uber$vehicle_year == 0] <- NA
uber$first_completed_date = as.Date(uber$first_completed_date, format="%m/%d/%Y")
Create a new feature representing whether a driver took a first drive
uber$ynfirst[!is.na(uber$first_completed_date)] <- 1
uber$ynfirst[is.na(uber$first_completed_date)] <- 0
uber$ynfirst <- factor(uber$ynfirst,
levels = c(0,1),
labels = c("No", "Yes"))
We wish to use the dates for predictive modeling in the future, thus we transform the background check date and add vehicle information date into ‘response times’. These represent the time it takes for drivers to take action after signing up.
uber$bgc_response <- as.numeric(uber$bgc_date - uber$signup_date)
uber$vehicle_added_response <- as.numeric(uber$vehicle_added_date - uber$signup_date)
We also notice that some times are drivers have not received background check or entered vehicle information. I thus make the response time very large
uber$bgc_response <- ifelse(is.na(uber$bgc_response),100,uber$bgc_response)
uber$vehicle_added_response <- ifelse(is.na(uber$vehicle_added_response),100,uber$vehicle_added_response)
Calculate fraction of drivers who took a first trip
sum(is.finite(uber$first_completed_date)) / length(uber$first_completed_date)
## [1] 0.1122328
In a total of 54681 who signed up, about 11.22% drivers took a first trip as an Uber driver.
Now we can create some visualizations for the dataset. For instance, first let’s look at which city first started signing up as Uber drivers.
p <- ggplot(uber, aes(fill = city_name, x = signup_date))
p + geom_density(alpha=0.8, size = 0.5) + coord_cartesian(ylim=c(0.020, 0.050)) +
labs(x = "Signup Date") +
scale_fill_manual(values=c("red", "orange", "blue"),
name="City Name") +
theme(text = element_text(size=12), legend.position = c(.95,.8))
We can clearly see from the plot above that in city Wrouver, drivers sign up later than in two other cities.
p <- ggplot(subset(uber, !is.na(vehicle_year) & !is.na(signup_date)),
aes(x = vehicle_year, y = signup_date,
color=ynfirst))
p + geom_point() +
facet_grid(.~city_name) +
scale_color_grey(name = "First Trip", start = 0.8, end = 0.2) +
labs(x = "Vehicle Year", y = "Signup Date")
In comparison, drivers in Wrouver own newer vehicles. But there is no clear difference between percentage of drivers who took a first trip. In addition, some very old vehicles (with Vehicle Year < 2000) did not take a first trip after signing up.
To get a sense of timeline drivers follow from signing up, to possibly taking a first trip.
plotbox <- function(uber, ydata, xdata) {
if (ydata == "first_completed_date") {
p <- ggplot(uber[!is.na(uber[[ydata]]), ],
aes_string(y = ydata, x = xdata, fill = xdata)) +
coord_flip() +
geom_boxplot() +
labs(x = "First Trip") +
scale_fill_manual(guide = FALSE, values=c("#d7191c"), name="First Trip")
} else {
p <- ggplot(uber[!is.na(uber[[ydata]]), ],
aes_string(y = ydata, x = xdata, fill = xdata)) +
coord_flip() +
geom_boxplot() +
labs(x = "First Trip") +
scale_fill_manual(guide = FALSE, values=c("#abd9e9", "#d7191c"), name="First Trip")
}
return (p)
}
date1 <- as.Date("0016-01-01")
date2 <- as.Date("0016-04-01")
p1 <- plotbox(uber, "signup_date", "ynfirst")+
scale_y_date(limits = c(date1, date2)) +
labs(y = "Signup Date") +
theme(text = element_text(size=8))
p2 <- plotbox(uber, "bgc_date", "ynfirst") +
scale_y_date(limits = c(date1, date2)) +
labs(y = "Background Check Date") +
theme(text = element_text(size=8))
p3 <- plotbox(uber, "vehicle_added_date", "ynfirst") +
scale_y_date(limits = c(date1, date2)) +
labs(y = "Vehicle Info. Add Date") +
theme(text = element_text(size=8))
p4 <- plotbox(uber, "first_completed_date", "ynfirst") +
scale_y_date(limits = c(date1, date2)) +
labs(y = "First Trip Date") +
theme(text = element_text(size=8))
grid.arrange(p1, p2, p3, p4, nrow = 4)
We can already see that the response time between signup and vehicle info. add time is much shorter for those who take a first trip.
Next, our goal is to build a machine learning model to predict whether a driver will take a first trip after signing up. We first split the dataset into training set, and test set.
train_rows = sample.split(uber$ynfirst, SplitRatio=0.75)
train = uber[train_rows,]
test = uber[!train_rows,]
In our first model, we use logistic regression to predict whether a driver takes a first trip.
logitfit <- glm(ynfirst ~
city_name + vehicle_added_response + vehicle_year + signup_channel,
family = binomial(link='logit'),
data = train)
summary(logitfit)
##
## Call:
## glm(formula = ynfirst ~ city_name + vehicle_added_response +
## vehicle_year + signup_channel, family = binomial(link = "logit"),
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1387 -0.6563 -0.1035 0.7499 5.3882
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -73.462847 12.901018 -5.694 1.24e-08 ***
## city_nameStrark -0.129424 0.055600 -2.328 0.0199 *
## city_nameWrouver -0.257817 0.101409 -2.542 0.0110 *
## vehicle_added_response -0.155350 0.003359 -46.254 < 2e-16 ***
## vehicle_year 0.037239 0.006416 5.804 6.47e-09 ***
## signup_channelPaid 0.105274 0.074024 1.422 0.1550
## signup_channelReferral 0.568406 0.066609 8.533 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 13653 on 9940 degrees of freedom
## Residual deviance: 8918 on 9934 degrees of freedom
## (31070 observations deleted due to missingness)
## AIC: 8932
##
## Number of Fisher Scoring iterations: 6
We notice that features
vehicle_added_response time
vehicle_year
and 3) signup_channelReferral
have strong impacts to whether a driver will take a first trip.
Next we use decision tree to build a predictive model. First grow initial tree We choose the important features from logistic regression. In particular, we choose features including vehicle information added response time, vehicle year, and signup channel.
set.seed(1221)
control_parms <- rpart.control(cp = .0001)
tree <- rpart(formula = ynfirst~ vehicle_year + vehicle_added_response + signup_channel,
data = uber,
control = control_parms,
parms = list(split = "information"))
plotcp(tree)
tree.pruned = prune(tree, cp = 0.00081)
prp(tree.pruned, extra = 104, type = 2, fallen.leaves = TRUE, main = "Decision Tree")
From the pruned tree, I observe that a shorter response time to add vehicle information indicates a better chance of taking a first trip. In addition, newer vehicles (i.e. more recent vehicle year) also suggests more possible first trip. Finally, if the signup channel is via referral, a driver is more likely to take a first trip.
(Time consideration) Uber should keep reminding the signed-up drivers whether they wish to take a first trip. It can be helpful to follow up with the potential drivers, providing them useful information to get them on the road.
(Potential drivers) Uber should focus on drivers with newer cars. It may help giving bonuses or other forms of incentives to the this type of drivers.
(Signup platform) Uber can reallocate the resources and money to award signup referrals. This will not only grow the Uber driver network, also those who signup via referral are more likely to take a first trip.