Part 3: Data analysis

Step 1. Data Exaction and Visualization

Initial setup

library(ggplot2)
library(gridExtra)
library(rpart)
library(rpart.plot)
library(caTools)
setwd('~shifan/Documents/Jobs/Uber')

Load in data, and replace empty data with NA’s.

uber <- read.csv(
  "ds_challenge_v2_1_data.csv"
  , header=TRUE, na.strings=c("","NA"))
summary(uber)

##        id          city_name           signup_os      signup_channel 
##  Min.   :    1   Berton :20117   android web:14944   Organic :13427  
##  1st Qu.:13671   Strark :29557   ios web    :16632   Paid    :23938  
##  Median :27341   Wrouver: 5007   mac        : 5824   Referral:17316  
##  Mean   :27341                   other      : 3648                   
##  3rd Qu.:41011                   windows    : 6776                   
##  Max.   :54681                   NA's       : 6857                   
##                                                                      
##   signup_date       bgc_date     vehicle_added_date  vehicle_make  
##  1/5/16 : 2489   1/29/16: 1125   1/26/16:  377      Toyota : 3219  
##  1/4/16 : 2460   1/28/16: 1103   1/28/16:  370      Honda  : 1845  
##  1/1/16 : 2282   1/27/16: 1071   1/22/16:  336      Nissan : 1311  
##  1/6/16 : 2207   1/30/16: 1071   1/29/16:  331      Ford   :  778  
##  1/7/16 : 2078   1/22/16: 1028   1/24/16:  328      Hyundai:  677  
##  1/21/16: 2024   (Other):27498   (Other):11392      (Other): 5393  
##  (Other):41141   NA's   :21785   NA's   :41547      NA's   :41458  
##  vehicle_model    vehicle_year   first_completed_date
##  Civic  :  689   Min.   :   0    1/23/16:  257       
##  Corolla:  688   1st Qu.:2008    1/30/16:  243       
##  Camry  :  683   Median :2013    1/29/16:  218       
##  Accord :  595   Mean   :2011    1/22/16:  215       
##  Prius V:  522   3rd Qu.:2015    1/26/16:  209       
##  (Other):10046   Max.   :2017    (Other): 4995       
##  NA's   :41458   NA's   :41458   NA's   :48544

Convert into appropriate data types.

uber$id = as.integer(uber$id)
uber$city_name = as.factor(uber$city_name)
uber$signup_os = as.factor(uber$signup_os)
uber$signup_channel = as.factor(uber$signup_channel)
uber$signup_date = as.Date(uber$signup_date, format="%m/%d/%Y")
uber$bgc_date = as.Date(uber$bgc_date, format="%m/%d/%Y")
uber$vehicle_added_date = as.Date(uber$vehicle_added_date, format="%m/%d/%Y")
uber$vehicle_make = as.factor(uber$vehicle_make)
uber$vehicle_model = as.factor(uber$vehicle_model)
uber$vehicle_year = as.integer(uber$vehicle_year)
# replace the 0 with NA's in vehicle_year, since it represents the yr car is made
uber$vehicle_year[uber$vehicle_year == 0] <- NA
uber$first_completed_date = as.Date(uber$first_completed_date, format="%m/%d/%Y")

Create a new feature representing whether a driver took a first drive

uber$ynfirst[!is.na(uber$first_completed_date)] <- 1
uber$ynfirst[is.na(uber$first_completed_date)] <- 0
uber$ynfirst <- factor(uber$ynfirst,
                       levels = c(0,1), 
                       labels = c("No", "Yes"))

We wish to use the dates for predictive modeling in the future, thus we transform the background check date and add vehicle information date into ‘response times’. These represent the time it takes for drivers to take action after signing up.

uber$bgc_response <- as.numeric(uber$bgc_date - uber$signup_date)
uber$vehicle_added_response <- as.numeric(uber$vehicle_added_date - uber$signup_date)

We also notice that some times are drivers have not received background check or entered vehicle information. I thus make the response time very large

uber$bgc_response <- ifelse(is.na(uber$bgc_response),100,uber$bgc_response)
uber$vehicle_added_response <- ifelse(is.na(uber$vehicle_added_response),100,uber$vehicle_added_response)

Calculate fraction of drivers who took a first trip

sum(is.finite(uber$first_completed_date)) / length(uber$first_completed_date)

## [1] 0.1122328

In a total of 54681 who signed up, about 11.22% drivers took a first trip as an Uber driver.

Now we can create some visualizations for the dataset. For instance, first let’s look at which city first started signing up as Uber drivers.

p <- ggplot(uber, aes(fill = city_name, x = signup_date))
p + geom_density(alpha=0.8, size = 0.5) + coord_cartesian(ylim=c(0.020, 0.050)) +
  labs(x = "Signup Date") + 
  scale_fill_manual(values=c("red", "orange", "blue"), 
                     name="City Name") +
  theme(text = element_text(size=12), legend.position = c(.95,.8))

We can clearly see from the plot above that in city Wrouver, drivers sign up later than in two other cities.

p <- ggplot(subset(uber, !is.na(vehicle_year) & !is.na(signup_date)), 
            aes(x = vehicle_year, y = signup_date,
            color=ynfirst))
p + geom_point() + 
    facet_grid(.~city_name) +
    scale_color_grey(name = "First Trip", start = 0.8, end = 0.2) + 
    labs(x = "Vehicle Year", y = "Signup Date")

In comparison, drivers in Wrouver own newer vehicles. But there is no clear difference between percentage of drivers who took a first trip. In addition, some very old vehicles (with Vehicle Year < 2000) did not take a first trip after signing up.

To get a sense of timeline drivers follow from signing up, to possibly taking a first trip.

plotbox <- function(uber, ydata, xdata) {
  if (ydata == "first_completed_date") {
    p <- ggplot(uber[!is.na(uber[[ydata]]), ], 
                aes_string(y = ydata, x = xdata, fill = xdata)) +
                coord_flip() +
                geom_boxplot() +
                labs(x = "First Trip") +
                scale_fill_manual(guide = FALSE, values=c("#d7191c"), name="First Trip")
  } else {
    p <- ggplot(uber[!is.na(uber[[ydata]]), ], 
                aes_string(y = ydata, x = xdata, fill = xdata)) +
                coord_flip() +
                geom_boxplot() +
                labs(x = "First Trip") +
                scale_fill_manual(guide = FALSE, values=c("#abd9e9", "#d7191c"), name="First Trip")
  }
  return (p)
}

date1 <- as.Date("0016-01-01")
date2 <- as.Date("0016-04-01")
p1 <- plotbox(uber, "signup_date", "ynfirst")+ 
      scale_y_date(limits = c(date1, date2)) + 
      labs(y = "Signup Date") + 
      theme(text = element_text(size=8))
p2 <- plotbox(uber, "bgc_date", "ynfirst") + 
      scale_y_date(limits = c(date1, date2)) +
      labs(y = "Background Check Date") + 
      theme(text = element_text(size=8))
p3 <- plotbox(uber, "vehicle_added_date", "ynfirst") + 
      scale_y_date(limits = c(date1, date2)) + 
      labs(y = "Vehicle Info. Add Date") + 
      theme(text = element_text(size=8))
p4 <- plotbox(uber, "first_completed_date", "ynfirst") + 
      scale_y_date(limits = c(date1, date2)) + 
      labs(y = "First Trip Date") + 
      theme(text = element_text(size=8))
grid.arrange(p1, p2, p3, p4, nrow = 4)

We can already see that the response time between signup and vehicle info. add time is much shorter for those who take a first trip.

Step 2. Predict whether a user will take a first trip

Next, our goal is to build a machine learning model to predict whether a driver will take a first trip after signing up. We first split the dataset into training set, and test set.

train_rows = sample.split(uber$ynfirst, SplitRatio=0.75)
train = uber[train_rows,]
test = uber[!train_rows,]

Logistic regression

In our first model, we use logistic regression to predict whether a driver takes a first trip.

logitfit <- glm(ynfirst ~ 
                  city_name + vehicle_added_response + vehicle_year + signup_channel, 
           family = binomial(link='logit'),
           data = train)
summary(logitfit)

## 
## Call:
## glm(formula = ynfirst ~ city_name + vehicle_added_response + 
##     vehicle_year + signup_channel, family = binomial(link = "logit"), 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1387  -0.6563  -0.1035   0.7499   5.3882  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -73.462847  12.901018  -5.694 1.24e-08 ***
## city_nameStrark         -0.129424   0.055600  -2.328   0.0199 *  
## city_nameWrouver        -0.257817   0.101409  -2.542   0.0110 *  
## vehicle_added_response  -0.155350   0.003359 -46.254  < 2e-16 ***
## vehicle_year             0.037239   0.006416   5.804 6.47e-09 ***
## signup_channelPaid       0.105274   0.074024   1.422   0.1550    
## signup_channelReferral   0.568406   0.066609   8.533  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 13653  on 9940  degrees of freedom
## Residual deviance:  8918  on 9934  degrees of freedom
##   (31070 observations deleted due to missingness)
## AIC: 8932
## 
## Number of Fisher Scoring iterations: 6

We notice that features

vehicle_added_response time
vehicle_year

and 3) signup_channelReferral

have strong impacts to whether a driver will take a first trip.

Decision Tree

Next we use decision tree to build a predictive model. First grow initial tree We choose the important features from logistic regression. In particular, we choose features including vehicle information added response time, vehicle year, and signup channel.

set.seed(1221)
control_parms <- rpart.control(cp = .0001)
tree <- rpart(formula = ynfirst~ vehicle_year + vehicle_added_response + signup_channel, 
             data = uber, 
             control = control_parms, 
             parms = list(split = "information"))
plotcp(tree)

tree.pruned = prune(tree, cp = 0.00081)
prp(tree.pruned, extra = 104, type = 2, fallen.leaves = TRUE, main = "Decision Tree")

From the pruned tree, I observe that a shorter response time to add vehicle information indicates a better chance of taking a first trip. In addition, newer vehicles (i.e. more recent vehicle year) also suggests more possible first trip. Finally, if the signup channel is via referral, a driver is more likely to take a first trip.

Step 3. Insights and actions

(Time consideration) Uber should keep reminding the signed-up drivers whether they wish to take a first trip. It can be helpful to follow up with the potential drivers, providing them useful information to get them on the road.
(Potential drivers) Uber should focus on drivers with newer cars. It may help giving bonuses or other forms of incentives to the this type of drivers.
(Signup platform) Uber can reallocate the resources and money to award signup referrals. This will not only grow the Uber driver network, also those who signup via referral are more likely to take a first trip.

Uber Data Science Challenge

Shifan Mao

January 26, 2017