1 Initialization

1.1 Library Call

library(lubridate)
library(dplyr)
library(tidyr)
library(stringi)
library(keras)
library(ggplot2)
library(ggalt)
library(plotly)

1.2 Data Pre-processing

We’re going to use Scotty dataset provided by Algoritma Academy team for capstone project purposes.

Scotty turns out being a very popular service in Turkey! This is a good thing, but also brings a new problem: “There is no driver”. The demands for Scotty began to overload, at some region and some times, and there was not enough driver at those times and places.

Our goal was to create a report suits a proper analysis and prediction by region and hour for Scotty’s drivers coverage status (Sufficient or Insufficient), by reflecting the problem into “nodrivers” Order Status

scotty.csv <-  read.csv("Scotty.csv")
glimpse(scotty.csv)

## Observations: 366,784
## Variables: 10
## $ timeStamp        <fct> 2017-11-03 19:02:31, 2017-10-01 17:45:56, 201...
## $ driverID         <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ riderID          <fct> 5941083dc01c9f3eeeaac726, 59d0d4363d32b861760...
## $ orderStatus      <fct> nodrivers, nodrivers, nodrivers, nodrivers, n...
## $ confirmedTimeSec <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ srcGeohash       <fct> 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, ...
## $ srcLong          <dbl> -60.87779, 0.00000, 0.00000, 0.00000, 0.00000...
## $ srcLat           <dbl> -23.04455, 0.00000, 0.00000, 0.00000, 0.00000...
## $ destLong         <dbl> 40.98718, 40.99750, 40.99750, 41.03570, 41.04...
## $ destLat          <dbl> 29.02819, 28.85056, 28.85056, 29.06256, 29.00...

To classify region with sufficient coverage, we only need 4 columns from the dataset:

timeStamp -> Time of order made by Rider
riderID -> Unique Rider ID
orderStatus -> The status of order made by Rider
srcGeohash -> Region Code

scotty <- scotty.csv[,c(1,3,4,6)]
glimpse(scotty)

## Observations: 366,784
## Variables: 4
## $ timeStamp   <fct> 2017-11-03 19:02:31, 2017-10-01 17:45:56, 2017-10-...
## $ riderID     <fct> 5941083dc01c9f3eeeaac726, 59d0d4363d32b861760d9862...
## $ orderStatus <fct> nodrivers, nodrivers, nodrivers, nodrivers, nodriv...
## $ srcGeohash  <fct> 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,...

Since the report is to predict coverage by Hour and Region, we should extract the hour value first from the TimeStamp. But in order to ease the data pre-processing, we’re just going to round the timestamp into hour using lubridate round_date and ymd_hms

scotty$timeStamp <- round_date(ymd_hms(scotty$timeStamp), "hour")

Next, we’re going to get rid the data with order status “cancelled”, because it has no meaning to solve this problem

scotty <- scotty %>% filter(orderStatus != "cancelled")

Before we move to the next step, we’re going to remove duplicate orders by identifying the “nodrivers” status and apply some business rules to it.

If there’s no other status from the same Rider in the same Hour window than “nodrivers”, then we’ll keep the data
If there’s another status after “nodrivers” from the same Rider in the same Hour window, then we’ll remove the data, as it identified as duplicate

Both business rules are complimentary, so we only need to apply one business rule.

To apply those business rules, first we’re going to group the data by Rider ID, Timestamp, and srcGeohash. Next we’re going to mark the duplicated row, using ifelse statement using business rule number 2. And finally we take the “unique” data using filter

scotty_nd <- scotty %>%
  group_by(riderID, timeStamp, srcGeohash) %>% 
  mutate(
    final = 
      ifelse(
        orderStatus=="nodrivers" 
        & !is.na(lead(orderStatus)),
        1, 
        0
      )
    ) %>% 
  ungroup() %>% 
  filter(final == 0)

After we delete the duplicates, we can now mark Scotty Coverage at certain Hour and Region, by finding the presence of “nodrivers”. If there’s a presence of “nodrivers” status, then we’re going to assume that at that time and place, the coverage is Insufficient. Insufficient marked as 0, and Sufficient marked as 1.

scotty_nd <- readRDS("scotty_nd.RDS")
scotty_nd <- scotty_nd %>%  
  group_by(timeStamp, srcGeohash) %>% 
  summarise(coverage = ifelse(sum(orderStatus == "nodrivers")>0,0,1)) %>%
  ungroup()

But before we move to the next process, we should check the data completeness

table(scotty_nd$srcGeohash)

## 
##    6    7    9    c    d    e    g    k    q    s    t    u    w 
##    1  180   10    2    7   63   10    4    1 1477    4   36    3

As we can see, the region count is un-even, means the data is not complete. We’re going to fill the blank data with coverage value assumed as Sufficient, since there’s no demand at that Hour and Region

PS: The test dataset is not containing data for region 6, so we’re going to take that data out.

scotty_nd <- scotty_nd %>% 
  filter(srcGeohash != "6") %>% 
  mutate(srcGeohash = factor(srcGeohash)) %>% 
  tidyr::complete(timeStamp, srcGeohash, fill=list(coverage = 1))

table(scotty_nd$srcGeohash)

## 
##    7    9    c    d    e    g    k    q    s    t    u    w 
## 1477 1477 1477 1477 1477 1477 1477 1477 1477 1477 1477 1477

Now the data is complete, other than extracting the hour value, we could also do another features engineering to give the model with better precision and accuracy by extracting the weekdays and week of month from the timestamp. Because there’s a possibility that between a business days or not, and between the pay week or not, it impact the demands of Scotty at certain Regions and Hour

PS: Since the test dataset is using only weekdays and 2^nd week of month, so we’re going to use that data from train.

scotty_nd <- scotty_nd %>% 
  mutate(weekdays = weekdays(scotty_nd$timeStamp)) %>% 
  filter(weekdays != "Sunday") %>% 
  filter(weekdays != "Saturday") %>% 
  mutate(
    weekdays = 
           factor(
             weekdays, 
             levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
             ),
    hour = factor(hour(timeStamp)),
    weekOfMonth = stri_datetime_fields(timeStamp)$WeekOfMonth) %>% 
  filter(weekOfMonth == 2)

table(scotty_nd$coverage)

## 
##    0    1 
##  305 2515

2 Keras Model

2.1 Prepare Data Train

One-hot encoding using model.matrix from column srcGeohash, weekdays, hour, and weeOfMonth

sdm <- model.matrix(~.,scotty_nd[,c(3,2,4,5)])
sdm <- sdm[,-1]
train_x <- sdm[,-1]
train_y <- sdm[,1]
dim(train_x)

## [1] 2820   38

2.2 Creating the model

The data train consist of 38 columns as the input, so we’re going to put the value into the input shape. Since the goal is to predict binary classification, then the last layer should be activated by sigmoid function.

model <- keras_model_sequential()
model %>%
  layer_dense(units = 128, activation = 'relu', input_shape = c(38)) %>%
  layer_dropout(rate = 0.3) %>% 
  layer_dense(units = 64, activation = 'relu') %>%
  layer_dense(units = 1, activation = "sigmoid")
summary(model)

## ___________________________________________________________________________
## Layer (type)                     Output Shape                  Param #     
## ===========================================================================
## dense_1 (Dense)                  (None, 128)                   4992        
## ___________________________________________________________________________
## dropout_1 (Dropout)              (None, 128)                   0           
## ___________________________________________________________________________
## dense_2 (Dense)                  (None, 64)                    8256        
## ___________________________________________________________________________
## dense_3 (Dense)                  (None, 1)                     65          
## ===========================================================================
## Total params: 13,313
## Trainable params: 13,313
## Non-trainable params: 0
## ___________________________________________________________________________

I run several trials using different optimizers, but in-train accuracy is not much different (ranging around 96%). binary_crossentropy is used to calculate the loss value because this is binomial classification model.

model %>% compile(
  optimizer = optimizer_rmsprop(), 
  loss = 'binary_crossentropy',
  metrics = c('accuracy')
)

2.3 Train the Model

For every optimizers, i train the model using epoch set to 100, validation split 0.2, and leverage the callback_early_stopping which monitor the accuracy value, and immediately stop the training if the accuracy fail to increase for consecutive 2 epochs

history <- model %>% 
  fit(train_x, train_y, epoch=100, validation_split = 0.2,
      callback=callback_early_stopping(monitor = "acc", mode = "max", patience=2)
      )

history <- readRDS("history.RDS")
model <- load_model_hdf5("model_scotty.hdf5")

ggplotly(
ggplot(
  data.frame(epoch=1:length(history$metrics$acc),
             acc=history$metrics$acc,
             val_acc=history$metrics$val_acc),
  aes(x=epoch))+
  geom_point(aes(y=acc)) +
  geom_xspline(size=0.5, aes(y=acc), col="red") +
  geom_point(aes(y=val_acc)) +
  geom_xspline(size=0.5, aes(y=val_acc), col="green")
)

ggplotly(
ggplot(
  data.frame(epoch=1:length(history$metrics$loss),
             loss=history$metrics$loss,
             val_loss=history$metrics$val_loss),
  aes(x=epoch))+
  geom_point(aes(y=loss)) +
  geom_xspline(size=0.5, aes(y=loss), col="red") +
  geom_point(aes(y=val_loss)) +
  geom_xspline(size=0.5, aes(y=val_loss), col="green")
)

3 Data Test Evaluation

3.1 Data Test Preprocessing

Preprocess the data test to fit our model

test_csv <- read.csv("submissionClassification.csv")
test_x <- test_csv[,-3]
test_x$weekdays <- weekdays(ymd_hms(test_x$timeStamp))
test_x$weekdays <- factor(test_x$weekdays, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday"))
test_x$hour <- factor(hour(ymd_hms(test_x$timeStamp)))
test_sdm <- model.matrix(~.,test_x[,-2])
test_sdm <- test_sdm[,-1]
dim(test_sdm)

3.2 Making predictions

Since the prediction is activated by sigmoid function, then we need to convert the value based on specific cutoff. We’ve found the best cutoff value is 0.29, after doing some test and getting the result from our beloved mentor

pred <- predict(model, x=test_sdm)
table(test_csv$coverage)
test_csv$coverage <- as.factor(ifelse(pred>0.29, "Sufficient", "Insufficient"))

3.3 Test Result

Test Result

Capstone Project - Scotty Classification

Adytiawan Arga Dwitama

March 25, 2019