library(lubridate)
library(dplyr)
library(tidyr)
library(stringi)
library(keras)
library(ggplot2)
library(ggalt)
library(plotly)We’re going to use Scotty dataset provided by Algoritma Academy team for capstone project purposes.
Scotty turns out being a very popular service in Turkey! This is a good thing, but also brings a new problem: “There is no driver”. The demands for Scotty began to overload, at some region and some times, and there was not enough driver at those times and places.
Our goal was to create a report suits a proper analysis and prediction by region and hour for Scotty’s drivers coverage status (Sufficient or Insufficient), by reflecting the problem into “nodrivers” Order Status
scotty.csv <- read.csv("Scotty.csv")
glimpse(scotty.csv)## Observations: 366,784
## Variables: 10
## $ timeStamp <fct> 2017-11-03 19:02:31, 2017-10-01 17:45:56, 201...
## $ driverID <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ riderID <fct> 5941083dc01c9f3eeeaac726, 59d0d4363d32b861760...
## $ orderStatus <fct> nodrivers, nodrivers, nodrivers, nodrivers, n...
## $ confirmedTimeSec <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ srcGeohash <fct> 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, ...
## $ srcLong <dbl> -60.87779, 0.00000, 0.00000, 0.00000, 0.00000...
## $ srcLat <dbl> -23.04455, 0.00000, 0.00000, 0.00000, 0.00000...
## $ destLong <dbl> 40.98718, 40.99750, 40.99750, 41.03570, 41.04...
## $ destLat <dbl> 29.02819, 28.85056, 28.85056, 29.06256, 29.00...
To classify region with sufficient coverage, we only need 4 columns from the dataset:
scotty <- scotty.csv[,c(1,3,4,6)]
glimpse(scotty)## Observations: 366,784
## Variables: 4
## $ timeStamp <fct> 2017-11-03 19:02:31, 2017-10-01 17:45:56, 2017-10-...
## $ riderID <fct> 5941083dc01c9f3eeeaac726, 59d0d4363d32b861760d9862...
## $ orderStatus <fct> nodrivers, nodrivers, nodrivers, nodrivers, nodriv...
## $ srcGeohash <fct> 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,...
Since the report is to predict coverage by Hour and Region, we should extract the hour value first from the TimeStamp. But in order to ease the data pre-processing, we’re just going to round the timestamp into hour using lubridate round_date and ymd_hms
scotty$timeStamp <- round_date(ymd_hms(scotty$timeStamp), "hour")Next, we’re going to get rid the data with order status “cancelled”, because it has no meaning to solve this problem
scotty <- scotty %>% filter(orderStatus != "cancelled")Before we move to the next step, we’re going to remove duplicate orders by identifying the “nodrivers” status and apply some business rules to it.
Both business rules are complimentary, so we only need to apply one business rule.
To apply those business rules, first we’re going to group the data by Rider ID, Timestamp, and srcGeohash. Next we’re going to mark the duplicated row, using ifelse statement using business rule number 2. And finally we take the “unique” data using filter
scotty_nd <- scotty %>%
group_by(riderID, timeStamp, srcGeohash) %>%
mutate(
final =
ifelse(
orderStatus=="nodrivers"
& !is.na(lead(orderStatus)),
1,
0
)
) %>%
ungroup() %>%
filter(final == 0)After we delete the duplicates, we can now mark Scotty Coverage at certain Hour and Region, by finding the presence of “nodrivers”. If there’s a presence of “nodrivers” status, then we’re going to assume that at that time and place, the coverage is Insufficient. Insufficient marked as 0, and Sufficient marked as 1.
scotty_nd <- readRDS("scotty_nd.RDS")
scotty_nd <- scotty_nd %>%
group_by(timeStamp, srcGeohash) %>%
summarise(coverage = ifelse(sum(orderStatus == "nodrivers")>0,0,1)) %>%
ungroup()But before we move to the next process, we should check the data completeness
table(scotty_nd$srcGeohash)##
## 6 7 9 c d e g k q s t u w
## 1 180 10 2 7 63 10 4 1 1477 4 36 3
As we can see, the region count is un-even, means the data is not complete. We’re going to fill the blank data with coverage value assumed as Sufficient, since there’s no demand at that Hour and Region
PS: The test dataset is not containing data for region 6, so we’re going to take that data out.
scotty_nd <- scotty_nd %>%
filter(srcGeohash != "6") %>%
mutate(srcGeohash = factor(srcGeohash)) %>%
tidyr::complete(timeStamp, srcGeohash, fill=list(coverage = 1))
table(scotty_nd$srcGeohash)##
## 7 9 c d e g k q s t u w
## 1477 1477 1477 1477 1477 1477 1477 1477 1477 1477 1477 1477
Now the data is complete, other than extracting the hour value, we could also do another features engineering to give the model with better precision and accuracy by extracting the weekdays and week of month from the timestamp. Because there’s a possibility that between a business days or not, and between the pay week or not, it impact the demands of Scotty at certain Regions and Hour
PS: Since the test dataset is using only weekdays and 2nd week of month, so we’re going to use that data from train.
scotty_nd <- scotty_nd %>%
mutate(weekdays = weekdays(scotty_nd$timeStamp)) %>%
filter(weekdays != "Sunday") %>%
filter(weekdays != "Saturday") %>%
mutate(
weekdays =
factor(
weekdays,
levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
),
hour = factor(hour(timeStamp)),
weekOfMonth = stri_datetime_fields(timeStamp)$WeekOfMonth) %>%
filter(weekOfMonth == 2)
table(scotty_nd$coverage)##
## 0 1
## 305 2515
One-hot encoding using model.matrix from column srcGeohash, weekdays, hour, and weeOfMonth
sdm <- model.matrix(~.,scotty_nd[,c(3,2,4,5)])
sdm <- sdm[,-1]
train_x <- sdm[,-1]
train_y <- sdm[,1]
dim(train_x)## [1] 2820 38
The data train consist of 38 columns as the input, so we’re going to put the value into the input shape. Since the goal is to predict binary classification, then the last layer should be activated by sigmoid function.
model <- keras_model_sequential()
model %>%
layer_dense(units = 128, activation = 'relu', input_shape = c(38)) %>%
layer_dropout(rate = 0.3) %>%
layer_dense(units = 64, activation = 'relu') %>%
layer_dense(units = 1, activation = "sigmoid")
summary(model)## ___________________________________________________________________________
## Layer (type) Output Shape Param #
## ===========================================================================
## dense_1 (Dense) (None, 128) 4992
## ___________________________________________________________________________
## dropout_1 (Dropout) (None, 128) 0
## ___________________________________________________________________________
## dense_2 (Dense) (None, 64) 8256
## ___________________________________________________________________________
## dense_3 (Dense) (None, 1) 65
## ===========================================================================
## Total params: 13,313
## Trainable params: 13,313
## Non-trainable params: 0
## ___________________________________________________________________________
I run several trials using different optimizers, but in-train accuracy is not much different (ranging around 96%). binary_crossentropy is used to calculate the loss value because this is binomial classification model.
model %>% compile(
optimizer = optimizer_rmsprop(),
loss = 'binary_crossentropy',
metrics = c('accuracy')
)For every optimizers, i train the model using epoch set to 100, validation split 0.2, and leverage the callback_early_stopping which monitor the accuracy value, and immediately stop the training if the accuracy fail to increase for consecutive 2 epochs
history <- model %>%
fit(train_x, train_y, epoch=100, validation_split = 0.2,
callback=callback_early_stopping(monitor = "acc", mode = "max", patience=2)
)history <- readRDS("history.RDS")
model <- load_model_hdf5("model_scotty.hdf5")
ggplotly(
ggplot(
data.frame(epoch=1:length(history$metrics$acc),
acc=history$metrics$acc,
val_acc=history$metrics$val_acc),
aes(x=epoch))+
geom_point(aes(y=acc)) +
geom_xspline(size=0.5, aes(y=acc), col="red") +
geom_point(aes(y=val_acc)) +
geom_xspline(size=0.5, aes(y=val_acc), col="green")
)ggplotly(
ggplot(
data.frame(epoch=1:length(history$metrics$loss),
loss=history$metrics$loss,
val_loss=history$metrics$val_loss),
aes(x=epoch))+
geom_point(aes(y=loss)) +
geom_xspline(size=0.5, aes(y=loss), col="red") +
geom_point(aes(y=val_loss)) +
geom_xspline(size=0.5, aes(y=val_loss), col="green")
)Preprocess the data test to fit our model
test_csv <- read.csv("submissionClassification.csv")
test_x <- test_csv[,-3]
test_x$weekdays <- weekdays(ymd_hms(test_x$timeStamp))
test_x$weekdays <- factor(test_x$weekdays, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday"))
test_x$hour <- factor(hour(ymd_hms(test_x$timeStamp)))
test_sdm <- model.matrix(~.,test_x[,-2])
test_sdm <- test_sdm[,-1]
dim(test_sdm)pred <- predict(model, x=test_sdm)
table(test_csv$coverage)
test_csv$coverage <- as.factor(ifelse(pred>0.29, "Sufficient", "Insufficient"))