Firstly, prepare the libraries
library(tidyverse) #Piping purpose
library(lubridate) #Hour and weekday filter from date
library(ggplot2) #Visualisation purpose
library(plotly) #Visualisation purpose
library(e1071) #for naiveBayes function
library(caret) #for model evaluation
library(padr) #to use padding time interval
library(tidyr) #to tidy the data (is in tidyverse too)
library(glue) #embedding expressions in curly braces which are then evaluated and inserted into the argument string.
library(car)#to use vif function - check multicolinearity in logistic regression model
library(partykit) #to use ctree() function for decision tree model
library(lime) #for lime functionTrain Data
## id trip_id driver_id rider_id
## 0 14901 14900 0
## start_time src_lat src_lon src_area
## 0 0 0 0
## src_sub_area dest_lat dest_lon dest_area
## 0 0 0 0
## dest_sub_area distance status confirmed_time_sec
## 0 0 0 0
## Rows: 229,532
## Columns: 16
## $ id <fct> 59d005e1ffcfa261708ce9cd, 59d0066a3d32b861760d4a01…
## $ trip_id <fct> 59d005e9cb564761a8fe5d3e, 59d00678ffcfa261708ceb37…
## $ driver_id <fct> 59a892c5568be44b2734f276, 59a135565e88a24b11f11994…
## $ rider_id <fct> 59ad2d6efba75a581666b506, 59ce930f3d32b861760a4617…
## $ start_time <fct> 2017-10-01T00:00:17Z, 2017-10-01T00:02:34Z, 2017-1…
## $ src_lat <dbl> 41.07047, 40.94157, 41.07487, 41.04995, 41.05287, …
## $ src_lon <dbl> 29.01945, 29.11484, 28.99528, 29.03107, 28.99522, …
## $ src_area <fct> sxk9, sxk8, sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sx…
## $ src_sub_area <fct> sxk9s, sxk8y, sxk9e, sxk9s, sxk9e, sxk90, sxk9e, s…
## $ dest_lat <dbl> 41.11716, 41.06151, 41.08351, 41.04495, 41.08140, …
## $ dest_lon <dbl> 29.03650, 29.02068, 29.00228, 28.98192, 28.98197, …
## $ dest_area <fct> sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sx…
## $ dest_sub_area <fct> sxk9u, sxk9s, sxk9e, sxk9e, sxk9e, sxk97, sxk9q, s…
## $ distance <dbl> 5.379250, 15.497130, 1.126098, 4.169492, 3.358296,…
## $ status <fct> confirmed, confirmed, nodrivers, confirmed, confir…
## $ confirmed_time_sec <int> 8, 14, 0, 32, 65, 110, 0, 49, 27, 21, 23, 46, 185,…
From the loaded data above we found out that there are several missing value on trip_id and driver_id, but it looks fine so i will keep it as it is. We also need to change the date type to POSIXct and everything else are good :)
The dataset includes information about:
Test Data
## Rows: 504
## Columns: 3
## $ src_area <fct> sxk3, sxk3, sxk3, sxk3, sxk3, sxk3, sxk3, sxk3, sxk3, sxk3, …
## $ datetime <fct> 2017-12-03T00:00:00Z, 2017-12-03T01:00:00Z, 2017-12-03T02:00…
## $ coverage <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
From the loaded data above we found out that we only need to change the date type to POSIXct.
## src_area datetime coverage
## 0 0 504
In order to fulfill the requirement, we need to do some extra effort on exploring this data test.
We found out the date range and source area as written below
2017-12-03, 00:00:00 to 2017-12-09, 23:00:00.
Area sxk3, sxk8 and sxk9.
As we can see, some of the data type are still inappropiate, lets try to fix it.
Train Data
Test Data
It looks like the time stamp are different, the datetime in TRAIN DATA are down to second and the datetime in TEST DATA are in hourly interval, lets make it equal in an hour.
Floor the time in TRAIN DATA to nearest hour
scotty_clean <- scotty_clean %>%
mutate(datetime = floor_date(datetime, unit = "hour"))
head(scotty_clean)In simple way, if we want to make sure that the coverage status are Sufficient or Insufficient, firstly we need to observe the order status if its confirmed or no drivers, from the data we can see that both status are in the same hour, i dont exactly sure that it happened every hour but lets assume it is, so lets group it so it wont distract the data preprocess.
scotty_clean_agg <- scotty_clean %>%
group_by(src_area, status, datetime) %>%
summarise(ndrivers = n()) %>%
ungroup() %>%
spread(status, ndrivers)## `summarise()` regrouping output by 'src_area', 'status' (override with `.groups` argument)
After doing this, found out that the time was not in a proper interval, some hour missing like 02:00:00, 05:00:00, 06:00:00 are saw on first 6 data, lets make it proper. and also, we are going back to an earlier stage, after we know that the TEST data has 3 different source area, lets break the TRAIN data into three different dataframe based on their area to get a proper observation with each scope.
scotty_clean_sxk3 <- scotty_clean_agg %>%
filter(src_area == "sxk3") %>%
pad('hour') %>%
mutate(src_area = replace_na(src_area,"sxk3"))
scotty_clean_sxk8 <- scotty_clean_agg %>%
filter(src_area == "sxk8") %>%
pad('hour') %>%
mutate(src_area = replace_na(src_area,"sxk8"))
scotty_clean_sxk9 <- scotty_clean_agg %>%
filter(src_area == "sxk9") %>%
pad('hour') %>%
mutate(src_area = replace_na(src_area,"sxk9"))
#rejoin all the dataset back into one data
scotty_clean_agg_all <- rbind(scotty_clean_sxk3,scotty_clean_sxk8,scotty_clean_sxk9)To be able to calculate how many confirmed and nodrivers status, first we need to change NA value to 0. Create a new column to show that the data at that time is sufficient or insufficient
scotty_clean_agg_all <- scotty_clean_agg_all %>%
mutate(confirmed = replace_na(confirmed, 0),
nodrivers = replace_na(nodrivers, 0)) %>%
mutate(coverage = as.factor(ifelse(nodrivers == 0, "sufficient", "insufficient")))
head(scotty_clean_agg_all)#Exploratory Data Analysis
Lets Explore the state in the target distribution by:
Proportion of class of target variable overall
##
## insufficient sufficient
## 0.5410053 0.4589947
Proportion of class of target variable in each area (3 areas)
SXK3
scotty_clean_agg_all_sxk3 <- scotty_clean_agg_all %>%
filter(src_area == "sxk3")
prop.table(table(scotty_clean_agg_all_sxk3$coverage))##
## insufficient sufficient
## 0.6263228 0.3736772
SXK8
scotty_clean_agg_all_sxk8 <- scotty_clean_agg_all %>%
filter(src_area == "sxk8")
prop.table(table(scotty_clean_agg_all_sxk8$coverage))##
## insufficient sufficient
## 0.1706349 0.8293651
SXK9
scotty_clean_agg_all_sxk9 <- scotty_clean_agg_all %>%
filter(src_area == "sxk9")
prop.table(table(scotty_clean_agg_all_sxk9$coverage))##
## insufficient sufficient
## 0.8260582 0.1739418
Explored the relation between the target and the features.
Finding the pattern or correlation between target and features
# sxk3 visual
scotty_viz <- scotty_clean_agg_all_sxk3 %>%
mutate(hourly = hour(datetime)) %>%
group_by(src_area, hourly, coverage) %>%
summarise(nostatus = n()) %>%
spread(coverage, nostatus) %>%
group_by(src_area, hourly) %>%
summarise(sum_sufficient = sum(sufficient),
sum_insufficient = sum(insufficient)) %>%
mutate(sum_sufficient = replace_na(sum_sufficient, 0),
sum_insufficient = replace_na(sum_insufficient, 0))## `summarise()` regrouping output by 'src_area', 'hourly' (override with `.groups` argument)
## `summarise()` regrouping output by 'src_area' (override with `.groups` argument)
# dataset for bar-chart
scotty_viz2 <- pivot_longer(data = scotty_viz,
cols = c(sum_sufficient, sum_insufficient),
names_to = "coverage_status",
values_to = "value")
scotty_viz3 <- scotty_viz2 %>%
mutate(coverage_status = case_when(coverage_status == "sum_sufficient" ~ "Sufficient",
coverage_status == "sum_insufficient" ~ "Insufficient"))
# Plot the bar-chart
ggplot(data = scotty_viz3,
aes (x=hourly, y=value)) +
geom_col(aes (fill = coverage_status), position = "dodge")+
labs(title = "SXK3 Area")# sxk8 visual
scotty_viz4 <- scotty_clean_agg_all_sxk8 %>%
mutate(hourly = hour(datetime)) %>%
group_by(src_area, hourly, coverage) %>%
summarise(nostatus = n()) %>%
spread(coverage, nostatus) %>%
group_by(src_area, hourly) %>%
summarise(sum_sufficient = sum(sufficient),
sum_insufficient = sum(insufficient)) %>%
mutate(sum_sufficient = replace_na(sum_sufficient, 0),
sum_insufficient = replace_na(sum_insufficient, 0))## `summarise()` regrouping output by 'src_area', 'hourly' (override with `.groups` argument)
## `summarise()` regrouping output by 'src_area' (override with `.groups` argument)
# dataset for bar-chart
scotty_viz5 <- pivot_longer(data = scotty_viz4,
cols = c(sum_sufficient, sum_insufficient),
names_to = "coverage_status",
values_to = "value")
scotty_viz6 <- scotty_viz5 %>%
mutate(coverage_status = case_when(coverage_status == "sum_sufficient" ~ "Sufficient",
coverage_status == "sum_insufficient" ~ "Insufficient"))
# Plot the bar-chart
ggplot(data = scotty_viz6,
aes (x=hourly, y=value)) +
geom_col(aes (fill = coverage_status), position = "dodge")+
labs(title = "SXK8 Area")# sxk9 visual
scotty_viz7 <- scotty_clean_agg_all_sxk9 %>%
mutate(hourly = hour(datetime)) %>%
group_by(src_area, hourly, coverage) %>%
summarise(nostatus = n()) %>%
spread(coverage, nostatus) %>%
group_by(src_area, hourly) %>%
summarise(sum_sufficient = sum(sufficient),
sum_insufficient = sum(insufficient)) %>%
mutate(sum_sufficient = replace_na(sum_sufficient, 0),
sum_insufficient = replace_na(sum_insufficient, 0))## `summarise()` regrouping output by 'src_area', 'hourly' (override with `.groups` argument)
## `summarise()` regrouping output by 'src_area' (override with `.groups` argument)
# dataset for bar-chart
scotty_viz8 <- pivot_longer(data = scotty_viz7,
cols = c(sum_sufficient, sum_insufficient),
names_to = "coverage_status",
values_to = "value")
scotty_viz9 <- scotty_viz8 %>%
mutate(coverage_status = case_when(coverage_status == "sum_sufficient" ~ "Sufficient",
coverage_status == "sum_insufficient" ~ "Insufficient"))
# Plot the bar-chart
ggplot(data = scotty_viz9,
aes (x=hourly, y=value)) +
geom_col(aes (fill = coverage_status), position = "dodge")+
labs(title = "SXK9 Area") Based on the bar-chart plot above:
Area sxk3 and sxk9 have more days of insufficient compared to days of Sufficient
Area sxk8 has more days of sufficient compared to days of Insufficient
Area sxk3 has more days of insufficient from 7 am till 12 am
Area sxk9 has more days of insufficient for whole hour (12 am to 11 pm)
Area sxk8 has quite high days of insufficient from 7-9 am and 3-7 pm
Use heatmap of time (hour) and weekdays, grouped by area and find the pattern
scotty_viz11 <- scotty_clean_agg_all %>%
mutate(hourly = hour(datetime),
dayname = wday(datetime, label = TRUE)) %>%
group_by(src_area, hourly, dayname, coverage) %>%
summarise(nstatus = n()) %>%
spread(coverage, nstatus) %>%
group_by(src_area, hourly, dayname) %>%
summarise(sum_sufficient = sum(sufficient),
sum_insufficient = sum(insufficient)) %>%
mutate(sum_sufficient = replace_na(sum_sufficient, 0),
sum_insufficient = replace_na(sum_insufficient, 0))## `summarise()` regrouping output by 'src_area', 'hourly', 'dayname' (override with `.groups` argument)
## `summarise()` regrouping output by 'src_area', 'hourly' (override with `.groups` argument)
scotty_viz12 <- scotty_viz11 %>%
mutate(coverage_day = sum_sufficient - sum_insufficient) %>%
mutate(coverage_status = case_when(coverage_day >= 0 ~ "Sufficient",
coverage_day < 0 ~ "Insufficient")) %>%
mutate(absolute_coverage_day = abs(coverage_day))
plotheatmap <- ggplot(data = scotty_viz12,
aes (x=dayname, y=hourly, fill = coverage_day,
text = glue("Status: {coverage_status} at {hourly}:00:00 on {dayname}
Sufficient for {sum_sufficient} days and
Insufficient for {sum_insufficient} days"))) +
geom_tile() +
theme(legend.position = "none") +
labs(title = "Daily and Hourly Scotty Rider Coverage Status",
x = "", y = "") +
facet_grid(. ~ src_area)+
theme(axis.text.x = element_text(angle = 30, hjust = 1))
# Plot the heat-map
ggplotly(plotheatmap, tooltip = "text")Area sxk3 has insufficient riders on weekend (Saturday and Sunday) and weekday (Monday to Friday) from around 7 am to 12 am
Area sxk8 has insufficient riders only on weekday from 8 am to 9 am (Monday to Thursday) and 3 pm to 7 pm (Friday)
Area sxk9 has insufficient riders for both weekday and weekend for whole hour (12 am to 11 pm)
-Drop some unuse column
#Data Split
set.seed(100)
index <- sample(nrow(scotty_clean_agg_all), nrow(scotty_clean_agg_all) * 0.8)
data_train <- scotty_clean_agg_all[index,]
data_val <- scotty_clean_agg_all[-index,]
#Change type of of both hour& dayname into factor
data_train <- data_train %>%
mutate(hourly = as.factor(hour(datetime)),
dayname = as.factor(wday(datetime, label = TRUE))) %>%
select(-datetime)
data_val <- data_val %>%
mutate(hourly = as.factor(hour(datetime)),
dayname = as.factor(wday(datetime, label = TRUE))) %>%
select(-datetime)
head(data_train, 2)#Lets recheck the proportion table of the coverage on train data
prop.table(table(scotty_clean_agg_all$coverage))##
## insufficient sufficient
## 0.5410053 0.4589947
We need to equalize between each of it, best practice is using DownSample.
# donwsample training dataset
scotty_train_down <- downSample(x = data_train[,-2],
y = data_train$coverage,
yname = "coverage")
# visualize proportion of downsample training dataset
proportion_area <- prop.table(table(scotty_train_down$coverage))
proportion_area##
## insufficient sufficient
## 0.5 0.5
It’s balance now, lets move to the next step :)
Now it’s the fun part, finding the right model for our data. I always thrive for a humanly way to explain a thing and could be easily understood by them.
There will be 2 model that i am going to use
Logistic Regression
Decision Tree
Logistic Regression is one of kind of classification method which the concept are similiar with linear regression. Although in binary logistic regression are not being specifically calculated but calculating the possibilities on each target class.
Decision tree is a quite simple model and the performance is robust/ powerfull and can be easily interpretable. Decision Tree model is the basic of other tree-based model like Random Forest which well known as a robust model to make a prediction. This model will create a tree decision based on the pattern of the data and the result will be visualized as a Flowchart and from that, it can be easily seen how the model build a prediction.
The other character of decision tree:
Assume that on each predictor variable are dependent
It can handle multicollinearity and outlier
scotty_logreg <- glm(formula = coverage ~ ., data = scotty_train_down, family = "binomial")
summary(scotty_logreg)##
## Call:
## glm(formula = coverage ~ ., family = "binomial", data = scotty_train_down)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.68191 -0.73900 0.00614 0.69223 2.86488
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.201639 0.229410 -0.879 0.379430
## src_areasxk8 2.423088 0.113839 21.285 < 2e-16 ***
## src_areasxk9 -1.163636 0.107249 -10.850 < 2e-16 ***
## hourly1 0.205163 0.315056 0.651 0.514920
## hourly2 0.781255 0.313253 2.494 0.012631 *
## hourly3 0.501428 0.308420 1.626 0.103993
## hourly4 1.348423 0.309672 4.354 1.33e-05 ***
## hourly5 1.124973 0.312085 3.605 0.000313 ***
## hourly6 0.749114 0.304528 2.460 0.013897 *
## hourly7 -0.695338 0.313962 -2.215 0.026779 *
## hourly8 -2.132189 0.331128 -6.439 1.20e-10 ***
## hourly9 -1.062422 0.318258 -3.338 0.000843 ***
## hourly10 -0.106276 0.307984 -0.345 0.730041
## hourly11 0.026123 0.311322 0.084 0.933128
## hourly12 -0.567000 0.309009 -1.835 0.066521 .
## hourly13 -0.760945 0.313946 -2.424 0.015359 *
## hourly14 -0.622568 0.311359 -2.000 0.045552 *
## hourly15 -0.821045 0.319644 -2.569 0.010210 *
## hourly16 -1.025553 0.321629 -3.189 0.001430 **
## hourly17 -1.206824 0.322380 -3.743 0.000181 ***
## hourly18 -1.291317 0.319109 -4.047 5.20e-05 ***
## hourly19 -1.077813 0.317658 -3.393 0.000691 ***
## hourly20 0.022152 0.308201 0.072 0.942702
## hourly21 -0.064482 0.300702 -0.214 0.830205
## hourly22 0.639626 0.301692 2.120 0.033995 *
## hourly23 0.047613 0.307533 0.155 0.876962
## dayname.L -0.187640 0.118142 -1.588 0.112228
## dayname.Q -0.260470 0.117547 -2.216 0.026700 *
## dayname.C 0.195982 0.117362 1.670 0.094941 .
## dayname^4 0.571191 0.118673 4.813 1.49e-06 ***
## dayname^5 0.264233 0.116753 2.263 0.023625 *
## dayname^6 0.006165 0.116049 0.053 0.957634
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4597 on 3315 degrees of freedom
## Residual deviance: 3142 on 3284 degrees of freedom
## AIC: 3206
##
## Number of Fisher Scoring iterations: 5
Lets random check the variable and interpret the value
## hourly4
## 3.851349
It means that, at 2:00, chance of sufficient would be 3,85 times higher than any other hour with the same condition
Check if there is a mistake in creating the model
## GVIF Df GVIF^(1/(2*Df))
## src_area 1.207358 2 1.048236
## hourly 1.209117 23 1.004137
## dayname 1.038939 6 1.003188
Assumption of multicollinearity and independence of observation has not been violated (because the VIF value are lower than 10)
No perfect separation detected.
In Prediction, there is a type of parameter which are response and link.
link = will show value of Log of Odds from Predict (this is default)response = will return the Predict value into ProbabilityTrain Dataset
#lets go with `response` to calculate the probability of the coverage
scotty_train_prob_logreg <- predict(scotty_logreg, newdata = scotty_train_down, type = "response")#Classify the coverage test with 0.5 threshold
# ifelse(condition, sufficient or insufficient)
scotty_train_pred_logreg <- ifelse(scotty_train_prob_logreg > 0.5, "1", "0")
# Change target class into factor
scotty_train_coverage_logreg <- as.factor(ifelse(scotty_train_pred_logreg == 1, "sufficient", "insufficient"))Validate Dataset (in part of overall train data)
#Go with `response` to calculate the probability of the coverage
scotty_val_prob_logreg <- predict(scotty_logreg, newdata = data_val, type = "response")#Classify the coverage test with 0.5 threshold
# ifelse(condition, sufficient or insufficient)
scotty_val_pred_logreg <- ifelse(scotty_val_prob_logreg > 0.5, "1", "0")
# Change target class into factor
scotty_val_coverage_logreg <- as.factor(ifelse(scotty_val_pred_logreg == 1, "sufficient", "insufficient"))# Prepare testing dataset to make it similar with train dataset
scotty_test_logreg <- scotty_test_clean %>%
mutate(src_area = as.factor(src_area),
hourly = as.factor(hour(datetime)),
dayname = as.factor(wday(datetime, label = TRUE)))
# Compute probability of the coverage
scotty_test_prob_logreg <- predict(scotty_logreg, newdata = scotty_test_logreg, type = "response")
#Classify the coverage test with 0.5 threshold
# ifelse(condition, sufficient or insufficient)
scotty_test_pred_logreg <- ifelse(scotty_test_prob_logreg > 0.5, "1", "0")
# Change target class into factor
scotty_test_coverage_logreg <- as.factor(ifelse(scotty_test_pred_logreg == 1, "sufficient", "insufficient"))
# Assign the prediction result on scotty test data
scotty_test_logreg$coverage <- scotty_test_coverage_logreg
# remove unneeded column and save the scotty test result on csv file
scotty_test_logreg <- scotty_test_logreg %>%
select(src_area, datetime, coverage)
write.csv(scotty_test_logreg,"data-output/data-test-result-logreg.csv")After we doing prediction with the model, there must be a chance of error.
In Classification we can use Confusion Matrix to evaluate the model
TP (True Positive) = When we’re predicting Positive Class and it is Correct
TN (True Negative) = When we’re predicting Negative Class and it is Correct
FP (False Positive) = When we’re predicting Positive Class and it is Wrong
FN (False Negative) = When we’re predicting Negative Class and it is Wrong
Evaluate the Train Dataset
confusionMatrix(data = scotty_train_coverage_logreg,
reference = scotty_train_down$coverage,
positive = "insufficient")## Confusion Matrix and Statistics
##
## Reference
## Prediction insufficient sufficient
## insufficient 1374 410
## sufficient 284 1248
##
## Accuracy : 0.7907
## 95% CI : (0.7765, 0.8044)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5814
##
## Mcnemar's Test P-Value : 2.086e-06
##
## Sensitivity : 0.8287
## Specificity : 0.7527
## Pos Pred Value : 0.7702
## Neg Pred Value : 0.8146
## Prevalence : 0.5000
## Detection Rate : 0.4144
## Detection Prevalence : 0.5380
## Balanced Accuracy : 0.7907
##
## 'Positive' Class : insufficient
##
Evaluate the Validation Dataset
confusionMatrix(data = scotty_val_coverage_logreg,
reference = data_val$coverage,
positive = "insufficient")## Confusion Matrix and Statistics
##
## Reference
## Prediction insufficient sufficient
## insufficient 396 113
## sufficient 88 311
##
## Accuracy : 0.7786
## 95% CI : (0.7502, 0.8052)
## No Information Rate : 0.533
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.5537
##
## Mcnemar's Test P-Value : 0.09049
##
## Sensitivity : 0.8182
## Specificity : 0.7335
## Pos Pred Value : 0.7780
## Neg Pred Value : 0.7794
## Prevalence : 0.5330
## Detection Rate : 0.4361
## Detection Prevalence : 0.5606
## Balanced Accuracy : 0.7758
##
## 'Positive' Class : insufficient
##
Accuracy: How good the model predict the target class (in global)
Sensitivity/ Recall: Measurement of goodness of the model with positive class
Specificity : Measurement of goodness of the model with negative class
Pos Pred Value/ Precision: Level of precision of the model to predict positive class
As can be shown above, the result between Test Data vs Train Data are quite similar.
We can simply adjust the level of Accuracy, Recall, Precision and Specificity needed by tuning the threshold point (that 0,5 things), it depends on the business case we need and level of goodness on each measurement.
Some said that Decision Tree is a quite simple model but the performance are Robust/ Powerfull and the results are easily interpretable. Decision Tree Model is a base of any other Tree-Based Model such as the famous Random Forest Model which well-known as a robust model to make a prediction.
Decision tree will create several answer based on the pattern from the data. The result then will be visualize in a Flow Chart so it can be easily seen on how the model doing the prediction.
Additional Character in Decision Tree are:
Assuming that each of Predictor Variabel are all dependent
It can be able to overcome multicollinearity and outlier
scotty.dt <- ctree(formula = coverage ~ . , data = scotty_train_down)
plot(scotty.dt, type = "simple")From the decision tree above, some of the interpretation are:
If the src_area is sxk8 and hourly is 13:00:00, the the coverage status is mostly sufficient
If the src_area is sxk3 or sxk9 and hourly is 7, 8, 9, 12, 13, 14, 17, 18 and 19 and dayname is after Monday, the coverage status is mostly insufficient
The process are pretty similar with Logistic Regression Model on Train Dataset
# Get probability of the coverage
scotty.train.prob.dtree <- predict(scotty.dt, scotty_train_down, type = "prob")
#Classify the coverage test with 0.5 threshold
# ifelse(condition, sufficient or insufficient)
scotty.train.pred.dtree <- ifelse(scotty.train.prob.dtree[,2] > 0.5, "1", "0")
# Change target class into factor
scotty.train.coverage.dtree <- as.factor(ifelse(scotty.train.pred.dtree == 1, "sufficient", "insufficient"))on Validation Dataset
# Get probability of the coverage
scotty.val.prob.dtree <- predict(scotty.dt, data_val, type = "prob")
#Classify the coverage test with 0.5 threshold
# ifelse(condition, sufficient or insufficient)
scotty.val.pred.dtree <- ifelse(scotty.val.prob.dtree[,2] > 0.5, "1", "0")
# Change target class into factor
scotty.val.coverage.dtree <- as.factor(ifelse(scotty.val.pred.dtree == 1, "sufficient", "insufficient"))on Test Dataset
# Prepare testing dataset
scotty.test.dtree <- scotty_test_clean %>%
mutate(src_area = as.factor(src_area),
hourly = as.factor(hour(datetime)),
dayname = as.factor(wday(datetime, label = TRUE)))
# Get probability of the coverage
scotty.test.prob.dtree <- predict(scotty.dt, scotty.test.dtree, type = "prob")
#Classify the coverage test with 0.5 threshold
# ifelse(condition, sufficient or insufficient)
scotty.test.pred.dtree <- ifelse(scotty.test.prob.dtree[,2] > 0.5, "1", "0")
scotty.test.coverage.dtree <- as.factor(ifelse(scotty.test.pred.dtree == 1, "sufficient", "insufficient"))
# Assign the prediction result on scotty test data
scotty.test.dtree$coverage <- scotty.test.coverage.dtree
# remove unneeded column and save the scotty test result on csv file
scotty.test.dtree <- scotty.test.dtree %>%
select(src_area, datetime, coverage)
write.csv(scotty.test.dtree,"data-output/data-test-result-DTree.csv")Evaluate the Train Dataset
confusionMatrix(data = scotty.train.coverage.dtree,
reference = scotty_train_down$coverage,
positive = "insufficient")## Confusion Matrix and Statistics
##
## Reference
## Prediction insufficient sufficient
## insufficient 1418 436
## sufficient 240 1222
##
## Accuracy : 0.7961
## 95% CI : (0.782, 0.8097)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5923
##
## Mcnemar's Test P-Value : 6.382e-14
##
## Sensitivity : 0.8552
## Specificity : 0.7370
## Pos Pred Value : 0.7648
## Neg Pred Value : 0.8358
## Prevalence : 0.5000
## Detection Rate : 0.4276
## Detection Prevalence : 0.5591
## Balanced Accuracy : 0.7961
##
## 'Positive' Class : insufficient
##
Evaluate the Validation Dataset
confusionMatrix(data = scotty.val.coverage.dtree,
reference = data_val$coverage,
positive = "insufficient")## Confusion Matrix and Statistics
##
## Reference
## Prediction insufficient sufficient
## insufficient 408 118
## sufficient 76 306
##
## Accuracy : 0.7863
## 95% CI : (0.7582, 0.8126)
## No Information Rate : 0.533
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5682
##
## Mcnemar's Test P-Value : 0.003244
##
## Sensitivity : 0.8430
## Specificity : 0.7217
## Pos Pred Value : 0.7757
## Neg Pred Value : 0.8010
## Prevalence : 0.5330
## Detection Rate : 0.4493
## Detection Prevalence : 0.5793
## Balanced Accuracy : 0.7823
##
## 'Positive' Class : insufficient
##
As shown above, the result between Test Data vs Train Data are still acceptable with note its better than Logistic Regression Model :).
Same as Logistic Regression method, we can simply adjust the level of Accuracy, Recall, Precision and Specificity needed by tuning the threshold point (that 0,5 things), it depends on the business case we need and level of goodness on each measurement.
It’s acceptable to use the Decision Tree model with 0,5 treshold.
Lets trying some other method, LIME. LIME is a Local Interpretable Model-Agnostic Explanations.
It’s an explanation technique that reflected from each part of the name. Local referes to local fidelity , i.e., we want the explanation to really reflect the behaviour of the classifier “around” the instance being predicted. This explanation is useless unless it is interpretable - that is, unless a human can make sense of it. Lime is able to explain any model without needing to ‘peak’ into it, so it is model-agnostic. We now give a high level overview of how lime works.
First, a word about interpretability. Some classifiers use representations that are not intuitive to users at all (e.g. word embeddings). Lime explains those classifiers in terms of interpretable representations (words), even if that is not the representation actually used by the classifier. Further, lime takes human limitations into account: i.e. the explanations are not too long.
In order to be model-agnostic, lime can’t peak into the model. In order to figure out what parts of the interpretable input are contributing to the prediction, we perturb the input around its neighborhood and see how the model’s predictions behave. We then weight these perturbed data points by their proximity to the original example, and learn an interpretable model on those and the associated predictions. For example, if we are trying to explain the prediction for the sentence “I hate this movie”, we will perturb the sentence and get predictions on sentences such as “I hate movie”, “I this movie”, “I movie”, “I hate”, etc. Even if the original classifier takes many more words into account globally, it is reasonable to expect that around this example only the word “hate” will be relevant. Note that if the classifier uses some uninterpretable representation such as word embeddings, this still works: we just represent the perturbed sentences with word embeddings, and the explanation will still be in terms of words such as “hate” or “movie”.
Lets return to this case, shall we? :)
I am going to use randomForest method for the model and implementing k-fold cross validation (k=5) and k-fold set 3 times.
set.seed(100)
ctrl <- trainControl(method = "repeatedcv", number = 5,repeats = 3)
# scotty.randomforest <- train(coverage ~ ., data = scotty_train_down, method = "rf", trControl= ctrl)# Reduce processing time by saving the processed randomforest model in RDS file
# saveRDS(scotty.randomforest, file = "models/model_rf.RDS")
# Load the randomforest model from RDS file
scotty.randomforest <- readRDS("models/model_rf.RDS")Trying with LIME
# Implement LIME on Random Forest model
set.seed(100)
explainer <- lime(x = scotty_train_down %>% select(-coverage),
model = scotty.randomforest)
explanation <- explain(x = data_val[315:318,] %>% select(-coverage),
labels = "insufficient",
explainer = explainer,
n_features = 5)
plot_features(explanation) The text
Label: yes shows what value of target variable is being explained.
The Probability shows the probability of the observation belong to the label yes.
We can see that for all observations they have little probability, so the model would predict them as no instead of yes.
LIME can be used to analyze the variable predictors used in the model for each observation.
for LIME result:
Observation 1 : Most vmportant variable = sxk8, least important = hourly 9
Observation 4 : Most important variable = sxk8, least important = Wednesday
One or several Machine Learning Model (and with some tuning in the parameter) can create quite good prediction with sufficient data and treatment.
Problem can be solved by Machine Learning. For other else, we need to assure what point/ target that we need to achieve.
2 Models are used in this case for its simplicity of application and interpretable model. Using LIME will add some different perspective in predicting the results.
The process or way of thinking can be used in vast business case, some of it can be targeting customer satisfaction as their top of goal and lowering the chance of customer lost because of decreasement of customer’s trust.