Purpose

Build a model to predict if an incident will be resolved within the SLA time period.

Training set

Training set include closed tickets from 2017 & 2018.

Data exploration

Visualize and explore the data to highlight it’s main characteristics of distribution, variation, and relationships, and pinpoint data quality issues.
Incident represents an IT service task and “ticket resolution time task” is the net time from ticket creation until resolution
(does not include weekend, pending,after business time, etc.).

For modeling purposes, we are going to use the “total_resolution_time”, the gross time from ticket creation until resolution
(including weekend, pending, after business time, etc.).
The reason for not using the net time is that we are not aware to how it is calculated, and more importantly the data which was used for the calculation is not available.

The meet SLA output will be calculated according to 24 hours threshold.
In general this is a sound approach, because our purpose is to identify the underlining factors that determine the variability is the resolution time.

Resolution time distribution

Plot #1 - Shows the complete distribution, what looks unusual is the longtail composed from tickets that took more then 5 days to resolve.

Plot #2 - Shows the distribution of tickets that took less than 5 days.
The distribution appears to be made up of two or more individual distributions (i.e. the distribution has multiple peaks or modes).

Analyzing the multimodal distribuation

Analyzing the longtail

We see that in 2018 there were more longtail incidents on monthly basis, except for Nov-18.

Looking at the distribution based on the top 20 subcategory reveal that Application and software category dominate the longtail.

SLA rate

64% of the incidents were resolved within 24 hours.

Analysis approach

The major factors influencing the variability of the resolution time, can be grouped as follow.

Complexity

Infer about the incident complexity using text analytics. The short description was processed, and two features were engineered in addition to the number of words in description. In the first we try to assess the topic of the incident, or what it is all about. The second tries to assess how difficult it is.

The plot shows the most frequent words across all incident descriptions.

The second plot shows the sentiment analysis that tag positive and negative words.

Difficulty

Based on negative sentiment tagging, we are able to infer about how difficult is the incident at hand.

Case type

Case types with 1K incidents and above, don’t vary much. Bug cases are exception. It means that case_type is not going to be an strong predictor.

Short Description - Number of words

Longer description indicate complex problem, that takes more time to resolve.

The assumption is supported by the data.

We can clearly see a pattern, the median is growing incrementally, as the number of words increases, but there are many outliers.

Subcategory

Subcategory w/confidance interval

For each subcategory, the SLA Rate is calculated and a 95% confidence interval is shown to help understand what the noise is around this value. We can clearly see which subcategories deviate from randomness and the width of the error bars help the reader understand how much each number should be trusted.

Group type

Service desk teams have the highest sla rate around 76%, next are the support teams with sla of 64% and last are the application teams with 55%. On the right side of the plot we can see the median resolution time per group type, the results are according to the expectations.

Originator Group

Originator group is the first group that handled the ticket, and in many cases the one that resolved the ticket.

Region of assignment group

Urgency

Contact type

Temporal

If the ticket was opened on Friday or Saturday it has lower chances to be resolved on time.

Data Spliting

Random split of the training data based on 10/90 percent split, 10% goes for testing.

Feature Enginnering

The features that goes into the model:

Model Training - Decision Tree

Test Performance

Accuracy of 74.84% compare of 64.17% of majority vote model. which is a 16% improvement. the model has hard time to predict correctly incidents that will not meet the SLA. In an attempt to improve the learning process let’s use random forest which is a bagging (Bootstrap Aggregation) Decision Tree Ensemble.

Feature Importance

Decistion Tree Plot

Model Training - Random Forest

Confusion Matrix and Statistics

          Reference
Prediction   No  Yes
       No   790  342
       Yes  409 1805
                                         
               Accuracy : 0.7756         
                 95% CI : (0.761, 0.7896)
    No Information Rate : 0.6417         
    P-Value [Acc > NIR] : < 2e-16        
                                         
                  Kappa : 0.5058         
 Mcnemar's Test P-Value : 0.01602        
                                         
            Sensitivity : 0.8407         
            Specificity : 0.6589         
         Pos Pred Value : 0.8153         
         Neg Pred Value : 0.6979         
             Prevalence : 0.6417         
         Detection Rate : 0.5395         
   Detection Prevalence : 0.6617         
      Balanced Accuracy : 0.7498         
                                         
       'Positive' Class : Yes            
                                         

Using the RandomForest model we were able to increase the accuracy to 77.56% on the test set compare to 74.84% for a single decision tree.

More importantly the RF model is doing a better job in predicting the incidents that will not meet the SLA.

---
title: "SLA Classifer"
output:
  html_notebook: default
  pdf_document: default
  html_document:
    df_print: paged
---

###Purpose
Build a model to predict if an incident will be resolved
within the SLA time period.

```{r include=FALSE}
# load libs
library(tidyverse)
library(lubridate)
library(DBI)
library(RPostgreSQL)
library(caret)
library(Hmisc)
library(gridExtra)
# source helper func
source("Functions.R")
```

###Training set
Training set include closed tickets from 2017 & 2018.
```{r message=FALSE, warning=FALSE, include=FALSE}
# load data
inc_file_name <- 'source-data/2018-11-19/incident 19112018.csv'

inc <- read_csv(inc_file_name) %>% 
  mutate(opened_at = dmy_hms(opened_at)) %>% 
  mutate(year = year(opened_at)) %>% 
  mutate(resolution_time = u_total_resolution_time / 3600,
         sla = ifelse(resolution_time * 3600 < 86400,'Yes','No')) %>%
  filter(state=='Closed', year>2016, !is.na(u_met_the_sla)) %>% 
  select(-year)

# append group type and region
inc <- inc %>% 
  left_join(fx_group_type(),by="assignment_group") %>%
  left_join(fx_group_region(), by='assignment_group')
```

###Data exploration
Visualize and explore the data to highlight it's main characteristics of distribution, variation, and relationships, and pinpoint data quality issues.  
Incident represents an IT service task and "ticket resolution time task" is the
net time from ticket creation until resolution  
(does not include weekend, pending,after business time, etc.).

For modeling purposes, we are going to use the "total_resolution_time", the gross time from ticket creation until resolution  
(including weekend, pending, after business time, etc.).  
The reason for not using the net time is that we are not aware to how it is calculated,
and more importantly the data which was used for the calculation is not available.

The meet SLA output will be calculated according to 24 hours threshold.  
In general this is a sound approach, because our purpose is to identify the
underlining factors that determine the variability is the resolution time.

####Resolution time distribution
```{r echo=FALSE, fig.height=4, fig.width=8, message=FALSE, warning=FALSE}
plot1 <- ggplot(inc, aes(x=resolution_time, fill=sla)) +
  geom_histogram(aes(y=..density..), binwidth=1) +
  xlim(0,500) +
  scale_y_continuous(labels = scales::percent, limits = c(0,0.1)) +
  labs(x='resolution time (hours)',
       title ='plot #1',
       subtitle = 'resolution time h <= 500') +
  theme(legend.position="none")

plot2 <- ggplot(inc, aes(x=resolution_time, fill=sla)) +
  geom_histogram(aes(y=..density..), binwidth=1) +
  scale_x_continuous(limits=c(0,120), breaks = c(0,3,5,10,18,24,32,56,72,96,120)) +
  scale_y_continuous(labels = scales::percent, limits = c(0,0.1)) +
  labs(x='resolution time (hours)',
       title = 'plot #2',
       subtitle = 'resolution time h < =120')

grid.arrange(plot1, plot2, ncol=2)

longtail <- inc %>% filter(resolution_time>120) %>% count() %>% pull()

```
*Plot #1* - Shows the complete distribution, what looks unusual is the 
longtail composed from tickets that took more then 5 days to resolve.

*Plot #2* - Shows the distribution of tickets that took less than 5 days.  
The distribution appears to be made up of two or more individual distributions
(i.e. the distribution has multiple peaks or modes).

####Analyzing the multimodal distribuation
```{r echo=FALSE, fig.height=4, fig.width=8, message=FALSE, warning=FALSE}
inc %>% 
  ggplot(aes(x=resolution_time, fill=factor(group_type))) +
  geom_histogram(position="stack") +
  scale_x_continuous(limits=c(0,24), breaks = seq(0:24)) +
  scale_y_continuous(limits = c(0,2500)) +
  labs(x='resolution time (hours)',
       title = 'distribution by group type')
```

\pagebreak

####Analyzing the longtail
```{r echo=FALSE, fig.height=4, fig.width=8}
inc %>% 
  filter(resolution_time > 120) %>% 
  mutate(year = year(opened_at),
         month = month(opened_at, label=TRUE)) %>% 
  # count(year, month) %>% 
  ggplot(aes(x=factor(month), fill=factor(year))) +
  geom_bar(position = position_dodge2(preserve = "single")) +
  geom_text(stat="count", aes(label = ..count..), size = 3.0, color='black',
            position = position_dodge(width = 1), vjust=-.8) +
  labs(title = 'longtail - yearly comparision', x='month') +
  scale_fill_discrete(name = "year")
  
```
We see that in 2018 there were more longtail incidents on monthly basis,
except for Nov-18.

\pagebreak

```{r echo=FALSE, fig.height=4, fig.width=8}
inc %>% 
  filter(resolution_time > 120) %>% 
  mutate(year = year(opened_at),
         month = month(opened_at, label=TRUE)) %>% 
  count(category, subcategory) %>% 
  top_n(20,n) %>%
  ggplot(aes(x=subcategory, y=n)) +
  geom_col() +
  geom_text(aes(label = n),
            size = 2.5, color='orange', vjust=.2, hjust=1) +
  coord_flip() +
  facet_wrap(~category) +
  labs(title='longtail by subcategory')
```
Looking at the distribution based on the top 20 subcategory reveal that Application
and software category dominate the longtail.

```{r eval=FALSE, include=FALSE}
inc %>% 
  filter(resolution_time > 120) %>% 
  count(assignment_group, subcategory) %>% 
  top_n(20,n) %>%
  ggplot(aes(x=reorder(assignment_group, n), y=n, fill=subcategory)) +
  geom_col() +
  coord_flip() +
  labs(x=NULL, title='longtail by group')
# SCM(ISL) and Service Desk (SPTS) contribute more to the longtail compare to the 
# other teams.
```

####SLA rate
```{r echo=FALSE, fig.height=2, fig.width=8}
h <- fx_overall_sla()

fx_viz_prep(inc %>% count(sla), n) %>% 
ggplot(aes(x = sla, y = n)) +
  geom_bar(stat = "identity", 
           width = 0.75) +
  geom_text(aes(label = bar_label),
            size = 3.5, color='orange', vjust=0, hjust=1.5) +                 
  coord_flip() +
  labs(x=NULL, y='incidents')
```

64% of the incidents were resolved within 24 hours.

\pagebreak

###Analysis approach
The major factors influencing the variability of the resolution time,
can be grouped as follow.
  
 * Incident Type - what is the nature of the incident, and to what area it belongs.
 
 * Complexity - how complex is the problem at hand. 
 
 * Urgency - how urgent is it to resolve the problem. 
 
 * Performance - relate to the performance of the assignee and the assignment group. 
 
 * Flow - workload, pending tickets, tickets in process, group transitions etc..

 * Temporal - like day of the week, hour of the day. 

\pagebreak 

####Complexity

Infer about the incident complexity using text analytics.
The short description was processed, and two features were engineered in addition to the
number of words in description.
In the first we try to assess the topic of the incident, or what it is all about.
The second tries to assess how difficult it is.

The plot shows the most frequent words across all incident descriptions.

```{r echo=FALSE, fig.height=4, fig.width=8, message=FALSE, warning=FALSE}
library(tidytext)
inc_desc <- inc %>% 
  filter(!is.na(short_description)) %>% 
  select(text = short_description) %>% 
  unnest_tokens(word, text)

inc_cleaned <- inc_desc %>%
  anti_join(get_stopwords())

inc_cleaned %>%
  count(word, sort = TRUE) %>%
  filter(n > 1000) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip()
# library(wordcloud)
# inc_cleaned %>%
#   count(word) %>%
#   with(wordcloud(word, n, max.words = 100))
```

The second plot shows the sentiment analysis that tag positive and negative words.

```{r echo=FALSE, fig.height=4, fig.width=8, message=FALSE, warning=FALSE}
bing_word_counts <- inc_cleaned %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()

bing_word_counts %>%
  group_by(sentiment) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to sentiment",
       x = NULL) +
  coord_flip()
# library(reshape2)
# bing <- get_sentiments("bing")
# inc_desc %>%
#   filter(!word %in% c('wise','excel','workagile')) %>% 
#   inner_join(bing) %>%
#   count(word, sentiment, sort = TRUE) %>%
#   acast(word ~ sentiment, value.var = "n", fill = 0) %>%
#   comparison.cloud(colors = c("#F8766D", "#00BFC4"),
#                    max.words = 100)
```

\pagebreak


#### Difficulty
Based on negative sentiment tagging, we are able to infer about how
difficult is the incident at hand.
```{r echo=FALSE, fig.height=4, fig.width=8}
inc %>% 
  mutate(difficulty = ifelse(str_detect(str_to_lower(short_description),
    '(error|problem|issue|unable|failed|missing|needs|wrong|slow|locked|disabled|bug)' ) & 
      !is.na(short_description),1,0)) %>% 
  select(short_description, difficulty, sla) %>% 
  mutate(sla = ifelse(sla=='Yes', 1, 0)) %>% 
  ggplot(aes(x = factor(difficulty), y = sla)) +
  stat_summary(fun.data = mean_cl_normal)
```

\pagebreak

#### Case type
```{r echo=FALSE, fig.height=4, fig.width=8}
inc %>% 
  filter(!is.na(u_case_type)) %>% 
  select(x=u_case_type, y=resolution_time, sla) %>%
  count(x,sla) %>%
  filter(n>10) %>% 
  spread(sla, n) %>%
  mutate(No = ifelse(is.na(No),0,No),
         Yes = ifelse(is.na(Yes),0,Yes),
         sla_rate = round(Yes/(Yes+No),2)) %>%
  mutate(bar_percentage = sprintf("%.1f%%", 100*sla_rate)) %>% 
  mutate(bar_label = paste0(format(Yes, big.mark = ","), " ~ ", 
                            bar_percentage)) %>% 
    select(-bar_percentage) %>%
  ggplot(aes(x=reorder(x, sla_rate), y=sla_rate)) +
  geom_col() +
  geom_text(aes(label = bar_label), hjust=1, size=3, color='orange',
                 show.legend = FALSE) +
  scale_y_continuous(labels = scales::percent) + 
  coord_flip() +
  # Overall SLA rate
  geom_hline(aes(yintercept=h), colour="#BB0000", linetype="dashed") + 
  geom_text(aes(0, h, label = scales::percent(h), vjust = -0.5, hjust = -.2),
            size = 3.0, color='coral3') +
  labs(title='case type sla rates',x=NULL, y='sla rate')
```
Case types with 1K incidents and above, don't vary much.
Bug cases are exception. It means that case_type is not going to be
an strong predictor.

\pagebreak

#####Short Description - Number of words
Longer description indicate complex problem, that takes more time to resolve.

```{r echo=FALSE, fig.height=4, fig.width=8, message=FALSE, warning=FALSE}
inc %>% 
  filter(!is.na(short_description)) %>% 
  mutate(desc_words = str_count(short_description, "\\S+")) %>% 
  select(x=desc_words, sla) %>% 
  ggplot(aes(x=x, fill=sla)) +
  geom_histogram(binwidth = 1) +
  labs(title='distribution of number of words in short description',
       x='number of words')
```

The assumption is supported by the data.

```{r echo=FALSE, fig.height=4, fig.width=8}
inc %>% 
  filter(!is.na(short_description)) %>% 
  mutate(desc_words = factor(str_count(short_description, "\\S+")),
         words_bin = factor(ntile(desc_words,10))) %>% 
  select(x=desc_words, y=resolution_time, sla) %>% 
  filter(sla=='Yes') %>%
  ggplot(aes(x=x, y=y)) +
  geom_boxplot() +
  ylim(0,24) +
  labs(title='distribution number of words in short description',
       subtitle = 'incidents that meet the SLA',
       x='number of words in short description', y='resolution time (hour)')
```
We can clearly see a pattern, the median is growing incrementally,
as the number of words increases, but there are many outliers.

\pagebreak

####Subcategory
```{r echo=FALSE, fig.height=10, fig.width=8}
subcat <- inc %>% 
  filter(!is.na(subcategory)) %>% 
  left_join(fx_subcategory_t(20), by=c('subcategory')) %>% 
  select(category, subcategory=new_subcategory, sla) %>% 
  count(category, subcategory, sla) %>% 
  filter(n>30) %>% 
  spread(sla, n) %>%
  mutate(No = ifelse(is.na(No),0,No),
         Yes = ifelse(is.na(Yes),0,Yes),
         sla_rate = round(Yes/(Yes+No),2)) %>%
  mutate(bar_percentage = sprintf("%.1f%%", 100*sla_rate)) %>% 
  mutate(bar_label = paste0(format(Yes, big.mark = ","), " ~ ", 
         bar_percentage)) %>% select(-bar_percentage)
  
subcat %>% 
  filter(sla_rate!=0) %>% 
  ggplot(aes(x=reorder(subcategory, Yes), y=sla_rate)) +
  geom_col() +
  facet_grid(.~category)+
  # Bar label
  geom_text(aes(label = bar_label), hjust=1, size=3, color='orange',
                 show.legend = FALSE) +
  coord_flip() +
  scale_y_continuous(labels = scales::percent) +
  # Overall SLA rate
  geom_hline(aes(yintercept=h), colour="#BB0000", linetype="dashed") + 
  geom_text(aes(0, h, label = scales::percent(h), vjust = -0.5, hjust = -.2),
            size = 3.0, color='coral3') +
  labs(x = NULL, y = 'sla Rate',
       title ='sla rates by subcategory') 
```

\pagebreak

####Subcategory w/confidance interval
```{r echo=FALSE, fig.height=10, fig.width=8, message=FALSE, warning=FALSE}
inc %>% 
  filter(!is.na(subcategory)) %>% 
  left_join(fx_subcategory_t(100), by=c('subcategory')) %>% 
  select(category, subcategory=new_subcategory, sla) %>% 
  mutate(sla = ifelse(sla=='Yes', 1, 0)) %>% 
  ggplot(aes(x = subcategory, y = sla)) +
  stat_summary(fun.data = mean_cl_normal) +
  coord_flip() +
  facet_grid(.~category) +
  scale_y_continuous(labels = scales::percent) +
  # Overall SLA rate
  geom_hline(aes(yintercept=fx_overall_sla()), colour="#BB0000", linetype="dashed") + 
  geom_text(aes(0, fx_overall_sla(), label = scales::percent(fx_overall_sla()), vjust = -0.5, hjust = -.2),
            size = 3.0, color='coral3') +
  labs(x = NULL, y = 'sla rate',
       title ='subcategory sla rates') 
```

For each subcategory, the SLA Rate is calculated and a 95% confidence interval
is shown to help understand what the noise is around this value.
We can clearly see which subcategories deviate from randomness and the width of
the error bars help the reader understand how much each number should be trusted.

\pagebreak

####Group type
```{r echo=FALSE, fig.height=4, fig.width=8}
p1 <- inc %>%
  mutate(sla = ifelse(sla=='Yes', 1, 0)) %>% 
  ggplot(aes(x = factor(group_type), y = sla)) +
  stat_summary(fun.data = mean_cl_normal) +
  scale_y_continuous(labels = scales::percent) +
  # Overall SLA rate
  geom_hline(aes(yintercept=fx_overall_sla()), colour="#BB0000", linetype="dashed") +
  geom_text(aes(0, fx_overall_sla(), label = scales::percent(fx_overall_sla()),
                vjust = -0.5, hjust = -.2), size = 3.0, color='coral3') +
  labs(x = NULL, y = 'SLA rate', title ='group type sla rates')

meds <- inc %>% group_by(group_type) %>% summarise(med=round(median(resolution_time),1))

p2 <- inc %>% 
  ggplot(aes(x=factor(group_type), y=resolution_time )) +
  geom_boxplot() +
  coord_cartesian(ylim=c(0,120)) +
   geom_text(data = meds, aes(y=med, label = med), size = 3, vjust = -1.5) +
  labs(title="median resolution time by group type",
       x="group type", y="resolution time (hours)")

grid.arrange(p1, p2, ncol=2)
```
Service desk teams have the highest sla rate around 76%, next are
the support teams with sla of 64% and last are the application teams with 55%.
On the right side of the plot we can see the median resolution time per group
type, the results are according to the expectations.

\pagebreak

#####Originator Group
Originator group is the first group that handled the ticket, and in many cases
the one that resolved the ticket.
```{r echo=FALSE, fig.height=8, fig.width=8}
inc %>% 
  # rename(originator_group=u_originator_group) %>% 
  filter(!is.na(u_originator_group)) %>% 
  select(x=u_originator_group, sla) %>%
  count(x, sla) %>% 
  filter(n>10) %>% 
  spread(sla, n) %>%
  mutate(No = ifelse(is.na(No),0,No),
         Yes = ifelse(is.na(Yes),0,Yes),
         sla_rate = round(Yes/(Yes+No),2)) %>%
  mutate(bar_percentage = sprintf("%.1f%%", 100*sla_rate)) %>% 
  mutate(bar_label = paste0(format(Yes, big.mark = ","), " ~ ", 
         bar_percentage)) %>% select(-bar_percentage) %>% 
  filter(sla_rate > 0) %>% 
  ggplot(aes(x, y=sla_rate)) +
  geom_col() +
  # Bar label
  geom_text(aes(label = bar_label), hjust=1, size=3, color='orange',
                 show.legend = FALSE) +
  coord_flip() +
  scale_y_continuous(labels = scales::percent) +
  # Overall SLA rate
  geom_hline(aes(yintercept=h), colour="#BB0000", linetype="dashed") + 
  geom_text(aes(0, h, label = scales::percent(h), vjust = 0, hjust = 0),
            size = 3.0, color='coral3') +
  labs(x = NULL, y = 'sla Rate',
       title ='originator group sla rates') 
```

\pagebreak

####Region of assignment group
```{r echo=FALSE, fig.height=4, fig.width=8}
inc %>% 
  filter(!is.na(region)) %>% 
  select(x=region, sla) %>%
  count(x, sla) %>% 
  # filter(n>10) %>%
  spread(sla, n) %>%
  mutate(No = ifelse(is.na(No),0,No),
         Yes = ifelse(is.na(Yes),0,Yes),
         sla_rate = round(Yes/(Yes+No),2)) %>%
  mutate(bar_percentage = sprintf("%.1f%%", 100*sla_rate)) %>% 
  mutate(bar_label = paste0(format(Yes, big.mark = ","), " ~ ", 
         bar_percentage)) %>% select(-bar_percentage) %>% 
  ggplot(aes(x=reorder(x, sla_rate), y=sla_rate)) +
  geom_col() +
  # Bar label
  geom_text(aes(label = bar_label), hjust=1, size=3, color='orange',
                 show.legend = FALSE) +
  coord_flip() +
  scale_y_continuous(labels = scales::percent) +
  # Overall SLA rate
  geom_hline(aes(yintercept=h), colour="#BB0000", linetype="dashed") + 
  geom_text(aes(0, h, label = scales::percent(h), vjust = 0, hjust = 0),
            size = 3.0, color='coral3') +
  labs(x = NULL, y = 'sla rate',
       title ='region sla rate') 
```

\pagebreak

###Urgency
#####Contact type
```{r echo=FALSE, fig.height=4, fig.width=8}
inc %>% 
  filter(!is.na(contact_type)) %>% 
  select(x=contact_type, sla) %>%
  count(x, sla) %>% 
  filter(n>10) %>%
  spread(sla, n) %>%
  mutate(No = ifelse(is.na(No),0,No),
         Yes = ifelse(is.na(Yes),0,Yes),
         sla_rate = round(Yes/(Yes+No),2)) %>%
  mutate(bar_percentage = sprintf("%.1f%%", 100*sla_rate)) %>% 
  mutate(bar_label = paste0(format(Yes, big.mark = ","), " ~ ", 
         bar_percentage)) %>% select(-bar_percentage) %>% 
  ggplot(aes(x, y=sla_rate)) +
  geom_col() +
  # Bar label
  geom_text(aes(label = bar_label), hjust=1, size=3, color='orange',
                 show.legend = FALSE) +
  coord_flip() +
  scale_y_continuous(labels = scales::percent) +
  # Overall SLA rate
  geom_hline(aes(yintercept=h), colour="#BB0000", linetype="dashed") + 
  geom_text(aes(0, h, label = scales::percent(h), vjust = 0, hjust = 0),
            size = 3.0, color='coral3') +
  labs(x = NULL, y = 'sla rate',
       title ='contact type sla rate') 
 
```

\pagebreak

###Temporal
```{r echo=FALSE, fig.height=4, fig.width=8}
inc %>% 
  mutate(wday = wday(opened_at, label = TRUE)) %>% 
  count(wday, sla) %>% 
  spread(sla, n) %>%
  mutate(No = ifelse(is.na(No),0,No),
         Yes = ifelse(is.na(Yes),0,Yes),
         sla_rate = round(Yes/(Yes+No),2)) %>%
  mutate(bar_percentage = sprintf("%.1f%%", 100*sla_rate)) %>% 
  mutate(bar_label = paste0(format(Yes, big.mark = ","), " ~ ", 
         bar_percentage)) %>% select(-bar_percentage) %>%
  ggplot(aes(wday, y=sla_rate)) +
  geom_col() +
  # Bar label
  geom_text(aes(label = bar_label), hjust=1, size=3, color='orange',
                 show.legend = FALSE) +
  coord_flip() +
  scale_y_continuous(labels = scales::percent) +
  # Overall SLA rate
  geom_hline(aes(yintercept=h), colour="#BB0000", linetype="dashed") + 
  geom_text(aes(0, h, label = scales::percent(h), vjust = 0, hjust = 0),
            size = 3.0, color='coral3') +
  labs(x = 'weekday', y = 'sla rate',
       title ='weekday sla rates')
  
```
If the ticket was opened on Friday or Saturday it has lower chances to be
resolved on time.

\pagebreak

###Data Spliting
Random split of the training data based on 10/90 percent split, 10% goes for testing.
```{r include=FALSE}
inc <- inc %>% 
  left_join(fx_random_split(.90), by='number')
```

###Feature Enginnering
The features that goes into the model:

 * Incident Type
    * topic (text analytics)
    * subcategory
    * case type

 * Complexity
    * difficulty (text analytics)
    * words in description (text analytics)
  
 * Urgency
    * contact_type
 
 * Performance
    * grpN - assignment group binned number of tickets resolved.
    * grpT - assignment group binned median resolution time.
    * userN - assignee binned number of tickets resolved.
    * userT - assignee binned median resolution time.
    
 * Flow
    * assignment_group_change indicator
    * assigned_to_change indicator
    
 * Temporal
    * opened_at day of the week.
  
 * Workforce  
    * assignment group
    * group type
    * originator group
    * department
 
```{r include=FALSE}
abt <- inc %>%
  mutate(difficulty = ifelse(str_detect(str_to_lower(short_description),
    '(error|problem|issue|unable|failed|missing|
    needs|wrong|slow|locked|disabled)' ) & !is.na(short_description),1,0)) %>%
  # topic (text analytics) 
  left_join(fx_topic(), by='number') %>% 
  # Weekday
  mutate(wday = factor(wday(opened_at, label = TRUE), ordered = FALSE )) %>%
  # Words in description (binning)
  mutate(words = str_count(short_description, "\\S+"),
         words = ifelse(is.na(words),0,words)) %>% 
  # Case type (cleansing)
  mutate(case_type = str_replace_all(str_to_lower(u_case_type),
                  '([[:punct:]]|\\s)', '_')) %>% 
  # Originator group (collapse minor occurrences)
  left_join(fx_orig_group_t(20), by=c('u_originator_group')) %>% 
  # subcategory (collapse minor occurrences)
  left_join(fx_subcategory_t(20), by=c('subcategory')) %>% 
  # assignee performance (FE)
  left_join(fx_assignee_perf(), by=c('assigned_to')) %>%
  # group performance (FE)
  left_join(fx_grp_perf(), by=c('assignment_group'='u_resolver_group')) %>% 
  # Selection
  select(difficulty, 
         category,
         case_type,
         words, 
         contact_type, 
         department=u_department,
         orig_group=new_orig_group,
         assignment_group,
         subcategory=new_subcategory,
         grpN, grpT, wday, 
         group_type,
         region,
         userN, userT,
         assignment_group_change=u_assignment_group_change,
         assigned_to_change=u_assigned_to_change,
         sla, dataset, 
         ends_with("T", ignore.case = FALSE)) %>%
  # replace nulls
  replace(., is.na(.), "unknown") %>% 
  # Factorization
  mutate(sla = factor(sla),
         category = factor(category),
         subcategory = factor(subcategory),
         case_type = factor(case_type),
         orig_group = factor(orig_group),
         assignment_group = factor(assignment_group),
         contact_type = factor(contact_type),
         department = factor(department),
         grpN = factor(grpN),
         grpT = factor(grpT),
         userN = factor(userN),
         userT = factor(userT),
         wday = factor(wday),
         group_type = factor(group_type),
         region = factor(region),
         difficulty = factor(difficulty), 
         assignment_group_change = factor(assignment_group_change),
         assigned_to_change = factor(assigned_to_change))
```

```{r include=FALSE}
# for nicer tree labels
abt <- abt %>% 
  rename(case.=case_type,
         cat.=category,
         contact.=contact_type,
         dept.=department,
         origGrp.=orig_group,
         # resolveGrp=resolver_group,
         subCat.=subcategory,
         grpChg.=assignment_group_change,
         userChg.=assigned_to_change,
         grpType.=group_type,
         region.=region)
```

```{r include=FALSE}
#Data Splitting 
train <- abt %>% filter(dataset=='train') %>% select(-dataset)
test <- abt %>% filter(dataset=='test') %>% select(-dataset)
```

####Model Training - Decision Tree 
```{r DECISTION TREE, echo=FALSE, message=FALSE, warning=FALSE}
suppressMessages(library(doParallel))
cl <- makeCluster(detectCores())
registerDoParallel(cl)

fitControl <- trainControl(method = "repeatedcv",
                           number = 10,
                           repeats = 3,
                           classProbs = TRUE,
                           summaryFunction = twoClassSummary,
                           allowParallel = TRUE)
library(rpart)
set.seed(123)
cart <- train(sla ~ . ,
                    data = train,
                    method = "rpart",
                    # tuneGrid=expand.grid(cp=0.02),
                    tuneLength = 10,
                    trControl = fitControl,
                    metric = "ROC",
                    na.action = na.pass)
stopCluster(cl)

# print(cart$results)
```


####Test Performance
```{r echo=FALSE}
# test prediction
test.pred <- predict(cart, newdata = test, type= 'raw', na.action = na.pass)
# test evaluation
confusionMatrix(test.pred, test$sla, positive='Yes')
```

Accuracy of 74.84% compare of 64.17% of majority vote model.
which is a 16% improvement.
the model has hard time to predict correctly incidents that will not meet the SLA.
In an attempt to improve the learning process let's use random forest which is
a bagging (Bootstrap Aggregation) Decision Tree Ensemble.

\pagebreak

####Feature Importance
```{r FEATURE IMPORTANCE, echo=FALSE, fig.height=8, fig.width=8, message=FALSE, warning=FALSE}
fx_var_imp(cart,1)
```

\pagebreak

####Decistion Tree Plot
```{r echo=FALSE, fig.height=8, fig.width=12, message=FALSE, warning=FALSE}
suppressMessages(library(rpart.plot))
# rpart.plot(cart$finalModel)
# rpart.plot(cart$finalModel, type=3, tweak = 1.2)
rpart.plot(cart$finalModel, yesno = 2, type = 0, extra = 0, tweak = 1.4)
```

```{r TRAIN PREDICTION, eval=FALSE, message=FALSE, warning=FALSE, include=FALSE}
####Train Performance
# train prediction
train.pred <- predict(cart, newdata = train, type= 'raw', na.action = na.pass)
# train evaluation
confusionMatrix(train.pred, train$sla, positive='Yes')
```

\pagebreak

####Model Training - Random Forest
```{r Random Forest, echo=FALSE, message=FALSE, warning=FALSE}
# load the model
suppressMessages(library(ranger))
# rf <- readRDS("./rf_final_model.rds")
suppressMessages(library(doParallel))
cl <- makeCluster(detectCores())
registerDoParallel(cl)

fitControl <- trainControl(method = "cv",
                           number = 5,
                           classProbs = TRUE,
                           summaryFunction = twoClassSummary,
                           allowParallel = TRUE)

suppressMessages(library(ranger))
set.seed(123)
rf <- train(sla ~ . ,
            data = train,
            method = "ranger",
            metric = "ROC",
            tuneGrid=expand.grid(mtry=69,
                                 splitrule="extratrees", min.node.size=1),
            # tuneLength = 5,
            importance = "impurity",
            trControl = fitControl)
stopCluster(cl)
# print(rf$results)
# saveRDS(rf$finalModel, "./rf_final_model.rds")
```

```{r echo=FALSE, message=FALSE, warning=FALSE}
# # train prediction
# train.pred <- predict(rf, train)
# # train evaluation
# confusionMatrix(train.pred, train$sla, positive='Yes')
# test prediction
test.pred <- predict(rf, newdata = test)
# test evaluation
confusionMatrix(test.pred, test$sla, positive='Yes')
```

```{r echo=FALSE, fig.height=8, fig.width=8, message=FALSE, warning=FALSE}
fx_var_imp(rf, 10)
```

Using the RandomForest model we were able to increase the accuracy to 77.56% on the test set
compare to 74.84% for a single decision tree.

More importantly the RF model is doing a better job in predicting the incidents that will not meet the SLA.
















