Purpose
Build a model to predict if an incident will be resolved within the SLA time period.
Training set
Training set include closed tickets from 2017 & 2018.
Data exploration
Visualize and explore the data to highlight it’s main characteristics of distribution, variation, and relationships, and pinpoint data quality issues.
Incident represents an IT service task and “ticket resolution time task” is the net time from ticket creation until resolution
(does not include weekend, pending,after business time, etc.).
For modeling purposes, we are going to use the “total_resolution_time”, the gross time from ticket creation until resolution
(including weekend, pending, after business time, etc.).
The reason for not using the net time is that we are not aware to how it is calculated, and more importantly the data which was used for the calculation is not available.
The meet SLA output will be calculated according to 24 hours threshold.
In general this is a sound approach, because our purpose is to identify the underlining factors that determine the variability is the resolution time.
Resolution time distribution

Plot #1 - Shows the complete distribution, what looks unusual is the longtail composed from tickets that took more then 5 days to resolve.
Plot #2 - Shows the distribution of tickets that took less than 5 days.
The distribution appears to be made up of two or more individual distributions (i.e. the distribution has multiple peaks or modes).
Analyzing the multimodal distribuation

Analyzing the longtail

We see that in 2018 there were more longtail incidents on monthly basis, except for Nov-18.

Looking at the distribution based on the top 20 subcategory reveal that Application and software category dominate the longtail.
SLA rate

64% of the incidents were resolved within 24 hours.
Analysis approach
The major factors influencing the variability of the resolution time, can be grouped as follow.
Incident Type - what is the nature of the incident, and to what area it belongs.
Complexity - how complex is the problem at hand.
Urgency - how urgent is it to resolve the problem.
Performance - relate to the performance of the assignee and the assignment group.
Flow - workload, pending tickets, tickets in process, group transitions etc..
Temporal - like day of the week, hour of the day.
Complexity
Infer about the incident complexity using text analytics. The short description was processed, and two features were engineered in addition to the number of words in description. In the first we try to assess the topic of the incident, or what it is all about. The second tries to assess how difficult it is.
The plot shows the most frequent words across all incident descriptions.

The second plot shows the sentiment analysis that tag positive and negative words.

Difficulty
Based on negative sentiment tagging, we are able to infer about how difficult is the incident at hand.

Case type

Case types with 1K incidents and above, don’t vary much. Bug cases are exception. It means that case_type is not going to be an strong predictor.
Short Description - Number of words
Longer description indicate complex problem, that takes more time to resolve.

The assumption is supported by the data.

We can clearly see a pattern, the median is growing incrementally, as the number of words increases, but there are many outliers.
Subcategory

Subcategory w/confidance interval

For each subcategory, the SLA Rate is calculated and a 95% confidence interval is shown to help understand what the noise is around this value. We can clearly see which subcategories deviate from randomness and the width of the error bars help the reader understand how much each number should be trusted.
Group type

Service desk teams have the highest sla rate around 76%, next are the support teams with sla of 64% and last are the application teams with 55%. On the right side of the plot we can see the median resolution time per group type, the results are according to the expectations.
Originator Group
Originator group is the first group that handled the ticket, and in many cases the one that resolved the ticket.

Region of assignment group

Temporal

If the ticket was opened on Friday or Saturday it has lower chances to be resolved on time.
Data Spliting
Random split of the training data based on 10/90 percent split, 10% goes for testing.
Feature Enginnering
The features that goes into the model:
- Incident Type
- topic (text analytics)
- subcategory
- case type
- Complexity
- difficulty (text analytics)
- words in description (text analytics)
- Urgency
- Performance
- grpN - assignment group binned number of tickets resolved.
- grpT - assignment group binned median resolution time.
- userN - assignee binned number of tickets resolved.
- userT - assignee binned median resolution time.
- Flow
- assignment_group_change indicator
- assigned_to_change indicator
- Temporal
- opened_at day of the week.
- Workforce
- assignment group
- group type
- originator group
- department
Model Training - Decision Tree
Feature Importance

Decistion Tree Plot

Model Training - Random Forest
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 790 342
Yes 409 1805
Accuracy : 0.7756
95% CI : (0.761, 0.7896)
No Information Rate : 0.6417
P-Value [Acc > NIR] : < 2e-16
Kappa : 0.5058
Mcnemar's Test P-Value : 0.01602
Sensitivity : 0.8407
Specificity : 0.6589
Pos Pred Value : 0.8153
Neg Pred Value : 0.6979
Prevalence : 0.6417
Detection Rate : 0.5395
Detection Prevalence : 0.6617
Balanced Accuracy : 0.7498
'Positive' Class : Yes

Using the RandomForest model we were able to increase the accuracy to 77.56% on the test set compare to 74.84% for a single decision tree.
More importantly the RF model is doing a better job in predicting the incidents that will not meet the SLA.
