June 12, 2016

Objective

The objective of this presentation is to show an example of machine lerning algorithm to predict the total incident handle time for a Security Operation Center (SOC) ticketing dataset.

The total handle time for an incident is the sum of the time needed to create an incident, followup (transfer, escalate, close) and resolve the incident.

The dataset used for the analysis is a sample 2014 arcsight dataset for a major bank in Europe. All client related information and event description has been redacted for confidentiality purposes.

Introduction to dataset

The original dataset has 231 records and 36 variables as shown below

 [1] "Title"                          "Incident.Handler"              
 [3] "Category"                       "Threat"                        
 [5] "Arcsight.Event"                 "Arcsight.event.found.timestamp"
 [7] "Arcsight.event.Date"            "Arcsight.Event.Time"           
 [9] "Arcsight.Event.Month"           "Incident.Number"               
[11] "Incident.Created"               "Incident.Followup.Date"        
[13] "Incident.Followup.Time"         "Incident.Event.Month"          
[15] "Ticket.creation.delta"          "Resolution.Date"               
[17] "Resolution.Time"                "Incident.Resolved"             
[19] "Incident.Resolution.delta"      "methode"                       
[21] "status"                         "Result"                        
[23] "DitisResolved"                  "DitisNotResolved"              
[25] "Actions"                        "Summary"                       
[27] "MAAND"                          "JAAR"                          
[29] "TIME.TO.TICKET..DAYS."          "TIME.TO.RESOLUTION..DAYS."     
[31] "TOTAL.DAYS"                     "METHOD.GROUPING"               
[33] "PERFORMANCE"                    "STATUS.AT.REPORTING.DATE"      
[35] "PERFORMANCE..P.1."              "STATUS.AT.REPORTING.DATE..P.1."

First row of the dataset

The data quality is poor and the dataset has to be cleaned, processed before modeling. Here is the first record in the dataset.

                                             Title Incident.Handler
1 SOC37 217 New Public DMZ Source Address Detected                 
  Category Threat Arcsight.Event Arcsight.event.found.timestamp
1     Misc    Low    59211328033                2/01/14 6:03 AM
  Arcsight.event.Date Arcsight.Event.Time Arcsight.Event.Month
1    2/01/14 12:00 AM             6:03:22         januari-2014
  Incident.Number Incident.Created Incident.Followup.Date
1           soc37 2/01/14 10:29 AM       2/01/14 12:00 AM
  Incident.Followup.Time Incident.Event.Month Ticket.creation.delta
1       0/01/00 10:29 AM         januari-2014               25:38.0
   Resolution.Date Resolution.Time Incident.Resolved
1 2/01/14 12:00 AM   0/01/00 10:29  2/01/14 10:29 AM
  Incident.Resolution.delta methode status           Result DitisResolved
1                   00:01.0     SOC Closed Work in Progress             0
  DitisNotResolved
1                0
                                                      Actions Summary
1 Ask GPS for more information over these unknown IP-adresses        
  MAAND JAAR TIME.TO.TICKET..DAYS. TIME.TO.RESOLUTION..DAYS. TOTAL.DAYS
1     1 2014                 0.18                          0       0.18
  METHOD.GROUPING PERFORMANCE STATUS.AT.REPORTING.DATE PERFORMANCE..P.1.
1             SOC     IN TIME          UNRESOLVED OPEN           IN TIME
  STATUS.AT.REPORTING.DATE..P.1.
1                UNRESOLVED OPEN

Data Cleaning

  1. Loading required libraries
  2. Removing all extraneous variables
  3. Data Cleaning
  4. Transforming data types
  5. Creation of new variables to calculate event recording time, incident creation time, incident followup time and incident resolve time.
  6. Imputed or delete NAs.

The first three rows of the transformed dataset is shown below

  category threat method.grouping event.hr time.event.create
1     Misc    Low             SOC       06             0.185
2     ddos    Low             SOC       05             0.113
3     ddos    Low             SOC       08             0.003
  total.handle.time
1             0.185
2             0.321
3             0.050

Variable Explanation

The transformed dataset has the following variables:

  • category: Event/ Incident Category
    • levels: Anti-Virus, AOL, Data Leakage, ddos, Hacking, Malware, Misc, Privilige escalation
  • threat: Threat type
    • levels: high, Low, Medium
  • method.grouping: Handled by internal staff (SOC) or Supplier
  • event.hr: Hour of the day when the event occured
  • event.create: Time(in days) to create the incident after it was first noticed
  • total.handle.time: Time(in days) to handle event which includes creation, followup and resolve time

Variables Summary

Summary of the dataset is as follows:

                 category      threat    method.grouping    event.hr 
 Malware             :116   high  :  1   SOC     : 37    14     :42  
 Misc                : 34   Low   :164   SUPPLIER:156    09     :23  
 Hacking             : 17   Medium: 28                   10     :23  
 Privilige escalation: 16                                15     :17  
 AOL                 :  4                                08     :16  
 Anti-Virus          :  3                                12     :11  
 (Other)             :  3                                (Other):61  
 time.event.create total.handle.time
 Min.   : 0.0000   Min.   : 0.001   
 1st Qu.: 0.0010   1st Qu.: 1.182   
 Median : 0.0170   Median : 5.150   
 Mean   : 0.2947   Mean   : 8.824   
 3rd Qu.: 0.0350   3rd Qu.:10.779   
 Max.   :10.2920   Max.   :45.922   
                                    

Data Exploration and Visualization 1

First, we look at the distribution of total handle time by group and total events by category

Data Exploration and Visualization 2

Next, we look at the distribution of the total handle time by category

  • Mean: 8.8 days
  • Median: 5.2 days

Data Exploration and Visualization 3

Finally, we look at distribution of the event occuring hour (24 hr clock)

Prediction Models

We use the follwing predictors and response variable for the machine learning model

  1. Predictors: Category, Threat type, Handling Groups, Event hours,and Time for Event Creation

  2. Response Variable: Total Handle Time

The predicting regression algorithm is 'Random Forest' and the sampling is performed using repeated cross validation.

Random Forest Regression Algorithm

  • Step1: Partition data into training and testing sets
  • Step2: Preprocess data using center, scaling algorithms
  • Step3: Tune model using Tuning parameters as below

The tuning parameter is 10 fold repeated cross validation with 2 repeats, and 500 regression trees

ID = createDataPartition(abnsoc1$total.handle.time, p = 0.6, list = FALSE)

trainsoc = abnsoc1[ID, ]
testsoc = abnsoc1[-ID, ]

ctrl = trainControl(method = "repeatedcv", number = 10, repeats = 2)
rftrain = train(total.handle.time ~ ., data = trainsoc, method = "rf", trControl = ctrl, 
    ntrees = 500, preProcess = c("center", "scale"), verbose = FALSE)

Machine Learning Results

The Random Forest Machine Learning Model has a maximum R Squared value of 45.79 %. The result of the modelling is shown below

Random Forest 

117 samples
  5 predictor

Pre-processing: centered (30), scaled (30) 
Resampling: Cross-Validated (10 fold, repeated 2 times) 
Summary of sample sizes: 106, 105, 105, 105, 105, 106, ... 
Resampling results across tuning parameters:

  mtry  RMSE      Rsquared 
   2    8.689125  0.4189035
  16    7.977219  0.4579212
  30    8.315141  0.4552052

RMSE was used to select the optimal model using  the smallest value.
The final value used for the model was mtry = 16. 

Prediction using the ML model

We now predict the response variable in the test set using the random forest model.

Conclusions and Next Steps

We conclude that it is possible to predict with reasonable accuracy the cycle handle time for a security incident. The predictive capabilities largely depend on the data quality, number of predictors and model tuning.

The cycle time prediction will be very helpful in SOC Optimization and efficiency studies. It will also impact the incident response capabilities and provide important parameters for incident forensics.

As a next step, following is proposed

  1. Obtain sample Ticketing database from other clients
  2. Increase number of predictor variables for model building
  3. Increase sampling and cross validation
  4. Build and test models with other machine learning algorithms like Support Vector Machines, Neural Network
  5. Compare model performance for accuracy