Finding patterns and a model for instrusion data

Abstract

Financial security is always a concern, especially for online systems, which are vulnerable to money siphoning. Banks process many online microtransactions, creating a two-fold challenge: to distinguish between benign activity and intrusions, and second, to distinguish between varying types of intrusions through pattern analysis. The general task was to develop a detection system. This project focuses on distinguishing between benign and anomalous activity, and demonstrates the value of decision tree models for use in detecting network intrusions using randomization processes.

Practical considerations include, but are not limited to interpretability, scalability, real time monitoring capability, the cost of disrupting general client activity versus fradulent activity, and the cost of implementing recommendations. In such a case as preventing theft, it is preferable to prevent theft versus disrupting financial activity.

Introduction

The study involved 3000 records of data instrusion which showcased 22 potentially related attributes, of which 6 attributes only contained single values, and offered no contributory feedback. The log contained 300 known intrusions (rate of 10%). Through a combination of symbolic model analysis, random sampling, and cross validation, the data set was transformed to identify, test, and develop a model.

An important initial consideration is to distinguish between which features show a high degree of entropy in benign activity and which features show high entropy in intrusion activity. High entropy is indicative of inconsistency, which through deeper study, insights and patterns can be discovered.

Identifying the top influential attributes creates a starting point to delineate which attributes might be important or have correlation value, and equally important, also which are least likely to be helpful.

As it turns out, the top features with high entropy are data from both source to destination and destination to source, service type, flag type, and duration.

Conceptually, entropy can be expanded upon by relating features to each other. By studying that relationship and calculating information gain, a measuring system which scores feature relationships on a scale from 0 (no correlation) to 1 how much information can be gained from an attribute, or combination thereof.

This gives us a combinatorial and more complex view of the potential relationship between the features.

The data was split into two categories, classified between intrusion data and benign data. The generated output ranked and scored the differing features. From this, it became easy to see where variability might be expected, and find where both intrusions and benign transactions might share a common variability, versus features which only existed in the benign patterns.

By use of the apriori function features compared as follows:

Several features showed entropy which did not at all occur in intrusion activity. This supported the strategy of limiting the number of features used in predictive models. It is generally ideal to limit features to a minimum amount that can acheive maximum performance both for computational purposes and interpretability.

This can be taken to a much deeper level. Decision trees use logical (yes or no) branching to classify a category and can delve into the attribute characteristics by exploring the full nature of a given arity.

Choosing a Model - The Varying nature of intrusion characteristics

Many industries expect to see logistic regresson, and perhaps this is also what one might be expecting for a solution. Logistic regression is simple, fast, easy to calculate, and easy to interpret. In particular, the medical industry loves this method.

Because this particular data set has nuances, specifically the features that deliniate classification range across several possible combinatorial attributes. When calculating logistic regression this information is lost.

As can be seen visually from the Residuals vs Leverage graph below, much of the data overlaps across attributes and becomes difficult to classify through logistic regression, indicating low leverage (helpfulness), and high residuals (unexplained activity).

Thus it seemed natural to proceed with a decision tree, which allowed exploration of the deeper feature nuances by changing continuous variables to ranges and categorizing and classifying categorical data.

There are several considerations taken into account when designing a tree.

Underfitting - using too few features to map model

Overfitting - using too many features to map model

Smaller trees are easier to calculate and reduce risk of overfitting

Smaller trees are much easier to interpret

Method

It became quickly evident that maximum depth the tree would be a high influence to controlling the model. Other various branching and algorithm criteria were also tweaked and tested but did not indicate significant influence. Through a strategy called bootstrapping, random samples were constructed by taking the subset of data to create the same sample size, but with a random sample distribution to allow study of pattern of dispersion on a wider scale.

Below is a subset of results.
CP (complexity parameter): measures the complexity of the model and is the setting we use to prune the tree.
CV (cross validation) is a form of data validation which helps prevent underfitting and overfitting.

Finding the best tree model

The best tree depth seemed consistent with the original data and reported a depth of around 9-10 with a pruned depth matching a CP score of .010. Since the base data also reflected this result, this helped support the numerical criteria for the model. The lowest error, which hovered around 28% also occured in larger trees, but as depth went beyond 19, the error rate overall generally tended to increase.

Why all this testing?

The long term goal of this task is to construct a robust model to address both a present solution and also the issue of concept drifting. Models which have more static results and cannot train on new data will not respond well to fluctuations in new data. Skilled hackers are often sensitive of the signature patterns of an intrusion, and will work to develop money siphoning methods which cannot be detected.

The Tree Model

As can be seen above, the detailed breakdown of the tree can take into account the various feature nuances and detail that created high levels of unanalyzed residuals in logistic regression.

Accuracy of findings with learning tree

Of course we are are also intersted in the most important question.

What kind of results can we gain from this tree?

For this we plot a confusion matrix.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2624    3
##          1   76  297
##                                                
##                Accuracy : 0.9737               
##                  95% CI : (0.9673, 0.9791)     
##     No Information Rate : 0.9                  
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.868                
##  Mcnemar's Test P-Value : 0.0000000000000005467
##                                                
##             Sensitivity : 0.9719               
##             Specificity : 0.9900               
##          Pos Pred Value : 0.9989               
##          Neg Pred Value : 0.7962               
##              Prevalence : 0.9000               
##          Detection Rate : 0.8747               
##    Detection Prevalence : 0.8757               
##       Balanced Accuracy : 0.9809               
##                                                
##        'Positive' Class : 0                    
##

2624 = True Positives, 297 = True Negatives 76 = False Positives, 3 = False Negatives

The model prioritizes alerting to false positives, which is considered a much lower cost than false negatives, which equate to an intrusion and comprimise of financial records and monies.

Conclusion and Recommendations

The tree model functions with approximately 97% accuracy. It is designed so that it can be updated and trained and learn off of ongoing data to adjust to drifting that will happen as patterns change over time.

While the tree model is not as fast as logistic regression, it is still possible to create a system that can address real time activity by setting thresholds developed by the decision tree model independent of the system that calculates the model. The tree model can run concurrent analysis and dynamically train itself over time to new data, allowing it to study new information and adapt to new conditions, it can learn from future incidents which might come in another pattern or consist of different features currently considered unimportant. Since this model will have a lead time for computation, the most recent thresholds can be implemented and adjusted by update.

Standard banking practice for fraud is to put transations on a temporary hold that can be manually overridden. One potential recommendation is to utilize the existing customer service call center resources to manage ambiguous cases and since it is already currently trained to manage customer care. For large volumes of money, higher security and alternative verification restrictions can be put in place such as keys, identification checks, and holding funds for direct review by bank analysts.

Further Consultation and Exploration

A data analyst can help guide these implementations both with technical set up and through educating staff and establishing robust internal controls to create best practice behaviors. This can potentially also be used in conjuction with financial experts. In particular, a bank may be interested in assigning internal auditors to work with data analysts to create methods to support compliance with IFRS, GAAP, and GAAS.

A strategy that might be useful to run in parallel, under guidance of a data analyst team working in conjuction with domain experts is to run a significantly larger data sample than used by the updating training model to separately confirm the nature of the large data. This will allow larger amounts of data to train through the tree, and work towards even more refined solutions.

Effectively this means that decision trees will be used to set system thresholds and update over specified periods of time to evaluate changing parameters. As new hacking methods are developed, the patterns can be studied and the algorithm will be set to recalculate and update.

Further study is encouraged for both domain experts and consulting. Technology is always expanding boundaries, allowing ever changing environment and possibilities for fraud. With ongoing consultation, a data expert can guide this process and keep current to industry practices. While the present model was able to obtain significant accuracy from the subset of data testing, in order to maintain long term robust predictions, experts should perform ongoing studies and provide updates over time.

Finding patterns and a model for instrusion data

eschultz

Saturday, May 08, 2016