library(AnomalyDetection)
library(tidyverse)
The function AnomalyDetectionTs is called to detect one or more statistically significant anomalies in the input time series. The documentation of the function AnomalyDetectionTs, which can be seen by using the following command, details the input arguments and the output of the function AnomalyDetectionTs.
help(AnomalyDetectionTs)
The function AnomalyDetectionVec is called to detect one or more statistically significant anomalies in a vector of observations. The documentation of the function AnomalyDetectionVec, which can be seen by using the following command, details the input arguments and the output of the function AnomalyDetectionVec.
help(AnomalyDetectionVec)
data(raw_data)
AnomalyDetectionVec(raw_data[,2], max_anoms=0.02, period=1440, direction='both', plot=TRUE)
$anoms
$plot
# To detect only the anomalies in the last period, run the following:
#AnomalyDetectionVec(raw_data[,2], max_anoms=0.02, period=1440, direction='both',
#only_last=TRUE, plot=TRUE)
From the plot, we observe that the input time series experiences both positive and negative anomalies. Furthermore, many of the anomalies in the time series are local anomalies within the bounds of the time series’ seasonality (hence, cannot be detected using the traditional approaches). The anomalies detected using the proposed technique are annotated on the plot. In case the timestamps for the plot above were not available, anomaly detection could then carried out using the AnomalyDetectionVec function; specifically, one can use the following command:
#To detect only the anomalies in the last period, run the following:
AnomalyDetectionVec(raw_data[,2], max_anoms=0.02, period=1440, direction='both',
only_last=TRUE, plot=TRUE)
$anoms
$plot
#install.packages("anomalyDetection")
# we'll use several tidyverse packages for some common manipulations
library(tibble)
library(dplyr)
library(tidyr)
library(ggplot2)
library(magrittr)
library(anomalyDetection)
security_logs%>%head()
security_logs%>%dim()
[1] 300 10
To apply the statistical methods that we’ll see in the sections that follow, we employ the tabulated vector approach. This approach transforms the security log data into unique counts of data attributes based on pre-defined time blocks. Therefore, as each time block is generated, the categorical fields are separated by their levels and a count of occurrences for each level are recorded into a vector. All numerical fields, such as bytes in and bytes out, are recorded as a summation within the time block. The result is what we call a “state vector matrix”.
Thus, for our security_logs data we can create our state vector matrix based on our data being divided into 30 blocks of 10 observations each. What results is the summary of instances for each categorical level in our data for each time block. Consequently, row one represents the first time block and there were 2 instance of ASA as the Device_Product, 5 instances of Attempt as the Device_Action, etc.
tabulate_state_vector(security_logs, 10)%>%head
Some variables contain more than 50 levels. Only the 10 most popular levels of these variables will be tabulated.
tabulate_state_vector(security_logs, 10)%>%dim()
Some variables contain more than 50 levels. Only the 10 most popular levels of these variables will be tabulated.
[1] 30 54
Prior to proceeding with any multivariate statistical analyses we should inspect the state vector for multicollinearity, to avoid issues such as matrix singularity, rank deficiency, and strong correlation values, and remove any columns that pose an issue. We can use mc_adjust to handle issues with multi-collinearity by first removing any columns whose variance is close to or less than a minimum level of variance (min_var). Then, it removes linearly dependent columns. Finally, it removes any columns that have a high absolute correlation value equal to or greater than max_cor. In order to perform multivariate analysis, we need the number of blocks to exceed the number of features, so instead of blocks of size 10, we downsize to blocks of 6 observations each.
(state_vec <- security_logs %>%
tabulate_state_vector(6) %>%
mc_adjust())%>%head()
Some variables contain more than 50 levels. Only the 10 most popular levels of these variables will be tabulated.
By default, mc_adjust removes all columns that violate the variance, dependency, and correlation thresholds. Alternatively, we can use action = “select” as an argument, which provides interactivity where the user can select the variables that violate the correlation threshold that they would like to retain.
Multivariate Statistical Analyses
With our data adjusted to eliminate multicollinearity concerns, we can now proceed with multivariate analyses to identify anomalies. First we’ll use the mahalanobis_distance function to compare the distance between each observation by its distance from the data mean, independent of scale. This is computed as
\[\begin{equation*} MD=\sqrt{(x-\bar{x})^{\prime }C^{-1}(x-\bar{x}) } \end{equation*}\]where \(x\) is a vector of \(p\) observations ,\(x=(x_{1},\cdots,x_{p})\), \(\bar{x}\) is the vector of the data.\(C^{-1}\) is the inverse data covariance matrix.
Here, we include output = “both” to return both the Mahalanobis distance and the absolute breakdown distances and normalize = TRUE so that we can compare relative magnitudes across our data.
state_vec %>%
mahalanobis_distance("both", normalize = TRUE) %>%
as_tibble%>%head()
We can use this information in a modified heatmap visualization to identify outlier values across our security log attributes and time blocks. The larger and brighter the dot the more significant the outlier is and deserves attention.
state_vec %>%
mahalanobis_distance("both", normalize = TRUE) %>%
as_tibble %>%
mutate(Block = 1:n()) %>%
gather(Variable, BD, -c(MD, Block)) %>%
ggplot(aes(factor(Block), Variable, color = MD, size = BD)) +
geom_point()
This process could be streamlined by using the hmat command to create the histogram matrix. This command can take raw data, a state vector, or calculated Mahalanobis distances as input. Additional ggplot2 aesthetics can be added using the + operator. By default, hmat displays the top 20 most anomalous blocks. The user can change the number of blocks kept using the top argument. The arguments needed to feed into tabulate_state_vector, mc_adjust, and/or mahalanobis_distance can be entered directly into hmat.
state_vec %>%
hmat(input = "SV", top = 15, normalize = TRUE) +
ggtitle("Histogram Matrix of Anomalous Blocks") +
ylab(NULL)
We can build onto this concept with the bd_row command to identify which security log attributes in the data are driving the Mahalanobis distance. bd_row measures the relative contribution of each variable, \(x_{i}\) by computing
\[\begin{equation*} BD_{i}=\left| \frac{x-x_{i}}{\sqrt{C_{ii}}}\right| \end{equation*}\]where \(C_{ii}\) is the variance of \(x_{i}\) . Furthermore, bd_row will look at a specified row and rank-order the columns by those that are driving the Mahalanobis distance. For example, the plot above identified block 17 as having the largest Mahalanobis distance suggesting some abnormal activity may be occuring during that time block. We can drill down into that block and look at the top 10 security log attributes that are driving the Mahalanobis distance as these may be areas that require further investigation.
state_vec %>%
mahalanobis_distance("bd", normalize = TRUE) %>%
bd_row(17, 10)
India_BD Src_Port_80_BD Dst_Port_593_BD Dst_IP_151.194.233.198_BD
2.974807 2.201276 2.149510 1.880467
Firewall_BD NSP_BD Dst_IP_219.142.109.8_BD Dst_Port_25_BD
1.338744 1.326790 1.249429 1.161628
Dst_Port_20000_BD Dst_IP_145.114.4.203_BD
1.153349 1.141101
Next, we can use factor analysis by first exploring the factor loadings (correlations between the columns of the state vector matrix and the suggested factors) and then comparing the factor scores against one another for anomaly detection. Factor analysis is another dimensionality reduction technique designed to identify underlying structure of the data. Factor analysis relates the correlations between variables through a set of factors to link together seemingly unrelated variables. The basic factor analysis model is
horns_curve(state_vec)
[1] 3.29370781 2.93581292 2.67338722 2.46330670 2.27500658 2.10980325 1.95286230 1.81448616 1.68359582
[10] 1.56335312 1.45192933 1.34692929 1.24824704 1.15598576 1.06676989 0.98367725 0.90571483 0.83050534
[19] 0.76006371 0.69436116 0.63320769 0.57378665 0.51803174 0.46352579 0.41617785 0.36992497 0.32529757
[28] 0.28436864 0.24657092 0.21143750 0.17964230 0.15056034 0.12286925 0.09756747 0.07499168 0.05531090
[37] 0.03812798 0.02201472
Next, the dimensionality is determined by finding the eigenvalues of the correlation matrix of the state vector matrix and retaining only those factors whose eigenvalues are greater than or equal to those produced by horns_curve. We use factor_analysis to reduce the state vector matrix into resultant factors. The factor_analysis function generates a list containing five outputs:
fa_loadings: numerical matrix with the original factor loadings fa_scores: numerical matrix with the row scores for each factor fa_loadings_rotated: numerical matrix with the varimax rotated factor loadings fa_scores_rotated: numerical matrix with the row scores for each varimax rotated factor num_factors: numeric vector identifying the number of factors
state_vec %>%
horns_curve() %>%
factor_analysis(state_vec, hc_points = .) %>%
str
List of 5
$ fa_loadings : num [1:38, 1:11] 0.2618 -0.3515 0.3082 0.0393 -0.3154 ...
$ fa_scores : num [1:50, 1:11] 0.437 -0.139 -1.27 -0.599 -1.158 ...
$ fa_loadings_rotated: num [1:38, 1:11] 0.07471 -0.08339 0.17689 0.15318 -0.00127 ...
$ fa_scores_rotated : num [1:50, 1:11] -0.0196 0.717 -0.8017 1.8066 -2.1938 ...
$ num_factors : int 11
For easy access to these results we can use the factor_analysis_results parsing function. The factor_analysis_results will parse the results either by their list name or by location. For instance to extract the rotated factor scores you can use factor_analysis_results(data, results = fa_scores_rotated) or factor_analysis_results(data, results = 4) as demonstrated below.
state_vec %>%
horns_curve() %>%
factor_analysis(state_vec, hc_points = .) %>%
factor_analysis_results(4) %>%
as_tibble
Furthermore, Kaiser created the following evaluations of the score produced by the IFS as shown below:
In the .90s: Marvelous In the .80s: Meritorious In the .70s: Middling In the .60s: Mediocre In the .50s: Miserable < .50: Unacceptable Thus, to assess the quality of our factor analysis results we apply kaisers_index to the rotated factor loadings and as the results show below our output value of 0.702 suggests that our results are “middling”
state_vec %>%
horns_curve() %>%
factor_analysis(data = state_vec, hc_points = .) %>%
factor_analysis_results(fa_loadings_rotated) %>%
kaisers_index()
[1] 0.6236816
We can visualize the factor analysis results to show the correlation between the columns of the reduced state vector to the rotated factor loadings. Strong negative correlations are depicted as red while strong positive correlations are shown as blue. This helps to identify which factors are correlated with each security log data attribute. Furthermore, this helps to identify two or more security log data attributes that appear to have relationships with their occurrences. For example, this shows that Russia is highly correlated with IP address 223.70.128. If there is an abnormally large amount of instances with Russian occurrances this would be the logical IP address to start investigating.
fa_loadings <- state_vec %>%
horns_curve() %>%
factor_analysis(state_vec, hc_points = .) %>%
factor_analysis_results(fa_loadings_rotated)
row.names(fa_loadings) <- colnames(state_vec)
gplots::heatmap.2(fa_loadings, dendrogram = 'both', trace = 'none',
density.info = 'none', breaks = seq(-1, 1, by = .25),
col = RColorBrewer::brewer.pal(8, 'RdBu'))
fa_loadings
[,1] [,2] [,3] [,4] [,5] [,6]
ASA 0.074712751 0.558077053 0.074874875 -0.400549765 0.056826637 0.31984234
Attempt -0.083389094 -0.656580287 -0.321105361 -0.004359775 0.157384674 0.08181114
China 0.176886521 0.627322463 -0.222614268 -0.205552406 -0.163573763 0.30494597
Dst_IP_145.114.4.203 0.153180366 0.052548529 0.147572458 0.178733550 -0.173047679 0.05047874
Dst_IP_151.194.233.198 -0.001267740 -0.060345548 0.100690023 0.019440198 -0.077886042 0.02440611
Dst_IP_219.142.109.8 -0.085784433 -0.007439337 -0.701899670 -0.175357059 0.066499908 0.08036235
Dst_IP_32.73.26.223 -0.343468926 -0.040080955 0.382754549 0.038377798 -0.554163186 -0.10126363
Dst_IP_56.137.121.203 0.260590776 0.052500320 0.070160092 -0.058073841 0.697933607 -0.05237257
Dst_Port_20000 -0.080365427 -0.043308735 -0.084023862 -0.197040679 -0.057336939 0.01435919
Dst_Port_25 0.094035915 -0.574495844 -0.009177268 -0.073181633 0.029117481 -0.08622714
Dst_Port_593 -0.085498478 0.006229547 0.432664440 0.259691245 -0.215606091 0.23544323
Dst_Port_80 -0.020711620 0.712977704 -0.036917609 -0.184515673 0.141526794 -0.14518163
Dst_Port_90 0.068497346 -0.091158976 -0.233515313 0.201003563 0.063460028 0.01472312
ePO 0.088543276 -0.209732207 0.001655189 -0.033860750 -0.099533364 -0.06864136
Failure -0.269993220 0.463835799 0.111828435 0.009805842 0.182112312 -0.35314599
Firewall -0.088185864 0.009644928 -0.184012147 -0.210862847 -0.185959263 0.17796526
IBM -0.163287441 -0.350024859 0.016951472 -0.086826629 0.120448722 0.41660803
India 0.315955598 -0.220262193 -0.029271116 -0.117594641 -0.106308136 0.18367690
Juniper -0.003871076 -0.083974853 0.156812654 0.782305605 0.091768754 0.04957238
Korea 0.080561688 -0.289770310 0.039884183 -0.170458792 -0.044418657 -0.20851591
McAfee 0.151210296 -0.113337483 -0.045196958 -0.091028575 -0.054802748 -0.79463409
Netherlands -0.094968730 -0.181137079 -0.063104352 0.064583630 -0.158717813 -0.19726357
NSP 0.087360214 0.066277854 -0.051956005 -0.070458867 0.030320227 -0.82364090
Russia -0.041775246 -0.030694511 0.020663747 0.327645342 0.715719963 0.03293353
Src_IP_174.110.206.174 -0.136982483 -0.005429893 0.184144532 0.191039891 0.096947255 0.33203570
Src_IP_223.70.128.61 -0.160127793 -0.505024829 0.276295775 -0.230708245 0.080781479 0.15015573
Src_IP_227.12.127.87 -0.082088288 0.612100352 -0.180913696 0.379107453 0.083639434 -0.04922409
Src_IP_28.9.24.154 0.404201561 -0.059367011 -0.241199569 0.339057674 -0.144080381 -0.24292090
Src_IP_89.130.69.91 -0.074130508 -0.016087452 -0.004403669 -0.673570993 -0.093752433 -0.13279409
Src_Port_113 0.331524934 0.018463167 -0.563126560 0.021054419 -0.216746385 -0.19850649
Src_Port_135 -0.840744905 -0.075308644 0.046694999 -0.073229801 0.007165358 0.10069220
Src_Port_21 0.815754580 0.019925566 0.155868505 0.013150589 0.157644833 -0.12299228
Src_Port_25 -0.191111809 0.152161071 -0.240693905 0.114262212 0.006467906 0.08009817
Src_Port_80 0.134093362 -0.118667670 0.683098039 -0.073857770 0.066841384 0.11715165
Success 0.356908992 0.253757502 0.237639293 -0.004954589 -0.350356846 0.25994093
TCP 0.078703056 -0.068023828 0.032758980 -0.051958769 0.126437433 -0.05851152
United Kingdom -0.021241440 -0.038130539 0.317888818 -0.094456454 -0.027707351 -0.05537870
US -0.481853431 0.109357083 -0.065570659 0.231568759 -0.252439615 -0.08392967
[,7] [,8] [,9] [,10] [,11]
ASA -0.18061863 -0.10253798 -0.3938212663 -0.11149210 0.1846028472
Attempt -0.18261129 0.14667618 0.0163902648 -0.29029340 0.2100764341
China 0.12950385 0.08407339 -0.1229911578 0.04961429 0.0074162671
Dst_IP_145.114.4.203 0.66749817 0.04361502 -0.0575320917 0.08155650 -0.0004105488
Dst_IP_151.194.233.198 0.03683968 -0.16414488 0.0039125497 -0.79910309 0.1283627736
Dst_IP_219.142.109.8 -0.13756030 0.04832027 -0.0127626286 0.03188806 0.1312455554
Dst_IP_32.73.26.223 -0.35260152 0.01313100 -0.0411478056 0.29634311 -0.1207802901
Dst_IP_56.137.121.203 -0.20740302 0.05705179 0.1019062864 0.37658196 -0.1332929153
Dst_Port_20000 0.22707540 0.08924923 0.7829877143 -0.13353153 0.0218864264
Dst_Port_25 0.49319150 -0.04883513 -0.1638027170 0.01502020 -0.0971508378
Dst_Port_593 0.08866456 0.04526219 -0.3527070757 -0.27034936 0.0592187696
Dst_Port_80 -0.06391780 -0.11268859 0.0284919786 0.23773337 -0.0684704508
Dst_Port_90 -0.67198043 0.03814368 -0.2597491565 0.09650420 0.0890276791
ePO 0.07981780 0.81455519 -0.1543543816 0.14857401 0.1087285490
Failure 0.18945499 -0.09741250 -0.0453724168 -0.21441566 -0.3884192516
Firewall 0.05677638 -0.05805084 0.2548993729 0.27267921 -0.6107663199
IBM 0.14367894 -0.33643625 0.2341287168 -0.17975428 0.3506772714
India -0.29695735 0.19852267 0.3243343566 -0.48790705 -0.0821028061
Juniper -0.06643703 -0.07533229 -0.0143861072 -0.10790415 -0.0331381270
Korea 0.13632327 -0.08300399 -0.4505355259 0.18053498 -0.0549633287
McAfee 0.03973007 0.46830745 -0.0806734597 0.08280643 0.1464681429
Netherlands 0.38103567 -0.07880588 -0.1652474126 0.27222072 0.5322621702
NSP -0.02902049 -0.22602038 0.0518334734 -0.04414098 0.0635176365
Russia -0.17424589 -0.11584191 -0.1695909439 -0.06493106 -0.0965311925
Src_IP_174.110.206.174 0.40988009 0.02295497 0.1204583117 -0.13297799 -0.4858467077
Src_IP_223.70.128.61 -0.11261457 0.22998699 -0.3530200344 0.12404340 0.0858801369
Src_IP_227.12.127.87 0.15783074 0.25247100 0.0499797451 -0.13642073 0.2801676619
Src_IP_28.9.24.154 -0.23403475 -0.33538012 -0.0294993708 0.26429271 0.3054715139
Src_IP_89.130.69.91 -0.14536868 -0.14029143 0.2392336819 -0.16345211 -0.2700846665
Src_Port_113 0.07147133 -0.46147006 -0.2284840877 0.06915702 0.0783692280
Src_Port_135 -0.10200909 -0.04848508 0.0143917664 0.01744506 -0.0091190736
Src_Port_21 -0.04886891 -0.05692298 -0.1280138504 -0.04923531 -0.1653469526
Src_Port_25 -0.13085995 0.68780284 0.3097890709 -0.07333963 -0.0693189632
Src_Port_80 0.25095334 -0.16810934 0.0008504266 0.03449281 0.1675798853
Success 0.01090157 -0.06281397 0.0270252533 0.52629949 0.1557402112
TCP 0.07668302 -0.04014361 -0.1202202947 0.09412848 -0.6925808955
United Kingdom -0.08296780 -0.11544002 0.6127044777 0.27594259 -0.0706482064
US -0.11947823 0.13154423 -0.0257923463 -0.28033227 -0.2743071027
We can also visualize the rotated factor score plots to see which time blocks appear to be outliers and deserve closer attention.
state_vec %>%
horns_curve() %>%
factor_analysis(state_vec, hc_points = .) %>%
factor_analysis_results(fa_scores_rotated) %>%
as_tibble() %>%
mutate(Block = 1:n()) %>%
gather(Factor, Score, -Block) %>%
mutate(Absolute_Score = abs(Score)) %>%
ggplot(aes(Factor, Absolute_Score, label = Block)) +
geom_text() +
geom_boxplot(outlier.shape = NA)
To perform a principal components analysis we use principal_components which will create a list containing:
pca_sdev: the standard deviations of the principal components (i.e., the square roots of the eigenvalues of the covariance/correlation matrix, though the calculation is actually done with the singular values of the data matrix). pca_loadings: the matrix of variable loadings (i.e., a matrix whose columns contain the eigenvectors). pca_rotated: the value of the rotated data (the centered, and scaled if requested, data multiplied by the rotation matrix) is returned. pca_center: the centering used. pca_scale: a logical response indicating whether scaling was used.
principal_components(state_vec) %>% str
List of 5
$ pca_sdev : num [1:38] 1.94 1.86 1.66 1.59 1.5 ...
$ pca_loadings: num [1:38, 1:38] 0.1735 -0.5479 0.2626 0.0674 -0.1162 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:38] "ASA" "Attempt" "China" "Dst_IP_145.114.4.203" ...
.. ..$ : chr [1:38] "PC1" "PC2" "PC3" "PC4" ...
$ pca_rotated : num [1:50, 1:38] -0.0217 -3.7951 -1.1869 -1.296 -1.8197 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr [1:38] "PC1" "PC2" "PC3" "PC4" ...
$ pca_center : Named num [1:38] 0.98 2.08 0.9 1.12 1.26 1.34 1.1 1.18 1.16 1.26 ...
..- attr(*, "names")= chr [1:38] "ASA" "Attempt" "China" "Dst_IP_145.114.4.203" ...
$ pca_scale : logi FALSE
For easy access to these results we can use the principal_components_result parsing function. The principal_components_result will parse the results either by their list name or by location. For instance to extract the computed component scores as outlined in Eq. 8 you can use principal_components_result(data, results = pca_rotated) or principal_components_result(data, results = 3) as demonstrated below.
state_vec %>%
principal_components() %>%
principal_components_result(pca_rotated) %>%
as_tibble