This is the companion R Markdown document to the following presentations that were delivered in Winter 2015:

nbtcon 2014: “From Threat Intelligence to Defense Cleverness: A Data Science Approach”
SANS CTI Summit 2015: “From Threat Intelligence to Defense Cleverness: A Data Science Approach”

This markdown file calculates the outputs and charts that are used on the presentations using the test data available. It is published in Rpubs here

It should provide enough examples for usage of the tools implemented at TIQ-test. Please review our github repository page, report bugs and suggest features!

Adding the TIQ-TEST functions

## Some limitations from not being an R package: Setting the Working directory
tiqtest.dir = file.path("..", "tiq-test")
current.dir = setwd(tiqtest.dir)
source("tiq-test.R")

## Setting the root data path to where it should be in this repo
.tiq.data.setRootPath(file.path(current.dir, "data"))

## INFO [2015-02-01 12:32:25 PST] pid=1551 tiq.data.setRootPath: Setting path to '/Users/alexcp/src/tiq-test-Winter2015/data'

Acessing the data using TIQ-TEST

We have roughly 2 months of data available on this public dataset:

print(tiq.data.getAvailableDates("raw", "public_outbound"))

##  [1] "20141001" "20141002" "20141003" "20141004" "20141005" "20141006"
##  [7] "20141007" "20141008" "20141009" "20141010" "20141011" "20141012"
## [13] "20141013" "20141014" "20141015" "20141016" "20141017" "20141018"
## [19] "20141019" "20141020" "20141021" "20141022" "20141023" "20141024"
## [25] "20141025" "20141026" "20141027" "20141028" "20141029" "20141030"
## [31] "20141031" "20141101" "20141102" "20141103" "20141104" "20141105"
## [37] "20141106" "20141107" "20141108" "20141109" "20141110" "20141111"
## [43] "20141112" "20141113" "20141114" "20141115" "20141116" "20141117"
## [49] "20141118" "20141119" "20141120" "20141121" "20141122" "20141123"
## [55] "20141124" "20141125" "20141126" "20141127" "20141128" "20141129"
## [61] "20141130"

print(tiq.data.getAvailableDates("raw", "public_inbound"))

##  [1] "20141001" "20141002" "20141003" "20141004" "20141005" "20141006"
##  [7] "20141007" "20141008" "20141009" "20141010" "20141011" "20141012"
## [13] "20141013" "20141014" "20141015" "20141016" "20141017" "20141018"
## [19] "20141019" "20141020" "20141021" "20141022" "20141023" "20141024"
## [25] "20141025" "20141026" "20141027" "20141028" "20141029" "20141030"
## [31] "20141031" "20141101" "20141102" "20141103" "20141104" "20141105"
## [37] "20141106" "20141107" "20141108" "20141109" "20141110" "20141111"
## [43] "20141112" "20141113" "20141114" "20141115" "20141116" "20141117"
## [49] "20141118" "20141119" "20141120" "20141121" "20141122" "20141123"
## [55] "20141124" "20141125" "20141126" "20141127" "20141128" "20141129"
## [61] "20141130"

This time, we also have a private data feeds over the time period, but the information in them cannot be shared publicly as a part of this release. If you are reproducing this at your own environemnt, you will not be able to recreate some of the outputs below:

if (tiq.data.isDatasetAvailable("raw", "private1")) {
  print(tiq.data.getAvailableDates("raw", "private1"))
} else {
    print("Sorry, private1 dataset is not available.")
}

##  [1] "20141001" "20141002" "20141004" "20141005" "20141006" "20141007"
##  [7] "20141008" "20141009" "20141010" "20141011" "20141012" "20141013"
## [13] "20141014" "20141015" "20141016" "20141017" "20141018" "20141019"
## [19] "20141020" "20141021" "20141022" "20141023" "20141024" "20141025"
## [25] "20141026" "20141027" "20141028" "20141029" "20141030" "20141031"
## [31] "20141101" "20141102" "20141103" "20141104" "20141105" "20141106"
## [37] "20141107" "20141108" "20141109" "20141110" "20141111" "20141112"
## [43] "20141113" "20141114" "20141115" "20141116" "20141117" "20141118"
## [49] "20141119" "20141120" "20141121" "20141122" "20141123" "20141124"
## [55] "20141125" "20141126" "20141127" "20141128" "20141129" "20141130"

Data manipulation demonstration using TIQ-test

This is an example of “RAW” (not enriched) outbound data imported from combine output

outbound.ti = tiq.data.loadTI("raw", "public_outbound", "20141101")
outbound.ti[, list(entity, type, direction, source, date)]

##                          entity type direction     source       date
##     1:             1.168.15.140 IPv4  outbound alienvault 2014-11-01
##     2:                1.93.6.86 IPv4  outbound alienvault 2014-11-01
##     3:             100.42.211.4 IPv4  outbound alienvault 2014-11-01
##     4:           101.227.172.24 IPv4  outbound alienvault 2014-11-01
##     5:             101.36.81.55 IPv4  outbound alienvault 2014-11-01
##    ---                                                              
## 11388:          up.frigo2000.it FQDN  outbound       zeus 2014-11-01
## 11389:          update.odeen.eu FQDN  outbound       zeus 2014-11-01
## 11390: update.rifugiopontese.it FQDN  outbound       zeus 2014-11-01
## 11391:       vahendkarasis4.com FQDN  outbound       zeus 2014-11-01
## 11392:           welcahllyn.com FQDN  outbound       zeus 2014-11-01

We can use the same loadTI function to also gather the enriched datasets:

enrich.ti = tiq.data.loadTI("enriched", "public_outbound", "20141101")
enrich.ti = enrich.ti[, notes := NULL]
tail(enrich.ti)

##            entity type direction source       date asnumber
## 1:  94.102.63.153 IPv4  outbound   zeus 2014-11-01    29073
## 2:   94.103.36.55 IPv4  outbound   zeus 2014-11-01    47894
## 3:  95.163.121.12 IPv4  outbound   zeus 2014-11-01    12695
## 4: 98.131.185.136 IPv4  outbound   zeus 2014-11-01    32392
## 5: 98.131.185.136 IPv4  outbound   zeus 2014-11-01    32392
## 6:    99.181.5.83 IPv4  outbound   zeus 2014-11-01     7018
##                     asname country                       host
## 1:          Ecatel Network      NL                         NA
## 2: VeriTeknik Bilisim Ltd.      TR                         NA
## 3:   Digital Networks CJSC      RU                         NA
## 4:   Ecommerce Corporation      US                         NA
## 5:   Ecommerce Corporation      US projects.globaltronics.net
## 6:     AT&T Services, Inc.      US                         NA
##                                        rhost
## 1:                            exadomains.net
## 2:                 datacenter.veriteknik.com
## 3:                                        NA
## 4:                                        NA
## 5:                                        NA
## 6: adsl-99-181-5-83.dsl.irvnca.sbcglobal.net

This specific outbound dataset has the following sources included:

outbound.ti = tiq.data.loadTI("raw", "public_outbound", "20141101")
unique(outbound.ti$source)

##  [1] "alienvault"        "feodo"             "malcode"          
##  [4] "malcode_zones"     "malwaredomainlist" "malwaredomains"   
##  [7] "malwaregroup"      "palevotracker"     "spyeye"           
## [10] "sslbl"             "zeus"

We can do the same for the inbound data we have to see the sources we have available:

inbound.ti = tiq.data.loadTI("raw", "public_inbound", "20141101")
unique(inbound.ti$source)

##  [1] "alienvault"        "autoshun"          "blocklistde"      
##  [4] "botscout"          "bruteforceblocker" "charleshaley"     
##  [7] "ciarmy"            "dragonresearch"    "dshield"          
## [10] "honeypot"          "openbl"            "packetmail"       
## [13] "virbl"

Novelty Test examples

Here are some results of running the Novelty test on the inbound data:

inbound.novelty = tiq.test.noveltyTest("public_inbound", "20141001", "20141130", 
                                             select.sources=c("alienvault", "blocklistde", 
                                                                "dshield", "charleshaley"),
                                                                             .progress=FALSE)
tiq.test.plotNoveltyTest(inbound.novelty, title="Novelty Test - Inbound Indicators")

And results running on the outbound data:

outbound.novelty = tiq.test.noveltyTest("public_outbound", "20141001", "20141130", 
                                        select.sources=c("alienvault", "malwaregroup", 
                                                         "malcode", "zeus"),
                                                                             .progress=FALSE)
tiq.test.plotNoveltyTest(outbound.novelty, title="Novelty Test - Outbound Indicators")

We can analyze the public_outbound dataset as a single unit as well, in order to compare it with other repositories:

outbound.novelty = tiq.test.noveltyTest("public_outbound", "20141001", "20141130",
                                                                                split.tii=F, .progress=FALSE)
tiq.test.plotNoveltyTest(outbound.novelty)

## Warning: Stacking not well defined when ymin != 0

The same can be done ith the inbound indicators:

inbound.novelty = tiq.test.noveltyTest("public_inbound", "20141001", "20141130",
                                                                                split.tii=F, .progress=FALSE)
tiq.test.plotNoveltyTest(inbound.novelty)

## Warning: Stacking not well defined when ymin != 0

And with private sources we may have available:

if (tiq.data.isDatasetAvailable("raw", "private1")) {
    private.novelty = tiq.test.noveltyTest("private1", "20141001", "20141130", 
                                                                                 split.tii=F, .progress=FALSE)
    tiq.test.plotNoveltyTest(private.novelty)
} else {
    print("Sorry, private1 dataset is not available.")
}

## WARN [2015-02-01 12:35:25 PST] pid=1551 tiq.data.loadTI: path '/Users/alexcp/src/tiq-test-Winter2015/data/raw/private1/20141003.csv.gz' is invalid. No data available on date '20141003'.

## Warning: Stacking not well defined when ymin != 0

Overlap Test examples

This is an example of applying the Overlap Test to our inbound dataset

overlap = tiq.test.overlapTest("public_inbound", "20141101", "enriched", 
                               select.sources=NULL)
tiq.test.plotOverlapTest(overlap, title="Overlap Test - Inbound Data - 20141101")

Similarly, an example applying the Overlap Test to the outbound dataset

overlap = tiq.test.overlapTest("public_outbound", "20141101", "enriched", 
                               select.sources=NULL)
tiq.test.plotOverlapTest(overlap, title="Overlap Test - Outbound Data - 20141101")

We can use this function to compare our private dataset to each different source in our public outbound indicator libraries. This gives some interesting insight onto data it may be using from public sources

overlap = tiq.test.overlapTest(c("public_outbound", "private1"), "20141101", "enriched", 
                               split.ti=c(T,F), select.sources=NULL)
tiq.test.plotOverlapTest(overlap, title="Overlap Test - public_outbound VS private1 - 20141101")

Population Test Chart examples

With the population data we can generate some plot to compare the top quantities of reported IP addresses on a specific date by Country

outbound.pop = tiq.test.extractPopulationFromTI("public_outbound", "country", 
                                                date = "20141111",
                                                select.sources=NULL, split.ti=F)
inbound.pop = tiq.test.extractPopulationFromTI("public_inbound", "country", 
                                               date = "20141111",
                                               select.sources=NULL, split.ti=F)

complete.pop = tiq.data.loadPopulation("mmgeo", "country")
tiq.test.plotPopulationBars(c(inbound.pop, outbound.pop, complete.pop), "country")

We can use the same to compare our agregated outbound indicators against the private dataset we have:

if (tiq.data.isDatasetAvailable("enriched", "private1")) {
    outbound.pop = tiq.test.extractPopulationFromTI("public_outbound", "country", 
                                                    date = "20141110",
                                                    select.sources=NULL, split.ti=F)
    private.pop = tiq.test.extractPopulationFromTI("private1", "country", 
                                                   date = "20141110",
                                                   select.sources=NULL, split.ti=F)
    
    tiq.test.plotPopulationBars(c(private.pop, outbound.pop), "country", 
                                                            title="Comparing Private1 and Public Feeds on 20141110")
} else {
    print("Sorry, private1 dataset is not available.")
}

Population Test Inference - Country data

We can use some inference tools to get a better understanding if the volume of maliciousness we are seeing makes sense in relation to the population we consider to be our reference population.

outbound.pop = tiq.test.extractPopulationFromTI("public_outbound", "country", 
                                                date = "20141111",
                                                select.sources=NULL,
                                                split.ti=FALSE)
complete.pop = tiq.data.loadPopulation("mmgeo", "country")
tests = tiq.test.populationInference(complete.pop$mmgeo, 
                                     outbound.pop$public_outbound, "country",
                                     exact = TRUE, top=10)

# Whose proportion is bigger than it should be?
tests[p.value < 0.05/10 & conf.int.end > 0][order(conf.int.end, decreasing=T)]

##    country conf.int.start conf.int.end       p.value
## 1:      CN    0.065323470   0.08150539  4.762343e-97
## 2:      RU    0.037520304   0.04752410 3.878014e-141
## 3:      HK    0.036605616   0.04563014 2.614466e-265
## 4:      UA    0.025893666   0.03373351 1.034106e-168
## 5:      NL    0.013268828   0.02084241  7.764204e-30
## 6:      DE    0.009467889   0.01878853  3.370376e-11

# Whose is smaller?
tests[p.value < 0.05/10 & conf.int.start < 0][order(conf.int.start, decreasing=F)]

##    country conf.int.start conf.int.end      p.value
## 1:      US    -0.09584373 -0.075020192 7.156756e-56
## 2:      GB    -0.01567198 -0.009154111 4.385023e-11
## 3:      KR    -0.01118954 -0.004554179 1.392778e-05

# And whose is the same? ¯\_(ツ)_/¯
tests[p.value > 0.05/10]

##    country conf.int.start conf.int.end p.value
## 1:      FR    -0.00347656  0.002973104 0.82281

This tool also enables us to do trend comparison between the same TI groupings from different days or between different groupings. A suggested usage is comparing the threat intelligence feeds you have against the population of confirmed attacks or firewall blocks you have in your environment.

outbound.pop2 = tiq.test.extractPopulationFromTI("public_outbound", "country", 
                                                 date = "20141112",
                                                 select.sources=NULL,
                                                 split.ti=FALSE)
tests = tiq.test.populationInference(outbound.pop$public_outbound, 
                                     outbound.pop2$public_outbound, "country",
                                     exact = F, top=10)

# Whose proportion is bigger than it should be?
tests[p.value < 0.05/10 & conf.int.end > 0][order(conf.int.end, decreasing=T)]

##    country conf.int.start conf.int.end      p.value
## 1:      US     0.03177125   0.06126435 5.434405e-10

# Whose is smaller?
tests[p.value < 0.05/10 & conf.int.start < 0][order(conf.int.start, decreasing=F)]

## Empty data.table (0 rows) of 4 cols: country,conf.int.start,conf.int.end,p.value

# And whose is the same? ¯\_(ツ)_/¯
tests[p.value > 0.05/10]

##    country conf.int.start  conf.int.end    p.value
## 1:      CA   -0.007720267  0.0004724928 0.08337404
## 2:      CN   -0.019216301  0.0032365011 0.16467689
## 3:      DE   -0.007485370  0.0055347429 0.79271233
## 4:      FR   -0.004909272  0.0041212106 0.90224369
## 5:      GB   -0.003879552  0.0053539704 0.78758393
## 6:      HK   -0.008242725  0.0042953955 0.55419789
## 7:      KR   -0.005543473  0.0036982022 0.72618057
## 8:      NL   -0.006448015  0.0040968963 0.68756942
## 9:      RU   -0.013835792 -0.0002109399 0.04289032

Aging Test examples

The aging test will try to identify how long a specific indicator has lived in a threat feed. As with other tests, like the population and novelty, you are able to measure this information on aggregate of all your subgroups or separately.

Here is it run against the whole dataset on the Outbound indicators, as they are separated out on subgroups:

outbound.aging = tiq.test.agingTest("public_outbound", "20141001", "20141130")
tiq.test.plotAgingTest(outbound.aging, title="Aging Test - Outbound Data")

Here is it run against the whole dataset on the Inbound indicators. It is interesting to observe how they have different distributions because of the different ways of collecting the data:

inbound.aging = tiq.test.agingTest("public_inbound", "20141001", "20141130")
tiq.test.plotAgingTest(inbound.aging, title="Aging Test - Inbound Data")

You can also look at it as whole thing, as to evaluate the aging of your whole TI repository in its enriched format:

outbound.aging = tiq.test.agingTest("public_outbound", "20141001", "20141130", type="enriched",
                                                                        split.ti=F)
tiq.test.plotAgingTest(outbound.aging, title="Aging Test - Outbound Data")

Which allows us to compare it against the same formatted data for the private dataset:

if (tiq.data.isDatasetAvailable("enriched", "private1")) {
    private.aging = tiq.test.agingTest("private1", "20141001", "20141130", type="enriched",
                                        split.ti=F)
    tiq.test.plotAgingTest(private.aging, title="Aging Test - Private Outbound Data", density.limit=0.7)
} else {
    print("Sorry, private1 dataset is not available.")
}

## WARN [2015-02-01 12:40:13 PST] pid=1551 tiq.data.loadTI: path '/Users/alexcp/src/tiq-test-Winter2015/data/enriched/private1/20141003.csv.gz' is invalid. No data available on date '20141003'.

Uniqueness Test examples

For the Uniqueness test examples, we are calculating the absolute uniqueness of the data on different data periods (1, 15, 30 and 60 days) to verify how this uniqueness evolves over time. By running the tests, we see that there is not a lot of variation in the ratio of uniqueness on inbound data:

uniqueTest = rbind(
    tiq.test.uniquenessTest("public_inbound", "20141001","20141001", "raw", split.tii = T),
    tiq.test.uniquenessTest("public_inbound", "20141001","20141015", "raw", split.tii = T),
    tiq.test.uniquenessTest("public_inbound", "20141001","20141030", "raw", split.tii = T),
    tiq.test.uniquenessTest("public_inbound", "20141001","20141129", "raw", split.tii = T)
)

uniqueTest[count == 1]

##    count     ratio days
## 1:     1 0.9759170    1
## 2:     1 0.9737684   15
## 3:     1 0.9727630   30
## 4:     1 0.9710713   60

tiq.test.plotUniquenessTest(uniqueTest, title="Uniqueness Test - Inbound Data")

Neither there is a lot of variation on outbound data:

uniqueTest = rbind(
    tiq.test.uniquenessTest("public_outbound", "20141001","20141001", "raw", split.tii = T),
    tiq.test.uniquenessTest("public_outbound", "20141001","20141015", "raw", split.tii = T),
    tiq.test.uniquenessTest("public_outbound", "20141001","20141030", "raw", split.tii = T),
    tiq.test.uniquenessTest("public_outbound", "20141001","20141129", "raw", split.tii = T)
)

uniqueTest[count == 1]

##    count     ratio days
## 1:     1 0.9815816    1
## 2:     1 0.9678728   15
## 3:     1 0.9660537   30
## 4:     1 0.9663163   60

tiq.test.plotUniquenessTest(uniqueTest, title="Uniqueness Test - Outbound Data")

Also, adding the private data does not change the uniqueness ratios much further. Some work had been done previously on selecting the feeds for little overlap, and we can see that it paid off here.

if (tiq.data.isDatasetAvailable("enriched", "private1")) {
    uniqueTest = rbind(
        tiq.test.uniquenessTest(c("public_outbound", "private1"), "20141001","20141001",
                                                        "enriched", split.tii = c(T,F)),
        tiq.test.uniquenessTest(c("public_outbound", "private1"), "20141001","20141015",
                                                        "enriched", split.tii = c(T,F)),
        tiq.test.uniquenessTest(c("public_outbound", "private1"), "20141001","20141030",
                                                        "enriched", split.tii = c(T,F)),
        tiq.test.uniquenessTest(c("public_outbound", "private1"), "20141001","20141129",
                                                        "enriched", split.tii = c(T,F))
    )
    uniqueTest[count == 1]
    tiq.test.plotUniquenessTest(uniqueTest, title="Uniqueness Test (enriched) - Private Data vs. Outbound Data")
} else {
    print("Sorry, private1 dataset is not available.")
}

## WARN [2015-02-01 12:43:11 PST] pid=1551 tiq.data.loadTI: path '/Users/alexcp/src/tiq-test-Winter2015/data/enriched/private1/20141003.csv.gz' is invalid. No data available on date '20141003'.
## WARN [2015-02-01 12:43:20 PST] pid=1551 tiq.data.loadTI: path '/Users/alexcp/src/tiq-test-Winter2015/data/enriched/private1/20141003.csv.gz' is invalid. No data available on date '20141003'.
## WARN [2015-02-01 12:43:36 PST] pid=1551 tiq.data.loadTI: path '/Users/alexcp/src/tiq-test-Winter2015/data/enriched/private1/20141003.csv.gz' is invalid. No data available on date '20141003'.

This finishes the analysis of this dataset. Feel free to suggest new tests and sources.

From Threat Intelligence to Defense Cleverness: A Data Science Approach

Alex Pinto

February 2nd, 2015