This is the companion R Markdown document to the following presentations that were delivered in Summer 2014:

The slides deck for these talks is located here

It should provide enough examples for usage of the tools implemented at TIQ-test. Please review our github repository page, report bugs and suggest features!

Adding the TIQ-TEST functions

source("tiq-test.R")

Acessing the data using TIQ-TEST

We can use the tiq.data functions to load the Threat Intelligence datasets from the database for exploration using R. We have defaulted to the use of data.table objects for this because they are faster and you can write tighter code around it (sorry Hadleyverse fans).

We have roughly a month of data available on this public dataset:

.tiq.data.getAvailableDates("raw", "public_outbound")
##  [1] "20140615" "20140616" "20140617" "20140618" "20140619" "20140620"
##  [7] "20140621" "20140622" "20140623" "20140624" "20140625" "20140626"
## [13] "20140627" "20140628" "20140629" "20140630" "20140701" "20140702"
## [19] "20140703" "20140704" "20140705" "20140706" "20140707" "20140708"
## [25] "20140709" "20140710" "20140711" "20140712" "20140713" "20140714"
## [31] "20140715"

This is an example of “RAW” (not enriched) outbound data imported from combine output

outbound.ti = tiq.data.loadTI("raw", "public_outbound", "20140701")
outbound.ti[, list(entity, type, direction, source, date)]
##                      entity type direction     source       date
##     1:         1.224.163.26 IPv4  outbound alienvault 2014-07-01
##     2:         1.242.99.155 IPv4  outbound alienvault 2014-07-01
##     3:           1.85.2.118 IPv4  outbound alienvault 2014-07-01
##     4:           1.93.1.162 IPv4  outbound alienvault 2014-07-01
##     5:         1.93.161.204 IPv4  outbound alienvault 2014-07-01
##    ---                                                          
## 16298:         winscoft.com FQDN  outbound       zeus 2014-07-01
## 16299:           wmzbase.ru FQDN  outbound       zeus 2014-07-01
## 16300:       zhabademon.net FQDN  outbound       zeus 2014-07-01
## 16301: zhangleetranding.com FQDN  outbound       zeus 2014-07-01
## 16302:         znatnydom.by FQDN  outbound       zeus 2014-07-01

This specific outbound dataset has the following sources included:

outbound.ti = tiq.data.loadTI("raw", "public_outbound", "20140701")
unique(outbound.ti$source)
##  [1] "alienvault"        "botscout"          "malcode"          
##  [4] "malcode_zones"     "malwaredomainlist" "malwaredomains"   
##  [7] "malwaregroup"      "palevotracker"     "spyeye"           
## [10] "zeus"

We can do the same for the inbound data we have to see the sources we have available:

inbound.ti = tiq.data.loadTI("raw", "public_inbound", "20140701")
unique(inbound.ti$source)
##  [1] "alienvault"        "autoshun"          "blocklistde"      
##  [4] "bruteforceblocker" "charleshaley"      "ciarmy"           
##  [7] "dragonresearch"    "dshield"           "honeypot"         
## [10] "openbl"            "packetmail"        "virbl"

SIDE NOTE: please don’t add non-malicious domains to malware domain lists, ok? :)

outbound.ti = tiq.data.loadTI("raw", "public_outbound", "20140701")
outbound.ti[entity %like% "google.com", list(entity, type, direction, source, date)] 
##               entity type direction         source       date
## 1: chrome.google.com FQDN  outbound malwaredomains 2014-07-01

We can use the same loadTI function to also gather the enriched datasets:

enrich.ti = tiq.data.loadTI("enriched", "public_outbound", "20140710")
enrich.ti = enrich.ti[, notes := NULL]
enrich.ti[c(2,22264, 22266)]
##            entity type direction     source       date asnumber
## 1:   1.224.163.26 IPv4  outbound alienvault 2014-07-10     9318
## 2: 95.181.178.177 IPv4  outbound       zeus 2014-07-10    57311
## 3: 98.131.185.136 IPv4  outbound       zeus 2014-07-10    32392
##                                    asname country
## 1:                    Hanaro Telecom Inc.      KR
## 2: FOP ILIUSHENKO VOLODYMYR OLEXANDROVUCH      GB
## 3:                  Ecommerce Corporation      US
##                          host                   rhost
## 1:                         NA                      NA
## 2:           newdomaininfo.ru host178-177.neohost.net
## 3: projects.globaltronics.net                      NA

Novelty Test

The novelty test should be used to try get a sense of the ratio of new indicators and the retiring of old ones as the data feeds progresses day-by-day.

There is no intrinsic right or wrong, but less frequent updates usually means that they are carefully curated (or abandoned :) ). Curated is great, abandoned is very bad.

Here are some results of running the Novelty test on the inbound data:

inbound.novelty = tiq.test.noveltyTest("public_inbound", "20140615", "20140715", 
                                select.sources=c("alienvault", "blocklistde", 
                                                 "dshield", "charleshaley"))
## WARN [2014-08-12 16:04:59 PDT] pid=7637 tiq.data.loadTI: path '/Users/alexcp/src/tiq-test/data/raw/public_inbound/20140621.csv.gz' is invalid. No data available on date '20140621'.
tiq.test.plotNoveltyTest(inbound.novelty)

plot of chunk unnamed-chunk-8

And results running on the outbound date:

outbound.novelty = tiq.test.noveltyTest("public_outbound", "20140615", "20140715", 
                                select.sources=c("alienvault", "malwaregroup", 
                                                 "malwaredomainlist", "malwaredomains"))
tiq.test.plotNoveltyTest(outbound.novelty)

plot of chunk unnamed-chunk-9

Overlap Test

This is an example of applying the Overlap Test to our inbound dataset

  overlap = tiq.test.overlapTest("public_inbound", "20140715", "enriched", 
                                 select.sources=NULL)
  overlap.plot = tiq.test.plotOverlapTest(overlap, title="Overlap Test - Inbound Data - 20140715")
  print(overlap.plot)

plot of chunk unnamed-chunk-10

Similarly, an example applying the Overlap Test to the outbound dataset

  overlap = tiq.test.overlapTest("public_outbound", "20140715", "enriched", 
                                 select.sources=NULL)
  overlap.plot = tiq.test.plotOverlapTest(overlap, title="Overlap Test - Outbound Data - 20140715")
  print(overlap.plot)

plot of chunk unnamed-chunk-11

What about that day when malwaredomainlist and malwaredomains moved together on the novelty test?

  overlap = tiq.test.overlapTest("public_outbound", "20140629", "enriched", 
                                 select.sources=c("alienvault", "malwaredomainlist",
                                                  "malwaredomains", "zeus"))
  overlap.plot = tiq.test.plotOverlapTest(overlap, title="Overlap Test - Outbound Data Sources - 20140629")
  print(overlap.plot)

plot of chunk unnamed-chunk-12

Population Test Plots

With the population data we can generate some plot to compare the top quantities of reported IP addresses on a specific date by Country

  outbound.pop = tiq.test.extractPopulationFromTI("public_outbound", "country", 
                                                  date = "20140711",
                                                  select.sources=NULL, split.ti=F)
  inbound.pop = tiq.test.extractPopulationFromTI("public_inbound", "country", 
                                                 date = "20140711",
                                                 select.sources=NULL, split.ti=F)

  complete.pop = tiq.data.loadPopulation("mmgeo", "country")
  tiq.test.plotPopulationBars(c(inbound.pop, outbound.pop, complete.pop), "country")

plot of chunk unnamed-chunk-13

Or we can compare them by the AS that those IP addresses are a part of. Of course, there is an infinite number of more AS’s then Countries, so the distribution is much more granular.

  outbound.pop = tiq.test.extractPopulationFromTI("public_outbound", 
                                                  c("asnumber", "asname"), 
                                                  date = "20140711",
                                                  select.sources=NULL, split.ti=F)
  inbound.pop = tiq.test.extractPopulationFromTI("public_inbound", 
                                                 c("asnumber", "asname"), 
                                                 date = "20140711",
                                                 select.sources=NULL, split.ti=F)

  complete.pop = tiq.data.loadPopulation("mmasn", c("asnumber", "asname"))
  tiq.test.plotPopulationBars(c(inbound.pop, outbound.pop, complete.pop), "asname")

plot of chunk unnamed-chunk-14

Population Test Inference - Country data

We can use some inference tools to get a better understanding if the volume of maliciousness we are seeing makes sense in relation to the population we consider to be our reference population.

outbound.pop = tiq.test.extractPopulationFromTI("public_outbound", "country", 
                                                date = "20140711",
                                                select.sources=NULL,
                                                split.ti=FALSE)
complete.pop = tiq.data.loadPopulation("mmgeo", "country")
tests = tiq.test.populationInference(complete.pop$mmgeo, 
                                     outbound.pop$public_outbound, "country",
                                     exact = TRUE, top=10)

# Whose proportion is bigger than it should be?
tests[p.value < 0.05/10 & conf.int.end > 0][order(conf.int.end, decreasing=T)]
##    country conf.int.start conf.int.end    p.value
## 1:      TH       0.047044      0.05415  0.000e+00
## 2:      US       0.025335      0.04111  9.406e-17
## 3:      UA       0.031252      0.03730  0.000e+00
## 4:      RU       0.021363      0.02739 1.198e-105
## 5:      HK       0.014238      0.01868 2.412e-128
## 6:      NL       0.007818      0.01268  4.091e-23
# Whose is smaller?
tests[p.value < 0.05/10 & conf.int.start < 0][order(conf.int.start, decreasing=F)]
##    country conf.int.start conf.int.end   p.value
## 1:      GB       -0.01926    -0.015040 8.988e-38
## 2:      CN       -0.01469    -0.005996 5.356e-06
## 3:      KR       -0.01411    -0.009713 3.809e-20
# And whose is the same? ¯\_(ツ)_/¯
tests[p.value > 0.05/10]
##    country conf.int.start conf.int.end p.value
## 1:      DE      -0.002366     0.003411  0.7553

This tool also enables us to do trend comparison between the same TI groupings from different days or between different groupings. A suggested usage is comparing the threat intelligence feeds you have against the population of confirmed attacks or firewall blocks you have in your environment.

outbound.pop2 = tiq.test.extractPopulationFromTI("public_outbound", "country", 
                                                 date = "20140712",
                                                 select.sources=NULL,
                                                 split.ti=FALSE)
tests = tiq.test.populationInference(outbound.pop$public_outbound, 
                                     outbound.pop2$public_outbound, "country",
                                     exact = F, top=10)

# Whose proportion is bigger than it should be?
tests[p.value < 0.05/10 & conf.int.end > 0][order(conf.int.end, decreasing=T)]
##    country conf.int.start conf.int.end   p.value
## 1:      TH       0.008892      0.01949 1.312e-07
# Whose is smaller?
tests[p.value < 0.05/10 & conf.int.start < 0][order(conf.int.start, decreasing=F)]
## Empty data.table (0 rows) of 4 cols: country,conf.int.start,conf.int.end,p.value
# And whose is the same? ¯\_(ツ)_/¯
tests[p.value > 0.05/10]
##    country conf.int.start conf.int.end p.value
## 1:      CN      -0.008903     0.003230  0.3652
## 2:      DE      -0.005626     0.002421  0.4461
## 3:      GB      -0.003826     0.002055  0.5753
## 4:      HK      -0.004286     0.001887  0.4612
## 5:      KR      -0.004004     0.002129  0.5682
## 6:      NL      -0.004471     0.002308  0.5484
## 7:      RU      -0.005538     0.002877  0.5489
## 8:      UA      -0.005500     0.002947  0.5675
## 9:      US      -0.009315     0.012858  0.7613

Population Test Inference - ASN data

We can do the same population-like tests for ASN data. Let’s investigate the prevalence of Google IP addresses on

outbound.pop = tiq.test.extractPopulationFromTI("public_outbound", 
                                                c("asnumber", "asname"), 
                                                date="20140711",
                                                select.sources=NULL,
                                                split.ti=FALSE)
complete.pop = tiq.data.loadPopulation("mmasn", c("asnumber", "asname"))
tests = tiq.test.populationInference(complete.pop$mmasn,
                                     outbound.pop$public_outbound, 
                                     c("asname", "asnumber"),
                                     exact = TRUE, top=10)

# Whose proportion is bigger than it should be?
tests[p.value < 0.05/10 & conf.int.end > 0][order(conf.int.end, decreasing=T)]
##                        asname conf.int.start conf.int.end    p.value
## 1:                Google Inc.        0.10756      0.11758  0.000e+00
## 2:           Amazon.com, Inc.        0.04015      0.04673  0.000e+00
## 3:  Akamai International B.V.        0.03534      0.04151  0.000e+00
## 4: TOT Public Company Limited        0.03019      0.03588  0.000e+00
## 5:           GoDaddy.com, LLC        0.02052      0.02532  0.000e+00
## 6:                    OVH SAS        0.01397      0.01802 1.046e-302
## 7:              Unified Layer        0.01292      0.01682 7.411e-323
## 8:         Krypt Technologies        0.01049      0.01404 8.007e-265
# Whose is smaller?
tests[p.value < 0.05/10 & conf.int.start < 0][order(conf.int.start, decreasing=F)]
##      asname conf.int.start conf.int.end   p.value
## 1: Chinanet       -0.01216    -0.006648 4.903e-10
# And whose is the same? ¯\_(ツ)_/¯
tests[p.value > 0.05/10]
##                        asname conf.int.start conf.int.end p.value
## 1: CNCGROUP China169 Backbone      -0.004651   -0.0004625 0.01762

This huge prevalence of Google AS IPs should be investigated further. Some of it could be from parking at 8.8.8.8 and 1.1.1.1 but it seems to be too much

outbound.ti = tiq.data.loadTI("enriched", "public_outbound", "20140711")
outbound.ti[asname %like% "Google",  list(entity, type, source, asname, host)]
##               entity type         source      asname                 host
##    1:  74.125.228.43 IPv4        malcode Google Inc.                   NA
##    2:  74.125.228.75 IPv4        malcode Google Inc.                   NA
##    3: 173.194.115.16 IPv4  malcode_zones Google Inc.       googleapis.com
##    4: 173.194.115.17 IPv4  malcode_zones Google Inc.       googleapis.com
##    5: 173.194.115.18 IPv4  malcode_zones Google Inc.       googleapis.com
##   ---                                                                    
## 1964:        8.8.8.8 IPv4 malwaredomains Google Inc.        revlister.com
## 1965:        8.8.8.8 IPv4 malwaredomains Google Inc.        statalyze.net
## 1966:        8.8.8.8 IPv4 malwaredomains Google Inc.   statisticbench.net
## 1967:        8.8.8.8 IPv4 malwaredomains Google Inc.      webdestinct.net
## 1968:        8.8.8.8 IPv4         spyeye Google Inc. futuretelefonica.com
outbound.ti[asname %like% "Google" & entity != "8.8.8.8" & entity != "1.1.1.1",
            list(entity, type, source, asname, host)]
##               entity type         source      asname              host
##    1:  74.125.228.43 IPv4        malcode Google Inc.                NA
##    2:  74.125.228.75 IPv4        malcode Google Inc.                NA
##    3: 173.194.115.16 IPv4  malcode_zones Google Inc.    googleapis.com
##    4: 173.194.115.17 IPv4  malcode_zones Google Inc.    googleapis.com
##    5: 173.194.115.18 IPv4  malcode_zones Google Inc.    googleapis.com
##   ---                                                                 
## 1950:  74.125.70.101 IPv4 malwaredomains Google Inc. chrome.google.com
## 1951:  74.125.70.102 IPv4 malwaredomains Google Inc. chrome.google.com
## 1952:  74.125.70.113 IPv4 malwaredomains Google Inc. chrome.google.com
## 1953:  74.125.70.138 IPv4 malwaredomains Google Inc. chrome.google.com
## 1954:  74.125.70.139 IPv4 malwaredomains Google Inc. chrome.google.com

I guess it is fair to say that it would be a good idea to cleanup these feeds. :)

That’s all for now, folks! Feel free to suggest new tests and experiments!