This is the companion R Markdown document to the following presentations that were delivered in Summer 2014:
The slides deck for these talks is located here
It should provide enough examples for usage of the tools implemented at TIQ-test. Please review our github repository page, report bugs and suggest features!
Adding the TIQ-TEST functions
source("tiq-test.R")
We can use the tiq.data functions to load the Threat Intelligence datasets from the database for exploration using R. We have defaulted to the use of data.table objects for this because they are faster and you can write tighter code around it (sorry Hadleyverse fans).
We have roughly a month of data available on this public dataset:
.tiq.data.getAvailableDates("raw", "public_outbound")
## [1] "20140615" "20140616" "20140617" "20140618" "20140619" "20140620"
## [7] "20140621" "20140622" "20140623" "20140624" "20140625" "20140626"
## [13] "20140627" "20140628" "20140629" "20140630" "20140701" "20140702"
## [19] "20140703" "20140704" "20140705" "20140706" "20140707" "20140708"
## [25] "20140709" "20140710" "20140711" "20140712" "20140713" "20140714"
## [31] "20140715"
This is an example of “RAW” (not enriched) outbound data imported from combine output
outbound.ti = tiq.data.loadTI("raw", "public_outbound", "20140701")
outbound.ti[, list(entity, type, direction, source, date)]
## entity type direction source date
## 1: 1.224.163.26 IPv4 outbound alienvault 2014-07-01
## 2: 1.242.99.155 IPv4 outbound alienvault 2014-07-01
## 3: 1.85.2.118 IPv4 outbound alienvault 2014-07-01
## 4: 1.93.1.162 IPv4 outbound alienvault 2014-07-01
## 5: 1.93.161.204 IPv4 outbound alienvault 2014-07-01
## ---
## 16298: winscoft.com FQDN outbound zeus 2014-07-01
## 16299: wmzbase.ru FQDN outbound zeus 2014-07-01
## 16300: zhabademon.net FQDN outbound zeus 2014-07-01
## 16301: zhangleetranding.com FQDN outbound zeus 2014-07-01
## 16302: znatnydom.by FQDN outbound zeus 2014-07-01
This specific outbound dataset has the following sources included:
outbound.ti = tiq.data.loadTI("raw", "public_outbound", "20140701")
unique(outbound.ti$source)
## [1] "alienvault" "botscout" "malcode"
## [4] "malcode_zones" "malwaredomainlist" "malwaredomains"
## [7] "malwaregroup" "palevotracker" "spyeye"
## [10] "zeus"
We can do the same for the inbound data we have to see the sources we have available:
inbound.ti = tiq.data.loadTI("raw", "public_inbound", "20140701")
unique(inbound.ti$source)
## [1] "alienvault" "autoshun" "blocklistde"
## [4] "bruteforceblocker" "charleshaley" "ciarmy"
## [7] "dragonresearch" "dshield" "honeypot"
## [10] "openbl" "packetmail" "virbl"
SIDE NOTE: please don’t add non-malicious domains to malware domain lists, ok? :)
outbound.ti = tiq.data.loadTI("raw", "public_outbound", "20140701")
outbound.ti[entity %like% "google.com", list(entity, type, direction, source, date)]
## entity type direction source date
## 1: chrome.google.com FQDN outbound malwaredomains 2014-07-01
We can use the same loadTI function to also gather the enriched datasets:
enrich.ti = tiq.data.loadTI("enriched", "public_outbound", "20140710")
enrich.ti = enrich.ti[, notes := NULL]
enrich.ti[c(2,22264, 22266)]
## entity type direction source date asnumber
## 1: 1.224.163.26 IPv4 outbound alienvault 2014-07-10 9318
## 2: 95.181.178.177 IPv4 outbound zeus 2014-07-10 57311
## 3: 98.131.185.136 IPv4 outbound zeus 2014-07-10 32392
## asname country
## 1: Hanaro Telecom Inc. KR
## 2: FOP ILIUSHENKO VOLODYMYR OLEXANDROVUCH GB
## 3: Ecommerce Corporation US
## host rhost
## 1: NA NA
## 2: newdomaininfo.ru host178-177.neohost.net
## 3: projects.globaltronics.net NA
The novelty test should be used to try get a sense of the ratio of new indicators and the retiring of old ones as the data feeds progresses day-by-day.
There is no intrinsic right or wrong, but less frequent updates usually means that they are carefully curated (or abandoned :) ). Curated is great, abandoned is very bad.
Here are some results of running the Novelty test on the inbound data:
inbound.novelty = tiq.test.noveltyTest("public_inbound", "20140615", "20140715",
select.sources=c("alienvault", "blocklistde",
"dshield", "charleshaley"))
## WARN [2014-08-12 16:04:59 PDT] pid=7637 tiq.data.loadTI: path '/Users/alexcp/src/tiq-test/data/raw/public_inbound/20140621.csv.gz' is invalid. No data available on date '20140621'.
tiq.test.plotNoveltyTest(inbound.novelty)
And results running on the outbound date:
outbound.novelty = tiq.test.noveltyTest("public_outbound", "20140615", "20140715",
select.sources=c("alienvault", "malwaregroup",
"malwaredomainlist", "malwaredomains"))
tiq.test.plotNoveltyTest(outbound.novelty)
This is an example of applying the Overlap Test to our inbound dataset
overlap = tiq.test.overlapTest("public_inbound", "20140715", "enriched",
select.sources=NULL)
overlap.plot = tiq.test.plotOverlapTest(overlap, title="Overlap Test - Inbound Data - 20140715")
print(overlap.plot)
Similarly, an example applying the Overlap Test to the outbound dataset
overlap = tiq.test.overlapTest("public_outbound", "20140715", "enriched",
select.sources=NULL)
overlap.plot = tiq.test.plotOverlapTest(overlap, title="Overlap Test - Outbound Data - 20140715")
print(overlap.plot)
What about that day when malwaredomainlist and malwaredomains moved together on the novelty test?
overlap = tiq.test.overlapTest("public_outbound", "20140629", "enriched",
select.sources=c("alienvault", "malwaredomainlist",
"malwaredomains", "zeus"))
overlap.plot = tiq.test.plotOverlapTest(overlap, title="Overlap Test - Outbound Data Sources - 20140629")
print(overlap.plot)
With the population data we can generate some plot to compare the top quantities of reported IP addresses on a specific date by Country
outbound.pop = tiq.test.extractPopulationFromTI("public_outbound", "country",
date = "20140711",
select.sources=NULL, split.ti=F)
inbound.pop = tiq.test.extractPopulationFromTI("public_inbound", "country",
date = "20140711",
select.sources=NULL, split.ti=F)
complete.pop = tiq.data.loadPopulation("mmgeo", "country")
tiq.test.plotPopulationBars(c(inbound.pop, outbound.pop, complete.pop), "country")
Or we can compare them by the AS that those IP addresses are a part of. Of course, there is an infinite number of more AS’s then Countries, so the distribution is much more granular.
outbound.pop = tiq.test.extractPopulationFromTI("public_outbound",
c("asnumber", "asname"),
date = "20140711",
select.sources=NULL, split.ti=F)
inbound.pop = tiq.test.extractPopulationFromTI("public_inbound",
c("asnumber", "asname"),
date = "20140711",
select.sources=NULL, split.ti=F)
complete.pop = tiq.data.loadPopulation("mmasn", c("asnumber", "asname"))
tiq.test.plotPopulationBars(c(inbound.pop, outbound.pop, complete.pop), "asname")
We can use some inference tools to get a better understanding if the volume of maliciousness we are seeing makes sense in relation to the population we consider to be our reference population.
outbound.pop = tiq.test.extractPopulationFromTI("public_outbound", "country",
date = "20140711",
select.sources=NULL,
split.ti=FALSE)
complete.pop = tiq.data.loadPopulation("mmgeo", "country")
tests = tiq.test.populationInference(complete.pop$mmgeo,
outbound.pop$public_outbound, "country",
exact = TRUE, top=10)
# Whose proportion is bigger than it should be?
tests[p.value < 0.05/10 & conf.int.end > 0][order(conf.int.end, decreasing=T)]
## country conf.int.start conf.int.end p.value
## 1: TH 0.047044 0.05415 0.000e+00
## 2: US 0.025335 0.04111 9.406e-17
## 3: UA 0.031252 0.03730 0.000e+00
## 4: RU 0.021363 0.02739 1.198e-105
## 5: HK 0.014238 0.01868 2.412e-128
## 6: NL 0.007818 0.01268 4.091e-23
# Whose is smaller?
tests[p.value < 0.05/10 & conf.int.start < 0][order(conf.int.start, decreasing=F)]
## country conf.int.start conf.int.end p.value
## 1: GB -0.01926 -0.015040 8.988e-38
## 2: CN -0.01469 -0.005996 5.356e-06
## 3: KR -0.01411 -0.009713 3.809e-20
# And whose is the same? ¯\_(ツ)_/¯
tests[p.value > 0.05/10]
## country conf.int.start conf.int.end p.value
## 1: DE -0.002366 0.003411 0.7553
This tool also enables us to do trend comparison between the same TI groupings from different days or between different groupings. A suggested usage is comparing the threat intelligence feeds you have against the population of confirmed attacks or firewall blocks you have in your environment.
outbound.pop2 = tiq.test.extractPopulationFromTI("public_outbound", "country",
date = "20140712",
select.sources=NULL,
split.ti=FALSE)
tests = tiq.test.populationInference(outbound.pop$public_outbound,
outbound.pop2$public_outbound, "country",
exact = F, top=10)
# Whose proportion is bigger than it should be?
tests[p.value < 0.05/10 & conf.int.end > 0][order(conf.int.end, decreasing=T)]
## country conf.int.start conf.int.end p.value
## 1: TH 0.008892 0.01949 1.312e-07
# Whose is smaller?
tests[p.value < 0.05/10 & conf.int.start < 0][order(conf.int.start, decreasing=F)]
## Empty data.table (0 rows) of 4 cols: country,conf.int.start,conf.int.end,p.value
# And whose is the same? ¯\_(ツ)_/¯
tests[p.value > 0.05/10]
## country conf.int.start conf.int.end p.value
## 1: CN -0.008903 0.003230 0.3652
## 2: DE -0.005626 0.002421 0.4461
## 3: GB -0.003826 0.002055 0.5753
## 4: HK -0.004286 0.001887 0.4612
## 5: KR -0.004004 0.002129 0.5682
## 6: NL -0.004471 0.002308 0.5484
## 7: RU -0.005538 0.002877 0.5489
## 8: UA -0.005500 0.002947 0.5675
## 9: US -0.009315 0.012858 0.7613
We can do the same population-like tests for ASN data. Let’s investigate the prevalence of Google IP addresses on
outbound.pop = tiq.test.extractPopulationFromTI("public_outbound",
c("asnumber", "asname"),
date="20140711",
select.sources=NULL,
split.ti=FALSE)
complete.pop = tiq.data.loadPopulation("mmasn", c("asnumber", "asname"))
tests = tiq.test.populationInference(complete.pop$mmasn,
outbound.pop$public_outbound,
c("asname", "asnumber"),
exact = TRUE, top=10)
# Whose proportion is bigger than it should be?
tests[p.value < 0.05/10 & conf.int.end > 0][order(conf.int.end, decreasing=T)]
## asname conf.int.start conf.int.end p.value
## 1: Google Inc. 0.10756 0.11758 0.000e+00
## 2: Amazon.com, Inc. 0.04015 0.04673 0.000e+00
## 3: Akamai International B.V. 0.03534 0.04151 0.000e+00
## 4: TOT Public Company Limited 0.03019 0.03588 0.000e+00
## 5: GoDaddy.com, LLC 0.02052 0.02532 0.000e+00
## 6: OVH SAS 0.01397 0.01802 1.046e-302
## 7: Unified Layer 0.01292 0.01682 7.411e-323
## 8: Krypt Technologies 0.01049 0.01404 8.007e-265
# Whose is smaller?
tests[p.value < 0.05/10 & conf.int.start < 0][order(conf.int.start, decreasing=F)]
## asname conf.int.start conf.int.end p.value
## 1: Chinanet -0.01216 -0.006648 4.903e-10
# And whose is the same? ¯\_(ツ)_/¯
tests[p.value > 0.05/10]
## asname conf.int.start conf.int.end p.value
## 1: CNCGROUP China169 Backbone -0.004651 -0.0004625 0.01762
This huge prevalence of Google AS IPs should be investigated further. Some of it could be from parking at 8.8.8.8 and 1.1.1.1 but it seems to be too much
outbound.ti = tiq.data.loadTI("enriched", "public_outbound", "20140711")
outbound.ti[asname %like% "Google", list(entity, type, source, asname, host)]
## entity type source asname host
## 1: 74.125.228.43 IPv4 malcode Google Inc. NA
## 2: 74.125.228.75 IPv4 malcode Google Inc. NA
## 3: 173.194.115.16 IPv4 malcode_zones Google Inc. googleapis.com
## 4: 173.194.115.17 IPv4 malcode_zones Google Inc. googleapis.com
## 5: 173.194.115.18 IPv4 malcode_zones Google Inc. googleapis.com
## ---
## 1964: 8.8.8.8 IPv4 malwaredomains Google Inc. revlister.com
## 1965: 8.8.8.8 IPv4 malwaredomains Google Inc. statalyze.net
## 1966: 8.8.8.8 IPv4 malwaredomains Google Inc. statisticbench.net
## 1967: 8.8.8.8 IPv4 malwaredomains Google Inc. webdestinct.net
## 1968: 8.8.8.8 IPv4 spyeye Google Inc. futuretelefonica.com
outbound.ti[asname %like% "Google" & entity != "8.8.8.8" & entity != "1.1.1.1",
list(entity, type, source, asname, host)]
## entity type source asname host
## 1: 74.125.228.43 IPv4 malcode Google Inc. NA
## 2: 74.125.228.75 IPv4 malcode Google Inc. NA
## 3: 173.194.115.16 IPv4 malcode_zones Google Inc. googleapis.com
## 4: 173.194.115.17 IPv4 malcode_zones Google Inc. googleapis.com
## 5: 173.194.115.18 IPv4 malcode_zones Google Inc. googleapis.com
## ---
## 1950: 74.125.70.101 IPv4 malwaredomains Google Inc. chrome.google.com
## 1951: 74.125.70.102 IPv4 malwaredomains Google Inc. chrome.google.com
## 1952: 74.125.70.113 IPv4 malwaredomains Google Inc. chrome.google.com
## 1953: 74.125.70.138 IPv4 malwaredomains Google Inc. chrome.google.com
## 1954: 74.125.70.139 IPv4 malwaredomains Google Inc. chrome.google.com
I guess it is fair to say that it would be a good idea to cleanup these feeds. :)
That’s all for now, folks! Feel free to suggest new tests and experiments!