In the light of a recent hack/leak of Hacking Team (HT) I have decided to look a closer look on their customers - in particular email addresses with whom HT have communicated.
Here are links to that file which I am going to analyse.
https://wikileaks.org/hackingteam/emails/emailid/144932
http://cryptome.org/2015/07/ht-email-addresses.htm
Even though I could easily copy&paste emails in Excel (and make already some adjustments), I have decided for a harder way of doing so. Namely scrapping the website. This has a “disadvantage” that cleansing will be a manual process (well, not so much :) ).
First some libraries.
library(rvest)
library(ggplot2)
library(stringi)
library(magrittr)
library(plyr)
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Then select 6th p
tag (element) of the body which contains all data.
emailList <- html("http://cryptome.org/2015/07/ht-email-addresses.htm") %>%
html_nodes("p") %>%
extract2(6) %>%
html_text(trim = TRUE)
However, because we cannot pass an object directly to another method, I need to store that in a temporary file using writeLines
function. Then read that file using read.csv
. I also skip first 3 lines and rename column name to V1
.
tf <- tempfile()
writeLines(emailList, tf)
df_emails <- read.csv(tf, skip = 3, col.names = c("V1"))
head(df_emails)
## V1
## 1 01nurlan@gmail.com
## 2 139011@gmail.com
## 3 6442221@gmail.com
## 4 7715@rijnmond.politie.nl
## 5 aaron.garza@fcc.gov
## 6 aaron.j.robinson@us.army.mil
Now, after some initial scrapping of the website and loading of emails into one single column, it is necessary to split tem further into 2 columns for later use.
splitted_email <- data.frame(stri_split_fixed(tolower(df_emails$V1), "@", 2, omit_empty = NA, simplify = TRUE))
Sadly, there are some abnormalities, e.g. # kl, , #klaus.mochalski, adytonsystems.com or jimmi.lapotulo, gmail.com@ which is rather a big problem. So that I will need to identify all “#” rows and either correct or delete them, before continuing further. Let’s do that!
BTW.: There are 11 pieces which will be edited (1 is TRUE) and they are these:
plyr::count(stri_count_fixed(splitted_email$X1, "#"))
## x freq
## 1 0 2226
## 2 1 11
splitted_email[grepl("#", splitted_email$X1), ]
## X1 X2
## 1123 # kl
## 1124 #klaus.mochalski adytonsystems.com
## 1562 #papp mail.datanet.hu
## 1687 #reza cybersecurity.my
## 2015 #tomas.copete gencat.cat
## 2089 #vladimir.remenar fpz.hr
## 2188 #diego.cazzin eurasiastrategy.eu
## 2224 #carlasuc_igor mail.ru
## 2234 #7vko mail.ru
## 2235 #rk_mvd bk.ru
## 2237 #jody.revets usse.bl
To make cleansing faster, I will delete row # 1123 directly, correct # 981 and for others apply a stri_replace_first_fixed
func.
# delete 1123
splitted_email <- splitted_email[-1123, ]
splitted_email[981, 2] <- "gmail.com"
# delete/replace hashtag if the string begins with that
splitted_email$X1 <- stri_replace_first_fixed(splitted_email$X1, "#", "")
splitted_email$X1 <- stri_trim_both(splitted_email$X1)
splitted_email$X2 <- stri_trim_both(splitted_email$X2)
Let me check that it is correct. Great, it is.
splitted_email[grepl("#", splitted_email$X1), ]
## [1] X1 X2
## <0 rows> (or 0-length row.names)
Ohh yes, there are! (who would say so after WikiLeaks had released that file on the web? I would think they have already taken care of it)
splitted_email$X3 <- paste(tolower(splitted_email$X1), tolower(splitted_email$X2), sep = "@")
plyr::count(splitted_email[duplicated(tolower(splitted_email$X3)), ])
## X1 X2 X3 freq
## 1 bruno.ae modcom.com.my bruno.ae@modcom.com.my 1
## 2 bsakhr yahoo.fr bsakhr@yahoo.fr 1
## 3 jalee ncis.navy.mil jalee@ncis.navy.mil 1
## 4 jon.abolins gmail.com jon.abolins@gmail.com 2
## 5 marco.aimetti aermacchi.it marco.aimetti@aermacchi.it 1
## 6 michelestick gmail.com michelestick@gmail.com 1
## 7 milosstr centrum.cz milosstr@centrum.cz 1
## 8 pasha5163 hotmail.com pasha5163@hotmail.com 1
## 9 rnewman sheriffleefl.org rnewman@sheriffleefl.org 1
## 10 roger.flury fedpol.admin.ch roger.flury@fedpol.admin.ch 1
## 11 tony.tortora ice.dhs.gov tony.tortora@ice.dhs.gov 1
splitted_email <- splitted_email[!duplicated(tolower(splitted_email$X3)), ]
So now I have 2224 unique rows to work with.
Therefore, now I can really start analysing my data set. An example :
head(splitted_email)
## X1 X2 X3
## 1 01nurlan gmail.com 01nurlan@gmail.com
## 2 139011 gmail.com 139011@gmail.com
## 3 6442221 gmail.com 6442221@gmail.com
## 4 7715 rijnmond.politie.nl 7715@rijnmond.politie.nl
## 5 aaron.garza fcc.gov aaron.garza@fcc.gov
## 6 aaron.j.robinson us.army.mil aaron.j.robinson@us.army.mil
This prints unique domains (rather most widely used email providers) which had been used by “customers”.
splitted_email_freq <- as.data.frame(table(splitted_email$X2))
splitted_email_freq <- plyr::arrange(splitted_email_freq, desc(splitted_email_freq$Freq))
head(splitted_email_freq)
## Var1 Freq
## 1 gmail.com 422
## 2 yahoo.com 156
## 3 hotmail.com 119
## 4 dhs.gov 27
## 5 rmp.gov.my 21
## 6 dc.gov 19
A plot ?
splitted_email_freq_plot <- ggplot(data=splitted_email_freq[1:20,], aes(x=reorder(Var1, Freq), y=Freq))
splitted_email_freq_plot <- splitted_email_freq_plot + geom_bar(stat = "identity") + coord_flip(ylim= c(0,430)) + theme(
axis.text.y = element_text(size = 12),
axis.text.x = element_text(size = 12))
splitted_email_freq_plot <- splitted_email_freq_plot + xlab("Domains") + ylab("Number of users who used those email providers") + ggtitle("Which providers can be found in the address book of Hacking Team ?")
splitted_email_freq_plot <- splitted_email_freq_plot + scale_y_continuous(breaks = seq(0, 430, 30))
splitted_email_freq_plot
As one can see, there are many gmail, outlook, yahoo etc. I don’t want to display them, however.
blackListDomains <- c("outlook|gmail|yahoo|live|hotmail|googlemail")
splitted_email_freq_bld <- splitted_email[!stri_detect_regex(splitted_email$X2, blackListDomains), ]
splitted_email_freq_bld <- as.data.frame(table(splitted_email_freq_bld$X2))
splitted_email_freq_bld <- plyr::arrange(splitted_email_freq_bld, desc(splitted_email_freq_bld$Freq))
head(splitted_email_freq_bld, 15)
## Var1 Freq
## 1 dhs.gov 27
## 2 rmp.gov.my 21
## 3 dc.gov 19
## 4 fairfaxcounty.gov 18
## 5 ic.fbi.gov 14
## 6 kpk.go.id 14
## 7 mindef.nl 11
## 8 navy.mil 11
## 9 state.gov 11
## 10 us.army.mil 11
## 11 cybercrimeunit.gr 9
## 12 cybersecurity.my 9
## 13 ice.dhs.gov 9
## 14 politiet.no 9
## 15 usss.dhs.gov 9
Ok, so the first is the department of homeland security, then somebody from Maryland and so it goes. To me, it was surprising to see fairfaxcounty.gov to which HT had 18 email addresses in their contact list. There must be something going on…
Also in the TOP 15, there are a lot of Indonesia, Netherland, Norway, and fucking Greeks (they have no money at all, but money is always found to buy 1 Mil. Euro stuff from HT; morons).
I guess that - without any selection/removal - the .gov
, .it
and .com
will be among TOP 5. However, but what are other very often used TLD ? For this plot I will use initial splitted_email_freq_bld
dataset and delete all generic domains (gTLD: com, org, net, edu, gov, mil).
domain_list <- data.frame(stri_split_fixed(tolower(splitted_email_freq_bld$Var1), ".", -1, omit_empty = NA, simplify = TRUE))
domain_list[domain_list == ""] <- NA
# sapply(domain_list, class) ## all are factors :(
domain_list$X1 <- as.character(domain_list$X1)
domain_list$X2 <- as.character(domain_list$X2)
domain_list$X3 <- as.character(domain_list$X3)
domain_list$X4 <- as.character(domain_list$X4)
domain_list$X5 <- as.character(domain_list$X5)
# http://stackoverflow.com/questions/29634425/merging-two-columns-into-one-in-r
domain_list$new_NEW <- domain_list[-1][cbind(1:nrow(domain_list), max.col(!is.na(domain_list[-1]), ties.method = "last"))]
# freq again
domain_list <- as.data.frame(table(domain_list$new_NEW))
domain_list <- plyr::arrange(domain_list, desc(domain_list$Freq))
domain_list <- domain_list[!stri_detect_regex(domain_list$Var1, "com|org|net|edu|gov|mil"), ]
head(domain_list, 10)
## Var1 Freq
## 2 it 97
## 6 de 25
## 7 uk 20
## 8 cz 15
## 9 us 15
## 10 nl 14
## 11 ch 13
## 12 my 13
## 13 ae 12
## 14 ca 12
Nothing positive, at least for the Czech Republic, Switzerland, Germany and Singapore (well, there at least understand why).
domain_list_plot <- ggplot(data=domain_list[1:20,], aes(x=reorder(Var1, Freq), y=Freq))
domain_list_plot <- domain_list_plot + geom_bar(stat = "identity") + coord_flip(ylim= c(0,100)) + theme(
axis.text.y = element_text(size = 14),
axis.text.x = element_text(size = 14))
domain_list_plot <- domain_list_plot + xlab("Top-Level-Domains") + ylab("Number of unique email addresses") + ggtitle("Which domains can be found in the address book of Hacking Team ?")
domain_list_plot <- domain_list_plot + scale_y_continuous(breaks = seq(0, 100, 5))
domain_list_plot
splitted_email_czech <- splitted_email[grepl(".cz", splitted_email$X2), ]
head(splitted_email_czech, 10)
## X1 X2 X3
## 45 afg.prague centrum.cz afg.prague@centrum.cz
## 251 benes stech.cz benes@stech.cz
## 542 dobesvl atlas.cz dobesvl@atlas.cz
## 825 havlicek ppcr.cz havlicek@ppcr.cz
## 854 horinek.jan seznam.cz horinek.jan@seznam.cz
## 921 jakub.kriz ppcr.cz jakub.kriz@ppcr.cz
## 933 janmiler seznam.cz janmiler@seznam.cz
## 983 jindrich.hora seznam.cz jindrich.hora@seznam.cz
## 984 jiri.jenis px.mvcr.cz jiri.jenis@px.mvcr.cz
## 985 jirij px.mvcr.cz jirij@px.mvcr.cz
Ok, so who do we have here ?:
Uninteresting (because they make sense, except for CTU.cz)
http://www.BIS.cz: The Security Information Service (BIS) is an intelligence institution active within the Czech Republic.
http://www.CTU.cz: Czech Telecommunication Office ensures that electronic communications and related international activities within number of governmental and non-governmental agencies and organisations are being performed. <- This has been now confronted with it, link (at the time of publishing this remained without an answer, which is very much typical for Czech Republic)
http://www.MVCR.cz: Ministry of the Interior
http://www.UZSI.cz: The Office for Foreign Relations and Information is an intelligence service … providing the Czech legislative and executive authorities with timely, impartial and quality foreign intelligence vital for the security and protection of foreign policy interests and economic policy interests of the Czech Republic.
http://polac.cz - The Police Academy of the Czech Republic in Prague
http://www.Atlas.cz & http://www.Centrum.cz & https://www.Seznam.cz & http://Volny.cz are local email providers.
Very MUCH Interesting
http://PPCR.cz - Information on WHOIS http://www.nic.cz/whois/?d=ppcr.cz
http://npdc.cz: Martin Vanicek ? http://www.nic.cz/whois/?d=npdc.cz
http://okfk.cz: Information on WHOIS http://www.nic.cz/whois/?d=okfk.cz
http://srblad.cz: Ladislav Srb ? http://www.nic.cz/whois/?d=srblad.cz
http://stech.cz - cover up for something :)
Also, recently, a very nice article has been published about this Tomas Hlavsa, guy at Bull & ATOS Services which helped Czech gov. “acquire” those systems from HT (at the end of the day, they haven’t worked at least in one case).
splitted_email_swi_sg <- rbind(splitted_email[stri_endswith_fixed(splitted_email$X2, ".ch"), ], splitted_email[stri_endswith_fixed(splitted_email$X2, ".sg"), ])
head(table(splitted_email_swi_sg$X2))
##
## aduno.ch ar.admin.ch armasuisse.ch baumarep.ch cnb.gov.sg
## 1 1 1 1 3
## corner.ch
## 4
I won’t go into full details here, rather just for fun.
The goal was to take a close look on email addresses, which have been /are being used for communication with the Hacking Team. Specifically, I was interested into the Czech ones as there is a continuing perception that Czech don’t spy on people (both inside as well as outside of the country). With the release of email addresses from the Hacking Team address book, this has been proven to be totally wrong.