Analysis of Hacking Team Email List

Goal
Data Cleansing
Analysis
Summary
About the author

Goal

In the light of a recent hack/leak of Hacking Team (HT) I have decided to look a closer look on their customers - in particular email addresses with whom HT have communicated.

Here are links to that file which I am going to analyse.

https://wikileaks.org/hackingteam/emails/emailid/144932

http://cryptome.org/2015/07/ht-email-addresses.htm

Even though I could easily copy&paste emails in Excel (and make already some adjustments), I have decided for a harder way of doing so. Namely scrapping the website. This has a “disadvantage” that cleansing will be a manual process (well, not so much :) ).

First some libraries.

library(rvest)
library(ggplot2)
library(stringi)
library(magrittr)
library(plyr)
library(dplyr)

## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Then select 6th p tag (element) of the body which contains all data.

emailList <- html("http://cryptome.org/2015/07/ht-email-addresses.htm") %>%   
  html_nodes("p") %>%
  extract2(6)  %>%
  html_text(trim = TRUE)

However, because we cannot pass an object directly to another method, I need to store that in a temporary file using writeLines function. Then read that file using read.csv. I also skip first 3 lines and rename column name to V1.

tf <- tempfile()
writeLines(emailList, tf)
df_emails <- read.csv(tf, skip = 3, col.names = c("V1"))

head(df_emails)

##                             V1
## 1           01nurlan@gmail.com
## 2             139011@gmail.com
## 3            6442221@gmail.com
## 4     7715@rijnmond.politie.nl
## 5          aaron.garza@fcc.gov
## 6 aaron.j.robinson@us.army.mil

Now, after some initial scrapping of the website and loading of emails into one single column, it is necessary to split tem further into 2 columns for later use.

splitted_email <- data.frame(stri_split_fixed(tolower(df_emails$V1), "@", 2, omit_empty = NA, simplify = TRUE))

Data Cleansing

Getting rid of “#”

Sadly, there are some abnormalities, e.g. # kl, , #klaus.mochalski, adytonsystems.com or jimmi.lapotulo, gmail.com@ which is rather a big problem. So that I will need to identify all “#” rows and either correct or delete them, before continuing further. Let’s do that!

BTW.: There are 11 pieces which will be edited (1 is TRUE) and they are these:

plyr::count(stri_count_fixed(splitted_email$X1, "#"))

##   x freq
## 1 0 2226
## 2 1   11

splitted_email[grepl("#", splitted_email$X1), ]

##                     X1                 X2
## 1123              # kl                   
## 1124  #klaus.mochalski  adytonsystems.com
## 1562             #papp    mail.datanet.hu
## 1687             #reza   cybersecurity.my
## 2015     #tomas.copete         gencat.cat
## 2089 #vladimir.remenar             fpz.hr
## 2188     #diego.cazzin eurasiastrategy.eu
## 2224    #carlasuc_igor            mail.ru
## 2234             #7vko            mail.ru
## 2235           #rk_mvd              bk.ru
## 2237      #jody.revets            usse.bl

To make cleansing faster, I will delete row # 1123 directly, correct # 981 and for others apply a stri_replace_first_fixed func.

# delete 1123
splitted_email <- splitted_email[-1123, ]
splitted_email[981, 2] <- "gmail.com" 

# delete/replace hashtag if the string begins with that
splitted_email$X1 <- stri_replace_first_fixed(splitted_email$X1, "#", "")

splitted_email$X1 <- stri_trim_both(splitted_email$X1)
splitted_email$X2 <- stri_trim_both(splitted_email$X2)

Let me check that it is correct. Great, it is.

splitted_email[grepl("#", splitted_email$X1), ]

## [1] X1 X2
## <0 rows> (or 0-length row.names)

Any duplicates ?

Ohh yes, there are! (who would say so after WikiLeaks had released that file on the web? I would think they have already taken care of it)

splitted_email$X3 <- paste(tolower(splitted_email$X1), tolower(splitted_email$X2), sep = "@")

plyr::count(splitted_email[duplicated(tolower(splitted_email$X3)), ])

##               X1               X2                          X3 freq
## 1       bruno.ae    modcom.com.my      bruno.ae@modcom.com.my    1
## 2         bsakhr         yahoo.fr             bsakhr@yahoo.fr    1
## 3          jalee    ncis.navy.mil         jalee@ncis.navy.mil    1
## 4    jon.abolins        gmail.com       jon.abolins@gmail.com    2
## 5  marco.aimetti     aermacchi.it  marco.aimetti@aermacchi.it    1
## 6   michelestick        gmail.com      michelestick@gmail.com    1
## 7       milosstr       centrum.cz         milosstr@centrum.cz    1
## 8      pasha5163      hotmail.com       pasha5163@hotmail.com    1
## 9        rnewman sheriffleefl.org    rnewman@sheriffleefl.org    1
## 10   roger.flury  fedpol.admin.ch roger.flury@fedpol.admin.ch    1
## 11  tony.tortora      ice.dhs.gov    tony.tortora@ice.dhs.gov    1

splitted_email <- splitted_email[!duplicated(tolower(splitted_email$X3)), ]

So now I have 2224 unique rows to work with.

Analysis

Therefore, now I can really start analysing my data set. An example :

head(splitted_email)

##                 X1                  X2                           X3
## 1         01nurlan           gmail.com           01nurlan@gmail.com
## 2           139011           gmail.com             139011@gmail.com
## 3          6442221           gmail.com            6442221@gmail.com
## 4             7715 rijnmond.politie.nl     7715@rijnmond.politie.nl
## 5      aaron.garza             fcc.gov          aaron.garza@fcc.gov
## 6 aaron.j.robinson         us.army.mil aaron.j.robinson@us.army.mil

Frequency of domain/country occurrence

This prints unique domains (rather most widely used email providers) which had been used by “customers”.

splitted_email_freq <- as.data.frame(table(splitted_email$X2))
splitted_email_freq <- plyr::arrange(splitted_email_freq, desc(splitted_email_freq$Freq))

head(splitted_email_freq)

##          Var1 Freq
## 1   gmail.com  422
## 2   yahoo.com  156
## 3 hotmail.com  119
## 4     dhs.gov   27
## 5  rmp.gov.my   21
## 6      dc.gov   19

A plot ?

splitted_email_freq_plot <- ggplot(data=splitted_email_freq[1:20,], aes(x=reorder(Var1, Freq), y=Freq)) 
splitted_email_freq_plot <- splitted_email_freq_plot + geom_bar(stat = "identity") + coord_flip(ylim= c(0,430)) + theme(
                      axis.text.y = element_text(size = 12), 
                      axis.text.x = element_text(size = 12)) 
splitted_email_freq_plot <- splitted_email_freq_plot + xlab("Domains") + ylab("Number of users who used those email providers") + ggtitle("Which providers can be found in the address book of Hacking Team ?")
splitted_email_freq_plot <- splitted_email_freq_plot + scale_y_continuous(breaks = seq(0, 430, 30))
splitted_email_freq_plot

As one can see, there are many gmail, outlook, yahoo etc. I don’t want to display them, however.

blackListDomains <- c("outlook|gmail|yahoo|live|hotmail|googlemail")

splitted_email_freq_bld <- splitted_email[!stri_detect_regex(splitted_email$X2, blackListDomains), ]

splitted_email_freq_bld <- as.data.frame(table(splitted_email_freq_bld$X2))
splitted_email_freq_bld <- plyr::arrange(splitted_email_freq_bld, desc(splitted_email_freq_bld$Freq))

head(splitted_email_freq_bld, 15)

##                 Var1 Freq
## 1            dhs.gov   27
## 2         rmp.gov.my   21
## 3             dc.gov   19
## 4  fairfaxcounty.gov   18
## 5         ic.fbi.gov   14
## 6          kpk.go.id   14
## 7          mindef.nl   11
## 8           navy.mil   11
## 9          state.gov   11
## 10       us.army.mil   11
## 11 cybercrimeunit.gr    9
## 12  cybersecurity.my    9
## 13       ice.dhs.gov    9
## 14       politiet.no    9
## 15      usss.dhs.gov    9

Ok, so the first is the department of homeland security, then somebody from Maryland and so it goes. To me, it was surprising to see fairfaxcounty.gov to which HT had 18 email addresses in their contact list. There must be something going on…

Also in the TOP 15, there are a lot of Indonesia, Netherland, Norway, and fucking Greeks (they have no money at all, but money is always found to buy 1 Mil. Euro stuff from HT; morons).

Top-Level-Domains with whom HT had contact with.

I guess that - without any selection/removal - the .gov, .it and .com will be among TOP 5. However, but what are other very often used TLD ? For this plot I will use initial splitted_email_freq_bld dataset and delete all generic domains (gTLD: com, org, net, edu, gov, mil).

domain_list <- data.frame(stri_split_fixed(tolower(splitted_email_freq_bld$Var1), ".", -1, omit_empty = NA, simplify = TRUE))
domain_list[domain_list == ""] <- NA

# sapply(domain_list, class) ## all are factors :(

domain_list$X1 <- as.character(domain_list$X1)
domain_list$X2 <- as.character(domain_list$X2)
domain_list$X3 <- as.character(domain_list$X3)
domain_list$X4 <- as.character(domain_list$X4)
domain_list$X5 <- as.character(domain_list$X5)

# http://stackoverflow.com/questions/29634425/merging-two-columns-into-one-in-r
domain_list$new_NEW <- domain_list[-1][cbind(1:nrow(domain_list), max.col(!is.na(domain_list[-1]), ties.method = "last"))]

# freq again
domain_list <- as.data.frame(table(domain_list$new_NEW))
domain_list <- plyr::arrange(domain_list, desc(domain_list$Freq))
domain_list <- domain_list[!stri_detect_regex(domain_list$Var1, "com|org|net|edu|gov|mil"), ]

head(domain_list, 10)

##    Var1 Freq
## 2    it   97
## 6    de   25
## 7    uk   20
## 8    cz   15
## 9    us   15
## 10   nl   14
## 11   ch   13
## 12   my   13
## 13   ae   12
## 14   ca   12

Nothing positive, at least for the Czech Republic, Switzerland, Germany and Singapore (well, there at least understand why).

domain_list_plot <- ggplot(data=domain_list[1:20,], aes(x=reorder(Var1, Freq), y=Freq)) 
domain_list_plot <- domain_list_plot + geom_bar(stat = "identity") + coord_flip(ylim= c(0,100)) + theme(
                      axis.text.y = element_text(size = 14), 
                      axis.text.x = element_text(size = 14)) 
domain_list_plot <- domain_list_plot + xlab("Top-Level-Domains") + ylab("Number of unique email addresses") + ggtitle("Which domains can be found in the address book of Hacking Team ?")
domain_list_plot <- domain_list_plot + scale_y_continuous(breaks = seq(0, 100, 5))
domain_list_plot

Country: Czech Republic

splitted_email_czech <- splitted_email[grepl(".cz", splitted_email$X2), ]
head(splitted_email_czech, 10)

##                X1         X2                      X3
## 45     afg.prague centrum.cz   afg.prague@centrum.cz
## 251         benes   stech.cz          benes@stech.cz
## 542       dobesvl   atlas.cz        dobesvl@atlas.cz
## 825      havlicek    ppcr.cz        havlicek@ppcr.cz
## 854   horinek.jan  seznam.cz   horinek.jan@seznam.cz
## 921    jakub.kriz    ppcr.cz      jakub.kriz@ppcr.cz
## 933      janmiler  seznam.cz      janmiler@seznam.cz
## 983 jindrich.hora  seznam.cz jindrich.hora@seznam.cz
## 984    jiri.jenis px.mvcr.cz   jiri.jenis@px.mvcr.cz
## 985         jirij px.mvcr.cz        jirij@px.mvcr.cz

Ok, so who do we have here ?:

Uninteresting (because they make sense, except for CTU.cz)

http://www.BIS.cz: The Security Information Service (BIS) is an intelligence institution active within the Czech Republic.

http://www.CTU.cz: Czech Telecommunication Office ensures that electronic communications and related international activities within number of governmental and non-governmental agencies and organisations are being performed. <- This has been now confronted with it, link (at the time of publishing this remained without an answer, which is very much typical for Czech Republic)

http://www.MVCR.cz: Ministry of the Interior

http://www.UZSI.cz: The Office for Foreign Relations and Information is an intelligence service … providing the Czech legislative and executive authorities with timely, impartial and quality foreign intelligence vital for the security and protection of foreign policy interests and economic policy interests of the Czech Republic.

http://polac.cz - The Police Academy of the Czech Republic in Prague

http://www.Atlas.cz & http://www.Centrum.cz & https://www.Seznam.cz & http://Volny.cz are local email providers.

Very MUCH Interesting

http://PPCR.cz - Information on WHOIS http://www.nic.cz/whois/?d=ppcr.cz

http://npdc.cz: Martin Vanicek ? http://www.nic.cz/whois/?d=npdc.cz

http://okfk.cz: Information on WHOIS http://www.nic.cz/whois/?d=okfk.cz

http://srblad.cz: Ladislav Srb ? http://www.nic.cz/whois/?d=srblad.cz

http://stech.cz - cover up for something :)

Also, recently, a very nice article has been published about this Tomas Hlavsa, guy at Bull & ATOS Services which helped Czech gov. “acquire” those systems from HT (at the end of the day, they haven’t worked at least in one case).

Country: Singapore and Switzerland

splitted_email_swi_sg <- rbind(splitted_email[stri_endswith_fixed(splitted_email$X2, ".ch"), ], splitted_email[stri_endswith_fixed(splitted_email$X2, ".sg"), ])

head(table(splitted_email_swi_sg$X2))

## 
##      aduno.ch   ar.admin.ch armasuisse.ch   baumarep.ch    cnb.gov.sg 
##             1             1             1             1             3 
##     corner.ch 
##             4

I won’t go into full details here, rather just for fun.

Summary

The goal was to take a close look on email addresses, which have been /are being used for communication with the Hacking Team. Specifically, I was interested into the Czech ones as there is a continuing perception that Czech don’t spy on people (both inside as well as outside of the country). With the release of email addresses from the Hacking Team address book, this has been proven to be totally wrong.

About the author

I am student from Germany who is interested in Asia (and its small nations).