Introduction

The present project is part of a course in data science and business analytics. In this work, we’ve chosen to investigate Agora, a market place of the darkweb that used to be very popular before it was closed in 2015. Hopefully, our analysis will contribute to produce intelligence about this specific platform and help understanding how the platform supporting illegal trades are organized.

The Dataset

For our project, we used a dataset we found on Kaggle called Dark net marketplace drug data
It was made from an html rip by a reddit user called “usheep” that was blackmailing the vendors on the website, saying he would expose them to the police if they didn’t meet his demand. We don’t know what happend, he posted the html rip and there is no more information about what happened to him or his demands.

The dataset is ripped from a dark/deep web market place called Agora from the years 2014 et 2015. It contains offers for drugs, weapons, books, services and others. There is about 100’000 items to sell listed. The market place was shut down a few months after this data release, we don’t know if it’s related to the reddit user or not.

It is organized by:

Vendor: The seller
Category: Where in the marketplace the item falls under
Item: The title of the listing
Description: The description of the listing
Price: Cost of the item (averaged across any duplicate listings between 2014 and 2015)
Origin: Where the item is listed to have shipped from
Destination: Where the item is listed to be shipped to (blank means no information was provided, but mostly likely worldwide. I did not enter worldwide for any blanks however as to not make assumptions)
Rating: The rating of the seller (a rating of [0 deals] or anything else with “deals” in it means there is not concrete rating as the amount of deals is too small for a rating to be displayed)
Remarks: Only remark options are blank, or “Average price may be skewed outliar > .5 BTC found” which is pretty self explanatory.

Motivation

As we both followed classes about criminality, web-criminality and cybersecurity before following this class. We thought it would be interresting to make links between classes and work on a criminality related project.
Furthermore, we often read or hear about dark/deep web and the various stuff that you could find in here, finding this dataset gave us the opportunity to explore the reality of a market place selling mostly illegal items.

Litterature

If the darkweb is a set of unreferenced websites that can provide anonymity and freedom of speech, it’s also heavily used by criminals to meet, makes deals, exchanges informations and propose illegal products and services.

As transactions are anonymous, it has been well documented that those places share common features in the development of specific governance modes, such as reputation systems, which we can see in the grading system present in the Agora dataset, in order to ensure a certain level of confidence among users.

If the identification of individuals behind the transactions is very difficult to do, analyzing the dynamiques present in the marketplaces has a lot to offer in terms of informations and intelligence production regarding criminal activity in general and could be used as indicator of international traffic and fluxes.

Research Questions

Our project aims at giving insights about the composition of Agora as a market place, what products are mostly available through the platform, where they come from and to what destinations they’re shipped. It will also investigate how criminal activity is distributed: are the majority of offers the result of a few prolific authors? Is this organization dependant on the categories and types of products proposed? Are the most prolific vendors specialized actors of a sub-market or do they diversify their scope of activity?

Thus, the analysis project will be organized around those 3 dimensions:

  1. Investigating the type of products available in Agora

  2. Investigating the geographical distribution of proposed products in Agora

  3. Investigatinng the distribution of activity amongst Agora’s Vendors

Data Cleaning

In order to pursue our research questions, a substantial amount of work was required to clean the dataset. In particular, declared categories of products had to be decomposed in different levels of precision (branches) that describe the content of the offer, which we called form thicker to thiner descripion: “Category”, “Type1” and “Type2”.

What took the most of our time was to recode declared destinations and origins of offers because they were free text fields inputs given by the vendors. They contained many spelling mistakes, and unstandardized ways of declaring locations. Many indicated areas of differents scopes (cities, countries, continents, etc.) in varying forms (countrycode,english full name, french, etc.). In order to perform this cleaning step, we mainly used regular expression and the “unnnest_tokens” function from the tidytext mining package. Having an automatic way of doing those corrections would have been nice but the data was so messy that we were forced to operate by iteration and explorative ways. It was also a good opportunity to get in touch we the dataset. After this first cleaning step, we then checked for matches with fidex and standardized list of countries and regions provided by the “countrycode” and “worldmap” packages.

We decided to let apart the more detailed variables “Item” and “Item description”, because they were the results of inputs from the vendors in a totally unstandardized manner, which implies that they contain very messy text data and would have required a lot of preprocessing and cleaning to become useful for our analysis.

We also choose not to focus on prices and ratings. Analyse prices among different offers would have required to mine into the item description variables in order to find quantities contained in the deals and have commensurable data, such a process would have taken us more than the time we could allow for the data cleaning phase. In what concerns ratings, all were very highly graded and we figured it wouldn’t be the most interesting dimension to analyze as a first exploration of this dataset.

Data cleaning for categories: Pruning and regrouping

Exploratory Observations

Our first thought were:
-109 Categories is too much
-We can remove frequence 1 categories
-We can regroup less frequent categories
-There are many catogries we can regroup

Regrouping categories at different level

In this part we tried to regroup the 109 categories at different level in order to only have a few of them so we can use them more easily in the analysis.

## using regex for the first group
dark_market_data_cat <- dark_market_data %>% mutate(Group=gsub('^(Drugs|Services|Data|Info|Forgeries|Electronics|Weapons|Counterfeits|Tobacco|Chemicals|Drug paraphernalia){1}(/.*)','\\1',Category)) %>% mutate(Spec1=gsub('^(Drugs|Services|Data|Info|Forgeries|Electronics|Weapons|Counterfeits|Tobacco|Chemicals|Drug paraphernalia){1}(/.*)','\\2',Category))

dark_market_data_cat %>% count(Spec1, sort=TRUE)
#We still have 106 different categories

#utilisation des regex pour diviser les niveaux suivants (spec1 et spec2)
dark_market_data_cat <- dark_market_data_cat %>% mutate(Spec2=gsub('(/[^/]*)(.*)','\\1',Spec1)) %>% mutate(Spec3=gsub('(^/[^/]*/$)(/.*)','\\2',Spec1)) %>% mutate(Spec4=gsub('(/[^/]*)(.*)','\\2',Spec3)) %>% mutate(Spec3=gsub('(/[^/]*)(.*)','\\1',Spec3))

#checking results
dark_market_data_cat %>% count(Spec1, sort = TRUE)
dark_market_data_cat %>% count(Spec2, sort = TRUE)
dark_market_data_cat %>% count(Spec3, sort = TRUE)
dark_market_data_cat %>% count(Spec4,sort=TRUE)

#dropping everything that is useless
dark_market_data_cat <- dark_market_data_cat %>%  mutate(Spec1=gsub('/','',Spec1),Spec2=gsub('/','',Spec2),Spec3=gsub('/','',Spec3),Spec4=gsub('/','',Spec4)) %>% mutate(Spec1=Spec3,Spec2=Spec4)

drop.cols <- c('Spec3','Spec4')
dark_market_data_cat <- dark_market_data_cat %>% select(-one_of(drop.cols))

dark_market_data_cat %>% transmute(Group,Spec1,Spec2)

dark_market_data <- dark_market_data_cat

### Some more Categories adjustments

#display per value freq:
dark_market_data %>% count(Category,sort=TRUE) %>% tail()
#remove values with only on occurence
dark_market_data <- dark_market_data %>% group_by(Category) %>% filter(n() > 1)
#check
dark_market_data %>% count(Category,sort=TRUE) %>% tail()

Data cleaning for origins and destinations

We had a lot of values for origins and destinations where they selected a few countries, and then add “no australia” for example. There was about 2848 values with those exceptions so we tried to filter them out.

We could see there’s more destinations than origins: we knew the cleaning workload would probably be heavier

#getting all the values with exceptions
pattern <- '(exc|exept)'
dark_market_data <- dark_market_data %>% mutate(excepts_dest = grepl(pattern,tolower(Destination)))
dark_market_data <- dark_market_data %>% mutate(excepts_orig = grepl(pattern,tolower(Origin)))

#find remaining "no .... country"
dark_market_data %>% transmute(no=grepl('no .*',Destination),tolower(Destination)) %>% filter(no==TRUE)
dark_market_data <- dark_market_data %>% mutate(no_dest = grepl('no .*',tolower(Destination)))
dark_market_data %>% transmute(no=grepl('no .*',Origin),tolower(Origin)) %>% filter(no==TRUE)
dark_market_data <- dark_market_data %>% mutate(no_orig = grepl('no .*',tolower(Origin)))


## Then filter every TRUE on both cols with excepts

dark_market_data_old <- dark_market_data
dark_market_data_old %>% filter(excepts_orig == TRUE | excepts_dest == TRUE | no_orig == TRUE | no_dest == TRUE) %>% transmute(Destination,Origin,ID,excepts_dest,excepts_orig)
dark_market_data <- dark_market_data %>% filter(excepts_orig == FALSE & excepts_dest == FALSE & no_orig == FALSE & no_dest == FALSE)

#SAVE what has been left apart in case we want to further investigate or clean it
dark_market_data_filtered_out <- dark_market_data_old %>% filter(excepts_orig == TRUE | excepts_dest == TRUE | no_orig == TRUE | no_dest == TRUE)

## *** SAVING LEFT APARTS ***
write_csv(dark_market_data_filtered_out, path = "../dark_market_filtered_out.csv")


# number of observations filtered out:
nrow(dark_market_data_old)-nrow(dark_market_data)

### SOME FIRST INSIGHTS
dark_market_data %>% summarise(n_obs=length(unique(ID)),n_cat=length(unique(Category)),n_vendors=length(unique(Vendor)),destinations=length(unique(Destination)),origins = length(unique(Origin))) %>% kable()

Data preparation for Places Processing with gsub and unnest_tokens

it is necessary to spot duplicates in the same line before unnest words, because it seperates words by spacechar, so if we don’t want to end up with two counts for same record i.g. switzerland switzerland, that will end up counted two times in at the end of the process, we need to make sure they aggregate in one word.

dark_market_data %>% count(Origin,sort=TRUE) %>% kable()
dark_market_data %>% count(Destination,sort=TRUE) %>% kable()

### list of countries composed by two words /duplicates to join, that will lose meaning after unnest

duplicates <- list(c('middle east','middleeast'),c('hong kong','hongkong'), c('netherlands netherlands','netherlands'),c('germany germany','germany'),c('uk uk','uk'),c('canada canada','canada'),c('eu schengen','eu'),c('united states','unitedstates'),c('s. america','southamerica'),c('usa usa','usa'),c('switzerland switzerland','switzerland'),c('world wide','worldwide'),c('worldwide international','worldwide'),c('international worldwide','worldwide'),c('new zealand','newzealand'),c('e.u. countries','eu'),c('everywhere worldwide','worldwide'),c('all','worldwide'),c('every where','worldwide'),c('worldwide any destination','worldwide'),c('* w o r l d w i d e *','worldwide'),c('planet earth','worldwide'),c('rest of the world','worldwide'),c('united kingdom','uk'))



## convert all destinations and origins to lower

dark_market_data <- dark_market_data %>% mutate(Origin = tolower(Origin),Destination = tolower(Destination))

dark_market_data %>% count(Origin,sort=TRUE) %>% kable()
dark_market_data %>% count(Destination,sort=TRUE) %>% kable()

## apply conversion to list of duplicates
for (i in duplicates) {
  #print(i[1])
  #print(i[2])
dark_market_data$Origin <- gsub(i[1], i[2],dark_market_data$Origin)}

for (i in duplicates) {dark_market_data$Destination <- gsub(i[1],i[2],dark_market_data$Destination)}


dark_market_data %>% count(Origin,sort=TRUE) %>% kable()
dark_market_data %>% count(Destination,sort=TRUE) %>% kable()

### unnest the words contained in destination and origin, add them to new dataset
df_origin <- dark_market_data %>% unnest_tokens(origin,Origin,drop=FALSE)
df_destination <- dark_market_data %>% unnest_tokens(destination,Destination,drop=FALSE)

df_origin %>% tail() %>% kable()
df_destination %>% tail() %>% kable()

### anti_join with a customed dataframe of stop_words

stop_words %>% kable()

custom_stop_words <- stop_words %>% filter(word != 'us' & word != 'state' & word != 'states')
custom_stop_words %>% kable()

stop_words_origin <- custom_stop_words %>% mutate(origin=word)
stop_words_destination <- custom_stop_words %>% mutate(destination=word)

df_origin <- df_origin %>% anti_join(stop_words_origin,by='origin')
df_destination <- df_destination %>% anti_join(stop_words_destination,by='destination')


### check and comparison of what has been done for now
df_origin %>% count(Origin,sort=TRUE) %>% kable()
df_origin %>% count(origin, sort = TRUE) %>% kable()
df_destination %>% count(Destination,sort=TRUE) %>% kable()
df_destination %>% count(destination, sort = TRUE) %>% kable()

Origin and Destination columns

We had then to normalize the datas for the origin and destination columns in order to produce graphs and maps in the analysis part of the report. We won’t provide the source code in this report because it’s quite long and not that interresting since it’s mostly removing typos, words that aren’t countries and at the end formatting everything the same way to use it later.
You can find the whole process in the source code linked with this report to check how it is done

Saving dataframes

Now that the cleaning part is done, we wanted to save our results in different dataframes, you can see them just below.
This part took us a lot of time because users were not really careful while entering information on Agora, but it’s really practical to use now that we cleaned it

Cleaned Datasets Description

After having applied the necessary modifications, we were ready to pursue our project’s goals and begin explorative analysis of the cleaned dataset. This cleaned data is composed of a main dataset where each observations correspond to a unique offer posted on the Agora marketplace and 2 others datasets composed of origins and destinations mentioned in the offers, each record corresponding to a declared unique location(origin/destination) for a specific offer.

The main dataset, which we’ll call “dark market dataset”, is composed of 106337 observations. The dataset of declared origins is composed of 61707 mentions of places and the dataset for declared destinations is composed of 99988 mentions.

Variables and values:

The table below shows the list of remaining variables and the number of distinct values they’re composed of.

Dataset Variables, Nb of distinct values
ID Vendor Category Item Item Description Price Rating Type1 Type2 Offer_ID dest_list orig_list
106337 3166 15 104348 67997 99320 478 46 62 106337 130 153

As we can see, numbers of distinct items and offer_IDs are very similar, partly because the provider of the dataset mentioned having curated the duplicates, but this could also mean that each offer describes the item differently or that it concerns an item of different nature everytime. We can also see that the descriptions of those items vary a lot, there are approximatively 2 items per descriptions.

Places

The following tables show the different places declared as origins and destinations and their respective counts:

Distinct values for declared destinations of products
clean_destination n
africa 24
asia 101
australia 4743
austria 10
belgium 41
brazil 3
canada 1316
china 7
denmark 34
europe 6486
finland 63
france 231
germany 1122
Grenada 2
Hong Kong SAR China 26
hungary 11
india 9
internet 15
iraq 2
ireland 110
israel 19
italy 4
Japan 7
luxembourg 4
mexico 9
mississippi 2
netherlands 76
New Zealand 230
norway 198
oceania 61
other 428
Philippines 21
poland 8
scandinavia 244
Singapore 7
spain 10
sweden 379
switzerland 313
thailand 19
uk 3607
usa 18363
worldwide 23342
Distinct values for declared origins of products
clean_origin n
africa 22
argentina 85
asia 110
australia 8860
austria 252
belarus 9
belgium 1171
belize 2
bolivia 24
brazil 20
cambodia 2
canada 5518
Cayman Islands 2
china 4186
croatia 4
czech republic 5
czech republicrepublic 273
denmark 383
Dominican Republic 9
estonia 4
europe 5082
fiji 11
finland 67
france 844
germany 7685
guatemala 5
Hong Kong SAR China 417
hungary 8
india 1122
internet 4010
ireland 269
israel 19
italy 279
japan 4
Japan 7
latvia 9
lithuania 6
luxembourg 2
mexico 105
morocco 3
netherlands 6701
New Zealand 180
North America 15
northamerica 2
norway 336
oceania 61
other 137
pakistan 55
panama 8
peru 7
Philippines 278
poland 111
romania 2
scandinavia 55
serbia 2
seychelles 4
singapore 14
Singapore 7
slovakia 9
South Africa 177
spain 442
St. Vincent & Grenadines 2
swaziland 2
sweden 1059
switzerland 451
thailand 66
Thailand 3
uk 11346
ukraine 127
usa 34962
worldwide 2472

Categories and nested categories

Here are the different categories present in our cleaned dataset.

Categories and nested types
Category Type1_list Type2_list
Chemicals Chemicals Chemicals
Counterfeits Accessories, Clothing, Electronics, Money, Watches
Data Accounts, Pirated, Software
Drug paraphernalia Containers, Grinders, Injecting equipment, Paper, Pipes, Scales, Stashes Filters, Needles, Other, Syringes
Drugs Barbiturates, Benzos, Cannabis, Dissociatives, Ecstasy, Opioids, Other, Prescription, Psychedelics, RCs, Steroids, Stimulants, Weight loss 2C, 5-MeO, Buprenorphine, Cocaine, Codeine, Concentrates, Dihydrocodeine, DMT, Edibles, Fentanyl, GBL, GHB, Hash, Heroin, Hydrocodone, Ketamine, LSD, MDA, MDMA, Mephedrone, Mescaline, Meth, Morphine, Mushrooms, MXE, NB, Opium, Other, Others, Oxycodone, PCP, Pills, Prescription, Salvia, Seeds, Shaketrim, Speed, Spores, Synthetics, Weed
Electronics Electronics Electronics
Forgeries Forgeries, Other, Physical documents, Scans Forgeries, Photos
Info eBooks AliensUFOs, Anonymity, Doomsday, Drugs, Economy, IT, Making money, Other, Philosophy, Politics, Psychology, RelationshipsSex, Science
Information/eBooks InformationeBooks Information
Information/Guides InformationGuides Information
Jewelry Jewelry Jewelry
Other Other Other
Services Advertising, Hacking, Money, Other, Travel
Tobacco Paraphernalia, Smoked
Weapons Ammunition, Fireworks, Lethal firearms, Melee, Non-lethal firearms
Types 1 and nested types (2)
Category Type1 Type2_list
Chemicals Chemicals Chemicals
Counterfeits Accessories
Counterfeits Clothing
Counterfeits Electronics
Counterfeits Money
Counterfeits Watches
Data Accounts
Data Pirated
Data Software
Drug paraphernalia Containers
Drug paraphernalia Grinders
Drug paraphernalia Injecting equipment Filters, Needles, Other, Syringes
Drug paraphernalia Paper
Drug paraphernalia Pipes
Drug paraphernalia Scales
Drug paraphernalia Stashes
Drugs Barbiturates
Drugs Benzos
Drugs Cannabis Concentrates, Edibles, Hash, Seeds, Shaketrim, Synthetics, Weed
Drugs Dissociatives GBL, GHB, Ketamine, MXE, Other, PCP
Drugs Ecstasy MDA, MDMA, Other, Pills
Drugs Opioids Buprenorphine, Codeine, Dihydrocodeine, Fentanyl, Heroin, Hydrocodone, Morphine, Opium, Other, Oxycodone
Drugs Other
Drugs Prescription
Drugs Psychedelics 2C, 5-MeO, DMT, LSD, Mescaline, Mushrooms, NB, Other, Others, Salvia, Spores
Drugs RCs
Drugs Steroids
Drugs Stimulants Cocaine, Mephedrone, Meth, Prescription, Speed
Drugs Weight loss
Electronics Electronics Electronics
Forgeries Forgeries Forgeries
Forgeries Other
Forgeries Physical documents
Forgeries Scans Photos
Info eBooks AliensUFOs, Anonymity, Doomsday, Drugs, Economy, IT, Making money, Other, Philosophy, Politics, Psychology, RelationshipsSex, Science
Information/eBooks InformationeBooks Information
Information/Guides InformationGuides Information
Jewelry Jewelry Jewelry
Other Other Other
Services Advertising
Services Hacking
Services Money
Services Other
Services Travel
Tobacco Paraphernalia
Tobacco Smoked
Weapons Ammunition
Weapons Fireworks
Weapons Lethal firearms
Weapons Melee
Weapons Non-lethal firearms

And here are the number of distinct values for each group

Counts of distinct values per Category
Category Type1 Type2 Vendor dest_list orig_list Item Offer_ID
Drugs 13 41 2923 121 143 88573 89697
Services 5 1 322 10 30 2557 2642
Counterfeits 5 1 102 8 25 2316 2367
Info 1 13 88 6 11 2023 2169
Data 3 1 145 9 21 1910 2118
Other 1 1 347 15 23 1402 1425
Forgeries 4 3 108 6 20 1018 1051
Information/Guides 1 1 76 5 10 908 927
Information/eBooks 1 1 67 5 10 895 918
Drug paraphernalia 7 5 80 9 13 838 840
Weapons 5 1 81 16 24 655 656
Electronics 1 1 123 12 16 594 599
Tobacco 2 1 40 6 13 383 420
Jewelry 1 1 24 3 12 415 418
Chemicals 1 1 18 6 10 90 90

Analysis: number of distinct values per categories of products

  • The categories with higher number of offers are Drugs (by far), then come Services, Counterfeits, Infos and Data in way more moderate proportions.

  • It would be interesting to normalize quantities of distinct vendors, destinations and origins by the number of offers in each category (i.g. express them as percentages) to see if there are some different trends within the categories.

  • For all categories, we can observe there are higher numbers of distinct origins declared than distinct destinations. It would be interesting to further investigate why.

Counts per Type 1
Category Type1 Type2 Vendor dest_list orig_list Item Offer_ID
Drugs Cannabis 7 1365 69 79 30052 30280
Drugs Ecstasy 4 968 53 55 13672 13867
Drugs Stimulants 5 1223 65 69 12013 12196
Drugs Psychedelics 11 634 31 42 8010 8084
Drugs Opioids 11 721 37 51 6609 6675
Drugs Prescription 1 570 32 47 5489 5556
Drugs Benzos 1 491 32 46 5322 5384
Drugs Steroids 1 139 18 28 2716 2761
Info eBooks 13 88 6 11 2023 2169
Drugs RCs 1 138 17 23 2064 2092
Drugs Dissociatives 6 262 18 28 1601 1659
Services Money 1 224 9 24 1445 1481
Other Other 1 347 15 23 1402 1425
Counterfeits Watches 1 18 5 10 1264 1309
Data Accounts 1 95 8 19 1032 1233
Information/Guides InformationGuides 1 76 5 10 908 927
Information/eBooks InformationeBooks 1 67 5 10 895 918
Drugs Other 1 230 17 34 861 864
Forgeries Physical documents 1 77 6 15 607 616
Electronics Electronics 1 123 12 16 594 599
Data Pirated 1 34 4 6 526 529
Services Other 1 126 8 16 477 487
Services Hacking 1 63 5 10 428 453
Jewelry Jewelry 1 24 3 12 415 418
Tobacco Smoked 1 32 6 11 356 393
Counterfeits Money 1 62 7 17 384 385
Counterfeits Clothing 1 14 2 7 359 364
Data Software 1 68 7 9 353 356
Weapons Lethal firearms 1 51 13 16 343 344
Forgeries Scans 1 48 4 10 319 327
Counterfeits Accessories 1 25 5 14 250 250
Drugs Weight loss 1 67 11 21 246 249
Drug paraphernalia Pipes 1 42 8 10 195 195
Drug paraphernalia Containers 1 18 5 7 185 186
Drug paraphernalia Stashes 1 10 5 5 149 149
Weapons Ammunition 1 23 8 11 138 138
Services Advertising 1 15 3 5 131 131
Drug paraphernalia Grinders 1 4 6 7 106 106
Weapons Melee 1 16 6 12 103 103
Forgeries Other 1 25 3 8 100 100
Drug paraphernalia Injecting equipment 4 25 5 6 96 96
Chemicals Chemicals 1 18 6 10 90 90
Services Travel 1 6 2 4 90 90
Drug paraphernalia Paper 1 3 3 4 61 61
Counterfeits Electronics 1 16 3 11 59 59
Weapons Non-lethal firearms 1 11 6 9 57 57
Drug paraphernalia Scales 1 9 5 4 46 47
Drugs Barbiturates 1 13 4 7 30 30
Tobacco Paraphernalia 1 10 3 5 27 27
Weapons Fireworks 1 9 3 6 14 14
Forgeries Forgeries 1 4 1 4 8 8
Counts per Type 2
Category Type1 Type2 Vendor dest_list orig_list Item Offer_ID
Drugs Cannabis Weed 1124 58 69 20584 20747
Drugs Ecstasy Pills 448 34 36 6759 6798
Drugs Ecstasy MDMA 753 47 45 5651 5782
Drugs Stimulants Cocaine 696 45 51 5502 5603
Drugs Prescription NA 570 32 47 5489 5556
Drugs Benzos NA 491 32 46 5322 5384
Drugs Cannabis Concentrates 331 22 26 4221 4247
Drugs Psychedelics LSD 327 25 33 3539 3564
Drugs Cannabis Hash 383 37 38 2948 2969
Drugs Steroids NA 139 18 28 2716 2761
Drugs Stimulants Meth 320 23 31 2392 2427
Drugs Stimulants Speed 310 33 30 2164 2178
Drugs RCs NA 138 17 23 2064 2092
Drugs Stimulants Prescription 335 22 30 1929 1955
Drugs Opioids Heroin 240 19 19 1688 1693
Services Money NA 224 9 24 1445 1481
Other Other Other 347 15 23 1402 1425
Drugs Opioids Oxycodone 241 18 27 1334 1343
Counterfeits Watches NA 18 5 10 1264 1309
Data Accounts NA 95 8 19 1032 1233
Drugs Opioids NA 235 20 27 1204 1207
Drugs Psychedelics Mushrooms 183 12 18 1122 1127
Drugs Cannabis Edibles 134 13 15 1097 1101
Drugs Ecstasy Other 114 13 17 957 966
Drugs Psychedelics NB 81 12 20 958 959
Information/Guides InformationGuides Information 76 5 10 908 927
Information/eBooks InformationeBooks Information 67 5 10 895 918
Drugs Psychedelics 2C 119 9 18 905 917
Drugs Dissociatives Ketamine 149 15 24 873 906
Drugs Other NA 230 17 34 861 864
Drugs Opioids Fentanyl 99 11 13 833 848
Drugs Psychedelics DMT 126 16 18 694 723
Info eBooks Other 45 4 7 673 691
Drugs Cannabis Synthetics 55 10 19 635 637
Drugs Opioids Other 165 16 23 628 631
Forgeries Physical documents NA 77 6 15 607 616
Electronics Electronics Electronics 123 12 16 594 599
Data Pirated NA 34 4 6 526 529
Services Other NA 126 8 16 477 487
Drugs Cannabis Seeds 60 10 11 458 458
Services Hacking NA 63 5 10 428 453
Jewelry Jewelry Jewelry 24 3 12 415 418
Drugs Dissociatives MXE 77 9 15 380 404
Tobacco Smoked NA 32 6 11 356 393
Counterfeits Money NA 62 7 17 384 385
Counterfeits Clothing NA 14 2 7 359 364
Data Software NA 68 7 9 353 356
Weapons Lethal firearms NA 51 13 16 343 344
Forgeries Scans Photos 48 4 10 319 327
Drugs Ecstasy MDA 62 10 13 310 321
Info eBooks Making money 46 4 6 307 313
Info eBooks Drugs 26 3 6 278 289
Drugs Opioids Buprenorphine 72 12 16 278 282
Drugs Psychedelics Other 52 9 15 272 272
Counterfeits Accessories NA 25 5 14 250 250
Drugs Weight loss NA 67 11 21 246 249
Drugs Opioids Morphine 68 11 19 246 248
Drugs Psychedelics 5-MeO 35 8 11 212 213
Drugs Dissociatives GHB 42 9 12 205 206
Info eBooks Anonymity 23 3 4 199 204
Drug paraphernalia Pipes NA 42 8 10 195 195
Drugs Opioids Hydrocodone 61 7 6 190 190
Drug paraphernalia Containers NA 18 5 7 185 186
Info eBooks Science 12 3 4 155 163
Drug paraphernalia Stashes NA 10 5 5 149 149
Info eBooks RelationshipsSex 16 3 4 141 145
Info eBooks IT 23 3 3 142 144
Weapons Ammunition NA 23 8 11 138 138
Services Advertising NA 15 3 5 131 131
Drugs Cannabis Shaketrim 36 6 8 121 121
Drug paraphernalia Grinders NA 4 6 7 106 106
Drugs Psychedelics Others 26 7 11 106 106
Weapons Melee NA 16 6 12 103 103
Forgeries Other NA 25 3 8 100 100
Drugs Opioids Codeine 37 7 14 92 92
Chemicals Chemicals Chemicals 18 6 10 90 90
Services Travel NA 6 2 4 90 90
Drugs Opioids Opium 34 11 14 85 87
Drugs Psychedelics Mescaline 28 7 12 86 86
Drugs Psychedelics Spores 14 5 6 79 80
Drugs Dissociatives GBL 20 6 11 76 76
Info eBooks Economy 13 4 3 75 76
Drugs Dissociatives Other 11 5 6 63 63
Drug paraphernalia Paper NA 3 3 4 61 61
Counterfeits Electronics NA 16 3 11 59 59
Weapons Non-lethal firearms NA 11 6 9 57 57
Drugs Opioids Dihydrocodeine 8 3 4 54 54
Drug paraphernalia Scales NA 9 5 4 46 47
Drug paraphernalia Injecting equipment Syringes 15 5 3 45 45
Info eBooks Doomsday 9 2 2 42 43
Info eBooks Psychology 11 3 2 40 40
Drugs Psychedelics Salvia 15 3 8 37 37
Drugs Stimulants Mephedrone 11 8 6 33 33
Drug paraphernalia Injecting equipment Other 7 3 5 30 30
Drugs Barbiturates NA 13 4 7 30 30
Tobacco Paraphernalia NA 10 3 5 27 27
Info eBooks Politics 7 3 2 26 26
Info eBooks Philosophy 5 3 3 25 25
Drug paraphernalia Injecting equipment Needles 7 4 3 15 15
Weapons Fireworks NA 9 3 6 14 14
Info eBooks AliensUFOs 5 3 3 10 10
Forgeries Forgeries Forgeries 4 1 4 8 8
Drug paraphernalia Injecting equipment Filters 4 2 2 6 6
Drugs Dissociatives PCP 2 2 2 4 4

Analysis: Number of distinct values per categories and types of products

  • The products with higher numbers of offers are almost only drugs, with weed, ectasy, and cocaine in the top 4, which are the more commonly used drugs according to the litterature. An explanation could be that the demand for those products is high and thus there’s a good incentive to propose them on the web. Interestingly, prescriptions is also highly available, but it could be due to the fact that it encompasses a lot of different products in one label (further analysis of the item description variable could be made).

  • Outside of drug products, the most proposed products and services are “money” (just by looking at this category it’s hard to know what it relates to and it deserves further investigations to clarify what it relates to.), counterfeited watches, stolen data of hacked accounts, and guides and ebooks, which interestingly highlight the fact that darknet markets are not only place of illegal trade, but also serve as places for accessing informations, learning skills and give tips for crime commitment. Forgeries, fake official papers and counterfeited goods also show a substantial share of the offers.

Research Question 1: Available types of products and composition of the marketplace

Frequencies of Categories
Category n freq
Drugs 89697 0.8435164
Services 2642 0.0248455
Counterfeits 2367 0.0222594
Info 2169 0.0203974
Data 2118 0.0199178
Other 1425 0.0134008
Forgeries 1051 0.0098837
Information/Guides 927 0.0087176
Information/eBooks 918 0.0086329
Drug paraphernalia 840 0.0078994
Weapons 656 0.0061691
Electronics 599 0.0056330
Tobacco 420 0.0039497
Jewelry 418 0.0039309
Chemicals 90 0.0008464

Analysis: Distribution of offers among categories of products (frequencies and percentages)

  • As mentioned earlier, the category containing drugs contains the majority of offers. Among other categories, offers are more evenly distributed.

Distribution of offers by types (1)

General distribution of Types (1) among overall offers

Frequencies of Types (1) among overall offers
Type1 n freq
Cannabis 30280 0.2847551
Ecstasy 13867 0.1304062
Stimulants 12196 0.1146920
Psychedelics 8084 0.0760225
Opioids 6675 0.0627721
Prescription 5556 0.0522490
Benzos 5384 0.0506315
Other 2876 0.0270461
Steroids 2761 0.0259646
eBooks 2169 0.0203974
RCs 2092 0.0196733
Money 1866 0.0175480
Dissociatives 1659 0.0156013
Watches 1309 0.0123099
Accounts 1233 0.0115952
InformationGuides 927 0.0087176
InformationeBooks 918 0.0086329
Electronics 658 0.0061879
Physical documents 616 0.0057929
Pirated 529 0.0049748
Hacking 453 0.0042600
Jewelry 418 0.0039309
Smoked 393 0.0036958
Clothing 364 0.0034231
Software 356 0.0033478
Lethal firearms 344 0.0032350
Scans 327 0.0030751
Accessories 250 0.0023510
Weight loss 249 0.0023416
Pipes 195 0.0018338
Containers 186 0.0017492
Stashes 149 0.0014012
Ammunition 138 0.0012978
Advertising 131 0.0012319
Grinders 106 0.0009968
Melee 103 0.0009686
Injecting equipment 96 0.0009028
Chemicals 90 0.0008464
Travel 90 0.0008464
Paper 61 0.0005736
Non-lethal firearms 57 0.0005360
Scales 47 0.0004420
Barbiturates 30 0.0002821
Paraphernalia 27 0.0002539
Fireworks 14 0.0001317
Forgeries 8 0.0000752

Proposals for drugs

Frequencies of Types (2) in each Drug Type (1)
Category Type1 n freq
Drugs Cannabis 30280 0.3375810
Drugs Ecstasy 13867 0.1545983
Drugs Stimulants 12196 0.1359689
Drugs Psychedelics 8084 0.0901256
Drugs Opioids 6675 0.0744172
Drugs Prescription 5556 0.0619419
Drugs Benzos 5384 0.0600243
Drugs Steroids 2761 0.0307814
Drugs RCs 2092 0.0233230
Drugs Dissociatives 1659 0.0184956
Drugs Other 864 0.0096324
Drugs Weight loss 249 0.0027760
Drugs Barbiturates 30 0.0003345

Proposals for other products

Frequencies of Types (1) in each Category (except Drugs)
Type1 Type2 n freq
Accessories NA 250 1.0000000
Accounts NA 1233 1.0000000
Advertising NA 131 1.0000000
Ammunition NA 138 1.0000000
Chemicals Chemicals 90 1.0000000
Clothing NA 364 1.0000000
Containers NA 186 1.0000000
eBooks Other 691 0.3185800
eBooks Making money 313 0.1443061
eBooks Drugs 289 0.1332411
eBooks Anonymity 204 0.0940526
eBooks Science 163 0.0751498
eBooks RelationshipsSex 145 0.0668511
eBooks IT 144 0.0663900
eBooks Economy 76 0.0350392
eBooks Doomsday 43 0.0198248
eBooks Psychology 40 0.0184417
eBooks Politics 26 0.0119871
eBooks Philosophy 25 0.0115260
eBooks AliensUFOs 10 0.0046104
Electronics Electronics 599 0.9103343
Electronics NA 59 0.0896657
Fireworks NA 14 1.0000000
Forgeries Forgeries 8 1.0000000
Grinders NA 106 1.0000000
Hacking NA 453 1.0000000
InformationeBooks Information 918 1.0000000
InformationGuides Information 927 1.0000000
Injecting equipment Syringes 45 0.4687500
Injecting equipment Other 30 0.3125000
Injecting equipment Needles 15 0.1562500
Injecting equipment Filters 6 0.0625000
Jewelry Jewelry 418 1.0000000
Lethal firearms NA 344 1.0000000
Melee NA 103 1.0000000
Money NA 1866 1.0000000
Non-lethal firearms NA 57 1.0000000
Other Other 1425 0.7082505
Other NA 587 0.2917495
Paper NA 61 1.0000000
Paraphernalia NA 27 1.0000000
Physical documents NA 616 1.0000000
Pipes NA 195 1.0000000
Pirated NA 529 1.0000000
Scales NA 47 1.0000000
Scans Photos 327 1.0000000
Smoked NA 393 1.0000000
Software NA 356 1.0000000
Stashes NA 149 1.0000000
Travel NA 90 1.0000000
Watches NA 1309 1.0000000

Distribution of offers for more detailed Types (2)

Prices for Categories

Reserach Question 2: Distribution of declared destinations and origins

##### FIRST VISUALIZATION OF THE DESTINATIONS and Origins


destinations <- df_destinations %>% count(clean_destination , sort = TRUE) %>%  filter(n>1) 
destinations %>% kable()
clean_destination n
worldwide 23342
usa 18363
europe 6486
australia 4743
uk 3607
canada 1316
germany 1122
other 428
sweden 379
switzerland 313
scandinavia 244
france 231
New Zealand 230
norway 198
ireland 110
asia 101
netherlands 76
finland 63
oceania 61
belgium 41
denmark 34
Hong Kong SAR China 26
africa 24
Philippines 21
israel 19
thailand 19
internet 15
hungary 11
austria 10
spain 10
india 9
mexico 9
poland 8
china 7
Japan 7
Singapore 7
italy 4
luxembourg 4
brazil 3
Grenada 2
iraq 2
mississippi 2
origins <- df_origins %>% count(clean_origin , sort = TRUE) %>%  filter(n>1) 
origins %>% kable()
clean_origin n
usa 34962
uk 11346
australia 8860
germany 7685
netherlands 6701
canada 5518
europe 5082
china 4186
internet 4010
worldwide 2472
belgium 1171
india 1122
sweden 1059
france 844
switzerland 451
spain 442
Hong Kong SAR China 417
denmark 383
norway 336
italy 279
Philippines 278
czech republicrepublic 273
ireland 269
austria 252
New Zealand 180
South Africa 177
other 137
ukraine 127
poland 111
asia 110
mexico 105
argentina 85
finland 67
thailand 66
oceania 61
pakistan 55
scandinavia 55
bolivia 24
africa 22
brazil 20
israel 19
North America 15
singapore 14
fiji 11
belarus 9
Dominican Republic 9
latvia 9
slovakia 9
hungary 8
panama 8
Japan 7
peru 7
Singapore 7
lithuania 6
czech republic 5
guatemala 5
croatia 4
estonia 4
japan 4
seychelles 4
morocco 3
Thailand 3
belize 2
cambodia 2
Cayman Islands 2
luxembourg 2
northamerica 2
romania 2
serbia 2
St. Vincent & Grenadines 2
swaziland 2
##frequencies
destinations <- destinations %>% mutate(freq=n()/sum(n)) %>% arrange(-freq) %>% mutate(place=clean_destination)
kable(destinations)
clean_destination n freq place
worldwide 23342 0.0006806 worldwide
usa 18363 0.0006806 usa
europe 6486 0.0006806 europe
australia 4743 0.0006806 australia
uk 3607 0.0006806 uk
canada 1316 0.0006806 canada
germany 1122 0.0006806 germany
other 428 0.0006806 other
sweden 379 0.0006806 sweden
switzerland 313 0.0006806 switzerland
scandinavia 244 0.0006806 scandinavia
france 231 0.0006806 france
New Zealand 230 0.0006806 New Zealand
norway 198 0.0006806 norway
ireland 110 0.0006806 ireland
asia 101 0.0006806 asia
netherlands 76 0.0006806 netherlands
finland 63 0.0006806 finland
oceania 61 0.0006806 oceania
belgium 41 0.0006806 belgium
denmark 34 0.0006806 denmark
Hong Kong SAR China 26 0.0006806 Hong Kong SAR China
africa 24 0.0006806 africa
Philippines 21 0.0006806 Philippines
israel 19 0.0006806 israel
thailand 19 0.0006806 thailand
internet 15 0.0006806 internet
hungary 11 0.0006806 hungary
austria 10 0.0006806 austria
spain 10 0.0006806 spain
india 9 0.0006806 india
mexico 9 0.0006806 mexico
poland 8 0.0006806 poland
china 7 0.0006806 china
Japan 7 0.0006806 Japan
Singapore 7 0.0006806 Singapore
italy 4 0.0006806 italy
luxembourg 4 0.0006806 luxembourg
brazil 3 0.0006806 brazil
Grenada 2 0.0006806 Grenada
iraq 2 0.0006806 iraq
mississippi 2 0.0006806 mississippi
counts <- df_destinations %>% group_by(clean_destination) %>% count()
sum(counts$n)
## [1] 61707
mean(counts$n)
## [1] 1469.214
max(counts$n)
## [1] 23342
min(counts$n)
## [1] 2
df_destinations %>% group_by(clean_destination) %>% summarise(n=n(),freq=n()/sum(counts$n), total = sum(counts$n)) %>% arrange(clean_destination) 
#origins <- origins %>% mutate(freq=n/sum(n)) %>% arrange(-freq) %>% mutate(place=clean_origin)
#kable(origins)

#origins
#destinations

Analysis

  • on voit que les NA ne sont pas du tout dans des proportions similaires: les auteurs ont davantage tendance à signaler l’origin et moins la destination, ce qui peut pourrait signifier que les informations sur la destination sont plus souvent communiquées ultérieurement, alors que l’origine est plus souvent indiquée, même si elle peut potentiellment exposer le vendeur à des risques de localisation(bien que cela reste sur des zones très larges et donc une info peut discriminante):hypothèse: on peut imaginer qu’indiquer l’origine est une argument de vente et de qualité des produits

  • certains endroits importent sont plus présents en tant qu’origine que destination, notamment les usa, uk, allemagne,les pays-bas, la chine, le canada et l’australie, et bcp d’autres… il faudrait toutefois filtrer les NA pour voir si ces chiffres tiennents

  • on trouve des proportions en tant que destinations supérieures dans les pays nordiques, la nouvelle zélande et israel… même remarque que précédemment –> filtrer les NAS?

  • évidemment le monde et l’europe sont davantage cités comme destinations, car cette appellation n’est pas vraiment une information ayant du sens pour indiquer une origine

Analysis

  • Maintenant on peut observer des vraies patterns, avec les pays-bas comme premier sur la balance destination/origine, ce qui semble plus en lien avec la réalité/littérature. Suivi par: allemagne, uk, usa, chine, canada, belgique…

  • c’est intéressant de voir que les pays nordiques, la nouvelle zélande et la suisse restent dans les plus cités comme destination plutôt que pays d’origine

  • on voit aussi qu’en accord à la première intuition, les zones plus larges sont principalement indiquées comme destinations

  • il faudrait tester les diff avec un Chi-carré, voir si elles restent significatives et/ou trouver un moyen de normaliser un peu tout ça

  • il faudrait faire une analyse sectorielle/par catégories de produit

  • on pourrait séparer les analyses entre régions et pays puisqu’ils ne suivent pas les mêmes logiques. Attention avec les régions car le nettoyage n’était pas optimal (bcp de pertes sur amérique du sud/du nord, etc.).

Destinations

Origins

Analysis

Analysis World Maps

We wanted to provide visual world maps to see what country exports or imports the most items from our dataset compared to the rest of the world. In order to do that we cleaned the destinations and origins in the original dataset to match them to the country names in the map and mapdata packages. In the data, we removed entries with continent names and “worldwide” values that composed most the the datas in order to focus on specific countries and see if we could find out trends.

We can see in the first map that USA is the country where most of the items can be shipped, with more than 15’000 appearance in the destination. Australia and UK are the number 2 and 3 on the list (4743 for australia and 3607 for UK) followed then by european countries, canada and others coutries such as mexico, brazil, china and india. It’s important to note that both Canada and Germany are the two coutries in the “blue” range with more than 1000 appearance This confirms a tendency that drugs are mostly shipped to western countries where it’s hard to produce those drugs but easier to buy anonymously on the internet.

If we take a look at the second map, we can actually see that USA are also the number one provider of items, this could be due to the recent legislation around cannabis products’ legallity in some states, creating a new market on the darkweb to sell those products in states where this kind of legislation is not in place yet. The products also come a lot from Australia, Poland, UK, China and Canada.

The last map represent a ratio of import-exports of every country, we can see that almost everyone of them imports more items than what they export. The math behind this map was origin of a country/total origins - destination of a country/total destinations.

The fact that european coutries are nor in Destinations nor in Origins is because sellers on Agora didn’t take time to list every country possible to ship to most of the time, the data usually was “europe”(5000 appearance in origins, 6400 in destinations). Adding this data to what is already presented in the maps makes europe a strong contender in the drug market even if we can’t see it on these maps

sum_dest <- df_destinations %>% group_by(clean_destination) %>% summarise(n=n()) %>% arrange(-n)  %>% kable(format='html',caption='Destinations of products sorted by decreasing count',col.names=c('Destination','Count')) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

sum_orig <- df_origins %>% group_by(clean_origin) %>% summarise(n=n()) %>% arrange(-n)  %>% kable(format='html',caption='Origins of products sorted by decreasing count',col.names=c('Origin','Count')) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

destinations <- transmute(df_destinations,Place=clean_destination)
origins <- transmute(df_origins,Place=clean_origin)

#destinations
#origins
#places <- merge(destinations,origins)
#places %>% head() %>% kable()

Research Question 3: Distribution of activity among Vendors

Cumulated distributions of offers

## $Chemicals
## 
## $Counterfeits
## 
## $Data
## 
## $`Drug paraphernalia`
## 
## $Drugs
## 
## $Electronics
## 
## $Forgeries
## 
## $Info
## 
## $`Information/eBooks`
## 
## $`Information/Guides`
## 
## $Jewelry
## 
## $Other
## 
## $Services
## 
## $Tobacco
## 
## $Weapons

The empirical cumulative distribution functions shows strong inequalities in the distribution of activity: there are a few vendors that propose a high proportion of offers. We can also see that there are different trends depending on the category, where activity for drugs, forgeries and infos seem to be more evenly distributed than other categories.

Variance of proposal numbers in proportion of the total offers

The following boxplots show that vendors’ activity in some categories in some categories vary more than in others: it’s particularly the case of counterfeits, drugs and info.

Drugs

Other proposals

Most prolific Vendors profiles

Most prolific vendors profiles
Vendor n perc nb_cat nb_type1 nb_type2 nb_dest nb_orig mostcat mosttype1
optiman 881 0.0082850 13 22 15 12 8 Drug paraphernalia Containers
sexyhomer 860 0.0080875 4 6 2 3 2 Counterfeits Watches
mssource 823 0.0077395 2 10 15 2 2 Drugs RCs
profesorhouse 804 0.0075609 11 12 12 2 3 Services Other
RXChemist 729 0.0068556 4 7 5 4 6 Drugs Prescription
rc4me 648 0.0060938 1 8 9 3 3 Drugs RCs
fake 608 0.0057177 13 23 13 2 7 Information/Guides InformationGuides
medibuds 604 0.0056801 2 2 4 3 1 Drugs Cannabis
Gotmilk 479 0.0045045 2 7 9 6 12 Drugs Prescription
Bigdeal100 451 0.0042412 5 6 4 1 3 Jewelry Jewelry
captainkirk 447 0.0042036 5 6 13 2 1 Information/eBooks InformationeBooks
TheDigital 435 0.0040908 8 12 12 1 3 Services Other
OnePiece 430 0.0040437 5 6 5 2 2 Info eBooks
HollandDutch 416 0.0039121 2 3 4 2 1 Drugs Ecstasy
Optumis 407 0.0038275 9 15 21 2 2 Info eBooks

```