For our project, we used a dataset we found on Kaggle called Dark net marketplace drug data
It was made from an html rip by a reddit user called “usheep” that was blackmailing the vendors on the website, saying he would expose them to the police if they didn’t meet his demand. We don’t know what happend, he posted the html rip and there is no more information about what happened to him or his demands.
The dataset is ripped from a dark/deep web market place called Agora from the years 2014 et 2015. It contains offers for drugs, weapons, books, services and others. There is about 100’000 items to sell listed. The market place was shut down a few months after this data release, we don’t know if it’s related to the reddit user or not.
It is organized by:
Vendor: The seller
Category: Where in the marketplace the item falls under
Item: The title of the listing
Description: The description of the listing
Price: Cost of the item (averaged across any duplicate listings between 2014 and 2015)
Origin: Where the item is listed to have shipped from
Destination: Where the item is listed to be shipped to (blank means no information was provided, but mostly likely worldwide. I did not enter worldwide for any blanks however as to not make assumptions)
Rating: The rating of the seller (a rating of [0 deals] or anything else with “deals” in it means there is not concrete rating as the amount of deals is too small for a rating to be displayed)
Remarks: Only remark options are blank, or “Average price may be skewed outliar > .5 BTC found” which is pretty self explanatory.
As we both followed classes about criminality, web-criminality and cybersecurity before following this class. We thought it would be interresting to make links between classes and work on a criminality related project.
Furthermore, we often read or hear about dark/deep web and the various stuff that you could find in here, finding this dataset gave us the opportunity to explore the reality of a market place selling mostly illegal items.
Our first thought were:
-109 Categories is too much
-We can remove frequence 1 categories
-We can regroup less frequent categories
-There are many catogries we can regroup
In this part we tried to regroup the 109 categories at different level in order to only have a few of them so we can use them more easily in the analysis.
## using regex for the first group
dark_market_data_cat <- dark_market_data %>% mutate(Group=gsub('^(Drugs|Services|Data|Info|Forgeries|Electronics|Weapons|Counterfeits|Tobacco|Chemicals|Drug paraphernalia){1}(/.*)','\\1',Category)) %>% mutate(Spec1=gsub('^(Drugs|Services|Data|Info|Forgeries|Electronics|Weapons|Counterfeits|Tobacco|Chemicals|Drug paraphernalia){1}(/.*)','\\2',Category))
dark_market_data_cat %>% count(Spec1, sort=TRUE)
#We still have 106 different categories
#utilisation des regex pour diviser les niveaux suivants (spec1 et spec2)
dark_market_data_cat <- dark_market_data_cat %>% mutate(Spec2=gsub('(/[^/]*)(.*)','\\1',Spec1)) %>% mutate(Spec3=gsub('(^/[^/]*/$)(/.*)','\\2',Spec1)) %>% mutate(Spec4=gsub('(/[^/]*)(.*)','\\2',Spec3)) %>% mutate(Spec3=gsub('(/[^/]*)(.*)','\\1',Spec3))
#checking results
dark_market_data_cat %>% count(Spec1, sort = TRUE)
dark_market_data_cat %>% count(Spec2, sort = TRUE)
dark_market_data_cat %>% count(Spec3, sort = TRUE)
dark_market_data_cat %>% count(Spec4,sort=TRUE)
#dropping everything that is useless
dark_market_data_cat <- dark_market_data_cat %>% mutate(Spec1=gsub('/','',Spec1),Spec2=gsub('/','',Spec2),Spec3=gsub('/','',Spec3),Spec4=gsub('/','',Spec4)) %>% mutate(Spec1=Spec3,Spec2=Spec4)
drop.cols <- c('Spec3','Spec4')
dark_market_data_cat <- dark_market_data_cat %>% select(-one_of(drop.cols))
dark_market_data_cat %>% transmute(Group,Spec1,Spec2)
dark_market_data <- dark_market_data_cat
### Some more Categories adjustments
#display per value freq:
dark_market_data %>% count(Category,sort=TRUE) %>% tail()
#remove values with only on occurence
dark_market_data <- dark_market_data %>% group_by(Category) %>% filter(n() > 1)
#check
dark_market_data %>% count(Category,sort=TRUE) %>% tail()
We had a lot of values for origins and destinations where they selected a few countries, and then add “no australia” for example. There was about 2848 values with those exceptions so we tried to filter them out.
We could see there’s more destinations than origins: we knew the cleaning workload would probably be heavier
#getting all the values with exceptions
pattern <- '(exc|exept)'
dark_market_data <- dark_market_data %>% mutate(excepts_dest = grepl(pattern,tolower(Destination)))
dark_market_data <- dark_market_data %>% mutate(excepts_orig = grepl(pattern,tolower(Origin)))
#find remaining "no .... country"
dark_market_data %>% transmute(no=grepl('no .*',Destination),tolower(Destination)) %>% filter(no==TRUE)
dark_market_data <- dark_market_data %>% mutate(no_dest = grepl('no .*',tolower(Destination)))
dark_market_data %>% transmute(no=grepl('no .*',Origin),tolower(Origin)) %>% filter(no==TRUE)
dark_market_data <- dark_market_data %>% mutate(no_orig = grepl('no .*',tolower(Origin)))
## Then filter every TRUE on both cols with excepts
dark_market_data_old <- dark_market_data
dark_market_data_old %>% filter(excepts_orig == TRUE | excepts_dest == TRUE | no_orig == TRUE | no_dest == TRUE) %>% transmute(Destination,Origin,ID,excepts_dest,excepts_orig)
dark_market_data <- dark_market_data %>% filter(excepts_orig == FALSE & excepts_dest == FALSE & no_orig == FALSE & no_dest == FALSE)
#SAVE what has been left apart in case we want to further investigate or clean it
dark_market_data_filtered_out <- dark_market_data_old %>% filter(excepts_orig == TRUE | excepts_dest == TRUE | no_orig == TRUE | no_dest == TRUE)
## *** SAVING LEFT APARTS ***
write_csv(dark_market_data_filtered_out, path = "../dark_market_filtered_out.csv")
# number of observations filtered out:
nrow(dark_market_data_old)-nrow(dark_market_data)
### SOME FIRST INSIGHTS
dark_market_data %>% summarise(n_obs=length(unique(ID)),n_cat=length(unique(Category)),n_vendors=length(unique(Vendor)),destinations=length(unique(Destination)),origins = length(unique(Origin))) %>% kable()
it is necessary to spot duplicates in the same line before unnest words, because it seperates words by spacechar, so if we don’t want to end up with two counts for same record i.g. switzerland switzerland, that will end up counted two times in at the end of the process, we need to make sure they aggregate in one word.
dark_market_data %>% count(Origin,sort=TRUE) %>% kable()
dark_market_data %>% count(Destination,sort=TRUE) %>% kable()
### list of countries composed by two words /duplicates to join, that will lose meaning after unnest
duplicates <- list(c('middle east','middleeast'),c('hong kong','hongkong'), c('netherlands netherlands','netherlands'),c('germany germany','germany'),c('uk uk','uk'),c('canada canada','canada'),c('eu schengen','eu'),c('united states','unitedstates'),c('s. america','southamerica'),c('usa usa','usa'),c('switzerland switzerland','switzerland'),c('world wide','worldwide'),c('worldwide international','worldwide'),c('international worldwide','worldwide'),c('new zealand','newzealand'),c('e.u. countries','eu'),c('everywhere worldwide','worldwide'),c('all','worldwide'),c('every where','worldwide'),c('worldwide any destination','worldwide'),c('* w o r l d w i d e *','worldwide'),c('planet earth','worldwide'),c('rest of the world','worldwide'),c('united kingdom','uk'))
## convert all destinations and origins to lower
dark_market_data <- dark_market_data %>% mutate(Origin = tolower(Origin),Destination = tolower(Destination))
dark_market_data %>% count(Origin,sort=TRUE) %>% kable()
dark_market_data %>% count(Destination,sort=TRUE) %>% kable()
## apply conversion to list of duplicates
for (i in duplicates) {
#print(i[1])
#print(i[2])
dark_market_data$Origin <- gsub(i[1], i[2],dark_market_data$Origin)}
for (i in duplicates) {dark_market_data$Destination <- gsub(i[1],i[2],dark_market_data$Destination)}
dark_market_data %>% count(Origin,sort=TRUE) %>% kable()
dark_market_data %>% count(Destination,sort=TRUE) %>% kable()
### unnest the words contained in destination and origin, add them to new dataset
df_origin <- dark_market_data %>% unnest_tokens(origin,Origin,drop=FALSE)
df_destination <- dark_market_data %>% unnest_tokens(destination,Destination,drop=FALSE)
df_origin %>% tail() %>% kable()
df_destination %>% tail() %>% kable()
### anti_join with a customed dataframe of stop_words
stop_words %>% kable()
custom_stop_words <- stop_words %>% filter(word != 'us' & word != 'state' & word != 'states')
custom_stop_words %>% kable()
stop_words_origin <- custom_stop_words %>% mutate(origin=word)
stop_words_destination <- custom_stop_words %>% mutate(destination=word)
df_origin <- df_origin %>% anti_join(stop_words_origin,by='origin')
df_destination <- df_destination %>% anti_join(stop_words_destination,by='destination')
### check and comparison of what has been done for now
df_origin %>% count(Origin,sort=TRUE) %>% kable()
df_origin %>% count(origin, sort = TRUE) %>% kable()
df_destination %>% count(Destination,sort=TRUE) %>% kable()
df_destination %>% count(destination, sort = TRUE) %>% kable()
We had then to normalize the datas for the origin and destination columns in order to produce graphs and maps in the analysis part of the report. We won’t provide the source code in this report because it’s quite long and not that interresting since it’s mostly removing typos, words that aren’t countries and at the end formatting everything the same way to use it later.
You can find the whole process in the source code linked with this report to check how it is done
Now that the cleaning part is done, we wanted to save our results in different dataframes, you can see them just below.
This part took us a lot of time because users were not really careful while entering information on Agora, but it’s really practical to use now that we cleaned it