In this report I identify the following clusters of addresses on the Thorchain network based on their swapping activity
The analysis is based on the swapping activity for 3134 addresses swapping on the Thorchain network over a 19 days period, and comprising a total of 100000 swaps.
The analysis is presented with all the cool figures and results first and a minimal methods section in the end along with code for reproducibility. All data used was exported from the velocity sql-interface thorchain.swaps-table.
Even before applying formal clustering methods, the outlines of clusters are quite clearly evident as can be seen in below figure that shows swapping frequency and avg. swap value on the axes and total swapped volume and arbitrage size encoded with size and color of points. The figure clearly shows a cloud of high volume addresses (large size) and within this cloud we find most of the addresses with positive arbitrage. The spread of values for daily arbitrage is quite high, ranging from -190$ to 390$ with a mean value of 0.55$.
To the left are the one time only senders, the cluster we will call HALK (hodlers and lost keys). This group comprise a very large span of total sent value, including the highest in the study period of 0.5 million $ worth of BTC.BTC swapped for ETH.ETH.
Going to the right, the swapping frequency increases to a point where only bots will be found in the extreme. The maximum amount of swaps per hour is 9360 per day, and the mean for the network is 13.2.
Clusters are assigned by a combination of EAR (eyeballing and reasoning) as well as K-means algorithm. HALK is the first and most easily observed cluster, simply all addresses with nSwaps = 1. While clearly subjective, the reason for assigning HALK in this way, is that it is a qualitatively meaningful cluster, and I would rather prioritize the K-means algorithm optimizing for the remaining points. Using K-means we obtain the clusters HV: High Volume, LVHF: Low volume, high frequency, and LVLF: Low volume, low frequency. The clusters can be seen in the figure below. Adding another cluster to the K-means algorihm was not enough to pick out the arbitragers (probably due to the low no. of arbitraging addresses). Due to their relevance for the network, arbitragers were selected as a cluster with the criterion of avg. daily arbitrage > 50$. A similar result could have likely been obtained using other input data transformations and distance metrics. The input variables used for K-means clustering were size of avg. swap, swapping frequency, total arbitrage, and total swapped volume. Clusters identified are:
nSwaps==1), those who have only swapped once (in 19 days)avg. daily arbitrage > 50 $), high volume, high daily arbitrageThe below figure shows all clusters identified in this way:
Below figure shows the cumulative contribution to the network’s total swapped volume sorted in order of highest contributing cluster and address. We see that the arbitragers account for almost 50% of the total swapped value in spite of accounting for less than 1% of all addresses (please note logarithmic x-scale). Adding the high volume cluster we account for around 90% of the total volume of the network with just around 10% of the addresses. The numbers on the figure represent fraction of swapped volume by cluster.
The below figure shows boxplots for various attributes of the clusters.
For average swap size, we see HV and ARB dominating with cluster means of 19000$ and 9900$ respectively compared to the network mean of 5700.
For frequency we see bots in the LVHF group dominating with a mean of 1100/day compared to the network mean of 92/day. Arbitragers are also significantly faster than the network average (mean =144/day, max = 600/day) indicating that a large part of the volume is probably swapped by bots.
Daily volume is completely dominated by the arbitragers with a mean of 270000 $/day compared to the network mean of 2600 $/day. The HV group is also high with a mean of 9300 $/day.
In terms of the number of addresses, the HALK cluster is by far the largest with with 0.63 of all addresses, then follows LVLF (casual humans) with 0.16, HV with 0.12, LVHF (bots) with 0.081, and finally arbitragers with 0.004.
On the next figure we see boxplots for arbitrage of the clusters. ARB completely dominates with mean daily arbitrage of 170 $/day, averaged over the whole study period. The average for the whole dataset is 0.55 $/day. The observant reader will note that I have added a new cluster “ARBneg” which is defined in the same way as the ARB cluster but with opposite signs, so avg. daily arbitrage < -50$/day. This is to check for confirmation bias in the assignment of the ARB cluster. The high daily arbitrage of ARB could in principle be due to the high volume of swaps in the cluster, and there could be a cluster with an equally large negative arbitrage in the other end of the spectrum. The daily arbitrage in cluster ARBneg is -90 $/day. This is much lower than the ARB-cluster (170 $/day) in absolute terms which we can also see from eyeballing the figure. Comparing the volumes of ARB and ARBneg we get 3800000 $ and 33000 $ respectively, a ratio of less than 1%. This assymettry lends significant support to our thesis that the ARB-cluster is in fact a meaningful entity.
One obvious thing that would be worth to investigate is the distinction between the high volume cluster and the arbitragers. In our case the line was drawn randomly at 50$ per day. Since arbitragers makes up so much of the swapping activity with a very small number of addresses, lowering the threshold value would quite likely have dramatic impacts on the results. Furthermore, it would also be interesting to see how much would be left of the current high volume cluster. Lowering the threshold might send all the addresses with the highest volume / highest average swapping value over to the arbitrager cluster. If so, there would be little left of the current high volume cluster that could become indistinguishable from the bots that dont move a lot and human smallholders.
Another obvious and interesing line of investigation is the types of assets most commonly swapped for the various clusters, which could possibly also be a clustering variable in itself.
An obvious improvement to the study would be the inclusion of more days. This would quite likely reduce the HALK cluster significantly which currently make up 63% of all addresses. Since a user in the HALK cluster could use their wallet tomorrow, this cluster is not exactly set in stone, especially for a short data series. However, since HALK only accounts for 6% of the total swapped volume, all conclusions regarding significant contributors of volume are not going to change.
The methods used were quite straightforward as documented above, and in the last section with the code used for the analysis. It should be noted that the data set initially contained two qualitatively different kinds of swaps namely native Rune tokens to/from other tokens and other tokens to/from other tokens. Since all swapping intermediates over the RUNE-token through liquidity pools, swaps between non-native tokens (eg. ETH-BNB) are actually comprised of two swaps: ETH:RUNE and RUNE:BNB. Such linked swaps share a common transaction ID that allows for the two swaps to be merged, which was done for this study, to allow for calculation of arbitrage etc. Arbitrage was calculated simply as the difference in dollar value of the sent asset minus the received asset. I only analyzed from the perspective of the sending addresses since these initiate the swaps and are assumed to be in control of the receiving addresses as well. Volumes of swaps were calculated as the dollar value of the asset sent. This value as well as the the value of the received asset is listed in the thorchain.swaps query-output from the velocity sql-interface.
library(dplyr)
library(ggplot2)
library(lubridate)
library(ggrepel)
library(ggfortify)
set.seed(123)
setwd("C:/Users/Fudzter/Etc....")
datS = read.table("swaps1e5.csv", row.names = NULL, sep = ",", header = T)
datS$TO_ASSET = gsub("-.*","",datS$TO_ASSET)
datS$FROM_ASSET = gsub("-.*","",datS$FROM_ASSET)
#FORMAT DATA
datS$BLOCK_TIMESTAMP = strptime(datS$BLOCK_TIMESTAMP,tz = "UTC", "%Y-%m-%dT%H:%M:%OSZ")
dtAll = as.numeric(max(datS$BLOCK_TIMESTAMP)-min(datS$BLOCK_TIMESTAMP),units = "hours")
# AGGREGATE SWAPS TABLE
# GROUP SWAPS WITH SAME TX_ID (basically on-chain vs off-chain)
datS =
by(datS, datS$TX_ID, FUN = function(datx){
out = datx[1,]
out[,] = NA
if(nrow(datx)==1){ # FOR ON-CHAIN SWAPPING
out = datx
}
if(nrow(datx)==2 & sum(datx$FROM_ASSET %in% "THOR.RUNE")>0){ # FOR CROSS CHAIN SWAPPING
iFrom = datx$FROM_ASSET != "THOR.RUNE" # Index of row where asset is sent from
out[c("FROM_ASSET", "FROM_AMOUNT_USD", "FROM_ADDRESS")] = datx[iFrom, c("FROM_ASSET", "FROM_AMOUNT_USD", "FROM_ADDRESS")]
out[c("TO_ASSET", "TO_AMOUNT_USD", "TO_ADDRESS")] = datx[!iFrom, c("TO_ASSET", "TO_AMOUNT_USD", "TO_ADDRESS")]
out[,datx[1,]==datx[2,]] = datx[1,datx[1,]==datx[2,]] # Sets out value for all variables with matching values
}
out$txCount = nrow(datx)
out
}) %>% plyr::ldply(data.frame)
dtAll = as.numeric(max(datS$BLOCK_TIMESTAMP, na.rm = T)-min(datS$BLOCK_TIMESTAMP, na.rm = T),units = "hours")
# How many of all transactions are not included (because nRow>2), remove if small amount
print(paste("Transactions not included:", sum(is.na(datS$FROM_AMOUNT_USD)), "/", length(unique(datS$TX_ID))))
datS = datS[!is.na(datS$FROM_AMOUNT_USD),]
# WHAT WAS THE PROFIT FOR THE SWAPPER (TO-FROM)
datS$WinUsdFrom = datS$TO_AMOUNT_USD-datS$FROM_AMOUNT_USD
# What is the number of addresses that are sometimes senders, sometimes receivers
sum(datS$FROM_ADDRESS %in% datS$TO_ADDRESS)
# Only looking at FROM_ADDRESSES AT FIRST SINCE THEY ARE INITIATORS, FROM INFO WILL BE SUMMARIZED
datSSum = datS %>% group_by(FROM_ADDRESS) %>% summarise(n = length(FROM_ADDRESS),
tMax = max(BLOCK_TIMESTAMP),
tMin = min(BLOCK_TIMESTAMP),
dt = as.numeric(tMax-tMin, units = "hours"),
assetMainFr = (data.frame(x = FROM_ASSET, wt = FROM_AMOUNT_USD) %>% dplyr::count(x,wt, sort = T))[1,1],
assetMainFrFrac = sum( FROM_AMOUNT_USD[FROM_ASSET == assetMainFr] )/sum(FROM_AMOUNT_USD),
assetMainTo = (data.frame(x = TO_ASSET, wt = TO_AMOUNT_USD) %>% dplyr::count(x, sort = T))[1,1],
assetMainToFrac = sum( TO_AMOUNT_USD[TO_ASSET == assetMainTo] )/sum(TO_AMOUNT_USD),
sizeSwapAvg = mean(FROM_AMOUNT_USD),
winUsdAvg = mean(WinUsdFrom),
winUsdPos = sum((WinUsdFrom>0))/length(WinUsdFrom),
txCount = mean(txCount))
#Set dt for all single time transacters
datSSum[datSSum$n==1,"dt"] = dtAll
datSSum$freqFrom = datSSum$n/datSSum$dt
datSSum$volUsd = datSSum$n*datSSum$sizeSwapAvg
datSSum$winUsdVol = datSSum$n*datSSum$winUsdAvg
#save(datSSum, file = "datSSum1e4.RData")
topDogs = datSSum[order(-datSSum$n*datSSum$sizeSwapAvg)[1:10],]
iTopDogs = match(topDogs$FROM_ADDRESS, datSSum$FROM_ADDRESS)
#Overview Plot
ggOverview =
ggplot() + #seq(0,0.52,0.04)
geom_point(data = datSSum, mapping = aes(x=n / (dt/24),
y = sizeSwapAvg,
size = n*sizeSwapAvg,
color = ordered( cut(winUsdVol/(dtAll/24),5, dig.lab = )),
shape = (cut(txCount,c(1,1.1,1.9,2), include.lowest = T))),#ordered(cut(winUsdPos,seq(0,1,0.2), include.lowest = T))),
alpha = 0.7) +
#geom_abline(slope = 2, intercept = 0)+
#geom_abline(slope = 0.5, intercept = 0)+
#geom_hline(yintercept = 60)+
#geom_vline(xintercept = 60)+
scale_x_continuous(trans = "log10") +
#scale_colour_hue(h = c(240, 360))+
scale_y_continuous(trans = "log10")+
scale_size_area(max_size = 20, breaks = c(100, 1e6, 1e7))+
scale_shape(labels = c("Non-native to/from RUNE","Mixed","non-native to/from non-native") )+
#scale_size_continuous(trans = "log10")+
labs(x = "Sending frequency [/day]",
y = "Avg sent value [ $ / transaction ]",
size = "Total sent volume [$]:",
title = paste("Summarized swapping activity in a", round(dtAll/24) , "days period" ),
color = "Daily arbitrage [$]:",
shape = "Swapped assets:")+
# guides(size = guide_legend(override.aes = list(size = c(5,10,20))))+
theme(legend.position = "bottom", legend.box="vertical", legend.margin=margin())
ggrepel::geom_text_repel(data = topDogs, mapping = aes(x=n / (dt/24),
y = sizeSwapAvg,
label = paste(assetMainTo, "-", assetMainFr),
color = ordered( cut(winUsdVol/(dtAll/24),5, dig.lab = 3))[iTopDogs],#ordered(cut(winUsdPos,seq(0,1,0.2), include.lowest = T)),
size = n*sizeSwapAvg/10),
force = 1, force_pull = 0.01)
ggOverview
# Define clusters
datSSum$cluster = as.character(NA)
datSSum[datSSum$n==1,"cluster"] = "HALK"
colnames(datSSum)
datNum = subset(datSSum, freqFrom != Inf & n>1) %>% select(c(sizeSwapAvg, freqFrom, winUsdVol, volUsd, FROM_ADDRESS))
#datNum$winUsdVol = datNum$winUsdVol*datNum$winUsdVol^2
addrNum = datNum$FROM_ADDRESS
datNum = select(datNum, -FROM_ADDRESS)
summary(datNum)
scaler = caret::preProcess(datNum, method = "range")
datNumSc = predict(scaler, datNum)
summary(datNumSc)
datNumSc=(log(datNumSc+0.0001))
sv = prcomp(datNumSc)
#autoplot(sv, loadings = T, loadings.label = T)
cluster = factor(kmeans(datNumSc,3)$cluster)
datSSum[match(addrNum,datSSum$FROM_ADDRESS), "cluster"] = as.character(cluster)
datSSum[datSSum$winUsdVol/(dtAll/24)>50,]$cluster = "ARB"
ggCluster =
ggplot() + #seq(0,0.52,0.04)
geom_point(data = datSSum, mapping = aes(x=n / (dt/24),
y = sizeSwapAvg,
size = n*sizeSwapAvg,
color = cluster,
shape = (cut(txCount,c(1,1.1,1.9,2), include.lowest = T))),#ordered(cut(winUsdPos,seq(0,1,0.2), include.lowest = T))),
alpha = 0.9) +
#geom_abline(slope = 2, intercept = 0)+
#geom_abline(slope = 0.5, intercept = 0)+
#geom_hline(yintercept = 60)+
#geom_vline(xintercept = 60)+
scale_x_continuous(trans = "log10") +
scale_colour_discrete(labels=c("LVHF: Low vol, high freq", "HV: High volume", "LVLF: Low vol, low freq.",
"ARB: Arbitragers", "HALK: Hodl and lost keys"))+
scale_y_continuous(trans = "log10")+
scale_size_area(max_size = 20, breaks = c(100, 1e6, 1e7))+
scale_shape(labels = c("Non-native to/from RUNE","Mixed", "non-native to/from non-native") )+
#scale_size_continuous(trans = "log10")+
labs(x = "Sending frequency [/day]",
y = "Avg sent value [ $ / transaction ]",
size = "Total sent volume [$]:",
title = paste("Summarized swapping activity in a", round(dtAll/24) , "days period" ),
color = "Cluster:",
shape = "Swapped assets:")+
# guides(size = guide_legend(override.aes = list(size = c(5,10,20))))+
theme(legend.position = "bottom", legend.box="vertical", legend.margin=margin())
ggCluster
# CUMSUM
volSum = datSSum %>% group_by(cluster) %>% summarise(volTot = sum(volUsd)) %>% as.data.frame()
volSum$nFrac = (datSSum %>% group_by(cluster) %>% summarise(nFrac = length(cluster)/nrow(datSSum)) %>% as.data.frame())$nFrac
volSum = dplyr::arrange(volSum, desc(volTot))
datSSum$volTot = volSum[match(datSSum$cluster, volSum$cluster),"volTot"]
datSSumSort = dplyr::arrange(datSSum, desc(volTot), desc(volUsd))
datSSumSort$cumsum = cumsum(datSSumSort$volUsd)/(dtAll/24)/sum(datSSumSort$volUsd/(dtAll/24))
ggCumsum = ggplot(datSSumSort, aes(x = (1:length(n))/length(n), y = cumsum, color = cluster)) + geom_point(size = 2)+
labs(x = "Cumulative fraction of nodes", y = "Cumulative fraction of total volume")+
scale_x_continuous(trans = "log10")+scale_colour_discrete(labels=c("LVHF: Low vol, high freq", "HV: High volume", "LVLF: Low vol, low freq.",
"ARB: Arbitragers", "HALK: Hodl and lost keys"))+
geom_text(data = volSum, aes(label = signif(volTot/sum(volTot),2),
y = cumsum(volTot)/sum(volTot), x = volSum$nFrac),vjust = 1, lwd = 6)
ggCumsum
datS$FROM_ASSET[which.max(datS$FROM_AMOUNT_USD)]
# HISTOGRAMS
datSSum$clusterFac = as.factor(datSSum$cluster)
library("scales")
reverselog_trans <- function(base = exp(1)) {
trans <- function(x) -log(x, base)
inv <- function(x) base^(-x)
trans_new(paste0("reverselog-", format(base)), trans, inv,
log_breaks(base = base),
domain = c(1e-100, Inf))
}
# CREATE logHist
datSSum$cluster2 = datSSum$cluster
datSSum[datSSum$winUsdVol/(dtAll/24)< -50,"cluster2"] = "ARBneg"
ggArbBox =
ggplot(datSSum, aes(y = winUsdVol/(dtAll/24), color = cluster2))+geom_boxplot()+
scale_color_discrete(labels=c("LVHF: Low vol, high freq", "HV: High volume", "LVLF: Low vol, low freq.",
"ARB: Arbitragers", "ARBneg: Negative arbitragers", "HALK: Hodl and lost keys"))
ggArbBox
ggSizeBox =
ggplot(datSSum, aes(y = sizeSwapAvg, fill = cluster, group = cluster))+geom_boxplot(bins = 100,
alpha=1)+
scale_y_continuous(trans = "log10")+
theme(axis.title.x = element_blank(), axis.ticks.x = element_blank(),axis.text.x = element_blank())+
labs(title = "Average swap size",y = "Size of average swap [$]")+
scale_fill_discrete(labels=c("LVHF: Low vol, high freq", "HV: High volume", "LVLF: Low vol, low freq.",
"ARB: Arbitragers", "HALK: Hodl and lost keys"))
ggFreqBox =
ggplot(datSSum, aes(y = freqFrom*24, fill = cluster, group = cluster))+geom_boxplot(bins = 100,
alpha=1)+
scale_y_continuous(trans = "log10")+
theme(axis.title.x = element_blank(), axis.ticks.x = element_blank(),axis.text.x = element_blank())+
labs(title = "Swapping frequency",y = "Average swapping frequency [/day]")+
scale_fill_discrete(labels=c("LVHF: Low vol, high freq", "HV: High volume", "LVLF: Low vol, low freq.",
"ARB: Arbitragers", "HALK: Hodl and lost keys"))
ggVolBox =
ggplot(datSSum, aes(y = volUsd/(dtAll/24), fill = cluster, group = cluster))+geom_boxplot(bins = 100,
alpha=1)+
scale_y_continuous(trans = "log10")+
theme(axis.title.x = element_blank(), axis.ticks.x = element_blank(),axis.text.x = element_blank())+
labs(title = "Daily volume",y = "Daily swapped volume [$]")+
scale_fill_discrete(labels=c("LVHF: Low vol, high freq", "HV: High volume", "LVLF: Low vol, low freq.",
"ARB: Arbitragers", "HALK: Hodl and lost keys"))
ggPie =
ggplot(datSSum, aes(x=factor(1), fill=cluster))+
geom_bar(stat = "count",width = 1)+
coord_polar("y")+
theme_void()+
labs(title = "Fraction of all addresses")
ggBox =
ggpubr::ggarrange(plotlist=list(ggSizeBox,ggFreqBox,ggVolBox, ggPie), common.legend = T,
legend = "bottom") %>%
ggpubr::annotate_figure(top=ggpubr::text_grob("Cluster comparison", size = 20))
ggBox