Introduction

One of the core, inherent features of the Internet is to facilitate collaborative interactions and distributed problem solving, so it’s no wonder that in recent decades we’ve seen the advent of applications and protocols that try to handle tasks in exactly this manner. A number of such systems, both community-driven and commercial, have become quite well known - examples include Skype (p2p messaging), BitTorrent (file sharing, content distribution), PGP (data encryption), TorChat (anonymous communication), ZeroNet (decentralized web). It’s worth to note, that all mentioned systems either came with practical implementation ready, or one became available shorty after the idea had been conceived. While this holds true for a lot of systems and ideas, there are some notable exceptions and digital money is one of them.

David Chaum pioneered the idea of digital cash back in 1982 in his article “Blind signatures for untraceable payments”, but it took more that 25 years for practical implementation of truly decentralized currency to arrive. Significant drawback of early ideas was that they still relied on central authority - a bank, for currency emission and ownership records maintenance. Next major milestone was reached when b-money (approach, proposed by Wei Dai in 1998) and BitGold (Nick Szabo, 2005) came up with the idea to interpret the solution of a cryptographic puzzle as something valuable, thus comparing it to a piece of precious metal. This way everyone could become a “digital gold miner”, effectively eliminating bank as the source of money emission. Still, a central instance of some kind to maintain ownership records was required.

In order to completely eliminate the need of central entity, the ledger that contains ownership records must also be distributed. This however, leads us to the fundamental and inherent risk of digital currencies in general and distributed digital currency in particular - double spending. Since making a digital copy is easy, someone could issue two transactions in parallel, thus transferring the same coin to both recipients. Such misbehavior is usually detected and prevented by a bank in centralized systems, but it’s far from trivial to achieve the same in distributed environment. In essence, we’re dealing with what is called the Byzantine Generals Problem, or the problem of consistent state. In 1998 Nick Szabo suggested to employ Quorum systems to handle the problem. They introduce the concept of voting: as long as the majority of any chosen subset of peers (quorum) is honest, the correct value can be obtained by election. This approach however could be circumvented if a malicious entity sets up many peers which inject faulty information and subvert the election (the so-called Sybil attack), it’s also vulnerable to to underlying technical issues of distributed systems, such as propagation delays. These remaining issues were resolved by anonymous Satoshi Nakamoto: a person or a group of persons, that presented concepts of the Bitcoin system.


Bitcoin

Nowadays E-Commerce has come to rely completely on financial institutions as a trusted third-party to process electronic payments. This system works fairly well for most cases, but still suffers from the inherent weaknesses of the trust-based model. Transactions irreversibility can’t be achieved since banks can not avoid disputes mediation, which in turn increase cost of transaction processing. This also implies loss of ability to make non-reversible transactions for non-reversible services and increased need of trust: merchants and customers must be wary of each other, thus putting more effort/resources into gathering more information than it’s actually necessary.

These issues, among others, has driven development of Bitcoin - a peer-to-peer electronic cash system. It was introduced in 2008 as a research paper, combining decades of research (including among others [Merkle 1987, Haber and Stornetta 1991, Massias et al 1999, Back 2002]) and adding clever solution to fundamental problem of consistency. The actual implementation went live in 2009 and since then enjoys stable growth of interest and acceptance.

Stemming from Wei Dai ideas, Bitcoin has cryptography deeply embedded into it’s core. It employs digital signatures as a way of establishing ownership of an electronic coin, which in turn is defined as a chain of digital signatures (meaning, it carries it’s ownership history). Each new transaction is made by signing a coin (combining private key of the current owner, public key of the next one, and a hash of the previous transaction in the chain of ownership, thus appending itself to the history) and broadcasting a message containing input/output addresses and amount. Every node in the network receives this message, collects it into a block along with other transactions and works on finding a difficult proof-of-work for that block. Once done, the block is broadcasted to other nodes: if all transactions inside are valid and not spent, it gets accepted and added to the distributed public ledger called the blockchain and the next block is created upon it’s hash. Proof-of-work effectively introduces one-cpu-one-vote limit and binding it to the additions to the chain is Bitcoin’s solution to the problem of double spending. Any discrepancies (i.e. blockchain forks) are resolved by consensus: if the majority of computing power in the network is controlled by honest nodes - the longest chain grows the fastest, thus rendering Sybil attack impossible. The nodes, also called miners are essentially maintainers of Bitcoin network: in addition to blockchain processing they also serve as a source of new coins, which they get as a reward for successful block generation (this process is also called mining).

Bitcoin network internals are hidden from the regular user by a software called wallet. These can be either locally installed applications, web applications or even paper folds with all necessary information put into QR code (you still need software to make payments though). The wallet does not contain any actual coins, instead digital keys are stored inside and since there’s no limit on how many of those can be stored, virtually unlimited number of bitcoin addresses can be used in one wallet. User must take caution not to lose the keys though, since all associated coins will be lost as well, both for him and for bitcoin economy in general.


Our Work

Since Bitcoin technology has gotten more popular in recent years, the purpose of our research is to look at the state of the network as it is now (Dec, 2015). The main goal of this project is to detect and analyze Bitcoin Network users’ community based on their activity, wallet features and connections, with taking into account known groups, Bitcoin events and general trends, in order to determine specific behavior patterns, which can be used to better understand the principles of Bitcoin Network development.

To do this we employ 2 different approaches: traditional statistical analysis and graph theory framework. We use webbtc database dump as a primary data source.


Bitcoin Network Graph

Bitcoin network is presented as a directed, weighted graph, where we treat building blocks as follows:

  1. graph vertex – separate bitcoin address. Due to the sheer size of network it is impossible to visualize it in terms of vertices, instead, clusters will be used.
  2. graph edge – fact of the coin transfer from source address to destination address. Each edge is characterized by weight attribute, which represents the number of inputs within each transaction between addresses.

Basic Characteristics

As of Dec 2015, Bitcoin network includes more than 108,000,000 addresses, which are connected by more than 725,000,000 links. Addresses and their connections form more than 7 million strongly connected components with one giant component that encompasses approximately 93% of vertices, thus indicating that majority of network is connected. Other basic characteristics can be split into 3 groups, namely:

  1. Relations – these metrics reflect the nature of the relationships between addresses.

    • reciprocity = 0.0145. Measure of the likelihood of vertices to be mutually linked. For Bitcoin graph we can see that in 1.5% cases the process of coins transfer is mutual. This situation can be related, for example, to transactions in gambling subnetworks.
  2. Communication – these metrics reflect the features of the connections of both separate graph entities and whole graph.

    • density = 6.180e-08. Cardinality of complete subgraph vertices set does not exceed 0.025% of total vertices number - any complete subgraph of the Bitcoin network doesn’t include more than 26939 vertices.
    • modularity = 0.0025. This metric reflects the level of community distribution. The higher modularity the more significant connections exist between graph vertexes. As modularity value is close to zero, we can assume that there are groups in Bitcoin society, which are characterized by connections stability, but they don’t have any advantage in numbers.
  3. Segmentation – reflects segmented graph features.

    • transitivity = 0.0012. Network closure metric. This parameter shows whether graph can be represented in view of set of local communities. In this case we can see that analyzed network is decentralized and the fraction of closed groups is extremely small.

So, according to the obtained characteristics, we can claim that Bitcoin network is decentralized, without any significant dependencies between separate participants, but in some cases it is similar to the graph, which describe the social community and these exclusions have a special interest.


Centrality analysis

We employ degree centrality and pagerank metrics to identify central nodes in Bitcoin network. The first metric indicates the number of connections that a particular vertex has while the second roughly estimates importance of the vertex by counting the number and quality of it’s connections. The following table gives us top 10 addresses by degree centrality:

The first entry in the table is the so-called vanity address (address, that contains selected combination of letters or a word) that belongs to Bitcoin-centric gambling network SatoshiDICE, it has public name SatoshiDICE 48%. We can see five more addresses with the same pattern in the table, in fact there are 180 of them in the network. Others belong to Eobot cloud mining pool, FreeBit site and private owners.

Extreme difference between indegree and outdegree of vertex is another interesting pattern that can be seen in the table. Using rudimentary cluster analysis we can see that such combination is rare in Bitcoin network.

This figure was created by binning in/out-degree distributions into an equal number of ranges and comparing them against each other. The graph clearly reveals top 2 addresses from the table in the upper right corner as well as 33 outliers. Upon closer look, addresses with extreme differences don’t reveal any specific patterns, expect for 2 addresses in ‘small indegree/large outdegree’ group which, back in 2011, were involved in multiple transfers of the same coin to one another.

Top 5 addresses by page rank are presented in the next table.

SatoshiDICE 48% is on top here as well, so it’s safe to consider this address to be most popular in the Bitcoin network at the moment. Another interesting address is 1FsVcdeHbpvUVT3gjeuVR2ZSDnpcsJMsLL - timetobit.com, resource widely accused of scam.

Usually, there is strong correlation between vertex indegree value and pagerank: either a node with high pagerank has large indegree, or it has inbound connections from several important nodes. Bitcoin network is no exception to this rule as we can see on the figure below, the former nodes tend to stretch to the right (we can clearly spot SatoshiDICE 48% as the rightmost outlier) while the latter occupy the leftmost quadrant of the figure.


Network Clustering

Clustering is widely employed to detect network communities, to partition large graphs and to allow their visualization. Among the most used algorithms are:

  • label propagation - simple approach based on vertex labeling: at the beginning each node has it’s own label but the iterative is reassigned a label that is most common among it’s neighbors. Runtime complexity is effectively linear and is equal to |V| + |E| (where |V|=number of vertices, |E|=number of edges).
  • fast greedy algorithm - bottom-up hierarchical approach that tries to optimize modularity function in greedy fashion. It has runtime complexity of |V||E|log|V|.
  • walktrap algorithm - approach based on the idea that if you perform random walks on the graph, then they are more likely to stay within the same community, because there are only few edges leading outside. Runtime complexity is equal to |E||V|^2.

Due to the sheer size of Bitcoin network and it’s structure, mentioned algorithms are not very suitable due to either extremely long execution time (fastgreedy, walktrap) or poor results (label propagation computes fairly fast, but produces highly skewed communities distribution). In addition, mentioned algorithms were designed for undirected graphs to begin with.

We use BGLL algorithm (sometimes called Louvain Method) to cluster Bitcoin network. It’s a greedy optimization method that attempts to optimize the modularity of a partition of the network, can be extended to work with directed, weighted types, and is known for it’s capability of handling large graphs. The process in performed in two steps: at first modularity is optimized locally in search of small communities, then nodes, belonging to the same community, are aggregated and new network is build from community aggregates. This process is repeated iteratively until no further modularity-increasing reassignments of communities are possible and produces hierarchical result that allows to see structure for each iteration. The algorithm suffers from problems, typical for all modularity optimization methods: most commonly occurring are the “resolution limit” (algorithms tend to overshoot, thus merging otherwise clearly identifiable smaller communities into one) and the “degeneracy” problem (since there’s exponentially large number of potential community assignments that maximize modularity, it’s usually hard to either find global maximum or to determine that it’s more important than any other local maxima of the same modularity). By design BGLL helps to mitigate the “resolution limit” to some extent by providing hierarchical structure as output, thus allowing us to analyze and pick appropriate clustering level. The “degeneracy” problem is indeed present, but comparison of results from several runs showed them to be acceptable for rough approximation.

Reference implementation from authors was used and managed to complete task in approximately 6 hours on PC with Intel Core i5-4460 CPU and 32G of memory (utilizing single thread, and peaking at 18G of RAM) attaining maximum modularity of 0.085.

There are 2788 resulting communities displayed above, with average size of 38866 elements. The size distribution is highly skewed to the right though with median value sitting at 280 and standard distribution of 558903. The smallest communities are effectively the vertices themselves while the size of the largest 3 is of 10e6 magnitude. Complement to higher modularity, clustered graph also produces higher values for other basic characteristics: density=0.0038, reciprocity=0.78 and transitivity=0.26.


Network development

In order to investigate network development, we make seven slices of graph data based on transaction times - effectively one slice for each year, beginning with 2010 (2 slices for 2015, January and December). We calculate and compare basic characteristics, described in previous sections, perform slice clustering and visualization. The results are presented below.

Let’s visualize this data.

This figure displays parameter changes over time, it also reveals correlations that are present in data. Correlation matrix, plotted below, provides clearer patterns.

Figures above lead us to the following conclusions:

  • Number of closed communities is increasing with growth of Bitcoin graph, but in most cases connections between addresses are unidirectional.
  • In 2013, when Bitcoin started to grow in popularity, structure of the graph started to change.
  • Decrease in density can be explained by almost linear dependency between address and connections growth.
  • Rise in modularity in 2011 is connected to the beginning of the community formation period.

Due to large size, it’s practically impossible to plot Bitcoin network in terms of vertices, instead we visualize graph using community aggregates as vertices and connections between communities as edges. The following animation presents network evolution over time.


Bitcoin Components

Transactions

A transaction is a transfer of bitcoins from one/multiple address(es) to another. There are few main patterns how transactions between addresses are organized:

  1. 1 -> 1 transaction - Address A has unspent transaction T1 (unspent transaction is simply an output of a transaction which isn’t yet an input of another transaction; T1 also could be treated as set of unspent transactions) on X BTC and sends X BTC to Address B via transaction T2. In this case bitcoins from T1 will flow to T2 and T2 becomes unspent transaction of Address B.

  1. 1 -> 2 transaction - Address A has unspent transaction T1 on X BTC and sends Y BTC (X > Y) to Address B via transaction T2. Change is sent back to Address A (or new address - Address A’ of the same wallet) via transaction T2 (now unspent transaction of Address A or A’ and B). It is also possible that Address A simply sends bitcoins to two different addresses.

  1. 1 -> multiple - organised in the same way as previous one, but with more than 2 output addresses. One of these addresses could simply be address for change. Such transactions are an example of dividing block reward between pools’ participants (in detailed in Miners chapter).

  1. >2 inputs - each of three previous patterns are combined here but for the situation when there are multiple addresses from which bitcoins are sent. Such method is used when there are not enough unspent transactions of one address to pay for smth. And wallet owner combines unspent transactions of multiple of his or her addresses.

Histograms below show that 81.15% transactions have 2 output addresses (it is obvious that for the most of these transactions one of the output addresses is for change) and 66.42% transactions have 1 income address. The most frequent is pattern 1 -> 2, what is explained by principle how bitcoins flow through transactions in discrete way (all bitcoins from income transactions should be spend and users are forced to send change back. These causes frequent 2-output transactions).

The most interesting for the further analysis are transactions, which characterized by more than 1 address from the payer side, because these cases could be used as an starting point for wallets deanonymization (Anonymity in Bitcoin chapter).


Miners

For the begining of 2010 there were a little bit more than 1.5 millions BTC in circulation, where till December 1, 2015 this number increased 10 times (14.9 millions BTC) and continues increasing.

Where do bitcoins come from? With paper money, a government decides when to print and distribute money. Bitcoin doesn’t have a central government.

With Bitcoin, miners use special software to solve math problems and are issued a certain number of bitcoins in exchange. This provides a smart way to issue the currency and also creates an incentive for more people to mine. [bitcoinmining.com]

There were 248453 mining addresses detected for the moment of December 1, 2015. Address is treated as mining address if it was detected at least in one coinbase transaction (coinbase (generation) transaction - one transaction per block which transfers amends to miner).

What is interesting, almost 30% of all bitcoins in circulation were mined by top 50 miners’ addresses.

When generation difficulty started increasing, it took too long to mine block for simple miner with lower-performance device. To provide a more smooth incentive to these miners, several pooled miners have been created. Pooled mining is a mining approach where multiple generating clients contribute to the generation of a block, and then split the block reward according the contributed processing power [bitcoin wiki]. Now almost all mining activities are concentrated in few mining pools: AntPool (30.3%), F2Pool (25%), BTCC Pool (15.8%), BitFury (10.7%), KnCMiner (5.1%), BW.COM (5.1%), Slush (3.2%).

Top 10 miners’ addresses with their main characteristics and mining pools to which they belong are below.

Last activity of 4 out of 10 addresses was detected earlier than September 2014, what makes it obvious that these addresses are not in use.

The biggest known mining pools also could be visualized in graph plot, using cluster graph of December of 2015 built in Network Clustering chapter (clusters which contain pools’ addresses are marked with red).

Some mining pools have stronger relations with each other than another, as they are in the same graph’s clusters. Addresses in graph are clustered based on number of connections between them, so being in the same cluster claims existence of stronger relations between addresses from this cluster than between addresses from different clusters. So BitFurry-Slush-GHashIO-50BTC.com and F2Pool-AntPool-BTCCPool have strong connections between themself. This could be caused by few reasons:

In different moments of bitcoins evolution there were different number of addresses, which wanted to do mining. Increasing of exchange rate caused sudden increasing of number of mining addresses appeared in the system in that time. It could be clearly seen in the histogram below via peaks associated with some real events [History of Bitcoin]. And almost all of these events are related to increasing of exchange rate of bitcoins.


Addresses

Dynamic Characteristics

In this chapter we have analysed dynamic of Bitcoin community addresses’ activity. In this context changes in such characteristics were analysed:

  • number of addresses with zero and non-zero balances for different moments of their last activity;
  • average number of transactions per address per month;
  • number of active addresses per month;
  • number of transactions per month.

Tracing balance of each address at the moment of their last activity gives an ability to detect bitcoins which are not in use (lost or settled on some addresses). Below is a plot, which shows histogram of addresses’ last activity. Addresses which don’t have zero balance are filled with red. 2 millions bitcoins are not in use since July 2012.

There is much higher proportion of addresses with not zero balance (whose activity were long ago) for the time of bitcoins creation, than later. It is possible that these addresses are “abandoned”, because this time bitcoins was like plaything for cryptography fans. We also could observe huge decreasing of proportion of addresses with not zero balance at the time when first Bitcoin currency exchange is born.

Median number of transactions per address per month line shows significant increment in November 2010 (Market cap exceeds $1 million USD, at that time different propositions how to divide bitcoins were proposed, what possibly could caused so high increment), May 2012 (SatoshiDICE at that time was responsible for more than half the Bitcoin blockchain transaction volume after one month of existing).

Comparison of number of active addresses with number of transactions per month is below. Time, when number of active addresses per month exceeded number of transactions is characterized by the time when Bitcoin exchange rate increased from 100$ to 1000$ per BTC in just 2 months.

The Richest Addresses

Addresses with a lot of bitcoins are very interesting for analysis. 811 thousands of bitcoins are owned by top 10 addresses with the biggest balance. This means that almost 6% of all bitcoins are in 6x10^-5 % addresses.

  • all addresses have zero mining income;
  • most of the addresses participated in small number of transactions and almost all of these transactions are income to mentioned addresses. So, these addresses are used as accumulators of bitcoins;
  • 5 out of 10 addresses started activity in November 2013 and are still active.

The richest addresses are also visualized in graph plot (clusters which contain these addresses are marked with blue).

What is interesting, the richest address is not very connected with large set of addresses (more details in Anonymity in Bitcoin chapter).


Anonymity in Bitcoin

Bitcoin is often described as an anonymous currency because it is possible to send and receive bitcoins without giving any personally identifying information. However, achieving reasonable anonymity with Bitcoin can be quite complicated and perfect anonymity may be impossible. Actually, Bitcoin is pseudonymous. Sending and receiving bitcoins is like writing under a pseudonym. If an author’s pseudonym is ever linked to their identity, everything they ever wrote under that pseudonym will now be linked to them. In Bitcoin, your pseudonym is the address to which you receive Bitcoin. Every transaction involving that address is stored forever in the blockchain. If your address is ever linked to your identity, every transaction will be linked to you. [Bitcoin simplified]

Multi-Input transaction - transaction of the pattern >2 inputs described above. A multi-input transaction occurs when you receive payments to your wallet to different addresses, but then you send a payment out of your wallet which pulls bitcoins from multiple addresses. The outgoing transaction will include multiple addresses as inputs, proving that they are in the same wallet and belong to the same entity.

Returning to the schematic visualization of >2 inputs, Address A1, Address A2, …, Address An could be linked to the same wallet. Moreover, recursively iterating through all transactions, which contain at least one of the already linked addresses, we could increase this list. In the example below, all 4 addresses could be linked to the same wallet via Transactions 1-3.

In the system there are 108.4 millions addresses on the beginning of December 2015. Using described approach, these addresses could be grouped to 44.2 millions wallets.

Top wallet based on number of addresses contains almost 10 millions addresses. Using graph we could show how top wallet’s addresses are distributed in its structure. Top 10 clusters which contain most of wallet’s addresses are marked with red numbers (in descending order of number of addresses of this wallet).

Top wallet’s addresses are almost 10% of all addresses and they could be treated as separate community, which is very active in Bitcoin network and is related to a lot of structures (because of high distribution in network among clusters).

Approach described above about detecting addresses of the same wallet is a basis of the Bitcoin deanonimization. For example, linked addresses gave an ability to detect some real bitcoin services. Using identified addresses of 100 most popular addresses we have found out wallets of these addresses and marked dominant clusters for each of these wallets in the graph.

The other example is connected with the richest address in network. Dispite the fact, that it wasn’t linked to any other address, we noticed, that there are a lot of income transactions from RipDice gambling service’s addresses to this address. As visualized below, the most of the biggest transfers of bitcoins to the richest address are from this service (red arrows in the plot). On this graph each dot represents address, arrow - transfers of bitcoins between addresses (wider arrow - more bitcoins). Yellow dot represents the richest address. Addresses with label were identified with blockchain.info, where another addresses of RipDice service (another red dots) were linked to them.

It is obvious that the richest address is owned by RipDice gambling service and is used by this service as accumulator of bitcoins.

Next example is related to reciprocity within gambling service. As was discussed in Basic Characteristics chapter, this metric is higher for subnetwork of gambling services’ addresses and their clients than for subnetwork of only gambling services’ addresses. It could be explained by a lot of mutual connections between “organizers and players”. As is shown in the graph below, there are more mutual connections (brown edges) between addresses of SatoshiDice (orange bigger dots) and players (brown dots), than within addresses of SatoshiDice (orange lines - one-way connections between addresses). Reciprocity of subnetwork of only SatoshiDice addresses - 0.08, where reciprocity of subnetwork of SatoshiDice and its clients - 0.5.

It’s not difficult to see that all these examples just a high-level demonstration of how bitcoin addresses could be linked with real organization or person, based only on minimal information about her activity in the Bitcoin Network, and, as a result, get more complete picture about bitcoins flow and state of the system.


Structures in Bitcoin Network

Addresses Clustering

To describe structure of Bitcoin community in more complete way, we performed segmentation of its main components - addresses, using next set of features:

  • Balance - balance of address for the beginning of December 2015;
  • IncomeValueNotMined - full income for the all time, excluding mining transactions;
  • MinersIncome - income of address through coinbase transactions;
  • FirstActivity - number of days ago to first activity by this address from 12-01-2015 divided by number of days bitcoins existed (2523). Value in range [0,1];
  • ActivePart - number of days between first and last activities by address divided by number of days bitcoins existed (2523). Value in range [0,1];
  • OutPerIn - Number of outcome transactions divided by number of income transactions;
  • NTransactions - number of income and outcome transactions related to address.

For clustering we have used k-means with 13 clusters. Such number of clusters was determined as the best one by Elbow Method.

Distribution of addresses among received clusters is below. There are 3 pretty small clusters in comparison with another.

Interesting fact that 94% of wallets (Anonymity in Bitcoin) have addresses from the same clusters. It indicates that features of addresses of the same user are very similar.

Received clusters have such characteristics:

Mean / Average - mean / average value of feature for addresses from cluster;
Prop. - proportion of feature value for all addresses from cluster in general sum;
Active Range(95%) - 5%-95% quantile range for active period for addresses from cluster.


Cluster1, Cluster2, Cluster3 - almost identical clusters which represent the most standard behavior. There are 45% of all addresses in these clusters. Addresses took action in 2-4 transactions and have out per in ratio 1. The main difference between clusters - time of addresses existence: [2013-09; 2014-03], [2014-04; 2014-09], [2015-03; 2015-07].

Cluster4 - the biggest cluster which contains 20% of all addresses and 17% of all bitcoins. Addresses of this clusters were created in second part of 2015 and mostly are still active.

Cluster5 - small cluster with 2% of addresses. Active period: beginning of 2013-now. Addresses are pretty active and are not drooped after one time use.

Cluster6 - the smallest cluster with only 101 addresses, but the most interesting cluster. Addresses contain 15% of all BTC and have mined 13% of all BTC (despite the fact, that there are only 5 miners in this cluster). Also, addresses of this cluster are very active, they characterized by a lot of transactions, they mostly appeared in system at the beginning of 2013 and stayed active till the time of analysis. This cluster is combined with the top miners and BTC owners.

Cluster7, Cluster8, Cluster10 - very similar clusters, which have 30% of addresses. They characterized y different time period of activity ([2014-10; 2015-03], [2013-01; 2013-08], [2012-03; 2012-12]) and not very active addresses ([2; 6] transactions per address).

Cluster9 - this cluster contains very small proportion of addresses, but the highest proportion of BTC are concentrated there - 23%. This cluster is one with the most active addresses ([2; 336] transactions per address). Also, this cluster is with the oldest and most active (based on active time range) addresses ([2010-11; 2015-11]).

Cluster11 - the second smallest cluster with only 313 addresses. This cluster is characterized by zero balance of its addresses, absence of miners, mostly appearance in the system starting from the end of 2014 and staying active very till the time of analysis. What is interesting, addresses are characterized by huge number of transactions per address with incredibly high out per in ratio (middle 50% range: [517; 1981]). Some of the addresses are example of Escrow Service (safer payment by securely holding a buyer’s coins in escrow).

Cluster12 - possible constantly active addresses with active period 2013-now. There are 1% of all addresses in this cluster. They are pretty active ([2; 236] transactions per address). Some of they are miners (9% of all), who also perform another activities.

Cluster13 - cluster with 59% of all miners who mined 53% of all BTC. Addresses were active in 2011 (mostly in second part). They characterized by small activity (1-2 transactions per address). Addresses own 13% of BTC.

We could conclude, that clusters 4, 5, 6, 9, 11, 12 are clusters which partly contain still active addresses.


Patterns detection in sequence of transactions

Another interesting fact of Bitcoin graph analysis is the existence of certain patterns in sequence of transactions within one address. Such patterns could be used to simulate Bitcoin graph development, as well as to identify cases of non-standard behavior.

In order to analyze sequence of transactions, transactions were categorized in 300 groups based on:

  • value - sum of bitcoins which was received or send in specific transaction by address;
  • income_tx - logical value which indicates whether address receive bitcoins (TRUE) or send them (FALSE) in this transaction;
  • OutcomeAddresses - number of unique outcome addresses from transaction;
  • IncomeAddresses - number of unique income addresses to transaction;
  • tx_value - full number of bitcoins which are send through transaction.

Only transactions of 2015 year (till December, 1) are taken into account. Summary of categories is below.

In order to analyze sequence of transactions, Associative Rules technique was used. Associative rules were mined only for sequence of 2 and 3 transactions with no skips of transactions. This increases interpretability and gives an ability to deal with a lot of data.

Rules were chosen based on support among addresses (>= 0.01), lift (>1) and confidence (>= 0.5). As a result, there are 9 interesting rules:

Every transaction type is encoded with 5 digits. Each digit responds for index of one of the 5 category variables
displayed and described above (Table 7).

First pattern, which is observed in the first 7 rules:

  • outcome -> income -> outcome transaction to address;
  • income -> outcome -> income transaction to address.
    So, there is sequential change in direction of transactions respectively to address.

Second interesting pattern related to all selected rules: Rules with 3 transactions have identical type of first and third transactions that indicate high probability that behavior of address will be reduplicated.

Third pattern: 6/9 rules related to transactions which operates with big sums of bitcoins.

In general, there are 2 types of rules:

  • rules which show movement of bitcoins (1-7);
  • rules which show accumulation of small portions of bitcoins on addresses (8-9).

Analysis of some rules is below.

Rule5

In the plot Address X stands for address for which rule is analysed, green indicates income transaction and orange - outcome.

Pattern: this rule indicates that users which operates with a lot of addresses and big sums of BTC don’t accumulate a lot of BTC on one address. Also, this rule could be example of movement of BTC to stay anonymized.

Rule 8-9

Both rules are almost the same. Small difference is only in tx_value.

Pattern: these rules clearly show how big sums of BTC are sent from one address to a lot of another in small portions. This could be an example how addresses systematically receive their bonuses as pool members.

Rule 7

Pattern: very similar rule to the previous ones, but here we could see that addresses do not accumulate BTC and buy smth for them or send them to another addresses.
The discovered patterns represent the most expected ways of coins transfers, so, we can use them to reconstruct the process of Bitcoin network evaluation in terms of coins distribution. This result can be considered as a basis of network development simulation.


Conclusions

  1. The way to Bitcoin wallets deanonimization is shown: in the case when we have minimal information about user activity, we can make suggestions about his active assets in Bitcoin Network.
  2. Analysis has shown that Bitcoin network development was accompanied by appearance of communities like gambling networks and miner pools. One of the most representative examples would be SatoshiDICE network.
  3. Statistical analysis allowed to get some understanding of user segments, based on their behavior. Along clusters that represent fairly standard behavior, exist some with pretty interesting specifics:

    • The smallest cluster (with only 101 addresses) holds 15% of all currently available BTC and have mined about 13% of all coins (despite the fact that there are only 5 miners in this cluster). Also, addresses that belong to this cluster are very active with lots of transactions (they mostly appeared at the beginning of 2013 and stayed active since). This cluster hold both top miners and richest BTC owners.
    • Second smallest cluster contains 313 addresses. It’s characterized by zero balance of its addresses and huge number of transactions per each address with incredibly high out per in ratio. Some of the addresses here are example of Escrow service (provides safer payments by securely holding buyer’s coins in escrow).
    • Cluster that holds 2.5% of all addresses: it contains 59% of all miners that generated 53% of all BTC. These addresses were mostly active in second part of 2011, they own 13% of BTC that don’t participate in the Bitcoin economy as of now.
  4. The most significant patterns of coins flow, which were detected by sequence analysis, can be used as a basis for abnormal cases detection. For example, those characterized by significant difference from discovered patterns, may serve as fraud markers.


Next steps

Analysis of the Bitcoin network was performed mostly without taking into account its dynamic nature, but the timings of graph development may highlight the most significant changes in network structure and give clearer representation of general tendencies of the network evolution. These results, along with specific patterns of the coins flow, can be used for prediction and simulation of Bitcoin Network Development.