Clustering

Introduciton

The main goal of this paper is to perform clustering methods on the first 200 cryptocurrencies ranked by market cap.

Clustering is a machine learning technique used to group similar objects or data points together into distinct subsets or clusters based on their similarities or dissimilarities. The goal of clustering is to divide a set of data points into several groups or clusters such that data points within each cluster are as similar as possible to each other and as dissimilar as possible to data points in other clusters. Clustering can be used for various applications such as customer segmentation, image segmentation, anomaly detection, and more. There are different types of clustering algorithms such as K-means, hierarchical clustering, DBSCAN, and more, each with its own strengths and weaknesses depending on the specific problem domain and data characteristics.

In this paper I use K-means, PAM, CLARA and hierarchical clustering algorithms.

Data Acquisition

General description of Data Acquisition

All data used in this project was obtained from the free API from the https://www.coingecko.com/ website. It was done by using httr package, which allows users to make http requests. Here is the link to the API documantation: https://www.coingecko.com/en/api/documentation

Unfortunately the API in the free version allows only 10-30 calls/minute, and I needed 200 calls. In the first step I obtained the list of the first 200 cryptocurrencies ranked by market cap. In the second step I created a function, which was ran several times using VPN, and each time acquiring data for other cryptocurrencies, basing on the list created in the first step. The data for each call was stored and then merged into a single table containing data for all 200 cryptocurrencies. In order to obtain the data I am interested in and in the format that I am interested in, several functions have been written, which are shown described below.

I downloaded the daily data for the Q4 of 2022 (01.10.2022 - 31.12.2022) and using a function make_a_summary, which is shown below I counted the average value for each cryptocurrency.

Functions to retrieve data

Function, which retrieves a list of ids of the first 200 cryptocurrencies ranked by market cap.

library(httr)
library(tidyverse)
library(purrr)
library(ggplot2)

get_top_cryptos <- function(n) {
  url <- paste0("https://api.coingecko.com/api/v3/coins/markets?vs_currency=usd&order=market_cap_desc&per_page=", 
                n, 
                "&page=1&sparkline=false")
  response <- GET(url)
  data <- content(response)
  data <- sapply(data, `[[`, 1)
  return(data)
}

Function, which retrieves data for the specific cryptocurrency over a certain period.

id - id of the cryptocurrency
start_date - date from which we want to download data
finish_date - date to which we want to download data

get_data_one_currency <- function(id, start_date, finish_date) {
  url <- paste0("https://api.coingecko.com/api/v3/coins/", 
                id, 
                "/market_chart/range?vs_currency=usd&from=", 
                start_date, 
                "&to=", 
                finish_date)
  response <- GET(url)
  data_one_currency <- content(response)
  return(data_one_currency)
}

Function, which changes the data of one currency to the proper format

change_to_proper_df <- function(data_one_currency) {
  
  df <- data.frame()
  for (i in 1:3){
    small_df <- data_one_currency[[i]]
    small_df <- do.call(rbind, small_df)
    small_df <- as.data.frame(small_df)
    if (i == 1){
      df <- small_df
    } else {
      df <- cbind(df, small_df[, 2, drop=FALSE])
    }
  }
  colnames(df) <- c("date", "price", "market_cap", "total_volume")
  
  df[] <- lapply(df, as.numeric)
  df$date <- as.POSIXct(df$date/1000, origin = "1970-01-01", tz = "UTC")
  return(df)
}

Function which makes a summary of one cryptocurrency

make_a_summary <- function(df_to_summary, id) {
  summary <- df_to_summary %>%
    summarize(mean_price = mean(price),
              mean_market_cap = mean(market_cap),
              mean_total_volume = mean(total_volume),
              price_pct_change = round((last(price) - first(price)) / first(price) * 100, 2)) %>%
    mutate(name = id, .before = "mean_price")
  return(summary)
}

Function, which builds final df

build_final_df <- function(i_start, i_end, start_date, finish_date) {
  
  for (i in i_start:i_end){
    Sys.sleep(4)
    df1 <- get_data_one_currency(top_n_id[i], start_date, finish_date)
    df1 <- change_to_proper_df(df1)
    df_summary <- make_a_summary(df1, top_n_id[i])
    summary_all <- rbind(summary_all, df_summary)
  }
  return(summary_all)
}

Summary of the Data Acquisition

As I mentioned at the beginning of that part, the function build_final_df was run several times in order to obtain data for some part of the list generated by the function get_top_cryptos. The results were saved, then merged by the rbind function and saved as csv file, which will be used in the next section.

Dataset exploration

Preprocessing

library(DT)

df <- read.csv("./data/crypto_data.csv")

datatable(df, options = list(
  searching = FALSE,
  pageLength = 5,
  lengthMenu = c(5, 10, 15, 20)
))

df$mean_price <- round(df$mean_price, 2)

# change mean_market_cap and mean_total_volume to be displayed in hundreds of thousands
df$mean_market_cap <- round(df$mean_market_cap/100000, 2)
df$mean_total_volume <- round(df$mean_total_volume/100000, 2)

colnames(df)[3] <- "mean_market_cap[100k]"
colnames(df)[4] <- "mean_total_volume[100k]"

datatable(df, options = list(
  searching = FALSE,
  pageLength = 5,
  lengthMenu = c(5, 10, 15, 20)
))

There is one observation with the value equal to 0, which needs to be deleted from the dataset.

df <- df[!(df$mean_market_cap == 0),]

Exploratory Data Analysis

My dataframe consists of 5 columns:

name - name of the cryptocurrency,
mean_price - mean price of the cryptocurrency,
mean_market_cap[100k] - mean market cap in hundred of thousands,
mean_total_volume[100k] - mean total volume in hundred of thousands,
price_pct_change - percentage change in price, between first and last observation.

Avarage values are calculated from daily observations over the period 01.10.2022 - 31.12.2022.

summary(df)

##      name             mean_price        mean_market_cap[100k]
##  Length:198         Min.   :    0.000   Min.   :    164      
##  Class :character   1st Qu.:    0.172   1st Qu.:   2027      
##  Mode  :character   Median :    1.000   Median :   3621      
##                     Mean   :  376.985   Mean   :  45002      
##                     3rd Qu.:    6.760   3rd Qu.:   9280      
##                     Max.   :18105.870   Max.   :3473562      
##  mean_total_volume[100k] price_pct_change
##  Min.   :     0.0        Min.   :-82.75  
##  1st Qu.:    56.6        1st Qu.:-42.87  
##  Median :   197.8        Median :-29.70  
##  Mean   :  5221.7        Mean   :-24.62  
##  3rd Qu.:   639.7        3rd Qu.: -8.91  
##  Max.   :372776.1        Max.   : 77.48

par(mfrow=c(2,4))

hist(df$mean_price, main="Hist of mean_price")
hist(df$mean_market_cap, main="Hist of mean_market_cap[100k]")
hist(df$mean_total_volume, main="Hist of mean_total_volume[100k]")
hist(df$price_pct_change, main="Hist of price percentage change")

boxplot(df$mean_price, main="Boxplot of mean_price", xlab="mean_price")
boxplot(df$mean_market_cap, main="Boxplot of mean_market_cap", xlab="mean_market_cap[100k]")
boxplot(df$`mean_total_volume[100k]`, main="Boxplot of mean_total_volume", xlab="mean_total_volume[100k]")
boxplot(df$price_pct_change, main="Boxplot of percentage change", xlab="percentage change")

Outliers Handling

In this project I used two methods of removal outliers: Z–Score Method and IQR. Bothe methods are described below. Further analysis is performed on two different dataframes depending on the outlier removal method.

Creating dataset without names.

df1 <- df
row.names(df1) <- df1[,1]
df1 <- df1[,-1]

Z-score Method

The z-score method is a statistical technique used to standardize and normalize data by calculating how many standard deviations a given data point is from the mean of a dataset. The z-score formula is:

z = (x - μ) / σ

Where:

z is the z-score,
x is the raw data value,
μ is the mean of the dataset,
σ is the standard deviation of the dataset.

A z-score of 0 means that the data point is exactly at the mean, a z-score of 1 means that the data point is one standard deviation above the mean, and a z-score of -1 means that the data point is one standard deviation below the mean.

In the code below I use absolute values and the value 3 as the limit value, which means, that in my cleaned dataset observations will be between the third standard deviations.

library(dplyr)

z_scores <- abs(scale(df1))
z_scores <- as.data.frame(z_scores)
colnames(z_scores) <- c("mean_price_Z", "mean_market_cap_Z", "mean_total_volume_Z", "price_pct_change_Z")

df_Z <- cbind(df1, z_scores)

rm(z_scores)

df_Z <- df_Z %>%
  filter(mean_price_Z <= 3,
         mean_market_cap_Z <= 3,
         mean_total_volume_Z <= 3,
         price_pct_change_Z <= 3)

df_Z <- df_Z[,-c(6:10)]

count(df_Z)

##     n
## 1 188

We can observe that according to that method we have deleted only 10 outliers.

Interquartile Range (IQR)

IQR = Q3 – Q1

Q1 − 1.5 * IQR: Lower outlier gate

Q3 + 1.5 * IQR: Upper outlier gate

The IQR method is used to identify and remove outliers in a dataset based on the distribution of the data in that particular column. It involves calculating the interquartile range (IQR) of the data in each column and removing any values that are more than a specified number of IQRs from the first and third quartiles.

Since each column in the dataset may have a different distribution, applying the IQR method to just one column may not be sufficient to remove all the outliers in the dataset. By applying the method to each column individually, I can identify and remove outliers based on the distribution of each column, resulting in a cleaner and more accurate dataset.

q1 <- apply(df1, 2, quantile, probs = 0.25)
q3 <- apply(df1, 2, quantile, probs = 0.75)
iqr <- q3 - q1

lower <- q1 - 1.5 * iqr
upper <- q3 + 1.5 * iqr

df_IQR <- df1

for (i in 1:ncol(df_IQR)) {
  df_IQR[df_IQR[, i] < lower[i] | df_IQR[, i] > upper[i], i] <- NA
}

df_IQR <- na.omit(df_IQR)

count(df_IQR)

##     n
## 1 133

We can observe that according to that method we have deleted 66 outliers.

Dataset comparison

According to the histograms and boxplots below we can observe that the dataframe resulting from the IQR method is closer to a normal distribution than that resulting from the Z-Score method. It also deleted more variables (66 deleted vs 10 deleted).

At this stage, we can conclude that the clustering results for this dataset will be probably better, but will but will cover a smaller portion of the initial dataset.

Z-Score

IQR

The clustering part is carried out on two samples without outliers. On the sample where Z-Score method and IQR method were applied.

library(factoextra)
library(gridExtra)

Silhuette index

According to the plots below 2 clustersare the most suitable for all of the algorithms in both dataframes.

Z-Score df

sil_g1 <- fviz_nbclust(df_Z, FUNcluster = kmeans, method = "silhouette") + ggtitle("K-means")
sil_g2 <- fviz_nbclust(df_Z, FUNcluster = cluster::pam, method = "silhouette") + ggtitle("PAM")
sil_g3 <- fviz_nbclust(df_Z, FUNcluster = cluster::clara, method = "silhouette") + ggtitle("CLARA")
grid.arrange(sil_g1, sil_g2, sil_g3, ncol=2)

IQR df

sil_g1 <- fviz_nbclust(df_IQR, FUNcluster = kmeans, method = "silhouette") + ggtitle("K-means")
sil_g2 <- fviz_nbclust(df_IQR, FUNcluster = cluster::pam, method = "silhouette") + ggtitle("PAM")
sil_g3 <- fviz_nbclust(df_IQR, FUNcluster = cluster::clara, method = "silhouette") + ggtitle("CLARA")
grid.arrange(sil_g1, sil_g2, sil_g3, ncol=2)

Total within sum of squares

Z-Score df

wss_1 <- fviz_nbclust(df_Z, FUNcluster = kmeans, method = "wss") + ggtitle("K-means")
wss_2 <- fviz_nbclust(df_Z, FUNcluster = cluster::pam, method = "wss") + ggtitle("PAM")
wss_3 <- fviz_nbclust(df_Z, FUNcluster = cluster::clara, method = "wss") + ggtitle("CLARA")

grid.arrange(wss_1, wss_2, wss_3, ncol=2)

IQR df

wss_1 <- fviz_nbclust(df_IQR, FUNcluster = kmeans, method = "wss") + ggtitle("K-means")
wss_2 <- fviz_nbclust(df_IQR, FUNcluster = cluster::pam, method = "wss") + ggtitle("PAM")
wss_3 <- fviz_nbclust(df_IQR, FUNcluster = cluster::clara, method = "wss") + ggtitle("CLARA")

grid.arrange(wss_1, wss_2, wss_3, ncol=2)

K-means

Z-Score df

k1 <-  eclust(df_Z, k=2, hc_metric = "euclidean", graph=FALSE)
k2 <-  eclust(df_Z, k=3, hc_metric = "euclidean", graph=FALSE)
k3 <-  eclust(df_Z, k=4, hc_metric = "euclidean", graph=FALSE)

g1_1 <- fviz_cluster(k1, geom = "point", df_Z) + ggtitle("k = 2")
g2_2 <- fviz_cluster(k2, geom = "point", df_Z) + ggtitle("k = 3")
g2_3 <- fviz_cluster(k3, geom = "point", df_Z) + ggtitle("k = 4")

grid.arrange(g1_1, g2_2, g2_3, nrow=2)

IQR df

k1 <-  eclust(df_IQR, k=2, hc_metric = "euclidean", graph=FALSE)
k2 <-  eclust(df_IQR, k=3, hc_metric = "euclidean", graph=FALSE)
k3 <-  eclust(df_IQR, k=4, hc_metric = "euclidean", graph=FALSE)

g1_1 <- fviz_cluster(k1, geom = "point", df_IQR) + ggtitle("k = 2")
g2_2 <- fviz_cluster(k2, geom = "point", df_IQR) + ggtitle("k = 3")
g2_3 <- fviz_cluster(k3, geom = "point", df_IQR) + ggtitle("k = 4")

grid.arrange(g1_1, g2_2, g2_3, nrow=2)

PAM

Z-Score df

pam1 <- eclust(df_Z, "pam", k=2, graph=FALSE)
pam2 <- eclust(df_Z, "pam", k=3, graph=FALSE)
pam3 <- eclust(df_Z, "pam", k=4, graph=FALSE)

g2_1 <- fviz_cluster(pam1, geom = "point") + ggtitle("k = 2")
g2_2 <- fviz_cluster(pam2, geom = "point") + ggtitle("k = 3")
g2_3 <- fviz_cluster(pam3, geom = "point") + ggtitle("k = 4")

grid.arrange(g2_1, g2_2, g2_3, nrow=2)

IQR df

pam1 <- eclust(df_IQR, "pam", k=2, graph=FALSE)
pam2 <- eclust(df_IQR, "pam", k=3, graph=FALSE)
pam3 <- eclust(df_IQR, "pam", k=4, graph=FALSE)

g2_1 <- fviz_cluster(pam1, geom = "point") + ggtitle("k = 2")
g2_2 <- fviz_cluster(pam2, geom = "point") + ggtitle("k = 3")
g2_3 <- fviz_cluster(pam3, geom = "point") + ggtitle("k = 4")

grid.arrange(g2_1, g2_2, g2_3, nrow=2)

CLARA

Z-Score df

clara1 <- eclust(df_Z, "clara", k=2, graph=FALSE) 
clara2 <- eclust(df_Z, "clara", k=3, graph=FALSE)
clara3 <- eclust(df_Z, "clara", k=4, graph=FALSE)

g3_1 <- fviz_cluster(clara1, geom = "point") + ggtitle("k = 2")
g3_2 <- fviz_cluster(clara2, geom = "point") + ggtitle("k = 3")
g3_3 <- fviz_cluster(clara3, geom = "point") + ggtitle("k = 4")

grid.arrange(g3_1, g3_2, g3_3, nrow=2)

IQR df

clara1 <- eclust(df_IQR, "clara", k=2, graph=FALSE) 
clara2 <- eclust(df_IQR, "clara", k=3, graph=FALSE)
clara3 <- eclust(df_IQR, "clara", k=4, graph=FALSE)

g3_1 <- fviz_cluster(clara1, geom = "point") + ggtitle("k = 2")
g3_2 <- fviz_cluster(clara2, geom = "point") + ggtitle("k = 3")
g3_3 <- fviz_cluster(clara3, geom = "point") + ggtitle("k = 4")

grid.arrange(g3_1, g3_2, g3_3, nrow=2)

Post diagnostics

After carrying out a few clusterizations, basig on 3 different algorithms, 2 different datasets and 3 different number of clusters we can find out that generally, the clusterizations performed on IQR dataframe gave more resonable results. I woould like to focus on those with 2 clusters, as the Silhuette index for them had the highest value.

However in the post diagnostics I would like to xamine the following results (all performed on IQR df):

K-means 2 clusters,
K-means 3 clusters,
PAm 2 clusters,
CLARA 2 clusters.

k2 <-  eclust(df_IQR, k=2, hc_metric = "euclidean", graph=FALSE)
k3 <-  eclust(df_IQR, k=3, hc_metric = "euclidean", graph=FALSE)
pam2 <- eclust(df_IQR, "pam", k=2, graph=FALSE)
clara2 <- eclust(df_IQR, "clara", k=2, graph=FALSE) 

k2s <- fviz_silhouette(k2) + ggtitle("K-Means with 2 clusters")
k3s <- fviz_silhouette(k3) + ggtitle("K-Means with 3 clusters")
pam2s <- fviz_silhouette(pam2) + ggtitle("PAM with 2 clusters")
clara2s <- fviz_silhouette(clara2) + ggtitle("CLARA with 2 clusters")

Average cluster widths

k2$silinfo$clus.avg.widths

## [1] 0.5181753 0.7667209

k3$silinfo$clus.avg.widths

## [1] 0.5031923 0.4642449 0.7022540

pam2$silinfo$clus.avg.widths

## [1] 0.4302030 0.7750019

clara2$silinfo$clus.avg.widths

## [1] 0.4302030 0.7750019

Silhuette index among clusters

According to the graphs below we can conclude that the clustering for both K-means with 2 and 3 clusters is above 1, which means that the clustering is done correstly.

On the other hand the Silhuette index for bot PAM and CLARA is below 0, Which means that we should not consider these clusters as “good” clusters.

grid.arrange(k2s, k3s, pam2s, clara2s)

Hierachical Clustering

Hierarchical clustering is a machine learning technique used to group similar objects or data points into a hierarchy of nested clusters. In this method, the data points are successively merged into larger and larger clusters, until all the data points are contained in a single cluster at the top of the hierarchy. The result is a tree-like structure called a dendrogram that shows the relationships between the clusters at different levels of the hierarchy.

The hierarchical clustering is also performed on both dataframes, for the sake of consistency, but the one performed on IQR dataframe should be considered.

Z-Score df

d <- dist(df_Z, method = "euclidean")
hc1 <- hclust(d, method = "complete")
plot(hc1, cex = 0.6, hang = -1)
rect.hclust(hc1, k=3, border="red")

IQR df

d <- dist(df_IQR, method = "euclidean")
hc1 <- hclust(d, method = "complete")
plot(hc1, cex = 0.6, hang = -1)
rect.hclust(hc1, k=3, border="red")

Summary

Summarizing in this paper K-means, PAM, CLARA and hierarchical clustering algorithms were used in order to cluster the first 200 cryptocurrencies ranked by market cap. The best results were for dataframe, where outliers were deleted according to IQR method for K-Means algorithm with 2 and 3 clusters. Hierarchical clustering was also carried out at the end of the analysis.

Clustering - Cryptocurrencies

Kacper Gruca

Introduciton

Data Acquisition

General description of Data Acquisition

Functions to retrieve data

Function, which retrieves a list of ids of the first 200 cryptocurrencies ranked by market cap.

Function, which retrieves data for the specific cryptocurrency over a certain period.

Function, which changes the data of one currency to the proper format

Function which makes a summary of one cryptocurrency

Function, which builds final df

Summary of the Data Acquisition

Dataset exploration

Preprocessing

Exploratory Data Analysis

Outliers Handling

Z-score Method

Interquartile Range (IQR)

Dataset comparison

Z-Score

IQR

Clustering

Silhuette index

Z-Score df

IQR df

Total within sum of squares

Z-Score df

IQR df

K-means

Z-Score df

IQR df

PAM

Z-Score df

IQR df

CLARA

Z-Score df

IQR df

Post diagnostics

Average cluster widths

Silhuette index among clusters

Hierachical Clustering

Z-Score df

IQR df

Summary