The main goal of this paper is to perform clustering methods on the first 200 cryptocurrencies ranked by market cap.
Clustering is a machine learning technique used to group similar objects or data points together into distinct subsets or clusters based on their similarities or dissimilarities. The goal of clustering is to divide a set of data points into several groups or clusters such that data points within each cluster are as similar as possible to each other and as dissimilar as possible to data points in other clusters. Clustering can be used for various applications such as customer segmentation, image segmentation, anomaly detection, and more. There are different types of clustering algorithms such as K-means, hierarchical clustering, DBSCAN, and more, each with its own strengths and weaknesses depending on the specific problem domain and data characteristics.
In this paper I use K-means, PAM, CLARA and hierarchical clustering algorithms.
All data used in this project was obtained from the free API from the https://www.coingecko.com/ website. It was done by using httr package, which allows users to make http requests. Here is the link to the API documantation: https://www.coingecko.com/en/api/documentation
Unfortunately the API in the free version allows only 10-30 calls/minute, and I needed 200 calls. In the first step I obtained the list of the first 200 cryptocurrencies ranked by market cap. In the second step I created a function, which was ran several times using VPN, and each time acquiring data for other cryptocurrencies, basing on the list created in the first step. The data for each call was stored and then merged into a single table containing data for all 200 cryptocurrencies. In order to obtain the data I am interested in and in the format that I am interested in, several functions have been written, which are shown described below.
I downloaded the daily data for the Q4 of 2022 (01.10.2022 - 31.12.2022) and using a function make_a_summary, which is shown below I counted the average value for each cryptocurrency.
library(httr)
library(tidyverse)
library(purrr)
library(ggplot2)
get_top_cryptos <- function(n) {
url <- paste0("https://api.coingecko.com/api/v3/coins/markets?vs_currency=usd&order=market_cap_desc&per_page=",
n,
"&page=1&sparkline=false")
response <- GET(url)
data <- content(response)
data <- sapply(data, `[[`, 1)
return(data)
}
id - id of the cryptocurrency
start_date - date from which we want to download data
finish_date - date to which we want to download data
get_data_one_currency <- function(id, start_date, finish_date) {
url <- paste0("https://api.coingecko.com/api/v3/coins/",
id,
"/market_chart/range?vs_currency=usd&from=",
start_date,
"&to=",
finish_date)
response <- GET(url)
data_one_currency <- content(response)
return(data_one_currency)
}
change_to_proper_df <- function(data_one_currency) {
df <- data.frame()
for (i in 1:3){
small_df <- data_one_currency[[i]]
small_df <- do.call(rbind, small_df)
small_df <- as.data.frame(small_df)
if (i == 1){
df <- small_df
} else {
df <- cbind(df, small_df[, 2, drop=FALSE])
}
}
colnames(df) <- c("date", "price", "market_cap", "total_volume")
df[] <- lapply(df, as.numeric)
df$date <- as.POSIXct(df$date/1000, origin = "1970-01-01", tz = "UTC")
return(df)
}
make_a_summary <- function(df_to_summary, id) {
summary <- df_to_summary %>%
summarize(mean_price = mean(price),
mean_market_cap = mean(market_cap),
mean_total_volume = mean(total_volume),
price_pct_change = round((last(price) - first(price)) / first(price) * 100, 2)) %>%
mutate(name = id, .before = "mean_price")
return(summary)
}
build_final_df <- function(i_start, i_end, start_date, finish_date) {
for (i in i_start:i_end){
Sys.sleep(4)
df1 <- get_data_one_currency(top_n_id[i], start_date, finish_date)
df1 <- change_to_proper_df(df1)
df_summary <- make_a_summary(df1, top_n_id[i])
summary_all <- rbind(summary_all, df_summary)
}
return(summary_all)
}
As I mentioned at the beginning of that part, the function build_final_df was run several times in order to obtain data for some part of the list generated by the function get_top_cryptos. The results were saved, then merged by the rbind function and saved as csv file, which will be used in the next section.
library(DT)
df <- read.csv("./data/crypto_data.csv")
datatable(df, options = list(
searching = FALSE,
pageLength = 5,
lengthMenu = c(5, 10, 15, 20)
))
df$mean_price <- round(df$mean_price, 2)
# change mean_market_cap and mean_total_volume to be displayed in hundreds of thousands
df$mean_market_cap <- round(df$mean_market_cap/100000, 2)
df$mean_total_volume <- round(df$mean_total_volume/100000, 2)
colnames(df)[3] <- "mean_market_cap[100k]"
colnames(df)[4] <- "mean_total_volume[100k]"
datatable(df, options = list(
searching = FALSE,
pageLength = 5,
lengthMenu = c(5, 10, 15, 20)
))
There is one observation with the value equal to 0, which needs to be deleted from the dataset.
df <- df[!(df$mean_market_cap == 0),]
My dataframe consists of 5 columns:
Avarage values are calculated from daily observations over the period 01.10.2022 - 31.12.2022.
summary(df)
## name mean_price mean_market_cap[100k]
## Length:198 Min. : 0.000 Min. : 164
## Class :character 1st Qu.: 0.172 1st Qu.: 2027
## Mode :character Median : 1.000 Median : 3621
## Mean : 376.985 Mean : 45002
## 3rd Qu.: 6.760 3rd Qu.: 9280
## Max. :18105.870 Max. :3473562
## mean_total_volume[100k] price_pct_change
## Min. : 0.0 Min. :-82.75
## 1st Qu.: 56.6 1st Qu.:-42.87
## Median : 197.8 Median :-29.70
## Mean : 5221.7 Mean :-24.62
## 3rd Qu.: 639.7 3rd Qu.: -8.91
## Max. :372776.1 Max. : 77.48
par(mfrow=c(2,4))
hist(df$mean_price, main="Hist of mean_price")
hist(df$mean_market_cap, main="Hist of mean_market_cap[100k]")
hist(df$mean_total_volume, main="Hist of mean_total_volume[100k]")
hist(df$price_pct_change, main="Hist of price percentage change")
boxplot(df$mean_price, main="Boxplot of mean_price", xlab="mean_price")
boxplot(df$mean_market_cap, main="Boxplot of mean_market_cap", xlab="mean_market_cap[100k]")
boxplot(df$`mean_total_volume[100k]`, main="Boxplot of mean_total_volume", xlab="mean_total_volume[100k]")
boxplot(df$price_pct_change, main="Boxplot of percentage change", xlab="percentage change")
In this project I used two methods of removal outliers: Z–Score Method and IQR. Bothe methods are described below. Further analysis is performed on two different dataframes depending on the outlier removal method.
Creating dataset without names.
df1 <- df
row.names(df1) <- df1[,1]
df1 <- df1[,-1]
The z-score method is a statistical technique used to standardize and normalize data by calculating how many standard deviations a given data point is from the mean of a dataset. The z-score formula is:
z = (x - μ) / σ
Where:
A z-score of 0 means that the data point is exactly at the mean, a z-score of 1 means that the data point is one standard deviation above the mean, and a z-score of -1 means that the data point is one standard deviation below the mean.
In the code below I use absolute values and the value 3 as the limit value, which means, that in my cleaned dataset observations will be between the third standard deviations.
library(dplyr)
z_scores <- abs(scale(df1))
z_scores <- as.data.frame(z_scores)
colnames(z_scores) <- c("mean_price_Z", "mean_market_cap_Z", "mean_total_volume_Z", "price_pct_change_Z")
df_Z <- cbind(df1, z_scores)
rm(z_scores)
df_Z <- df_Z %>%
filter(mean_price_Z <= 3,
mean_market_cap_Z <= 3,
mean_total_volume_Z <= 3,
price_pct_change_Z <= 3)
df_Z <- df_Z[,-c(6:10)]
count(df_Z)
## n
## 1 188
We can observe that according to that method we have deleted only 10 outliers.
IQR = Q3 – Q1
Q1 − 1.5 * IQR: Lower outlier gate
Q3 + 1.5 * IQR: Upper outlier gate
The IQR method is used to identify and remove outliers in a dataset based on the distribution of the data in that particular column. It involves calculating the interquartile range (IQR) of the data in each column and removing any values that are more than a specified number of IQRs from the first and third quartiles.
Since each column in the dataset may have a different distribution, applying the IQR method to just one column may not be sufficient to remove all the outliers in the dataset. By applying the method to each column individually, I can identify and remove outliers based on the distribution of each column, resulting in a cleaner and more accurate dataset.
q1 <- apply(df1, 2, quantile, probs = 0.25)
q3 <- apply(df1, 2, quantile, probs = 0.75)
iqr <- q3 - q1
lower <- q1 - 1.5 * iqr
upper <- q3 + 1.5 * iqr
df_IQR <- df1
for (i in 1:ncol(df_IQR)) {
df_IQR[df_IQR[, i] < lower[i] | df_IQR[, i] > upper[i], i] <- NA
}
df_IQR <- na.omit(df_IQR)
count(df_IQR)
## n
## 1 133
We can observe that according to that method we have deleted 66 outliers.
According to the histograms and boxplots below we can observe that the dataframe resulting from the IQR method is closer to a normal distribution than that resulting from the Z-Score method. It also deleted more variables (66 deleted vs 10 deleted).
At this stage, we can conclude that the clustering results for this dataset will be probably better, but will but will cover a smaller portion of the initial dataset.
The clustering part is carried out on two samples without outliers. On the sample where Z-Score method and IQR method were applied.
library(factoextra)
library(gridExtra)
According to the plots below 2 clustersare the most suitable for all of the algorithms in both dataframes.
sil_g1 <- fviz_nbclust(df_Z, FUNcluster = kmeans, method = "silhouette") + ggtitle("K-means")
sil_g2 <- fviz_nbclust(df_Z, FUNcluster = cluster::pam, method = "silhouette") + ggtitle("PAM")
sil_g3 <- fviz_nbclust(df_Z, FUNcluster = cluster::clara, method = "silhouette") + ggtitle("CLARA")
grid.arrange(sil_g1, sil_g2, sil_g3, ncol=2)
sil_g1 <- fviz_nbclust(df_IQR, FUNcluster = kmeans, method = "silhouette") + ggtitle("K-means")
sil_g2 <- fviz_nbclust(df_IQR, FUNcluster = cluster::pam, method = "silhouette") + ggtitle("PAM")
sil_g3 <- fviz_nbclust(df_IQR, FUNcluster = cluster::clara, method = "silhouette") + ggtitle("CLARA")
grid.arrange(sil_g1, sil_g2, sil_g3, ncol=2)
wss_1 <- fviz_nbclust(df_Z, FUNcluster = kmeans, method = "wss") + ggtitle("K-means")
wss_2 <- fviz_nbclust(df_Z, FUNcluster = cluster::pam, method = "wss") + ggtitle("PAM")
wss_3 <- fviz_nbclust(df_Z, FUNcluster = cluster::clara, method = "wss") + ggtitle("CLARA")
grid.arrange(wss_1, wss_2, wss_3, ncol=2)
wss_1 <- fviz_nbclust(df_IQR, FUNcluster = kmeans, method = "wss") + ggtitle("K-means")
wss_2 <- fviz_nbclust(df_IQR, FUNcluster = cluster::pam, method = "wss") + ggtitle("PAM")
wss_3 <- fviz_nbclust(df_IQR, FUNcluster = cluster::clara, method = "wss") + ggtitle("CLARA")
grid.arrange(wss_1, wss_2, wss_3, ncol=2)
k1 <- eclust(df_Z, k=2, hc_metric = "euclidean", graph=FALSE)
k2 <- eclust(df_Z, k=3, hc_metric = "euclidean", graph=FALSE)
k3 <- eclust(df_Z, k=4, hc_metric = "euclidean", graph=FALSE)
g1_1 <- fviz_cluster(k1, geom = "point", df_Z) + ggtitle("k = 2")
g2_2 <- fviz_cluster(k2, geom = "point", df_Z) + ggtitle("k = 3")
g2_3 <- fviz_cluster(k3, geom = "point", df_Z) + ggtitle("k = 4")
grid.arrange(g1_1, g2_2, g2_3, nrow=2)
k1 <- eclust(df_IQR, k=2, hc_metric = "euclidean", graph=FALSE)
k2 <- eclust(df_IQR, k=3, hc_metric = "euclidean", graph=FALSE)
k3 <- eclust(df_IQR, k=4, hc_metric = "euclidean", graph=FALSE)
g1_1 <- fviz_cluster(k1, geom = "point", df_IQR) + ggtitle("k = 2")
g2_2 <- fviz_cluster(k2, geom = "point", df_IQR) + ggtitle("k = 3")
g2_3 <- fviz_cluster(k3, geom = "point", df_IQR) + ggtitle("k = 4")
grid.arrange(g1_1, g2_2, g2_3, nrow=2)
pam1 <- eclust(df_Z, "pam", k=2, graph=FALSE)
pam2 <- eclust(df_Z, "pam", k=3, graph=FALSE)
pam3 <- eclust(df_Z, "pam", k=4, graph=FALSE)
g2_1 <- fviz_cluster(pam1, geom = "point") + ggtitle("k = 2")
g2_2 <- fviz_cluster(pam2, geom = "point") + ggtitle("k = 3")
g2_3 <- fviz_cluster(pam3, geom = "point") + ggtitle("k = 4")
grid.arrange(g2_1, g2_2, g2_3, nrow=2)
pam1 <- eclust(df_IQR, "pam", k=2, graph=FALSE)
pam2 <- eclust(df_IQR, "pam", k=3, graph=FALSE)
pam3 <- eclust(df_IQR, "pam", k=4, graph=FALSE)
g2_1 <- fviz_cluster(pam1, geom = "point") + ggtitle("k = 2")
g2_2 <- fviz_cluster(pam2, geom = "point") + ggtitle("k = 3")
g2_3 <- fviz_cluster(pam3, geom = "point") + ggtitle("k = 4")
grid.arrange(g2_1, g2_2, g2_3, nrow=2)
clara1 <- eclust(df_Z, "clara", k=2, graph=FALSE)
clara2 <- eclust(df_Z, "clara", k=3, graph=FALSE)
clara3 <- eclust(df_Z, "clara", k=4, graph=FALSE)
g3_1 <- fviz_cluster(clara1, geom = "point") + ggtitle("k = 2")
g3_2 <- fviz_cluster(clara2, geom = "point") + ggtitle("k = 3")
g3_3 <- fviz_cluster(clara3, geom = "point") + ggtitle("k = 4")
grid.arrange(g3_1, g3_2, g3_3, nrow=2)
clara1 <- eclust(df_IQR, "clara", k=2, graph=FALSE)
clara2 <- eclust(df_IQR, "clara", k=3, graph=FALSE)
clara3 <- eclust(df_IQR, "clara", k=4, graph=FALSE)
g3_1 <- fviz_cluster(clara1, geom = "point") + ggtitle("k = 2")
g3_2 <- fviz_cluster(clara2, geom = "point") + ggtitle("k = 3")
g3_3 <- fviz_cluster(clara3, geom = "point") + ggtitle("k = 4")
grid.arrange(g3_1, g3_2, g3_3, nrow=2)
After carrying out a few clusterizations, basig on 3 different algorithms, 2 different datasets and 3 different number of clusters we can find out that generally, the clusterizations performed on IQR dataframe gave more resonable results. I woould like to focus on those with 2 clusters, as the Silhuette index for them had the highest value.
However in the post diagnostics I would like to xamine the following results (all performed on IQR df):
k2 <- eclust(df_IQR, k=2, hc_metric = "euclidean", graph=FALSE)
k3 <- eclust(df_IQR, k=3, hc_metric = "euclidean", graph=FALSE)
pam2 <- eclust(df_IQR, "pam", k=2, graph=FALSE)
clara2 <- eclust(df_IQR, "clara", k=2, graph=FALSE)
k2s <- fviz_silhouette(k2) + ggtitle("K-Means with 2 clusters")
k3s <- fviz_silhouette(k3) + ggtitle("K-Means with 3 clusters")
pam2s <- fviz_silhouette(pam2) + ggtitle("PAM with 2 clusters")
clara2s <- fviz_silhouette(clara2) + ggtitle("CLARA with 2 clusters")
k2$silinfo$clus.avg.widths
## [1] 0.5181753 0.7667209
k3$silinfo$clus.avg.widths
## [1] 0.5031923 0.4642449 0.7022540
pam2$silinfo$clus.avg.widths
## [1] 0.4302030 0.7750019
clara2$silinfo$clus.avg.widths
## [1] 0.4302030 0.7750019
According to the graphs below we can conclude that the clustering for both K-means with 2 and 3 clusters is above 1, which means that the clustering is done correstly.
On the other hand the Silhuette index for bot PAM and CLARA is below 0, Which means that we should not consider these clusters as “good” clusters.
grid.arrange(k2s, k3s, pam2s, clara2s)
Hierarchical clustering is a machine learning technique used to group similar objects or data points into a hierarchy of nested clusters. In this method, the data points are successively merged into larger and larger clusters, until all the data points are contained in a single cluster at the top of the hierarchy. The result is a tree-like structure called a dendrogram that shows the relationships between the clusters at different levels of the hierarchy.
The hierarchical clustering is also performed on both dataframes, for the sake of consistency, but the one performed on IQR dataframe should be considered.
d <- dist(df_Z, method = "euclidean")
hc1 <- hclust(d, method = "complete")
plot(hc1, cex = 0.6, hang = -1)
rect.hclust(hc1, k=3, border="red")
d <- dist(df_IQR, method = "euclidean")
hc1 <- hclust(d, method = "complete")
plot(hc1, cex = 0.6, hang = -1)
rect.hclust(hc1, k=3, border="red")
Summarizing in this paper K-means, PAM, CLARA and hierarchical clustering algorithms were used in order to cluster the first 200 cryptocurrencies ranked by market cap. The best results were for dataframe, where outliers were deleted according to IQR method for K-Means algorithm with 2 and 3 clusters. Hierarchical clustering was also carried out at the end of the analysis.