If any issues, questions or suggestions feel free to reach me out via e-mail or Linkedin. You can also visit my Github.

if(!require(pacman)) install.packages('pacman')
pacman::p_load(PolishStock, dplyr, factoextra)

We use PolishStock package to download data for WIG20 components from the website stooq.pl. Downloaded datasets consists of OLHC prices and volume. We are interested only in close prices so we use df_prices function (also from PolishStock package) to create data frame of close prices. There are 345 observations because Allegro stocks are being publicly traded since October 2020. Below we provide list of tickers which is valid as of February 2022.

tickers <- c('ACP', 'ALE', 'CCC', 'CDR', 'CPS', 'DNP', 'JSW', 'KGH', 'LPP', 'LTS'
             ,'MRC', 'OPL', 'PEO', 'PGE', 'PGN', 'PKN', 'PKO', 'PZU', 'SPL', 'TPE')

# for (ticker in tickers) {
#   stooq_download(ticker
#                  ,start_date = 20200101
#                  ,end_date = 20222502
#                  ,interval = 'd'
#                  ,destination = paste0(getwd(), '/datasets/'))
# }

close_prices <- df_prices(list.files('datasets/')
                          ,source = 'datasets/')

returns <- sapply(select(close_prices, -Date)
                  ,FUN = function(x) diff(log(x), lag = 1)) %>%
  t()

Function kmeans has 4 available algorithms to choose and default is Hartigan-Wong algorithm which is:
1. For each data point \(x_1, \dots x_n\) assign by random index \(1, \dots, k\) (where \(k\) is number of clusters).
2. For each cluster \(C_1, \dots, C_k\) calculate mean \(\mu_1, \dots, \mu_k\).
3. For each data point assign the closest \(\mu\).
4. Iterate steps 2 and 3 until either \(\mu\text{'s}\) do change their locations or other stopping criterion is met (e.g. max number of iterations).

This algorith guarantees obtaining only local minimum which depends on random selection of \(\mu\text{'s}\) in the first iteration. Therefore it’s advisable to run the algorithm multiple times (parameter nstart in kmeans function) and then choose the resulting partition by minimizing clusters distortion, i.e. \(\min \sum_{i=1}^k \sum_{x_j \in C_i} d(x_j, \mu_i)\) where \(d\) is an euclidean distance.

How much clusters we should choose? K-means clustering is unsupervised learning algorithm and usually is a starting point for the further investigations and research. Therefore we can play with number of clusters by decreasing it when some clusters are too small and increasing it when some clusters are too big or too dispersed. There are also several rules of thumb and below we use one of them called elbow method. We start with \(k=1\) and by increasing it we could observe that at the beginning clusters distortion drops down rapidly and from certain \(k\) it becomes stabilized.

k_max <- 8
elbow <- data.frame(clusters = 1:k_max
                    ,sse = rep(0, k_max))

set.seed(2137)
for (k in 1:k_max) {
  km <- kmeans(x = returns
               ,centers = k
               ,nstart = 250)
  elbow$sse[k] <- km$tot.withinss
}

plot(x = elbow$clusters
     ,y = elbow$sse
     ,type = 'b'
     ,xlab = 'Number of clusters'
     ,ylab = 'Clusters distortion')

We choose \(k=4\). Although our dataset is in over \(300\) dimensions space we can vizualize it using fviz_cluster function from factoextra package. This function applies principal component which seeks for 2-dimensional surface being the closest to our dataset (in the sense of euclidean distance).

km <- kmeans(x = returns
             ,centers = 4
             ,nstart = 250)
 
fviz_cluster(km, data = returns
             ,main = ''
             ,xlab = ''
             ,ylab = ''
             ,show.clust.cent = FALSE
             ,repel = TRUE
             ,labelsize = 10
             ,ggtheme = theme_bw())

In the above analysis we used daily log-returns as data attributes. This means we have grouped WIG20 stocks in the clusters where stock prices follow the same direction with similar strength during each trading day. Of course one also could use other attributes, e.g. daily volume changes, BS and P&L data, fundamental or technical analysis indicators. Instead of euclidean distance one could also use different measures such as correlation of log-returns. Instead of means as clusters centroids one could use medians or other quantiles. In either case there would be obtained potentially different results. Feel free to play with that!

LS0tDQp0aXRsZTogIkstbWVhbnMgY2x1c3RlcmluZyBmb3IgV0lHMjAgc3RvY2tzIg0Kb3V0cHV0OiBodG1sX25vdGVib29rDQotLS0NCg0KSWYgYW55IGlzc3VlcywgcXVlc3Rpb25zIG9yIHN1Z2dlc3Rpb25zIGZlZWwgZnJlZSB0byByZWFjaCBtZSBvdXQgdmlhIGUtbWFpbCA8d2llY3p5bnNraXBhd2VsQGdtYWlsLmNvbT4gb3IgW0xpbmtlZGluXShodHRwczovL3d3dy5saW5rZWRpbi5jb20vaW4vcGF3ZWwtd2llY3p5bnNraS8pLiBZb3UgY2FuIGFsc28gdmlzaXQgbXkgW0dpdGh1Yl0oaHR0cHM6Ly9naXRodWIuY29tL3Bhd2VsLXdpZWN6eW5za2kpLg0KDQpgYGB7ciBsaWJyYXJpZXMsIHdhcm5pbmc9RkFMU0UsIG1lc3NhZ2U9RkFMU0V9DQppZighcmVxdWlyZShwYWNtYW4pKSBpbnN0YWxsLnBhY2thZ2VzKCdwYWNtYW4nKQ0KcGFjbWFuOjpwX2xvYWQoUG9saXNoU3RvY2ssIGRwbHlyLCBmYWN0b2V4dHJhKQ0KYGBgDQoNCldlIHVzZSBbUG9saXNoU3RvY2sgcGFja2FnZV0oaHR0cHM6Ly9naXRodWIuY29tL3Bhd2VsLXdpZWN6eW5za2kvUG9saXNoU3RvY2spIHRvIGRvd25sb2FkIGRhdGEgZm9yIFdJRzIwIGNvbXBvbmVudHMgZnJvbSB0aGUgd2Vic2l0ZSBbc3Rvb3EucGxdKGh0dHBzOi8vc3Rvb3EucGwvKS4gRG93bmxvYWRlZCBkYXRhc2V0cyBjb25zaXN0cyBvZiBPTEhDIHByaWNlcyBhbmQgdm9sdW1lLiBXZSBhcmUgaW50ZXJlc3RlZCBvbmx5IGluIGNsb3NlIHByaWNlcyBzbyB3ZSB1c2UgKmRmX3ByaWNlcyogZnVuY3Rpb24gKGFsc28gZnJvbSBQb2xpc2hTdG9jayBwYWNrYWdlKSB0byBjcmVhdGUgZGF0YSBmcmFtZSBvZiBjbG9zZSBwcmljZXMuIFRoZXJlIGFyZSAzNDUgb2JzZXJ2YXRpb25zIGJlY2F1c2UgQWxsZWdybyBzdG9ja3MgYXJlIGJlaW5nIHB1YmxpY2x5IHRyYWRlZCBzaW5jZSBPY3RvYmVyIDIwMjAuIEJlbG93IHdlIHByb3ZpZGUgbGlzdCBvZiB0aWNrZXJzIHdoaWNoIGlzIHZhbGlkIGFzIG9mIEZlYnJ1YXJ5IDIwMjIuIA0KYGBge3IgZ2V0X2RhdGF9DQp0aWNrZXJzIDwtIGMoJ0FDUCcsICdBTEUnLCAnQ0NDJywgJ0NEUicsICdDUFMnLCAnRE5QJywgJ0pTVycsICdLR0gnLCAnTFBQJywgJ0xUUycNCiAgICAgICAgICAgICAsJ01SQycsICdPUEwnLCAnUEVPJywgJ1BHRScsICdQR04nLCAnUEtOJywgJ1BLTycsICdQWlUnLCAnU1BMJywgJ1RQRScpDQoNCmZvciAodGlja2VyIGluIHRpY2tlcnMpIHsNCiAgc3Rvb3FfZG93bmxvYWQodGlja2VyDQogICAgICAgICAgICAgICAgICxzdGFydF9kYXRlID0gMjAyMDAxMDENCiAgICAgICAgICAgICAgICAgLGVuZF9kYXRlID0gMjAyMjI1MDINCiAgICAgICAgICAgICAgICAgLGludGVydmFsID0gJ2QnDQogICAgICAgICAgICAgICAgICxkZXN0aW5hdGlvbiA9IHBhc3RlMChnZXR3ZCgpLCAnL2RhdGFzZXRzLycpKQ0KfQ0KDQpjbG9zZV9wcmljZXMgPC0gZGZfcHJpY2VzKGxpc3QuZmlsZXMoJ2RhdGFzZXRzLycpDQogICAgICAgICAgICAgICAgICAgICAgICAgICxzb3VyY2UgPSAnZGF0YXNldHMvJykNCg0KcmV0dXJucyA8LSBzYXBwbHkoc2VsZWN0KGNsb3NlX3ByaWNlcywgLURhdGUpDQogICAgICAgICAgICAgICAgICAsRlVOID0gZnVuY3Rpb24oeCkgZGlmZihsb2coeCksIGxhZyA9IDEpKSAlPiUNCiAgdCgpDQoNCmBgYA0KDQpGdW5jdGlvbiAqa21lYW5zKiBoYXMgNCBhdmFpbGFibGUgYWxnb3JpdGhtcyB0byBjaG9vc2UgYW5kIGRlZmF1bHQgaXMgKipIYXJ0aWdhbi1Xb25nIGFsZ29yaXRobSoqIHdoaWNoIGlzOiBcDQoxLiBGb3IgZWFjaCBkYXRhIHBvaW50ICR4XzEsIFxkb3RzIHhfbiQgYXNzaWduIGJ5IHJhbmRvbSBpbmRleCAkMSwgXGRvdHMsIGskICh3aGVyZSAkayQgaXMgbnVtYmVyIG9mIGNsdXN0ZXJzKS4gXA0KMi4gRm9yIGVhY2ggY2x1c3RlciAkQ18xLCBcZG90cywgQ19rJCBjYWxjdWxhdGUgbWVhbiAkXG11XzEsIFxkb3RzLCBcbXVfayQuIFwNCjMuIEZvciBlYWNoIGRhdGEgcG9pbnQgYXNzaWduIHRoZSBjbG9zZXN0ICRcbXUkLiBcDQo0LiBJdGVyYXRlIHN0ZXBzIDIgYW5kIDMgdW50aWwgZWl0aGVyICRcbXVcdGV4dHsnc30kIGRvIGNoYW5nZSB0aGVpciBsb2NhdGlvbnMgb3Igb3RoZXIgc3RvcHBpbmcgY3JpdGVyaW9uIGlzIG1ldCAoZS5nLiBtYXggbnVtYmVyIG9mIGl0ZXJhdGlvbnMpLiBcDQoNClRoaXMgYWxnb3JpdGggZ3VhcmFudGVlcyBvYnRhaW5pbmcgb25seSBsb2NhbCBtaW5pbXVtIHdoaWNoIGRlcGVuZHMgb24gcmFuZG9tIHNlbGVjdGlvbiBvZiAkXG11XHRleHR7J3N9JCBpbiB0aGUgZmlyc3QgaXRlcmF0aW9uLiBUaGVyZWZvcmUgaXQncyBhZHZpc2FibGUgdG8gcnVuIHRoZSBhbGdvcml0aG0gbXVsdGlwbGUgdGltZXMgKHBhcmFtZXRlciAqbnN0YXJ0KiBpbiAqa21lYW5zKiBmdW5jdGlvbikgYW5kIHRoZW4gY2hvb3NlIHRoZSByZXN1bHRpbmcgcGFydGl0aW9uIGJ5IG1pbmltaXppbmcgKipjbHVzdGVycyBkaXN0b3J0aW9uKiosIGkuZS4gJFxtaW4gXHN1bV97aT0xfV5rIFxzdW1fe3hfaiBcaW4gQ19pfSBkKHhfaiwgXG11X2kpJCB3aGVyZSAkZCQgaXMgYW4gZXVjbGlkZWFuIGRpc3RhbmNlLg0KDQpIb3cgbXVjaCBjbHVzdGVycyB3ZSBzaG91bGQgY2hvb3NlPyBLLW1lYW5zIGNsdXN0ZXJpbmcgaXMgdW5zdXBlcnZpc2VkIGxlYXJuaW5nIGFsZ29yaXRobSBhbmQgdXN1YWxseSBpcyBhIHN0YXJ0aW5nIHBvaW50IGZvciB0aGUgZnVydGhlciBpbnZlc3RpZ2F0aW9ucyBhbmQgcmVzZWFyY2guIFRoZXJlZm9yZSB3ZSBjYW4gcGxheSB3aXRoIG51bWJlciBvZiBjbHVzdGVycyBieSBkZWNyZWFzaW5nIGl0IHdoZW4gc29tZSBjbHVzdGVycyBhcmUgdG9vIHNtYWxsIGFuZCBpbmNyZWFzaW5nIGl0IHdoZW4gc29tZSBjbHVzdGVycyBhcmUgdG9vIGJpZyBvciB0b28gZGlzcGVyc2VkLiBUaGVyZSBhcmUgYWxzbyBzZXZlcmFsIHJ1bGVzIG9mIHRodW1iIGFuZCBiZWxvdyB3ZSB1c2Ugb25lIG9mIHRoZW0gY2FsbGVkICoqZWxib3cgbWV0aG9kKiouIFdlIHN0YXJ0IHdpdGggJGs9MSQgYW5kIGJ5IGluY3JlYXNpbmcgaXQgd2UgY291bGQgb2JzZXJ2ZSB0aGF0IGF0IHRoZSBiZWdpbm5pbmcgY2x1c3RlcnMgZGlzdG9ydGlvbiBkcm9wcyBkb3duIHJhcGlkbHkgYW5kIGZyb20gY2VydGFpbiAkayQgaXQgYmVjb21lcyBzdGFiaWxpemVkLiANCg0KYGBge3Igb3B0aW1hbF9ufQ0Ka19tYXggPC0gOA0KZWxib3cgPC0gZGF0YS5mcmFtZShjbHVzdGVycyA9IDE6a19tYXgNCiAgICAgICAgICAgICAgICAgICAgLHNzZSA9IHJlcCgwLCBrX21heCkpDQoNCnNldC5zZWVkKDIxMzcpDQpmb3IgKGsgaW4gMTprX21heCkgew0KICBrbSA8LSBrbWVhbnMoeCA9IHJldHVybnMNCiAgICAgICAgICAgICAgICxjZW50ZXJzID0gaw0KICAgICAgICAgICAgICAgLG5zdGFydCA9IDI1MCkNCiAgZWxib3ckc3NlW2tdIDwtIGttJHRvdC53aXRoaW5zcw0KfQ0KDQpwbG90KHggPSBlbGJvdyRjbHVzdGVycw0KICAgICAseSA9IGVsYm93JHNzZQ0KICAgICAsdHlwZSA9ICdiJw0KICAgICAseGxhYiA9ICdOdW1iZXIgb2YgY2x1c3RlcnMnDQogICAgICx5bGFiID0gJ0NsdXN0ZXJzIGRpc3RvcnRpb24nKQ0KDQpgYGANCg0KV2UgY2hvb3NlICRrPTQkLiBBbHRob3VnaCBvdXIgZGF0YXNldCBpcyBpbiBvdmVyICQzMDAkIGRpbWVuc2lvbnMgc3BhY2Ugd2UgY2FuIHZpenVhbGl6ZSBpdCB1c2luZyAqZnZpel9jbHVzdGVyKiBmdW5jdGlvbiBmcm9tIGZhY3RvZXh0cmEgcGFja2FnZS4gVGhpcyBmdW5jdGlvbiBhcHBsaWVzIHByaW5jaXBhbCBjb21wb25lbnQgd2hpY2ggc2Vla3MgZm9yIDItZGltZW5zaW9uYWwgc3VyZmFjZSBiZWluZyB0aGUgY2xvc2VzdCB0byBvdXIgZGF0YXNldCAoaW4gdGhlIHNlbnNlIG9mIGV1Y2xpZGVhbiBkaXN0YW5jZSkuDQpgYGB7ciBmaW5hbF9jbHVzdGVyaW5nfQ0Ka20gPC0ga21lYW5zKHggPSByZXR1cm5zDQogICAgICAgICAgICAgLGNlbnRlcnMgPSA0DQogICAgICAgICAgICAgLG5zdGFydCA9IDI1MCkNCiANCmZ2aXpfY2x1c3RlcihrbSwgZGF0YSA9IHJldHVybnMNCiAgICAgICAgICAgICAsbWFpbiA9ICcnDQogICAgICAgICAgICAgLHhsYWIgPSAnJw0KICAgICAgICAgICAgICx5bGFiID0gJycNCiAgICAgICAgICAgICAsc2hvdy5jbHVzdC5jZW50ID0gRkFMU0UNCiAgICAgICAgICAgICAscmVwZWwgPSBUUlVFDQogICAgICAgICAgICAgLGxhYmVsc2l6ZSA9IDEwDQogICAgICAgICAgICAgLGdndGhlbWUgPSB0aGVtZV9idygpKQ0KYGBgDQoNCkluIHRoZSBhYm92ZSBhbmFseXNpcyB3ZSB1c2VkIGRhaWx5IGxvZy1yZXR1cm5zIGFzIGRhdGEgYXR0cmlidXRlcy4gVGhpcyBtZWFucyB3ZSBoYXZlIGdyb3VwZWQgV0lHMjAgc3RvY2tzIGluIHRoZSBjbHVzdGVycyB3aGVyZSBzdG9jayBwcmljZXMgZm9sbG93IHRoZSBzYW1lIGRpcmVjdGlvbiB3aXRoIHNpbWlsYXIgc3RyZW5ndGggZHVyaW5nIGVhY2ggdHJhZGluZyBkYXkuIE9mIGNvdXJzZSBvbmUgYWxzbyBjb3VsZCB1c2Ugb3RoZXIgYXR0cmlidXRlcywgZS5nLiBkYWlseSB2b2x1bWUgY2hhbmdlcywgQlMgYW5kIFAmTCBkYXRhLCBmdW5kYW1lbnRhbCBvciB0ZWNobmljYWwgYW5hbHlzaXMgaW5kaWNhdG9ycy4gSW5zdGVhZCBvZiBldWNsaWRlYW4gZGlzdGFuY2Ugb25lIGNvdWxkIGFsc28gdXNlIGRpZmZlcmVudCBtZWFzdXJlcyBzdWNoIGFzIGNvcnJlbGF0aW9uIG9mIGxvZy1yZXR1cm5zLiBJbnN0ZWFkIG9mIG1lYW5zIGFzIGNsdXN0ZXJzIGNlbnRyb2lkcyBvbmUgY291bGQgdXNlIG1lZGlhbnMgb3Igb3RoZXIgcXVhbnRpbGVzLiBJbiBlaXRoZXIgY2FzZSB0aGVyZSB3b3VsZCBiZSBvYnRhaW5lZCBwb3RlbnRpYWxseSBkaWZmZXJlbnQgcmVzdWx0cy4gRmVlbCBmcmVlIHRvIHBsYXkgd2l0aCB0aGF0IQ==