Introduction

Stock market development is considered to be one of the most important factors for economic development and growth both in developing and developed countries. Well-developed stock markets attracts more investment by financing productive projects that lead to economic growth, mobilize domestic savings, allocate capital proficiency, reduce risk by diversifying, and facilitate exchange of goods and services (Mishkin 2001; and Caporale et al, 2004).

Despite widespread need for financial services, the range and depth of stock markets vary significantly across countries.This machine learning project aims to cluster stock market with others of similiar development levels. The result of the project may aid with international traders and investors, prima facie, with adjusting investing strategy in accordance to stock market clusters and asset allocation decision.

Literature Review

Utilizing the World Bank’s Financial Sector Development Indicators (FSDI) and Beck, Demirguc-Kunt, and Levine (2000)’s study, we extracted six commonly used indicators to measure stock market development:

  1. Size: the ratio of stock market capitalization in percent of GDP

Market capitalization in percent of GDP is the most commonly used measure of stock market development. It is also one indicator of financial deepening. Financial deepening is defined as increases in the ratio of a country’s financial assets to its GDP. Financial asset accumulation simultaneously provides credit to finance real asset accumulation for the development process.

  1. Activity: the ratio of stock market total value traded in percent of GDP

World Bank (2013) defines the total value of stocks traded (in percent of GDP) as the total value of shares traded during the period. They also add that this indicator complements the market capitalization ratio by showing whether market size is matched by trading. In the literature it is also used to measure the market depth, in terms of its liquidity or the easiness to buy and sell shares. Market depth refers to a market’s ability to absorb relatively large market orders without significantly impacting the price of the security.

  1. Efficiency: the stock market turnover ratio

The turnover ratio (another measure of market depth from the literature) is defined as the total value of shares traded during the period divided by the average market capitalization for the period.

  1. Stability: price volatility and stock return

While a certain degree of price volatility in the stock market is clearly desirable, since it may reflect the effects of new information flows in an efficient stock market, excessive volatility is likely to result in an inefficient allocation of resources, upward pressures on interest rates in view of the higher uncertainty, hampering both the volume and the productivity of investment and, therefore, reducing growth (Federer 1993; DeLong et al. 1989). Given that emerging economies are characterized by critical vulnerabilities to external shocks, they exhibit higher equity market fluctuations than the developed markets.

  1. Access: the number of listed firms

Less concentration of top firms in market is preferred for a well-develop market. This means not only large firms, but also small companies can raise fund and compete fairly in the market. Therefore, the access dimension of the stock market development is measured by number of listed domestic companies.

More listed domestic companies also suggests an active local equity market.

  1. Macroeconomics: GDP per capita (inflation adjusted)

The income level is one essential variable affecting almost all variables in an economy. It is also a significant determinant of the development level of financial markets, including stock markets. High income growth promotes development in the stock market, which in turn promotes further economic growth itself.

Data and methods

Data

Ratio of stock market capitalization in percent of GDP, ratio of stock market total value traded in percent of GDP, stock market turnover ratio, price volatility, stock return, and number of listed firms of 214 countries in 2020 are taken from World Development Indicators (WDI) Database. Meanwhile, GDP per capita of 2020 (constant in 2015) is taken from World Bank’s databank.

Methods

In this project, we are going to compare the performance of two clustering algorithm of the partitioning family, K-Means and K-Medoids.

  1. K-Means Clustering

K-Means algorithm is a centroid based clustering technique. This technique cluster the dataset to k different cluster having an almost equal number of points. Each cluster is k-means clustering algorithm is represented by a centroid point.

The steps of the algorithm are as follows:

  • Initialization
  • Assignment
  • Update centroid
  • Repeat Step 2 and 3 until convergence.

The weakness of K-Means clustering:

  • Centroids are not interpretable since they are not the actual points but the mean of points present in that cluster.
  • Sensitive to outliers because a mean is easily influenced by extreme values.
  1. K-Medoids Clustering

K-medoids or Partitioning Around Medoids (PAM) is a variant of K-means that is more robust to noises and outliers. Instead of using the mean point as the center of a cluster, K-medoids uses an actual point in the cluster to represent it. Medoid is the most centrally located object of the cluster, with minimum sum of distances to other points.

Data preprocessing

# read and preprocess data
library(tidyverse)
library(readr)

# EDA
library(knitr)
library(kableExtra) # table
library(GGally) # ggcorr
library(FactoMineR) #pca

# Clustering
library(factoextra) # k-means
library(cluster) # pam, k-medoids
library(clValid) # cluster validation

Read the data

financial_database <- read_delim("stock_market.csv", 
    delim = ";", escape_double = FALSE, trim_ws = TRUE)

gdp_capita <- read_delim("gdp_capita.csv", 
    delim = ";", escape_double = FALSE, trim_ws = TRUE)
glimpse(financial_database)
## Rows: 13,054
## Columns: 122
## $ iso3    <chr> "ABW", "ABW", "ABW", "ABW", "ABW", "ABW", "ABW", "ABW", "ABW",…
## $ iso2    <chr> "AW", "AW", "AW", "AW", "AW", "AW", "AW", "AW", "AW", "AW", "A…
## $ imfn    <dbl> 314, 314, 314, 314, 314, 314, 314, 314, 314, 314, 314, 314, 31…
## $ country <chr> "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "Aruba",…
## $ region  <chr> "Latin America & Caribbean", "Latin America & Caribbean", "Lat…
## $ income  <chr> "High income", "High income", "High income", "High income", "H…
## $ year    <dbl> 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 19…
## $ ai01    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ ai02    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ ai03    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ ai04    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ ai05    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ ai06    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ ai07    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ ai08    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ ai09    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ ai10    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ ai11    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ ai12    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ ai13    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ ai14    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ ai17    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ ai18    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ ai19    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ ai20    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ ai21    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ ai22    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ ai23    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ ai24    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ ai25    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ ai26    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ ai27    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ ai28    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ ai29    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ ai30    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ ai31    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ ai32    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ ai33    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ ai34    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ ai35    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ ai36    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ am01    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ am02    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ am03    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ di01    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ di02    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ di03    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ di04    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ di05    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ di06    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ di07    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ di08    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ di09    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ di10    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ di11    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ di12    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ di13    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ di14    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ dm01    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ dm02    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ dm03    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ dm04    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ dm05    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ dm06    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ dm07    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ dm08    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ dm09    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ dm10    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ dm11    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ dm12    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ dm13    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ dm14    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ dm15    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ dm16    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ ei01    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ ei02    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ ei03    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ ei04    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ ei05    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ ei06    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ ei07    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ ei08    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ ei09    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ ei10    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ em01    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ si01    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ si02    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ si03    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ si04    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ si05    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ si06    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ si07    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ sm01    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ oi01    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ oi02    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ oi06    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ oi07    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ oi08    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ oi09    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ oi10    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ oi11    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ oi12    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ oi13    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ oi14    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ oi15    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ oi16    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ oi16a   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ oi17    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ oi18    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ oi19    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0…
## $ oi20a   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ om01    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ om02    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ ...114  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ ...115  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ ...116  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ ...117  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ ...118  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ ...119  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ ...120  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ ...121  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ ...122  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…

Data Wrangling

The financial database from WDI comprised of 13,054 rows and 122 columns, each an indicators of international statistics on development. Since we only needs the indicators related to stock market development from 2020, we are going to subset the dataset according to our need.

# subsetting dataframe for rows from year 2020 and indicators related to stock market development.
stock <- financial_database %>%
  filter(grepl("2020", financial_database$year),) %>%
  select(c(country, region, income, year, am01, dm01, dm02, em01, sm01, om01, om02)) %>%
  rename("value/total" = "am01",
         "marketcap_gdp" = "dm01",
         "tradedvalue_gdp" = "dm02",
         "turnover_ratio" = "em01",
         "price_volatility" = "sm01",
         "n_listed_per_1m_people" = "om01",
         "market_return" = "om02")
stock

Data Dictionary

  • value/total: Value of shares traded excluding top 10 traded companies/total value traded (%)
  • marketcap_gdp: Stock market capitalization to GDP (%)
  • tradedvalue_gdp: Stock market total value traded to GDP (%)
  • turnover_ratio: Total values of shares traded to average market capitalization ratio
  • price_volatility: Stock price volatility (average 360 days)
  • n_listed_per_1m_people: Number of listed companies per 1 million people
  • market_return : Stock market return (%)

A quick glance of our data suggests that there are many countries with missing values of the indicators. This is to be expected, many countries don’t have structured stock market or a reliable data source. Therefore, we are going to subset only the countries with less than 3 missing indicators for a start.

stock <- stock[rowSums(is.na(stock)) <= 2, ] 

Then, we are going to check for any missing value left.

colSums(is.na(stock))
##                country                 region                 income 
##                      0                      0                      0 
##                   year            value/total          marketcap_gdp 
##                      0                     52                      0 
##        tradedvalue_gdp         turnover_ratio       price_volatility 
##                      0                      0                      0 
## n_listed_per_1m_people          market_return 
##                      0                      0

Of the 52 countries left after subsetting for missing value, all countries have no data on the shares value traded excluding top 10 companies, so we are going to drop the variable too.

stock <- stock %>%
  select(-c("value/total"))

Now that we have cleaned the financial database, we are going to also clean the gdp_capita tibble.

glimpse(gdp_capita)
## Rows: 266
## Columns: 65
## $ `Country Name`   <chr> "Aruba", "Africa Eastern and Southern", "Afghanistan"…
## $ `Country Code`   <chr> "ABW", "AFE", "AFG", "AFW", "AGO", "ALB", "AND", "ARB…
## $ `Indicator Name` <chr> "GDP per capita (constant 2015 US$)", "GDP per capita…
## $ `Indicator Code` <chr> "NY.GDP.PCAP.KD", "NY.GDP.PCAP.KD", "NY.GDP.PCAP.KD",…
## $ `1960`           <dbl> NA, NA, NA, 1084.7147, NA, NA, NA, NA, NA, 7362.5341,…
## $ `1961`           <dbl> NA, NA, NA, 1082.4173, NA, NA, NA, NA, NA, 7637.0667,…
## $ `1962`           <dbl> NA, NA, NA, 1099.6853, NA, NA, NA, NA, NA, 7451.8034,…
## $ `1963`           <dbl> NA, NA, NA, 1154.9990, NA, NA, NA, NA, NA, 6945.9571,…
## $ `1964`           <dbl> NA, NA, NA, 1191.7019, NA, NA, NA, NA, NA, 7532.0045,…
## $ `1965`           <dbl> NA, NA, NA, 1212.8017, NA, NA, NA, NA, NA, 8202.1125,…
## $ `1966`           <dbl> NA, NA, NA, 1165.3788, NA, NA, NA, NA, NA, 8026.8762,…
## $ `1967`           <dbl> NA, NA, NA, 1030.9713, NA, NA, NA, NA, NA, 8161.6021,…
## $ `1968`           <dbl> NA, NA, NA, 1022.9248, NA, NA, NA, NA, NA, 8429.8689,…
## $ `1969`           <dbl> NA, NA, NA, 1154.6329, NA, NA, NA, NA, NA, 9108.4969,…
## $ `1970`           <dbl> NA, NA, NA, 1329.9716, NA, NA, 35391.0736, NA, NA, 92…
## $ `1971`           <dbl> NA, NA, NA, 1439.1367, NA, NA, 35159.4666, NA, NA, 96…
## $ `1972`           <dbl> NA, NA, NA, 1448.9954, NA, NA, 36166.4134, NA, NA, 96…
## $ `1973`           <dbl> NA, NA, NA, 1473.2077, NA, NA, 37123.2623, NA, NA, 97…
## $ `1974`           <dbl> NA, NA, NA, 1583.3625, NA, NA, 37504.7417, NA, NA, 10…
## $ `1975`           <dbl> NA, NA, NA, 1510.1818, NA, NA, 36246.6833, 3843.6842,…
## $ `1976`           <dbl> NA, NA, NA, 1598.1492, NA, NA, 36175.3210, 4314.2287,…
## $ `1977`           <dbl> NA, NA, NA, 1629.3541, NA, NA, 36081.6577, 4526.1867,…
## $ `1978`           <dbl> NA, NA, NA, 1550.8158, NA, NA, 35551.7356, 4354.4131,…
## $ `1979`           <dbl> NA, NA, NA, 1587.8463, NA, NA, 34462.4928, 4702.4260,…
## $ `1980`           <dbl> NA, 1368.9478, NA, 1574.8244, 3550.0830, 1740.6022, 3…
## $ `1981`           <dbl> NA, 1381.2169, NA, 1425.6651, 3276.3617, 1804.1107, 3…
## $ `1982`           <dbl> NA, 1344.9731, NA, 1340.3975, 3162.0041, 1818.4685, 3…
## $ `1983`           <dbl> NA, 1309.3280, NA, 1218.3694, 3179.3482, 1799.9783, 3…
## $ `1984`           <dbl> NA, 1314.8797, NA, 1191.2855, 3252.1079, 1740.4441, 3…
## $ `1985`           <dbl> NA, 1275.8135, NA, 1223.2448, 3248.6081, 1735.3864, 2…
## $ `1986`           <dbl> 17231.3798, 1266.2303, NA, 1206.8273, 3226.8172, 1798…
## $ `1987`           <dbl> 20262.9449, 1276.6446, NA, 1191.3631, 3242.5791, 1748…
## $ `1988`           <dbl> 24343.2552, 1292.2509, NA, 1216.0062, 3323.5343, 1691…
## $ `1989`           <dbl> 27313.4954, 1287.0947, NA, 1210.6611, 3212.6627, 1808…
## $ `1990`           <dbl> 27884.2532, 1252.1151, NA, 1256.1422, 2998.7808, 1606…
## $ `1991`           <dbl> 28953.5247, 1214.8996, NA, 1236.8728, 2929.4829, 1163…
## $ `1992`           <dbl> 29031.7438, 1151.7327, NA, 1237.0230, 2669.4347, 1086…
## $ `1993`           <dbl> 29324.7921, 1105.2855, NA, 1190.9514, 1964.3897, 1197…
## $ `1994`           <dbl> 29989.0181, 1096.5961, NA, 1156.4541, 1927.5005, 1305…
## $ `1995`           <dbl> 29367.3097, 1115.1215, NA, 1148.4120, 2146.4557, 1488…
## $ `1996`           <dbl> 28684.5650, 1147.1060, NA, 1170.2763, 2360.0878, 1633…
## $ `1997`           <dbl> 29901.3593, 1154.8608, NA, 1181.6905, 2451.6087, 1464…
## $ `1998`           <dbl> 29857.5597, 1145.3600, NA, 1191.4003, 2485.0668, 1603…
## $ `1999`           <dbl> 29640.0452, 1145.4260, NA, 1176.6693, 2458.0965, 1821…
## $ `2000`           <dbl> 31245.7240, 1151.3585, NA, 1186.1470, 2451.5098, 1961…
## $ `2001`           <dbl> 29656.1040, 1160.9641, NA, 1217.8176, 2471.6646, 2143…
## $ `2002`           <dbl> 28051.1192, 1176.8931, 332.7669, 1304.3407, 2717.4410…
## $ `2003`           <dbl> 28008.5036, 1180.5759, 345.6396, 1342.7734, 2705.7061…
## $ `2004`           <dbl> 29695.4930, 1215.1269, 335.7017, 1409.6818, 2900.1670…
## $ `2005`           <dbl> 29670.2877, 1261.3358, 359.8990, 1452.7688, 3220.0781…
## $ `2006`           <dbl> 29743.5296, 1314.4977, 368.0086, 1488.0236, 3464.2347…
## $ `2007`           <dbl> 30160.5278, 1374.2537, 408.5735, 1528.6748, 3806.8492…
## $ `2008`           <dbl> 30092.7394, 1405.0679, 415.0870, 1578.7588, 4077.7768…
## $ `2009`           <dbl> 26903.1714, 1388.1262, 491.9421, 1631.7322, 3963.2484…
## $ `2010`           <dbl> 25857.4934, 1421.4840, 547.3549, 1693.8635, 3988.6236…
## $ `2011`           <dbl> 26647.6435, 1444.5014, 532.6800, 1729.4571, 3979.8151…
## $ `2012`           <dbl> 26150.6369, 1449.6106, 580.4873, 1771.9097, 4167.1258…
## $ `2013`           <dbl> 27090.0262, 1479.0167, 591.9471, 1828.2351, 4220.9648…
## $ `2014`           <dbl> 27011.3203, 1503.5066, 588.0089, 1884.5173, 4272.4555…
## $ `2015`           <dbl> 28396.9084, 1507.8003, 578.4664, 1886.2482, 4166.9798…
## $ `2016`           <dbl> 28847.8141, 1501.6713, 575.3344, 1838.2711, 3924.6205…
## $ `2017`           <dbl> 29286.2493, 1507.8214, 575.7071, 1830.3818, 3790.7916…
## $ `2018`           <dbl> NA, 1507.8611, 568.8279, 1834.3666, 3595.1067, 4433.7…
## $ `2019`           <dbl> NA, 1499.2563, 577.5631, 1843.5585, 3458.6505, 4549.4…
## $ `2020`           <dbl> NA, 1418.3805, 553.4891, 1778.9891, 3213.7842, 4424.3…

The data consists of 266 countries and the GDP per capita since 1960 adjusted for inflation (2015). We are going to subset only the country name and data from year 2020. Then, we are also going to omit rows with NA.

gdp_capita <- gdp_capita %>%
  select(c("Country Name", "2020")) %>%
  na.omit() %>%
  rename("gdp_per_capita" = "2020")

Afterwards, the stock tibble is left joined with gdp_capita tibble.

stock <- stock %>%
  left_join(gdp_capita, by = c("country" = "Country Name"))

Then, the tibble is changed into a dataframe to prepare for our subsequent data wrangling.

stock_df <- as.data.frame(stock)

Set the country name as row index.

rownames(stock_df) <- stock_df$country

Clean the dataframe from non-numeric. This is important because k-means and k-medoids clustering doesn’t accept non-numeric variables.

stock_clean <- stock_df %>%
  select(-c("country", "region", "income", "year"))

Exploratory Data Analysis

stock_clean

Using kable() from knitr library, the summary of the dataset is as such,

summary(stock_clean) %>% 
  kable("html", table.attr = "style = \"color: black;\"") %>% kable_styling()
marketcap_gdp tradedvalue_gdp turnover_ratio price_volatility n_listed_per_1m_people market_return gdp_per_capita
Min. : 3.09 Min. : 0.0297 Min. : 0.09 Min. : 6.15 Min. : 0.3683 Min. :-29.040 Min. : 976.2
1st Qu.: 24.64 1st Qu.: 0.6675 1st Qu.: 3.21 1st Qu.:16.02 1st Qu.: 2.4250 1st Qu.:-16.453 1st Qu.: 4076.4
Median : 44.97 Median : 8.1900 Median : 18.39 Median :19.20 Median : 6.8350 Median : -8.214 Median : 10790.3
Mean : 100.01 Mean : 51.2927 Mean : 41.33 Mean :19.69 Mean : 24.0277 Mean : -8.213 Mean : 18528.7
3rd Qu.: 77.33 3rd Qu.: 51.1175 3rd Qu.: 42.06 3rd Qu.:23.51 3rd Qu.: 27.6500 3rd Qu.: -1.820 3rd Qu.: 24632.8
Max. :1768.80 Max. :885.6000 Max. :365.77 Max. :34.58 Max. :314.5000 Max. : 14.731 Max. :101206.8
  • marketcap_gdp: The median of market capitalization to GDP ratio is 45%, while the mean is 100%. This goes to show the the variable is skewed left, with extreme values on the right side of the curve (as can be seen from the maximum value of 1,769%)

  • tradedvalue_gdp: As in our literature, the traded value of shares to GDP ratio is correlated with market capitalization to GDP ratio. This variable is also left-skewed with a median of 18.4% and mean of 51.3%.

  • turnover_ratio: Also left-skewed, the median of stock turnover ratio is 18.4 with a mean of 41.3.

  • price_volatility: the average stock price volatility for 360 days in the countries are quite uniform, with the median of 19.2 and mean of 19.6. The minimum price volatility is 6.2 and the maximum stock price volatility is 34.6.

  • n_listed_per_1m_people: the number of companies listed per 1 million people is also left-skewed, the majority of countries have 6 companies per 1 million people, but the mean is at 25 companies per 1 million people.

  • market_return: considering the negative impact of Covid-19 pandemic on stock market worldwide, the market return is also homogenous at the median and mean of -8.1%.

Next, Pearson correlation test is also conducted.

ggcorr(stock_clean, label = T, label_size = 3, hjust = 1)

As can be seen from the graph above, shares traded value to GDP ratio and number of listed companies are highly correlated to market capitalization to GDP ratio. This is in line with our literature review. However, K-means clustering isn’t terribly affected by collinearity to begin with. So we are going to leave the dataset as it is.

Next, we are going to perform PCA on stock_clean for outlier detection.

stock_pca <- PCA(X = stock_clean, 
                scale.unit = T, # scaling the data
                graph = F, # no plot
                ncp = 8) 
plot.PCA(x = stock_pca, # pca object
         choix = "ind", # create observation plot
         select = "contrib10") # show the three outliers

Since K-Means algorithm is sensitive to outlier, it is imperative to remove the outermost outliers.

# remove outliers
row.names.remove <- c("Hong Kong SAR, China")

stock_clean_kmeans <- stock_clean[!(row.names(stock_clean) %in% row.names.remove), ]

Cluster Modelling

K-Means Clustering

  1. Choosing the K optimum

Before modelling for K-Means clustering, we need to determine the K optimum using the Elbow method. It is a visual method to test the consistency of the best number of clusters by comparing the difference of the sum of square error (SSE) of each cluster, the most extreme difference forming the angle of the elbow shows the best cluster number.

RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)

fviz_nbclust(x = stock_clean_kmeans, # data yg diclustering
             FUNcluster = kmeans, # fungsi clustering yg digunakan
             method = "wss")

The optimum number of clusters is 3, which is aligned with our domain knowledge that most stock market development are classified into more than 2 groups (developed, emerging, and frontier). However, after careful consideration of the business need, the final number of cluster used is 4.

  1. Computing K-means clustering
# k-means with 3 clusters
RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)

stock_km <- kmeans(stock_clean_kmeans, centers = 4)

Let’s check the size of each cluster. We have a pretty homogenous cluster size, except for cluster 1.

stock_km$size
## [1]  2  9 16 24
  1. Cluster Goodness of Fit

One of the measure of goodness of fit of K-means clustering is the intra dan intercluster distance. A good cluster algorithm has a low intracluster distance and a high intercluster distance. In similiar vein, we compare the average distance of each centroid to the global sample mean with the average distance of each observation to the global sample mean. The closer the result of the comparison to 1, the better the ability of the clustering model to represent the real observation.

stock_km$betweenss/stock_km$totss
## [1] 0.9390834

The ratio of average distance of centroid to global sample mean with average distance of observation to global sample mean is 0.94. The clustering model has a good ability to represent the individual data observation.

  1. Summary
  • Summary Statistics
a <- data.frame(stock_km$cluster)

a$country <- row.names(a)

stock_df %>%
  left_join(a, by = c("country")) %>%
  rename("Cluster" = "stock_km.cluster") %>%
  select(-c("country", "region", "income", "year")) %>%
  group_by(Cluster) %>%
  summarise_all(mean) %>%
  na.omit()
  • Cluster Visualization
fviz_cluster(stock_km, data = stock_clean_kmeans, labelsize =8)

Cluster 1 (anomaly):

Out of all clusters, cluster 1 has the lowest number of observation. Switzerland and Luxembourg are both high income country with an above average number of listed companies per 1 million people and market capitalization to GDP ratio. However, both are at the extreme ends of spectrum when it comes to traded value to GDP ratio, turnover ratio and price volatility. While Switzerland boasts a high stock traded value and stock turnover ratio, indicating a bustling and active equity market, Luxembourg has a very low traded value to GDP ratio, that is 0.07, and a stock turnover ratio at a mere 0.09.

Overally, this cluster exhibits the highest market capitalization to GDP ratio, traded value of shares to GDP ratio, price volatility, and GDP per capita. It has the second lowest stock turnover and the second highest number of listed companies per 1 million people.

Cluster 2 (most developed stock markets):

This is the cluster of the developed stock markets, consisting of the markets from high income countries in North America, West Europe, Australia and New Zealand, and a minority of Middle East and Eastern Asia. This cluster is characterized as the second highest market capitalization to GDP ratio (compared to cluster 1), traded value of shares to GDP ratio, stock turnover, price volatility and GDP per capita. It has the highest number of listed companies and highest stock return of all clusters.

Cluster 3 (moderately developed stock markets):

This cluster is the middle developed stock markets, consisting of the markets from high and middle income countries in Europe, Latin America, and Asia Pacific. It has the second lowest market capitalization to GDP ratio, traded value of shares to GDP ratio, price volatility number of listed companies per 1 million people, market return and GDP per capita. However, it has the highest stock turnover ratio.

Cluster 4 (least developed stock markets):

Cluster 4 is the least developed stock markets, consisting of the markets from emerging and frontier countries in Asia Pacific, North and Sub-saharan Africa, Latin America, Middle East, and Eastern Europe. This cluster is characterized by the lowest market capitalization to GDP ratio, traded value of shares to GDP ratio, stock turnover ratio, price volatility number of listed companies per 1 million people, market return and GDP per capita. Cluster 4 has the highest number of observations among others.

Let’s do a further analysis on the cluster. Now, we would like to know if there is a correlation between a country’s income level and its level of stock market development. First, we make an object called stock_final where we left join the original dataframe, stock_df with the information on cluster in a dataframe. Then, we create a new column where we do label encoding from the income column. Respectively the high income, upper middle income and lower middle income are labeled as 3, 2, and 1.

stock_final <- stock_df %>%
  left_join(a, by = c("country")) %>%
  rename("Cluster" = "stock_km.cluster") %>%
  select(-c("year"))

# create a new column using conditional mutating
stock_final <- stock_final %>%
  mutate(income_code = case_when(
    income == "High income" ~ 3,
    income == "Upper middle income" ~ 2,
    income == "Lower middle income" ~ 1
  ))

Let’s group the dataframe using the cluster and income first.

table(stock_final$Cluster, stock_final$income)
##    
##     High income Lower middle income Upper middle income
##   1           2                   0                   0
##   2           9                   0                   0
##   3           9                   0                   7
##   4           0                  12                  12
stock_final %>%
  group_by(Cluster, income) %>%
  select_if(is.numeric) %>%
  summarise_all(mean) %>%
  na.omit()

At the first glance, cluster 1 and 2, the most developed stock markets, have only high income countries members. Meanwhile cluster 3, the moderately developed stock markets, has both high and upper middle income countries. And cluster 4, the least developed stock markets, has both upper and lower income countries.

Then, we compute the correlation between stock market development, proxied using Cluster and income_code.

# finding the correlation between cluster of stock market and income level
cor(stock_final$Cluster, stock_final$income_code, use = "complete.obs")
## [1] -0.7755141

As expected, there is a strong correlation between the income level and the level of stock market development. This is in line with Bayraktar’s (2014) study; the income level of a country or the development level of other financial indicators can be important factors determining the market capitalization of countries. Initially higher-income countries are expected to have larger market capitalization, while initially lower-income countries can have a limited capacity to accomplish such a high standard.

But as seen from the data above, the upper middle income countries in cluster 4, compared to high income countries in cluster 3, has higher market capitalization to GDP ratio, also higher traded stock value to GDP ratio. This indicates a more liquid market. This is mirrored by the turnover ratio and price volatility, the high income countries in cluster 3 are higher in term of both, indicating that relatively fewer shares are available, a sudden increase in demand could have a considerable effect on the stock price.

K-Medoids

Unlike K-Means which objective is to minimize the sum of distances between the points and their respective cluster centroid, K-Medoids’ objective is to minimize the absolute error criterion (“TD”/Total Deviation), TD is the sum of dissimilarities of each point to the medoid of its cluster. This is a robust alternative to K-Means when outliers are present (and we are unwilling to remove them).

Using the optimal cluster number from the elbow method earlier and the dataframe with outliers, we fit a K-Medoids model.

stock_kmed <- pam(stock_clean, k = 4)

In contrast to the k-means algorithm, k-medoids chooses datapoints as centers (medoids or exemplars).

stock_kmed$id.med
## [1] 14 25 40 31
stock_df[c(14, 25, 40, 31), 1]
## [1] "Germany"    "Jordan"     "Panama"     "Luxembourg"

The medoids is cluster 1, cluster 2, cluster 3 and cluster 4 are respectively Germany, Jordan, Panama and Luxembourg. It means that observations with the closest distance to the medoids would be included in the cluster.

b <- data.frame(stock_kmed$cluster)

b$country <- row.names(b)

stock_df %>%
  left_join(b, by = c("country")) %>%
  rename("Cluster" = "stock_kmed.cluster") %>%
  select(-c("country", "region", "income", "year")) %>%
  group_by(Cluster) %>%
  summarise_all(mean) %>%
  na.omit()

Cluster 1

This cluster exhibits the highest market capitalization to GDP ratio, traded value of shares to GDP ratio, number of lited companies per 1 million people and market return. It has the second highest turnover ratio, price volatility and GDP per capita.

Cluster 2

This cluster has the lowest market capitalization to GDP ratio, traded value of shares to GDP ratio, turnover ratio, price volatility, number of listed companies per 1 million people, market return and GDP per capita.

Cluster 3

This cluster has the second lowest market capitalization to GDP ratio, traded value of shares to GDP ratio, price volatility, number of listed companies per 1 million people, market return and GDP per capita. It also has the highest turnover ratio.

Cluster 4

This cluster exhibits the second highest market capitalization to GDP ratio, traded value of shares to GDP ratio, number of listed companies per 1 million people and market return. It has the highest GDP per capita and price volatility, also the second lowest turnover ratio.

fviz_cluster(stock_kmed, data = stock_clean, labelsize = 8)

Cluster Validation

Choosing between K-Means or K-Medoids as the best clustering method can be a hard task, as there are a lot of indices for assessing the goodness of fit for clustering result. This task is made easier using clValid package in R, it can be used for simultaneously comparing multiple clustering algorithms in a single function call for identifying the best clustering approach and the optimal number of clusters.

In clValid, there are 3 different types of clustering validation measures:

  • Clustering internal validation, which uses intrinsic information in the data to assess the quality of the clustering.
  • Clustering stability validation, which is a special version of internal validation. It evaluates the consistency of a clustering result by comparing it with the clusters obtained after each column is removed, one at a time.
  • Clustering biological validation, which evaluates the ability of a clustering algorithm to produce biologically meaningful clusters.

We are going to focus on internal validation, also note that the best number of K depends on our business question, regardless the result of validation.

# Compute clValid
clmethods <- c("kmeans","pam")
validation <- clValid(stock_clean, nClust = 2:7,
              clMethods = clmethods, validation = "internal")
# Summary
summary(validation)
## 
## Clustering Methods:
##  kmeans pam 
## 
## Cluster sizes:
##  2 3 4 5 6 7 
## 
## Validation Measures:
##                            2       3       4       5       6       7
##                                                                     
## kmeans Connectivity   5.7155  8.4274 13.9004 18.0373 20.0373 25.1671
##        Dunn           0.0822  0.1995  0.1327  0.1834  0.2059  0.0711
##        Silhouette     0.7203  0.7258  0.6432  0.6625  0.6412  0.6189
## pam    Connectivity   5.9714  9.5008 13.5587 17.4516 21.5885 23.6806
##        Dunn           0.0211  0.0036  0.0092  0.0604  0.0835  0.1196
##        Silhouette     0.6931  0.5273  0.5898  0.6203  0.6443  0.6471
## 
## Optimal Scores:
## 
##              Score  Method Clusters
## Connectivity 5.7155 kmeans 2       
## Dunn         0.2059 kmeans 6       
## Silhouette   0.7258 kmeans 3

Interpretation for k = 4

  • Connectivity indicates the degree of connectedness of the clusters. The connectivity has a value between 0 and infinity and should be minimized. K-medoids (PAM) model has a lower Connectivity relative to K-Means (13.5587 < 13.9004), meaning that the K-medoids model has a better separation of clusters.

  • Silhouette width combines measures of compactness and separation of the clusters. The values of silhouette width range from -1 (poorly clustered observations) to 1 (well clustered observations). K-Means clustering has the higher Silhouette coefficient (0.6432) compared to PAM (0.5898). This means that combining measure of compactness and separation of clusters, K-Means is the better model out of the two.

  • The Dunn index is the ratio between the smallest distance between observations not in the same cluster to the largest intra-cluster distance. It has a value between 0 and infinity and should be maximized. K-means model has the higher Dunn coefficient (0.1327) compared to PAM (0.0092).

Overally, K-Means model wins out as the best model with better compactness and separation of clusters, even using the data with outliers.

Conclusion

We can summarize that:

  1. During exploratory data analysis, PCA shows that there is an outlier in the dataset, that is ‘Hongkong SAR, China’. This particular stock market has an extremely high stock market capitalization to GDP ratio, traded stock value to GDP, and number of listed companies. Since K-Means model is sensitive to outliers, we took the observation out. Then, we are exploring if K-Medoids is the better alternative because it isn’t as sensitive to outliers.

  2. We set the number of clusters (k) to 4 since it suits our business question, that is to cluster the global stock markets according to their level of development (highly developed, moderately developed, and least developed). While there are originally 3 clusters, the remaining 1 cluster accounts for the anomalies in Northern Europe countries.

  3. Between K-Means and K-Medoids model, K-Means is the better algorithm with better compactness and separation of clusters when the k is 4. During internal validation, K-Medoids (PAM) has the better degree of separation between cluster, but when also taking in the cluster compactness into consideration, K-Means model wins in term of Silhouette coeffient and Dunn index.

  4. Using the stock market development indicators that are marktet capitalization to GDP ratio, traded stock value to GDP ratio, stock turnover ratio, stock price volatility, number of listed companies per 1 million people, market return, and GDP per capita, our K-Means model divides the stock markets into 4 clusters, where:

  • cluster 1 is the anomalies Northern Europe countries, they are high income countries with high market capitalization to GDP ratio, but minimal trading value to GDP ratio and very low turnover ratio.
  • cluster 2 is the most developed stock markets consisting of the markets from high income countries in North America, West Europe, Australia and New Zealand, and a minority of Middle East and Eastern Asia.
  • cluster 3 is the middle developed stock markets, consisting of the markets from high and middle income countries in Europe, Latin America, and Asia Pacific.
  • cluster 4 is the least developed stock markets, consisting of the markets from emerging and frontier countries in Asia Pacific, North and Sub-saharan Africa, Latin America, Middle East, and Eastern Europe.
  1. While stock market development is strongly correlated with income level, not all high income countries are in the most developed stock markets cluster. This is because while the countries has high GDP per capita, the market depth proxied using market capitalization to GDP is relatively lower. it means that the country has a relatively lower market liquidity.

  2. The cluster may help investor in their international investment strategy. In addition, policymakers may also benefit from the information in deciding the best policies for stock market development effort.

It should be noted that the outcomes in this project need to be interpreted carefully since they are complimentary but not substitutable for detailed analysis of a country’s market capitalization. The design of financial reforms must be country specific and can be constructed after comprehensive analysis of the country’s capacity, financial performance, and political structure.