NASDAQ companies clustering

Introduction

In a world where statistics dictate the trajectory of finances and businesses, the opportunities created by assessing and classifying organizations based on their trade metrics become unmatched. Clustering organizations Based on Trading Metrics takes a practical approach to the use of clustering algorithms in analyzing trading data with the goal of grouping organizations with similar behaviors, performances, or features.

This research, therefore, tries to identify the use of clustering algorithms in finding the patterns in the trading metrics of some price and volume characteristics. Segmentation of companies exhibiting similar trading characteristics may give valuable insights for investors, analysts, and strategists in terms of defining peer groups, benchmarking, and even making predictions about trends.

Data Set

The dataset I used is preprocessed dataset for cluster analysis of 50 company firms listed on NASDAQ based on their trading metric features. The historical trading data was extracted with Python from the online Yahoo Finance repository. It contains such main trading metrics as opening price, closing price, high-low price, number of traded stocks, and such basic information fields as ticker symbols to then identify the owner of the process. Data was gathered from 4/1/2022 to 29/11/2024.

Data preparation

The preprocessing of the data was performed by aggregating the stock metrics for each company in preparation for clustering analysis of Nasdaq companies. The data was grouped by the “Ticker” column, and key summary statistics were calculated: the mean closing price, mean high price, mean low price, mean trading volume, and price volatility—calculated as the standard deviation of the high-low range. These features provide a compact representation of the behavior of each company’s stock during the period of analysis, and effective clustering can be performed.

library(dplyr)

data <- read.csv("nasdaq_data_clustering.csv")

summary_data <- data %>%
  group_by(Ticker) %>%
  summarise(
    avg_close = mean(Close, na.rm = TRUE),
    avg_high = mean(High, na.rm = TRUE),
    avg_low = mean(Low, na.rm = TRUE),
    avg_volume = mean(Volume, na.rm = TRUE),
    volatility = sd(High - Low, na.rm = TRUE)
  )

print(head(summary_data))

## # A tibble: 6 × 6
##   Ticker avg_close avg_high avg_low avg_volume volatility
##   <chr>      <dbl>    <dbl>   <dbl>      <dbl>      <dbl>
## 1 AAPL        175.     177.    173.  68650377.       1.67
## 2 ADBE        461.     468.    455.   3273070.       5.29
## 3 ADP         231.     233.    229.   1732816.       1.91
## 4 AMAT        144.     146.    142.   6708705.       2.36
## 5 AMGN        257.     259.    254.   2672433.       2.47
## 6 AMZN        142.     144.    140.  59244184.       1.61

This is a project that requires scaling of data, as some features, such as stock prices, trading volumes, and volatility, come in very different scales and magnitudes. Whereas the prices are in hundreds and volumes in millions, volatility values are much smaller. These differences could make a model weigh more on features with larger values, thus producing biased or unreliable results.

To resolve this, we standardized the data, scaling each feature to have a mean of zero and a standard deviation of one. This way, all features contributed equally to the analysis, regardless of their original magnitude.

scaled_data <- as.data.frame(scale(summary_data[, 2:6]))

print(head(scaled_data))

##     avg_close   avg_high     avg_low avg_volume volatility
## 1 -0.25494229 -0.2561411 -0.25398323  0.6746946 -0.3890250
## 2  0.41250195  0.4140506  0.41145774 -0.2776870  0.3969999
## 3 -0.12496954 -0.1270110 -0.12299835 -0.3001246 -0.3380186
## 4 -0.32707377 -0.3262992 -0.32799516 -0.2276385 -0.2389286
## 5 -0.06438935 -0.0660462 -0.06278537 -0.2864368 -0.2154432
## 6 -0.33208905 -0.3320819 -0.33221923  0.5376702 -0.4019523

Clustering

Silhouette analysis was performed, by performing k-means to determine the best number of clusters for clustering analysis. Averaged silhouette widths for each k had been plotted to identify the k with the highest score, representing appropriate numbers of clusters. This process helps automate the selection of the best k towards better clustering results.

library(cluster)
library(factoextra)


silhouette_analysis <- function(scaled_data, max_clusters = 10) {
  avg_silhouette <- numeric(max_clusters)
  
  for (k in 2:max_clusters) {
    kmeans_result <- kmeans(scaled_data, centers = k, nstart = 25)
    silhouette_score <- silhouette(kmeans_result$cluster, dist(scaled_data))
    avg_silhouette[k] <- mean(silhouette_score[, 3])
  }
  

  plot(2:max_clusters, avg_silhouette[2:max_clusters],
       type = "b", pch = 19, frame = FALSE,
       xlab = "Number of clusters k",
       ylab = "Average silhouette width",
       main = "Silhouette Analysis to Estimate Optimal Clusters")
  

  optimal_k <- which.max(avg_silhouette)
  cat("Optimal number of clusters based on silhouette analysis:", optimal_k, "\n")
  return(optimal_k)
}


optimal_k <- silhouette_analysis(scaled_data, max_clusters = 10)

## Optimal number of clusters based on silhouette analysis: 2

As it was calculated the optimal number of clusters is 2, but in this case I will take 3 due to the fact that the silhouette width isn`t much smaller for it

set.seed(123)
kmeans_result <- kmeans(scaled_data, centers = 3, nstart = 25)


summary_data$cluster <- as.factor(kmeans_result$cluster)


print(head(summary_data))

## # A tibble: 6 × 7
##   Ticker avg_close avg_high avg_low avg_volume volatility cluster
##   <chr>      <dbl>    <dbl>   <dbl>      <dbl>      <dbl> <fct>  
## 1 AAPL        175.     177.    173.  68650377.       1.67 3      
## 2 ADBE        461.     468.    455.   3273070.       5.29 3      
## 3 ADP         231.     233.    229.   1732816.       1.91 3      
## 4 AMAT        144.     146.    142.   6708705.       2.36 3      
## 5 AMGN        257.     259.    254.   2672433.       2.47 3      
## 6 AMZN        142.     144.    140.  59244184.       1.61 3

In order to capture key financial attributes of companies, these feature sets were put together and ploted:

avg_close vs. avg_volume: Reflects market size and trading activity, critical for understanding liquidity and investor interest.
avg_close vs. volatility: Highlights the relationship between price levels and risk, essential for analyzing stability and return profiles.
avg_volume vs. volatility: Links trading activity to market fluctuations, helping to assess liquidity and price stability.

scatter_plot_clusters <- function(data, x_feature, y_feature) {
  ggplot(data, aes(x = .data[[x_feature]], y = .data[[y_feature]], color = cluster, label = Ticker)) +
    geom_point(size = 3) +
    geom_text(vjust = -0.5, hjust = 0.5, size = 2.5) +
    theme_minimal() +
    labs(
      title = paste(x_feature, "vs", y_feature),
      x = x_feature,
      y = y_feature
    ) +
    scale_color_discrete(name = "Cluster")
}

plot1 <- scatter_plot_clusters(summary_data, "avg_close", "avg_volume")
plot2 <- scatter_plot_clusters(summary_data, "avg_close", "volatility")
plot3 <- scatter_plot_clusters(summary_data, "avg_volume", "volatility")


library(patchwork)  
combined_plots <- plot1 + plot2 + plot3 + plot_layout(ncol = 2)


print(combined_plots)

From the output I obtained, we can observe that the clusters are biased mainly due to 2 companies which are:

NVDA: NVIDIA exhibits extraordinarily high average trading volumes, reflecting its status as a heavily traded stock, likely due to its dominance in the semiconductor and AI markets. Its high liquidity and frequent trading activity make it stand out from other companies in the dataset.

BKNG: Booking Holdings is an outlier due to its exceptionally high average closing price and significant volatility. This is characteristic of a high-priced stock in a niche market segment, where fewer shares are traded, but the stock price exhibits large fluctuations, likely due to its business model and sensitivity to global economic conditions.

Because of this fact I have decided to omit NVDA and BKNG from further analysis because their extreme values create disproportionate influence on clustering results, leading to distorted group assignments. By removing these outliers, the clustering process can focus on identifying meaningful patterns among the majority of companies without interference from extreme data points.

#K-means

filtered_data <- summary_data %>%
  filter(!Ticker %in% c("NVDA", "BKNG")) %>%
  filter_all(all_vars(!is.na(.)))

## Optimal number of clusters based on silhouette analysis: 3

## # A tibble: 6 × 7
##   Ticker avg_close avg_high avg_low avg_volume volatility cluster
##   <chr>      <dbl>    <dbl>   <dbl>      <dbl>      <dbl> <fct>  
## 1 AAPL        175.     177.    173.  68650377.       1.67 1      
## 2 ADBE        461.     468.    455.   3273070.       5.29 2      
## 3 ADP         231.     233.    229.   1732816.       1.91 3      
## 4 AMAT        144.     146.    142.   6708705.       2.36 3      
## 5 AMGN        257.     259.    254.   2672433.       2.47 3      
## 6 AMZN        142.     144.    140.  59244184.       1.61 1

The k-means clustering analysis groups stocks into three clusters based on avg_close, avg_volume, and volatility. Here’s a summary of the results:

Cluster 1 (Red): Stocks with very high trading volumes and moderate prices, such as TSLA and AMZN. These stocks are highly volatile and actively traded, making them suitable for short-term trading or momentum strategies.

Cluster 2 (Green): Stocks with high closing prices and significant volatility, like NFTX or COST. These are likely large-cap or growth stocks, appealing to investors willing to take on higher risk for potentially greater rewards.

Cluster 3 (Blue): The majority of stocks with lower trading volumes, such as META, MSFT or GOOG lower closing prices, and minimal volatility. These are stable and less risky, ideal for conservative investors focusing on predictable returns.

4 clusters

The k-means analysis indicated that using 4 clusters resulted in a similar silhouette width, suggesting it could be an appropriate choice for segmenting the data. Based on this, we will now proceed with clustering using 4 clusters to evaluate the resulting group structure and assess its relevance to the dataset.

## # A tibble: 6 × 7
##   Ticker avg_close avg_high avg_low avg_volume volatility cluster
##   <chr>      <dbl>    <dbl>   <dbl>      <dbl>      <dbl> <fct>  
## 1 AAPL        175.     177.    173.  68650377.       1.67 1      
## 2 ADBE        461.     468.    455.   3273070.       5.29 4      
## 3 ADP         231.     233.    229.   1732816.       1.91 3      
## 4 AMAT        144.     146.    142.   6708705.       2.36 3      
## 5 AMGN        257.     259.    254.   2672433.       2.47 3      
## 6 AMZN        142.     144.    140.  59244184.       1.61 1

New results are as follow:

Cluster 1 (Blue): Low trading volumes, lower closing prices, and minimal volatility. Stable, low-risk stocks ideal for conservative investors.

Cluster 2 (Red): Very high trading volumes and moderate to high prices (e.g., TSLA, AMZN). Highly volatile, appealing to short-term traders.

Cluster 3 (Purple): Mid-range prices, moderate volumes, and low volatility (e.g., META, MSFT). Balanced stocks for moderate risk-tolerant investors.

Cluster 4 (Green): Highest prices and significant volatility (e.g., REGN). High-risk, high-reward stocks for aggressive investors.

The four-cluster solution adds granularity, especially by isolating high-price, high-volatility stocks (Cluster 4) from Cluster 2. Cluster 1 and Cluster 3 remain consistent, representing low-risk and balanced options, respectively. This refined segmentation helps investors better align their strategies with specific risk and return profiles, offering clearer diversification opportunities.

PAM

Now I will perform Partitioning Around Medoids (PAM). This method will help refine the identification of the optimal number of clusters and provide additional insights into the cluster structures. PAM is particularly useful as it is less sensitive to outliers compared to k-means and may result in more robust clusters. This analysis will further validate and enhance the clustering results obtained earlier.

Here I will do the analysis only to 3 clusters due to the insufficient silhouette width to 4 clusters

## Optimal number of clusters based on silhouette analysis: 3

## # A tibble: 6 × 7
##   Ticker avg_close avg_high avg_low avg_volume volatility cluster
##   <chr>      <dbl>    <dbl>   <dbl>      <dbl>      <dbl> <fct>  
## 1 AAPL        175.     177.    173.  68650377.       1.67 1      
## 2 ADBE        461.     468.    455.   3273070.       5.29 2      
## 3 ADP         231.     233.    229.   1732816.       1.91 3      
## 4 AMAT        144.     146.    142.   6708705.       2.36 3      
## 5 AMGN        257.     259.    254.   2672433.       2.47 3      
## 6 AMZN        142.     144.    140.  59244184.       1.61 1

scatter_plot_clusters <- function(data, x_feature, y_feature) {
  ggplot(data, aes(x = .data[[x_feature]], y = .data[[y_feature]], color = cluster, label = Ticker)) +
    geom_point(size = 3) +
    geom_text(vjust = -0.5, hjust = 0.5, size = 2.5) +
    theme_minimal() +
    labs(
      title = paste(x_feature, "vs", y_feature),
      x = x_feature,
      y = y_feature
    ) +
    scale_color_discrete(name = "Cluster")
}


plot1 <- scatter_plot_clusters(filtered_data, "avg_close", "avg_volume")
plot2 <- scatter_plot_clusters(filtered_data, "avg_close", "volatility")
plot3 <- scatter_plot_clusters(filtered_data, "avg_volume", "volatility")


library(patchwork)  # Ensure the package is loaded for layout
combined_plots <- plot1 + plot2 + plot3 + plot_layout(ncol = 2)


print(combined_plots)

The PAM analysis yields results similar to the k-means clustering (3 cluster one, check results above), reinforcing the three distinct groups of stocks. However, PAM’s robustness to outliers ensures more precise grouping, particularly separating high-risk stocks in Cluster 2 (green) from the lower-risk, stable stocks in Cluster 3 (blue). This analysis confirms that Cluster 1 remains a high-liquidity, high-volatility category suitable for short-term trading. PAM further validates the structure of these clusters, offering a reliable foundation for investment strategies tailored to varying risk and return preferences.

Summary

Clustering techniques were used, like k-means and PAM, to divide stocks into segments on the basis of such main metrics as the average closing price, trading volume, and volatility. The research delivers insight into how some stock classes are really volatile and hence riskier to own, while other classes of stock are less risky to buy because they have lesser volatility. In essence, investors can come up with better ways of implementing their strategy by aligning their portfolio composition to the level of risk they can accommodate in meeting their goals and objectives. By knowing the groupings of stocks, traders can target richness in clusters for momentum trading, growth investment, or conservative diversification. Application of PAM makes the analysis further robust; hence, the method is important for data-driven decision making in financial markets.