1.The Business Problem

Investors often categorize stocks based on sector (Tech, Healthcare, Energy). However, this doesnt tell us about Risk.

In this project, I am using Unsupervised Machine Learning (K-Means) to automatically cluster stocks based on their behaviour: 1. Returns (How much money they make) 2. Volatility (How risky they are)

This allows us to find “Hidden Cousins” - stocks that act the same way, even if they are in different industries.

2. Setting Up the Environment

First, I will load the necessary libraries for financial analysis and machine learning.

# Load Libraries
library(quantmod) # For downloading stock data
## Loading required package: xts
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: TTR
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library(cluster) # For the K-Means algorithm
# Suppress warning to keep the report clean 
knitr::opts_chunk$set(warning = FALSE, message = FALSE)

3. Data Acquisition

I will download 1 year of daily price data for a diverse basket of stocks. Then, I will calculate the Daily Returns (percentage change from yesterday) for each one.

library(rvest) # Load the scraping tool

# 1. Go to Wikipedia's S&P 500 page
url <- "https://en.wikipedia.org/wiki/List_of_S%26P_500_companies"

# 2. Scrape the table
sp500_table <- url %>%
  read_html() %>%
  html_element("table") %>%
  html_table()

# 3. Extract the 'Symbol' column
all_tickers <- sp500_table$Symbol

# 4. Clean up tickers (Replace dots like BRK.B with dashes BRK-B for Yahoo)
all_tickers <- gsub("\\.", "-", all_tickers)

# 5. SELECT THE TOP 50 (To save time)
# If you really want all 500, change this to: tickers <- all_tickers
tickers <- head(all_tickers, 50) 

# --- Resume Standard Download ---
stock_env <- new.env()
getSymbols(tickers, src = "yahoo", from = Sys.Date() - 365, env = stock_env)
##  [1] "MMM"   "AOS"   "ABT"   "ABBV"  "ACN"   "ADBE"  "AMD"   "AES"   "AFL"  
## [10] "A"     "APD"   "ABNB"  "AKAM"  "ALB"   "ARE"   "ALGN"  "ALLE"  "LNT"  
## [19] "ALL"   "GOOGL" "GOOG"  "MO"    "AMZN"  "AMCR"  "AEE"   "AEP"   "AXP"  
## [28] "AIG"   "AMT"   "AWK"   "AMP"   "AME"   "AMGN"  "APH"   "ADI"   "AON"  
## [37] "APA"   "APO"   "AAPL"  "AMAT"  "APP"   "APTV"  "ACGL"  "ADM"   "ARES" 
## [46] "ANET"  "AJG"   "AIZ"   "T"     "ATO"
price_list <- lapply(stock_env, Ad)
prices <- do.call(merge, price_list)
colnames(prices) <- tickers
returns <- na.omit(ROC(prices))

4. Feature Engineering

The raw data shows us daily movements. To cluster the stocks, I need to summarize their behaviour into annual metrics.

For each stock, I will calculate: 1. Annualized Return: The average daily gain multiplied by 252 (trading days in a year). 2. Annualized Volatility: The standard deviation (wildness) of the daily returns, also scaled to a year.

# Calculate Average Return (Reward) and Standard Deviation (Risk) for each column. 
# 'apply' allows us to run a function on every column at once.

metrics <- data.frame(
  Ticker = colnames(returns),
  Returns = apply(returns, 2, mean)*252,  #Annualize the Mean
  Volatility = apply(returns, 2, sd) * sqrt(252) #Annualize the Risk
)

5. K-Means Clustering

Now I will feed this data into the K-Means algorithm. I am choosing k=4 to identify four distinct “Risk Regimes”:

  1. Defensive: Low Risk / Steady Return (Safe Havens)
  2. Market Baseline: Average Risk / Average Return (The Standard)
  3. Aggressive Growth: High Risk / High Return (Tech Leaders)
  4. Speculative: Extreme Risk / Unpredictable Return (Moonshots)

Note: I am scaling the data first so that “Volatility” and “Returns” are treated equally by the algorithm.

# 1. Scale the data (Standardize it)
scaled_data <- scale(metrics[, -1])
# Remove the "Ticker" text column before math

# 2. Set Seed (So we get the exact same result every time)
set.seed(123)

# 3. Run K-Means (Ask for 3 Groups)
k_result <- kmeans(scaled_data, centers = 4, nstart = 25)

# 4. Add the "Cluster ID" back to our original list
metrics$Cluster <- as.factor(k_result$cluster)

# 5. Show the result sorted by Cluster
metrics[order(metrics$Cluster), ]
##       Ticker      Returns Volatility Cluster
## AOS      AOS  0.261336929  0.8231600       1
## ALGN    ALGN  0.781705181  0.6291736       1
## LNT      LNT  0.242861044  0.5340836       1
## AMP      AMP  0.260578614  0.5540901       1
## APA      APA  0.737131569  0.5927432       1
## MMM      MMM -0.006003376  0.2889951       2
## ABBV    ABBV  0.001101602  0.2388588       2
## ACN      ACN  0.139531909  0.1688556       2
## ADBE    ADBE  0.103127487  0.1641335       2
## AES      AES  0.051341230  0.2331123       2
## AFL      AFL  0.031465199  0.2402862       2
## APD      APD  0.172717555  0.1660235       2
## ABNB    ABNB  0.231491175  0.2125030       2
## AKAM    AKAM  0.223828038  0.1865103       2
## ARE      ARE -0.051258314  0.2689477       2
## ALL      ALL -0.144574625  0.2401300       2
## GOOG    GOOG  0.194770541  0.2619450       2
## MO        MO  0.045772239  0.2592380       2
## AMCR    AMCR  0.061425173  0.2082609       2
## AEE      AEE  0.135414804  0.3143677       2
## AEP      AEP  0.055967983  0.2568779       2
## AIG      AIG -0.066292469  0.2208790       2
## AMT      AMT  0.202930654  0.2811342       2
## AWK      AWK  0.118694432  0.3174021       2
## AME      AME  0.113452387  0.2518348       2
## APH      APH  0.122042028  0.2272452       2
## APO      APO  0.261475164  0.2493968       2
## AAPL    AAPL  0.205065147  0.2276486       2
## ARES    ARES  0.127192549  0.2626258       2
## AJG      AJG  0.039758533  0.3029210       2
## T          T  0.007982318  0.3467367       2
## ABT      ABT -0.003202345  0.3695576       3
## A          A -0.283890180  0.5914822       3
## ALB      ALB -0.097887855  0.3190891       3
## GOOGL  GOOGL -0.492058257  0.4157049       3
## AXP      AXP -0.373717585  0.2983477       3
## AMGN    AMGN -0.248031880  0.4241436       3
## AON      AON -0.191583022  0.2837455       3
## AMAT    AMAT -0.407412511  0.3207242       3
## APTV    APTV -0.208164687  0.2576847       3
## ANET    ANET -0.008816242  0.4422597       3
## ATO      ATO -0.205320611  0.4211883       3
## AMD      AMD  0.224496867  0.3717862       4
## ALLE    ALLE  0.373342202  0.4814943       4
## AMZN    AMZN  0.528801448  0.3173122       4
## ADI      ADI  0.606457023  0.4661358       4
## APP      APP  0.523296922  0.3133472       4
## ACGL    ACGL  0.736596490  0.3716450       4
## ADM      ADM  0.436783601  0.3885792       4
## AIZ      AIZ  0.339201778  0.2919665       4

6. Visualization

Finally, I will plot the cluster. - X-Axis Volatility (Risk) - Y-Axis Returns (Reward) - Colour: Cluster Group

library(ggplot2)

ggplot(metrics, aes(x = Volatility, y = Returns, color = Cluster)) +
  geom_point(size = 4, alpha = 0.8) +    # Added transparency (alpha) to look nicer
  geom_text(aes(label = Ticker), vjust = -1, check_overlap = TRUE, show.legend = FALSE) + 
  theme_minimal() +
  labs(title = "Risk vs. Reward Clustering", 
       subtitle = "K-Means discovered these groups automatically",
       x = "Annualized Risk (Volatility)",
       y = "Annualized Reward (Returns)")

## 7. Conclusions & Recommendations Based on the K-Means clustering analysis with k=4, we can categorize the market into four distinct “Risk Regimes”:

  1. The “Defensive” Cluster: Low volatility stocks (like KO, JNJ) that act as safe havens.
    • Strategy: Use for capital preservation.
  2. The “Market Baseline” Cluster: Stocks or Indices (like SPY) that represent the average market behavior.
    • Strategy: Core portfolio holdings.
  3. The “Aggressive Growth” Cluster: High volatility stocks with high returns (like NVDA, TSLA).
    • Strategy: Growth drivers; position size carefully.
  4. The “Speculative” Cluster: Extreme volatility outliers (like AOS).
    • Strategy: High-risk “venture style” bets.

Next Steps

To improve this model further, I would: * Add more years of data to see if these clusters hold up during a recession. * Include “Beta” (market correlation) as a third dimension for clustering.