Investors often categorize stocks based on sector (Tech, Healthcare, Energy). However, this doesnt tell us about Risk.
In this project, I am using Unsupervised Machine Learning (K-Means) to automatically cluster stocks based on their behaviour: 1. Returns (How much money they make) 2. Volatility (How risky they are)
This allows us to find “Hidden Cousins” - stocks that act the same way, even if they are in different industries.
First, I will load the necessary libraries for financial analysis and machine learning.
# Load Libraries
library(quantmod) # For downloading stock data
## Loading required package: xts
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: TTR
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(cluster) # For the K-Means algorithm
# Suppress warning to keep the report clean
knitr::opts_chunk$set(warning = FALSE, message = FALSE)
I will download 1 year of daily price data for a diverse basket of stocks. Then, I will calculate the Daily Returns (percentage change from yesterday) for each one.
library(rvest) # Load the scraping tool
# 1. Go to Wikipedia's S&P 500 page
url <- "https://en.wikipedia.org/wiki/List_of_S%26P_500_companies"
# 2. Scrape the table
sp500_table <- url %>%
read_html() %>%
html_element("table") %>%
html_table()
# 3. Extract the 'Symbol' column
all_tickers <- sp500_table$Symbol
# 4. Clean up tickers (Replace dots like BRK.B with dashes BRK-B for Yahoo)
all_tickers <- gsub("\\.", "-", all_tickers)
# 5. SELECT THE TOP 50 (To save time)
# If you really want all 500, change this to: tickers <- all_tickers
tickers <- head(all_tickers, 50)
# --- Resume Standard Download ---
stock_env <- new.env()
getSymbols(tickers, src = "yahoo", from = Sys.Date() - 365, env = stock_env)
## [1] "MMM" "AOS" "ABT" "ABBV" "ACN" "ADBE" "AMD" "AES" "AFL"
## [10] "A" "APD" "ABNB" "AKAM" "ALB" "ARE" "ALGN" "ALLE" "LNT"
## [19] "ALL" "GOOGL" "GOOG" "MO" "AMZN" "AMCR" "AEE" "AEP" "AXP"
## [28] "AIG" "AMT" "AWK" "AMP" "AME" "AMGN" "APH" "ADI" "AON"
## [37] "APA" "APO" "AAPL" "AMAT" "APP" "APTV" "ACGL" "ADM" "ARES"
## [46] "ANET" "AJG" "AIZ" "T" "ATO"
price_list <- lapply(stock_env, Ad)
prices <- do.call(merge, price_list)
colnames(prices) <- tickers
returns <- na.omit(ROC(prices))
The raw data shows us daily movements. To cluster the stocks, I need to summarize their behaviour into annual metrics.
For each stock, I will calculate: 1. Annualized Return: The average daily gain multiplied by 252 (trading days in a year). 2. Annualized Volatility: The standard deviation (wildness) of the daily returns, also scaled to a year.
# Calculate Average Return (Reward) and Standard Deviation (Risk) for each column.
# 'apply' allows us to run a function on every column at once.
metrics <- data.frame(
Ticker = colnames(returns),
Returns = apply(returns, 2, mean)*252, #Annualize the Mean
Volatility = apply(returns, 2, sd) * sqrt(252) #Annualize the Risk
)
Now I will feed this data into the K-Means algorithm. I am choosing k=4 to identify four distinct “Risk Regimes”:
Note: I am scaling the data first so that “Volatility” and “Returns” are treated equally by the algorithm.
# 1. Scale the data (Standardize it)
scaled_data <- scale(metrics[, -1])
# Remove the "Ticker" text column before math
# 2. Set Seed (So we get the exact same result every time)
set.seed(123)
# 3. Run K-Means (Ask for 3 Groups)
k_result <- kmeans(scaled_data, centers = 4, nstart = 25)
# 4. Add the "Cluster ID" back to our original list
metrics$Cluster <- as.factor(k_result$cluster)
# 5. Show the result sorted by Cluster
metrics[order(metrics$Cluster), ]
## Ticker Returns Volatility Cluster
## AOS AOS 0.261336929 0.8231600 1
## ALGN ALGN 0.781705181 0.6291736 1
## LNT LNT 0.242861044 0.5340836 1
## AMP AMP 0.260578614 0.5540901 1
## APA APA 0.737131569 0.5927432 1
## MMM MMM -0.006003376 0.2889951 2
## ABBV ABBV 0.001101602 0.2388588 2
## ACN ACN 0.139531909 0.1688556 2
## ADBE ADBE 0.103127487 0.1641335 2
## AES AES 0.051341230 0.2331123 2
## AFL AFL 0.031465199 0.2402862 2
## APD APD 0.172717555 0.1660235 2
## ABNB ABNB 0.231491175 0.2125030 2
## AKAM AKAM 0.223828038 0.1865103 2
## ARE ARE -0.051258314 0.2689477 2
## ALL ALL -0.144574625 0.2401300 2
## GOOG GOOG 0.194770541 0.2619450 2
## MO MO 0.045772239 0.2592380 2
## AMCR AMCR 0.061425173 0.2082609 2
## AEE AEE 0.135414804 0.3143677 2
## AEP AEP 0.055967983 0.2568779 2
## AIG AIG -0.066292469 0.2208790 2
## AMT AMT 0.202930654 0.2811342 2
## AWK AWK 0.118694432 0.3174021 2
## AME AME 0.113452387 0.2518348 2
## APH APH 0.122042028 0.2272452 2
## APO APO 0.261475164 0.2493968 2
## AAPL AAPL 0.205065147 0.2276486 2
## ARES ARES 0.127192549 0.2626258 2
## AJG AJG 0.039758533 0.3029210 2
## T T 0.007982318 0.3467367 2
## ABT ABT -0.003202345 0.3695576 3
## A A -0.283890180 0.5914822 3
## ALB ALB -0.097887855 0.3190891 3
## GOOGL GOOGL -0.492058257 0.4157049 3
## AXP AXP -0.373717585 0.2983477 3
## AMGN AMGN -0.248031880 0.4241436 3
## AON AON -0.191583022 0.2837455 3
## AMAT AMAT -0.407412511 0.3207242 3
## APTV APTV -0.208164687 0.2576847 3
## ANET ANET -0.008816242 0.4422597 3
## ATO ATO -0.205320611 0.4211883 3
## AMD AMD 0.224496867 0.3717862 4
## ALLE ALLE 0.373342202 0.4814943 4
## AMZN AMZN 0.528801448 0.3173122 4
## ADI ADI 0.606457023 0.4661358 4
## APP APP 0.523296922 0.3133472 4
## ACGL ACGL 0.736596490 0.3716450 4
## ADM ADM 0.436783601 0.3885792 4
## AIZ AIZ 0.339201778 0.2919665 4
Finally, I will plot the cluster. - X-Axis Volatility (Risk) - Y-Axis Returns (Reward) - Colour: Cluster Group
library(ggplot2)
ggplot(metrics, aes(x = Volatility, y = Returns, color = Cluster)) +
geom_point(size = 4, alpha = 0.8) + # Added transparency (alpha) to look nicer
geom_text(aes(label = Ticker), vjust = -1, check_overlap = TRUE, show.legend = FALSE) +
theme_minimal() +
labs(title = "Risk vs. Reward Clustering",
subtitle = "K-Means discovered these groups automatically",
x = "Annualized Risk (Volatility)",
y = "Annualized Reward (Returns)")
## 7. Conclusions & Recommendations Based on the K-Means clustering
analysis with k=4, we can categorize the market into
four distinct “Risk Regimes”:
To improve this model further, I would: * Add more years of data to see if these clusters hold up during a recession. * Include “Beta” (market correlation) as a third dimension for clustering.