Set up

library(tidyverse)
library(tidyquant) # for financial analysis
library(broom) # for tidy model results
library(umap)  # for dimension reduction
library(plotly) # for interactive visualization

Data

# Get info on companies listed in S&P500
sp500_index_tbl <- tq_index("SP500")

# Get individual stocks from S&P500
sp500_symbols <- sp500_index_tbl %>% distinct(symbol) %>% pull() 

# Get stock prices of the companies
sp500_prices_tbl <- tq_get(sp500_symbols, from = "2020-04-01")

write.csv(sp500_index_tbl, "~/Desktop/PSU_DAT3100/09_module11/Apply9/sp500_index_tbl.csv")
write.csv(sp500_prices_tbl, "~/Desktop/PSU_DAT3100/09_module11/Apply9/sp500_prices_tbl.csv")

Import data

sp500_index_tbl <- read_csv("~/Desktop/PSU_DAT3100/09_module11/Apply9/sp500_index_tbl.csv")
sp500_prices_tbl <- read_csv("~/Desktop/PSU_DAT3100/09_module11/Apply9/sp500_prices_tbl.csv")
sp500_index_tbl %>% glimpse()

## Rows: 504
## Columns: 9
## $ ...1           <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
## $ symbol         <chr> "NVDA", "AAPL", "MSFT", "AMZN", "META", "GOOGL", "BRK-B…
## $ company        <chr> "NVIDIA CORP", "APPLE INC", "MICROSOFT CORP", "AMAZON.C…
## $ identifier     <chr> "67066G104", "037833100", "594918104", "023135106", "30…
## $ sedol          <chr> "2379504", "2046251", "2588173", "2000019", "B7TL820", …
## $ weight         <dbl> 0.071401680, 0.067499881, 0.062256061, 0.038562107, 0.0…
## $ sector         <chr> "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", …
## $ shares_held    <dbl> 303721856, 187716582, 91770926, 115334832, 26974529, 72…
## $ local_currency <chr> "USD", "USD", "USD", "USD", "USD", "USD", "USD", "USD",…

sp500_prices_tbl %>% glimpse()

## Rows: 575,668
## Columns: 9
## $ ...1     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…
## $ symbol   <chr> "NVDA", "NVDA", "NVDA", "NVDA", "NVDA", "NVDA", "NVDA", "NVDA…
## $ date     <date> 2020-04-01, 2020-04-02, 2020-04-03, 2020-04-06, 2020-04-07, …
## $ open     <dbl> 6.39125, 6.10600, 6.34900, 6.38100, 6.93250, 6.58525, 6.80000…
## $ high     <dbl> 6.53825, 6.40000, 6.39075, 6.74700, 6.95625, 6.69875, 6.82300…
## $ low      <dbl> 6.03200, 6.05775, 5.95975, 6.32325, 6.43250, 6.51500, 6.51050…
## $ close    <dbl> 6.07675, 6.38675, 6.09775, 6.71000, 6.47575, 6.67375, 6.57375…
## $ volume   <dbl> 656912000, 675764000, 663212000, 727884000, 784520000, 542444…
## $ adjusted <dbl> 6.055418, 6.364329, 6.076344, 6.686444, 6.453017, 6.650322, 6…

Question

Which stock prices behave similarly?

Our main objective is to identify stocks that exhibit similar price behaviors over time. By doing so, we aim to gain insights into the relationships between different companies, uncovering potential competitors and sector affiliations.

Why It Matters Understanding which companies are related is crucial for various reasons:

Identifying Relationships: Clustering similar stocks reveals underlying connections and dependencies in the market.
Competitor Analysis: Discovering competitors and sector affiliations aids in strategic planning and market positioning.
Market Dynamics: Analyzing stock price similarities provides valuable insights into market dynamics, benefiting diverse fields such as finance, sales, and marketing.

Assignment Details Your task is to analyze the historical price data of various stocks and determine which stocks behave similarly. We will employ clustering techniques to accomplish this task effectively.

1 Convert data to standardized form

To compare data effectively, it must be standardized or normalized. Why? Because comparing values (like stock prices) of vastly different magnitudes is impractical. So, we’ll standardize by converting from adjusted stock price (in dollars) to daily returns (as percent change from the previous day). Here’s the formula:

\[ return_{daily} = \frac{price_{i}-price_{i-1}}{price_{i-1}} \]

sp500_prices_tbl %>% glimpse()

## Rows: 575,668
## Columns: 9
## $ ...1     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…
## $ symbol   <chr> "NVDA", "NVDA", "NVDA", "NVDA", "NVDA", "NVDA", "NVDA", "NVDA…
## $ date     <date> 2020-04-01, 2020-04-02, 2020-04-03, 2020-04-06, 2020-04-07, …
## $ open     <dbl> 6.39125, 6.10600, 6.34900, 6.38100, 6.93250, 6.58525, 6.80000…
## $ high     <dbl> 6.53825, 6.40000, 6.39075, 6.74700, 6.95625, 6.69875, 6.82300…
## $ low      <dbl> 6.03200, 6.05775, 5.95975, 6.32325, 6.43250, 6.51500, 6.51050…
## $ close    <dbl> 6.07675, 6.38675, 6.09775, 6.71000, 6.47575, 6.67375, 6.57375…
## $ volume   <dbl> 656912000, 675764000, 663212000, 727884000, 784520000, 542444…
## $ adjusted <dbl> 6.055418, 6.364329, 6.076344, 6.686444, 6.453017, 6.650322, 6…

# Apply your data transformation skills!
sp_500_daily_returns_tbl <- sp500_prices_tbl %>%
    
    select(symbol, date, adjusted) %>%
    
    filter(date >= ymd("2018-01-01")) %>%
    
    group_by(symbol) %>%
    mutate(lag_1 = lag(adjusted)) %>%
    ungroup() %>%
    
    filter(!is.na(lag_1)) %>%
    
    mutate(pct_return = (adjusted - lag_1) / lag_1) %>%
    select(symbol, date, pct_return)

sp_500_daily_returns_tbl

## # A tibble: 575,165 × 3
##    symbol date       pct_return
##    <chr>  <date>          <dbl>
##  1 NVDA   2020-04-02     0.0510
##  2 NVDA   2020-04-03    -0.0452
##  3 NVDA   2020-04-06     0.100 
##  4 NVDA   2020-04-07    -0.0349
##  5 NVDA   2020-04-08     0.0306
##  6 NVDA   2020-04-09    -0.0150
##  7 NVDA   2020-04-13     0.0262
##  8 NVDA   2020-04-14     0.0523
##  9 NVDA   2020-04-15    -0.0110
## 10 NVDA   2020-04-16     0.0494
## # ℹ 575,155 more rows

2 Spread to object-characteristics format

We’ll convert the daily returns (percentage change from one day to the next) to object-characteristics format, also known as the user-item format. Users are identified by the symbol (company), and items are represented by the pct_return at each date.

stock_date_matrix_tbl <- sp_500_daily_returns_tbl %>%
    spread(key = date, value = pct_return, fill = 0)

stock_date_matrix_tbl

## # A tibble: 503 × 1,159
##    symbol `2020-04-02` `2020-04-03` `2020-04-06` `2020-04-07` `2020-04-08`
##    <chr>         <dbl>        <dbl>        <dbl>        <dbl>        <dbl>
##  1 A           0.0489     -0.0259         0.0560     -0.00444       0.0359
##  2 AAPL        0.0167     -0.0144         0.0872     -0.0116        0.0256
##  3 ABBV        0.0233     -0.0234         0.0322     -0.00449       0.0420
##  4 ABNB        0           0              0           0             0     
##  5 ABT         0.0375      0.000126       0.0413     -0.00967       0.0369
##  6 ACGL        0.0115     -0.0650         0.0983      0.0314        0.0291
##  7 ACN         0.0103     -0.0264         0.0914     -0.0116        0.0464
##  8 ADBE        0.00913    -0.0341         0.0869     -0.0320        0.0267
##  9 ADI         0.0429     -0.0130         0.107       0.00470       0.0535
## 10 ADM         0.0136      0.00932        0.0326      0.00559       0.0139
## # ℹ 493 more rows
## # ℹ 1,153 more variables: `2020-04-09` <dbl>, `2020-04-13` <dbl>,
## #   `2020-04-14` <dbl>, `2020-04-15` <dbl>, `2020-04-16` <dbl>,
## #   `2020-04-17` <dbl>, `2020-04-20` <dbl>, `2020-04-21` <dbl>,
## #   `2020-04-22` <dbl>, `2020-04-23` <dbl>, `2020-04-24` <dbl>,
## #   `2020-04-27` <dbl>, `2020-04-28` <dbl>, `2020-04-29` <dbl>,
## #   `2020-04-30` <dbl>, `2020-05-01` <dbl>, `2020-05-04` <dbl>, …

3 Perform k-means clustering

stock_date_matrix_cluster <- kmeans(stock_date_matrix_tbl %>% select(-symbol), centers = 3, nstart = 20)

# Summary of the clustering results
summary(stock_date_matrix_cluster)

##              Length Class  Mode   
## cluster       503   -none- numeric
## centers      3474   -none- numeric
## totss           1   -none- numeric
## withinss        3   -none- numeric
## tot.withinss    1   -none- numeric
## betweenss       1   -none- numeric
## size            3   -none- numeric
## iter            1   -none- numeric
## ifault          1   -none- numeric

# Create a tibble with cluster assignments
cluster_assignments <- as_tibble(stock_date_matrix_cluster$cluster) %>%
    rename(.cluster = value) %>%
    mutate(symbol = stock_date_matrix_tbl$symbol)

# Combine the cluster assignments with the original daily returns data
augmented_data <- sp_500_daily_returns_tbl %>%
    left_join(cluster_assignments, by = "symbol")

# Ensure the augmented data contains the necessary columns
glimpse(augmented_data)

## Rows: 575,165
## Columns: 4
## $ symbol     <chr> "NVDA", "NVDA", "NVDA", "NVDA", "NVDA", "NVDA", "NVDA", "NV…
## $ date       <date> 2020-04-02, 2020-04-03, 2020-04-06, 2020-04-07, 2020-04-08…
## $ pct_return <dbl> 0.051014042, -0.045249838, 0.100405736, -0.034910493, 0.030…
## $ .cluster   <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…

ggplot(augmented_data, aes(x = date, y = pct_return, color = as.factor(.cluster))) +
    geom_point() +
    labs(title = "Daily Percentage Returns by Cluster",
         x = "Date",
         y = "Daily Percentage Return",
         color = "Cluster") +
    theme_minimal()

4 Select optimal number of clusters

kclusts <- tibble(k = 1:9) %>%
    mutate(kclust = map(.x = k, .f = ~ kmeans(stock_date_matrix_tbl %>% select(-symbol), centers = .x, nstart = 20)),
           glanced = map(.x = kclust, .f = glance))

kclusts %>%
    unnest(glanced) %>%
    ggplot(aes(k, tot.withinss)) +
    geom_point() +
    geom_line()

final_cluster <- kmeans(stock_date_matrix_tbl %>% select(-symbol), centers = 5, nstart = 20)
cluster_assignments <- tibble(symbol = stock_date_matrix_tbl$symbol, .cluster = final_cluster$cluster)

augmented_data2 <- sp_500_daily_returns_tbl %>%
    left_join(cluster_assignments, by = "symbol")

# Check the structure of the joined data
glimpse(augmented_data2)

## Rows: 575,165
## Columns: 4
## $ symbol     <chr> "NVDA", "NVDA", "NVDA", "NVDA", "NVDA", "NVDA", "NVDA", "NV…
## $ date       <date> 2020-04-02, 2020-04-03, 2020-04-06, 2020-04-07, 2020-04-08…
## $ pct_return <dbl> 0.051014042, -0.045249838, 0.100405736, -0.034910493, 0.030…
## $ .cluster   <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…

# Plot the results
ggplot(augmented_data2, aes(x = date, y = pct_return, color = as.factor(.cluster))) +
    geom_point() +
    labs(title = "Daily Percentage Returns by Cluster",
         x = "Date",
         y = "Daily Percentage Return",
         color = "Cluster") +
    theme_minimal()

5 Reduce dimension using UMAP

umap_results <- stock_date_matrix_tbl %>%
    select(-symbol) %>%
    umap()

umap_results_tbl <- umap_results$layout %>%
    as.tibble() %>%
    bind_cols(stock_date_matrix_tbl %>% select(symbol))

umap_results_tbl

## # A tibble: 503 × 3
##           V1     V2 symbol
##        <dbl>  <dbl> <chr> 
##  1 -2.71     -0.968 A     
##  2 -1.96     -2.33  AAPL  
##  3 -1.74      1.31  ABBV  
##  4 -0.000340 -2.47  ABNB  
##  5 -2.18      0.108 ABT   
##  6  2.32     -0.210 ACGL  
##  7 -1.30     -1.51  ACN   
##  8 -1.86     -2.54  ADBE  
##  9 -0.495    -3.31  ADI   
## 10  2.03      0.335 ADM   
## # ℹ 493 more rows

umap_results_tbl %>%
    ggplot(aes(V1, V2)) +
    geom_point()

Visualize clusters by adding k-means results

glimpse(umap_results_tbl)

## Rows: 503
## Columns: 3
## $ V1     <dbl> -2.7096745292, -1.9585108849, -1.7352117114, -0.0003399784, -2.…
## $ V2     <dbl> -0.9682855, -2.3278545, 1.3050561, -2.4666494, 0.1078963, -0.20…
## $ symbol <chr> "A", "AAPL", "ABBV", "ABNB", "ABT", "ACGL", "ACN", "ADBE", "ADI…

glimpse(sp_500_daily_returns_tbl)

## Rows: 575,165
## Columns: 3
## $ symbol     <chr> "NVDA", "NVDA", "NVDA", "NVDA", "NVDA", "NVDA", "NVDA", "NV…
## $ date       <date> 2020-04-02, 2020-04-03, 2020-04-06, 2020-04-07, 2020-04-08…
## $ pct_return <dbl> 0.051014042, -0.045249838, 0.100405736, -0.034910493, 0.030…

pivoted_employment_data <- sp_500_daily_returns_tbl %>%
    select(-symbol) %>%
    pivot_wider(names_from = date, values_from = pct_return) %>%
    janitor::clean_names()  

pivoted_employment_data <- pivoted_employment_data %>%
    mutate(symbol = sp_500_daily_returns_tbl$symbol[1])  

kmeans_umap_tbl <- final_cluster %>%
    augment(stock_date_matrix_tbl) %>% 
    select(symbol, .cluster) %>% 
    left_join(umap_results_tbl, by = "symbol") %>%  
    left_join(pivoted_employment_data, by = "symbol")  


kmeans_umap_tbl

## # A tibble: 503 × 1,162
##    symbol .cluster        V1     V2 x2020_04_02 x2020_04_03 x2020_04_06
##    <chr>  <fct>        <dbl>  <dbl> <list>      <list>      <list>     
##  1 A      3        -2.71     -0.968 <NULL>      <NULL>      <NULL>     
##  2 AAPL   2        -1.96     -2.33  <NULL>      <NULL>      <NULL>     
##  3 ABBV   3        -1.74      1.31  <NULL>      <NULL>      <NULL>     
##  4 ABNB   2        -0.000340 -2.47  <NULL>      <NULL>      <NULL>     
##  5 ABT    3        -2.18      0.108 <NULL>      <NULL>      <NULL>     
##  6 ACGL   5         2.32     -0.210 <NULL>      <NULL>      <NULL>     
##  7 ACN    5        -1.30     -1.51  <NULL>      <NULL>      <NULL>     
##  8 ADBE   2        -1.86     -2.54  <NULL>      <NULL>      <NULL>     
##  9 ADI    2        -0.495    -3.31  <NULL>      <NULL>      <NULL>     
## 10 ADM    5         2.03      0.335 <NULL>      <NULL>      <NULL>     
## # ℹ 493 more rows
## # ℹ 1,155 more variables: x2020_04_07 <list>, x2020_04_08 <list>,
## #   x2020_04_09 <list>, x2020_04_13 <list>, x2020_04_14 <list>,
## #   x2020_04_15 <list>, x2020_04_16 <list>, x2020_04_17 <list>,
## #   x2020_04_20 <list>, x2020_04_21 <list>, x2020_04_22 <list>,
## #   x2020_04_23 <list>, x2020_04_24 <list>, x2020_04_27 <list>,
## #   x2020_04_28 <list>, x2020_04_29 <list>, x2020_04_30 <list>, …

g <- kmeans_umap_tbl %>%
    
    # Create text label
    mutate(text_label = str_glue("Company: {symbol}
                                 Cluster:     {.cluster}
                                 V1:       {V1 %>% scales::percent(1)}
                                 V2:       {V2 %>% scales::percent(1)}")) %>%
    
    # Plot
    ggplot(aes(V1, V2, color = .cluster, text = text_label)) +
    geom_point()

g %>% ggplotly(tooltip = "text")

Apply 9: Cluster Model for S&P500

Erin McEvoy

2024-11-07