The dataset documents the reasons for CEO departure in S&P 1500 firms from 2000 through 2018. Goal is to predict CEO departure (ceo_dismissal) by using the departures dataset.

Import Data

sp500_index_tbl <- read_csv("../00_data/sp500_index_tbl.csv")

## New names:
## Rows: 504 Columns: 9
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (6): symbol, company, identifier, sedol, sector, local_currency dbl (3): ...1,
## weight, shares_held
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`

sp500_prices_tbl <- read_csv("../00_data/sp500_prices_tbl.csv")

## New names:
## Rows: 630727 Columns: 9
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (1): symbol dbl (7): ...1, open, high, low, close, volume, adjusted date (1):
## date
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`

sp500_index_tbl %>% glimpse()

## Rows: 504
## Columns: 9
## $ ...1           <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
## $ symbol         <chr> "AAPL", "MSFT", "NVDA", "AMZN", "META", "BRK-B", "GOOGL…
## $ company        <chr> "APPLE INC", "MICROSOFT CORP", "NVIDIA CORP", "AMAZON.C…
## $ identifier     <chr> "037833100", "594918104", "67066G104", "023135106", "30…
## $ sedol          <chr> "2046251", "2588173", "2379504", "2000019", "B7TL820", …
## $ weight         <dbl> 0.066163918, 0.062480435, 0.059871596, 0.036908312, 0.0…
## $ sector         <chr> "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", …
## $ shares_held    <dbl> 183581933, 90849299, 299287755, 115266468, 26762319, 22…
## $ local_currency <chr> "USD", "USD", "USD", "USD", "USD", "USD", "USD", "USD",…

#Data

# Get info on companies listed in S&P500
sp500_index_tbl <- tq_index("SP500")

## Getting holdings for SP500

# Get individual stocks from S&P500
sp500_symbols <- sp500_index_tbl %>% distinct(symbol) %>% pull() 

# Get stock prices of the companies
sp500_prices_tbl <- tq_get(sp500_symbols, from = "2020-04-01")

## Warning: There was 1 warning in `dplyr::mutate()`.
## ℹ In argument: `data.. = purrr::map(...)`.
## Caused by warning:
## ! x = '-', get = 'stock.prices': Error in getSymbols.yahoo(Symbols = "-", env = <environment>, verbose = FALSE, : Unable to import "-".
## HTTP error 404.
##  Removing -.

write.csv(sp500_index_tbl, "../00_data/sp500_index_tbl.csv")
write.csv(sp500_prices_tbl, "../00_data/sp500_prices_tbl.csv")

sp500_prices_tbl %>% glimpse()

## Rows: 630,727
## Columns: 8
## $ symbol   <chr> "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL…
## $ date     <date> 2020-04-01, 2020-04-02, 2020-04-03, 2020-04-06, 2020-04-07, …
## $ open     <dbl> 61.6250, 60.0850, 60.7000, 62.7250, 67.7000, 65.6850, 67.1750…
## $ high     <dbl> 62.1800, 61.2875, 61.4250, 65.7775, 67.9250, 66.8425, 67.5175…
## $ low      <dbl> 59.7825, 59.2250, 59.7425, 62.3450, 64.7500, 65.3075, 66.1750…
## $ close    <dbl> 60.2275, 61.2325, 60.3525, 65.6175, 64.8575, 66.5175, 66.9975…
## $ volume   <dbl> 176218400, 165934000, 129880000, 201820400, 202887200, 168895…
## $ adjusted <dbl> 58.46381, 59.43937, 58.58515, 63.69596, 62.95822, 64.56960, 6…

Question

Which stock prices behave similarly?

Our main objective is to identify stocks that exhibit similar price behaviors over time. By doing so, we aim to gain insights into the relationships between different companies, uncovering potential competitors and sector affiliations.

Why It Matters Understanding which companies are related is crucial for various reasons:

Identifying Relationships: Clustering similar stocks reveals underlying connections and dependencies in the market.
Competitor Analysis: Discovering competitors and sector affiliations aids in strategic planning and market positioning.
Market Dynamics: Analyzing stock price similarities provides valuable insights into market dynamics, benefiting diverse fields such as finance, sales, and marketing. Assignment Details Your task is to analyze the historical price data of various stocks and determine which stocks behave similarly. We will employ clustering techniques to accomplish this task effectively.

1 Convert data to standardized form

To accurately compare stock data, we need to standardize or normalize it. Why? Because stock prices can vary widely in scale, and comparing raw prices directly wouldn’t be meaningful. Instead, we convert adjusted stock prices into daily returns, which reflect the percentage change from one day to the next. This allows for fair comparison across all stocks. The formula we’ll use is:

                      Return(daily) = price(i) - price(i-1) / price(i-1)

sp500_prices_tbl %>% glimpse()

## Rows: 630,727
## Columns: 8
## $ symbol   <chr> "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL…
## $ date     <date> 2020-04-01, 2020-04-02, 2020-04-03, 2020-04-06, 2020-04-07, …
## $ open     <dbl> 61.6250, 60.0850, 60.7000, 62.7250, 67.7000, 65.6850, 67.1750…
## $ high     <dbl> 62.1800, 61.2875, 61.4250, 65.7775, 67.9250, 66.8425, 67.5175…
## $ low      <dbl> 59.7825, 59.2250, 59.7425, 62.3450, 64.7500, 65.3075, 66.1750…
## $ close    <dbl> 60.2275, 61.2325, 60.3525, 65.6175, 64.8575, 66.5175, 66.9975…
## $ volume   <dbl> 176218400, 165934000, 129880000, 201820400, 202887200, 168895…
## $ adjusted <dbl> 58.46381, 59.43937, 58.58515, 63.69596, 62.95822, 64.56960, 6…

sp_500_daily_returns_tbl <- sp500_prices_tbl %>%
    
    select(symbol, date, adjusted) %>%
    
    filter(date >= ymd("2018-01-01")) %>%
    
    group_by(symbol) %>%
    mutate(lag_1 = lag(adjusted)) %>%
    ungroup() %>%
    
    filter(!is.na(lag_1)) %>%
    
    mutate(diff = adjusted - lag_1) %>%
    mutate(pct_return = diff / lag_1) %>%
    
    select(symbol, date, pct_return)

sp_500_daily_returns_tbl

## # A tibble: 630,224 × 3
##    symbol date       pct_return
##    <chr>  <date>          <dbl>
##  1 AAPL   2020-04-02    0.0167 
##  2 AAPL   2020-04-03   -0.0144 
##  3 AAPL   2020-04-06    0.0872 
##  4 AAPL   2020-04-07   -0.0116 
##  5 AAPL   2020-04-08    0.0256 
##  6 AAPL   2020-04-09    0.00722
##  7 AAPL   2020-04-13    0.0196 
##  8 AAPL   2020-04-14    0.0505 
##  9 AAPL   2020-04-15   -0.00913
## 10 AAPL   2020-04-16    0.00795
## # ℹ 630,214 more rows

2 Spread to object-characteristics format

We’ll convert the daily returns (percentage change from one day to the next) to object-characteristics format, also known as the user-item format. Users are identified by the symbol (company), and items are represented by the pct_return at each date.

stock_date_matrix_tbl <- sp_500_daily_returns_tbl %>%
    spread(key = date, value = pct_return, fill = 0)

stock_date_matrix_tbl

## # A tibble: 503 × 1,267
##    symbol `2020-04-02` `2020-04-03` `2020-04-06` `2020-04-07` `2020-04-08`
##    <chr>         <dbl>        <dbl>        <dbl>        <dbl>        <dbl>
##  1 A           0.0489     -0.0259         0.0559     -0.00444       0.0359
##  2 AAPL        0.0167     -0.0144         0.0872     -0.0116        0.0256
##  3 ABBV        0.0233     -0.0234         0.0322     -0.00449       0.0420
##  4 ABNB        0           0              0           0             0     
##  5 ABT         0.0375      0.000126       0.0413     -0.00967       0.0369
##  6 ACGL        0.0115     -0.0650         0.0983      0.0314        0.0291
##  7 ACN         0.0103     -0.0264         0.0914     -0.0116        0.0464
##  8 ADBE        0.00913    -0.0341         0.0869     -0.0320        0.0267
##  9 ADI         0.0429     -0.0130         0.107       0.00470       0.0535
## 10 ADM         0.0136      0.00932        0.0326      0.00559       0.0139
## # ℹ 493 more rows
## # ℹ 1,261 more variables: `2020-04-09` <dbl>, `2020-04-13` <dbl>,
## #   `2020-04-14` <dbl>, `2020-04-15` <dbl>, `2020-04-16` <dbl>,
## #   `2020-04-17` <dbl>, `2020-04-20` <dbl>, `2020-04-21` <dbl>,
## #   `2020-04-22` <dbl>, `2020-04-23` <dbl>, `2020-04-24` <dbl>,
## #   `2020-04-27` <dbl>, `2020-04-28` <dbl>, `2020-04-29` <dbl>,
## #   `2020-04-30` <dbl>, `2020-05-01` <dbl>, `2020-05-04` <dbl>, …

3 Perform k-means clustering

stock_cluster <- kmeans(stock_date_matrix_tbl %>% select(-symbol), centers = 4, nstart = 20)

summary(stock_cluster)

##              Length Class  Mode   
## cluster       503   -none- numeric
## centers      5064   -none- numeric
## totss           1   -none- numeric
## withinss        4   -none- numeric
## tot.withinss    1   -none- numeric
## betweenss       1   -none- numeric
## size            4   -none- numeric
## iter            1   -none- numeric
## ifault          1   -none- numeric

tidy(stock_cluster)

## # A tibble: 4 × 1,269
##   `2020-04-02` `2020-04-03` `2020-04-06` `2020-04-07` `2020-04-08` `2020-04-09`
##          <dbl>        <dbl>        <dbl>        <dbl>        <dbl>        <dbl>
## 1      0.0113       -0.0181       0.0913   -0.00218         0.0350      0.00763
## 2      0.0190       -0.0138       0.0633   -0.0000170       0.0378      0.0232 
## 3      0.0867        0.0170       0.0691    0.0345          0.0674      0.0143 
## 4     -0.00790      -0.0235       0.105     0.0272          0.0572      0.0383 
## # ℹ 1,263 more variables: `2020-04-13` <dbl>, `2020-04-14` <dbl>,
## #   `2020-04-15` <dbl>, `2020-04-16` <dbl>, `2020-04-17` <dbl>,
## #   `2020-04-20` <dbl>, `2020-04-21` <dbl>, `2020-04-22` <dbl>,
## #   `2020-04-23` <dbl>, `2020-04-24` <dbl>, `2020-04-27` <dbl>,
## #   `2020-04-28` <dbl>, `2020-04-29` <dbl>, `2020-04-30` <dbl>,
## #   `2020-05-01` <dbl>, `2020-05-04` <dbl>, `2020-05-05` <dbl>,
## #   `2020-05-06` <dbl>, `2020-05-07` <dbl>, `2020-05-08` <dbl>, …

glance(stock_cluster)

## # A tibble: 1 × 4
##   totss tot.withinss betweenss  iter
##   <dbl>        <dbl>     <dbl> <int>
## 1  206.         172.      34.3     4

augment(stock_cluster, stock_date_matrix_tbl) %>%
    
    ggplot(aes(`2020-04-02`, `2020-04-03`, color = .cluster)) + 
    geom_point()

# 4 Select optimal number of clusters

kclusts <- tibble(k = 1:9) %>%
    mutate(kclust = map(.x = k, .f = ~ kmeans(stock_date_matrix_tbl %>% select(-symbol), centers = .x, nstart = 20)), glanced = map(.x = kclust, .f = glance)) 

kclusts %>%
    unnest(glanced) %>%
    ggplot(aes(k, tot.withinss)) + 
    geom_point() + 
    geom_line()

final_cluster <- kmeans(stock_date_matrix_tbl %>% select(-symbol), centers = 3, nstart = 20)

5 Reduce dimension using UMAP

umap_results <- stock_date_matrix_tbl %>%
    select(-symbol) %>%
    umap()

umap_results_tbl <- umap_results$layout %>%
    as.tibble() %>%
    bind_cols(stock_date_matrix_tbl %>% select(symbol))

## Warning: `as.tibble()` was deprecated in tibble 2.0.0.
## ℹ Please use `as_tibble()` instead.
## ℹ The signature and semantics have changed, see `?as_tibble`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if
## `.name_repair` is omitted as of tibble 2.0.0.
## ℹ Using compatibility `.name_repair`.
## ℹ The deprecated feature was likely used in the tibble package.
##   Please report the issue at <https://github.com/tidyverse/tibble/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

umap_results_tbl

## # A tibble: 503 × 3
##        V1     V2 symbol
##     <dbl>  <dbl> <chr> 
##  1 -1.90  -2.41  A     
##  2 -2.63  -0.996 AAPL  
##  3  0.485 -2.10  ABBV  
##  4 -2.60  -0.300 ABNB  
##  5 -0.843 -2.10  ABT   
##  6  0.981  1.62  ACGL  
##  7 -1.88  -0.653 ACN   
##  8 -3.03  -1.11  ADBE  
##  9 -3.38   0.196 ADI   
## 10  0.735  0.786 ADM   
## # ℹ 493 more rows

umap_results_tbl %>%
    ggplot(aes(V1, V2)) +
    geom_point()

# 6 Visualize clusters by adding k-means results

kmeans_umap_tbl <- final_cluster %>%
    augment(stock_date_matrix_tbl) %>%
    select(symbol, .cluster) %>%
    
    #  umap results
    left_join(umap_results_tbl) %>%
    
    #  stock information
    left_join(sp500_index_tbl %>% select(symbol, company, sector), by = "symbol")

## Joining with `by = join_by(symbol)`

kmeans_umap_tbl

## # A tibble: 503 × 6
##    symbol .cluster     V1     V2 company                   sector
##    <chr>  <fct>     <dbl>  <dbl> <chr>                     <chr> 
##  1 A      3        -1.90  -2.41  AGILENT TECHNOLOGIES INC  -     
##  2 AAPL   1        -2.63  -0.996 APPLE INC                 -     
##  3 ABBV   3         0.485 -2.10  ABBVIE INC                -     
##  4 ABNB   1        -2.60  -0.300 AIRBNB INC CLASS A        -     
##  5 ABT    3        -0.843 -2.10  ABBOTT LABORATORIES       -     
##  6 ACGL   3         0.981  1.62  ARCH CAPITAL GROUP LTD    -     
##  7 ACN    3        -1.88  -0.653 ACCENTURE PLC CL A        -     
##  8 ADBE   1        -3.03  -1.11  ADOBE INC                 -     
##  9 ADI    1        -3.38   0.196 ANALOG DEVICES INC        -     
## 10 ADM    3         0.735  0.786 ARCHER DANIELS MIDLAND CO -     
## # ℹ 493 more rows

g <- kmeans_umap_tbl %>%
    
    # Create text label
    mutate(text_label = str_glue("Stock:   {symbol}
                                 Cluster:  {.cluster}
                                 Company:  {company}
                                 Sector:   {sector}")) %>%
    
    # Plot
    ggplot(aes(V1, V2, color = .cluster, text = text_label)) +
    geom_point() + 
    xlim(-4, 6) + 
    ylim(-6, 4)

g %>% ggplotly(tooltip = "text")

Apply it to your data 9

Nils Skogestig

2025-04-10