Your organization wants to know which companies are similar to each other to help in identifying potential customers of a SAAS software solution (e.g. Salesforce CRM or equivalent) in various segments of the market. The Sales Department is very interested in this analysis, which will help them more easily penetrate various market segments.
You will be using stock prices in this analysis. You come up with a method to classify companies based on how their stocks trade using their daily stock returns (percentage movement from one day to the next). This analysis will help your organization determine which companies are related to each other (competitors and have similar attributes).
You can analyze the stock prices using what you’ve learned in the unsupervised learning tools including K-Means and UMAP. You will use a combination of kmeans() to find groups and umap() to visualize similarity of daily stock returns.
Apply your knowledge on K-Means and UMAP along with dplyr, ggplot2, and purrr to create a visualization that identifies subgroups in the S&P 500 Index. You will specifically apply:
kmeans() and umap()purrrdplyr, tidyr, and tibbleggplot2 (bonus plotly)Load the following libraries. If you have never used plotly for interactive plotting, you will need to install with install.packages("plotly").
# install.packages("plotly")
library(tidyverse)
library(tidyquant)
library(broom)
library(umap)
library(plotly) # NEW PACKAGE We will be using stock prices in this analysis. The tidyquant R package contains an API to retrieve stock prices. The following code is shown so you can see how I obtained the stock prices for every stock in the S&P 500 index. The files are saved in the week_6_data directory.
# # NOT RUN - WILL TAKE SEVERAL MINUTES TO DOWNLOAD ALL THE STOCK PRICES
# # JUST SHOWN FOR FUN SO YOU KNOW HOW I GOT THE DATA
#
# # GET ALL STOCKS IN A STOCK INDEX (E.G. SP500)
# sp_500_index_tbl <- tq_index("SP500")
# sp_500_index_tbl
#
# # PULL IN STOCK PRICES FOR EACH STOCK IN THE INDEX
# sp_500_prices_tbl <- sp_500_index %>%
# select(symbol) %>%
# tq_get(get = "stock.prices")
#
# # SAVING THE DATA
# fs::dir_create("week_6_data")
# sp_500_prices_tbl %>% write_rds(path = "week_6_data/sp_500_prices_tbl.rds")
# sp_500_index_tbl %>% write_rds("week_6_data/sp_500_index_tbl.rds")We can read in the stock prices. The data is 1.2M observations. The most important columns for our analysis are:
symbol: The stock ticker symbol that corresponds to a company’s stock pricedate: The timestamp relating the symbol to the share price at that point in timeadjusted: The stock price, adjusted for any splits and dividends (we use this when analyzing stock data over long periods of time)# STOCK PRICES
sp_500_prices_tbl <- read_rds("week_6_data/sp_500_prices_tbl.rds")
sp_500_prices_tblThe second data frame contains information about the stocks the most important of which are:
company: The company namesector: The sector that the company belongs to# SECTOR INFORMATION
sp_500_index_tbl <- read_rds("week_6_data/sp_500_index_tbl.rds")
sp_500_index_tblWhich stock prices behave similarly?
Answering this question helps us understand which companies are related, and we can use clustering to help us answer it!
Even if you’re not interested in finance, this is still a great analysis because it will tell you which companies are competitors and which are likely in the same space (often called sectors) and can be categorized together. Bottom line - This analysis can help you better understand the dynamics of the market and competition, which is useful for all types of analyses from finance to sales to marketing.
Let’s get started.
What you first need to do is get the data in a format that can be converted to a “user-item” style matrix. The challenge here is to connect the dots between what we have and what we need to do to format it properly.
We know that in order to compare the data, it needs to be standardized or normalized. Why? Because we cannot compare values (stock prices) that are of completely different magnitudes. In order to standardize, we will convert from adjusted stock price (dollar value) to daily returns (percent change from previous day). Here is the formula.
\[ return_{daily} = \frac{price_{i}-price_{i-1}}{price_{i-1}} \]
First, what do we have? We have stock prices for every stock in the SP 500 Index, which is the daily stock prices for over 500 stocks. The data set is over 1.2M observations.
sp_500_prices_tbl %>% glimpse()## Rows: 1,225,765
## Columns: 8
## $ symbol <chr> "MSFT", "MSFT", "MSFT", "MSFT", "MSFT", "MSFT", "MSFT", "MSFT~
## $ date <date> 2009-01-02, 2009-01-05, 2009-01-06, 2009-01-07, 2009-01-08, ~
## $ open <dbl> 19.53, 20.20, 20.75, 20.19, 19.63, 20.17, 19.71, 19.52, 19.53~
## $ high <dbl> 20.40, 20.67, 21.00, 20.29, 20.19, 20.30, 19.79, 19.99, 19.68~
## $ low <dbl> 19.37, 20.06, 20.61, 19.48, 19.55, 19.41, 19.30, 19.52, 19.01~
## $ close <dbl> 20.33, 20.52, 20.76, 19.51, 20.12, 19.52, 19.47, 19.82, 19.09~
## $ volume <dbl> 50084000, 61475200, 58083400, 72709900, 70255400, 49815300, 5~
## $ adjusted <dbl> 15.86624, 16.01451, 16.20183, 15.22628, 15.70234, 15.23408, 1~
Your first task is to convert to a tibble named sp_500_daily_returns_tbl by performing the following operations:
symbol, date and adjusted columnsNA values from the lagging operationpct_return.symbol, date, and pct_return columnssp_500_daily_returns_tbl# Apply your data transformation skills!
# Output: sp_500_daily_returns_tbl
sp_500_daily_returns_tbl <- sp_500_prices_tbl %>%
# Select the `symbol`, `date` and `adjusted` columns
select(symbol, date, adjusted) %>%
# Filter to dates beginning in the year 2018 and beyond.
filter(date >= ymd('2018-01-01')) %>%
# Compute a Lag of 1 day on the adjusted stock price. Be sure to group by symbol first, otherwise we will have lags computed using values from the previous stock in the data frame.
group_by(symbol) %>%
mutate(lag_1day = lag(x = adjusted, n = 1)) %>%
ungroup() %>%
# Remove an `NA` values from the lagging operation
drop_na(lag_1day) %>%
# Compute the difference between adjusted and the lag
# Compute the percentage difference by dividing the difference by that lag. Name this column `pct_return`.
mutate(
diff = adjusted - lag_1day,
pct_return = diff / lag_1day) %>%
# Return only the `symbol`, `date`, and `pct_return` columns
select(symbol, date, pct_return)
# output
sp_500_daily_returns_tblThe next step is to convert to a user-item format with the symbol in the first column and every other column the value of the daily returns (pct_return) for every stock at each date.
We’re going to import the correct results first (just in case you were not able to complete the last step).
sp_500_daily_returns_tbl <- read_rds("week_6_data/sp_500_daily_returns_tbl.rds")
sp_500_daily_returns_tblNow that we have the daily returns (percentage change from one day to the next), we can convert to a user-item format. The user in this case is the symbol (company), and the item in this case is the pct_return at each date.
date column to get the values as percentage returns. Make sure to fill an NA values with zeros.stock_date_matrix_tbl# Convert to User-Item Format
# Output: stock_date_matrix_tbl
stock_date_matrix_tbl <- sp_500_daily_returns_tbl %>%
# Spread the `date` column to get the values as percentage returns.
# Make sure to fill an `NA` values with zeros.
spread(key = date, value = pct_return, fill = 0)
# output:
stock_date_matrix_tbl Next, we’ll perform K-Means clustering.
We’re going to import the correct results first (just in case you were not able to complete the last step).
stock_date_matrix_tbl <- read_rds("week_6_data/stock_date_matrix_tbl.rds")Beginning with the stock_date_matrix_tbl, perform the following operations:
symbolkmeans() with centers = 4 and nstart = 20kmeans_obj# Create kmeans_obj for 4 centers
kmeans_obj <- stock_date_matrix_tbl %>%
# Drop the non-numeric column, `symbol`
select(-symbol) %>%
# Perform `kmeans()` with `centers = 4` and `nstart = 20`
kmeans(centers = 4, nstart = 20)Use glance() to get the tot.withinss.
# Apply glance() to get the tot.withinss
# Use `glance()` to get the `tot.withinss`
glance(kmeans_obj) %>%
pull(tot.withinss)## [1] 29.20555
Now that we are familiar with the process for calculating kmeans(), let’s use purrr to iterate over many values of “k” using the centers argument.
We’ll use this custom function called kmeans_mapper():
kmeans_mapper <- function(center = 3) {
stock_date_matrix_tbl %>%
select(-symbol) %>%
kmeans(centers = center, nstart = 20)
}Apply the kmeans_mapper() and glance() functions iteratively using purrr.
centers that go from 1 to 30k_means with the kmeans_mapper() output. Use mutate() to add the column and map() to map centers to the kmeans_mapper() function.glance with the glance() output. Use mutate() and map() again to iterate over the column of k_means.k_means_mapped_tbl# Use purrr to map
# Output: k_means_mapped_tbl
k_means_mapped_tbl <-
# Create a tibble containing column called `centers` that go from 1 to 30
tibble(centers = 1:30) %>%
# Add a column named `k_means` with the `kmeans_mapper()` output.
mutate(k_means = centers %>% map(kmeans_mapper)) %>%
# Use `mutate()` to add the column and `map()` to map centers to
# the `kmeans_mapper()` function.
mutate(glance = k_means %>% map(glance))
# output:
k_means_mapped_tbl Next, let’s visualize the “tot.withinss” from the glance output as a Scree Plot.
k_means_mapped_tblglance columncenters column (x-axis) versus the tot.withinss column (y-axis) using geom_point() and geom_line()# Visualize Scree Plot
# Begin with the `k_means_mapped_tbl`
k_means_mapped_tbl %>%
# Unnest the `glance` column
unnest(glance) %>%
# Plot the `centers` column (x-axis) versus the `tot.withinss` column (y-axis)
ggplot(aes(x = centers, y = tot.withinss)) +
# using `geom_point()` and `geom_line()`
geom_line(size = 0.8) +
geom_point(size = 2.5) +
geom_vline(xintercept = 10, linetype = 'dashed', size = 0.8, color = 'gray70') +
theme_minimal()+
labs(title = 'Scree Plot')We can see that the Scree Plot becomes linear (constant rate of change) between 5 and 10 centers for K.
Next, let’s plot the UMAP 2D visualization to help us investigate cluster assignments.
We’re going to import the correct results first (just in case you were not able to complete the last step).
k_means_mapped_tbl <- read_rds("week_6_data/k_means_mapped_tbl.rds")First, let’s apply the umap() function to the stock_date_matrix_tbl, which contains our user-item matrix in tibble format.
stock_date_matrix_tblsymbol columnumap() function storing the output as umap_results# Apply UMAP
# Store results as: umap_results
umap_results <- stock_date_matrix_tbl %>%
# De-select the `symbol` column
select(-symbol) %>%
# umap()
umap()Next, we want to combine the layout from the umap_results with the symbol column from the stock_date_matrix_tbl.
umap_results$layoutmatrix data type to a tibble with as_tibble()symbol column from the stock_date_matrix_tbl.umap_results_tbl.# Convert umap results to tibble with symbols
# Output: umap_results_tbl
umap_results_tbl <- umap_results$layout %>%
as_tibble() %>%
bind_cols(stock_date_matrix_tbl %>% select(symbol))
umap_results_tblFinally, let’s make a quick visualization of the umap_results_tbl.
umap_results_tbl into ggplot() mapping the V1 and V2 columns to x-axis and y-axisgeom_point() geometry with an alpha = 0.5theme_tq() and add a title “UMAP Projection”# Visualize UMAP results
umap_results_tbl %>%
# `ggplot()` mapping the `V1` and `V2` columns to x-axis and y-axis
ggplot(aes(x = V1, y = V2)) +
# Add a `geom_point()` geometry with an `alpha = 0.5``
geom_point(size = 1, alpha = 0.5) +
theme_tq()+
labs(title = 'UMAP Projection')We can now see that we have some clusters. However, we still need to combine the K-Means clusters and the UMAP 2D representation.
Next, we combine the K-Means clusters and the UMAP 2D representation
We’re going to import the correct results first (just in case you were not able to complete the last step).
k_means_mapped_tbl <- read_rds("week_6_data/k_means_mapped_tbl.rds")
umap_results_tbl <- read_rds("week_6_data/umap_results_tbl.rds")First, pull out the K-Means for 10 Centers. Use this since beyond this value the Scree Plot flattens.
k_means_mapped_tblcenters == 10k_means columnk_means_obj# Get the k_means_obj from the 10th center
# Store as k_means_obj
k_means_obj <- k_means_mapped_tbl %>%
# Filter to `centers == 10`
filter(centers == 10) %>%
# Pull the `k_means` column
pull(k_means) %>%
# Pluck the first element
pluck(1)Next, we’ll combine the clusters from the k_means_obj with the umap_results_tbl.
k_means_objk_means_obj with the stock_date_matrix_tbl to get the clusters added to the end of the tibblesymbol and .cluster columnsumap_results_tbl by the symbol columnsp_500_index_tbl %>% select(symbol, company, sector) by the symbol column.umap_kmeans_results_tbl# Use your dplyr & broom skills to combine the k_means_obj with the umap_results_tbl
# Output: umap_kmeans_results_tbl
umap_kmeans_results_tbl <- k_means_obj %>%
# Augment the `k_means_obj` with the `stock_date_matrix_tbl`
# to get the clusters added to the end of the tibble
augment(data = stock_date_matrix_tbl) %>%
# Select just the `symbol` and `.cluster` columns
select(symbol, .cluster) %>%
# Left join the result with the `umap_results_tbl` by the `symbol` column
left_join(y = umap_results_tbl, by = 'symbol') %>%
# Left join the result with the result of
# `sp_500_index_tbl %>% select(symbol, company, sector)` by the `symbol` column.
left_join(y = sp_500_index_tbl %>% select(symbol, company, sector), by = 'symbol')
umap_kmeans_results_tblPlot the K-Means and UMAP results.
umap_kmeans_results_tblggplot() mapping V1, V2 and color = .clustergeom_point() geometry with alpha = 0.5theme_tq() and scale_color_tq()Note - If you’ve used centers greater than 12, you will need to use a hack to enable scale_color_tq() to work. Just replace with: scale_color_manual(values = palette_light() %>% rep(3))
# Visualize the combined K-Means and UMAP results
umap_kmeans_results_tbl %>%
# Use `ggplot()` mapping `V1`, `V2` and `color = .cluster`
ggplot(aes(x = V1, y = V2, color = .cluster)) +
# Add the `geom_point()` geometry with `alpha = 0.5`
geom_point(size = 2, alpha = 0.5) +
# Apply `theme_tq()` and `scale_color_tq()`
theme_tq() +
scale_color_tq()If you’ve made it this far, you’re doing GREAT!!!
This is an interactive demo that is an extension of what we’ve learned so far. You are not required to produce any code in this section. However, it presents an interesting case to see how we can explore the clusters using the plotly library with the ggplotly() function.
These two functions combine to produce the interactive plot:
get_kmeans(): Returns a data frame of UMAP and K-Means result for a value of kplot_cluster: Returns an interactive plotly plot enabling exploration of the cluster and UMAP results. The only additional code you have not seen so far is the ggplotly() function. This is a topic for Week 7: Communication.get_kmeans <- function(k = 3) {
k_means_obj <- k_means_mapped_tbl %>%
filter(centers == k) %>%
pull(k_means) %>%
pluck(1)
umap_kmeans_results_tbl <- k_means_obj %>%
augment(stock_date_matrix_tbl) %>%
select(symbol, .cluster) %>%
left_join(umap_results_tbl, by = "symbol") %>%
left_join(sp_500_index_tbl %>% select(symbol, company, sector),
by = "symbol")
return(umap_kmeans_results_tbl)
}
plot_cluster <- function(k = 3) {
g <- get_kmeans(k) %>%
mutate(label_text = str_glue("Stock: {symbol}
Company: {company}
Sector: {sector}")) %>%
ggplot(aes(V1, V2, color = .cluster, text = label_text)) +
geom_point(alpha = 0.5) +
theme_tq() +
scale_color_tq()
# scale_color_manual(values = palette_light() %>% rep(3))
g %>%
ggplotly(tooltip = "text")
}We can plot the clusters interactively.
plot_cluster(k = 10)sessionInfo()## R version 4.1.3 (2022-03-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18363)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] plotly_4.10.0 umap_0.2.7.0
## [3] broom_0.7.12 tidyquant_1.0.3
## [5] quantmod_0.4.18 TTR_0.24.3
## [7] PerformanceAnalytics_2.0.4 xts_0.12.1
## [9] zoo_1.8-9 lubridate_1.8.0
## [11] forcats_0.5.1 stringr_1.4.0
## [13] dplyr_1.0.8 purrr_0.3.4
## [15] readr_2.1.2 tidyr_1.2.0
## [17] tibble_3.1.6 ggplot2_3.3.5
## [19] tidyverse_1.3.1
##
## loaded via a namespace (and not attached):
## [1] httr_1.4.2 sass_0.4.0 viridisLite_0.4.0 jsonlite_1.8.0
## [5] modelr_0.1.8 bslib_0.3.1 assertthat_0.2.1 askpass_1.1
## [9] highr_0.9 cellranger_1.1.0 yaml_2.3.5 pillar_1.7.0
## [13] backports_1.4.1 lattice_0.20-45 glue_1.6.2 quadprog_1.5-8
## [17] reticulate_1.24 digest_0.6.29 rvest_1.0.2 colorspace_2.0-3
## [21] htmltools_0.5.2 Matrix_1.4-0 pkgconfig_2.0.3 haven_2.4.3
## [25] scales_1.1.1 RSpectra_0.16-0 tzdb_0.2.0 openssl_2.0.0
## [29] farver_2.1.0 generics_0.1.2 ellipsis_0.3.2 withr_2.5.0
## [33] lazyeval_0.2.2 cli_3.2.0 magrittr_2.0.2 crayon_1.5.0
## [37] readxl_1.3.1 evaluate_0.15 fs_1.5.2 fansi_1.0.2
## [41] xml2_1.3.3 tools_4.1.3 data.table_1.14.2 hms_1.1.1
## [45] lifecycle_1.0.1 munsell_0.5.0 reprex_2.0.1 compiler_4.1.3
## [49] jquerylib_0.1.4 rlang_1.0.2 grid_4.1.3 rstudioapi_0.13
## [53] htmlwidgets_1.5.4 crosstalk_1.2.0 labeling_0.4.2 rmarkdown_2.13
## [57] gtable_0.3.0 DBI_1.1.2 curl_4.3.2 R6_2.5.1
## [61] knitr_1.37 fastmap_1.1.0 utf8_1.2.2 Quandl_2.11.0
## [65] stringi_1.7.6 Rcpp_1.0.8.2 vctrs_0.3.8 png_0.1-7
## [69] dbplyr_2.1.1 tidyselect_1.1.2 xfun_0.30