The S&P 500 is widely regarded by investment professionals as the best index representing the performance of large-cap U.S. equities. Approximately $3.4 trillion is indexed to the S&P 500, and an additional $6.5 trillion is benchmarked to this index. The index includes 500 leading companies and captures approximately 80% of the U.S. total market capitalization.
One feature that distinguishes the S&P 500 is that its constituents are weighted by their free float market capitalization. Unlike the Dow Jones Industrial Average, whose components are weighted by their market prices so that a stock trading at $100/share carries 5x the weight of a stock trading at $20/share, regardless of the aggregate value of the shares outstanding of the respective companies.
Although the current composition of the index is widely disseminated, it can be difficult to acquire data on the historical constituents and their weights, especially for non-professionals without access to a Bloomberg terminal or other paid resources. Bespoke has published charts that display historical trends in sector weightings going back to 1990, although the underlying data do not appear to be available for download.
An inquiry on Quora seeking public data sets of historical S&P components led to some paid resources provided by Siblis Research. The ETF Database provides a visual history of the S&P 500 with some data about the market capitalization of the top ten S&P components since 1980. However, further work is required to find the denominator for the total S&P capitalization in order to compute the weightings. Also, obviously this site is restricted to only 2% of the 500 companies, and provides only annual data.
Pol Alvarez, a user on GitHub created a list of the historical S&P 500 constituents monthly from January 2008 through February 2019. This JSON-formatted file, however, does not include weights for any of the periods presented.
I concluded that it might be useful to present a simple approach to obtaining the historical S&P constituents and their respective weights in the index.
The iShares Core S&P 500 ETF (IVV) was created in May 2000 and currently has net assets of approximately $170 billion. It tracks the S&P 500 by owning the index constituents in proportion to their index weights. Conveniently, its website provides historical holdings on a monthly basis as far back as 2006.
There is a slight difference (on the order of 1-2 bps) in the weights reported by this ETF and the true S&P index weights, presumably due to a slight cash position in the ETF holdings. However, for my purposes, I could safely disregard this discrepancy.
I programmatically downloaded these holdings lists as CSV files.
library(RSelenium)
base_url <- 'https://www.ishares.com/us/products/239726/ishares-core-sp-500-etf'
rD <- rsDriver(browser = c("chrome"), chromever = "74.0.3729.6")
remDr <- rD$client
remDr$navigate(base_url)
test <- remDr$findElement(using = 'css selector', '#holdingsTabs li a[href*="All"]')
test$clickElement()
option_list <- remDr$findElement(using = 'css selector',
'#holdingsTabs #tabsAll .date-dropdown')$selectTag()
get_top_holdings <- function(x) {
option_list$elements[[x]]$clickElement()
remDr$findElement(using = 'css selector', '.holdings .icon-xls-export')$clickElement()
}
lapply(1:length(option_list$value), get_top_holdings)
It takes a little time for the entire corpus of holdings lists to download successfully, and I found that running my R code as a single script at once cut off the process before the downloads were complete.
# Close the browser and stop the Selenium driver
remDr$close()
rD$server$stop()
Note that the echo = FALSE
parameter was added to the code chunk to prevent printing of the R code that generated the plot.
# Some libraries required to compile the downloads
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
library(stringr)
get_historical_wgts <- function(fn) {
holdings_metadata <- read_csv(fn, col_names = FALSE, col_types = cols(),
n_max = 8, skip =1)
holdings <- read_csv(fn, col_types = cols(), skip = 10)
equity_holdings <- holdings %>%
filter(`Asset Class` == "Equity") %>%
mutate(As_Of_Date = mdy(holdings_metadata[[2, 2]]))
return(equity_holdings)
}
# Confirm the download path below and insert your own directory username
download_path <- str_interp("C:/Users/${user_name}/Downloads/",
list(user_name = "mbadr"))
filenames_by_month <- file.path(download_path,
list.files(path = download_path,
pattern = 'IVV_holdings.*'))
# Import and bind the files into a large tibble data frame
master_holdings <- tibble()
for(i in filenames_by_month) {
new_holdings_tbl <- get_historical_wgts(i)
master_holdings <- rbind(master_holdings, new_holdings_tbl)
}
For convenience, I saved the resulting master holdings list.
saveRDS(master_holdings, paste0(download_path, "SPX_historical_wgts.rds"))
Then, itβs easy to use dplyr and the tidyverse to analyze the resulting data set.
# Read in historical constituents and weights
master_holdings <- readRDS(paste0(download_path, "SPX_historical_wgts.rds"))
library(ggplot2)
## Registered S3 methods overwritten by 'ggplot2':
## method from
## [.quosures rlang
## c.quosures rlang
## print.quosures rlang
library(scales)
##
## Attaching package: 'scales'
## The following object is masked from 'package:readr':
##
## col_factor
target_company = c('GE', 'GEC') # GE changed tickers
target_co_name <- master_holdings %>% filter(Ticker %in% target_company) %>%
select(Name) %>% unique()
co_wgt <- master_holdings %>% filter(Ticker %in% target_company)
co_wgt %>% ggplot(aes(x = As_Of_Date, y = `Weight (%)`)) + geom_line() +
labs(title = str_interp('Historical Weight of ${target_co_name} in S&P 500'),
subtitle = "Percentage, as of month-end",
x = NULL,
y = NULL) +
ylim(0, NA) + scale_y_continuous(labels = function(x) {sprintf("%.1f", x)})
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.