Th
I refer to the EDGAR API using the R edgarWebR package.
library(edgarWebR)
library(lubridate)
To get started, for a given v ticker, I can pull the last N filings that starts with certain type. In the command below, I pull the last 30 filings that start with 10- type. The the 10-K report refers to the annual disclosure of the company, whereas the 10-Q are the quarterly ones. In this vignette, I will focus on the annual ones.
v <- "BAC"
filings <- company_filings(v, type = "10-", count = 30)
table(filings$type)
10-K 10-Q
5 15
filings <- filings[grep("10-K",filings$type), ] # keep the 10-K ones
The filings object is a data frame containing a number of variables, such as the type of the report, the filing date, the acceptance date, the size of the filings, as well as the URL link to the filings.
filings$filing_date <- date(filings$filing_date)
filings <- filings[order(filings$filing_date),]
filings$tic <- v
filings
The fiscal year end for BAC is late December. We note that the filing date of the report was submitted within three months after the end of the fiscal year.
The filings is a data frame object containing the link to each report. In order to parse the each one, I run a loop on row in the data. In the illustration below, I run it for year 2014:
i <- 1
href.i <- filing_documents(filings$href[i])
href.i
href.i <- href.i[href.i$description == 'Complete submission text file',"href"]
href.i
[1] "https://www.sec.gov/Archives/edgar/data/70858/000007085814000012/0000070858-14-000012.txt"
The filing_documents function takes an input as the URL to the filing in the EDGAR system and returns a data frame listing all relevant files for that year-specific report. In our case, we need the html-text content link that we will be able to parse. We pull this link, which we assign as href.i, where the description indicates that it is “Complete submission text file”.
To parse the filing, I do the following commands:
parse.href <- parse_submission(href.i)
doc <- parse_filing(parse.href[grep("10-K",parse.href$TYPE),"TEXT"], include.raw = TRUE)
The first parses the submission, whereas the second extract the textual content from the parsed submission. Given the textual content, I look search for the keyword “signature” as the main location where the board of directors sign the document. I conjecture that it takes place at the end, whereas the board names are separated using a line.
# find an index where the signature keyword shows up
signatures.index <- grep("signature", doc$text, ignore.case = TRUE)
signatures.index <- signatures.index:length(doc$text)
numerical expression has 3 elements: only the first used
# extract the textual content where it does
sign.doc <- doc$text[signatures.index]
# split the document
direc.names <- strsplit(sign.doc,"/s/")
# find the location where the title director shows up
direc.names2 <- lapply(direc.names, function(x) grep("Director",x,ignore.case = T) )
direc.names2
[[1]]
integer(0)
[[2]]
integer(0)
[[3]]
integer(0)
[[4]]
integer(0)
[[5]]
integer(0)
[[6]]
[1] 2 5 6 7 8 9 10 11 12
[[7]]
integer(0)
[[8]]
[1] 2 3 4 5 6 7
[[9]]
integer(0)
[[10]]
integer(0)
[[11]]
[1] 1
[[12]]
integer(0)
[[13]]
[1] 1
[[14]]
integer(0)
[[15]]
integer(0)
[[16]]
integer(0)
[[17]]
integer(0)
[[18]]
integer(0)
[[19]]
integer(0)
[[20]]
integer(0)
[[21]]
integer(0)
Finally, I argue that the board exists with the index with largest number of directors show up:
dir.index <- which.max(sapply(direc.names2, length))
direc.names3 <- direc.names[[dir.index]][direc.names2[[dir.index]]]
direc.names3 <- toupper(direc.names3)
direc.names3 <- strsplit(direc.names3,"DIRECTOR")
final <- sapply(direc.names3,function(x) x[1])
final
[1] " BRIAN T. MOYNIHAN CHIEF EXECUTIVE OFFICER, PRESIDENT AND "
[2] " SHARON L. ALLEN "
[3] " SUSAN S. BIES "
[4] " JACK O. BOVENDER, JR. "
[5] " FRANK P. BRAMBLE, SR. "
[6] " PIERRE DE WECK "
[7] " ARNOLD W. DONALD "
[8] " CHARLES K. GIFFORD "
[9] " CHARLES O. HOLLIDAY, JR. "
Finally, I do some adjustment and stack the board into a data.frame:
final.list <- strsplit(final," ")
final <- sapply(final.list, function(x) paste(x[!nchar(x) == 0],collapse = " " ) )
ds.dir <- data.frame(TIC = v, DATE = filings$filing_date[i], TYPE = filings$type[i], NAME = final )
ds.dir
The WSJ provides an open API regarding the profile of each public company, see e.g. the link to Bank of America here. This approach is much faster as it takes less computation to parse the WSJ than extracting certain firm’s submission and, then, parse them to locate the board names. I will refer to two packages to implement so, rvest and stringr for string manipulation.
library(rvest)
library(stringr)
To get started, first I update the URL using the firm’s ticker as follows
v <- "BAC"
theurl<- paste("https://quotes.wsj.com/",v,"/company-people",sep = "")
Then, I read the html content of the URL and extract all the tables nodes from the html content
file <- read_html(theurl)
tables <- html_nodes(file, "table")
length(tables)
[1] 11
For each identified table, I try to read it as a table
tables.list <- numeric()
for (i in 1:length(tables)) {
tables.list[i] <- list(try(html_table(tables[i], fill = TRUE),silent = T))
}
By taking a closer inspection into the html source along with the above tables, I conjecture that the board names is located in the third item of the tables.list:
dir.table <- tables.list[[3]][[1]]
dir.table <- dir.table[,1]
dir.table
[1] "Brian T. Moynihan, 56 Chairman, President & Chief Executive Officer"
[2] "Maria T. Zuber, 59 Independent Director"
[3] "Pierre J. P. de Weck, 67 Independent Director"
[4] "Lionel L. Nowell, 63 Independent Director"
[5] "Arnold W. Donald, 63 Independent Director"
[6] "R. David Yost, 70 Independent Director"
[7] "Linda Parker Hudson, 67 Independent Director"
[8] "Jack O. Bovender, 72 Lead Independent Director"
[9] "Sharon L. Allen, 66 Independent Director"
[10] "Michael D. White, 65 Independent Director"
[11] "Thomas D. Woods, 63 Independent Director"
[12] "Monica Cecilia Lozano, 59 Independent Director"
[13] "Frank P. Bramble, 67 Independent Director"
[14] "Susan Schmidt Bies, 69 Independent Director"
[15] "Thomas J. May, 68 Independent Director"
The above contains the names, the age, and title. I perform the following commands to extract the content into a data frame:
# usefull function to extract numerical values from a string
numextract <- function(string){
str_extract(string, "\\-*\\d+\\.*\\d*")
}
# extract age
age <- as.numeric(numextract(dir.table))
# stack all names in a data frame
ds.dir <- data.frame()
for (i in 1:length(dir.table)) {
str.i <- unlist(strsplit(dir.table[i],age[i]))
str.i <- gsub(",","",str.i)
str.i <- strsplit(str.i," ")
str.i <- sapply(str.i, function(x) paste(x[nchar(x) > 0],collapse = " ") )
ds.i <- data.frame(TIC = v,NAME = str.i[1], TITLE = str.i[2], AGE = age[i])
ds.dir <- rbind(ds.dir,ds.i)
}
ds.dir
This article privies….