In this lecture, we look at how we can use R to automate the process of extracting relevant fundamental data from the Singapore capital allocation database.
We begin by:
gc(); rm(list=ls()); graphics.off()
path_root <- "." # Sets the current working directory as the relative file path
path_output <- file.path(path_root, "output")
path_root <- ".."
path_data <- file.path(path_root, "data") # ./data
Download the xlsx files here: https://github.com/rnfermincota/academic/tree/main/research/traditional_assets/database/data/singapore
library(tidyxl) # Imports non-tabular data from Excel files.
library(tibble) # Stricter checking and better formatting than the traditional dataframe.
library(dplyr) # A manipulation tool for working with data frame like objects.
library(purrr) # A Complete functional programming toolkit for R.
library(furrr); plan(multisession) # Provides multicore parallel implementations of the purrr map() function.
library(unpivotr) # Tools for converting data from complex or irregular layouts to a columnar structure. https://nacnudus.github.io/unpivotr/
extract.sheets.f
FunctionA function for extracting all information contained in all sheets in
a specified Excel file. The excl
parameter can contain a
list of columns that should be excluded from being converted to
numeric type before the final dataframe is returned.
extract.sheets.f <- function(dsheet, excl=NULL){
out <- dsheet %>%
select(row, col, data_type, character, numeric, logical) %>% # 1
rectify %>% # 2
select(-c("row/col")) # 3
nm <- out %>% slice(1) %>% unlist(use.names=FALSE) # headings # 4
out <- set_names(out, nm) %>% slice(2:n()) # 5
if (!is.null(excl)){ # 6
out <- out %>% mutate(across(-excl, as.numeric))
}
return(out) # 7
}
Filter: use the select()
function to extract the
relevant columns from the Excel sheet.
Format: use the rectify()
function to display the
cells as though in a spreadsheet. See example:
https://github.com/nacnudus/unpivotr/blob/main/R/rectify.R
Remove: use the select()
function to remove the
row/col column.
Headings: Extract and save the first row (currently storing the
column headers) as nm
using the
slice()
function. The unlist(use.names=FALSE)
function is used to extract the column names as an atomic vector for
easier manipulation.
Set names: Set the column names of the out
dataframe
to our vector nm
and at the same time, remove the first row
that contains the initial column names.
IF: A conditional statement for excluding previously specified
columns from being converted to numeric
. The
across()
function allows us to transform several column at
once, instead of doing it manually, one by one.
Return: Returns the final dataframe.
We use the list.files()
function to extract the file
names within our “./data/xlsx” folder into a character
vector.
list_paths <- list.files(
path = file.path(path_data),
pattern = ".xlsx",
full.names = TRUE
)
We provide the following arguments:
path
: Directs the function to look within the
previously specified “./data/xlsx” directory.pattern
: Tells the function to pick out all Excel files
within the specified folder xlsx.full.names
: Prepends the specified directory path to
the file names.as.data.frame(list_paths) %>%
format.dt.f(.)
dat = enframe(list_paths, name = NULL, value = "path") %>%
mutate(sector=gsub(".xlsx", "", basename(path)))
First, we use the enframe()
function to convert the
list_paths
character vector into a one column dataframe
since name
parameter is set to NULL
(otherwise, an additional column of names is also created). The
value
parameter sets the name of the single column
returned.
Next, the mutate()
function is used to create a new
column named sector
. The sector
column is
created using the gsub()
function, which uses a regular
expression pattern to locate and substitute the filename
extension .xlsx for an empty string. The
basename()
function is used to remove the prepended file
path from each character string up to and including the last path
separator.
The table below shows a before and after effect of this function in its two columns respectively:
dat %>%
format.dt.f(.)
dat = dat %>%
mutate(dataset = future_map(path, xlsx_cells) %>%
set_names(sector))
Finally, we mutate the dataframe further by adding a new column
called dataset
which contains all data from all sheets
contained within the excel file specified in the corresponding
path
column. The xlsx_cells()
function is used
to import the data cell by cell in conjunction with the
future_map()
function which applies a particular function
to each element of a vector in parallel. The extracted sheets are stored
as tibbles
in the dataset
column.
In this section, we combine the extracts.sheets.f()
function with the dat
dataframe to extract all relevant
data from the earnings_debt sheets into a single dataframe.
First, we define a vector of column names as the variable
excl
for use in the extracts.sheets.f()
function.
excl <- c("country", "company_name", "industry_group")
Next,
singapore_industries <-
future_map(dat$dataset, function(inp){ # 1
extract.sheets.f(dplyr::filter(inp, sheet=="earnings_debt"), all_of(excl)) %>%
slice(1) # 2
}) %>%
bind_rows(.) %>% # 3
arrange(industry_group) %>% # 4
arrange(desc(roic_cost_capital)) # 5
future_map()
function to iterate over the
dat
dataframe, extracting all sheets named
earnings_debt stored in the dataset
column
(remember that this column stores tibbles
). We use the
all_of()
function to make sure that every value in the
excl
vector is present.slice()
function to grab only the first row at
this time.bind_rows(.)
function to bind the individual rows into a
single dataframe which shows earnings_debt data for all
industries included in the “./data/xlsx” folder.industry group
in ascending order.roic_cost_capital
in descending order.A sample of the singapore_industries
dataframe can be
viewed
singapore_industries %>%
format.dt.f(.)
In this next chunk of code, we perform the same steps as above except
instead of only grabbing the first row from the
earnings_debt
sheet, we grabbed the remaining rows.
singapore_earnings_debt <-
future_map(dat$dataset, function(inp){
extract.sheets.f(dplyr::filter(inp, sheet=="earnings_debt"), all_of(excl)) %>%
slice(2:n())
}) %>%
bind_rows(.) %>%
arrange(industry_group) %>%
arrange(industry_group, desc(roic_cost_capital))
format.dt.f(singapore_earnings_debt)
singapore_earnings_debt
dataframe that have a
dividend_yield
greater than \(0.05\).industry_group
and uses the slice()
function to return only the first
occurring result for each industry.singapore_earnings_debt %>%
dplyr::filter(dividend_yield>0.05) %>%
group_by(industry_group) %>%
slice(1) %>%
format.dt.f(.)
The code here is very similar to the Sheet
Extraction section, but instead we target the
cost_capital
sheets for extraction
excl <- c("company_name", "exchange_ticker", "industry_group", "country",
"reported_debt_rating", "current_debt_rating", "optimal_debt_rating",
"flag_bankruptcy", "flag_refinanced")
singapore_cost_capital <-
future_map(dat$dataset, function(inp){
if((inp$sheet %>% n_distinct) > 1){ # 1
extract.sheets.f(dplyr::filter(inp, sheet=="cost_capital"), all_of(excl)) %>%
mutate(
flag_bankruptcy=as.logical(flag_bankruptcy),
flag_refinanced=as.logical(flag_refinanced)
)
}
}) %>%
bind_rows(.) %>%
mutate(spread_optimal = current_debt_capital-optimal_debt_capital) %>% # 2
arrange(industry_group, spread_optimal)
There are two notable exceptions:
if
the corresponding tibble in
the dataset
column has more than one distinct sheet.spread_optimal
is created by
subtracting optimal_debt_capital
from
current_debt_capital
.format.dt.f(singapore_cost_capital)
In this section, we combine the data from our
singapore_earnings_debt
and
singapore_cost_capital
dataframes into one final dataframe
before filtering out any rows that either exceed or fall short of
certain metrics.
singapore_screener <- singapore_earnings_debt %>%
select( # 1
industry_group, company_name, dividend_yield, roe,
cost_equity, roe_excess_return=roe_cost_equity, roic,
cost_capital, roic_excess_return=roic_cost_capital) %>%
left_join( # 2
singapore_cost_capital %>%
select(company_name, current_debt_capital,
optimal_debt_capital, spread_optimal)) %>%
select(company_name, dividend_yield, roic_excess_return, roic, # 3
cost_capital, roe_excess_return, roe, cost_equity, spread_optimal,
current_debt_capital, optimal_debt_capital) %>%
dplyr::filter(dividend_yield > 0.01, roic_excess_return > 0.025, spread_optimal < 0) # 4
Use the select()
function to grab, and
simultaneously rename, the columns of interest in the
singapore_earnings_debt
dataframe
Perform a left_join()
with the
singapore_cost_capital
dataframe after selecting four
columns of interest. Since no column to join the two dataframes as
specified, the common column company_name
is used.
Once again, use the select()
function to keep only
the columns of interest
Use the filter()
function to keep only the rows that
pass ALL of the following conditions:
dividend_yield
> 0.01roic_excess_return
> 0.025spread_optimal
< 0format.dt.f(singapore_screener)
extract.sheets.f
to extract all sheets from an Excel
fileThis function combines and utilizes many of the tools we have already seen. We take things further by however, by extracting ALL sheets from a given Excel file and grouping them together in a database like format.
if (FALSE){
singapore_db <-
future_map(dat$dataset,function(inp){
excl = c("country", "company_name", "industry_group")
out = list(earnings_debt=extract.sheets.f(
dplyr::filter(inp, sheet=="earnings_debt"), all_of(excl))) # 1
if ( (inp$sheet %>% n_distinct) > 1){
excl = c("company_name", "exchange_ticker", "industry_group", "country",
"reported_debt_rating", "current_debt_rating", "optimal_debt_rating",
"flag_bankruptcy", "flag_refinanced")
out = list_modify(
out,
cost_capital = extract.sheets.f(
dplyr::filter(inp, sheet == "cost_capital"), all_of(excl)) %>% # 2
mutate(
flag_bankruptcy=as.logical(flag_bankruptcy),
flag_refinanced=as.logical(flag_refinanced)
)
)
nm = inp %>% # 3
select(sheet) %>%
distinct(.) %>%
dplyr::filter(!sheet %in% c("earnings_debt", "cost_capital")) %>%
pull
out = list_modify( # 4
out,
optimal_mix = map(
nm,
~extract.sheets.f(dplyr::filter(inp, sheet == .x), NULL) %>%
mutate(across(c(where(is.character), -debt_rating), as.numeric))
) %>% set_names(nm)
)
}
out # 5
}
)
listviewer::jsonedit(singapore_db)
}
earnings_debt
sheet from the dataset
column.
This time however, store the extracted sheet in a list named
out
.cost_capital
sheet, and append it to
our master list of sheets out
.cost_capital
or earnings_debt
. This step uses
the pull()
function to extract a list of the remaining
sheets in the provided Excel file.extract.sheets()
function to extract the sheet
corresponding to that name, storing it in the optimal_mix
variable and appending all extracted sheets to our master list
out
.singapore_db
.We finish by saving our four dataframes as an RDA file. RDA files allow us to export and store our work in a format that is shareable and restorable in any future R session.
save(singapore_industries, singapore_earnings_debt, singapore_cost_capital,
singapore_screener, file=file.path(path_output, "singapore_fundamental_data.Rda"))
Finally, we remove all created objects and user-defined functions from the global environment.
rm(dat)
rm(singapore_industries, singapore_earnings_debt, singapore_cost_capital, singapore_screener)
rm(path_root, path_data, path_output, list_paths)
rm(excl, extract.sheets.f)
Singapore: Capital Allocation via Cost of Capital https://rpubs.com/rafael_nicolas/singapore_capital_allocation