1 Context

In this lecture, we look at how we can use R to automate the process of extracting relevant fundamental data from the Singapore capital allocation database.

2 Initialization

We begin by:

2.1 Clearing all pre-existing objects from the workspace

gc(); rm(list=ls()); graphics.off()

2.2 Initializing the file paths

path_root <- "." # Sets the current working directory as the relative file path
path_output <- file.path(path_root, "output") 

path_root <- ".."
path_data <- file.path(path_root, "data") # ./data

Download the xlsx files here: https://github.com/rnfermincota/academic/tree/main/research/traditional_assets/database/data/singapore

2.3 Loading the necessary libraries

library(tidyxl)   # Imports non-tabular data from Excel files.
library(tibble)   # Stricter checking and better formatting than the traditional dataframe.
library(dplyr)    # A manipulation tool for working with data frame like objects.
library(purrr)    # A Complete functional programming toolkit for R.
library(furrr); plan(multisession) # Provides multicore parallel implementations of the purrr map() function. 
library(unpivotr) # Tools for converting data from complex or irregular layouts to a columnar structure. https://nacnudus.github.io/unpivotr/

3 The extract.sheets.f Function

3.1 What?

A function for extracting all information contained in all sheets in a specified Excel file. The excl parameter can contain a list of columns that should be excluded from being converted to numeric type before the final dataframe is returned.

extract.sheets.f <- function(dsheet, excl=NULL){
  
  out <- dsheet %>% 
    select(row, col, data_type, character, numeric, logical) %>%      # 1
    rectify %>%                                                       # 2
    select(-c("row/col"))                                             # 3
  
  nm <- out %>% slice(1) %>% unlist(use.names=FALSE) # headings       # 4
  out <- set_names(out, nm) %>% slice(2:n())                          # 5
  
  if (!is.null(excl)){                                                # 6
    out <- out %>% mutate(across(-excl, as.numeric))
    }
  
  return(out)                                                         # 7
}

3.2 How?

  1. Filter: use the select() function to extract the relevant columns from the Excel sheet.

  2. Format: use the rectify() function to display the cells as though in a spreadsheet. See example: https://github.com/nacnudus/unpivotr/blob/main/R/rectify.R

  3. Remove: use the select() function to remove the row/col column.

  4. Headings: Extract and save the first row (currently storing the column headers) as nm using the slice()function. The unlist(use.names=FALSE) function is used to extract the column names as an atomic vector for easier manipulation.

  5. Set names: Set the column names of the out dataframe to our vector nm and at the same time, remove the first row that contains the initial column names.

  6. IF: A conditional statement for excluding previously specified columns from being converted to numeric. The across() function allows us to transform several column at once, instead of doing it manually, one by one.

  7. Return: Returns the final dataframe.

4 Extract File Names

We use the list.files() function to extract the file names within our “./data/xlsx” folder into a character vector.

list_paths <- list.files(
  path = file.path(path_data),
  pattern = ".xlsx",
  full.names = TRUE
  )

We provide the following arguments:

  1. path: Directs the function to look within the previously specified “./data/xlsx” directory.
  2. pattern: Tells the function to pick out all Excel files within the specified folder xlsx.
  3. full.names: Prepends the specified directory path to the file names.
as.data.frame(list_paths) %>% 
  format.dt.f(.)

5 Sheet Extraction Prep

5.1 Create list of sheet names

dat = enframe(list_paths, name = NULL, value = "path") %>%
      mutate(sector=gsub(".xlsx", "", basename(path)))

First, we use the enframe() function to convert the list_paths character vector into a one column dataframe since name parameter is set to NULL (otherwise, an additional column of names is also created). The value parameter sets the name of the single column returned.

Next, the mutate() function is used to create a new column named sector. The sector column is created using the gsub() function, which uses a regular expression pattern to locate and substitute the filename extension .xlsx for an empty string. The basename() function is used to remove the prepended file path from each character string up to and including the last path separator.

The table below shows a before and after effect of this function in its two columns respectively:

dat %>% 
  format.dt.f(.)

5.2 Extract sheets and store in column

dat = dat %>% 
      mutate(dataset = future_map(path, xlsx_cells) %>%
      set_names(sector))

Finally, we mutate the dataframe further by adding a new column called dataset which contains all data from all sheets contained within the excel file specified in the corresponding path column. The xlsx_cells() function is used to import the data cell by cell in conjunction with the future_map() function which applies a particular function to each element of a vector in parallel. The extracted sheets are stored as tibbles in the dataset column.

6 Sheet Extraction

6.1 What?

In this section, we combine the extracts.sheets.f() function with the dat dataframe to extract all relevant data from the earnings_debt sheets into a single dataframe.

6.2 How?

First, we define a vector of column names as the variable excl for use in the extracts.sheets.f() function.

excl <- c("country", "company_name", "industry_group")

Next,

singapore_industries <- 
    future_map(dat$dataset, function(inp){  # 1
               extract.sheets.f(dplyr::filter(inp, sheet=="earnings_debt"), all_of(excl)) %>%
               slice(1)                     # 2
    }) %>%
    bind_rows(.) %>%                        # 3
    arrange(industry_group) %>%             # 4
    arrange(desc(roic_cost_capital))        # 5
  1. Use the future_map() function to iterate over the dat dataframe, extracting all sheets named earnings_debt stored in the dataset column (remember that this column stores tibbles). We use the all_of() function to make sure that every value in the excl vector is present.
  2. Use the slice() function to grab only the first row at this time.
  3. Since step (1) returns a list of 74 elements, we use the bind_rows(.) function to bind the individual rows into a single dataframe which shows earnings_debt data for all industries included in the “./data/xlsx” folder.
  4. Next, arrange the data alphabetically by industry group in ascending order.
  5. Finally, arrange the industries based on their roic_cost_capital in descending order.

A sample of the singapore_industries dataframe can be viewed

singapore_industries %>% 
  format.dt.f(.)

In this next chunk of code, we perform the same steps as above except instead of only grabbing the first row from the earnings_debt sheet, we grabbed the remaining rows.

singapore_earnings_debt <- 
    future_map(dat$dataset, function(inp){
               extract.sheets.f(dplyr::filter(inp, sheet=="earnings_debt"), all_of(excl)) %>%
               slice(2:n())
  }) %>%
  bind_rows(.) %>%
  arrange(industry_group) %>%
  arrange(industry_group, desc(roic_cost_capital))
format.dt.f(singapore_earnings_debt)

7 Dividend Yield Filter

  1. Starts by selecting only rows from the singapore_earnings_debt dataframe that have a dividend_yield greater than \(0.05\).
  2. Groups the resulting filtered rows by industry_group and uses the slice() function to return only the first occurring result for each industry.
singapore_earnings_debt %>%
  dplyr::filter(dividend_yield>0.05) %>%
  group_by(industry_group) %>%
  slice(1) %>% 
  format.dt.f(.)

8 Capital Structure

The code here is very similar to the Sheet Extraction section, but instead we target the cost_capital sheets for extraction

excl <- c("company_name", "exchange_ticker", "industry_group", "country",
       "reported_debt_rating", "current_debt_rating", "optimal_debt_rating", 
       "flag_bankruptcy", "flag_refinanced")

singapore_cost_capital <- 
  future_map(dat$dataset, function(inp){
    if((inp$sheet %>% n_distinct) > 1){                                           # 1
        extract.sheets.f(dplyr::filter(inp, sheet=="cost_capital"), all_of(excl)) %>%
        mutate(
          flag_bankruptcy=as.logical(flag_bankruptcy),
          flag_refinanced=as.logical(flag_refinanced)
        )
    }
  }) %>%
  bind_rows(.) %>%
  mutate(spread_optimal = current_debt_capital-optimal_debt_capital) %>%           # 2
  arrange(industry_group, spread_optimal)

There are two notable exceptions:

  1. The sheet is only loaded if the corresponding tibble in the dataset column has more than one distinct sheet.
  2. An additional column spread_optimal is created by subtracting optimal_debt_capital from current_debt_capital.
format.dt.f(singapore_cost_capital)

9 Putting It All Together

In this section, we combine the data from our singapore_earnings_debt and singapore_cost_capital dataframes into one final dataframe before filtering out any rows that either exceed or fall short of certain metrics.

singapore_screener <- singapore_earnings_debt %>%
  select(                                                                              # 1
      industry_group, company_name, dividend_yield, roe, 
         cost_equity, roe_excess_return=roe_cost_equity, roic, 
         cost_capital, roic_excess_return=roic_cost_capital) %>%
  left_join(                                                                           # 2
    singapore_cost_capital %>% 
  select(company_name, current_debt_capital,                                            
         optimal_debt_capital, spread_optimal)) %>%
  select(company_name, dividend_yield, roic_excess_return, roic,                       # 3
         cost_capital, roe_excess_return, roe, cost_equity, spread_optimal, 
         current_debt_capital, optimal_debt_capital) %>%
  dplyr::filter(dividend_yield > 0.01, roic_excess_return > 0.025, spread_optimal < 0) # 4
  1. Use the select() function to grab, and simultaneously rename, the columns of interest in the singapore_earnings_debt dataframe

  2. Perform a left_join() with the singapore_cost_capital dataframe after selecting four columns of interest. Since no column to join the two dataframes as specified, the common column company_name is used.

  3. Once again, use the select() function to keep only the columns of interest

  4. Use the filter() function to keep only the rows that pass ALL of the following conditions:

    • dividend_yield > 0.01
    • roic_excess_return > 0.025
    • spread_optimal < 0
format.dt.f(singapore_screener)

10 Using the extract.sheets.f to extract all sheets from an Excel file

This function combines and utilizes many of the tools we have already seen. We take things further by however, by extracting ALL sheets from a given Excel file and grouping them together in a database like format.

if (FALSE){
  singapore_db <- 
    future_map(dat$dataset,function(inp){
      excl = c("country", "company_name", "industry_group")
      out = list(earnings_debt=extract.sheets.f(
                 dplyr::filter(inp, sheet=="earnings_debt"), all_of(excl)))       # 1
      
      if ( (inp$sheet %>% n_distinct) > 1){                                       
        excl = c("company_name", "exchange_ticker", "industry_group", "country",
               "reported_debt_rating", "current_debt_rating", "optimal_debt_rating", 
               "flag_bankruptcy", "flag_refinanced")
      
        out = list_modify(                                                        
          out,
          cost_capital = extract.sheets.f(
            dplyr::filter(inp, sheet == "cost_capital"), all_of(excl))  %>%      # 2
            mutate(
              flag_bankruptcy=as.logical(flag_bankruptcy),
              flag_refinanced=as.logical(flag_refinanced)
            )
        )

        nm = inp %>%                                                              # 3
          select(sheet) %>%
          distinct(.) %>%
          dplyr::filter(!sheet %in% c("earnings_debt", "cost_capital")) %>%
          pull

        out = list_modify(                                                        # 4
          out,
          optimal_mix = map(
            nm,
            ~extract.sheets.f(dplyr::filter(inp, sheet == .x), NULL) %>%
              mutate(across(c(where(is.character), -debt_rating), as.numeric))
          ) %>% set_names(nm)
        )
      }
      
      out                                                                         # 5
    }
  )
  listviewer::jsonedit(singapore_db)
}
  1. Just as we have done previously, extract the earnings_debt sheet from the dataset column. This time however, store the extracted sheet in a list named out.
  2. Next, extract the cost_capital sheet, and append it to our master list of sheets out.
  3. Create a vector of distinct sheet names that are not cost_capital or earnings_debt. This step uses the pull() function to extract a list of the remaining sheets in the provided Excel file.
  4. Then, for every sheet name that was collected in the previous step, use the extract.sheets() function to extract the sheet corresponding to that name, storing it in the optimal_mix variable and appending all extracted sheets to our master list out.
  5. Finally, Return the completed singapore_db.

11 Housekeeping

We finish by saving our four dataframes as an RDA file. RDA files allow us to export and store our work in a format that is shareable and restorable in any future R session.

save(singapore_industries, singapore_earnings_debt, singapore_cost_capital, 
     singapore_screener, file=file.path(path_output, "singapore_fundamental_data.Rda"))

Finally, we remove all created objects and user-defined functions from the global environment.

rm(dat)
rm(singapore_industries, singapore_earnings_debt, singapore_cost_capital, singapore_screener)
rm(path_root, path_data, path_output, list_paths)
rm(excl, extract.sheets.f)

12 References

Singapore: Capital Allocation via Cost of Capital https://rpubs.com/rafael_nicolas/singapore_capital_allocation