1 Objective

Our goal is to extract the security description table from each workbook and transform it into a list detailing each observation and its variables and values. When finished, our output should look like this for COMPS1 and COMPS2, respectively:

We begin with two sheets: Summary withn RNFC_COMP1.rds and Valuation Summary within RNFC_COMP2.rds


1.1 COMP1

xlsx_obj1 <- file.path(path_data, "RNFC_COMP1") %>% readRDS

bag1=xlsx_obj1 %>% # bag of cells for COMP1
  filter(sheet == "Summary") %>% 
  select(row, col, data_type, numeric, date, 
         character, local_format_id, style_format)

bag1 %>% 
  rectify() %>% 
  format.dt.f(.)


1.2 COMP2

xlsx_obj2 <- file.path(path_data, "RNFC_COMP2") %>% readRDS

bag2=xlsx_obj2 %>% # bag of cells for COMP2
  filter(sheet == "Valuation Summary") %>% 
  select(row, col, data_type, numeric, date, 
         character, local_format_id, style_format)

bag2 %>% 
  rectify() %>% 
  format.dt.f(.)


2 Isolating the Data

To start off, we need to capture the data between rows \(16\) and \(28\) for COMP1 and \(18\) and \(31\) for COMP2. To do so, we can determine the corners of the table we wish to extract and then filter out the unneeded data using their coordinates. Finding the top left corner is simple since the table on both sheets are titled Security Description. To grab the row and column for this cell, we can perform an inner join, joining a single data frame with the value Security Description by the character variable of the bag. This will return the observation in the bag that includes the row and column of the top left cell for our table, as well as additional information on this cell.


2.1 Top


2.1.1 COMP1

Below we can see that the table heading cells for COMP1 have a local format id of \(168\). This means we can filter for the next cell in column \(3\) that shares the same format as the top left corner and then count back two rows to ensure data from the next table is not included in our bag of cells.

# Top left corner COMP1
idx_tl1 = bag1 %>% 
  regex_inner_join(
    "Security Description" %>% enframe(name = NULL), 
    by = c("character" = "value")
  ) %>% 
  select(-value)

idx_tl1 %>% 
  select(row, col, character, local_format_id, style_format)
# A tibble: 1 × 5
    row   col character             local_format_id style_format      
  <int> <int> <chr>                           <int> <chr>             
1    16     3 Security Description:             168 "Norm\u0a4e\u0a4e"


2.1.2 COMP2

With COMP2, the local_format_id does not help us in locating these positions, however the style_format does. Looking at this information, we can see that the headings have different style ids, so we can filter for the first row below the top left index that does not have the same style_format. Luckily, these filter conditions do not interfere with each other for the two comp sheets so an OR operator can be used; a select function based on the xlsx_toggle could be employed if more spreadsheets are added and it becomes an issue.

idx_tl2 = bag2 %>% 
  regex_inner_join(
    "Security Description" %>% enframe(name = NULL), 
    by = c("character" = "value")
  ) %>% 
  select(-value)

idx_tl2 %>% 
  select(row, col, character, local_format_id, style_format)
# A tibble: 1 × 5
    row   col character             local_format_id style_format                
  <int> <int> <chr>                           <int> <chr>                       
1    18     3 Security Description:              90 Normal_EVERTEC - Q4 Reporti…


2.2 Bottom

Next, we need to find the bottom of the tables. Because there is no irrelevant data to the right of our tables, we just need to find the value for the last row. This is a little trickier than finding the top left cell since the two sheets do not have identical West headings. One method we can employ is looking at the style formatting to see if the table can be isolated through its local_format_id or style_format. To make the style data easily readable, we can set the character values of the bag to the format id and then rectify it.


2.2.1 COMP1

# Bottom Left Corner COMP1
idx_bl1 = bag1 %>% 
  filter(
    col == 3 & row > idx_tl1$row & 
      (local_format_id == idx_tl1$local_format_id | style_format!=idx_tl1$style_format)
  ) %>% 
  slice(1) %>% 
  select(row) %>% 
  mutate (row = row-2)

idx_bl1
# A tibble: 1 × 1
    row
  <dbl>
1    28


2.2.2 COMP2

# Bottom Left Corner COMP2
idx_bl2 = bag2 %>% 
  filter(
    col == 3 & row > idx_tl2$row & 
      (local_format_id == idx_tl2$local_format_id | style_format!=idx_tl2$style_format)
  ) %>% 
  slice(1) %>% 
  select(row) %>% 
  mutate (row = row-2)

idx_bl2
# A tibble: 1 × 1
    row
  <dbl>
1    31


Now that we know the rows where our security description table starts and ends in each sheet and the column it starts in, we can filter out all other data from our bag.

bag1 = bag1 %>% 
  filter(
    row >= idx_tl1$row,
    row <= idx_bl1$row,
    col >= idx_tl1$col
  )

bag2 = bag2 %>% 
  filter(
    row >= idx_tl2$row,
    row <= idx_bl2$row,
    col >= idx_tl2$col
  )


3 Extracting and Cleaning the Data

Our next goal is to perform some final clean up so that data types match the target output and the table can be interpreted by the behead and pack functions from the unpivotr package. In COMP2 we can remove row \(20\) since it contains extra headings that we don’t want to be picked up by unpivotr::behead(). The date values in the bag also need to be converted to characters in both sheets to match the target. Finally, we need to filter out the blanks so we are only left with values and their relevant Northern and Western headings (usually behead is good about filtering out blanks but the number of NAs might be throwing it off, not entirely sure). Lastly, we can deselect the style formatting since we no longer need it. The resulting cleaned tables that we can begin to apply unpivotr functions on can be seen below for COMP1 and COMP2:


3.1 Extracting


3.1.1 COMP1

bag1=bag1 %>%
    mutate(date = as.character(date)) %>%
    filter(data_type!= "blank") %>%
    select(-local_format_id, -style_format)

bag1 %>% 
  rectify() %>% 
  format.dt.f()


3.1.2 COMP2

xlsx_toggle=2
if(xlsx_toggle==2){bag2 = bag2 %>% filter(row!=20)}

bag2=bag2 %>%
    mutate(date = as.character(date)) %>%
    filter(data_type!= "blank") %>%
    select(-local_format_id, -style_format)

bag2 %>% 
  rectify() %>% 
  format.dt.f()


3.2 Cleaning


Next we can begin stripping the headings, assigning each value to a North and West heading. Each value belongs to a specific security id and an element that describes an aspect of that security. In the tables above, we can see the North headings indicate the security id while the west headings represent the associated element of that security the value describes.


3.2.1 Strip


We can use behead to strip these headings so only the values remain, however the information is not lost, the headings are assigned to a new column in the bag tibble.


t_security1

t_security1 <- bag1 %>% 
   behead(direction = "N", name = "security_id") %>%
   behead(direction = "W", name = "element")

t_security1 %>% format.dt.f()


t_security2

t_security2 <- bag2 %>% 
   behead(direction = "N", name = "security_id") %>%
   behead(direction = "W", name = "element")

t_security2 %>% format.dt.f()


3.2.2 Match


The security_id and element values will become the list dividers and list element names in the final output, so we want to clean up the names to match the target now. We can do this with the clean_names() function.

clean_names <- function(old_names){
  new_names <- old_names %>%
    gsub("'", "", .) %>% # remove quotation marks
    gsub("\"", "", .) %>% # remove quotation marks
    gsub("%", ".percent_", .) %>% # starting with "." as a workaround, to make
    # ".percent" a valid name. The "." will be replaced in the call to to_any_case
    # via the preprocess argument anyway.
    gsub("^[ ]+", "", .) %>%
    make.names(.) %>%
    # Handle dots, multiple underscores, case conversion, string transliteration
    snakecase::to_any_case(
      case = "snake" #, 
      # preprocess = "\\.",
      # replace_special_characters = c("Latin-ASCII")
    )
  new_names
}


t_security1

t_security1=t_security1 %>% 
   mutate(
      element = element %>% clean_names(),
      security_id = security_id %>% clean_names()
    )

t_security1 %>% format.dt.f()


t_security2

t_security2=t_security2 %>% 
   mutate(
      element = element %>% clean_names(),
      security_id = security_id %>% clean_names()
    )

t_security2 %>% format.dt.f()


3.2.3 Combine


Next, we can use unpivotr::pack() to combine the the different data type columns into just one column. This will allow us to extract them all into a single list to match the target output.


t_security1

t_security1=t_security1 %>% 
   pack() %>%
   select(security_id, element, value)

t_security1 %>% format.dt.f()


t_security2

t_security2 = t_security2 %>%
   pack() %>%
   select(security_id, element, value)

t_security2 %>% format.dt.f()


4 Transforming the Data


Our next steps are to convert our table to match the output format we are targeting. We currently have a table that includes the security_id (the title of the target list), element (the names for the values in the target list), and value (list of values to be included in our list). We can extract each of the value lists by calling t_security1$value and then pass that through purrr:set_names() with the element column of t_security1 for the names argument; the resulting list can then be split by the security_id, making it a list of lists which matches the target output for COMP2.

# Extract values from table into list of lists (match previous output)
security1=t_security1$value %>%
  purrr::set_names(t_security1$element) %>%
  split(t_security1$security_id) %>% # list() %>%
  set_names(
    t_security1 %>% 
      filter(element=="description") %>% 
      pull() %>% 
      map_chr(~.x) %>%
      clean_names()
  )

security2=t_security2$value %>%
  purrr::set_names(t_security2$element) %>%
  split(t_security2$security_id) %>% # list() %>%
  set_names(
    t_security2 %>% 
      filter(element=="description") %>% 
      pull() %>% 
      map_chr(~.x) %>%
      clean_names()
  )

list(security1, security2) %>%
  set_names(str_replace(names(scenarios), "RNFC_", "")) %>%
  listviewer::jsonedit()