1 Objective

Our aim in this exercise is to extract the business description elements from each workbook and transform it into a named list. Our target results should replicate the following for COMP1 and COMP2, respectively:

path_root = file.path("..", "..")
path_code = file.path(path_root, "libs")
path_data = file.path(path_root, "data")
path_output=file.path(path_root, "output")
load(file.path(path_output, "target_output.rda"))

xlsx_obj1 <- file.path(path_data, "RNFC_COMP1") %>% readRDS
xlsx_obj2 <- file.path(path_data, "RNFC_COMP2") %>% readRDS

list(
  scenarios$RNFC_COMP1$business,
  scenarios$RNFC_COMP2$business
) %>% 
  set_names(str_replace(names(scenarios), "RNFC_", "")) %>%
  listviewer::jsonedit()

2 Isolating the Data

We begin with selecting a bag of cells located in our two sheets: Summary within RNFC_COMP1.rds and Valuation Summary within RNFC_COMP2.rds.

2.1 COMP1

xlsx_obj1 %>% 
  filter(sheet == "Summary") %>% 
  rectify() %>% 
  format.dt.f(.)

2.2 COMP2

xlsx_obj2 %>% 
  filter(sheet == "Valuation Summary") %>% 
  rectify() %>% 
  format.dt.f(.)

3 Extracting the Data

The relevant information lies between rows \(1\) and \(16\), and we will select the row, col, character, and data_type so that we can use the data with unpivotr functions. To make sure the headings are clean and match the target, we also need to trim any present white space from the end of character values.

# Set up and clean bag
bag=xlsx_obj1 %>% filter(sheet == "Summary") %>% 
  filter(!is.na(character), row<=16) %>% 
  select(row, col, character, data_type)

# This leaves us with the following bags of cells:
bag$character <- trimws(bag$character, which = c("right"))

Examining the cells selected for each scenario, we can see the data we need to pull is scattered amongst many small tables. To gather the data up so that we can extract it, we can use the unpivotr::partition function. Given the top left corners of each of the tables, the function will be able to partition and store each individual table for us to pull from afterwards. We know that the character values in the top lefts of each table end with :, so we can filter the bag to find the rows and columns for each of these table corners.

# Create Partitions
idx = bag %>% 
  filter(character %>% endsWith(":"))

idx
# A tibble: 13 × 4
     row   col character             data_type
   <int> <int> <chr>                 <chr>    
 1     5     3 Business Description: character
 2     5    14 Deal Team:            character
 3     5    16 Monitoring Team:      character
 4     8     3 Sponsors:             character
 5     8    10 RNFC Board Rights:    character
 6     9     3 Sponsor Main Contact: character
 7     9    10 Info Rights:          character
 8    10     3 Company Main Contact: character
 9    11     3 Mezz Co-Investors:    character
10    12     3 Company Year End:     character
11    13     3 Admin Agent:          character
12    14     3 Advisors:             character
13    16     3 Security Description: character

We can feed the locations of these corners into the unpivotr::partition function, providing it with the original bag and the idx table.

desc_partitions = partition(bag, idx)
desc_partitions %>% listviewer::jsonedit()

The output table of this function will give us exactly what we need, we just need to extract it. The character column will be the names we need for our list and the cells tibbles will be the corresponding list values.

First we will grab the names and store them in a vector. To do this, we will use clean_names() on the character values and then select and store the results.

clean_names <- function(old_names){
  new_names <- old_names %>%
    gsub("'", "", .) %>% # remove quotation marks
    gsub("\"", "", .) %>% # remove quotation marks
    gsub("%", ".percent_", .) %>% # starting with "." as a workaround, to make
    # ".percent" a valid name. The "." will be replaced in the call to to_any_case
    # via the preprocess argument anyway.
    gsub("^[ ]+", "", .) %>%
    make.names(.) %>%
    # Handle dots, multiple underscores, case conversion, string transliteration
    snakecase::to_any_case(
      case = "snake" #, 
      # preprocess = "\\.",
      # replace_special_characters = c("Latin-ASCII")
    )
  new_names
}

# Generate Headings
headings <- desc_partitions %>%
  mutate(character = character %>% clean_names()) %>%
  pull(character)

headings
 [1] "business_description" "deal_team"            "monitoring_team"     
 [4] "sponsors"             "rnfc_board_rights"    "sponsor_main_contact"
 [7] "info_rights"          "company_main_contact" "mezz_co_investors"   
[10] "company_year_end"     "admin_agent"          "advisors"            
[13] "security_description"

Next we will grab the values for our output list. To do this we need to extract the values in each cells tibble; to figure out how to go about this we need to first examine the structure of the tibbles stored in cells. We will pull and examine the first tibble from cells to get an idea of what we are working with:

desc_partitions %>% 
  select(cells) %>% 
  slice(1) %>% 
  pull()
[[1]]
# A tibble: 2 × 4
    row   col character                                                  data_…¹
  <int> <int> <chr>                                                      <chr>  
1     5     3 Business Description:                                      charac…
2     6     3 COMP1 is a top N insurance broker in the U.S., focusing o… charac…
# … with abbreviated variable name ¹​data_type

As we can see, each tibble stores the row, col, character, and data_type. The values we want to extract are located in character, but we must exclude first character value in each tibble since it represents the title. Thus, we will select the character column of the tibble and then slice it from value \(2\) onwards. We will then convert it to a list and flatten it so it is ready to be stored in our output list. We can make this process into a function and then map it across all tibbles in the ‘cells’ column of desc_partitions. Since we only need the values in cells, we can deselect the other columns. We also need to keep in mind that the number of partitions exceeds \(12\) in both cases. Later on we will have to select only the desired ones.

# Generate List of Element Descriptions
extract_character <- function(cells) {
  # desc_partitions$cells[[1]] %>%
  cells %>%
    #select(character) %>%
    slice(2:n()) %>%
    pull(character)
    #as.list() %>%
    #unlist() %>%
    #unname()
}
business <- desc_partitions %>%
  mutate(cells = map(cells, extract_character)) %>%
  select(cells)
business$cells
[[1]]
[1] "COMP1 is a top N insurance broker in the U.S., focusing on providing brokerage and consulting services in the Commercial Property & Casualty, Employee Benefits and Specialty / Program segments to middle to upper-middle market accounts"

[[2]]
[1] "Nico"  "Baran"

[[3]]
[1] "Nico"  "Baran"

[[4]]
[1] "XYZ"

[[5]]
[1] "Board Observer Rights"

[[6]]
[1] "XYZ"

[[7]]
[1] "- Quarterly unaudited financials: 45 days after each fiscal quarter"

[[8]]
[1] "n/a"

[[9]]
[1] "XYZ"

[[10]]
[1] "31 December"

[[11]]
[1] "RNFC / XYZ"

[[12]]
[1] "XYZ"

[[13]]
[1] "Security 1"

4 Transforming the Data

We now have the list values for our target output, however we need to set the names and remove irrelevant list items. First we will clean up some discrepancies between our headings vector and the target output headings. For COMP1, we need to change mezz_co_investors: to co_investors and admin_agent to agent. For COMP2, we need to fix agent, advisors, and sponsor_main_contact headings. Once we set the headings correctly, we can run set_names to assign the headings to business$cells and save them to our business list. Because the two sheets are different, there are additional unneeded elements that were created by unpivotr::partition, so we will select the twelve desired ones from each sheet and save it for our final output.

xlsx_toggle = 1

if(xlsx_toggle==1){
  # Fix and Set Names
  headings[9] <- "co_investors"
  headings[11] <- "agent"
  business <- purrr::set_names(business$cells, headings)
  # Select Desired Elements
  business <- business[c(1,2,3,4,6,9,10,11,12)]
} else if(xlsx_toggle==2){
  # Fix and Set Names
  headings[12] <- "agent"
  headings[13] <- "advisors"
  headings[6] <- "sponsor_main_contact"
  business <- purrr::set_names(business$cells, headings)
  # Select Desired Elements
  business <- business[c(1,2,3,4,6,10,11,12,13)]
}

We are then left with our final output, which matches the target named list.