Our aim in this exercise is to extract the business description
elements from each workbook and transform it into a named list. Our
target results should replicate the following for COMP1
and
COMP2
, respectively:
path_root = file.path("..", "..")
path_code = file.path(path_root, "libs")
path_data = file.path(path_root, "data")
path_output=file.path(path_root, "output")
load(file.path(path_output, "target_output.rda"))
xlsx_obj1 <- file.path(path_data, "RNFC_COMP1") %>% readRDS
xlsx_obj2 <- file.path(path_data, "RNFC_COMP2") %>% readRDS
list(
scenarios$RNFC_COMP1$business,
scenarios$RNFC_COMP2$business
) %>%
set_names(str_replace(names(scenarios), "RNFC_", "")) %>%
listviewer::jsonedit()
We begin with selecting a bag of cells located in our two sheets:
Summary within RNFC_COMP1.rds
and
Valuation Summary within
RNFC_COMP2.rds
.
xlsx_obj1 %>%
filter(sheet == "Summary") %>%
rectify() %>%
format.dt.f(.)
xlsx_obj2 %>%
filter(sheet == "Valuation Summary") %>%
rectify() %>%
format.dt.f(.)
The relevant information lies between rows \(1\) and \(16\), and we will select the row,
col, character, and data_type so that we can
use the data with unpivotr
functions. To make sure the
headings are clean and match the target, we also need to trim any
present white space from the end of character values.
# Set up and clean bag
bag=xlsx_obj1 %>% filter(sheet == "Summary") %>%
filter(!is.na(character), row<=16) %>%
select(row, col, character, data_type)
# This leaves us with the following bags of cells:
bag$character <- trimws(bag$character, which = c("right"))
Examining the cells selected for each scenario, we can see the data
we need to pull is scattered amongst many small tables. To gather the
data up so that we can extract it, we can use the
unpivotr::partition function. Given the top left
corners of each of the tables, the function will be able to partition
and store each individual table for us to pull from afterwards. We know
that the character values in the top lefts of each table end with
:
, so we can filter the bag to find the rows and columns
for each of these table corners.
# Create Partitions
idx = bag %>%
filter(character %>% endsWith(":"))
idx
# A tibble: 13 × 4
row col character data_type
<int> <int> <chr> <chr>
1 5 3 Business Description: character
2 5 14 Deal Team: character
3 5 16 Monitoring Team: character
4 8 3 Sponsors: character
5 8 10 RNFC Board Rights: character
6 9 3 Sponsor Main Contact: character
7 9 10 Info Rights: character
8 10 3 Company Main Contact: character
9 11 3 Mezz Co-Investors: character
10 12 3 Company Year End: character
11 13 3 Admin Agent: character
12 14 3 Advisors: character
13 16 3 Security Description: character
We can feed the locations of these corners into the unpivotr::partition function, providing it with the original bag and the idx table.
desc_partitions = partition(bag, idx)
desc_partitions %>% listviewer::jsonedit()
The output table of this function will give us exactly what we need, we just need to extract it. The character column will be the names we need for our list and the cells tibbles will be the corresponding list values.
First we will grab the names and store them in a vector. To do this, we will use clean_names() on the character values and then select and store the results.
clean_names <- function(old_names){
new_names <- old_names %>%
gsub("'", "", .) %>% # remove quotation marks
gsub("\"", "", .) %>% # remove quotation marks
gsub("%", ".percent_", .) %>% # starting with "." as a workaround, to make
# ".percent" a valid name. The "." will be replaced in the call to to_any_case
# via the preprocess argument anyway.
gsub("^[ ]+", "", .) %>%
make.names(.) %>%
# Handle dots, multiple underscores, case conversion, string transliteration
snakecase::to_any_case(
case = "snake" #,
# preprocess = "\\.",
# replace_special_characters = c("Latin-ASCII")
)
new_names
}
# Generate Headings
headings <- desc_partitions %>%
mutate(character = character %>% clean_names()) %>%
pull(character)
headings
[1] "business_description" "deal_team" "monitoring_team"
[4] "sponsors" "rnfc_board_rights" "sponsor_main_contact"
[7] "info_rights" "company_main_contact" "mezz_co_investors"
[10] "company_year_end" "admin_agent" "advisors"
[13] "security_description"
Next we will grab the values for our output list. To do this we need to extract the values in each cells tibble; to figure out how to go about this we need to first examine the structure of the tibbles stored in cells. We will pull and examine the first tibble from cells to get an idea of what we are working with:
desc_partitions %>%
select(cells) %>%
slice(1) %>%
pull()
[[1]]
# A tibble: 2 × 4
row col character data_…¹
<int> <int> <chr> <chr>
1 5 3 Business Description: charac…
2 6 3 COMP1 is a top N insurance broker in the U.S., focusing o… charac…
# … with abbreviated variable name ¹data_type
As we can see, each tibble stores the row, col, character, and data_type. The values we want to extract are located in character, but we must exclude first character value in each tibble since it represents the title. Thus, we will select the character column of the tibble and then slice it from value \(2\) onwards. We will then convert it to a list and flatten it so it is ready to be stored in our output list. We can make this process into a function and then map it across all tibbles in the ‘cells’ column of desc_partitions. Since we only need the values in cells, we can deselect the other columns. We also need to keep in mind that the number of partitions exceeds \(12\) in both cases. Later on we will have to select only the desired ones.
# Generate List of Element Descriptions
extract_character <- function(cells) {
# desc_partitions$cells[[1]] %>%
cells %>%
#select(character) %>%
slice(2:n()) %>%
pull(character)
#as.list() %>%
#unlist() %>%
#unname()
}
business <- desc_partitions %>%
mutate(cells = map(cells, extract_character)) %>%
select(cells)
business$cells
[[1]]
[1] "COMP1 is a top N insurance broker in the U.S., focusing on providing brokerage and consulting services in the Commercial Property & Casualty, Employee Benefits and Specialty / Program segments to middle to upper-middle market accounts"
[[2]]
[1] "Nico" "Baran"
[[3]]
[1] "Nico" "Baran"
[[4]]
[1] "XYZ"
[[5]]
[1] "Board Observer Rights"
[[6]]
[1] "XYZ"
[[7]]
[1] "- Quarterly unaudited financials: 45 days after each fiscal quarter"
[[8]]
[1] "n/a"
[[9]]
[1] "XYZ"
[[10]]
[1] "31 December"
[[11]]
[1] "RNFC / XYZ"
[[12]]
[1] "XYZ"
[[13]]
[1] "Security 1"
We now have the list values for our target output, however we need to
set the names and remove irrelevant list items. First we will clean up
some discrepancies between our headings
vector and the
target output headings. For COMP1
, we need to change
mezz_co_investors: to co_investors and
admin_agent to agent. For COMP2
, we need
to fix agent, advisors, and
sponsor_main_contact headings. Once we set the headings
correctly, we can run set_names to assign the headings to
business$cells
and save them to our business
list. Because the two sheets are different, there are additional
unneeded elements that were created by
unpivotr::partition, so we will select the twelve
desired ones from each sheet and save it for our final output.
xlsx_toggle = 1
if(xlsx_toggle==1){
# Fix and Set Names
headings[9] <- "co_investors"
headings[11] <- "agent"
business <- purrr::set_names(business$cells, headings)
# Select Desired Elements
business <- business[c(1,2,3,4,6,9,10,11,12)]
} else if(xlsx_toggle==2){
# Fix and Set Names
headings[12] <- "agent"
headings[13] <- "advisors"
headings[6] <- "sponsor_main_contact"
business <- purrr::set_names(business$cells, headings)
# Select Desired Elements
business <- business[c(1,2,3,4,6,10,11,12,13)]
}
We are then left with our final output, which matches the target named list.