Our goal is to extract the public comparables table from each workbook and transform it into a list of vectors that each contain all the observations corresponding to a heading. When finished it should look like the following:
Observing the summary_cells
below, we can see that there
is one table with comparable data we need to extract for
COMP1
and two tables in COMP2
.
To simplify it, we will focus on making a function to isolate and clean the data for the first table and then modify it so that it can be applied to grab all three tables we are interested in.
First, we will need to give our function an input that it can use to locate and isolate the correct table. All three of these tables have a corresponding table heading, so we can provide it with the data in the top left corner cell; we will use the ‘corner’ parameter to narrow our bag further later.
To begin the function, we will locate the general area of the
summary_cells
our data is located in so that we are not
searching the entire sheet. We will do this with an
inner_join
to find the cell where “Valuation Metric #2” is
located, since this demarks the start of the section on both
COMP1
and COMP2
.
xlsx_toggle=1
if (xlsx_toggle==1) {
xlsx_obj=file.path(path_data, "RNFC_COMP1") %>% readRDS
summary_cells=xlsx_obj %>%
filter(sheet == "Summary")
} else {
xlsx_obj <- file.path(path_data, "RNFC_COMP2") %>% readRDS
summary_cells=xlsx_obj %>% # bag of cells for COMP2
filter(sheet == "Valuation Summary")
}
bag=summary_cells %>%
select(row, col, character, numeric, date, data_type, local_format_id)
idx=bag %>%
regex_inner_join(
"Valuation Metric #2" %>% as_tibble,
by = c("character"="value")
) %>%
select(-value)
Next, we will find where the section ends by using the
local_format_id
to find the next cell that matches the
formatting of the “Valuation Metric #2” cell. We will set this cell to
idx_2 and it marks where the next section begins.
idx_2=bag %>%
filter(
data_type!="numeric" & row>idx$row &
(local_format_id==idx$local_format_id |
local_format_id==idx$local_format_id+1)
) %>%
slice(1)
Now we will change the upper index to where the subtable begins. We will find this using the corner parameter, filtering for the string in the bag of cells; finally, we will filter for the min row to return the first time we find the string, since this will be where the table begins. With the top left corner of the subtable located, we can filter the bag down.
idx=bag %>%
filter(row>=idx$row) %>%
regex_inner_join(
str_glue("^{corner}$") %>% as_tibble,
by = c("character"="value")
) %>%
select(row, col, character, local_format_id) %>%
filter(row==min(row))
bag=bag %>%
filter(
row>=idx$row, col>=idx$col
) %>%
select(row, col, character, numeric, date, data_type, local_format_id)
We will then fix up the bottom index by cutting it back to the bottom of the table. Since it now sits at the beginning of the next section, we know it has to come back up to the first data entry and also skip any extra rows at the bottom of our tables with source data or averages. For some tables the last row includes the string “Source:” and others end with “Mean”, so we can detect where those cells are located in the narrowed bag. Lastly, some tables have a row of \(NA\) above the source or mean row, so we can filter the heading column in the bag for \(NA\)s and take the tail value to find the gap between the last row and where the data ends, if it exists. This will be our final idx_2.
idx_2=bag %>%
filter(
str_detect(character, "Source:|Mean .")
) %>%
select(row, col) %>% slice(1)
bag=bag %>%
filter(
row>=idx$row,
row<idx_2$row,
col>=idx$col
) %>%
select(row, col, character, numeric, date, data_type)
idx_2=bag %>%
filter(
col == min(bag$col),
!is.na(character)
) %>%
select(row, col) %>% tail(1)
With these steps complete, we know our function will now have determined where the top left cell of the table corresponding to the given corner is as well as the cell below bottom left of the table. We can use this to filter for our final bag.
if (xlsx_toggle==1) {
bag=bag %>%
filter(
row>=idx$row,
row<=idx_2$row,
col>=idx$col
) %>%
select(row, col, character, numeric, date, data_type)
} else { # if (xlsx_toggle==2)
bag=summary_cells %>%
filter(
row>=idx$row-1,
row<=idx_2$row,
col>=idx$col
) %>%
select(row, col, character, numeric, date, data_type)
}
The next part of the function is where the isolated data will be
cleaned up and transformed into our output. We begin by stripping the
northern headings and assigning them to the data with
behead
. For COMP2
, there are two rows of
headings, and so we need to perform behead twice for any subsets that
are from COMP2
. As a final step, we can remove any values
that do not have a heading, as these are just empty cells to the right
of our table. We also need to remove the ticker column from
COMP2
by filtering it out.
if(xlsx_toggle==2){
bag=bag %>%
behead("N", name=outer_heading)
}
data=bag %>%
behead("N", name=heading) %>%
mutate(heading=heading %>% clean_names) %>%
filter(heading!="na")
if(xlsx_toggle==2){data=data %>% filter(heading!="ticker")}
Continuing on, we will convert any “–” character values to \(NA\), convert the date column to character
to prepare it for unpivotr::pack
, and then combine the
outer_heading
and heading for subtables in
COMP2
. We can do each of these steps with a combination of
mutate
and map
with various anonymous
functions. The map
function chosen will depend on what data
is needed by the anonymous function; for converting "-- "
to \(NA\) and dates to characters, we
can use a simple map_chr
to ensure the output is of the
character type.
data=data %>%
mutate(
character=map_chr(
character,
~ifelse(.=="-- ", NA_real_, .))
) %>%
mutate(date=map_chr(date, as.character))
For combining the outer_heading
, we will use a
pmap_chr
, providing it with the piped data and telling it
to combine the string in variable \(7\)
and \(8\) (outer_heading
and heading
) provided that the outer_heading
is not \(NA\). We will also put the new
string through clean_names
and then deselect the
outer_heading
.
if(xlsx_toggle==2){
data=data %>%
mutate(
heading=pmap_chr(.,
~ifelse(!is.na(..7), paste(..7, ..8), ..8) %>%
clean_names
)
) %>%
select(-outer_heading)
}
Finally, we will gather the separated data type columns into one
value column using pack
and then transform the output into
a list of vectors so that it conforms with the target output. After the
unpivotr::pack
function, we need to first split
data$value
by column, since each property is in a different
column and then set_names
of each list using the unique
values of heading in data. Lastly, we will convert our list of lists
into a list of vectors by iteratively unlisting and running each list
through as.vector
.
data=data %>%
pack()
public_comparables=data$value %>%
split(data$col) %>%
set_names(data$heading %>% unique) %>%
lapply(unlist) %>%
lapply(as.vector) %>%
list()
All together, the match.f
function will look like
this:
print(match.f)
function(corner, summary_cells, xlsx_toggle=1){
# CREATE BAG
bag=summary_cells %>%
select(row, col, character, numeric, date, data_type, local_format_id)
# Set idx (index) to Valuation Metric #2
idx=bag %>%
regex_inner_join(
"Valuation Metric #2" %>% as_tibble,
by = c("character"="value")
) %>%
select(-value)
# Set idx_2 to next Valuation Metric table or metric subtable
idx_2 = bag %>%
filter(
data_type!="numeric" & row>idx$row &
(local_format_id==idx$local_format_id |
local_format_id==idx$local_format_id+1)
) %>%
slice(1)
# Set idx to top left of sub table (corner value)
idx=bag %>%
filter(row>=idx$row) %>%
regex_inner_join(
str_glue("^{corner}$") %>% as_tibble,
by = c("character"="value")
) %>%
select(row, col, character, local_format_id) %>%
filter(row==min(row))
bag=bag %>%
filter(row>=idx$row, col>=idx$col) %>%
select(row, col, character, numeric, date, data_type, local_format_id)
# Cut bag back to Source/Mean row
idx_2 = bag %>%
filter(str_detect(character, "Source:|Mean .")) %>%
select(row, col) %>%
slice(1)
bag=bag %>%
filter(row>=idx$row, row<idx_2$row, col>=idx$col) %>%
select(row, col, character, numeric, date, data_type)
# Cut back any NA rows left (asset management/ wealth advisors)
idx_2 = bag %>%
filter(col == min(bag$col), !is.na(character)) %>%
select(row, col) %>%
tail(1)
# Final data set
if (xlsx_toggle==1) {
bag=bag %>%
filter(row>=idx$row, row<=idx_2$row, col>=idx$col) %>%
select(row, col, character, numeric, date, data_type)
} else { # if (xlsx_toggle==2)
bag=summary_cells %>%
filter(row>=idx$row-1, row<=idx_2$row, col>=idx$col) %>%
select(row, col, character, numeric, date, data_type)
}
if(xlsx_toggle==2){
# If bag 2, there are two lines of heading blocks
bag = bag %>% behead("N", name = outer_heading)
}
# behead data, remove ticker, filter out nas, change -- to NA, and convert
# date to character
data=bag %>%
behead("N", name = heading) %>%
mutate(heading = heading %>% clean_names) %>%
filter(heading!="na")
if(xlsx_toggle==2){data = data %>% filter(heading!="ticker")}
data = data %>%
mutate(
character = map_chr(
character,
~ifelse(.=="-- ", NA_real_, .)
)
) %>%
mutate(date = map_chr(date, as.character))
# If comp 2, fuse two lines of heading blocks
if(xlsx_toggle==2){
data=data %>%
mutate(
heading = pmap_chr(
.,
~ifelse(!is.na(..7), paste(..7, ..8), ..8) %>%
clean_names
)
) %>%
select(-outer_heading)
}
data = data %>% pack()
public_comparables=data$value %>%
split(data$col) %>%
set_names(data$heading %>% unique) %>%
lapply(unlist) %>%
lapply(as.vector) %>%
list()
if (xlsx_toggle==1){
# map(plyr::rename, c("company" = "business"))
nm=c("business", "exchange") %>% set_names(c("company", "ticker"))
} else { # if (xlsx_toggle==2)
# xlsx_toggle=2
# map(plyr::rename, c("company" = "business"))
nm=c(rep("business", 2), "ev", "ev_ebitda") %>%
set_names(c("insurance_brokers", "asset_management_wealth_advisors",
"tev", "tev_ltm_ebitda"))
}
public_comparables=public_comparables %>% map(plyr::rename, nm)
}
<bytecode: 0x7fed09268e08>
Now that we have a function created, we just need to set
public_comparables
equal to the output of
match.f
for each corner of the desired subtables. For
COMP1
, we can simply set the output of the function with
the parameter “Company” to public_comparables
. For
COMP2
, we can save the lists output by running “Insurance
Brokers” and “Asset Management / Wealth Advisors” through
match.f
and then set public_comparables
equal
to a list of the outputs.
The resulting lists from running this code appear for
COMP1
and COMP2
as below:
xlsx_toggle=1
xlsx_obj=file.path(path_data, "RNFC_COMP1") %>% readRDS
summary_cells=xlsx_obj %>% filter(sheet == "Summary")
corner="Company"
public_comparables=match.f(corner, summary_cells, xlsx_toggle)
public_comparables %>%
map_dfc(~.x) %>%
format.dt.f(.)
xlsx_toggle=2
xlsx_obj <- file.path(path_data, "RNFC_COMP2") %>% readRDS
summary_cells=xlsx_obj %>% filter(sheet == "Valuation Summary")
corner="Insurance Brokers"
public_comparables=match.f(corner, summary_cells, xlsx_toggle)
public_comparables %>%
map_dfc(~.x) %>%
format.dt.f(.)
corner="Asset Management / Wealth Advisors"
public_comparables=match.f(corner, summary_cells, xlsx_toggle)
public_comparables %>%
map_dfc(~.x) %>%
format.dt.f(.)
The last steps are to update the master valuation list by using
list_modify
and appending ma_comparables
to
the end of our existing master list.
# provide heading for top left corner of table, produces a list; fuse for 2 since
# there are two partitions
if (xlsx_toggle==1){
corner="Company"
public_comparables=match.f(corner, summary_cells, xlsx_toggle) %>%
set_names(c("insurance_brokers"))
}else if (xlsx_toggle==2){
corner="Insurance Brokers"
l1=match.f(corner, summary_cells, xlsx_toggle)
corner="Asset Management / Wealth Advisors"
l2=match.f(corner, summary_cells, xlsx_toggle)
public_comparables=flatten(list(l1, l2)) %>%
set_names(c("insurance_brokers", "asset_management_wealth_advisors"))
}
# public_comparables %>% bind_rows(.id="sector")
valuation=list_modify(
valuation,
public_comparables=public_comparables
)