Our goal is to extract the security description table from each
workbook and transform it into a list detailing each observation and its
variables and values. When finished, our output should look like this
for COMPS1
and COMPS2
, respectively:
We begin with two sheets: Summary withn
RNFC_COMP1.rds
and Valuation Summary
within RNFC_COMP2.rds
xlsx_obj1 <- file.path(path_data, "RNFC_COMP1") %>% readRDS
bag1=xlsx_obj1 %>% # bag of cells for COMP1
filter(sheet == "Summary") %>%
select(row, col, data_type, numeric, date,
character, local_format_id, style_format)
bag1 %>%
rectify() %>%
format.dt.f(.)
xlsx_obj2 <- file.path(path_data, "RNFC_COMP2") %>% readRDS
bag2=xlsx_obj2 %>% # bag of cells for COMP2
filter(sheet == "Valuation Summary") %>%
select(row, col, data_type, numeric, date,
character, local_format_id, style_format)
bag2 %>%
rectify() %>%
format.dt.f(.)
To start off, we need to capture the data between rows \(16\) and \(28\) for COMP1
and \(18\) and \(31\) for COMP2
. To do so, we
can determine the corners of the table we wish to extract and then
filter out the unneeded data using their coordinates. Finding the top
left corner is simple since the table on both sheets are titled
Security Description. To grab the row and column for this cell,
we can perform an inner join
, joining a single data frame
with the value Security Description by the character variable
of the bag. This will return the observation in the bag that includes
the row and column of the top left cell for our table, as well as
additional information on this cell.
Below we can see that the table heading cells for COMP1
have a local format id of \(168\). This
means we can filter for the next cell in column \(3\) that shares the same format as the top
left corner and then count back two rows to ensure data from the next
table is not included in our bag of cells.
# Top left corner COMP1
idx_tl1 = bag1 %>%
regex_inner_join(
"Security Description" %>% enframe(name = NULL),
by = c("character" = "value")
) %>%
select(-value)
idx_tl1 %>%
select(row, col, character, local_format_id, style_format)
# A tibble: 1 × 5
row col character local_format_id style_format
<int> <int> <chr> <int> <chr>
1 16 3 Security Description: 168 "Norm\u0a4e\u0a4e"
With COMP2
, the local_format_id
does not
help us in locating these positions, however the
style_format
does. Looking at this information, we can see
that the headings have different style ids, so we can filter
for the first row below the top left index that does not have the same
style_format
. Luckily, these filter conditions do not
interfere with each other for the two comp sheets so an OR
operator can be used; a select function based on the
xlsx_toggle
could be employed if more spreadsheets are
added and it becomes an issue.
idx_tl2 = bag2 %>%
regex_inner_join(
"Security Description" %>% enframe(name = NULL),
by = c("character" = "value")
) %>%
select(-value)
idx_tl2 %>%
select(row, col, character, local_format_id, style_format)
# A tibble: 1 × 5
row col character local_format_id style_format
<int> <int> <chr> <int> <chr>
1 18 3 Security Description: 90 Normal_EVERTEC - Q4 Reporti…
Next, we need to find the bottom of the tables. Because there is no
irrelevant data to the right of our tables, we just need to find the
value for the last row. This is a little trickier than finding the top
left cell since the two sheets do not have identical West
headings. One method we can employ is looking at the style formatting to
see if the table can be isolated through its
local_format_id
or style_format
. To make the
style data easily readable, we can set the character values of the bag
to the format id and then rectify it.
# Bottom Left Corner COMP1
idx_bl1 = bag1 %>%
filter(
col == 3 & row > idx_tl1$row &
(local_format_id == idx_tl1$local_format_id | style_format!=idx_tl1$style_format)
) %>%
slice(1) %>%
select(row) %>%
mutate (row = row-2)
idx_bl1
# A tibble: 1 × 1
row
<dbl>
1 28
# Bottom Left Corner COMP2
idx_bl2 = bag2 %>%
filter(
col == 3 & row > idx_tl2$row &
(local_format_id == idx_tl2$local_format_id | style_format!=idx_tl2$style_format)
) %>%
slice(1) %>%
select(row) %>%
mutate (row = row-2)
idx_bl2
# A tibble: 1 × 1
row
<dbl>
1 31
Now that we know the rows where our security description table starts and ends in each sheet and the column it starts in, we can filter out all other data from our bag.
bag1 = bag1 %>%
filter(
row >= idx_tl1$row,
row <= idx_bl1$row,
col >= idx_tl1$col
)
bag2 = bag2 %>%
filter(
row >= idx_tl2$row,
row <= idx_bl2$row,
col >= idx_tl2$col
)
Our next goal is to perform some final clean up so that data types
match the target output and the table can be interpreted by the
behead
and pack
functions from the
unpivotr package. In COMP2
we can remove
row \(20\) since it contains extra
headings that we don’t want to be picked up by
unpivotr::behead(). The date values in the bag also
need to be converted to characters in both sheets to match the target.
Finally, we need to filter out the blanks so we are only left with
values and their relevant Northern and Western
headings (usually behead
is good about filtering out
blanks but the number of NAs might be throwing it off, not
entirely sure). Lastly, we can deselect the style formatting since we no
longer need it. The resulting cleaned tables that we can begin to apply
unpivotr functions on can be seen below for
COMP1
and COMP2
:
bag1=bag1 %>%
mutate(date = as.character(date)) %>%
filter(data_type!= "blank") %>%
select(-local_format_id, -style_format)
bag1 %>%
rectify() %>%
format.dt.f()
xlsx_toggle=2
if(xlsx_toggle==2){bag2 = bag2 %>% filter(row!=20)}
bag2=bag2 %>%
mutate(date = as.character(date)) %>%
filter(data_type!= "blank") %>%
select(-local_format_id, -style_format)
bag2 %>%
rectify() %>%
format.dt.f()
Next we can begin stripping the headings, assigning each value to a North and West heading. Each value belongs to a specific security id and an element that describes an aspect of that security. In the tables above, we can see the North headings indicate the security id while the west headings represent the associated element of that security the value describes.
We can use behead
to strip these headings so only the
values remain, however the information is not lost, the headings are
assigned to a new column in the bag tibble.
t_security1
t_security1 <- bag1 %>%
behead(direction = "N", name = "security_id") %>%
behead(direction = "W", name = "element")
t_security1 %>% format.dt.f()
t_security2
t_security2 <- bag2 %>%
behead(direction = "N", name = "security_id") %>%
behead(direction = "W", name = "element")
t_security2 %>% format.dt.f()
The security_id and element values will become the list
dividers and list element names in the final output, so we want to clean
up the names to match the target now. We can do this with the
clean_names()
function.
clean_names <- function(old_names){
new_names <- old_names %>%
gsub("'", "", .) %>% # remove quotation marks
gsub("\"", "", .) %>% # remove quotation marks
gsub("%", ".percent_", .) %>% # starting with "." as a workaround, to make
# ".percent" a valid name. The "." will be replaced in the call to to_any_case
# via the preprocess argument anyway.
gsub("^[ ]+", "", .) %>%
make.names(.) %>%
# Handle dots, multiple underscores, case conversion, string transliteration
snakecase::to_any_case(
case = "snake" #,
# preprocess = "\\.",
# replace_special_characters = c("Latin-ASCII")
)
new_names
}
t_security1
t_security1=t_security1 %>%
mutate(
element = element %>% clean_names(),
security_id = security_id %>% clean_names()
)
t_security1 %>% format.dt.f()
t_security2
t_security2=t_security2 %>%
mutate(
element = element %>% clean_names(),
security_id = security_id %>% clean_names()
)
t_security2 %>% format.dt.f()
Next, we can use unpivotr::pack() to combine the the different data type columns into just one column. This will allow us to extract them all into a single list to match the target output.
t_security1
t_security1=t_security1 %>%
pack() %>%
select(security_id, element, value)
t_security1 %>% format.dt.f()
t_security2
t_security2 = t_security2 %>%
pack() %>%
select(security_id, element, value)
t_security2 %>% format.dt.f()
Our next steps are to convert our table to match the output format we
are targeting. We currently have a table that includes the
security_id (the title of the target list), element
(the names for the values in the target list), and value (list
of values to be included in our list). We can extract each of the value
lists by calling t_security1$value
and then pass that
through purrr:set_names() with the element column of
t_security1
for the names argument; the resulting list can
then be split by the security_id, making it a list of lists
which matches the target output for COMP2
.
# Extract values from table into list of lists (match previous output)
security1=t_security1$value %>%
purrr::set_names(t_security1$element) %>%
split(t_security1$security_id) %>% # list() %>%
set_names(
t_security1 %>%
filter(element=="description") %>%
pull() %>%
map_chr(~.x) %>%
clean_names()
)
security2=t_security2$value %>%
purrr::set_names(t_security2$element) %>%
split(t_security2$security_id) %>% # list() %>%
set_names(
t_security2 %>%
filter(element=="description") %>%
pull() %>%
map_chr(~.x) %>%
clean_names()
)
list(security1, security2) %>%
set_names(str_replace(names(scenarios), "RNFC_", "")) %>%
listviewer::jsonedit()