.@JennyBryan I want to see @drob figure out how to turn this into a tidy data set. I'd pay money to watch it in realtime….

— Simply Statistics (@simplystats) April 6, 2016

Challenge accepted. Here’s the data Jenny sent, which can be downloaded here:

This is some early unfinished work. I’m focusing on the tables with BID and OFFER. The phone numbers are probably not generally interesting to an analyst, and the NYMEX data is relatively easy to extract.

Rather than solve it for this particular sheet, I’m trying to think about tools that would be useful in general. In particular, I’m treating the problem as about pulling “sub-tables” out into a tidy format. There are a few general functions I come up with and others that could be useful.

Pre-processing, melting

We read it in with the readxl package:

library(dplyr)
library(tidyr)
library(stringr)

raw_data <- readxl::read_excel("~/Downloads/frank_ermis__11178__Enron Pricing Report.xlsx")

raw_data
## Source: local data frame [234 x 99]
## 
##          Enron North America - West Gas    NA               NA    NA    NA
##    (dbl)                          (chr) (chr)            (chr) (chr) (chr)
## 1     NA                          37204    NA               NA    NA    NA
## 2     NA                             NA    NA               NA    NA    NA
## 3     NA        ENA - West Gas Contacts    NA               NA    NA    NA
## 4     NA                             NA    NA               NA    NA    NA
## 5     NA                             NA    NA   Houston Office    NA    NA
## 6     NA                             NA    NA   Barry Tycholiz    NA    NA
## 7     NA                             NA    NA         Kim Ward    NA    NA
## 8     NA                             NA    NA Stephanie Miller    NA    NA
## 9     NA                             NA    NA    Philip Polsky    NA    NA
## 10    NA                             NA    NA               NA    NA    NA
## ..   ...                            ...   ...              ...   ...   ...
## Variables not shown: NA (chr), NA (chr), NA (chr), NA (chr), NA (dbl), NA
##   (chr), NA (chr), NA (chr), NA (chr), NA (chr), NA (chr), NA (chr), NA
##   (chr), NA (chr), NA (dbl), NA (dbl), NA (dbl), NA (dbl), NA (dbl), NA
##   (dbl), NYMEX Month (chr), NA (dbl), 37226 (dbl), 37257 (dbl), 37288
##   (dbl), 37316 (dbl), 37347 (dbl), 37377 (dbl), 37408 (dbl), 37438 (dbl),
##   37469 (dbl), 37500 (dbl), 37530 (dbl), 37561 (dbl), NA (dbl), NA (dbl),
##   NA (dbl), NA (dbl), NA (dbl), NA (dbl), NA (dbl), NA (dbl), NA (dbl), NA
##   (dbl), NA (dbl), NA (dbl), NA (dbl), NA (dbl), NA (dbl), NA (dbl), NA
##   (dbl), NA (dbl), NA (dbl), NA (dbl), NA (dbl), NA (dbl), NA (dbl), NA
##   (dbl), NA (dbl), NA (dbl), NA (dbl), NA (dbl), NA (dbl), NA (dbl), NA
##   (dbl), NA (dbl), NA (dbl), NA (dbl), NA (dbl), NA (dbl), NA (dbl), NA
##   (dbl), NA (dbl), NA (dbl), NA (dbl), NA (dbl), NA (dbl), NA (dbl), NA
##   (dbl), NA (dbl), NA (dbl), NA (dbl), NA (dbl), NA (dbl), NA (dbl), NA
##   (dbl), NA (dbl), NA (dbl), NA (dbl), NA (dbl), NA (dbl), NA (dbl), NA
##   (dbl)

Nothing new there.

Viewing it in melted format allows more useful operations to be performed:

# Gather into row-column-value format, and remove all-NA rows/cols
# (which is most of them)
tidy_excel <- function(x) {
  x %>%
    setNames(seq_len(ncol(x))) %>%
    mutate(row = row_number()) %>%
    tidyr::gather(column, value, -row) %>%
    mutate(column = as.integer(column)) %>%
    group_by(row) %>%
    filter(!all(is.na(value))) %>%
    group_by(column) %>%
    filter(!all(is.na(value))) %>%
    ungroup() %>%
    arrange(column, row)
}

tbl <- tidy_excel(raw_data)
tbl
## Source: local data frame [2,139 x 3]
## 
##      row column                      value
##    (int)  (int)                      (chr)
## 1      1      2                      37204
## 2      2      2                         NA
## 3      3      2    ENA - West Gas Contacts
## 4      5      2                         NA
## 5      6      2                         NA
## 6      7      2                         NA
## 7      8      2                         NA
## 8      9      2                         NA
## 9     11      2 Forward Prices (US$/MMBtu)
## 10    13      2                         NA
## ..   ...    ...                        ...

Already it’s more useful because we can look at the structure:

library(ggplot2)

plot_excel <- function(x) {
  x %>%
  filter(!is.na(value)) %>%
  ggplot(aes(column, -row, fill = !is.na(as.numeric(value)))) +
  geom_tile(color = "black") +
  theme_void()
}

plot_excel(tbl)

The vast majority of the cells are NA. But they may end up in some of the tables so we have to be careful to keep them for now.

Notice that all the sub-tables of BID/OFFER etc have the same size- that may not be true in other tables, but it’s true here. That gives me an idea.

tbl_toplefts <- tbl %>%
  filter(str_detect(value, "(IF|AECO \\/|Border|City Gate)"),
         row < 56)

tbl_toplefts
## Source: local data frame [10 x 3]
## 
##      row column                           value
##    (int)  (int)                           (chr)
## 1     24      7          IF CIG Rocky Mountains
## 2     35      7                      AECO / NIT
## 3     46      7    NGI Socal (South Cal Border)
## 4     24     12              IF EL Paso Permian
## 5     35     12 IF NWPL Canadian Border (Sumas)
## 6     46     12    NGI Malin (North Cal Border)
## 7     13     17         IF NWPL Rocky Mountains
## 8     24     17             IF EL Paso San Juan
## 9     35     17                   IF PEPL TX-OK
## 10    46     17                  PG&E City Gate

We then have the chance to extract tables based on those positions. I write an approach that can be done with any vector of row/column starts, as well as either a fixed or at least known width and height. (If it turned out that the tables, say, alternate between width 4 and width 6, that would be easy to modify beforehand).

extract_sub_tbl_position <- function(x, row_start, col_start,
                                width, height) {
  positions <- data_frame(row_start, row_end = row_start + height - 1,
                          col_start, col_end = col_start + width - 1) %>%
    arrange(row_start, col_start) %>%
    group_by(table_id = row_number()) %>%
    do(tidyr::crossing(row = seq(.$row_start, .$row_end),
                       column = seq(.$col_start, .$col_end))) %>%
    ungroup()

  positions %>%
    inner_join(x)
}

sub_tbls <- extract_sub_tbl_position(tbl,
                                     tbl_toplefts$row, tbl_toplefts$column,
                                     width = 4, height = 10)

sub_tbls
## Source: local data frame [400 x 4]
## 
##    table_id   row column                   value
##       (int) (int)  (int)                   (chr)
## 1         1    13     17 IF NWPL Rocky Mountains
## 2         1    13     18                      NA
## 3         1    13     19                      NA
## 4         1    13     20                      NA
## 5         1    14     17             Fixed Price
## 6         1    14     18                      NA
## 7         1    14     19                   Basis
## 8         1    14     20                      NA
## 9         1    15     17                     BID
## 10        1    15     18                   OFFER
## ..      ...   ...    ...                     ...
plot_excel(sub_tbls)

In the process, we haven’t just pulled out these tables, we’ve kept track of which is which:

plot_excel(sub_tbls) +
  geom_tile(aes(fill = factor(table_id)), color = "black")

Turning those tables into one

What we really have here is multiple header rows (the red above). To work with this as tidy data, we’ve got to take those multiple header rows and put them in columns alongside it.

First figure out which are headers, and fill them across (you don’t want to fill NAs in non-header data):

library(zoo)

sub_tbls_filled <- sub_tbls %>%
  group_by(table_id, row) %>%
  mutate(is_header_row = is.na(as.numeric(value[1]))) %>%
  mutate(value = ifelse(is_header_row, na.locf(value), value))

sub_tbls_filled
## Source: local data frame [400 x 5]
## Groups: table_id, row [100]
## 
##    table_id   row column                   value is_header_row
##       (int) (int)  (int)                   (chr)         (lgl)
## 1         1    13     17 IF NWPL Rocky Mountains          TRUE
## 2         1    13     18 IF NWPL Rocky Mountains          TRUE
## 3         1    13     19 IF NWPL Rocky Mountains          TRUE
## 4         1    13     20 IF NWPL Rocky Mountains          TRUE
## 5         1    14     17             Fixed Price          TRUE
## 6         1    14     18             Fixed Price          TRUE
## 7         1    14     19                   Basis          TRUE
## 8         1    14     20                   Basis          TRUE
## 9         1    15     17                     BID          TRUE
## 10        1    15     18                   OFFER          TRUE
## ..      ...   ...    ...                     ...           ...

Then we need to know, for each column/table_id, what combination of columns it has. This “multiple header” pattern probably appears in other sheets and may be worth “verb”-ing in some way.

header_data <- sub_tbls_filled %>%
  group_by(table_id) %>%
  filter(is_header_row) %>%
  mutate(header_row = paste0("header", row - min(row) + 1)) %>%
  ungroup() %>%
  select(table_id, column, header_row, value) %>%
  spread(header_row, value)

header_data
## Source: local data frame [40 x 5]
## 
##    table_id column                 header1     header2 header3
##       (int)  (int)                   (chr)       (chr)   (chr)
## 1         1     17 IF NWPL Rocky Mountains Fixed Price     BID
## 2         1     18 IF NWPL Rocky Mountains Fixed Price   OFFER
## 3         1     19 IF NWPL Rocky Mountains       Basis     BID
## 4         1     20 IF NWPL Rocky Mountains       Basis   OFFER
## 5         2      7  IF CIG Rocky Mountains Fixed Price     BID
## 6         2      8  IF CIG Rocky Mountains Fixed Price   OFFER
## 7         2      9  IF CIG Rocky Mountains       Basis     BID
## 8         2     10  IF CIG Rocky Mountains       Basis   OFFER
## 9         3     12      IF EL Paso Permian Fixed Price     BID
## 10        3     13      IF EL Paso Permian Fixed Price   OFFER
## ..      ...    ...                     ...         ...     ...

Joining that with the non-header data gets us a tidy version, with a value column that we can actually convert to numeric. (We can also drop the table_id afterwards since it is redundant with header1).

td_tbls <- sub_tbls_filled %>%
  ungroup() %>%
  filter(!is_header_row) %>%
  select(-is_header_row) %>%
  inner_join(header_data) %>%
  select(-column, -table_id) %>%
  mutate(value = as.numeric(value))

td_tbls
## Source: local data frame [280 x 5]
## 
##      row value                 header1     header2 header3
##    (int) (dbl)                   (chr)       (chr)   (chr)
## 1     16 1.890 IF NWPL Rocky Mountains Fixed Price     BID
## 2     16 1.910 IF NWPL Rocky Mountains Fixed Price   OFFER
## 3     16    NA IF NWPL Rocky Mountains       Basis     BID
## 4     16    NA IF NWPL Rocky Mountains       Basis   OFFER
## 5     17 2.060 IF NWPL Rocky Mountains Fixed Price     BID
## 6     17 2.080 IF NWPL Rocky Mountains Fixed Price   OFFER
## 7     17    NA IF NWPL Rocky Mountains       Basis     BID
## 8     17    NA IF NWPL Rocky Mountains       Basis   OFFER
## 9     18 2.395 IF NWPL Rocky Mountains Fixed Price     BID
## 10    18 2.415 IF NWPL Rocky Mountains Fixed Price   OFFER
## ..   ...   ...                     ...         ...     ...

Adding row information

The next step is that we need to get the per-row information. These come from the second (with values like “Cash”, “ROM”, and “from” dates) and fourth columns (entirely “to” dates) of the full table (and we only care about the ones in the same rows as td_tbls):

tbl_with_row <- tbl %>%
  filter(column %in% c(2, 4)) %>%
  mutate(column = c("row_info", "to_date")[column / 2]) %>%
  spread(column, value) %>%
  inner_join(td_tbls, by = "row")

tbl_with_row
## Source: local data frame [280 x 7]
## 
##      row row_info to_date value                 header1     header2
##    (int)    (chr)   (chr) (dbl)                   (chr)       (chr)
## 1     16     Cash      NA 1.890 IF NWPL Rocky Mountains Fixed Price
## 2     16     Cash      NA 1.910 IF NWPL Rocky Mountains Fixed Price
## 3     16     Cash      NA    NA IF NWPL Rocky Mountains       Basis
## 4     16     Cash      NA    NA IF NWPL Rocky Mountains       Basis
## 5     17      ROM      NA 2.060 IF NWPL Rocky Mountains Fixed Price
## 6     17      ROM      NA 2.080 IF NWPL Rocky Mountains Fixed Price
## 7     17      ROM      NA    NA IF NWPL Rocky Mountains       Basis
## 8     17      ROM      NA    NA IF NWPL Rocky Mountains       Basis
## 9     18    37226      NA 2.395 IF NWPL Rocky Mountains Fixed Price
## 10    18    37226      NA 2.415 IF NWPL Rocky Mountains Fixed Price
## ..   ...      ...     ...   ...                     ...         ...
## Variables not shown: header3 (chr)

Last step is to clean those values like 37226 into dates. Those came straight out of the readxl step, (see here for the explanation of excel dates).

convert_excel_date <- function(x) {
  result <- as.Date("1900-01-01") + as.numeric(x) - 2
  ifelse(is.na(result), x, as.character(result))
}

final_tbl <- tbl_with_row %>%
  mutate(row_info = convert_excel_date(row_info),
         to_date = convert_excel_date(to_date))

Let’s look at the entire table:

knitr::kable(final_tbl, digits = 2)
row row_info to_date value header1 header2 header3
16 Cash NA 1.89 IF NWPL Rocky Mountains Fixed Price BID
16 Cash NA 1.91 IF NWPL Rocky Mountains Fixed Price OFFER
16 Cash NA NA IF NWPL Rocky Mountains Basis BID
16 Cash NA NA IF NWPL Rocky Mountains Basis OFFER
17 ROM NA 2.06 IF NWPL Rocky Mountains Fixed Price BID
17 ROM NA 2.08 IF NWPL Rocky Mountains Fixed Price OFFER
17 ROM NA NA IF NWPL Rocky Mountains Basis BID
17 ROM NA NA IF NWPL Rocky Mountains Basis OFFER
18 2001-12-01 NA 2.40 IF NWPL Rocky Mountains Fixed Price BID
18 2001-12-01 NA 2.42 IF NWPL Rocky Mountains Fixed Price OFFER
18 2001-12-01 NA -0.56 IF NWPL Rocky Mountains Basis BID
18 2001-12-01 NA -0.55 IF NWPL Rocky Mountains Basis OFFER
19 2001-12-01 2002-03-01 2.59 IF NWPL Rocky Mountains Fixed Price BID
19 2001-12-01 2002-03-01 2.61 IF NWPL Rocky Mountains Fixed Price OFFER
19 2001-12-01 2002-03-01 -0.49 IF NWPL Rocky Mountains Basis BID
19 2001-12-01 2002-03-01 -0.47 IF NWPL Rocky Mountains Basis OFFER
20 2002-04-01 2002-10-01 2.58 IF NWPL Rocky Mountains Fixed Price BID
20 2002-04-01 2002-10-01 2.60 IF NWPL Rocky Mountains Fixed Price OFFER
20 2002-04-01 2002-10-01 -0.58 IF NWPL Rocky Mountains Basis BID
20 2002-04-01 2002-10-01 -0.56 IF NWPL Rocky Mountains Basis OFFER
21 2002-11-01 2003-03-01 3.36 IF NWPL Rocky Mountains Fixed Price BID
21 2002-11-01 2003-03-01 3.38 IF NWPL Rocky Mountains Fixed Price OFFER
21 2002-11-01 2003-03-01 -0.30 IF NWPL Rocky Mountains Basis BID
21 2002-11-01 2003-03-01 -0.28 IF NWPL Rocky Mountains Basis OFFER
22 One Year Strip* NA 2.63 IF NWPL Rocky Mountains Fixed Price BID
22 One Year Strip* NA 2.65 IF NWPL Rocky Mountains Fixed Price OFFER
22 One Year Strip* NA -0.53 IF NWPL Rocky Mountains Basis BID
22 One Year Strip* NA -0.51 IF NWPL Rocky Mountains Basis OFFER
27 Cash NA 1.94 IF CIG Rocky Mountains Fixed Price BID
27 Cash NA 1.96 IF CIG Rocky Mountains Fixed Price OFFER
27 Cash NA NA IF CIG Rocky Mountains Basis BID
27 Cash NA NA IF CIG Rocky Mountains Basis OFFER
27 Cash NA 2.38 IF EL Paso Permian Fixed Price BID
27 Cash NA 2.40 IF EL Paso Permian Fixed Price OFFER
27 Cash NA NA IF EL Paso Permian Basis BID
27 Cash NA NA IF EL Paso Permian Basis OFFER
27 Cash NA 2.45 IF EL Paso San Juan Fixed Price BID
27 Cash NA 2.47 IF EL Paso San Juan Fixed Price OFFER
27 Cash NA NA IF EL Paso San Juan Basis BID
27 Cash NA NA IF EL Paso San Juan Basis OFFER
28 ROM NA 1.96 IF CIG Rocky Mountains Fixed Price BID
28 ROM NA 1.98 IF CIG Rocky Mountains Fixed Price OFFER
28 ROM NA NA IF CIG Rocky Mountains Basis BID
28 ROM NA NA IF CIG Rocky Mountains Basis OFFER
28 ROM NA 2.42 IF EL Paso Permian Fixed Price BID
28 ROM NA 2.44 IF EL Paso Permian Fixed Price OFFER
28 ROM NA NA IF EL Paso Permian Basis BID
28 ROM NA NA IF EL Paso Permian Basis OFFER
28 ROM NA 2.35 IF EL Paso San Juan Fixed Price BID
28 ROM NA 2.37 IF EL Paso San Juan Fixed Price OFFER
28 ROM NA NA IF EL Paso San Juan Basis BID
28 ROM NA NA IF EL Paso San Juan Basis OFFER
29 2001-12-01 NA 2.35 IF CIG Rocky Mountains Fixed Price BID
29 2001-12-01 NA 2.37 IF CIG Rocky Mountains Fixed Price OFFER
29 2001-12-01 NA -0.62 IF CIG Rocky Mountains Basis BID
29 2001-12-01 NA -0.60 IF CIG Rocky Mountains Basis OFFER
29 2001-12-01 NA 2.70 IF EL Paso Permian Fixed Price BID
29 2001-12-01 NA 2.72 IF EL Paso Permian Fixed Price OFFER
29 2001-12-01 NA -0.26 IF EL Paso Permian Basis BID
29 2001-12-01 NA -0.24 IF EL Paso Permian Basis OFFER
29 2001-12-01 NA 2.56 IF EL Paso San Juan Fixed Price BID
29 2001-12-01 NA 2.58 IF EL Paso San Juan Fixed Price OFFER
29 2001-12-01 NA -0.40 IF EL Paso San Juan Basis BID
29 2001-12-01 NA -0.38 IF EL Paso San Juan Basis OFFER
30 2001-12-01 2002-03-01 2.55 IF CIG Rocky Mountains Fixed Price BID
30 2001-12-01 2002-03-01 2.57 IF CIG Rocky Mountains Fixed Price OFFER
30 2001-12-01 2002-03-01 -0.54 IF CIG Rocky Mountains Basis BID
30 2001-12-01 2002-03-01 -0.52 IF CIG Rocky Mountains Basis OFFER
30 2001-12-01 2002-03-01 2.85 IF EL Paso Permian Fixed Price BID
30 2001-12-01 2002-03-01 2.87 IF EL Paso Permian Fixed Price OFFER
30 2001-12-01 2002-03-01 -0.23 IF EL Paso Permian Basis BID
30 2001-12-01 2002-03-01 -0.21 IF EL Paso Permian Basis OFFER
30 2001-12-01 2002-03-01 2.74 IF EL Paso San Juan Fixed Price BID
30 2001-12-01 2002-03-01 2.76 IF EL Paso San Juan Fixed Price OFFER
30 2001-12-01 2002-03-01 -0.34 IF EL Paso San Juan Basis BID
30 2001-12-01 2002-03-01 -0.32 IF EL Paso San Juan Basis OFFER
31 2002-04-01 2002-10-01 2.47 IF CIG Rocky Mountains Fixed Price BID
31 2002-04-01 2002-10-01 2.49 IF CIG Rocky Mountains Fixed Price OFFER
31 2002-04-01 2002-10-01 -0.70 IF CIG Rocky Mountains Basis BID
31 2002-04-01 2002-10-01 -0.68 IF CIG Rocky Mountains Basis OFFER
31 2002-04-01 2002-10-01 3.01 IF EL Paso Permian Fixed Price BID
31 2002-04-01 2002-10-01 3.03 IF EL Paso Permian Fixed Price OFFER
31 2002-04-01 2002-10-01 -0.16 IF EL Paso Permian Basis BID
31 2002-04-01 2002-10-01 -0.14 IF EL Paso Permian Basis OFFER
31 2002-04-01 2002-10-01 2.80 IF EL Paso San Juan Fixed Price BID
31 2002-04-01 2002-10-01 2.82 IF EL Paso San Juan Fixed Price OFFER
31 2002-04-01 2002-10-01 -0.36 IF EL Paso San Juan Basis BID
31 2002-04-01 2002-10-01 -0.34 IF EL Paso San Juan Basis OFFER
32 2002-11-01 2003-03-01 3.31 IF CIG Rocky Mountains Fixed Price BID
32 2002-11-01 2003-03-01 3.33 IF CIG Rocky Mountains Fixed Price OFFER
32 2002-11-01 2003-03-01 -0.34 IF CIG Rocky Mountains Basis BID
32 2002-11-01 2003-03-01 -0.32 IF CIG Rocky Mountains Basis OFFER
32 2002-11-01 2003-03-01 3.50 IF EL Paso Permian Fixed Price BID
32 2002-11-01 2003-03-01 3.52 IF EL Paso Permian Fixed Price OFFER
32 2002-11-01 2003-03-01 -0.15 IF EL Paso Permian Basis BID
32 2002-11-01 2003-03-01 -0.13 IF EL Paso Permian Basis OFFER
32 2002-11-01 2003-03-01 3.42 IF EL Paso San Juan Fixed Price BID
32 2002-11-01 2003-03-01 3.44 IF EL Paso San Juan Fixed Price OFFER
32 2002-11-01 2003-03-01 -0.23 IF EL Paso San Juan Basis BID
32 2002-11-01 2003-03-01 -0.21 IF EL Paso San Juan Basis OFFER
33 One Year Strip* NA 2.55 IF CIG Rocky Mountains Fixed Price BID
33 One Year Strip* NA 2.57 IF CIG Rocky Mountains Fixed Price OFFER
33 One Year Strip* NA -0.61 IF CIG Rocky Mountains Basis BID
33 One Year Strip* NA -0.59 IF CIG Rocky Mountains Basis OFFER
33 One Year Strip* NA 2.98 IF EL Paso Permian Fixed Price BID
33 One Year Strip* NA 3.00 IF EL Paso Permian Fixed Price OFFER
33 One Year Strip* NA -0.18 IF EL Paso Permian Basis BID
33 One Year Strip* NA -0.16 IF EL Paso Permian Basis OFFER
33 One Year Strip* NA 2.82 IF EL Paso San Juan Fixed Price BID
33 One Year Strip* NA 2.84 IF EL Paso San Juan Fixed Price OFFER
33 One Year Strip* NA -0.35 IF EL Paso San Juan Basis BID
33 One Year Strip* NA -0.33 IF EL Paso San Juan Basis OFFER
38 Cash NA 2.38 AECO / NIT Fixed Price BID
38 Cash NA 2.40 AECO / NIT Fixed Price OFFER
38 Cash NA NA AECO / NIT Basis BID
38 Cash NA NA AECO / NIT Basis OFFER
38 Cash NA 2.48 IF NWPL Canadian Border (Sumas) Fixed Price BID
38 Cash NA 2.50 IF NWPL Canadian Border (Sumas) Fixed Price OFFER
38 Cash NA NA IF NWPL Canadian Border (Sumas) Basis BID
38 Cash NA NA IF NWPL Canadian Border (Sumas) Basis OFFER
38 Cash NA 2.53 IF PEPL TX-OK Fixed Price BID
38 Cash NA 2.55 IF PEPL TX-OK Fixed Price OFFER
38 Cash NA NA IF PEPL TX-OK Basis BID
38 Cash NA NA IF PEPL TX-OK Basis OFFER
39 ROM NA 2.40 AECO / NIT Fixed Price BID
39 ROM NA 2.42 AECO / NIT Fixed Price OFFER
39 ROM NA NA AECO / NIT Basis BID
39 ROM NA NA AECO / NIT Basis OFFER
39 ROM NA 2.46 IF NWPL Canadian Border (Sumas) Fixed Price BID
39 ROM NA 2.48 IF NWPL Canadian Border (Sumas) Fixed Price OFFER
39 ROM NA NA IF NWPL Canadian Border (Sumas) Basis BID
39 ROM NA NA IF NWPL Canadian Border (Sumas) Basis OFFER
39 ROM NA 2.53 IF PEPL TX-OK Fixed Price BID
39 ROM NA 2.55 IF PEPL TX-OK Fixed Price OFFER
39 ROM NA NA IF PEPL TX-OK Basis BID
39 ROM NA NA IF PEPL TX-OK Basis OFFER
40 2001-12-01 NA 2.55 AECO / NIT Fixed Price BID
40 2001-12-01 NA 2.57 AECO / NIT Fixed Price OFFER
40 2001-12-01 NA -0.41 AECO / NIT Basis BID
40 2001-12-01 NA -0.39 AECO / NIT Basis OFFER
40 2001-12-01 NA 2.80 IF NWPL Canadian Border (Sumas) Fixed Price BID
40 2001-12-01 NA 2.82 IF NWPL Canadian Border (Sumas) Fixed Price OFFER
40 2001-12-01 NA -0.16 IF NWPL Canadian Border (Sumas) Basis BID
40 2001-12-01 NA -0.14 IF NWPL Canadian Border (Sumas) Basis OFFER
40 2001-12-01 NA 2.83 IF PEPL TX-OK Fixed Price BID
40 2001-12-01 NA 2.85 IF PEPL TX-OK Fixed Price OFFER
40 2001-12-01 NA -0.13 IF PEPL TX-OK Basis BID
40 2001-12-01 NA -0.11 IF PEPL TX-OK Basis OFFER
41 2001-12-01 2002-03-01 2.62 AECO / NIT Fixed Price BID
41 2001-12-01 2002-03-01 2.64 AECO / NIT Fixed Price OFFER
41 2001-12-01 2002-03-01 -0.47 AECO / NIT Basis BID
41 2001-12-01 2002-03-01 -0.45 AECO / NIT Basis OFFER
41 2001-12-01 2002-03-01 2.89 IF NWPL Canadian Border (Sumas) Fixed Price BID
41 2001-12-01 2002-03-01 2.91 IF NWPL Canadian Border (Sumas) Fixed Price OFFER
41 2001-12-01 2002-03-01 -0.20 IF NWPL Canadian Border (Sumas) Basis BID
41 2001-12-01 2002-03-01 -0.18 IF NWPL Canadian Border (Sumas) Basis OFFER
41 2001-12-01 2002-03-01 2.96 IF PEPL TX-OK Fixed Price BID
41 2001-12-01 2002-03-01 2.98 IF PEPL TX-OK Fixed Price OFFER
41 2001-12-01 2002-03-01 -0.13 IF PEPL TX-OK Basis BID
41 2001-12-01 2002-03-01 -0.11 IF PEPL TX-OK Basis OFFER
42 2002-04-01 2002-10-01 2.66 AECO / NIT Fixed Price BID
42 2002-04-01 2002-10-01 2.68 AECO / NIT Fixed Price OFFER
42 2002-04-01 2002-10-01 -0.50 AECO / NIT Basis BID
42 2002-04-01 2002-10-01 -0.48 AECO / NIT Basis OFFER
42 2002-04-01 2002-10-01 2.80 IF NWPL Canadian Border (Sumas) Fixed Price BID
42 2002-04-01 2002-10-01 2.82 IF NWPL Canadian Border (Sumas) Fixed Price OFFER
42 2002-04-01 2002-10-01 -0.37 IF NWPL Canadian Border (Sumas) Basis BID
42 2002-04-01 2002-10-01 -0.35 IF NWPL Canadian Border (Sumas) Basis OFFER
42 2002-04-01 2002-10-01 3.05 IF PEPL TX-OK Fixed Price BID
42 2002-04-01 2002-10-01 3.07 IF PEPL TX-OK Fixed Price OFFER
42 2002-04-01 2002-10-01 -0.12 IF PEPL TX-OK Basis BID
42 2002-04-01 2002-10-01 -0.10 IF PEPL TX-OK Basis OFFER
43 2002-11-01 2003-03-01 3.22 AECO / NIT Fixed Price BID
43 2002-11-01 2003-03-01 3.24 AECO / NIT Fixed Price OFFER
43 2002-11-01 2003-03-01 -0.44 AECO / NIT Basis BID
43 2002-11-01 2003-03-01 -0.42 AECO / NIT Basis OFFER
43 2002-11-01 2003-03-01 3.71 IF NWPL Canadian Border (Sumas) Fixed Price BID
43 2002-11-01 2003-03-01 3.73 IF NWPL Canadian Border (Sumas) Fixed Price OFFER
43 2002-11-01 2003-03-01 0.06 IF NWPL Canadian Border (Sumas) Basis BID
43 2002-11-01 2003-03-01 0.08 IF NWPL Canadian Border (Sumas) Basis OFFER
43 2002-11-01 2003-03-01 3.53 IF PEPL TX-OK Fixed Price BID
43 2002-11-01 2003-03-01 3.55 IF PEPL TX-OK Fixed Price OFFER
43 2002-11-01 2003-03-01 -0.12 IF PEPL TX-OK Basis BID
43 2002-11-01 2003-03-01 -0.10 IF PEPL TX-OK Basis OFFER
44 One Year Strip* NA 2.68 AECO / NIT Fixed Price BID
44 One Year Strip* NA 2.70 AECO / NIT Fixed Price OFFER
44 One Year Strip* NA -0.49 AECO / NIT Basis BID
44 One Year Strip* NA -0.47 AECO / NIT Basis OFFER
44 One Year Strip* NA 2.88 IF NWPL Canadian Border (Sumas) Fixed Price BID
44 One Year Strip* NA 2.90 IF NWPL Canadian Border (Sumas) Fixed Price OFFER
44 One Year Strip* NA -0.28 IF NWPL Canadian Border (Sumas) Basis BID
44 One Year Strip* NA -0.26 IF NWPL Canadian Border (Sumas) Basis OFFER
44 One Year Strip* NA 3.04 IF PEPL TX-OK Fixed Price BID
44 One Year Strip* NA 3.06 IF PEPL TX-OK Fixed Price OFFER
44 One Year Strip* NA -0.12 IF PEPL TX-OK Basis BID
44 One Year Strip* NA -0.10 IF PEPL TX-OK Basis OFFER
49 Cash NA 2.58 NGI Socal (South Cal Border) Fixed Price BID
49 Cash NA 2.60 NGI Socal (South Cal Border) Fixed Price OFFER
49 Cash NA NA NGI Socal (South Cal Border) Basis BID
49 Cash NA NA NGI Socal (South Cal Border) Basis OFFER
49 Cash NA 2.55 NGI Malin (North Cal Border) Fixed Price BID
49 Cash NA 2.57 NGI Malin (North Cal Border) Fixed Price OFFER
49 Cash NA NA NGI Malin (North Cal Border) Basis BID
49 Cash NA NA NGI Malin (North Cal Border) Basis OFFER
49 Cash NA 2.57 PG&E City Gate Fixed Price BID
49 Cash NA 2.59 PG&E City Gate Fixed Price OFFER
49 Cash NA NA PG&E City Gate Basis BID
49 Cash NA NA PG&E City Gate Basis OFFER
50 ROM NA 2.50 NGI Socal (South Cal Border) Fixed Price BID
50 ROM NA 2.52 NGI Socal (South Cal Border) Fixed Price OFFER
50 ROM NA NA NGI Socal (South Cal Border) Basis BID
50 ROM NA NA NGI Socal (South Cal Border) Basis OFFER
50 ROM NA 2.48 NGI Malin (North Cal Border) Fixed Price BID
50 ROM NA 2.50 NGI Malin (North Cal Border) Fixed Price OFFER
50 ROM NA NA NGI Malin (North Cal Border) Basis BID
50 ROM NA NA NGI Malin (North Cal Border) Basis OFFER
50 ROM NA 2.52 PG&E City Gate Fixed Price BID
50 ROM NA 2.54 PG&E City Gate Fixed Price OFFER
50 ROM NA NA PG&E City Gate Basis BID
50 ROM NA NA PG&E City Gate Basis OFFER
51 2001-12-01 NA 2.79 NGI Socal (South Cal Border) Fixed Price BID
51 2001-12-01 NA 2.81 NGI Socal (South Cal Border) Fixed Price OFFER
51 2001-12-01 NA -0.17 NGI Socal (South Cal Border) Basis BID
51 2001-12-01 NA -0.15 NGI Socal (South Cal Border) Basis OFFER
51 2001-12-01 NA 2.79 NGI Malin (North Cal Border) Fixed Price BID
51 2001-12-01 NA 2.81 NGI Malin (North Cal Border) Fixed Price OFFER
51 2001-12-01 NA -0.18 NGI Malin (North Cal Border) Basis BID
51 2001-12-01 NA -0.16 NGI Malin (North Cal Border) Basis OFFER
51 2001-12-01 NA 2.88 PG&E City Gate Fixed Price BID
51 2001-12-01 NA 2.90 PG&E City Gate Fixed Price OFFER
51 2001-12-01 NA -0.08 PG&E City Gate Basis BID
51 2001-12-01 NA -0.06 PG&E City Gate Basis OFFER
52 2001-12-01 2002-03-01 2.95 NGI Socal (South Cal Border) Fixed Price BID
52 2001-12-01 2002-03-01 2.97 NGI Socal (South Cal Border) Fixed Price OFFER
52 2001-12-01 2002-03-01 -0.14 NGI Socal (South Cal Border) Basis BID
52 2001-12-01 2002-03-01 -0.12 NGI Socal (South Cal Border) Basis OFFER
52 2001-12-01 2002-03-01 2.92 NGI Malin (North Cal Border) Fixed Price BID
52 2001-12-01 2002-03-01 2.94 NGI Malin (North Cal Border) Fixed Price OFFER
52 2001-12-01 2002-03-01 -0.17 NGI Malin (North Cal Border) Basis BID
52 2001-12-01 2002-03-01 -0.15 NGI Malin (North Cal Border) Basis OFFER
52 2001-12-01 2002-03-01 3.02 PG&E City Gate Fixed Price BID
52 2001-12-01 2002-03-01 3.04 PG&E City Gate Fixed Price OFFER
52 2001-12-01 2002-03-01 -0.07 PG&E City Gate Basis BID
52 2001-12-01 2002-03-01 -0.05 PG&E City Gate Basis OFFER
53 2002-04-01 2002-10-01 3.22 NGI Socal (South Cal Border) Fixed Price BID
53 2002-04-01 2002-10-01 3.24 NGI Socal (South Cal Border) Fixed Price OFFER
53 2002-04-01 2002-10-01 0.06 NGI Socal (South Cal Border) Basis BID
53 2002-04-01 2002-10-01 0.08 NGI Socal (South Cal Border) Basis OFFER
53 2002-04-01 2002-10-01 3.04 NGI Malin (North Cal Border) Fixed Price BID
53 2002-04-01 2002-10-01 3.06 NGI Malin (North Cal Border) Fixed Price OFFER
53 2002-04-01 2002-10-01 -0.12 NGI Malin (North Cal Border) Basis BID
53 2002-04-01 2002-10-01 -0.10 NGI Malin (North Cal Border) Basis OFFER
53 2002-04-01 2002-10-01 3.26 PG&E City Gate Fixed Price BID
53 2002-04-01 2002-10-01 3.28 PG&E City Gate Fixed Price OFFER
53 2002-04-01 2002-10-01 0.10 PG&E City Gate Basis BID
53 2002-04-01 2002-10-01 0.12 PG&E City Gate Basis OFFER
54 2002-11-01 2003-03-01 3.74 NGI Socal (South Cal Border) Fixed Price BID
54 2002-11-01 2003-03-01 3.76 NGI Socal (South Cal Border) Fixed Price OFFER
54 2002-11-01 2003-03-01 0.09 NGI Socal (South Cal Border) Basis BID
54 2002-11-01 2003-03-01 0.11 NGI Socal (South Cal Border) Basis OFFER
54 2002-11-01 2003-03-01 3.72 NGI Malin (North Cal Border) Fixed Price BID
54 2002-11-01 2003-03-01 3.74 NGI Malin (North Cal Border) Fixed Price OFFER
54 2002-11-01 2003-03-01 0.07 NGI Malin (North Cal Border) Basis BID
54 2002-11-01 2003-03-01 0.09 NGI Malin (North Cal Border) Basis OFFER
54 2002-11-01 2003-03-01 3.96 PG&E City Gate Fixed Price BID
54 2002-11-01 2003-03-01 3.98 PG&E City Gate Fixed Price OFFER
54 2002-11-01 2003-03-01 0.31 PG&E City Gate Basis BID
54 2002-11-01 2003-03-01 0.33 PG&E City Gate Basis OFFER
55 One Year Strip* NA 3.16 NGI Socal (South Cal Border) Fixed Price BID
55 One Year Strip* NA 3.18 NGI Socal (South Cal Border) Fixed Price OFFER
55 One Year Strip* NA -0.01 NGI Socal (South Cal Border) Basis BID
55 One Year Strip* NA 0.02 NGI Socal (South Cal Border) Basis OFFER
55 One Year Strip* NA 3.04 NGI Malin (North Cal Border) Fixed Price BID
55 One Year Strip* NA 3.06 NGI Malin (North Cal Border) Fixed Price OFFER
55 One Year Strip* NA -0.13 NGI Malin (North Cal Border) Basis BID
55 One Year Strip* NA -0.11 NGI Malin (North Cal Border) Basis OFFER
55 One Year Strip* NA 3.22 PG&E City Gate Fixed Price BID
55 One Year Strip* NA 3.24 PG&E City Gate Fixed Price OFFER
55 One Year Strip* NA 0.05 PG&E City Gate Basis BID
55 One Year Strip* NA 0.07 PG&E City Gate Basis OFFER

Now we can start analyzing it using tidy tools. This data isn’t particularly interesting to me so I’m not sure what we’d graph, but here’s a start (ignoring all the per-row data because I don’t, like, know what ROM is):

g <- ggplot(tbl_with_row, aes(header1, value)) +
  geom_boxplot() +
  facet_wrap(~ header2, scales = "free_y")
g

# polished:
g +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  xlab("Region") +
  ylab("Forward Price (US$/MMBtu)") +
  scale_y_continuous(labels = scales::dollar_format())

Bonus: finding tbl areas by column names

Suppose the headings of these sub-tables weren’t as easy to specify: maybe we’re working across lots of sheets. But maybe the “sub-columns”, BID and OFFER, are consistent everywhere. Finding those would get you most of the way to specifying the locations of the tables.

I’d define a sub-column as a particular column name (pre-specified) followed by a sequence of either NAs or numbers- that works for this but may not for others. Here’s an early approach.

# grab out sub-columns with particular headers
sub_col_by_header <- function(tbl, col_names, reg = "[\\d\\.\\n]+") {
  processed <- tbl %>%
    mutate(match_pred = is.na(value) | str_detect(value, reg)) %>%
    mutate(colgroup = cumsum(!match_pred))

  header_cells <- processed %>%
    filter(is_name = value %in% col_names) %>%
    select(subcol_start_row = row, column, colname = value, colgroup)
  
  header_cells %>%
    inner_join(processed) %>%
    mutate(na = is.na(value)) %>%
    group_by(colgroup) %>%
    filter(!na | ((cumsum(!na) != sum(!na))), row != subcol_start_row) %>%
    ungroup() %>%
    select(-match_pred, -na)
}

sub_cols <- sub_col_by_header(tbl, c("BID", "OFFER"))
sub_cols
## Source: local data frame [280 x 6]
## 
##    subcol_start_row column colname colgroup   row              value
##               (int)  (int)   (chr)    (int) (int)              (chr)
## 1                26      7     BID       52    27               1.94
## 2                26      7     BID       52    28               1.96
## 3                26      7     BID       52    29 2.3450000000000002
## 4                26      7     BID       52    30 2.5477500000000002
## 5                26      7     BID       52    31 2.4712857142857145
## 6                26      7     BID       52    32 3.3110000000000008
## 7                26      7     BID       52    33 2.5507499999999999
## 8                37      7     BID       55    38 2.3763773146878147
## 9                37      7     BID       55    39 2.3976345963681185
## 10               37      7     BID       55    40 2.5518909635361497
## ..              ...    ...     ...      ...   ...                ...

Much like above, we can represent this graphically, so we can (for example) check we grabbed everything:

plot_sub_cols <- function(cols, chars = 4) {
  cols %>%
    ggplot(aes(column, -row, fill = colname)) +
    geom_tile() +
    geom_text(aes(label = stringr::str_sub(value, 1, chars))) +
    theme_void()
}

plot_sub_cols(sub_cols)

One can imagine ways of taking these shapes and finding the “header” areas above the upper left of each of them.