.@JennyBryan I want to see @drob figure out how to turn this into a tidy data set. I'd pay money to watch it in realtime….
— Simply Statistics (@simplystats) April 6, 2016
Challenge accepted. Here’s the data Jenny sent, which can be downloaded here:
This is some early unfinished work. I’m focusing on the tables with BID and OFFER. The phone numbers are probably not generally interesting to an analyst, and the NYMEX data is relatively easy to extract.
Rather than solve it for this particular sheet, I’m trying to think about tools that would be useful in general. In particular, I’m treating the problem as about pulling “sub-tables” out into a tidy format. There are a few general functions I come up with and others that could be useful.
We read it in with the readxl package:
library(dplyr)
library(tidyr)
library(stringr)
raw_data <- readxl::read_excel("~/Downloads/frank_ermis__11178__Enron Pricing Report.xlsx")
raw_data
## Source: local data frame [234 x 99]
##
## Enron North America - West Gas NA NA NA NA
## (dbl) (chr) (chr) (chr) (chr) (chr)
## 1 NA 37204 NA NA NA NA
## 2 NA NA NA NA NA NA
## 3 NA ENA - West Gas Contacts NA NA NA NA
## 4 NA NA NA NA NA NA
## 5 NA NA NA Houston Office NA NA
## 6 NA NA NA Barry Tycholiz NA NA
## 7 NA NA NA Kim Ward NA NA
## 8 NA NA NA Stephanie Miller NA NA
## 9 NA NA NA Philip Polsky NA NA
## 10 NA NA NA NA NA NA
## .. ... ... ... ... ... ...
## Variables not shown: NA (chr), NA (chr), NA (chr), NA (chr), NA (dbl), NA
## (chr), NA (chr), NA (chr), NA (chr), NA (chr), NA (chr), NA (chr), NA
## (chr), NA (chr), NA (dbl), NA (dbl), NA (dbl), NA (dbl), NA (dbl), NA
## (dbl), NYMEX Month (chr), NA (dbl), 37226 (dbl), 37257 (dbl), 37288
## (dbl), 37316 (dbl), 37347 (dbl), 37377 (dbl), 37408 (dbl), 37438 (dbl),
## 37469 (dbl), 37500 (dbl), 37530 (dbl), 37561 (dbl), NA (dbl), NA (dbl),
## NA (dbl), NA (dbl), NA (dbl), NA (dbl), NA (dbl), NA (dbl), NA (dbl), NA
## (dbl), NA (dbl), NA (dbl), NA (dbl), NA (dbl), NA (dbl), NA (dbl), NA
## (dbl), NA (dbl), NA (dbl), NA (dbl), NA (dbl), NA (dbl), NA (dbl), NA
## (dbl), NA (dbl), NA (dbl), NA (dbl), NA (dbl), NA (dbl), NA (dbl), NA
## (dbl), NA (dbl), NA (dbl), NA (dbl), NA (dbl), NA (dbl), NA (dbl), NA
## (dbl), NA (dbl), NA (dbl), NA (dbl), NA (dbl), NA (dbl), NA (dbl), NA
## (dbl), NA (dbl), NA (dbl), NA (dbl), NA (dbl), NA (dbl), NA (dbl), NA
## (dbl), NA (dbl), NA (dbl), NA (dbl), NA (dbl), NA (dbl), NA (dbl), NA
## (dbl)
Nothing new there.
Viewing it in melted format allows more useful operations to be performed:
# Gather into row-column-value format, and remove all-NA rows/cols
# (which is most of them)
tidy_excel <- function(x) {
x %>%
setNames(seq_len(ncol(x))) %>%
mutate(row = row_number()) %>%
tidyr::gather(column, value, -row) %>%
mutate(column = as.integer(column)) %>%
group_by(row) %>%
filter(!all(is.na(value))) %>%
group_by(column) %>%
filter(!all(is.na(value))) %>%
ungroup() %>%
arrange(column, row)
}
tbl <- tidy_excel(raw_data)
tbl
## Source: local data frame [2,139 x 3]
##
## row column value
## (int) (int) (chr)
## 1 1 2 37204
## 2 2 2 NA
## 3 3 2 ENA - West Gas Contacts
## 4 5 2 NA
## 5 6 2 NA
## 6 7 2 NA
## 7 8 2 NA
## 8 9 2 NA
## 9 11 2 Forward Prices (US$/MMBtu)
## 10 13 2 NA
## .. ... ... ...
Already it’s more useful because we can look at the structure:
library(ggplot2)
plot_excel <- function(x) {
x %>%
filter(!is.na(value)) %>%
ggplot(aes(column, -row, fill = !is.na(as.numeric(value)))) +
geom_tile(color = "black") +
theme_void()
}
plot_excel(tbl)
The vast majority of the cells are NA. But they may end up in some of the tables so we have to be careful to keep them for now.
Notice that all the sub-tables of BID/OFFER etc have the same size- that may not be true in other tables, but it’s true here. That gives me an idea.
tbl_toplefts <- tbl %>%
filter(str_detect(value, "(IF|AECO \\/|Border|City Gate)"),
row < 56)
tbl_toplefts
## Source: local data frame [10 x 3]
##
## row column value
## (int) (int) (chr)
## 1 24 7 IF CIG Rocky Mountains
## 2 35 7 AECO / NIT
## 3 46 7 NGI Socal (South Cal Border)
## 4 24 12 IF EL Paso Permian
## 5 35 12 IF NWPL Canadian Border (Sumas)
## 6 46 12 NGI Malin (North Cal Border)
## 7 13 17 IF NWPL Rocky Mountains
## 8 24 17 IF EL Paso San Juan
## 9 35 17 IF PEPL TX-OK
## 10 46 17 PG&E City Gate
We then have the chance to extract tables based on those positions. I write an approach that can be done with any vector of row/column starts, as well as either a fixed or at least known width and height. (If it turned out that the tables, say, alternate between width 4 and width 6, that would be easy to modify beforehand).
extract_sub_tbl_position <- function(x, row_start, col_start,
width, height) {
positions <- data_frame(row_start, row_end = row_start + height - 1,
col_start, col_end = col_start + width - 1) %>%
arrange(row_start, col_start) %>%
group_by(table_id = row_number()) %>%
do(tidyr::crossing(row = seq(.$row_start, .$row_end),
column = seq(.$col_start, .$col_end))) %>%
ungroup()
positions %>%
inner_join(x)
}
sub_tbls <- extract_sub_tbl_position(tbl,
tbl_toplefts$row, tbl_toplefts$column,
width = 4, height = 10)
sub_tbls
## Source: local data frame [400 x 4]
##
## table_id row column value
## (int) (int) (int) (chr)
## 1 1 13 17 IF NWPL Rocky Mountains
## 2 1 13 18 NA
## 3 1 13 19 NA
## 4 1 13 20 NA
## 5 1 14 17 Fixed Price
## 6 1 14 18 NA
## 7 1 14 19 Basis
## 8 1 14 20 NA
## 9 1 15 17 BID
## 10 1 15 18 OFFER
## .. ... ... ... ...
plot_excel(sub_tbls)
In the process, we haven’t just pulled out these tables, we’ve kept track of which is which:
plot_excel(sub_tbls) +
geom_tile(aes(fill = factor(table_id)), color = "black")
What we really have here is multiple header rows (the red above). To work with this as tidy data, we’ve got to take those multiple header rows and put them in columns alongside it.
First figure out which are headers, and fill them across (you don’t want to fill NAs in non-header data):
library(zoo)
sub_tbls_filled <- sub_tbls %>%
group_by(table_id, row) %>%
mutate(is_header_row = is.na(as.numeric(value[1]))) %>%
mutate(value = ifelse(is_header_row, na.locf(value), value))
sub_tbls_filled
## Source: local data frame [400 x 5]
## Groups: table_id, row [100]
##
## table_id row column value is_header_row
## (int) (int) (int) (chr) (lgl)
## 1 1 13 17 IF NWPL Rocky Mountains TRUE
## 2 1 13 18 IF NWPL Rocky Mountains TRUE
## 3 1 13 19 IF NWPL Rocky Mountains TRUE
## 4 1 13 20 IF NWPL Rocky Mountains TRUE
## 5 1 14 17 Fixed Price TRUE
## 6 1 14 18 Fixed Price TRUE
## 7 1 14 19 Basis TRUE
## 8 1 14 20 Basis TRUE
## 9 1 15 17 BID TRUE
## 10 1 15 18 OFFER TRUE
## .. ... ... ... ... ...
Then we need to know, for each column/table_id, what combination of columns it has. This “multiple header” pattern probably appears in other sheets and may be worth “verb”-ing in some way.
header_data <- sub_tbls_filled %>%
group_by(table_id) %>%
filter(is_header_row) %>%
mutate(header_row = paste0("header", row - min(row) + 1)) %>%
ungroup() %>%
select(table_id, column, header_row, value) %>%
spread(header_row, value)
header_data
## Source: local data frame [40 x 5]
##
## table_id column header1 header2 header3
## (int) (int) (chr) (chr) (chr)
## 1 1 17 IF NWPL Rocky Mountains Fixed Price BID
## 2 1 18 IF NWPL Rocky Mountains Fixed Price OFFER
## 3 1 19 IF NWPL Rocky Mountains Basis BID
## 4 1 20 IF NWPL Rocky Mountains Basis OFFER
## 5 2 7 IF CIG Rocky Mountains Fixed Price BID
## 6 2 8 IF CIG Rocky Mountains Fixed Price OFFER
## 7 2 9 IF CIG Rocky Mountains Basis BID
## 8 2 10 IF CIG Rocky Mountains Basis OFFER
## 9 3 12 IF EL Paso Permian Fixed Price BID
## 10 3 13 IF EL Paso Permian Fixed Price OFFER
## .. ... ... ... ... ...
Joining that with the non-header data gets us a tidy version, with a value column that we can actually convert to numeric. (We can also drop the table_id
afterwards since it is redundant with header1
).
td_tbls <- sub_tbls_filled %>%
ungroup() %>%
filter(!is_header_row) %>%
select(-is_header_row) %>%
inner_join(header_data) %>%
select(-column, -table_id) %>%
mutate(value = as.numeric(value))
td_tbls
## Source: local data frame [280 x 5]
##
## row value header1 header2 header3
## (int) (dbl) (chr) (chr) (chr)
## 1 16 1.890 IF NWPL Rocky Mountains Fixed Price BID
## 2 16 1.910 IF NWPL Rocky Mountains Fixed Price OFFER
## 3 16 NA IF NWPL Rocky Mountains Basis BID
## 4 16 NA IF NWPL Rocky Mountains Basis OFFER
## 5 17 2.060 IF NWPL Rocky Mountains Fixed Price BID
## 6 17 2.080 IF NWPL Rocky Mountains Fixed Price OFFER
## 7 17 NA IF NWPL Rocky Mountains Basis BID
## 8 17 NA IF NWPL Rocky Mountains Basis OFFER
## 9 18 2.395 IF NWPL Rocky Mountains Fixed Price BID
## 10 18 2.415 IF NWPL Rocky Mountains Fixed Price OFFER
## .. ... ... ... ... ...
The next step is that we need to get the per-row information. These come from the second (with values like “Cash”, “ROM”, and “from” dates) and fourth columns (entirely “to” dates) of the full table (and we only care about the ones in the same rows as td_tbls
):
tbl_with_row <- tbl %>%
filter(column %in% c(2, 4)) %>%
mutate(column = c("row_info", "to_date")[column / 2]) %>%
spread(column, value) %>%
inner_join(td_tbls, by = "row")
tbl_with_row
## Source: local data frame [280 x 7]
##
## row row_info to_date value header1 header2
## (int) (chr) (chr) (dbl) (chr) (chr)
## 1 16 Cash NA 1.890 IF NWPL Rocky Mountains Fixed Price
## 2 16 Cash NA 1.910 IF NWPL Rocky Mountains Fixed Price
## 3 16 Cash NA NA IF NWPL Rocky Mountains Basis
## 4 16 Cash NA NA IF NWPL Rocky Mountains Basis
## 5 17 ROM NA 2.060 IF NWPL Rocky Mountains Fixed Price
## 6 17 ROM NA 2.080 IF NWPL Rocky Mountains Fixed Price
## 7 17 ROM NA NA IF NWPL Rocky Mountains Basis
## 8 17 ROM NA NA IF NWPL Rocky Mountains Basis
## 9 18 37226 NA 2.395 IF NWPL Rocky Mountains Fixed Price
## 10 18 37226 NA 2.415 IF NWPL Rocky Mountains Fixed Price
## .. ... ... ... ... ... ...
## Variables not shown: header3 (chr)
Last step is to clean those values like 37226 into dates. Those came straight out of the readxl
step, (see here for the explanation of excel dates).
convert_excel_date <- function(x) {
result <- as.Date("1900-01-01") + as.numeric(x) - 2
ifelse(is.na(result), x, as.character(result))
}
final_tbl <- tbl_with_row %>%
mutate(row_info = convert_excel_date(row_info),
to_date = convert_excel_date(to_date))
Let’s look at the entire table:
knitr::kable(final_tbl, digits = 2)
row | row_info | to_date | value | header1 | header2 | header3 |
---|---|---|---|---|---|---|
16 | Cash | NA | 1.89 | IF NWPL Rocky Mountains | Fixed Price | BID |
16 | Cash | NA | 1.91 | IF NWPL Rocky Mountains | Fixed Price | OFFER |
16 | Cash | NA | NA | IF NWPL Rocky Mountains | Basis | BID |
16 | Cash | NA | NA | IF NWPL Rocky Mountains | Basis | OFFER |
17 | ROM | NA | 2.06 | IF NWPL Rocky Mountains | Fixed Price | BID |
17 | ROM | NA | 2.08 | IF NWPL Rocky Mountains | Fixed Price | OFFER |
17 | ROM | NA | NA | IF NWPL Rocky Mountains | Basis | BID |
17 | ROM | NA | NA | IF NWPL Rocky Mountains | Basis | OFFER |
18 | 2001-12-01 | NA | 2.40 | IF NWPL Rocky Mountains | Fixed Price | BID |
18 | 2001-12-01 | NA | 2.42 | IF NWPL Rocky Mountains | Fixed Price | OFFER |
18 | 2001-12-01 | NA | -0.56 | IF NWPL Rocky Mountains | Basis | BID |
18 | 2001-12-01 | NA | -0.55 | IF NWPL Rocky Mountains | Basis | OFFER |
19 | 2001-12-01 | 2002-03-01 | 2.59 | IF NWPL Rocky Mountains | Fixed Price | BID |
19 | 2001-12-01 | 2002-03-01 | 2.61 | IF NWPL Rocky Mountains | Fixed Price | OFFER |
19 | 2001-12-01 | 2002-03-01 | -0.49 | IF NWPL Rocky Mountains | Basis | BID |
19 | 2001-12-01 | 2002-03-01 | -0.47 | IF NWPL Rocky Mountains | Basis | OFFER |
20 | 2002-04-01 | 2002-10-01 | 2.58 | IF NWPL Rocky Mountains | Fixed Price | BID |
20 | 2002-04-01 | 2002-10-01 | 2.60 | IF NWPL Rocky Mountains | Fixed Price | OFFER |
20 | 2002-04-01 | 2002-10-01 | -0.58 | IF NWPL Rocky Mountains | Basis | BID |
20 | 2002-04-01 | 2002-10-01 | -0.56 | IF NWPL Rocky Mountains | Basis | OFFER |
21 | 2002-11-01 | 2003-03-01 | 3.36 | IF NWPL Rocky Mountains | Fixed Price | BID |
21 | 2002-11-01 | 2003-03-01 | 3.38 | IF NWPL Rocky Mountains | Fixed Price | OFFER |
21 | 2002-11-01 | 2003-03-01 | -0.30 | IF NWPL Rocky Mountains | Basis | BID |
21 | 2002-11-01 | 2003-03-01 | -0.28 | IF NWPL Rocky Mountains | Basis | OFFER |
22 | One Year Strip* | NA | 2.63 | IF NWPL Rocky Mountains | Fixed Price | BID |
22 | One Year Strip* | NA | 2.65 | IF NWPL Rocky Mountains | Fixed Price | OFFER |
22 | One Year Strip* | NA | -0.53 | IF NWPL Rocky Mountains | Basis | BID |
22 | One Year Strip* | NA | -0.51 | IF NWPL Rocky Mountains | Basis | OFFER |
27 | Cash | NA | 1.94 | IF CIG Rocky Mountains | Fixed Price | BID |
27 | Cash | NA | 1.96 | IF CIG Rocky Mountains | Fixed Price | OFFER |
27 | Cash | NA | NA | IF CIG Rocky Mountains | Basis | BID |
27 | Cash | NA | NA | IF CIG Rocky Mountains | Basis | OFFER |
27 | Cash | NA | 2.38 | IF EL Paso Permian | Fixed Price | BID |
27 | Cash | NA | 2.40 | IF EL Paso Permian | Fixed Price | OFFER |
27 | Cash | NA | NA | IF EL Paso Permian | Basis | BID |
27 | Cash | NA | NA | IF EL Paso Permian | Basis | OFFER |
27 | Cash | NA | 2.45 | IF EL Paso San Juan | Fixed Price | BID |
27 | Cash | NA | 2.47 | IF EL Paso San Juan | Fixed Price | OFFER |
27 | Cash | NA | NA | IF EL Paso San Juan | Basis | BID |
27 | Cash | NA | NA | IF EL Paso San Juan | Basis | OFFER |
28 | ROM | NA | 1.96 | IF CIG Rocky Mountains | Fixed Price | BID |
28 | ROM | NA | 1.98 | IF CIG Rocky Mountains | Fixed Price | OFFER |
28 | ROM | NA | NA | IF CIG Rocky Mountains | Basis | BID |
28 | ROM | NA | NA | IF CIG Rocky Mountains | Basis | OFFER |
28 | ROM | NA | 2.42 | IF EL Paso Permian | Fixed Price | BID |
28 | ROM | NA | 2.44 | IF EL Paso Permian | Fixed Price | OFFER |
28 | ROM | NA | NA | IF EL Paso Permian | Basis | BID |
28 | ROM | NA | NA | IF EL Paso Permian | Basis | OFFER |
28 | ROM | NA | 2.35 | IF EL Paso San Juan | Fixed Price | BID |
28 | ROM | NA | 2.37 | IF EL Paso San Juan | Fixed Price | OFFER |
28 | ROM | NA | NA | IF EL Paso San Juan | Basis | BID |
28 | ROM | NA | NA | IF EL Paso San Juan | Basis | OFFER |
29 | 2001-12-01 | NA | 2.35 | IF CIG Rocky Mountains | Fixed Price | BID |
29 | 2001-12-01 | NA | 2.37 | IF CIG Rocky Mountains | Fixed Price | OFFER |
29 | 2001-12-01 | NA | -0.62 | IF CIG Rocky Mountains | Basis | BID |
29 | 2001-12-01 | NA | -0.60 | IF CIG Rocky Mountains | Basis | OFFER |
29 | 2001-12-01 | NA | 2.70 | IF EL Paso Permian | Fixed Price | BID |
29 | 2001-12-01 | NA | 2.72 | IF EL Paso Permian | Fixed Price | OFFER |
29 | 2001-12-01 | NA | -0.26 | IF EL Paso Permian | Basis | BID |
29 | 2001-12-01 | NA | -0.24 | IF EL Paso Permian | Basis | OFFER |
29 | 2001-12-01 | NA | 2.56 | IF EL Paso San Juan | Fixed Price | BID |
29 | 2001-12-01 | NA | 2.58 | IF EL Paso San Juan | Fixed Price | OFFER |
29 | 2001-12-01 | NA | -0.40 | IF EL Paso San Juan | Basis | BID |
29 | 2001-12-01 | NA | -0.38 | IF EL Paso San Juan | Basis | OFFER |
30 | 2001-12-01 | 2002-03-01 | 2.55 | IF CIG Rocky Mountains | Fixed Price | BID |
30 | 2001-12-01 | 2002-03-01 | 2.57 | IF CIG Rocky Mountains | Fixed Price | OFFER |
30 | 2001-12-01 | 2002-03-01 | -0.54 | IF CIG Rocky Mountains | Basis | BID |
30 | 2001-12-01 | 2002-03-01 | -0.52 | IF CIG Rocky Mountains | Basis | OFFER |
30 | 2001-12-01 | 2002-03-01 | 2.85 | IF EL Paso Permian | Fixed Price | BID |
30 | 2001-12-01 | 2002-03-01 | 2.87 | IF EL Paso Permian | Fixed Price | OFFER |
30 | 2001-12-01 | 2002-03-01 | -0.23 | IF EL Paso Permian | Basis | BID |
30 | 2001-12-01 | 2002-03-01 | -0.21 | IF EL Paso Permian | Basis | OFFER |
30 | 2001-12-01 | 2002-03-01 | 2.74 | IF EL Paso San Juan | Fixed Price | BID |
30 | 2001-12-01 | 2002-03-01 | 2.76 | IF EL Paso San Juan | Fixed Price | OFFER |
30 | 2001-12-01 | 2002-03-01 | -0.34 | IF EL Paso San Juan | Basis | BID |
30 | 2001-12-01 | 2002-03-01 | -0.32 | IF EL Paso San Juan | Basis | OFFER |
31 | 2002-04-01 | 2002-10-01 | 2.47 | IF CIG Rocky Mountains | Fixed Price | BID |
31 | 2002-04-01 | 2002-10-01 | 2.49 | IF CIG Rocky Mountains | Fixed Price | OFFER |
31 | 2002-04-01 | 2002-10-01 | -0.70 | IF CIG Rocky Mountains | Basis | BID |
31 | 2002-04-01 | 2002-10-01 | -0.68 | IF CIG Rocky Mountains | Basis | OFFER |
31 | 2002-04-01 | 2002-10-01 | 3.01 | IF EL Paso Permian | Fixed Price | BID |
31 | 2002-04-01 | 2002-10-01 | 3.03 | IF EL Paso Permian | Fixed Price | OFFER |
31 | 2002-04-01 | 2002-10-01 | -0.16 | IF EL Paso Permian | Basis | BID |
31 | 2002-04-01 | 2002-10-01 | -0.14 | IF EL Paso Permian | Basis | OFFER |
31 | 2002-04-01 | 2002-10-01 | 2.80 | IF EL Paso San Juan | Fixed Price | BID |
31 | 2002-04-01 | 2002-10-01 | 2.82 | IF EL Paso San Juan | Fixed Price | OFFER |
31 | 2002-04-01 | 2002-10-01 | -0.36 | IF EL Paso San Juan | Basis | BID |
31 | 2002-04-01 | 2002-10-01 | -0.34 | IF EL Paso San Juan | Basis | OFFER |
32 | 2002-11-01 | 2003-03-01 | 3.31 | IF CIG Rocky Mountains | Fixed Price | BID |
32 | 2002-11-01 | 2003-03-01 | 3.33 | IF CIG Rocky Mountains | Fixed Price | OFFER |
32 | 2002-11-01 | 2003-03-01 | -0.34 | IF CIG Rocky Mountains | Basis | BID |
32 | 2002-11-01 | 2003-03-01 | -0.32 | IF CIG Rocky Mountains | Basis | OFFER |
32 | 2002-11-01 | 2003-03-01 | 3.50 | IF EL Paso Permian | Fixed Price | BID |
32 | 2002-11-01 | 2003-03-01 | 3.52 | IF EL Paso Permian | Fixed Price | OFFER |
32 | 2002-11-01 | 2003-03-01 | -0.15 | IF EL Paso Permian | Basis | BID |
32 | 2002-11-01 | 2003-03-01 | -0.13 | IF EL Paso Permian | Basis | OFFER |
32 | 2002-11-01 | 2003-03-01 | 3.42 | IF EL Paso San Juan | Fixed Price | BID |
32 | 2002-11-01 | 2003-03-01 | 3.44 | IF EL Paso San Juan | Fixed Price | OFFER |
32 | 2002-11-01 | 2003-03-01 | -0.23 | IF EL Paso San Juan | Basis | BID |
32 | 2002-11-01 | 2003-03-01 | -0.21 | IF EL Paso San Juan | Basis | OFFER |
33 | One Year Strip* | NA | 2.55 | IF CIG Rocky Mountains | Fixed Price | BID |
33 | One Year Strip* | NA | 2.57 | IF CIG Rocky Mountains | Fixed Price | OFFER |
33 | One Year Strip* | NA | -0.61 | IF CIG Rocky Mountains | Basis | BID |
33 | One Year Strip* | NA | -0.59 | IF CIG Rocky Mountains | Basis | OFFER |
33 | One Year Strip* | NA | 2.98 | IF EL Paso Permian | Fixed Price | BID |
33 | One Year Strip* | NA | 3.00 | IF EL Paso Permian | Fixed Price | OFFER |
33 | One Year Strip* | NA | -0.18 | IF EL Paso Permian | Basis | BID |
33 | One Year Strip* | NA | -0.16 | IF EL Paso Permian | Basis | OFFER |
33 | One Year Strip* | NA | 2.82 | IF EL Paso San Juan | Fixed Price | BID |
33 | One Year Strip* | NA | 2.84 | IF EL Paso San Juan | Fixed Price | OFFER |
33 | One Year Strip* | NA | -0.35 | IF EL Paso San Juan | Basis | BID |
33 | One Year Strip* | NA | -0.33 | IF EL Paso San Juan | Basis | OFFER |
38 | Cash | NA | 2.38 | AECO / NIT | Fixed Price | BID |
38 | Cash | NA | 2.40 | AECO / NIT | Fixed Price | OFFER |
38 | Cash | NA | NA | AECO / NIT | Basis | BID |
38 | Cash | NA | NA | AECO / NIT | Basis | OFFER |
38 | Cash | NA | 2.48 | IF NWPL Canadian Border (Sumas) | Fixed Price | BID |
38 | Cash | NA | 2.50 | IF NWPL Canadian Border (Sumas) | Fixed Price | OFFER |
38 | Cash | NA | NA | IF NWPL Canadian Border (Sumas) | Basis | BID |
38 | Cash | NA | NA | IF NWPL Canadian Border (Sumas) | Basis | OFFER |
38 | Cash | NA | 2.53 | IF PEPL TX-OK | Fixed Price | BID |
38 | Cash | NA | 2.55 | IF PEPL TX-OK | Fixed Price | OFFER |
38 | Cash | NA | NA | IF PEPL TX-OK | Basis | BID |
38 | Cash | NA | NA | IF PEPL TX-OK | Basis | OFFER |
39 | ROM | NA | 2.40 | AECO / NIT | Fixed Price | BID |
39 | ROM | NA | 2.42 | AECO / NIT | Fixed Price | OFFER |
39 | ROM | NA | NA | AECO / NIT | Basis | BID |
39 | ROM | NA | NA | AECO / NIT | Basis | OFFER |
39 | ROM | NA | 2.46 | IF NWPL Canadian Border (Sumas) | Fixed Price | BID |
39 | ROM | NA | 2.48 | IF NWPL Canadian Border (Sumas) | Fixed Price | OFFER |
39 | ROM | NA | NA | IF NWPL Canadian Border (Sumas) | Basis | BID |
39 | ROM | NA | NA | IF NWPL Canadian Border (Sumas) | Basis | OFFER |
39 | ROM | NA | 2.53 | IF PEPL TX-OK | Fixed Price | BID |
39 | ROM | NA | 2.55 | IF PEPL TX-OK | Fixed Price | OFFER |
39 | ROM | NA | NA | IF PEPL TX-OK | Basis | BID |
39 | ROM | NA | NA | IF PEPL TX-OK | Basis | OFFER |
40 | 2001-12-01 | NA | 2.55 | AECO / NIT | Fixed Price | BID |
40 | 2001-12-01 | NA | 2.57 | AECO / NIT | Fixed Price | OFFER |
40 | 2001-12-01 | NA | -0.41 | AECO / NIT | Basis | BID |
40 | 2001-12-01 | NA | -0.39 | AECO / NIT | Basis | OFFER |
40 | 2001-12-01 | NA | 2.80 | IF NWPL Canadian Border (Sumas) | Fixed Price | BID |
40 | 2001-12-01 | NA | 2.82 | IF NWPL Canadian Border (Sumas) | Fixed Price | OFFER |
40 | 2001-12-01 | NA | -0.16 | IF NWPL Canadian Border (Sumas) | Basis | BID |
40 | 2001-12-01 | NA | -0.14 | IF NWPL Canadian Border (Sumas) | Basis | OFFER |
40 | 2001-12-01 | NA | 2.83 | IF PEPL TX-OK | Fixed Price | BID |
40 | 2001-12-01 | NA | 2.85 | IF PEPL TX-OK | Fixed Price | OFFER |
40 | 2001-12-01 | NA | -0.13 | IF PEPL TX-OK | Basis | BID |
40 | 2001-12-01 | NA | -0.11 | IF PEPL TX-OK | Basis | OFFER |
41 | 2001-12-01 | 2002-03-01 | 2.62 | AECO / NIT | Fixed Price | BID |
41 | 2001-12-01 | 2002-03-01 | 2.64 | AECO / NIT | Fixed Price | OFFER |
41 | 2001-12-01 | 2002-03-01 | -0.47 | AECO / NIT | Basis | BID |
41 | 2001-12-01 | 2002-03-01 | -0.45 | AECO / NIT | Basis | OFFER |
41 | 2001-12-01 | 2002-03-01 | 2.89 | IF NWPL Canadian Border (Sumas) | Fixed Price | BID |
41 | 2001-12-01 | 2002-03-01 | 2.91 | IF NWPL Canadian Border (Sumas) | Fixed Price | OFFER |
41 | 2001-12-01 | 2002-03-01 | -0.20 | IF NWPL Canadian Border (Sumas) | Basis | BID |
41 | 2001-12-01 | 2002-03-01 | -0.18 | IF NWPL Canadian Border (Sumas) | Basis | OFFER |
41 | 2001-12-01 | 2002-03-01 | 2.96 | IF PEPL TX-OK | Fixed Price | BID |
41 | 2001-12-01 | 2002-03-01 | 2.98 | IF PEPL TX-OK | Fixed Price | OFFER |
41 | 2001-12-01 | 2002-03-01 | -0.13 | IF PEPL TX-OK | Basis | BID |
41 | 2001-12-01 | 2002-03-01 | -0.11 | IF PEPL TX-OK | Basis | OFFER |
42 | 2002-04-01 | 2002-10-01 | 2.66 | AECO / NIT | Fixed Price | BID |
42 | 2002-04-01 | 2002-10-01 | 2.68 | AECO / NIT | Fixed Price | OFFER |
42 | 2002-04-01 | 2002-10-01 | -0.50 | AECO / NIT | Basis | BID |
42 | 2002-04-01 | 2002-10-01 | -0.48 | AECO / NIT | Basis | OFFER |
42 | 2002-04-01 | 2002-10-01 | 2.80 | IF NWPL Canadian Border (Sumas) | Fixed Price | BID |
42 | 2002-04-01 | 2002-10-01 | 2.82 | IF NWPL Canadian Border (Sumas) | Fixed Price | OFFER |
42 | 2002-04-01 | 2002-10-01 | -0.37 | IF NWPL Canadian Border (Sumas) | Basis | BID |
42 | 2002-04-01 | 2002-10-01 | -0.35 | IF NWPL Canadian Border (Sumas) | Basis | OFFER |
42 | 2002-04-01 | 2002-10-01 | 3.05 | IF PEPL TX-OK | Fixed Price | BID |
42 | 2002-04-01 | 2002-10-01 | 3.07 | IF PEPL TX-OK | Fixed Price | OFFER |
42 | 2002-04-01 | 2002-10-01 | -0.12 | IF PEPL TX-OK | Basis | BID |
42 | 2002-04-01 | 2002-10-01 | -0.10 | IF PEPL TX-OK | Basis | OFFER |
43 | 2002-11-01 | 2003-03-01 | 3.22 | AECO / NIT | Fixed Price | BID |
43 | 2002-11-01 | 2003-03-01 | 3.24 | AECO / NIT | Fixed Price | OFFER |
43 | 2002-11-01 | 2003-03-01 | -0.44 | AECO / NIT | Basis | BID |
43 | 2002-11-01 | 2003-03-01 | -0.42 | AECO / NIT | Basis | OFFER |
43 | 2002-11-01 | 2003-03-01 | 3.71 | IF NWPL Canadian Border (Sumas) | Fixed Price | BID |
43 | 2002-11-01 | 2003-03-01 | 3.73 | IF NWPL Canadian Border (Sumas) | Fixed Price | OFFER |
43 | 2002-11-01 | 2003-03-01 | 0.06 | IF NWPL Canadian Border (Sumas) | Basis | BID |
43 | 2002-11-01 | 2003-03-01 | 0.08 | IF NWPL Canadian Border (Sumas) | Basis | OFFER |
43 | 2002-11-01 | 2003-03-01 | 3.53 | IF PEPL TX-OK | Fixed Price | BID |
43 | 2002-11-01 | 2003-03-01 | 3.55 | IF PEPL TX-OK | Fixed Price | OFFER |
43 | 2002-11-01 | 2003-03-01 | -0.12 | IF PEPL TX-OK | Basis | BID |
43 | 2002-11-01 | 2003-03-01 | -0.10 | IF PEPL TX-OK | Basis | OFFER |
44 | One Year Strip* | NA | 2.68 | AECO / NIT | Fixed Price | BID |
44 | One Year Strip* | NA | 2.70 | AECO / NIT | Fixed Price | OFFER |
44 | One Year Strip* | NA | -0.49 | AECO / NIT | Basis | BID |
44 | One Year Strip* | NA | -0.47 | AECO / NIT | Basis | OFFER |
44 | One Year Strip* | NA | 2.88 | IF NWPL Canadian Border (Sumas) | Fixed Price | BID |
44 | One Year Strip* | NA | 2.90 | IF NWPL Canadian Border (Sumas) | Fixed Price | OFFER |
44 | One Year Strip* | NA | -0.28 | IF NWPL Canadian Border (Sumas) | Basis | BID |
44 | One Year Strip* | NA | -0.26 | IF NWPL Canadian Border (Sumas) | Basis | OFFER |
44 | One Year Strip* | NA | 3.04 | IF PEPL TX-OK | Fixed Price | BID |
44 | One Year Strip* | NA | 3.06 | IF PEPL TX-OK | Fixed Price | OFFER |
44 | One Year Strip* | NA | -0.12 | IF PEPL TX-OK | Basis | BID |
44 | One Year Strip* | NA | -0.10 | IF PEPL TX-OK | Basis | OFFER |
49 | Cash | NA | 2.58 | NGI Socal (South Cal Border) | Fixed Price | BID |
49 | Cash | NA | 2.60 | NGI Socal (South Cal Border) | Fixed Price | OFFER |
49 | Cash | NA | NA | NGI Socal (South Cal Border) | Basis | BID |
49 | Cash | NA | NA | NGI Socal (South Cal Border) | Basis | OFFER |
49 | Cash | NA | 2.55 | NGI Malin (North Cal Border) | Fixed Price | BID |
49 | Cash | NA | 2.57 | NGI Malin (North Cal Border) | Fixed Price | OFFER |
49 | Cash | NA | NA | NGI Malin (North Cal Border) | Basis | BID |
49 | Cash | NA | NA | NGI Malin (North Cal Border) | Basis | OFFER |
49 | Cash | NA | 2.57 | PG&E City Gate | Fixed Price | BID |
49 | Cash | NA | 2.59 | PG&E City Gate | Fixed Price | OFFER |
49 | Cash | NA | NA | PG&E City Gate | Basis | BID |
49 | Cash | NA | NA | PG&E City Gate | Basis | OFFER |
50 | ROM | NA | 2.50 | NGI Socal (South Cal Border) | Fixed Price | BID |
50 | ROM | NA | 2.52 | NGI Socal (South Cal Border) | Fixed Price | OFFER |
50 | ROM | NA | NA | NGI Socal (South Cal Border) | Basis | BID |
50 | ROM | NA | NA | NGI Socal (South Cal Border) | Basis | OFFER |
50 | ROM | NA | 2.48 | NGI Malin (North Cal Border) | Fixed Price | BID |
50 | ROM | NA | 2.50 | NGI Malin (North Cal Border) | Fixed Price | OFFER |
50 | ROM | NA | NA | NGI Malin (North Cal Border) | Basis | BID |
50 | ROM | NA | NA | NGI Malin (North Cal Border) | Basis | OFFER |
50 | ROM | NA | 2.52 | PG&E City Gate | Fixed Price | BID |
50 | ROM | NA | 2.54 | PG&E City Gate | Fixed Price | OFFER |
50 | ROM | NA | NA | PG&E City Gate | Basis | BID |
50 | ROM | NA | NA | PG&E City Gate | Basis | OFFER |
51 | 2001-12-01 | NA | 2.79 | NGI Socal (South Cal Border) | Fixed Price | BID |
51 | 2001-12-01 | NA | 2.81 | NGI Socal (South Cal Border) | Fixed Price | OFFER |
51 | 2001-12-01 | NA | -0.17 | NGI Socal (South Cal Border) | Basis | BID |
51 | 2001-12-01 | NA | -0.15 | NGI Socal (South Cal Border) | Basis | OFFER |
51 | 2001-12-01 | NA | 2.79 | NGI Malin (North Cal Border) | Fixed Price | BID |
51 | 2001-12-01 | NA | 2.81 | NGI Malin (North Cal Border) | Fixed Price | OFFER |
51 | 2001-12-01 | NA | -0.18 | NGI Malin (North Cal Border) | Basis | BID |
51 | 2001-12-01 | NA | -0.16 | NGI Malin (North Cal Border) | Basis | OFFER |
51 | 2001-12-01 | NA | 2.88 | PG&E City Gate | Fixed Price | BID |
51 | 2001-12-01 | NA | 2.90 | PG&E City Gate | Fixed Price | OFFER |
51 | 2001-12-01 | NA | -0.08 | PG&E City Gate | Basis | BID |
51 | 2001-12-01 | NA | -0.06 | PG&E City Gate | Basis | OFFER |
52 | 2001-12-01 | 2002-03-01 | 2.95 | NGI Socal (South Cal Border) | Fixed Price | BID |
52 | 2001-12-01 | 2002-03-01 | 2.97 | NGI Socal (South Cal Border) | Fixed Price | OFFER |
52 | 2001-12-01 | 2002-03-01 | -0.14 | NGI Socal (South Cal Border) | Basis | BID |
52 | 2001-12-01 | 2002-03-01 | -0.12 | NGI Socal (South Cal Border) | Basis | OFFER |
52 | 2001-12-01 | 2002-03-01 | 2.92 | NGI Malin (North Cal Border) | Fixed Price | BID |
52 | 2001-12-01 | 2002-03-01 | 2.94 | NGI Malin (North Cal Border) | Fixed Price | OFFER |
52 | 2001-12-01 | 2002-03-01 | -0.17 | NGI Malin (North Cal Border) | Basis | BID |
52 | 2001-12-01 | 2002-03-01 | -0.15 | NGI Malin (North Cal Border) | Basis | OFFER |
52 | 2001-12-01 | 2002-03-01 | 3.02 | PG&E City Gate | Fixed Price | BID |
52 | 2001-12-01 | 2002-03-01 | 3.04 | PG&E City Gate | Fixed Price | OFFER |
52 | 2001-12-01 | 2002-03-01 | -0.07 | PG&E City Gate | Basis | BID |
52 | 2001-12-01 | 2002-03-01 | -0.05 | PG&E City Gate | Basis | OFFER |
53 | 2002-04-01 | 2002-10-01 | 3.22 | NGI Socal (South Cal Border) | Fixed Price | BID |
53 | 2002-04-01 | 2002-10-01 | 3.24 | NGI Socal (South Cal Border) | Fixed Price | OFFER |
53 | 2002-04-01 | 2002-10-01 | 0.06 | NGI Socal (South Cal Border) | Basis | BID |
53 | 2002-04-01 | 2002-10-01 | 0.08 | NGI Socal (South Cal Border) | Basis | OFFER |
53 | 2002-04-01 | 2002-10-01 | 3.04 | NGI Malin (North Cal Border) | Fixed Price | BID |
53 | 2002-04-01 | 2002-10-01 | 3.06 | NGI Malin (North Cal Border) | Fixed Price | OFFER |
53 | 2002-04-01 | 2002-10-01 | -0.12 | NGI Malin (North Cal Border) | Basis | BID |
53 | 2002-04-01 | 2002-10-01 | -0.10 | NGI Malin (North Cal Border) | Basis | OFFER |
53 | 2002-04-01 | 2002-10-01 | 3.26 | PG&E City Gate | Fixed Price | BID |
53 | 2002-04-01 | 2002-10-01 | 3.28 | PG&E City Gate | Fixed Price | OFFER |
53 | 2002-04-01 | 2002-10-01 | 0.10 | PG&E City Gate | Basis | BID |
53 | 2002-04-01 | 2002-10-01 | 0.12 | PG&E City Gate | Basis | OFFER |
54 | 2002-11-01 | 2003-03-01 | 3.74 | NGI Socal (South Cal Border) | Fixed Price | BID |
54 | 2002-11-01 | 2003-03-01 | 3.76 | NGI Socal (South Cal Border) | Fixed Price | OFFER |
54 | 2002-11-01 | 2003-03-01 | 0.09 | NGI Socal (South Cal Border) | Basis | BID |
54 | 2002-11-01 | 2003-03-01 | 0.11 | NGI Socal (South Cal Border) | Basis | OFFER |
54 | 2002-11-01 | 2003-03-01 | 3.72 | NGI Malin (North Cal Border) | Fixed Price | BID |
54 | 2002-11-01 | 2003-03-01 | 3.74 | NGI Malin (North Cal Border) | Fixed Price | OFFER |
54 | 2002-11-01 | 2003-03-01 | 0.07 | NGI Malin (North Cal Border) | Basis | BID |
54 | 2002-11-01 | 2003-03-01 | 0.09 | NGI Malin (North Cal Border) | Basis | OFFER |
54 | 2002-11-01 | 2003-03-01 | 3.96 | PG&E City Gate | Fixed Price | BID |
54 | 2002-11-01 | 2003-03-01 | 3.98 | PG&E City Gate | Fixed Price | OFFER |
54 | 2002-11-01 | 2003-03-01 | 0.31 | PG&E City Gate | Basis | BID |
54 | 2002-11-01 | 2003-03-01 | 0.33 | PG&E City Gate | Basis | OFFER |
55 | One Year Strip* | NA | 3.16 | NGI Socal (South Cal Border) | Fixed Price | BID |
55 | One Year Strip* | NA | 3.18 | NGI Socal (South Cal Border) | Fixed Price | OFFER |
55 | One Year Strip* | NA | -0.01 | NGI Socal (South Cal Border) | Basis | BID |
55 | One Year Strip* | NA | 0.02 | NGI Socal (South Cal Border) | Basis | OFFER |
55 | One Year Strip* | NA | 3.04 | NGI Malin (North Cal Border) | Fixed Price | BID |
55 | One Year Strip* | NA | 3.06 | NGI Malin (North Cal Border) | Fixed Price | OFFER |
55 | One Year Strip* | NA | -0.13 | NGI Malin (North Cal Border) | Basis | BID |
55 | One Year Strip* | NA | -0.11 | NGI Malin (North Cal Border) | Basis | OFFER |
55 | One Year Strip* | NA | 3.22 | PG&E City Gate | Fixed Price | BID |
55 | One Year Strip* | NA | 3.24 | PG&E City Gate | Fixed Price | OFFER |
55 | One Year Strip* | NA | 0.05 | PG&E City Gate | Basis | BID |
55 | One Year Strip* | NA | 0.07 | PG&E City Gate | Basis | OFFER |
Now we can start analyzing it using tidy tools. This data isn’t particularly interesting to me so I’m not sure what we’d graph, but here’s a start (ignoring all the per-row data because I don’t, like, know what ROM is):
g <- ggplot(tbl_with_row, aes(header1, value)) +
geom_boxplot() +
facet_wrap(~ header2, scales = "free_y")
g
# polished:
g +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
xlab("Region") +
ylab("Forward Price (US$/MMBtu)") +
scale_y_continuous(labels = scales::dollar_format())
Suppose the headings of these sub-tables weren’t as easy to specify: maybe we’re working across lots of sheets. But maybe the “sub-columns”, BID and OFFER, are consistent everywhere. Finding those would get you most of the way to specifying the locations of the tables.
I’d define a sub-column as a particular column name (pre-specified) followed by a sequence of either NAs or numbers- that works for this but may not for others. Here’s an early approach.
# grab out sub-columns with particular headers
sub_col_by_header <- function(tbl, col_names, reg = "[\\d\\.\\n]+") {
processed <- tbl %>%
mutate(match_pred = is.na(value) | str_detect(value, reg)) %>%
mutate(colgroup = cumsum(!match_pred))
header_cells <- processed %>%
filter(is_name = value %in% col_names) %>%
select(subcol_start_row = row, column, colname = value, colgroup)
header_cells %>%
inner_join(processed) %>%
mutate(na = is.na(value)) %>%
group_by(colgroup) %>%
filter(!na | ((cumsum(!na) != sum(!na))), row != subcol_start_row) %>%
ungroup() %>%
select(-match_pred, -na)
}
sub_cols <- sub_col_by_header(tbl, c("BID", "OFFER"))
sub_cols
## Source: local data frame [280 x 6]
##
## subcol_start_row column colname colgroup row value
## (int) (int) (chr) (int) (int) (chr)
## 1 26 7 BID 52 27 1.94
## 2 26 7 BID 52 28 1.96
## 3 26 7 BID 52 29 2.3450000000000002
## 4 26 7 BID 52 30 2.5477500000000002
## 5 26 7 BID 52 31 2.4712857142857145
## 6 26 7 BID 52 32 3.3110000000000008
## 7 26 7 BID 52 33 2.5507499999999999
## 8 37 7 BID 55 38 2.3763773146878147
## 9 37 7 BID 55 39 2.3976345963681185
## 10 37 7 BID 55 40 2.5518909635361497
## .. ... ... ... ... ... ...
Much like above, we can represent this graphically, so we can (for example) check we grabbed everything:
plot_sub_cols <- function(cols, chars = 4) {
cols %>%
ggplot(aes(column, -row, fill = colname)) +
geom_tile() +
geom_text(aes(label = stringr::str_sub(value, 1, chars))) +
theme_void()
}
plot_sub_cols(sub_cols)
One can imagine ways of taking these shapes and finding the “header” areas above the upper left of each of them.