Introduction

This document is published in the public domain on my RPubs account.

This is a demonstration of using the R language pdftools and stringr packages to scrape data from PDF files. I produced this example after reading a Revolutions blog which describes an interesting application wherein locations of medical dispensaries were scrapped from a PDF on the internet and used to plot points on a map.

I have seen good examples of where we could use PDF scraping tools to extract data from PDF files to use in scripts. In the past, I have been required to do this manually (from contract documentation) in order generate KKS code look-up tables which we are using in PFA analysis. Ideally, I would automate this manual process so that I:

In this example, I will demonstrate how to extract KKS codes which are tabulated in a PDF document. A document published by Siemens is publicly available on the web which I use regularly as a reference for the KKS code system. The document can be accessed by clicking here. My objective will be to write a script in R to extract the table on page 9 and create a look-up table. The process is:

  1. Download the document directly from the web.
  2. Read the entire contents.
  3. Select page 9 which contains the table of interest.
  4. Parse the text and extract the lines containing useful information.
  5. Process the text into a look-up table.
  6. Demonstrate usage.

Download and Read File

The file is downloaded and then read using the pdftools package. Each page is loaded into an element of a vector so in order to access page 9, I simply index that element.

library(pdftools)
my_url <- paste0("http://diskuse.elektrika.cz/index.php?action=dlattach;", 
                 "topic=18673.0;attach=10786")
my_file <- "KKS-Indentification-System.pdf"
download.file(my_url, my_file)
trying URL 'http://diskuse.elektrika.cz/index.php?action=dlattach;topic=18673.0;attach=10786'
Content type 'application/octet-stream' length 280693 bytes (274 KB)
==================================================
downloaded 274 KB
my_page <- pdf_text(my_file)[9]
print(my_page)
[1] "KKS Identification\n    Power Plants\n                   System for\n                                                    Function Key\n     Function Key, Main Groups\nA    Grid and distribution systems\nB    Power transmission and auxiliary power supply\nC    Instrumentation and control equipment\nE    Fuel supply and residues disposal\nG    Water supply and disposal\nH    Heat generation\nL    Steam, water, gas cycles\nM    Main machine sets\nN    Process energy/fluid supply for external users\n     (e.g. district heating)\nP    Cooling water systems\nQ    Auxiliary systems\nR    Gas generation and treatment\nS    Ancillary systems\nU    Structures\nW    Renewable energy plants\nX    Heavy machinery (not main machine sets)\n     (e.g. emergency diesel and generator sets)\n                              2\n"

Parse and Process Text

The above code has given access to the raw text on the page. Note that \n is a special character that denotes a newline. It is now necessary to extract the useful text from the table and arrange it appropriately into a look-up table. I use some text processing tools from the stringr package to extract the relevant text. NB: This could be done with base R but I prefer the syntax available from the stringr package.

library(stringr)
my_table <- my_page %>% 
    str_split("\n") %>% 
    .[[1]] %>% 
    str_subset("^[A-Z]\\s")
print(my_table)
 [1] "A    Grid and distribution systems"                 
 [2] "B    Power transmission and auxiliary power supply" 
 [3] "C    Instrumentation and control equipment"         
 [4] "E    Fuel supply and residues disposal"             
 [5] "G    Water supply and disposal"                     
 [6] "H    Heat generation"                               
 [7] "L    Steam, water, gas cycles"                      
 [8] "M    Main machine sets"                             
 [9] "N    Process energy/fluid supply for external users"
[10] "P    Cooling water systems"                         
[11] "Q    Auxiliary systems"                             
[12] "R    Gas generation and treatment"                  
[13] "S    Ancillary systems"                             
[14] "U    Structures"                                    
[15] "W    Renewable energy plants"                       
[16] "X    Heavy machinery (not main machine sets)"       

The processing steps are:

Make Lookup Table

So far, the relevant lines on the page have been read as a character vector which contains the KKS code and description as one item of text. However, our task is to be able to ‘look-up’ a KKS code and return a description. We achieve this by splitting the each element of the character vector into a ‘key’ and ‘value’ pair. The key will be the KKS code and value will be the description.

library(dplyr)
my_lookup <- my_table %>%
    str_split_fixed("\\s{2,}", n = 2) %>% 
    as_data_frame() %>% 
    rename(Code = V1, Description = V2)
print(my_lookup)

I have achieved this simply by splitting on the first instance where this is more than one space between word boundaries.

Demonstration

Now we can look-up codes programmatically, for example:

my_lookup %>% right_join(data_frame(Code = 'A'))
Joining, by = "Code"
my_lookup %>% right_join(data_frame(Code = c('E', 'F', 'G')))
Joining, by = "Code"

If the tag code does not exist, an NA is gracefully returned.

Conclusions

It would be trivial to continue reading pages in this fashion. There are very few lines of code required to extract, parse and process the data. I will explore more extensive usage of PDF scraping as opportunities arise.

LS0tCnRpdGxlOiAiRXh0cmFjdGluZyBUZXh0IGZyb20gUERGIgphdXRob3I6IE4gUCBUYXlsb3IKZGF0ZTogJ2ByIGZvcm1hdChTeXMuRGF0ZSgpLCAiJUIgJWQsICVZIilgJwpvdXRwdXQ6IGh0bWxfbm90ZWJvb2sKLS0tCgojIyBJbnRyb2R1Y3Rpb24KCioqVGhpcyBkb2N1bWVudCBpcyBwdWJsaXNoZWQgaW4gdGhlIHB1YmxpYyBkb21haW4gb24gbXkgUlB1YnMgYWNjb3VudC4qKgoKVGhpcyBpcyBhIGRlbW9uc3RyYXRpb24gb2YgdXNpbmcgdGhlIFIgbGFuZ3VhZ2UgYHBkZnRvb2xzYCBhbmQgYHN0cmluZ3JgIHBhY2thZ2VzIHRvIHNjcmFwZSBkYXRhIGZyb20gUERGIGZpbGVzLiAgSSBwcm9kdWNlZCB0aGlzIGV4YW1wbGUgYWZ0ZXIgcmVhZGluZyBhIFtSZXZvbHV0aW9uc11bUmV2cy0yMDE2MDgxMl0gYmxvZyB3aGljaCBkZXNjcmliZXMgYW4gaW50ZXJlc3RpbmcgYXBwbGljYXRpb24gd2hlcmVpbiBsb2NhdGlvbnMgb2YgbWVkaWNhbCBkaXNwZW5zYXJpZXMgd2VyZSBzY3JhcHBlZCBmcm9tIGEgUERGIG9uIHRoZSBpbnRlcm5ldCBhbmQgdXNlZCB0byBwbG90IHBvaW50cyBvbiBhIG1hcC4gIAoKSSBoYXZlIHNlZW4gZ29vZCBleGFtcGxlcyBvZiB3aGVyZSB3ZSBjb3VsZCB1c2UgUERGIHNjcmFwaW5nIHRvb2xzIHRvIGV4dHJhY3QgZGF0YSBmcm9tIFBERiBmaWxlcyB0byB1c2UgaW4gc2NyaXB0cy4gIEluIHRoZSBwYXN0LCBJIGhhdmUgYmVlbiByZXF1aXJlZCB0byBkbyB0aGlzIG1hbnVhbGx5IChmcm9tIGNvbnRyYWN0IGRvY3VtZW50YXRpb24pIGluIG9yZGVyIGdlbmVyYXRlIEtLUyBjb2RlIGxvb2stdXAgdGFibGVzIHdoaWNoIHdlIGFyZSB1c2luZyBpbiBQRkEgYW5hbHlzaXMuICBJZGVhbGx5LCBJIHdvdWxkIGF1dG9tYXRlIHRoaXMgbWFudWFsIHByb2Nlc3Mgc28gdGhhdCBJOgoKKiBEbyBub3QgbWFrZSBtaXN0YWtlcy4KKiBQZXJmb3JtIHRoZSBkYXRhIGV4dHJhY3Rpb24gbW9yZSBxdWlja2x5LgoqIENhbiByZS11c2UgYW5hbHlzZXMgb24gc2ltaWxhciBQREYgZG9jdW1lbnRzLgoqIENhbiByZWNvcmQgZXhwbGljaXRseSB3aGF0IEkgZGlkIGluIGFuIGFuYWx5c2lzIChpLmUuIGVuaGFuY2UgcmVwcm9kdWNpYmlsaXR5KS4KCkluIHRoaXMgZXhhbXBsZSwgSSB3aWxsIGRlbW9uc3RyYXRlIGhvdyB0byBleHRyYWN0IEtLUyBjb2RlcyB3aGljaCBhcmUgdGFidWxhdGVkIGluIGEgUERGIGRvY3VtZW50LiAgQSBkb2N1bWVudCBwdWJsaXNoZWQgYnkgU2llbWVucyBpcyBwdWJsaWNseSBhdmFpbGFibGUgb24gdGhlIHdlYiB3aGljaCBJIHVzZSByZWd1bGFybHkgYXMgYSByZWZlcmVuY2UgZm9yIHRoZSBLS1MgY29kZSBzeXN0ZW0uICBUaGUgZG9jdW1lbnQgY2FuIGJlIGFjY2Vzc2VkIGJ5IGNsaWNraW5nIFtoZXJlXVtLS1MtMjAxMG1tZGRdLiAgTXkgb2JqZWN0aXZlIHdpbGwgYmUgdG8gd3JpdGUgYSBzY3JpcHQgaW4gUiB0byBleHRyYWN0IHRoZSB0YWJsZSBvbiBwYWdlIDkgYW5kIGNyZWF0ZSBhIGxvb2stdXAgdGFibGUuICBUaGUgcHJvY2VzcyBpczoKCjEuIERvd25sb2FkIHRoZSBkb2N1bWVudCBkaXJlY3RseSBmcm9tIHRoZSB3ZWIuCjEuIFJlYWQgdGhlIGVudGlyZSBjb250ZW50cy4KMS4gU2VsZWN0IHBhZ2UgOSB3aGljaCBjb250YWlucyB0aGUgdGFibGUgb2YgaW50ZXJlc3QuCjEuIFBhcnNlIHRoZSB0ZXh0IGFuZCBleHRyYWN0IHRoZSBsaW5lcyBjb250YWluaW5nIHVzZWZ1bCBpbmZvcm1hdGlvbi4KMS4gUHJvY2VzcyB0aGUgdGV4dCBpbnRvIGEgbG9vay11cCB0YWJsZS4KMS4gRGVtb25zdHJhdGUgdXNhZ2UuCgojIyBEb3dubG9hZCBhbmQgUmVhZCBGaWxlCgpUaGUgZmlsZSBpcyBkb3dubG9hZGVkIGFuZCB0aGVuIHJlYWQgdXNpbmcgdGhlICBgcGRmdG9vbHNgIHBhY2thZ2UuICBFYWNoIHBhZ2UgaXMgbG9hZGVkIGludG8gYW4gZWxlbWVudCBvZiBhIHZlY3RvciBzbyBpbiBvcmRlciB0byBhY2Nlc3MgcGFnZSA5LCBJIHNpbXBseSBpbmRleCB0aGF0IGVsZW1lbnQuCgpgYGB7cn0KbGlicmFyeShwZGZ0b29scykKbXlfdXJsIDwtIHBhc3RlMCgiaHR0cDovL2Rpc2t1c2UuZWxla3RyaWthLmN6L2luZGV4LnBocD9hY3Rpb249ZGxhdHRhY2g7IiwgCiAgICAgICAgICAgICAgICAgInRvcGljPTE4NjczLjA7YXR0YWNoPTEwNzg2IikKbXlfZmlsZSA8LSAiS0tTLUluZGVudGlmaWNhdGlvbi1TeXN0ZW0ucGRmIgoKZG93bmxvYWQuZmlsZShteV91cmwsIG15X2ZpbGUpCm15X3BhZ2UgPC0gcGRmX3RleHQobXlfZmlsZSlbOV0KcHJpbnQobXlfcGFnZSkKYGBgCgojIyBQYXJzZSBhbmQgUHJvY2VzcyBUZXh0CgpUaGUgYWJvdmUgY29kZSBoYXMgZ2l2ZW4gYWNjZXNzIHRvIHRoZSByYXcgdGV4dCBvbiB0aGUgcGFnZS4gTm90ZSB0aGF0IGBcbmAgaXMgYSBzcGVjaWFsIGNoYXJhY3RlciB0aGF0IGRlbm90ZXMgYSBuZXdsaW5lLiAgSXQgaXMgbm93IG5lY2Vzc2FyeSB0byBleHRyYWN0IHRoZSB1c2VmdWwgdGV4dCBmcm9tIHRoZSB0YWJsZSBhbmQgYXJyYW5nZSBpdCBhcHByb3ByaWF0ZWx5IGludG8gYSBsb29rLXVwIHRhYmxlLiAgSSB1c2Ugc29tZSB0ZXh0IHByb2Nlc3NpbmcgdG9vbHMgZnJvbSB0aGUgYHN0cmluZ3JgIHBhY2thZ2UgdG8gZXh0cmFjdCB0aGUgcmVsZXZhbnQgdGV4dC4gIE5COiBUaGlzIGNvdWxkIGJlIGRvbmUgd2l0aCBiYXNlIFIgYnV0IEkgcHJlZmVyIHRoZSBzeW50YXggYXZhaWxhYmxlIGZyb20gdGhlIGBzdHJpbmdyYCBwYWNrYWdlLgoKYGBge3J9CmxpYnJhcnkoc3RyaW5ncikKCm15X3RhYmxlIDwtIG15X3BhZ2UgJT4lIAogICAgc3RyX3NwbGl0KCJcbiIpICU+JSAKICAgIC5bWzFdXSAlPiUgCiAgICBzdHJfc3Vic2V0KCJeW0EtWl1cXHMiKQoKcHJpbnQobXlfdGFibGUpCmBgYAoKVGhlIHByb2Nlc3Npbmcgc3RlcHMgYXJlOgoKKiBTcGxpdCB0aGUgdGV4dCBieSBuZXcgbGluZXMuCiogRXh0cmFjdCB0aGUgbGluZXMgd2hpY2ggYmVnaW4gd2l0aCBhIHNpbmdsZSBjYXBpdGFsIGxldHRlciwgQS1aLCBmb2xsb3dlZCBieSBhIHNpbmdsZSBzcGFjZSwgZm9yIHdoaWNoIEkgdXNlIGEgcmVndWxhciBleHByZXNzaW9uLgoKIyMgTWFrZSBMb29rdXAgVGFibGUKClNvIGZhciwgdGhlIHJlbGV2YW50IGxpbmVzIG9uIHRoZSBwYWdlIGhhdmUgYmVlbiByZWFkIGFzIGEgY2hhcmFjdGVyIHZlY3RvciB3aGljaCBjb250YWlucyB0aGUgS0tTIGNvZGUgYW5kIGRlc2NyaXB0aW9uIGFzIG9uZSBpdGVtIG9mIHRleHQuICBIb3dldmVyLCBvdXIgdGFzayBpcyB0byBiZSBhYmxlIHRvICdsb29rLXVwJyBhIEtLUyBjb2RlIGFuZCByZXR1cm4gYSBkZXNjcmlwdGlvbi4gIFdlIGFjaGlldmUgdGhpcyBieSBzcGxpdHRpbmcgdGhlIGVhY2ggZWxlbWVudCBvZiB0aGUgY2hhcmFjdGVyIHZlY3RvciBpbnRvIGEgJ2tleScgYW5kICd2YWx1ZScgcGFpci4gIFRoZSBrZXkgd2lsbCBiZSB0aGUgS0tTIGNvZGUgYW5kIHZhbHVlIHdpbGwgYmUgdGhlIGRlc2NyaXB0aW9uLgoKYGBge3J9CmxpYnJhcnkoZHBseXIpCm15X2xvb2t1cCA8LSBteV90YWJsZSAlPiUKICAgIHN0cl9zcGxpdF9maXhlZCgiXFxzezIsfSIsIG4gPSAyKSAlPiUgCiAgICBhc19kYXRhX2ZyYW1lKCkgJT4lIAogICAgcmVuYW1lKENvZGUgPSBWMSwgRGVzY3JpcHRpb24gPSBWMikKCnByaW50KG15X2xvb2t1cCkKYGBgCkkgaGF2ZSBhY2hpZXZlZCB0aGlzIHNpbXBseSBieSBzcGxpdHRpbmcgb24gdGhlIGZpcnN0IGluc3RhbmNlIHdoZXJlIHRoaXMgaXMgbW9yZSB0aGFuIG9uZSBzcGFjZSBiZXR3ZWVuIHdvcmQgYm91bmRhcmllcy4gIAoKIyMgRGVtb25zdHJhdGlvbgoKTm93IHdlIGNhbiBsb29rLXVwIGNvZGVzIHByb2dyYW1tYXRpY2FsbHksIGZvciBleGFtcGxlOgoKKiBMb29rLXVwIGRlc2NyaXB0aW9uIG9mIEtLUyBjb2RlIGBBYC4KKiBMb29rLXVwIGRlc2NyaXB0aW9ucyBvZiBLS1MgY29kZXMgYEVgLCBgRmAgYW5kIGBHYC4KCmBgYHtyfQpteV9sb29rdXAgJT4lIHJpZ2h0X2pvaW4oZGF0YV9mcmFtZShDb2RlID0gJ0EnKSkKbXlfbG9va3VwICU+JSByaWdodF9qb2luKGRhdGFfZnJhbWUoQ29kZSA9IGMoJ0UnLCAnRicsICdHJykpKQpgYGAKSWYgdGhlIHRhZyBjb2RlIGRvZXMgbm90IGV4aXN0LCBhbiBOQSBpcyBncmFjZWZ1bGx5IHJldHVybmVkLgoKIyMgQ29uY2x1c2lvbnMKCkl0IHdvdWxkIGJlIHRyaXZpYWwgdG8gY29udGludWUgcmVhZGluZyBwYWdlcyBpbiB0aGlzIGZhc2hpb24uICBUaGVyZSBhcmUgdmVyeSBmZXcgbGluZXMgb2YgY29kZSByZXF1aXJlZCB0byBleHRyYWN0LCBwYXJzZSBhbmQgcHJvY2VzcyB0aGUgZGF0YS4gIEkgd2lsbCBleHBsb3JlIG1vcmUgZXh0ZW5zaXZlIHVzYWdlIG9mIFBERiBzY3JhcGluZyBhcyBvcHBvcnR1bml0aWVzIGFyaXNlLgoKW1JldnMtMjAxNjA4MTJdOiBodHRwOi8vbWVkaWNhbGNhbm5hYmlzaWwud2VlYmx5LmNvbS9maW5kLWEtZGlzcGVuc2FyeS5odG1sCltLS1MtMjAxMG1tZGRdOiBodHRwOi8vZGlza3VzZS5lbGVrdHJpa2EuY3ovaW5kZXgucGhwP2FjdGlvbj1kbGF0dGFjaDt0b3BpYz0xODY3My4wO2F0dGFjaD0xMDc4Ng==