Project Overview

For this project, I’m attempting to classify documents from the Library of Congress’s Chronicling America initiative as either containing specific financial data, or not. I have several goals for this project:

  1. Classify pages (and thus newpapers) according to whether or not they contain specific financial content
  2. Familiarize myself with this data set with a goal to facilitate future work. There is A LOT of interesting stuff that could be done here
  3. Improve my text-mining skills

All of work assodicated with this project can be found in my github including some sample data. The output itself can be found on rpubs Given the size and location of the data set that I used, I have decided to provide tools that allow any user to collect them from the source themselves. The Library of Congress seems committed to maintaining availability of data so I don’t have any concerns here regarding reproducibility.

About the Data

Chronicling America currently has about 13.3M pages archived and seems to be continually expanding their data set. The data is available in a few formats:

  • Images. They can also be downloaded as hi-res .jp2 files.
  • Text. This was created using OCR and thus is far from perfect, especially for older or messier newspapers.
  • XML. The XML appears to be an enriched version of the OCR which contains word coordinates within the page, OCR confidence scores. Probably beyond the scope of this project, but something that I hope to work with in the future to automate the extraction of data.

I estimate that it takes me about 10 seconds to manually load and visually scan a page for content of interest, so it would take about 154 days (working 24/7) to scan all the pages manually. I also roughly estimate that only about 0.5% of pages contain content that is relevant in the context of this project.

A few examples of the kind of data that we’re looking for here; both images and corresponding OCR data.

Daily price quotes for various futures markets:

Example 1

Image

Daily Futures Quotes - Circa Dec 29, 1883
Daily Futures Quotes - Circa Dec 29, 1883

OCR:

COMMISSION MERCHANTS,
Boom 4, Mann helper Building, Southeast comer
Third and Minnesota street?. Direct wire to
Chicago and Milwaukee Boards of Trad*.
(Operator is our office.)
Bt. Paul, Friday, Dec. 23, 1833.
Following is to-day's rango of prices on the
Milwaukee and Chicago boards:
???""">" ! I I I ' fit
1 w a I- s S s
I a ? ? f 111
: ? : . : c 3 :
Milwaukee, j
Wheat- "
January.... 95 i 94% S5Ji Si% 94%' 92%
February... 96% 95& Oo^'i 95% S5& 98%
May 1C3% 103% 103% 108% ....
Chicago,
Wheat-
January.... S6>i 95%' 96% Sssi 36 92%
February.. 97% 98% 91% S6s£ 96% 93 %
March ;
May 1G4% 1C8& 10*34 1«3% ICS^ »5f

Example 2

Image

Hourly Price quotes for various futures market:

Hourly Futures Quotes - Circa Dec 29, 1883
Hourly Futures Quotes - Circa Dec 29, 1883

OCR

The following quotations, glrti tt the r>n^« of
the 2&rkets durissthe day, wars nesi - by XL.
Doran, Co2tj»ion Merchant:
. WHEAT.
Jan. May. Jaa. Kay.
9-10 A.M. 84% 10SH SSK- 103Ji
9^5 - t4si lOS-^. 9JJi 108H
10:00 " 95 103 V toft lUSK
10a5 " 85 IC3S B*£ 104 V
10*0 M 95 X 108% 96 IMU
LQAi " 85# 113 % 91)? 104 %
11*0 ??? 95 10J% H¥ 10: V
11:15 " . 95 l«»i 9»« 104
11*) " 95 10»»i »5% 10SK
HHa " .84% 108* 95% 199%
13.-0O " 95 108% 86 104
12:15 " »5 ~?????? 103% 96^ 104^
11:10 M t*% 1C3% 96% K-4 j
1345 " 84« 103%- C-6 10S5<
IsOQ " 84% ' l'J3% 86 ltSXj

As can be seen from the above, the quality of the OCR leaves something to be desired. My hope here is that, despite the translation being either incorrect or totally illegible to humans, there are still enough similarities between financial pages for them to be grouped together and that they are distinct from non-financial pages.

Collect the Data Dictionary

Chronicling America provides a well organized data dictionary and bulk-downloading facility. The first thing that we’ll do is get a list of files to download.

ocr.url ocr.size
http://chroniclingamerica.loc.gov/data/word_coordinates/ocr/batch_whi_lucretia_ver01.tar.bz2 1314552281
http://chroniclingamerica.loc.gov/data/word_coordinates/ocr/batch_whi_inez_ver01.tar.bz2 1524682389
http://chroniclingamerica.loc.gov/data/word_coordinates/ocr/batch_wvu_jim_ver01.tar.bz2 1024492998
http://chroniclingamerica.loc.gov/data/word_coordinates/ocr/batch_wvu_iconia_ver01.tar.bz2 975727516
http://chroniclingamerica.loc.gov/data/word_coordinates/ocr/batch_wvu_horta_ver01.tar.bz2 990614566

Download the Data

The list above looks usable so now we can proceed to downloading some data. Note that some of these files are pretty large (1GB compressed) so I’ve included a “test” mode in my download function that grabs the 5 smallest files. Data is downloaded to CWD unless otherwise specified

Uncompressing

Once the data is downloaded, it will need to be manually unzipped from .bz2 format. While R seems to have good support for bz2, I couldn’t seem to get it to work seamlessly with .tar.bz2 files containing a large tree-structure within.

After we have unzipped, we can untar the rest as below. Note that given some of the file sizes, this can take a while to run, so I’ve included a sample unzip from github. Note that these files contain both .txt and .xml, but no images.

Inspect

Let’s take a look at the file structure contained within the compressed files.

##                      levelName
## 1 test                        
## 2  °--paper_name              
## 3      °--year                
## 4          °--month           
## 5              °--day_of_month
## 6                  °--edition 
## 7                      °--page

We see that basically, ever paper,year, month, day, edition and page gets its own folder. This means that we have to crawl these paths to get to all the actual files. I’m tempted to sidestep this file structure by saving all the files in a single directory and encoding the files names with the same info as provided by the dir tree… but we won’t do that right now.

Get the Files

We’ll get a list of all the files within the tree structure

ocr.names
test/cat/sn83025287/1883/01/01/ed-1/seq-1/ocr.txt
test/cat/sn83025287/1883/01/01/ed-1/seq-1/ocr.xml
test/cat/sn83025287/1883/01/01/ed-1/seq-2/ocr.txt
test/cat/sn83025287/1883/01/01/ed-1/seq-2/ocr.xml
test/cat/sn83025287/1883/01/01/ed-1/seq-3/ocr.txt
test/cat/sn83025287/1883/01/01/ed-1/seq-3/ocr.xml

Transform

Now we’ll transform the data. Essentially, I’m going to chop up the file-paths and put all that data into a df to be used as meta-data in my corpus further along. We’ll concatenate the year, month and day into a proper date, add a column that contains the day of the week (potentially useful in finance) and a column to flag whether the document contains financial data or not.

It’s also a logical place to separate the .txt and .xml files at this point.

doc_id rootfolder cat id YYYY mm dd edition pageNum fname date wDay hasData text
test/cat/sn83025287/1883/01/01/ed-1/seq-1/ocr.txt test cat sn83025287 1883 01 01 1 1 ocr.txt 1883-01-01 Mon 0
test/cat/sn83025287/1883/01/01/ed-1/seq-2/ocr.txt test cat sn83025287 1883 01 01 1 2 ocr.txt 1883-01-01 Mon 0
test/cat/sn83025287/1883/01/01/ed-1/seq-3/ocr.txt test cat sn83025287 1883 01 01 1 3 ocr.txt 1883-01-01 Mon 0
test/cat/sn83025287/1883/01/01/ed-1/seq-4/ocr.txt test cat sn83025287 1883 01 01 1 4 ocr.txt 1883-01-01 Mon 0
test/cat/sn83025287/1883/01/01/ed-1/seq-5/ocr.txt test cat sn83025287 1883 01 01 1 5 ocr.txt 1883-01-01 Mon 0

Build a Corpus

Next we’ll take the dataframe that we’ve just created and add the actual text content from the files. Once again, this can take a while.

rootfolder cat id YYYY mm dd edition pageNum fname date wDay hasData doc_id
test cat sn83025287 1883 01 01 1 1 ocr.txt 1883-01-01 Mon 0 test/cat/sn83025287/1883/01/01/ed-1/seq-1/ocr.txt
test cat sn83025287 1883 01 01 1 2 ocr.txt 1883-01-01 Mon 0 test/cat/sn83025287/1883/01/01/ed-1/seq-2/ocr.txt
test cat sn83025287 1883 01 01 1 3 ocr.txt 1883-01-01 Mon 0 test/cat/sn83025287/1883/01/01/ed-1/seq-3/ocr.txt
test cat sn83025287 1883 01 01 1 4 ocr.txt 1883-01-01 Mon 0 test/cat/sn83025287/1883/01/01/ed-1/seq-4/ocr.txt
test cat sn83025287 1883 01 01 1 5 ocr.txt 1883-01-01 Mon 0 test/cat/sn83025287/1883/01/01/ed-1/seq-5/ocr.txt

Bag O’ Words… (or something like it)

Now we’re going to employ a “bag of words” like method to pare down the data for manual classification.

One of the problems that I face is that I need a sample of classified documents in order to train the SVM process that I’m planning on applying here. My idea is to short-list the docs using a bag-of-words approach and then manually classify.

Cursory inspection tells me that certain words are more common within target pages than non-target pages. What I do here is remove everything but those relevant words and work with that (significantly reduced) output. This speeds things up a lot.

Now I subset all the documents that contain an abundance of my “keep” words. All I did here was make a simple rule that picks out documents that contain > 75 occurences of the special words. In hindsight, I probably should have used a proportion rather than an absolute number, but I suspect that the number of words on a newspaper page is relatively constant for both business & practical reasons, so a hard limit probably isn’t all that dangerous.

doc_id
55 test/cat/sn83025287/1883/01/06/ed-1/seq-7/ocr.txt
79 test/cat/sn83025287/1883/01/09/ed-1/seq-7/ocr.txt
87 test/cat/sn83025287/1883/01/10/ed-1/seq-7/ocr.txt
94 test/cat/sn83025287/1883/01/11/ed-1/seq-6/ocr.txt
110 test/cat/sn83025287/1883/01/13/ed-1/seq-6/ocr.txt

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can see that most of the documents contain <25 occurrences of the “keep” words but that there is a bit of a jump around the 75-count mark. To be clear, my intention with this “bag of words” approach is to identify documents that have a higher probability of containing financial data for manual classification, but I’m not overly concerned with being wrong.

#Get Images

So we’ve identified a handful of files that contain relevant words. Next we’re going to download images (.pdfs) so that we can manually inspect and classify them. Once again, even the .pdfs are about 5mb each, so this takes a while.

We basically parse the file names to reconstruct the image URL at Chronicling America.

Add Classifiaction Info

So I’ve manually classified about 220 documents and that information can now be added to the corpus. We’ll use the “hasData” meta-tag that we previously created when transforming the data.

For the manual classification, rather than classifying every document as either “TRUE” or “FALSE”, I’ve created a list that contains documents for which my condition is “TRUE”. On a date-by-date basis, this is valid as newspapers (particularly of this vintage) generally have futures quotes confined to one single page per date.

We go through my manually classified items; a simple text file containing paper id, date and page num. We identify the index for each id+date+pageNum combo in the corpus, and then change that flag from from FALSE to TRUE.

hasData
FALSE
FALSE
FALSE
FALSE
FALSE

Build a Model

Now that we have some classified data, we can build a model. Recall that previously we’d used a stripped down version of the corpus. Here, we’re going to revert back to the original, full corpus.

We’ll start by cleaning the original corpus. Then we’ll build a container for the SVM.

Train the Model

Next we’ll train the actual model on some data. I’ve chosen SVM as it appears to be a reasonable approach for problems with high dimensionality, which this is.

Inspect the Model

Now we examine the results of the model. Below we see a table with all of the predicted values alongside the real TRUE values. I think it’s a good way to show the data as the real TRUE observations are few relative to the whole sample.

We also look at the accuracy using the “recall_accuracy()” function. In this case, I think that this function provides a misleading result. Given that I expect only about 0.5% of docs to contain info that I am interested in, and that all my docs are pre-coded as false, we’d expect to get a high score here. I’m much more concerned with the proportion of TRUE documents that I was able to identify as such.

SVM_LABEL SVM_PROB real
11 FALSE 0.5709651 TRUE
27 TRUE 0.9740367 TRUE
42 TRUE 0.5881983 TRUE
80 TRUE 0.9266203 TRUE
146 TRUE 0.7238994 TRUE
148 TRUE 0.7015101 TRUE
151 TRUE 0.8944307 TRUE
157 TRUE 0.9719772 TRUE
158 TRUE 0.9558798 TRUE
175 TRUE 0.9724161 TRUE
209 TRUE 0.7680566 TRUE
233 TRUE 0.9155681 TRUE
238 TRUE 0.9041032 TRUE
248 FALSE 0.5607846 TRUE
250 TRUE 0.9746006 TRUE
286 TRUE 0.9820918 TRUE
306 TRUE 0.8618724 TRUE
323 TRUE 0.8272815 TRUE
365 FALSE 0.5568750 TRUE
371 TRUE 0.9228798 TRUE
## [1] 0.982
## [1] 0.88

So the model appears to be able to do a reasonable job of classifying which documents meet my criteria and which do not. I find this impressive given that the training set is relatively small and that the documents are so messy.

Outsample Attempt

Finally, we’ll try to apply the model that we’ve just trained on some outsample data and take a quick look at the results.

We’ll start by loading up some uncategorized data and formatting it.

We’ll then create a dtm. Note that a slightly different method is required here as model appears to be picky about inputs. As such, I had to give it some info about the matrix originally used to train the model.

## Tracing function "create_matrix" in package "RTextTools"
## [1] "create_matrix"

We then create the container and run the model

SVM_LABEL SVM_PROB
FALSE 0.9997626
FALSE 0.9999277
FALSE 0.9997707
FALSE 0.9997551
FALSE 0.9997543
FALSE 0.9997582
FALSE 0.9997544
FALSE 0.9997756
FALSE 0.9998396
FALSE 0.9997760
## [1] SVM_LABEL SVM_PROB 
## <0 rows> (or 0-length row.names)

We see that according to the model, none of the 50 out-sample pages appear to have any data of interest. A cursory investigation of the associated image files suggests that this is reasonably, although not perfectly accurate. Either way, my suspicion is that the model needs a larger training set, and I intend to continue working on that.

Summary

In this project, I was able to

  1. Acquire and unpack the desired data
  2. Manipulate and transform the data
  3. Significantly reduce manual efforts using “bag of words”, including automated download of image files for manual inspection
  4. Manually classify, then automatically tag the data
  5. Build an SVM model, which was surprisingly accurate based on the input / intuition
  6. Run the model on totally “out of sample” data.

Future Work

I intend to continue to explore this model and the data set in general. Once I have located a significant number of pages which contain the content that I am interested in, I plan to take a crack at automatic extraction of that data.