MARC for Programming Historians

What is MARC?

A MARC record is a MAchine Readable Cataloging record, and it uses a standard method of identifying information in a library catalog through tags and indicators. Historians who wish to perform an analysis of books published by a specific publisher, books published in a particular city, books published on a given topic, or other analyses using bibliographic data can all benefit from pulling MARC data into R for further analysis. This tutorial will show you how to do it, step-by-step.

Where does MARC come from?

Many library catalogs will provide an option for downloading MARC records, although you may have to manually select the records for download, unless you find a provider who will allow you to download all of the results from a specific search. Below are a few places to get started looking for MARC data.

George Mason University Libraries Catalog http://magik.gmu.edu/cgi-bin/Pwebrecon.cgi?DB=local&PAGE=First

Harvard Library Bibliographic Data Set http://openmetadata.lib.harvard.edu/bibdata

Library of Congress Online Catalog http://catalog.loc.gov (Information on downloading options available at http://catalog.loc.gov/vwebv/ui/en_US/htdocs/help/faqs.html#14 .)

MARC Records from Scriblio 2007 https://archive.org/details/marc_records_scriblio_net

How do I read MARC?

A number of guides and tutorials are available for MARC records. A good overview, available through the Library of Congress is What Is a MARC Record, And Why Is It Important? (see http://www.loc.gov/marc/uma/pt1-7.html#pt4 ).

While you are getting used to MARC, it can be helpful to view some records through the library catalog where you intend to get your data; this should allow easy switching between a human-readable catalog record and the machine-readable catalog record. Depending on your historical question, you will want to get acquainted with specific MARC tags and their associated indicators and subfields.

For MARC records, tags are three digit numbers that represent the field for the data. Some examples are:

020 International Standard Book Number – (ISBN);
100 Main entry – Personal name – (primary author);
260 Publication, distribution, etc. (Imprint);
600 Subject added entry – Personal name; and
650 Subject added entry – Topical term.

These tags are grouped into the following categories:

0XX code and number fields
1XX main entry fields
2XX title and edition fields
3XX physical description fields
4XX series fields
5XX notes fields
6XX subject fields
7XX added entry and linking fields
8XX series added entry and holdings data fields
9XX local data fields

[For more detailed information, see: Taylor, Arlene G. and Joudrey, Daniel N. The Organization of Information. 3rd edition, Westport, Conn: Libraries Unlimited. 2009, page 135.]

Up to two indicators—comprised of a single digit number—may be defined for any given field. For example, field 600, which catalogs subject headings comprised of personal names, contains the following indicators:

Indicator 1: Type of personal name entry element

  0 -- Forename  
  1 -- Surname (this is the most common form)  
  3 -- Family name

Indicator 2: Subject heading system/thesaurus (identifies the specific list or file which was used)

 0 -- Library of Congress Subject Headings  
 1 -- LC subject headings for children's literature  
 2 -- Medical Subject Headings  
 3 -- National Agricultural Library subject authority file  
 4 -- Source not specified  
 5 -- Canadian Subject Headings  
 6 -- Répertoire de vedettes-matière  
 7 -- Source specified in subfield $2

The next part of the MARC record is the delimited subfield which contains the information. Most frequently used subfields for the 260 field, for example, include:

    $a -- Place of publication, distribution, etc.
    $b -- Name of publisher, distributor, etc.
    $c -- Date of publication, distribution, etc.

As you find your way around in MARC records, this list of common MARC fields may be helpful, http://www.loc.gov/marc/umb/um07to10.html . If you need help understanding something you find in a record that is not part of the list of common fields, more extensive documentation is available at: http://www.loc.gov/marc/marcdocz.html . For a good overview, see Taylor and Joudrey, referenced above.

Converting MARC

For many programming techniques used by historians, it is helpful to convert MARC from the native format into a .csv file, so that the comma delimited fields can be manipulated in the programming environment. For this tutorial, the method of converting is to use a script written in python.

Getting, Installing, & Running marc2csv.py

marc2csv.py is a script written in python which converts downloaded MARC record data into .csv format, which then can be pulled into R for analysis.

You can get the script at: https://github.com/jindrichmynarz/MARC2CSV and should view the README.md file for the necessary python libraries to install as well. Use the command line to install the python libraries, and to clone the files from github.

$ sudo easy_install chardet

$ sudo easy_install pymarc

$ git clone https://github.com/jindrichmynarz/MARC2CSV.git

What directory should you clone the files to? Make sure that the directory is easy to access from the directory where you will hold your MARC data files that need converting. For example, I placed MARC2CSV in a folder for my programming work, and placed a data folder as a subdirectory within MARC2CSV to hold my MARC data files and the converted .csv files. If you would like to download sample data to follow along, you can clone the repository from GitHub at: https://github.com/amo9055/MARCdata.git .

If your data file is in a directory in MARC2CSV, your code to convert the file will look like this:

$ ../marc2csv.py records

Once you have your MARC records converted into a .csv format, and you have identified which fields you are interested in for your analysis, the real fun can begin!

Playing with Data

But first: the .csv file that comes out of the conversion may not have a header row to identify the columns of data. Before you pull your file into R, it is best to add that information in. In my text editor I added a new row at the top and typed:

Item, Field, x1, x2, x3, x4, x5, Value

Depending on where you get your MARC data, it may or may not have the same structure. The sample data came from Library of Congress, and is the result of a subject search for “ventilation”. I did not know exactly which sub-fields or delimiters were assigned to x1-x5 until I got familiar with the data in an easier-to-read format (after I pulled it into R). So, at this point, I just kept the names generic.

Load the .csv file into R so that you can access your data.

marcdata <- read.csv(file = "records.csv", sep = ",", stringsAsFactors = FALSE)

Load any libraries you need for data analysis.

library(mullenMisc)
library(ggplot2)
library(stringr)
library(dplyr)

In this example, I am interested in plotting the publication years associated with records that have the subject heading “Ventilation”. Since the records I downloaded from LOC all have “Ventilation” in the subject heading, I just need to limit my MARC data set to find publication years, which is found in MARC field 260$c.

limited_marcdata <- marcdata %>%
  select(Item,
         Field,
         x4,
         Value) %>%
  filter(Field %in% c("260")) %>%
  filter(x4 == "c")

tbl_df(limited_marcdata)

## Source: local data frame [1,387 x 4]
## 
##        Item Field x4   Value
## 1   8746022   260  c [c1929-
## 2   7764638   260  c   n.d.]
## 3  12089690   260  c    18 ]
## 4  10213014   260  c  [n.d.]
## 5   6810562   260  c   1743.
## 6   9649481   260  c   1756.
## 7   6342357   260  c   1758.
## 8   7344718   260  c   1794.
## 9   8218438   260  c      18
## 10  3464329   260  c  1810?]
## ..      ...   ... ..     ...

This data is somewhat problematic if we want to plot the publication year, so we need to clean that up.

limited_marcdata <- limited_marcdata %>%
  mutate(year = str_extract(Value, "\\d{4}"))

tbl_df(limited_marcdata)

## Source: local data frame [1,387 x 5]
## 
##        Item Field x4   Value year
## 1   8746022   260  c [c1929- 1929
## 2   7764638   260  c   n.d.]   NA
## 3  12089690   260  c    18 ]   NA
## 4  10213014   260  c  [n.d.]   NA
## 5   6810562   260  c   1743. 1743
## 6   9649481   260  c   1756. 1756
## 7   6342357   260  c   1758. 1758
## 8   7344718   260  c   1794. 1794
## 9   8218438   260  c      18   NA
## 10  3464329   260  c  1810?] 1810
## ..      ...   ... ..     ...  ...

Next, let’s plot the data.

ggplot(data = limited_marcdata,
       aes(x = year))+
  geom_histogram()+ 
  theme(axis.text.x=element_text(angle=90, hjust=1))+
  ggtitle("Number of Books with Subject Heading 'Ventilation' in LOC by Publication Year")

## Warning: position_stack requires constant width: output may be incorrect

We can fine-tune the code to get a better looking plot, but already we can see some trends in the publication dates of books about ventilation.

The combination of these tools allows us to analyze bibliographic data in new and interesting ways.

This tutorial created as part of the course “Programming for Historians,” taught by Lincoln Mullen at George Maosn University, Fall, 2014. Questions, comments, or feedback? Please contact aoconnor@gmu.edu.