Data Conglomeration: SEER parts into a whole

Intro: The Surveillance, Epidemiology, and End Results (SEER) Program provides a lot of potentially useful cancer data. Unfortunately, the data formats are a bit limited at first.

Purpose: To obtain SEER data and combine the 40 provided files into just a few useful files, using R.

Requesting/Downloading the data

To begin doing anything, you’ll have to obtain data. Go to this page to request the data and sign their agreement form.

After returning the agreement form, you’ll be linked to this site. Pick ZIP or exe, whatever you prefer, as long as you download the ASCII text version of the data. You’ll have to give them your username/pw to download the data, of course.

Unpack the files, they should be something like 3GB total. You’ll probably notice that there are dozens of files, spread across several folders. This…is inconvenient. Fortunately, there’s an R package for that!

Installing SEERaBomb

Thanks to the hard work of researchers at the Cleveland Clinic Foundation, there’s an R package that will put those SEER files together. Moreover, you’ll have the data in native R format and as a database!

If you’re interested in learning more about it, check out the SEERaBomb page at your favorite CRAN mirror.

Installing the R package dependencies for SEERaBomb also requires installing a library outside of R, the Open GL Utility, “GLU”. Here are some links to install GLU:

  • For Ubuntu users: run sudo apt-get install libglu1-mesa-dev; for RedHat users, run yum install Mesa-devel.

  • If you’re using a Mac, go here for instructions.

  • For Windows (or more detailed instructions for other operating systems), see here for how to install GLU.

If you haven’t installed it yet, use this command to install other dependencies

install.packages(c("LaF",
            "RSQLite",
            "dplyr",
            "XLConnect",
            "Rcpp",
            "rgl",
            "reshape2",
            "mgcv",
            "DBI",
            "bbmle"))

install.packages("SEERaBomb")

#load the library
library(SEERaBomb)

SEERaBomb Prep Work

The command getFields parses out the SAS files to generate a dataframe (without data yet) that contains all of the SEER fields. Use it as follows:

df <- getFields(seerHome = "SEER_1973_2012_TEXTDATA/")

This next command will create a dataframe that only extracts those variables that we want. It was intended to allow the user to select variables of interest, allowing for faster and more streamlined work. I used this package with the intention of exploring the data, so I want all the variables!

df <- pickFields(sas = df, picks = df$names)

All of them!

If you want the program to run faster, probably select which variables you’d like. You can do this by substituting a subset of df$names in the picks option of the previous command.

Congratulations! You’ve successfully prepared to put the files together.

Assemble the files!

The following command will assemble the 40 separate files into four, of which the rest of this tutorial will focus on one. Personally, I ran this command and grabbed a cup of coffee.

mkSEER(df, seerHome = "SEER_1973_2012_TEXTDATA/")

Please note that none of these commands are run by this document. Its purpose, rather than executing code, is to demonstrate the code necessary to get started with SEER in R.

Citations:

[1] T. Radivoyevitch. SEERaBomb: SEER and Atomic Bomb Survivor Data Analysis Tools. R package version 2015.2. 2015. <URL: https://CRAN.R-project.org/package=SEERaBomb>.