Before we decided on where to store the data, I created a script that stored that data in MongoDB. This is important because we wanted to make sure that we could store the data in a suitable database infrastructure.
library(RODBC)
library(mongolite)
library(knitr)
library(kableExtra)
library(stringr)
library(dplyr)
library(tidyr)
library(scales)
library(ggplot2)
library(plotly)
library(maps)
library(mapdata)
library(ggrepel) #not using this at the moment, but it does give the option to add labels. While not useful for the
First we have to connect to an existing Mongo database. It is assumed that the mongo database is already running on the machine and is ready to accept connections. We have to drop all the tables prior to loading the data again to make sure that the data we are storing is clean and there are no duplicates.
mbreast <- mongo("breast")
mdigothr <- mongo("digothr")
mmalegen <- mongo("malegen")
mfemgen <- mongo("femgent")
mother <- mongo("other")
mrespir <- mongo("respir")
mcolrect <- mongo("colrect")
mlymyleuk <- mongo("lymyleuk")
murinary <- mongo("urinary")
mbreast$drop()
mdigothr$drop()
mmalegen$drop()
mfemgen$drop()
mother$drop()
mrespir$drop()
mcolrect$drop()
mlymyleuk$drop()
murinary$drop()
This was a pretty straightforward set of commands. The process repeats itself for all parsed SEER Cancer data. For each CSV file, we opened it and stored the data in its appropriate collection.
# Open up each CSV file and store in mongodb
dfbreast = read.csv("c:/SEER/output/breast.csv")
mbreast$insert(dfbreast)
## List of 5
## $ nInserted : num 1631572
## $ nMatched : num 0
## $ nRemoved : num 0
## $ nUpserted : num 0
## $ writeErrors: list()
## [1] "Finished loading BREAST CANCER data."
## List of 5
## $ nInserted : num 790188
## $ nMatched : num 0
## $ nRemoved : num 0
## $ nUpserted : num 0
## $ writeErrors: list()
## [1] "Finished loading OTHER DIGESTIVE CANCER data."
## List of 5
## $ nInserted : num 1334496
## $ nMatched : num 0
## $ nRemoved : num 0
## $ nUpserted : num 0
## $ writeErrors: list()
## [1] "Finished loading MALE GENITAL CANCER data."
## List of 5
## $ nInserted : num 618134
## $ nMatched : num 0
## $ nRemoved : num 0
## $ nUpserted : num 0
## $ writeErrors: list()
## [1] "Finished loading FEMALE GENITAL CANCER data."
## List of 5
## $ nInserted : num 1840653
## $ nMatched : num 0
## $ nRemoved : num 0
## $ nUpserted : num 0
## $ writeErrors: list()
## [1] "Finished loading OTHER SITES CANCER data."
## List of 5
## $ nInserted : num 1312846
## $ nMatched : num 0
## $ nRemoved : num 0
## $ nUpserted : num 0
## $ writeErrors: list()
## [1] "Finished loading RESPIRATORY CANCER data."
## List of 5
## $ nInserted : num 1022999
## $ nMatched : num 0
## $ nRemoved : num 0
## $ nUpserted : num 0
## $ writeErrors: list()
## [1] "Finished loading COLON AND RECTAL CANCER data."
## List of 5
## $ nInserted : num 808182
## $ nMatched : num 0
## $ nRemoved : num 0
## $ nUpserted : num 0
## $ writeErrors: list()
## [1] "Finished loading LYMPHOMA AND LEUKEMIA CANCER data."
## List of 5
## $ nInserted : num 691744
## $ nMatched : num 0
## $ nRemoved : num 0
## $ nUpserted : num 0
## $ writeErrors: list()
## [1] "Finished loading URINARY CANCER data."
## [1] "Finished MongoDB load of CSV files!"