In order to migrate our relational database, in this case the “flights” database in MySQL, we first need to establish a connection. For the purpose of this assignment, I’ve created a new account and password.
con <- dbConnect(MySQL(),
user = "school",
password = "a123456",
host = "localhost",
dbname = "flights")
tables <- dbListTables(con)
tables
## [1] "airlines" "airports" "flights" "planes" "weather"
We see that the “flights” database has five tables associated with it. We’ll need to migrate all of them. To me, the simplest way to achieve this is with the dbReadTable function, but one can also use SQL language to fetch each table utilizing dbGetQuery.
airlines <- dbReadTable(con, tables[1])
airlines
## carrier name
## 1 9E Endeavor Air Inc.\r
## 2 AA American Airlines Inc.\r
## 3 AS Alaska Airlines Inc.\r
## 4 B6 JetBlue Airways\r
## 5 DL Delta Air Lines Inc.\r
## 6 EV ExpressJet Airlines Inc.\r
## 7 F9 Frontier Airlines Inc.\r
## 8 FL AirTran Airways Corporation\r
## 9 HA Hawaiian Airlines Inc.\r
## 10 MQ Envoy Air\r
## 11 OO SkyWest Airlines Inc.\r
## 12 UA United Air Lines Inc.\r
## 13 US US Airways Inc.\r
## 14 VX Virgin America\r
## 15 WN Southwest Airlines Co.\r
## 16 YV Mesa Airlines Inc.\r
Simple. There are return characters for each airline line, but that can easily be cleaned up. None of the other tables have this quirk.
airlines$name <- str_replace(airlines$name, "\\r", "")
airports <- dbReadTable(con, tables[2])
flights <- dbReadTable(con, tables[3])
planes <- dbReadTable(con, tables[4])
weather <- dbReadTable(con, tables[5])
Now, we need to import the data into a NoSQL database. For the purpose of this assignment, I’ve chosen MongoDB. I was unable to find a way to create a new database in MongoDB using RMongo, so for the purpose of this assignment we will assume that a user has already opened their MongoDB connection using "C:\Program Files\MongoDB\Server\3.6\bin\mongod.exe" for the default, and then using Compass to connect, or working straight from the command line, has created a new database flights.
mongo <- mongoDbConnect("flights")
Using RMongo, the only way we can insert new information into MongoDB is via dbInsertDocument, however this necessitates converting the data frames into a JSON document. Fortunately, the jsonlite package has a function to do this called toJSON. However, toJSON will add brackets ([]) to the string, which we must remove before writing. This can be repetitive, so it’s easiest to do this by writing a function.
Unfortunately, simply calling dbInsertDocument on a dataframe converted into a JSON format via toJSON only seems to write the first entry before stopping. From this StackOverflow question, it seems right now the only solution is to iterate over the list.
# Function writes an entry to a MongoDB database
to.mongo <- function(conn, name, df){
df.json <- toJSON(df)
df.json <- str_replace(df.json, "\\[", "")
df.json <- str_replace(df.json, "\\]", "")
dbInsertDocument(conn, name, df.json)
}
# Function simply iterates over each row of our dataframes
mongo.mover <- function(conn, name, df){
for(i in 1:nrow(df)) {
to.mongo(conn, name, df[i, ])
}
}
mongo.mover(mongo, "airlines", airlines)
mongo.mover(mongo, "airports", airports)
mongo.mover(mongo, "flights", flights)
mongo.mover(mongo, "planes", planes)
mongo.mover(mongo, "weather", weather)
We can now do some simple tests to see if we’ve imported everything correctly.
mon.airlines <- dbGetQuery(mongo, "airlines", '{"carrier":"HA"}')
kable(mon.airlines)
| carrier | name | X_id |
|---|---|---|
| HA | Hawaiian Airlines Inc. | 5ae33f2fc968195aee286fec |
mon.airport <- dbGetQuery(mongo, "airports", '{"faa": "HNL"}')
kable(mon.airport)
| dst | tz | faa | name | alt | lon | X_id | lat |
|---|---|---|---|---|---|---|---|
| N | -10 | HNL | Honolulu Intl | 13 | -157.9224 | 5ae33f30c968195aee28722b | 21.3187 |
So what are the advantages/disadvantages of using a NoSQL database like MongoDB? One of the main advantages to NoSQL is that it is dynamic. This blog notes that NoSQL is flexible. You can:
They also note that NoSQL scales differently. “NoSQL databases are horizontally scalable…meaning that you handle more traffic by sharding, or adding more servers…it’s like adding more floors to the same building versus adding more buildings to the neighborhood”.
It seems like SQL and NoSQL play their own roles. If you know your data will be structured a certain way, or expect it to be structured, and have the time to put in the effort to pre-planning, then SQL should be great. If, however, you expect to utilize different types of data, or multiple unstructured sources of data, and are in a small team where you don’t have the time to standardize everything, then NoSQL would suit you best.
That said, NoSQL is newer and integration isn’t perfect cross platform, as I found out in this assignment. RMongo is not a great package for connecting R to MongoDB, and migrating a HUGE dataframe to MongoDB row-by-row is very time consuming.