The following project involves migrating the MySQL database used for our Data Science Skills project to MongoDB. The original database consisted of 4 tables: company, job_position, job_post_specific, and description. The column list and relationships are detailed in the diagram below.

As conveyed by the diagram, the database consists of companies which in turn publish job positions. Some of those positions are listed in different geographic areas and may have slight variations. Those sub listings are captured in the job_post_specific_table. Finally, the description table was used to separate the HTML link and the description text from each unique sub-listing.
For the migration to MongoDB, we will first import the company table as a separate collection. The default object ID will be replaced by the primary key used in the MySQL table in order to maintain the relationship with the job_position data through a document reference.
There is a 1:1 between each listing (job_post_specific table ) and its link and description (description table), so we’ll merge both tables into the same document.
Although there may be several job listings (job_post_specific) derived from one position (job_position) the ratio isn’t very high and the bulk of the data is contained in the description. For this reason, we will merge the job position data into the job listings even if that means repeating some of the data. As we did with the description table, we will replace the default object ID with the primary key of the job_post_specific table.
Below are sample documents from our final database.
# Company Collection Sample
{
"_id": "c_10",
"company": "3M",
"company_revenue": "More than $10B (USD)",
"company_employees": "10,000+",
"company_industry": "Industrial Manufacturing",
"no_of_stars": 4.099999905,
"no_of_reviews": 2837
}
# Listing Collection Sample
{
"_id": "fec647775a21ecc6",
"job_title": "Data Scientist I",
"company": "Federal Reserve Bank of Dallas",
"company_id": "c_782",
"location": "TX",
"description": "[<ul><li>Assist in consultations with business partn...",
"queried_salary": "<80000",
"skill": [
"Statistical Software",
"Time Management",
"R",
"Microsoft Office",
"Excel",
"Tableau",
"Data Science"
],
"job_type": "data_scientist",
"date_since_posted": 30,
"link": "https://www.indeed.com/rc/clk?jk=fec647775a21ecc6&fccid=2c6850e24c8a2811&vjs=3"
}
In this particular case, I don’t think either storage model is ideal for this dataset and choosing a platform would depend on how the data will be used. In the data set, we have two distinct entities (companies and listings) which are composed of very different data but are still closely related. Additionally, some large companies publish the same position is different locations or with slight variations. The relational model seems to be better suited at keeping these entities separate while preserving their relationships.
On the other hand, most of the data is contained within the job listing which can be easily stored on MongoDB’s document model. MongoDB’s schemaless design is more flexible and would make it easier to expand and adapt the data set for future applications. MongoDB also has built-in Geolocation capabilities which could be useful in this application.
If I were to start a similar project from scratch I would still opt for MongoDB, since the relational model isn’t an absolute necessity for this data and the schemaless design would give me more flexibility.
The code for the migration is detailed below.
First, we load the required libraries.
library(mongolite)
library(tidyverse)
library(RMySQL)
library(jsonlite)
We establish a connnection to the MySQL Server.
mydb <-
DBI::dbConnect(
RMySQL::MySQL(),
dbname = 'project3',
host = "35.232.240.247",
user = "root",
password = rstudioapi::askForPassword("Database password"))
We list the tables to verify that the connection is working.
dbListTables(mydb)
[1] "company" "description" "job_position" "job_post_specific"
We load the company
table onto a data frame and we rename the primary key as _id
.
company <- dbGetQuery(mydb,"SELECT * FROM company") %>%
rename("_id" = "company_id")
We join the job_post_specific
, job_position
and description
tables, keeping the primary key from the job_post_specific
table. We will include the company name along with the listings for easy referencing.
listing <-
dbGetQuery(mydb,
"SELECT
jk_id,
job_title,
company,
company_id,
location,
description,
queried_salary,
skill,
job_type,
date_since_posted,
link
FROM
company
NATURAL JOIN
job_post_specific
NATURAL JOIN
job_position
NATURAL JOIN
description;")
The skills list is stored as a string in the MySQL database. The transformation below will convert the string into a list which in turn will be converted into an array once the data is imported into MongoDB. We will also rename the primary key as _id
.
listing <- listing %>%
mutate(skill = str_replace_all(skill,"', '", "\\|")) %>%
mutate(skill = str_replace_all(skill,"(\\[')|('\\])", "")) %>%
mutate(skill = as.list(strsplit(skill, "\\|"))) %>%
rename('_id' = `jk_id`)
Now that we have our data ready we can create two connections to MongoDB, one for each collection.
mgcompany <- mongo(collection = "company",db = "DS-Skills", url = "mongodb://localhost",
verbose = FALSE, options = ssl_options())
mglistings <- mongo(collection = "listings",db = "DS-Skills", url = "mongodb://localhost",
verbose = FALSE, options = ssl_options())
Once the connections are established we can upload the data to the new database.
mgcompany$insert(company)
mglistings$insert(listing)
LS0tDQp0aXRsZTogIk15U1FMIHRvIE1vbmdvREIgTWlncmF0aW9uIg0Kc3VidGl0bGU6ICJEYXRhIFNjaWVuY2UgU2tpbGxzIERhdGEgU2V0Ig0KYXV0aG9yOiAiRmVybmFuZG8gRmlndWVyZXMiDQpvdXRwdXQ6IGh0bWxfbm90ZWJvb2sNCi0tLQ0KDQpUaGUgZm9sbG93aW5nIHByb2plY3QgaW52b2x2ZXMgbWlncmF0aW5nIHRoZSBNeVNRTCBkYXRhYmFzZSB1c2VkIGZvciBvdXIgRGF0YSBTY2llbmNlIFNraWxscyBwcm9qZWN0IHRvIE1vbmdvREIuIFRoZSBvcmlnaW5hbCBkYXRhYmFzZSBjb25zaXN0ZWQgb2YgNCB0YWJsZXM6IGNvbXBhbnksIGpvYl9wb3NpdGlvbiwgam9iX3Bvc3Rfc3BlY2lmaWMsIGFuZCBkZXNjcmlwdGlvbi4gVGhlIGNvbHVtbiBsaXN0IGFuZCByZWxhdGlvbnNoaXBzIGFyZSBkZXRhaWxlZCBpbiB0aGUgZGlhZ3JhbSBiZWxvdy4NCg0KIVtdKG15ZGJfZGlhZy5wbmcpDQoNCkFzIGNvbnZleWVkIGJ5IHRoZSBkaWFncmFtLCB0aGUgZGF0YWJhc2UgY29uc2lzdHMgb2YgY29tcGFuaWVzIHdoaWNoIGluIHR1cm4gcHVibGlzaCBqb2IgcG9zaXRpb25zLiBTb21lIG9mIHRob3NlIHBvc2l0aW9ucyBhcmUgbGlzdGVkIGluIGRpZmZlcmVudCBnZW9ncmFwaGljIGFyZWFzIGFuZCBtYXkgaGF2ZSBzbGlnaHQgdmFyaWF0aW9ucy4gVGhvc2Ugc3ViIGxpc3RpbmdzIGFyZSBjYXB0dXJlZCBpbiB0aGUgam9iX3Bvc3Rfc3BlY2lmaWNfdGFibGUuIEZpbmFsbHksIHRoZSBkZXNjcmlwdGlvbiB0YWJsZSB3YXMgdXNlZCB0byBzZXBhcmF0ZSB0aGUgSFRNTCBsaW5rIGFuZCB0aGUgZGVzY3JpcHRpb24gdGV4dCBmcm9tIGVhY2ggdW5pcXVlIHN1Yi1saXN0aW5nLg0KDQpGb3IgdGhlIG1pZ3JhdGlvbiB0byBNb25nb0RCLCB3ZSB3aWxsIGZpcnN0IGltcG9ydCB0aGUgY29tcGFueSB0YWJsZSBhcyBhIHNlcGFyYXRlIGNvbGxlY3Rpb24uIFRoZSBkZWZhdWx0IG9iamVjdCBJRCB3aWxsIGJlIHJlcGxhY2VkIGJ5IHRoZSBwcmltYXJ5IGtleSB1c2VkIGluIHRoZSBNeVNRTCB0YWJsZSBpbiBvcmRlciB0byBtYWludGFpbiB0aGUgcmVsYXRpb25zaGlwIHdpdGggdGhlIGpvYl9wb3NpdGlvbiBkYXRhIHRocm91Z2ggYSBkb2N1bWVudCByZWZlcmVuY2UuDQoNClRoZXJlIGlzIGEgMToxIGJldHdlZW4gZWFjaCBsaXN0aW5nIChqb2JfcG9zdF9zcGVjaWZpYyB0YWJsZSApIGFuZCBpdHMgbGluayBhbmQgZGVzY3JpcHRpb24gKGRlc2NyaXB0aW9uIHRhYmxlKSwgc28gd2UnbGwgbWVyZ2UgYm90aCB0YWJsZXMgaW50byB0aGUgc2FtZSBkb2N1bWVudC4NCg0KQWx0aG91Z2ggdGhlcmUgbWF5IGJlIHNldmVyYWwgam9iIGxpc3RpbmdzIChqb2JfcG9zdF9zcGVjaWZpYykgZGVyaXZlZCBmcm9tIG9uZSBwb3NpdGlvbiAoam9iX3Bvc2l0aW9uKSB0aGUgcmF0aW8gaXNuJ3QgdmVyeSBoaWdoIGFuZCB0aGUgYnVsayBvZiB0aGUgZGF0YSBpcyBjb250YWluZWQgaW4gdGhlIGRlc2NyaXB0aW9uLiBGb3IgdGhpcyByZWFzb24sIHdlIHdpbGwgbWVyZ2UgdGhlIGpvYiBwb3NpdGlvbiBkYXRhIGludG8gdGhlIGpvYiBsaXN0aW5ncyBldmVuIGlmIHRoYXQgbWVhbnMgcmVwZWF0aW5nIHNvbWUgb2YgdGhlIGRhdGEuIEFzIHdlIGRpZCB3aXRoIHRoZSBkZXNjcmlwdGlvbiB0YWJsZSwgd2Ugd2lsbCByZXBsYWNlIHRoZSBkZWZhdWx0IG9iamVjdCBJRCB3aXRoIHRoZSBwcmltYXJ5IGtleSBvZiB0aGUgam9iX3Bvc3Rfc3BlY2lmaWMgdGFibGUuDQoNCkJlbG93IGFyZSBzYW1wbGUgZG9jdW1lbnRzIGZyb20gb3VyIGZpbmFsIGRhdGFiYXNlLg0KDQpgYGB7cn0NCiMgQ29tcGFueSBDb2xsZWN0aW9uIFNhbXBsZQ0KDQp7DQogICAgIl9pZCI6ICJjXzEwIiwNCiAgICAiY29tcGFueSI6ICIzTSIsDQogICAgImNvbXBhbnlfcmV2ZW51ZSI6ICJNb3JlIHRoYW4gJDEwQiAoVVNEKSIsDQogICAgImNvbXBhbnlfZW1wbG95ZWVzIjogIjEwLDAwMCsiLA0KICAgICJjb21wYW55X2luZHVzdHJ5IjogIkluZHVzdHJpYWwgTWFudWZhY3R1cmluZyIsDQogICAgIm5vX29mX3N0YXJzIjogNC4wOTk5OTk5MDUsDQogICAgIm5vX29mX3Jldmlld3MiOiAyODM3DQp9DQoNCg0KDQojIExpc3RpbmcgQ29sbGVjdGlvbiBTYW1wbGUNCg0Kew0KICAgICJfaWQiOiAiZmVjNjQ3Nzc1YTIxZWNjNiIsDQogICAgImpvYl90aXRsZSI6ICJEYXRhIFNjaWVudGlzdCBJIiwNCiAgICAiY29tcGFueSI6ICJGZWRlcmFsIFJlc2VydmUgQmFuayBvZiBEYWxsYXMiLA0KICAgICJjb21wYW55X2lkIjogImNfNzgyIiwNCiAgICAibG9jYXRpb24iOiAiVFgiLA0KICAgICJkZXNjcmlwdGlvbiI6ICJbPHVsPjxsaT5Bc3Npc3QgaW4gY29uc3VsdGF0aW9ucyB3aXRoIGJ1c2luZXNzIHBhcnRuLi4uIiwNCiAgICAicXVlcmllZF9zYWxhcnkiOiAiPDgwMDAwIiwNCiAgICAic2tpbGwiOiBbDQogICAgICAgICJTdGF0aXN0aWNhbCBTb2Z0d2FyZSIsDQogICAgICAgICJUaW1lIE1hbmFnZW1lbnQiLA0KICAgICAgICAiUiIsDQogICAgICAgICJNaWNyb3NvZnQgT2ZmaWNlIiwNCiAgICAgICAgIkV4Y2VsIiwNCiAgICAgICAgIlRhYmxlYXUiLA0KICAgICAgICAiRGF0YSBTY2llbmNlIg0KICAgIF0sDQogICAgImpvYl90eXBlIjogImRhdGFfc2NpZW50aXN0IiwNCiAgICAiZGF0ZV9zaW5jZV9wb3N0ZWQiOiAzMCwNCiAgICAibGluayI6ICJodHRwczovL3d3dy5pbmRlZWQuY29tL3JjL2Nsaz9qaz1mZWM2NDc3NzVhMjFlY2M2JmZjY2lkPTJjNjg1MGUyNGM4YTI4MTEmdmpzPTMiDQp9DQpgYGANCg0KSW4gdGhpcyBwYXJ0aWN1bGFyIGNhc2UsIEkgZG9uJ3QgdGhpbmsgZWl0aGVyIHN0b3JhZ2UgbW9kZWwgaXMgaWRlYWwgZm9yIHRoaXMgZGF0YXNldCBhbmQgY2hvb3NpbmcgYSBwbGF0Zm9ybSB3b3VsZCBkZXBlbmQgb24gaG93IHRoZSBkYXRhIHdpbGwgYmUgdXNlZC4gSW4gdGhlIGRhdGEgc2V0LCB3ZSBoYXZlIHR3byBkaXN0aW5jdCBlbnRpdGllcyAoY29tcGFuaWVzIGFuZCBsaXN0aW5ncykgd2hpY2ggYXJlIGNvbXBvc2VkIG9mIHZlcnkgZGlmZmVyZW50IGRhdGEgYnV0IGFyZSBzdGlsbCBjbG9zZWx5IHJlbGF0ZWQuIEFkZGl0aW9uYWxseSwgc29tZSBsYXJnZSBjb21wYW5pZXMgcHVibGlzaCB0aGUgc2FtZSBwb3NpdGlvbiBpcyBkaWZmZXJlbnQgbG9jYXRpb25zIG9yIHdpdGggc2xpZ2h0IHZhcmlhdGlvbnMuIFRoZSByZWxhdGlvbmFsIG1vZGVsIHNlZW1zIHRvIGJlIGJldHRlciBzdWl0ZWQgYXQga2VlcGluZyB0aGVzZSBlbnRpdGllcyBzZXBhcmF0ZSB3aGlsZSBwcmVzZXJ2aW5nIHRoZWlyIHJlbGF0aW9uc2hpcHMuIA0KDQpPbiB0aGUgb3RoZXIgaGFuZCwgbW9zdCBvZiB0aGUgZGF0YSBpcyBjb250YWluZWQgd2l0aGluIHRoZSBqb2IgbGlzdGluZyB3aGljaCBjYW4gYmUgZWFzaWx5IHN0b3JlZCBvbiBNb25nb0RCJ3MgZG9jdW1lbnQgbW9kZWwuIE1vbmdvREIncyBzY2hlbWFsZXNzIGRlc2lnbiBpcyBtb3JlIGZsZXhpYmxlIGFuZCB3b3VsZCBtYWtlIGl0IGVhc2llciB0byBleHBhbmQgYW5kIGFkYXB0IHRoZSBkYXRhIHNldCBmb3IgZnV0dXJlIGFwcGxpY2F0aW9ucy4gTW9uZ29EQiBhbHNvIGhhcyBidWlsdC1pbiBHZW9sb2NhdGlvbiBjYXBhYmlsaXRpZXMgd2hpY2ggY291bGQgYmUgdXNlZnVsIGluIHRoaXMgYXBwbGljYXRpb24uDQoNCklmIEkgd2VyZSB0byBzdGFydCBhIHNpbWlsYXIgcHJvamVjdCBmcm9tIHNjcmF0Y2ggSSB3b3VsZCBzdGlsbCBvcHQgZm9yIE1vbmdvREIsIHNpbmNlIHRoZSByZWxhdGlvbmFsIG1vZGVsIGlzbid0IGFuIGFic29sdXRlIG5lY2Vzc2l0eSBmb3IgdGhpcyBkYXRhIGFuZCB0aGUgc2NoZW1hbGVzcyBkZXNpZ24gd291bGQgZ2l2ZSBtZSBtb3JlIGZsZXhpYmlsaXR5Lg0KDQoqKlRoZSBjb2RlIGZvciB0aGUgbWlncmF0aW9uIGlzIGRldGFpbGVkIGJlbG93LioqDQoNCkZpcnN0LCB3ZSBsb2FkIHRoZSByZXF1aXJlZCBsaWJyYXJpZXMuDQpgYGB7ciBsaWJyYXJpZXMsIG1lc3NhZ2U9RkFMU0UsIHdhcm5pbmc9RkFMU0V9DQpsaWJyYXJ5KG1vbmdvbGl0ZSkNCmxpYnJhcnkodGlkeXZlcnNlKQ0KbGlicmFyeShSTXlTUUwpDQpsaWJyYXJ5KGpzb25saXRlKQ0KYGBgDQoNCldlIGVzdGFibGlzaCBhIGNvbm5uZWN0aW9uIHRvIHRoZSBNeVNRTCBTZXJ2ZXIuDQpgYGB7ciBNeVNRTCBDb25uZWN0aW9ufQ0KbXlkYiA8LSANCiAgREJJOjpkYkNvbm5lY3QoDQogICAgUk15U1FMOjpNeVNRTCgpLCANCiAgICBkYm5hbWUgPSAncHJvamVjdDMnLCANCiAgICBob3N0ID0gIjM1LjIzMi4yNDAuMjQ3IiwgDQogICAgdXNlciA9ICJyb290IiwgDQogICAgcGFzc3dvcmQgPSByc3R1ZGlvYXBpOjphc2tGb3JQYXNzd29yZCgiRGF0YWJhc2UgcGFzc3dvcmQiKSkNCmBgYA0KDQpXZSBsaXN0IHRoZSB0YWJsZXMgdG8gdmVyaWZ5IHRoYXQgdGhlIGNvbm5lY3Rpb24gaXMgd29ya2luZy4NCmBgYHtyIE15U1FMIFRhYmxlIExpc3R9DQpkYkxpc3RUYWJsZXMobXlkYikNCmBgYA0KDQpXZSBsb2FkIHRoZSBgY29tcGFueWAgdGFibGUgb250byBhIGRhdGEgZnJhbWUgYW5kIHdlIHJlbmFtZSB0aGUgcHJpbWFyeSBrZXkgYXMgYF9pZGAuDQpgYGB7ciBNeVNRTCBDb21wYW5pZXN9DQpjb21wYW55IDwtIGRiR2V0UXVlcnkobXlkYiwiU0VMRUNUICogRlJPTSBjb21wYW55IikgJT4lIA0KICByZW5hbWUoIl9pZCIgPSAiY29tcGFueV9pZCIpDQpgYGANCg0KV2Ugam9pbiB0aGUgYGpvYl9wb3N0X3NwZWNpZmljYCwgYGpvYl9wb3NpdGlvbmAgYW5kIGBkZXNjcmlwdGlvbmAgdGFibGVzLCBrZWVwaW5nIHRoZSBwcmltYXJ5IGtleSBmcm9tIHRoZSBgam9iX3Bvc3Rfc3BlY2lmaWNgIHRhYmxlLiBXZSB3aWxsIGluY2x1ZGUgdGhlIGNvbXBhbnkgbmFtZSBhbG9uZyB3aXRoIHRoZSBsaXN0aW5ncyBmb3IgZWFzeSByZWZlcmVuY2luZy4NCmBgYHtyIE15U1FMIExpc3RpbmdzfQ0KbGlzdGluZyA8LSANCiAgZGJHZXRRdWVyeShteWRiLA0KICAgICJTRUxFQ1QgDQogICAgICBqa19pZCwNCiAgICAgIGpvYl90aXRsZSwNCiAgICAgIGNvbXBhbnksDQogICAgICBjb21wYW55X2lkLA0KICAgICAgbG9jYXRpb24sDQogICAgICBkZXNjcmlwdGlvbiwNCiAgICAgIHF1ZXJpZWRfc2FsYXJ5LA0KICAgICAgc2tpbGwsDQogICAgICBqb2JfdHlwZSwNCiAgICAgIGRhdGVfc2luY2VfcG9zdGVkLA0KICAgICAgbGluaw0KICAgIEZST00NCiAgICAgIGNvbXBhbnkNCiAgICBOQVRVUkFMIEpPSU4NCiAgICAgIGpvYl9wb3N0X3NwZWNpZmljDQogICAgTkFUVVJBTCBKT0lODQogICAgICBqb2JfcG9zaXRpb24NCiAgICBOQVRVUkFMIEpPSU4NCiAgICAgIGRlc2NyaXB0aW9uOyIpDQpgYGANCg0KVGhlIHNraWxscyBsaXN0IGlzIHN0b3JlZCBhcyBhIHN0cmluZyBpbiB0aGUgTXlTUUwgZGF0YWJhc2UuIFRoZSB0cmFuc2Zvcm1hdGlvbiBiZWxvdyB3aWxsIGNvbnZlcnQgdGhlIHN0cmluZyBpbnRvIGEgbGlzdCB3aGljaCBpbiB0dXJuIHdpbGwgYmUgY29udmVydGVkIGludG8gYW4gYXJyYXkgb25jZSB0aGUgZGF0YSBpcyBpbXBvcnRlZCBpbnRvIE1vbmdvREIuIFdlIHdpbGwgYWxzbyByZW5hbWUgdGhlIHByaW1hcnkga2V5IGFzIGBfaWRgLg0KYGBge3IgU2tpbGxzIHRyYW5zZm9ybWF0aW9ufQ0KbGlzdGluZyA8LSBsaXN0aW5nICU+JSANCiAgbXV0YXRlKHNraWxsID0gc3RyX3JlcGxhY2VfYWxsKHNraWxsLCInLCAnIiwgIlxcfCIpKSAlPiUgDQogIG11dGF0ZShza2lsbCA9IHN0cl9yZXBsYWNlX2FsbChza2lsbCwiKFxcWycpfCgnXFxdKSIsICIiKSkgJT4lIA0KICBtdXRhdGUoc2tpbGwgPSBhcy5saXN0KHN0cnNwbGl0KHNraWxsLCAiXFx8IikpKSAlPiUgDQogIHJlbmFtZSgnX2lkJyA9IGBqa19pZGApDQpgYGANCg0KTm93IHRoYXQgd2UgaGF2ZSBvdXIgZGF0YSByZWFkeSB3ZSBjYW4gY3JlYXRlIHR3byBjb25uZWN0aW9ucyB0byBNb25nb0RCLCBvbmUgZm9yIGVhY2ggY29sbGVjdGlvbi4NCmBgYHtyIG1vbmdvZGItY29ubmVjdGlvbnN9DQptZ2NvbXBhbnkgPC0gbW9uZ28oY29sbGVjdGlvbiA9ICJjb21wYW55IixkYiA9ICJEUy1Ta2lsbHMiLCB1cmwgPSAibW9uZ29kYjovL2xvY2FsaG9zdCIsDQogIHZlcmJvc2UgPSBGQUxTRSwgb3B0aW9ucyA9IHNzbF9vcHRpb25zKCkpDQptZ2xpc3RpbmdzIDwtIG1vbmdvKGNvbGxlY3Rpb24gPSAibGlzdGluZ3MiLGRiID0gIkRTLVNraWxscyIsIHVybCA9ICJtb25nb2RiOi8vbG9jYWxob3N0IiwNCiAgdmVyYm9zZSA9IEZBTFNFLCBvcHRpb25zID0gc3NsX29wdGlvbnMoKSkNCmBgYA0KDQpPbmNlIHRoZSBjb25uZWN0aW9ucyBhcmUgZXN0YWJsaXNoZWQgd2UgY2FuIHVwbG9hZCB0aGUgZGF0YSB0byB0aGUgbmV3IGRhdGFiYXNlLg0KYGBge3IgVXBsb2FkIHRvIE1vbmdvREJ9DQptZ2NvbXBhbnkkaW5zZXJ0KGNvbXBhbnkpDQptZ2xpc3RpbmdzJGluc2VydChsaXN0aW5nKQ0KYGBgDQo=