CUNY SPS - Master of Science in Data Science

Assignment 7: NoSQL Migration

Assignment Description: In this assignment, we will take the information from a ralational database and migrate it to a NoSQL database of our choosing. I will use the “data skills” database created for project 3 and migrate it to a MongoDB database.

Load Database

I will first connect to the MySQL database, which resides in AWS and extract the tables I will use for the MongoDB database:

library(DBI)
skills_db <- dbConnect(RMySQL::MySQL(), 
                dbname = "data607project3", 
                host = "database-1.cxdov2mcmzlo.us-east-2.rds.amazonaws.com", 
                port = 3306, 
                user = "admin",
                password = "Data__607")

software <- dbReadTable(skills_db, "softwareskills")
computing <- dbReadTable(skills_db, "computingskills")
education <- dbReadTable(skills_db, "educationlevels")

Create Database and Collections in MongoDB

The environment in MongoDB is set up by creating the database of where the data will be inserted and the names of the collections for each table:

library(mongolite)
mongo_software <- mongo(collection = "software", db = "skills")
mongo_computing <- mongo(collection = "computing", db = "skills")
mongo_education <- mongo(collection = "education", db = "skills")

Clear Collections and Insert Tables

The tables we extracted from the MySQL database will be inserted in our newly created database and collections in MongoDB, but first we make sure we clear any data that might be in the collections:

mongo_software$drop()
mongo_software$insert(software)

## List of 5
##  $ nInserted  : num 46
##  $ nMatched   : num 0
##  $ nRemoved   : num 0
##  $ nUpserted  : num 0
##  $ writeErrors: list()

mongo_computing$drop()
mongo_computing$insert(computing)

## List of 5
##  $ nInserted  : num 15
##  $ nMatched   : num 0
##  $ nRemoved   : num 0
##  $ nUpserted  : num 0
##  $ writeErrors: list()

mongo_education$drop()
mongo_education$insert(education)

## List of 5
##  $ nInserted  : num 7
##  $ nMatched   : num 0
##  $ nRemoved   : num 0
##  $ nUpserted  : num 0
##  $ writeErrors: list()

Check for Migrated Data in MongoDB

We have to make sure that our data was successfully migrated to our new MongoDB database:

query_software <- mongo_software$find('{}', fields = '{"_id":0}')
query_software

##    id          Keyword LinkedIn Indeed Monster SimplyHired AngelList
## 1   1           Python    10534   7067    5353        5564       962
## 2   2              SQL     7359   5504    8970        4475       559
## 3   3                R     6454   4617    3421        3709       332
## 4   4             Java     4458   3366    4359        2549       527
## 5   5           Hadoop     4388   2982    2858        2240       194
## 6   6            Spark     3686   2978    2453        2262       259
## 7   7            Excel     2727   2287    1674        1862       627
## 8   8          Tableau     2636   2183    1826        1732        76
## 9   9              AWS     2621   2346    1682        1793       571
## 10 10              SAS     2589   1744    1153        1417        37
## 11 11            Scala     1980   1506    1206        1175       182
## 12 12              C++     1977   1567    1094        1160       459
## 13 13             Hive     1859   1534      40        1161        91
## 14 14       Javascript     1564   1192    1436         992       635
## 15 15            NoSQL     1417   1103    1562         942       221
## 16 16            Azure     1332   1205     752         800        90
## 17 17       TensorFlow     1130    926     643         686       149
## 18 18                C        0      0       0           0         0
## 19 19       PowerPoint     1061    832     706         712        16
## 20 20           Matlab      986    828     625         691        68
## 21 21           Docker      951    792     583         603       228
## 22 22              Git      931    813     814         669       232
## 23 23               C#      839    645     610         497       459
## 24 24            MySQL      780    659     501         524       226
## 25 25             Ruby      736    535     302         411       262
## 26 26 Microsoft Office      711    550     661         427        37
## 27 27             SPSS      660    470     416         428        18
## 28 28          MongoDB      653    584     401         423       159
## 29 29              Pig      639    508     477         422        17
## 30 30           Pandas      630    525     333         450        86
## 31 31            Hbase      629    500     495         419        33
## 32 32        Cassandra      610    490     461         408        90
## 33 33            Numpy      585    433     311         365        88
## 34 34             Perl      521    473     386         332        86
## 35 35         Power BI      494    400     275         347        16
## 36 36             Node      476    360      95         317       110
## 37 37       PostgreSQL      438    376     470         324       194
## 38 38               D3      436    176     173         141        69
## 39 39            Keras      403    302     211         245        50
## 40 40              PHP      382    320     133         212       113
## 41 41            Redis      274    194     131         158       136
## 42 42          Alteryx      270    262     201         170         0
## 43 43          Jupyter      270    198     139         173        22
## 44 44            Stata      256    172     124         155        14
## 45 45            Caffe      243    185     155         149        35
## 46 46          PyTorch      241    178     161         165        32

mongo_software$count('{}') == nrow(software)

## [1] TRUE

query_computing <- mongo_computing$find('{}', fields = '{"_id":0}')
query_computing

##    id              Keyword LinkedIn Indeed SimplyHired Monster
## 1   1     machine learning     5701   3439        2561    2340
## 2   2             analysis     5168   3500        2668    3306
## 3   3           statistics     4893   2992        2308    2399
## 4   4     computer science     4517   2739        2093    1900
## 5   5        communication     3404   2344        1791    2053
## 6   6          mathematics     2605   1961        1497    1815
## 7   7        visualization     1879   1413        1153    1207
## 8   8         AI composite        0      0           0       0
## 9   9        deep learning     1310    979         675     606
## 10 10        NLP composite        0      0           0       0
## 11 11 software development      732    627         481     784
## 12 12      neural networks      671    485         421     305
## 13 13     data engineering      514      0           0     200
## 14 14   project management      476    397         330     348
## 15 15 software engineering      413    295         250     512

mongo_computing$count('{}') == nrow(computing)

## [1] TRUE

query_education <- mongo_education$find('{}', fields = '{"_id":0}')
query_education

##   id   Keyword LinkedIn Monster Indeed SimplyHired AngelList
## 1  1        MS     2917    1985   1821        2532       288
## 2  2       PhD     3242    1629   1468        2221       230
## 3  3   masters     1568    2033    887        2704       165
## 4  4 bachelors      677    2631    578        3326        97
## 5  5       MBA     1186     634    675         788        63
## 6  6  bootcamp       31      74     14         129        18
## 7  7    Kaggle       74      44  12086          49         5

mongo_education$count('{}') == nrow(education)

## [1] TRUE

Advantages and Disadvantages Between SQL and NoSQL?

First a quick overview of how each database is structured:

SQL stands for Structured Query Language, and this is the language in which relational databases are written and read within the environment. Each database consists of tables that can be thought of as spreadsheets, where data is stored in columns representing the fileds and rows representing observations. Typically the talbes will have a primary key that will identify a specific observation and can be used to link or connect to data in other tables where this primary key would be called a foreign key.

On the other hand, NoSQL databases are the opposite of relational databases and do not require a schema. Instead of using tables like in a relational database, these databases are made up of documents, which are representations of an object in programming. This makes working with NoSQL databases a lot more flexible since they are not bound by columns and rows. Additionally, data that would otherwise be linked through different tables in SQL, is rather added as a property to an existing document.

The database we decided to use will depend on what we’re trying to build and what we want our data to look like, then we can pick the one that best meets our needs.

Arguments for using SQL include:

Structured Data

There may be projects that need our data to be structured in a specific way in order to keep it organized in different tables that link to one another in some way.

ACID Compliance

Most SQL databases are ACID compliant, which means the integrity of your data is guaranteed.

Joins

This is a function that is very useful in SQL when developing complex applications. Retrieving multiple sets of data in different tables can be achieved through “joining”, which there is no way of doing in a NoSQL database at the moment.

Arguments for using NoSQL include:

Flexible

NoSQL databases allow for the user to structure data that a table could not easily support or in a way that would make the most sense for the application they are working on.

Fast

Even though these type of databases are not ACID compliant, they are significantly faster. NoSQL databases are much faster at querying, but it may cost you the speed you gain if your application requires to query several tables for needed data, which could be easily accomplished through “joining” in SQL.

Ease of Use

Since we have everything represented as an object, it can be very easy to reason with these type of databases and there is a very little learning curve in using.

CUNY SPS - Master of Science in Data Science - DATA607

Mario Pena

November 17, 2019