Assignment Description: In this assignment, we will take the information from a ralational database and migrate it to a NoSQL database of our choosing. I will use the “data skills” database created for project 3 and migrate it to a MongoDB database.
I will first connect to the MySQL database, which resides in AWS and extract the tables I will use for the MongoDB database:
library(DBI)
skills_db <- dbConnect(RMySQL::MySQL(),
dbname = "data607project3",
host = "database-1.cxdov2mcmzlo.us-east-2.rds.amazonaws.com",
port = 3306,
user = "admin",
password = "Data__607")
software <- dbReadTable(skills_db, "softwareskills")
computing <- dbReadTable(skills_db, "computingskills")
education <- dbReadTable(skills_db, "educationlevels")
The environment in MongoDB is set up by creating the database of where the data will be inserted and the names of the collections for each table:
library(mongolite)
mongo_software <- mongo(collection = "software", db = "skills")
mongo_computing <- mongo(collection = "computing", db = "skills")
mongo_education <- mongo(collection = "education", db = "skills")
The tables we extracted from the MySQL database will be inserted in our newly created database and collections in MongoDB, but first we make sure we clear any data that might be in the collections:
mongo_software$drop()
mongo_software$insert(software)
## List of 5
## $ nInserted : num 46
## $ nMatched : num 0
## $ nRemoved : num 0
## $ nUpserted : num 0
## $ writeErrors: list()
mongo_computing$drop()
mongo_computing$insert(computing)
## List of 5
## $ nInserted : num 15
## $ nMatched : num 0
## $ nRemoved : num 0
## $ nUpserted : num 0
## $ writeErrors: list()
mongo_education$drop()
mongo_education$insert(education)
## List of 5
## $ nInserted : num 7
## $ nMatched : num 0
## $ nRemoved : num 0
## $ nUpserted : num 0
## $ writeErrors: list()
We have to make sure that our data was successfully migrated to our new MongoDB database:
query_software <- mongo_software$find('{}', fields = '{"_id":0}')
query_software
## id Keyword LinkedIn Indeed Monster SimplyHired AngelList
## 1 1 Python 10534 7067 5353 5564 962
## 2 2 SQL 7359 5504 8970 4475 559
## 3 3 R 6454 4617 3421 3709 332
## 4 4 Java 4458 3366 4359 2549 527
## 5 5 Hadoop 4388 2982 2858 2240 194
## 6 6 Spark 3686 2978 2453 2262 259
## 7 7 Excel 2727 2287 1674 1862 627
## 8 8 Tableau 2636 2183 1826 1732 76
## 9 9 AWS 2621 2346 1682 1793 571
## 10 10 SAS 2589 1744 1153 1417 37
## 11 11 Scala 1980 1506 1206 1175 182
## 12 12 C++ 1977 1567 1094 1160 459
## 13 13 Hive 1859 1534 40 1161 91
## 14 14 Javascript 1564 1192 1436 992 635
## 15 15 NoSQL 1417 1103 1562 942 221
## 16 16 Azure 1332 1205 752 800 90
## 17 17 TensorFlow 1130 926 643 686 149
## 18 18 C 0 0 0 0 0
## 19 19 PowerPoint 1061 832 706 712 16
## 20 20 Matlab 986 828 625 691 68
## 21 21 Docker 951 792 583 603 228
## 22 22 Git 931 813 814 669 232
## 23 23 C# 839 645 610 497 459
## 24 24 MySQL 780 659 501 524 226
## 25 25 Ruby 736 535 302 411 262
## 26 26 Microsoft Office 711 550 661 427 37
## 27 27 SPSS 660 470 416 428 18
## 28 28 MongoDB 653 584 401 423 159
## 29 29 Pig 639 508 477 422 17
## 30 30 Pandas 630 525 333 450 86
## 31 31 Hbase 629 500 495 419 33
## 32 32 Cassandra 610 490 461 408 90
## 33 33 Numpy 585 433 311 365 88
## 34 34 Perl 521 473 386 332 86
## 35 35 Power BI 494 400 275 347 16
## 36 36 Node 476 360 95 317 110
## 37 37 PostgreSQL 438 376 470 324 194
## 38 38 D3 436 176 173 141 69
## 39 39 Keras 403 302 211 245 50
## 40 40 PHP 382 320 133 212 113
## 41 41 Redis 274 194 131 158 136
## 42 42 Alteryx 270 262 201 170 0
## 43 43 Jupyter 270 198 139 173 22
## 44 44 Stata 256 172 124 155 14
## 45 45 Caffe 243 185 155 149 35
## 46 46 PyTorch 241 178 161 165 32
mongo_software$count('{}') == nrow(software)
## [1] TRUE
query_computing <- mongo_computing$find('{}', fields = '{"_id":0}')
query_computing
## id Keyword LinkedIn Indeed SimplyHired Monster
## 1 1 machine learning 5701 3439 2561 2340
## 2 2 analysis 5168 3500 2668 3306
## 3 3 statistics 4893 2992 2308 2399
## 4 4 computer science 4517 2739 2093 1900
## 5 5 communication 3404 2344 1791 2053
## 6 6 mathematics 2605 1961 1497 1815
## 7 7 visualization 1879 1413 1153 1207
## 8 8 AI composite 0 0 0 0
## 9 9 deep learning 1310 979 675 606
## 10 10 NLP composite 0 0 0 0
## 11 11 software development 732 627 481 784
## 12 12 neural networks 671 485 421 305
## 13 13 data engineering 514 0 0 200
## 14 14 project management 476 397 330 348
## 15 15 software engineering 413 295 250 512
mongo_computing$count('{}') == nrow(computing)
## [1] TRUE
query_education <- mongo_education$find('{}', fields = '{"_id":0}')
query_education
## id Keyword LinkedIn Monster Indeed SimplyHired AngelList
## 1 1 MS 2917 1985 1821 2532 288
## 2 2 PhD 3242 1629 1468 2221 230
## 3 3 masters 1568 2033 887 2704 165
## 4 4 bachelors 677 2631 578 3326 97
## 5 5 MBA 1186 634 675 788 63
## 6 6 bootcamp 31 74 14 129 18
## 7 7 Kaggle 74 44 12086 49 5
mongo_education$count('{}') == nrow(education)
## [1] TRUE
First a quick overview of how each database is structured:
SQL stands for Structured Query Language, and this is the language in which relational databases are written and read within the environment. Each database consists of tables that can be thought of as spreadsheets, where data is stored in columns representing the fileds and rows representing observations. Typically the talbes will have a primary key that will identify a specific observation and can be used to link or connect to data in other tables where this primary key would be called a foreign key.
On the other hand, NoSQL databases are the opposite of relational databases and do not require a schema. Instead of using tables like in a relational database, these databases are made up of documents, which are representations of an object in programming. This makes working with NoSQL databases a lot more flexible since they are not bound by columns and rows. Additionally, data that would otherwise be linked through different tables in SQL, is rather added as a property to an existing document.
The database we decided to use will depend on what we’re trying to build and what we want our data to look like, then we can pick the one that best meets our needs.
Arguments for using SQL include:
Structured Data
There may be projects that need our data to be structured in a specific way in order to keep it organized in different tables that link to one another in some way.
ACID Compliance
Most SQL databases are ACID compliant, which means the integrity of your data is guaranteed.
Joins
This is a function that is very useful in SQL when developing complex applications. Retrieving multiple sets of data in different tables can be achieved through “joining”, which there is no way of doing in a NoSQL database at the moment.
Arguments for using NoSQL include:
Flexible
NoSQL databases allow for the user to structure data that a table could not easily support or in a way that would make the most sense for the application they are working on.
Fast
Even though these type of databases are not ACID compliant, they are significantly faster. NoSQL databases are much faster at querying, but it may cost you the speed you gain if your application requires to query several tables for needed data, which could be easily accomplished through “joining” in SQL.
Ease of Use
Since we have everything represented as an object, it can be very easy to reason with these type of databases and there is a very little learning curve in using.