R Markdown

Reference: https://creativedata.atlassian.net/wiki/spaces/SAP/pages/53411843/R+-+Read+Write+files+from+MongoDB

Reference: https://stackoverflow.com/questions/12402184/rmongo-and-query-ing

Reference: https://cran.r-project.org/src/contrib/Archive/RMongo/

Reference: https://datascienceplus.com/using-mongodb-with-r/

Reference: https://github.com/tc/RMongo/blob/master/man/dbRemoveQuery-methods.Rd

Reference: https://www.bmc.com/blogs/sql-vs-nosql/

Reference: https://bitnine.net/blog-computing/sql-vs-nosql-comparative-advantages-and-disadvantages/

Reference: https://medium.com/@zhenwu93/relational-vs-non-relational-databases-8336870da8bc

library(DBI)
## Warning: package 'DBI' was built under R version 3.5.3

Note: I used the data from Project 3 to convert to NoSQL database. The relational data resides on AWS.

cn <- dbConnect(drv      = RMySQL::MySQL(), 
                username = "admin", 
                password = "Data__607", 
                host     = "database-1.cxdov2mcmzlo.us-east-2.rds.amazonaws.com", 
                port     = 3306, 
                dbname   = "data607project3")


data <- dbGetQuery(cn, "SELECT * FROM softwareskills")

data2C <- dbGetQuery(cn, "SELECT * FROM computingskills")

data3E<- dbGetQuery(cn, "SELECT * FROM educationlevels")

Load the relational data into NoSQL

library(RMongo)
## Loading required package: rJava
## Warning: package 'rJava' was built under R version 3.5.3
## 
## Attaching package: 'RMongo'
## The following objects are masked from 'package:DBI':
## 
##     dbDisconnect, dbGetQuery
library(jsonlite)
## Warning: package 'jsonlite' was built under R version 3.5.3
library(stringr)

#Connection to  mongodb 
mongodb <- mongoDbConnect("skills", host = "localhost", port = 27017)

#Remove any existing collections
dbRemoveQuery(mongodb, 'softwareskills', '{ }')
## [1] "ok"
dbRemoveQuery(mongodb, 'computingskills', '{ }')
## [1] "ok"
dbRemoveQuery(mongodb, 'educationlevels', '{ }')
## [1] "ok"
#Function to insert data into mongodb
mongodb_insert <- function(connection, db, df)
  {
  
  #make json mongodb document
  mongodata <- toJSON(df)
  mongodata <- sub("[","",mongodata, fixed = TRUE)
  mongodata <- sub("]","",mongodata, fixed = TRUE)
  mongodata <- as.character(mongodata)
  dbInsertDocument(connection, db, mongodata)
}

for(i in 1:nrow(data)) 
  {
  mongodb_insert(mongodb,"softwareskills",data[i,])
}

for(i in 1:nrow(data2C)) 
  {
  mongodb_insert(mongodb,"computingskills",data2C[i,])
}

for(i in 1:nrow(data3E)) 
  {
  mongodb_insert(mongodb,"educationlevels",data3E[i,])
}

Read the data from NoSQL

library(kableExtra)
## Warning: package 'kableExtra' was built under R version 3.5.3
#library(mongolite)

mongo <- mongoDbConnect("skills", "localhost", 27017)

kable(dbGetQuery(mongo, "softwareskills","{}")) #Get all the records 
LinkedIn Keyword SimplyHired Monster X_id id Indeed AngelList
10534 Python 5564 5353 5dbef0abe60dcb4d901833f9 1 7067 962
7359 SQL 4475 8970 5dbef0abe60dcb4d901833fa 2 5504 559
6454 R 3709 3421 5dbef0abe60dcb4d901833fb 3 4617 332
4458 Java 2549 4359 5dbef0abe60dcb4d901833fc 4 3366 527
4388 Hadoop 2240 2858 5dbef0abe60dcb4d901833fd 5 2982 194
3686 Spark 2262 2453 5dbef0abe60dcb4d901833fe 6 2978 259
2727 Excel 1862 1674 5dbef0abe60dcb4d901833ff 7 2287 627
2636 Tableau 1732 1826 5dbef0abe60dcb4d90183400 8 2183 76
2621 AWS 1793 1682 5dbef0abe60dcb4d90183401 9 2346 571
2589 SAS 1417 1153 5dbef0abe60dcb4d90183402 10 1744 37
1980 Scala 1175 1206 5dbef0abe60dcb4d90183403 11 1506 182
1977 C++ 1160 1094 5dbef0abe60dcb4d90183404 12 1567 459
1859 Hive 1161 40 5dbef0abe60dcb4d90183405 13 1534 91
1564 Javascript 992 1436 5dbef0abe60dcb4d90183406 14 1192 635
1417 NoSQL 942 1562 5dbef0abe60dcb4d90183407 15 1103 221
1332 Azure 800 752 5dbef0abe60dcb4d90183408 16 1205 90
1130 TensorFlow 686 643 5dbef0abe60dcb4d90183409 17 926 149
0 C 0 0 5dbef0abe60dcb4d9018340a 18 0 0
1061 PowerPoint 712 706 5dbef0abe60dcb4d9018340b 19 832 16
986 Matlab 691 625 5dbef0abe60dcb4d9018340c 20 828 68
951 Docker 603 583 5dbef0abe60dcb4d9018340d 21 792 228
931 Git 669 814 5dbef0abe60dcb4d9018340e 22 813 232
839 C# 497 610 5dbef0abe60dcb4d9018340f 23 645 459
780 MySQL 524 501 5dbef0abe60dcb4d90183410 24 659 226
736 Ruby 411 302 5dbef0abe60dcb4d90183411 25 535 262
711 Microsoft Office 427 661 5dbef0abe60dcb4d90183412 26 550 37
660 SPSS 428 416 5dbef0abe60dcb4d90183413 27 470 18
653 MongoDB 423 401 5dbef0abe60dcb4d90183414 28 584 159
639 Pig 422 477 5dbef0abe60dcb4d90183415 29 508 17
630 Pandas 450 333 5dbef0abe60dcb4d90183416 30 525 86
629 Hbase 419 495 5dbef0abe60dcb4d90183417 31 500 33
610 Cassandra 408 461 5dbef0abe60dcb4d90183418 32 490 90
585 Numpy 365 311 5dbef0abe60dcb4d90183419 33 433 88
521 Perl 332 386 5dbef0abe60dcb4d9018341a 34 473 86
494 Power BI 347 275 5dbef0abe60dcb4d9018341b 35 400 16
476 Node 317 95 5dbef0abe60dcb4d9018341c 36 360 110
438 PostgreSQL 324 470 5dbef0abe60dcb4d9018341d 37 376 194
436 D3 141 173 5dbef0abe60dcb4d9018341e 38 176 69
403 Keras 245 211 5dbef0abe60dcb4d9018341f 39 302 50
382 PHP 212 133 5dbef0abe60dcb4d90183420 40 320 113
274 Redis 158 131 5dbef0abe60dcb4d90183421 41 194 136
270 Alteryx 170 201 5dbef0abe60dcb4d90183422 42 262 0
270 Jupyter 173 139 5dbef0abe60dcb4d90183423 43 198 22
256 Stata 155 124 5dbef0abe60dcb4d90183424 44 172 14
243 Caffe 149 155 5dbef0abe60dcb4d90183425 45 185 35
241 PyTorch 165 161 5dbef0abe60dcb4d90183426 46 178 32

Read the data from NoSQL

library(kableExtra)
#library(mongolite)

mongo <- mongoDbConnect("skills", "localhost", 27017)

kable(dbGetQuery(mongo, "computingskills","{}")) #Get all the records 
LinkedIn Keyword SimplyHired Monster X_id id Indeed
5701 machine learning 2561 2340 5dbef0abe60dcb4d90183427 1 3439
5168 analysis 2668 3306 5dbef0abe60dcb4d90183428 2 3500
4893 statistics 2308 2399 5dbef0abe60dcb4d90183429 3 2992
4517 computer science 2093 1900 5dbef0abe60dcb4d9018342a 4 2739
3404 communication 1791 2053 5dbef0abe60dcb4d9018342b 5 2344
2605 mathematics 1497 1815 5dbef0abe60dcb4d9018342c 6 1961
1879 visualization 1153 1207 5dbef0abe60dcb4d9018342d 7 1413
0 AI composite 0 0 5dbef0abe60dcb4d9018342e 8 0
1310 deep learning 675 606 5dbef0abe60dcb4d9018342f 9 979
0 NLP composite 0 0 5dbef0abe60dcb4d90183430 10 0
732 software development 481 784 5dbef0abe60dcb4d90183431 11 627
671 neural networks 421 305 5dbef0abe60dcb4d90183432 12 485
514 data engineering 0 200 5dbef0abe60dcb4d90183433 13 0
476 project management 330 348 5dbef0abe60dcb4d90183434 14 397
413 software engineering 250 512 5dbef0abe60dcb4d90183435 15 295

Read the data from NoSQL

library(kableExtra)
#library(mongolite)

mongo <- mongoDbConnect("skills", "localhost", 27017)

kable(dbGetQuery(mongo, "educationlevels","{}")) #Get all the records 
LinkedIn Keyword SimplyHired Monster X_id id Indeed AngelList
2917 MS 2532 1985 5dbef0abe60dcb4d90183436 1 1821 288
3242 PhD 2221 1629 5dbef0abe60dcb4d90183437 2 1468 230
1568 masters 2704 2033 5dbef0abe60dcb4d90183438 3 887 165
677 bachelors 3326 2631 5dbef0abe60dcb4d90183439 4 578 97
1186 MBA 788 634 5dbef0abe60dcb4d9018343a 5 675 63
31 bootcamp 129 74 5dbef0abe60dcb4d9018343b 6 14 18
74 Kaggle 49 44 5dbef0abe60dcb4d9018343c 7 12086 5

What are some of the advantages and disadvantages between SQL and NoSQL?

What is SQL?

SQL (Structured Query Language) is a programming language that is used to manage data in relational databases. Relational databases use relations (typically called tables) to store data and then match that data by using common characteristics within the dataset. Some common relational database management systems that use SQL are Oracle, Sybase, Microsoft SQL Server, Access, and Ingres. Cobb’s breakthrough paper describes a database where objects could be constructed and queried using something he called SQL, structured query language. He used SQL to create both data (in objects called tables) and the schema for that data, which describes fields in columns. A single record in a SQL database is called a row.

What is NoSQL?

A NoSQL database, on the other hand, is self-describing, so does not require a schema. Nor does it enforce relations between tables in all cases. All its documents are JSON documents, which are complete entities that one can readily read and understand. NoSQL refers to high-performance, non-relational databases that utilize a wide variety of data models. These databases are highly recognized for their ease-of-use, scalable performance, strong resilience, and wide availability. NoSQL database examples include MongoDB, MarkLogic, Couchbase, CloudDB, and Amazon’s Dynamo DB.

Major Differences

There are many differences between SQL and NoSQL, all of which are important to understand when making a decision about what might be the best data management system for your organization. These include differences in:

Language

Scalability

Community

Structure

The Language

One of the major differences between SQL relational and NoSQL non-relational databases is the language. As mentioned, SQL databases use Structured Query Language for defining and manipulating data. This allows SQL to be extremely versatile and widely-used - however, it also makes it more restrictive. SQL requires that you use predefined schemas to determine the structure of your data before you even begin to work with it. Your data must also follow the same structure as well, which can entail significant upfront preparation along with careful execution.

A NoSQL database features a dynamic schema for unstructured data and the data can be stored in many different ways, whether it be graph-based, document-oriented, column-oriented, or organized as a KeyValue store. This extreme flexibility allows you to create documents without first having to carefully plan and define their structure, add fields as you go, and vary the syntax from database to database. It also allows you to give each document its own unique structure, providing you with more freedom overall.

The Scalability

Another big difference between SQL and NoSQL is their scalability. In most SQL databases, they are vertically scalable, which means that you can increase the load on a single server by increasing components like RAM, SSD, or CPU. In contrast, NoSQL databases are horizontally scalable, which means that they can handle increased traffic simply by adding more servers to the database. NoSQL databases have the ability to become larger and much more powerful, making them the preferred choice for large or constantly evolving data sets.

The Community

Due to SQL’s maturity, it has a much stronger and more developed community compared to NoSQL. There are thousands of chats and forums available where experts can share knowledge and discuss SQL best practices, continuously enhancing skills. Although NoSQL is growing rapidly, its community is not as well defined as SQL due to the fact that it is still relatively new.

The Structure

Finally, a last thing to consider when debating SQL versus NoSQL is their structures. SQL databases are table-based which makes them a better option for applications that require multi-row transactions. Samples of these may be accounting systems or even legacy systems that were originally built for a relational structure. NoSQL databases can be key-value pairs, wide-column stores, graph databases, or document-based.