library(RMariaDB)
library(mongolite)
library(dplyr)
library(ggplot2)
# Connecting to the database
user_name <- 'sie siong'
user_password <- "redy2rok"
database <- 'prj3'
host_name <- 'msds607.ckxhi71v1dqf.us-east-1.rds.amazonaws.com'
mydb_sql <- dbConnect(RMariaDB::MariaDB(), user=user_name, password=user_password, dbname=database, host=host_name)
mydb_sql
## <MariaDBConnection>
## Host: msds607.ckxhi71v1dqf.us-east-1.rds.amazonaws.com
## Server: 5.7.22-log
## Client: 5.5.1
# List of tables.
dbListTables(mydb_sql)
## [1] "agg_linkedin" "df"
## [3] "df_bak" "ds_general_skills_clean"
## [5] "dsmain" "footable"
## [7] "just_skills" "payscale_data"
## [9] "rawdata" "sample_linkedin_tall"
## [11] "sample_linkedin_wide" "skills_raw"
# Select one table to migrate.
sql_tbl <- dbGetQuery(mydb_sql, "select * from df")
head(sql_tbl, n=15)
## skill_id skills count title
## 1 1 Python 11 Data Scientist at Square
## 2 2 R 9 Data Scientist at Square
## 3 3 C++ 7 Data Scientist at Square
## 4 4 Data Structures 5 Data Scientist at Square
## 5 5 Statistics 2 Data Scientist at Square
## 6 6 Machine Learning 2 Data Scientist at Square
## 7 7 Research 1 Data Scientist at Square
## 8 8 Optimization 1 Data Scientist at Square
## 9 9 Data Analysis 1 Data Scientist at Square
## 10 10 Distributed Systems 1 Data Scientist at Square
## 11 17 SQL 2 Data Scientist at Square
## 12 18 python 1 Data Scientist at Square
## 13 19 Amazon Web Services AWS 1 Data Scientist at Square
## 14 20 MongoDB 1 Data Scientist at Square
## 15 29 Deep Learning 2 Data Scientist at Square
tail(sql_tbl, n=15)
## skill_id skills count title
## 3866 4801 Matlab 17 Data Scientist
## 3867 4802 LaTeX 9 Data Scientist
## 3868 4803 Python 10 Data Scientist
## 3869 4804 Data Analysis 12 Data Scientist
## 3870 4805 Machine Learning 8 Data Scientist
## 3871 4806 Research 6 Data Scientist
## 3872 4807 Financial Modeling 4 Data Scientist
## 3873 4808 Data Mining 11 Data Scientist
## 3874 4809 Statistics 8 Data Scientist
## 3875 4810 R 12 Data Scientist
## 3876 4811 Linux 3 Data Scientist
## 3877 4812 Microsoft Excel 8 Data Scientist
## 3878 4813 SQL 5 Data Scientist
## 3879 4814 Microsoft Office 6 Data Scientist
## 3880 4815 C++ 1 Data Scientist
# Database Connection and Creation
mydb_mongo <- mongo(collection = "sql_tbl", db = "DataScienceSkills", url = "mongodb://localhost", verbose = FALSE, options = ssl_options())
mydb_mongo$insert(sql_tbl)
## List of 5
## $ nInserted : num 3880
## $ nMatched : num 0
## $ nRemoved : num 0
## $ nUpserted : num 0
## $ writeErrors: list()
# Check if we have inserted the “slq-tbl” data.
mydb_mongo$count()
## [1] 3880
# Display 3 records.
mydb_mongo$iterate()$batch(3)
## [[1]]
## [[1]]$skill_id
## [1] 1
##
## [[1]]$skills
## [1] "Python"
##
## [[1]]$count
## [1] 11
##
## [[1]]$title
## [1] "Data Scientist at Square"
##
##
## [[2]]
## [[2]]$skill_id
## [1] 2
##
## [[2]]$skills
## [1] "R"
##
## [[2]]$count
## [1] 9
##
## [[2]]$title
## [1] "Data Scientist at Square"
##
##
## [[3]]
## [[3]]$skill_id
## [1] 3
##
## [[3]]$skills
## [1] "C++"
##
## [[3]]$count
## [1] 7
##
## [[3]]$title
## [1] "Data Scientist at Square"
# Count how many distinct skills.
length(mydb_mongo$distinct("skills"))
## [1] 929
# Aggregate the skills and count
df <- mydb_mongo$aggregate('[{"$group":{"_id":"$skills", "count": {"$sum":"$count"}}}]', options = '{"allowDiskUse":true}')
# Rename the column names.
names(df) <- c("skills", "count")
# Select only the skills with count greater than 200.
df <- filter(df, count > 200)
# Sort the count column.
df <- arrange(df, desc(count))
# Display the result.
head(df)
## skills count
## 1 Data Analysis 2196
## 2 R 1864
## 3 Python 1539
## 4 SQL 1237
## 5 Statistics 1036
## 6 Machine Learning 979
# Plot the result.
ggplot(df, aes(x=reorder(skills, count, fun=max), y=count))+
geom_bar(stat="identity", width = 0.5, fill=("tomato2"))+
theme(axis.text.x = element_text(angle = 65, vjust=0.6), axis.title.x = element_blank())
SQL Databses: The primary advantages are
It uses a single uniform language (DDL) for different roles (developer, user, DBA).
Unlike the NoSQL, SQL doesn’t have the issue of standardization. It follows the ISI and ANSI standards, which are approved across the globe.
It uses an advanced and non-structural querying language.
It sticks to ACID principles (atomicity, consistency, isolation, durability), thus guaranteeing stability, security, and predictability both of the entire database and every transaction in particular.
NoSQL Databases: Some of the biggest advantages are
They are non-relational (table-less), hence, very different from SQL databases. This means they are easier to manage and they provide a higher level of flexibility with newer data models.
They are mostly open source and low-cost. This makes them an appealing solution for smaller organizations with limited budgets. The top NoSQL databases on the market today such as MongoDB, MarkLogic, etc., allow for rapid processing of real-time Big Data applications in ways that are affordable.
They are easier scalability through support for map reduce. The elastic scalability of NoSQL often used as a major selling point.
The non-relational nature of a NoSQL database allows database architects to quickly create a database without needing to develop a detailed database model. This saves a lot of development time.
SQL Databases: The primary drawbacks are
It is very difficult to scale as much as a database grows larger.
It was designed to handle the number of transactions modern databases have to deal with every second. Giants like Amazon or Alibaba deal with staggering amounts of data that will choke an SQL in a matter of minutes.
NoSQL Databases: Some of the biggest disadvantages
Their community is not as well defined. The NoSQL community is relatively new and lacks the maturity of the MySQL user base. Even though NoSQL is growing rapidly, but for now MySQL is hard to beat for its vast network of highly experienced end users.
A major problem with NoSQL databases is the lack of reporting tools for analysis and performance testing. However, with MySQL, you can find a wide array of reporting tools to help you prove your application’s validity.
NoSQL databases need a standard query language like SQL. This is a major issue highlighted by researchers at Microsoft, who claim that NoSQL’s lack of standardization can cause a problem during migration.