Assignment 12

Load the R Packages

library(RMariaDB)
library(mongolite)
library(dplyr)
library(ggplot2)

Load the Data from MySQL

Database Connection

# Connecting to the database
user_name <- 'sie siong'
user_password <- "redy2rok"
database <- 'prj3'
host_name <- 'msds607.ckxhi71v1dqf.us-east-1.rds.amazonaws.com'

mydb_sql <- dbConnect(RMariaDB::MariaDB(), user=user_name, password=user_password, dbname=database, host=host_name)
mydb_sql

## <MariaDBConnection>
##   Host:    msds607.ckxhi71v1dqf.us-east-1.rds.amazonaws.com
##   Server:  5.7.22-log
##   Client:  5.5.1

View Database Tables

# List of tables.
dbListTables(mydb_sql)

##  [1] "agg_linkedin"            "df"                     
##  [3] "df_bak"                  "ds_general_skills_clean"
##  [5] "dsmain"                  "footable"               
##  [7] "just_skills"             "payscale_data"          
##  [9] "rawdata"                 "sample_linkedin_tall"   
## [11] "sample_linkedin_wide"    "skills_raw"

View Tables Data

# Select one table to migrate.
sql_tbl <- dbGetQuery(mydb_sql, "select * from df")

head(sql_tbl, n=15)

##    skill_id                  skills count                    title
## 1         1                  Python    11 Data Scientist at Square
## 2         2                       R     9 Data Scientist at Square
## 3         3                     C++     7 Data Scientist at Square
## 4         4         Data Structures     5 Data Scientist at Square
## 5         5              Statistics     2 Data Scientist at Square
## 6         6        Machine Learning     2 Data Scientist at Square
## 7         7                Research     1 Data Scientist at Square
## 8         8            Optimization     1 Data Scientist at Square
## 9         9           Data Analysis     1 Data Scientist at Square
## 10       10     Distributed Systems     1 Data Scientist at Square
## 11       17                     SQL     2 Data Scientist at Square
## 12       18                  python     1 Data Scientist at Square
## 13       19 Amazon Web Services AWS     1 Data Scientist at Square
## 14       20                 MongoDB     1 Data Scientist at Square
## 15       29           Deep Learning     2 Data Scientist at Square

tail(sql_tbl, n=15)

##      skill_id             skills count          title
## 3866     4801             Matlab    17 Data Scientist
## 3867     4802              LaTeX     9 Data Scientist
## 3868     4803             Python    10 Data Scientist
## 3869     4804      Data Analysis    12 Data Scientist
## 3870     4805   Machine Learning     8 Data Scientist
## 3871     4806           Research     6 Data Scientist
## 3872     4807 Financial Modeling     4 Data Scientist
## 3873     4808        Data Mining    11 Data Scientist
## 3874     4809         Statistics     8 Data Scientist
## 3875     4810                  R    12 Data Scientist
## 3876     4811              Linux     3 Data Scientist
## 3877     4812    Microsoft Excel     8 Data Scientist
## 3878     4813                SQL     5 Data Scientist
## 3879     4814   Microsoft Office     6 Data Scientist
## 3880     4815                C++     1 Data Scientist

Migrate Data to MongoDB

# Database Connection and Creation
mydb_mongo <- mongo(collection = "sql_tbl", db = "DataScienceSkills", url = "mongodb://localhost", verbose = FALSE, options = ssl_options())
mydb_mongo$insert(sql_tbl)

## List of 5
##  $ nInserted  : num 3880
##  $ nMatched   : num 0
##  $ nRemoved   : num 0
##  $ nUpserted  : num 0
##  $ writeErrors: list()

# Check if we have inserted the “slq-tbl” data.
mydb_mongo$count()

## [1] 3880

Formatting the Data

# Display 3 records.
mydb_mongo$iterate()$batch(3)

## [[1]]
## [[1]]$skill_id
## [1] 1
## 
## [[1]]$skills
## [1] "Python"
## 
## [[1]]$count
## [1] 11
## 
## [[1]]$title
## [1] "Data Scientist at Square"
## 
## 
## [[2]]
## [[2]]$skill_id
## [1] 2
## 
## [[2]]$skills
## [1] "R"
## 
## [[2]]$count
## [1] 9
## 
## [[2]]$title
## [1] "Data Scientist at Square"
## 
## 
## [[3]]
## [[3]]$skill_id
## [1] 3
## 
## [[3]]$skills
## [1] "C++"
## 
## [[3]]$count
## [1] 7
## 
## [[3]]$title
## [1] "Data Scientist at Square"

# Count how many distinct skills.
length(mydb_mongo$distinct("skills"))

## [1] 929

# Aggregate the skills and count
df <- mydb_mongo$aggregate('[{"$group":{"_id":"$skills", "count": {"$sum":"$count"}}}]', options = '{"allowDiskUse":true}')

# Rename the column names.
names(df) <- c("skills", "count")

# Select only the skills with count greater than 200.
df <- filter(df, count > 200)

# Sort the count column.
df <- arrange(df, desc(count))

# Display the result.
head(df)

##             skills count
## 1    Data Analysis  2196
## 2                R  1864
## 3           Python  1539
## 4              SQL  1237
## 5       Statistics  1036
## 6 Machine Learning   979

Visualizing the Data

# Plot the result.
ggplot(df, aes(x=reorder(skills, count, fun=max), y=count))+
  geom_bar(stat="identity", width = 0.5, fill=("tomato2"))+
             theme(axis.text.x = element_text(angle = 65, vjust=0.6), axis.title.x = element_blank())

SQL vs NoSQL

Advantages

SQL Databses: The primary advantages are

It uses a single uniform language (DDL) for different roles (developer, user, DBA).
Unlike the NoSQL, SQL doesn’t have the issue of standardization. It follows the ISI and ANSI standards, which are approved across the globe.
It uses an advanced and non-structural querying language.
It sticks to ACID principles (atomicity, consistency, isolation, durability), thus guaranteeing stability, security, and predictability both of the entire database and every transaction in particular.

NoSQL Databases: Some of the biggest advantages are

They are non-relational (table-less), hence, very different from SQL databases. This means they are easier to manage and they provide a higher level of flexibility with newer data models.
They are mostly open source and low-cost. This makes them an appealing solution for smaller organizations with limited budgets. The top NoSQL databases on the market today such as MongoDB, MarkLogic, etc., allow for rapid processing of real-time Big Data applications in ways that are affordable.
They are easier scalability through support for map reduce. The elastic scalability of NoSQL often used as a major selling point.
The non-relational nature of a NoSQL database allows database architects to quickly create a database without needing to develop a detailed database model. This saves a lot of development time.

Disadvantages

SQL Databases: The primary drawbacks are

It is very difficult to scale as much as a database grows larger.
It was designed to handle the number of transactions modern databases have to deal with every second. Giants like Amazon or Alibaba deal with staggering amounts of data that will choke an SQL in a matter of minutes.

NoSQL Databases: Some of the biggest disadvantages

Their community is not as well defined. The NoSQL community is relatively new and lacks the maturity of the MySQL user base. Even though NoSQL is growing rapidly, but for now MySQL is hard to beat for its vast network of highly experienced end users.
A major problem with NoSQL databases is the lack of reporting tools for analysis and performance testing. However, with MySQL, you can find a wide array of reporting tools to help you prove your application’s validity.
NoSQL databases need a standard query language like SQL. This is a major issue highlighted by researchers at Microsoft, who claim that NoSQL’s lack of standardization can cause a problem during migration.

Assignment 12

Sie Siong Wong

11/12/2019

Load the R Packages

Load the Data from MySQL

Database Connection

View Database Tables

View Tables Data

Migrate Data to MongoDB

Formatting the Data

Visualizing the Data

SQL vs NoSQL

Advantages

Disadvantages