Assignment overview

For this assignment, we should take information from a relational database and migrate it to a NoSQL database of our own choosing, preferably in R .

SQL Database

For the relational database, I chose the Employees sample database was developed by Patrick Crews and Giuseppe Maxia and provides a combination of a large base of data spread over six separate tables and consisting of 4 million records in total. The structure is compatible with a wide range of storage engine types. Infomation about the Employees sample database is available on https://dev.mysql.com/doc/employee/en/employees-introduction.html

The following diagram provides an overview of the structure of the Employees sample database.

NoSQL database

Though I planned using Neo4j in this assignment, I ended up using MongoDB because of technical issues in running Neo4j with R in my laptop.

Load packages

library(RMySQL)

## Loading required package: DBI

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(mongolite)
library(rjson)

Establish SQL Connection

con <- dbConnect(MySQL(), user='test_user', password='data607', dbname='employees', host='localhost')

Basic use of SQL

dbListTables(con)

## [1] "current_dept_emp"     "departments"          "dept_emp"            
## [4] "dept_emp_latest_date" "dept_manager"         "employees"           
## [7] "salaries"             "titles"

dbListFields(con, "titles")

## [1] "emp_no"    "title"     "from_date" "to_date"

sql <- "SELECT * FROM titles LIMIT 5;"
dbGetQuery(con, sql)

##   emp_no           title  from_date    to_date
## 1  10001 Senior Engineer 1986-06-26 9999-01-01
## 2  10002           Staff 1996-08-03 9999-01-01
## 3  10003 Senior Engineer 1995-12-03 9999-01-01
## 4  10004        Engineer 1986-12-01 1995-12-01
## 5  10004 Senior Engineer 1995-12-01 9999-01-01

SQL Queries

Read some of the information from all tables in the database into a data frame in R. I combined different tables into a single data frame.

employee_list <- dbReadTable(con, 'employees')
str(employee_list)

## 'data.frame':    300024 obs. of  6 variables:
##  $ emp_no    : int  10001 10002 10003 10004 10005 10006 10007 10008 10009 10010 ...
##  $ birth_date: chr  "1953-09-02" "1964-06-02" "1959-12-03" "1954-05-01" ...
##  $ first_name: chr  "Georgi" "Bezalel" "Parto" "Chirstian" ...
##  $ last_name : chr  "Facello" "Simmel" "Bamford" "Koblick" ...
##  $ gender    : chr  "M" "F" "M" "M" ...
##  $ hire_date : chr  "1986-06-26" "1985-11-21" "1986-08-28" "1986-12-01" ...

sql <- "SELECT   e.emp_no AS empID, concat(e.first_name,' ', e.last_name) AS employee, e.gender, t.title, d.dept_name
        FROM employees e JOIN titles t ON e.emp_no = t.emp_no
        JOIN dept_emp de ON de.emp_no = e.emp_no
        JOIN departments d ON d.dept_no = de.dept_no
        ORDER BY e.last_name, e.first_name;"
df1 <- dbGetQuery(con, sql)
str(df1)

## 'data.frame':    489903 obs. of  5 variables:
##  $ empID    : int  258641 258005 258005 455773 455773 436560 266651 266651 487598 487598 ...
##  $ employee : chr  "Abdelkader Aamodt" "Adhemar Aamodt" "Adhemar Aamodt" "Aemilian Aamodt" ...
##  $ gender   : chr  "M" "F" "F" "M" ...
##  $ title    : chr  "Senior Engineer" "Engineer" "Assistant Engineer" "Senior Staff" ...
##  $ dept_name: chr  "Development" "Development" "Development" "Sales" ...

head(df1)

##    empID          employee gender              title   dept_name
## 1 258641 Abdelkader Aamodt      M    Senior Engineer Development
## 2 258005    Adhemar Aamodt      F           Engineer Development
## 3 258005    Adhemar Aamodt      F Assistant Engineer Development
## 4 455773   Aemilian Aamodt      M       Senior Staff       Sales
## 5 455773   Aemilian Aamodt      M              Staff       Sales
## 6 436560      Alagu Aamodt      F           Engineer Development

sql <- "SELECT   d.dept_name, concat(e.first_name,' ', e.last_name) AS manager 
        FROM employees e JOIN dept_manager dm ON dm.emp_no = e.emp_no
        JOIN departments d ON d.dept_no = dm.dept_no
        ORDER BY e.last_name, e.first_name;"
df2 <- dbGetQuery(con, sql)
str(df2)

## 'data.frame':    24 obs. of  2 variables:
##  $ dept_name: chr  "Finance" "Customer Service" "Production" "Development" ...
##  $ manager  : chr  "Ebru Alpin" "Tonny Butterworth" "Rosine Cools" "Leon DasSarma" ...

head(df2)

##          dept_name           manager
## 1          Finance        Ebru Alpin
## 2 Customer Service Tonny Butterworth
## 3       Production      Rosine Cools
## 4      Development     Leon DasSarma
## 5       Production    Oscar Ghazalie
## 6 Customer Service  Marjo Giarratana

dbDisconnect(con)

## [1] TRUE

The following code adds the managers’ names from df2 to df1.

 df <- df1 %>% inner_join(df2, by = c("dept_name" = "dept_name"))
head(df)

##    empID          employee gender              title   dept_name
## 1 258641 Abdelkader Aamodt      M    Senior Engineer Development
## 2 258641 Abdelkader Aamodt      M    Senior Engineer Development
## 3 258005    Adhemar Aamodt      F           Engineer Development
## 4 258005    Adhemar Aamodt      F           Engineer Development
## 5 258005    Adhemar Aamodt      F Assistant Engineer Development
## 6 258005    Adhemar Aamodt      F Assistant Engineer Development
##             manager
## 1     Leon DasSarma
## 2 DeForest Hagimont
## 3     Leon DasSarma
## 4 DeForest Hagimont
## 5     Leon DasSarma
## 6 DeForest Hagimont

dim(df)

## [1] 1323864       6

Import data into MongoDB

This part of the work follows the instructions posted on https://www.r-bloggers.com/using-mongodb-with-r/

# create connection, database and collection
mongo_db = mongo(collection = "df", db = "employees")
str(mongo_db)

## Classes 'mongo', 'jeroen', 'environment' <environment: 0x7fae4c6a9f68>

# Insert the data
mongo_db$insert(df)

## List of 5
##  $ nInserted  : num 1323864
##  $ nMatched   : num 0
##  $ nRemoved   : num 0
##  $ nUpserted  : num 0
##  $ writeErrors: list()

# check if we have inserted df
mongo_db$iterate()$one()

## $empID
## [1] 258641
## 
## $employee
## [1] "Abdelkader Aamodt"
## 
## $gender
## [1] "M"
## 
## $title
## [1] "Senior Engineer"
## 
## $dept_name
## [1] "Development"
## 
## $manager
## [1] "Leon DasSarma"

# check the number of employees inserted
length(mongo_db$distinct("empID"))

## [1] 300024

length(employee_list$emp_no)

## [1] 300024

Advantages and disadvantages

NoSQL is capable of storing, processing, and managing huge amounts of data.
NoSQL focus on better response time, scale while maintaining availability and performance.
SQL can be an easier syntax to work with rather than JSON for NoSQL
The NoSQL database demanded a lot more technical skills from me with the installation and use than SQL did when I first used in the beginner of the course.

Reference

Fisseha Berhane. “Using MongoDB with R”. R-bloggers. March 2, 2017. https://www.r-bloggers.com/using-mongodb-with-r/

NoSQL Migration

B. Sosnovski

11/17/2018