Assignment overview

For this assignment, we should take information from a relational database and migrate it to a NoSQL database of our own choosing, preferably in R .

SQL Database

For the relational database, I chose the Employees sample database was developed by Patrick Crews and Giuseppe Maxia and provides a combination of a large base of data spread over six separate tables and consisting of 4 million records in total. The structure is compatible with a wide range of storage engine types. Infomation about the Employees sample database is available on https://dev.mysql.com/doc/employee/en/employees-introduction.html

The following diagram provides an overview of the structure of the Employees sample database.

NoSQL database

Though I planned using Neo4j in this assignment, I ended up using MongoDB because of technical issues in running Neo4j with R in my laptop.

Load packages

library(RMySQL)
## Loading required package: DBI
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(mongolite)
library(rjson)

Establish SQL Connection

con <- dbConnect(MySQL(), user='test_user', password='data607', dbname='employees', host='localhost')

Basic use of SQL

dbListTables(con)
## [1] "current_dept_emp"     "departments"          "dept_emp"            
## [4] "dept_emp_latest_date" "dept_manager"         "employees"           
## [7] "salaries"             "titles"
dbListFields(con, "titles")
## [1] "emp_no"    "title"     "from_date" "to_date"
sql <- "SELECT * FROM titles LIMIT 5;"
dbGetQuery(con, sql)
##   emp_no           title  from_date    to_date
## 1  10001 Senior Engineer 1986-06-26 9999-01-01
## 2  10002           Staff 1996-08-03 9999-01-01
## 3  10003 Senior Engineer 1995-12-03 9999-01-01
## 4  10004        Engineer 1986-12-01 1995-12-01
## 5  10004 Senior Engineer 1995-12-01 9999-01-01

SQL Queries

Read some of the information from all tables in the database into a data frame in R. I combined different tables into a single data frame.

employee_list <- dbReadTable(con, 'employees')
str(employee_list)
## 'data.frame':    300024 obs. of  6 variables:
##  $ emp_no    : int  10001 10002 10003 10004 10005 10006 10007 10008 10009 10010 ...
##  $ birth_date: chr  "1953-09-02" "1964-06-02" "1959-12-03" "1954-05-01" ...
##  $ first_name: chr  "Georgi" "Bezalel" "Parto" "Chirstian" ...
##  $ last_name : chr  "Facello" "Simmel" "Bamford" "Koblick" ...
##  $ gender    : chr  "M" "F" "M" "M" ...
##  $ hire_date : chr  "1986-06-26" "1985-11-21" "1986-08-28" "1986-12-01" ...
sql <- "SELECT   e.emp_no AS empID, concat(e.first_name,' ', e.last_name) AS employee, e.gender, t.title, d.dept_name
        FROM employees e JOIN titles t ON e.emp_no = t.emp_no
        JOIN dept_emp de ON de.emp_no = e.emp_no
        JOIN departments d ON d.dept_no = de.dept_no
        ORDER BY e.last_name, e.first_name;"
df1 <- dbGetQuery(con, sql)
str(df1)
## 'data.frame':    489903 obs. of  5 variables:
##  $ empID    : int  258641 258005 258005 455773 455773 436560 266651 266651 487598 487598 ...
##  $ employee : chr  "Abdelkader Aamodt" "Adhemar Aamodt" "Adhemar Aamodt" "Aemilian Aamodt" ...
##  $ gender   : chr  "M" "F" "F" "M" ...
##  $ title    : chr  "Senior Engineer" "Engineer" "Assistant Engineer" "Senior Staff" ...
##  $ dept_name: chr  "Development" "Development" "Development" "Sales" ...
head(df1)
##    empID          employee gender              title   dept_name
## 1 258641 Abdelkader Aamodt      M    Senior Engineer Development
## 2 258005    Adhemar Aamodt      F           Engineer Development
## 3 258005    Adhemar Aamodt      F Assistant Engineer Development
## 4 455773   Aemilian Aamodt      M       Senior Staff       Sales
## 5 455773   Aemilian Aamodt      M              Staff       Sales
## 6 436560      Alagu Aamodt      F           Engineer Development
sql <- "SELECT   d.dept_name, concat(e.first_name,' ', e.last_name) AS manager 
        FROM employees e JOIN dept_manager dm ON dm.emp_no = e.emp_no
        JOIN departments d ON d.dept_no = dm.dept_no
        ORDER BY e.last_name, e.first_name;"
df2 <- dbGetQuery(con, sql)
str(df2)
## 'data.frame':    24 obs. of  2 variables:
##  $ dept_name: chr  "Finance" "Customer Service" "Production" "Development" ...
##  $ manager  : chr  "Ebru Alpin" "Tonny Butterworth" "Rosine Cools" "Leon DasSarma" ...
head(df2)
##          dept_name           manager
## 1          Finance        Ebru Alpin
## 2 Customer Service Tonny Butterworth
## 3       Production      Rosine Cools
## 4      Development     Leon DasSarma
## 5       Production    Oscar Ghazalie
## 6 Customer Service  Marjo Giarratana
dbDisconnect(con)
## [1] TRUE

The following code adds the managers’ names from df2 to df1.

 df <- df1 %>% inner_join(df2, by = c("dept_name" = "dept_name"))
head(df)
##    empID          employee gender              title   dept_name
## 1 258641 Abdelkader Aamodt      M    Senior Engineer Development
## 2 258641 Abdelkader Aamodt      M    Senior Engineer Development
## 3 258005    Adhemar Aamodt      F           Engineer Development
## 4 258005    Adhemar Aamodt      F           Engineer Development
## 5 258005    Adhemar Aamodt      F Assistant Engineer Development
## 6 258005    Adhemar Aamodt      F Assistant Engineer Development
##             manager
## 1     Leon DasSarma
## 2 DeForest Hagimont
## 3     Leon DasSarma
## 4 DeForest Hagimont
## 5     Leon DasSarma
## 6 DeForest Hagimont
dim(df)
## [1] 1323864       6

Import data into MongoDB

This part of the work follows the instructions posted on https://www.r-bloggers.com/using-mongodb-with-r/

# create connection, database and collection
mongo_db = mongo(collection = "df", db = "employees")
str(mongo_db)
## Classes 'mongo', 'jeroen', 'environment' <environment: 0x7fae4c6a9f68>
# Insert the data
mongo_db$insert(df)
## List of 5
##  $ nInserted  : num 1323864
##  $ nMatched   : num 0
##  $ nRemoved   : num 0
##  $ nUpserted  : num 0
##  $ writeErrors: list()
# check if we have inserted df
mongo_db$iterate()$one()
## $empID
## [1] 258641
## 
## $employee
## [1] "Abdelkader Aamodt"
## 
## $gender
## [1] "M"
## 
## $title
## [1] "Senior Engineer"
## 
## $dept_name
## [1] "Development"
## 
## $manager
## [1] "Leon DasSarma"
# check the number of employees inserted
length(mongo_db$distinct("empID"))
## [1] 300024
length(employee_list$emp_no)
## [1] 300024

Advantages and disadvantages

Reference