Assignment overview
For this assignment, we should take information from a relational database and migrate it to a NoSQL database of our own choosing, preferably in R .
SQL Database
For the relational database, I chose the Employees sample database was developed by Patrick Crews and Giuseppe Maxia and provides a combination of a large base of data spread over six separate tables and consisting of 4 million records in total. The structure is compatible with a wide range of storage engine types. Infomation about the Employees sample database is available on https://dev.mysql.com/doc/employee/en/employees-introduction.html
The following diagram provides an overview of the structure of the Employees sample database.
NoSQL database
Though I planned using Neo4j in this assignment, I ended up using MongoDB because of technical issues in running Neo4j with R in my laptop.
Load packages
library(RMySQL)## Loading required package: DBI
library(dplyr)##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(mongolite)
library(rjson)Establish SQL Connection
con <- dbConnect(MySQL(), user='test_user', password='data607', dbname='employees', host='localhost')Basic use of SQL
dbListTables(con)## [1] "current_dept_emp" "departments" "dept_emp"
## [4] "dept_emp_latest_date" "dept_manager" "employees"
## [7] "salaries" "titles"
dbListFields(con, "titles")## [1] "emp_no" "title" "from_date" "to_date"
sql <- "SELECT * FROM titles LIMIT 5;"
dbGetQuery(con, sql)## emp_no title from_date to_date
## 1 10001 Senior Engineer 1986-06-26 9999-01-01
## 2 10002 Staff 1996-08-03 9999-01-01
## 3 10003 Senior Engineer 1995-12-03 9999-01-01
## 4 10004 Engineer 1986-12-01 1995-12-01
## 5 10004 Senior Engineer 1995-12-01 9999-01-01
SQL Queries
Read some of the information from all tables in the database into a data frame in R. I combined different tables into a single data frame.
employee_list <- dbReadTable(con, 'employees')
str(employee_list)## 'data.frame': 300024 obs. of 6 variables:
## $ emp_no : int 10001 10002 10003 10004 10005 10006 10007 10008 10009 10010 ...
## $ birth_date: chr "1953-09-02" "1964-06-02" "1959-12-03" "1954-05-01" ...
## $ first_name: chr "Georgi" "Bezalel" "Parto" "Chirstian" ...
## $ last_name : chr "Facello" "Simmel" "Bamford" "Koblick" ...
## $ gender : chr "M" "F" "M" "M" ...
## $ hire_date : chr "1986-06-26" "1985-11-21" "1986-08-28" "1986-12-01" ...
sql <- "SELECT e.emp_no AS empID, concat(e.first_name,' ', e.last_name) AS employee, e.gender, t.title, d.dept_name
FROM employees e JOIN titles t ON e.emp_no = t.emp_no
JOIN dept_emp de ON de.emp_no = e.emp_no
JOIN departments d ON d.dept_no = de.dept_no
ORDER BY e.last_name, e.first_name;"
df1 <- dbGetQuery(con, sql)
str(df1)## 'data.frame': 489903 obs. of 5 variables:
## $ empID : int 258641 258005 258005 455773 455773 436560 266651 266651 487598 487598 ...
## $ employee : chr "Abdelkader Aamodt" "Adhemar Aamodt" "Adhemar Aamodt" "Aemilian Aamodt" ...
## $ gender : chr "M" "F" "F" "M" ...
## $ title : chr "Senior Engineer" "Engineer" "Assistant Engineer" "Senior Staff" ...
## $ dept_name: chr "Development" "Development" "Development" "Sales" ...
head(df1)## empID employee gender title dept_name
## 1 258641 Abdelkader Aamodt M Senior Engineer Development
## 2 258005 Adhemar Aamodt F Engineer Development
## 3 258005 Adhemar Aamodt F Assistant Engineer Development
## 4 455773 Aemilian Aamodt M Senior Staff Sales
## 5 455773 Aemilian Aamodt M Staff Sales
## 6 436560 Alagu Aamodt F Engineer Development
sql <- "SELECT d.dept_name, concat(e.first_name,' ', e.last_name) AS manager
FROM employees e JOIN dept_manager dm ON dm.emp_no = e.emp_no
JOIN departments d ON d.dept_no = dm.dept_no
ORDER BY e.last_name, e.first_name;"
df2 <- dbGetQuery(con, sql)
str(df2)## 'data.frame': 24 obs. of 2 variables:
## $ dept_name: chr "Finance" "Customer Service" "Production" "Development" ...
## $ manager : chr "Ebru Alpin" "Tonny Butterworth" "Rosine Cools" "Leon DasSarma" ...
head(df2)## dept_name manager
## 1 Finance Ebru Alpin
## 2 Customer Service Tonny Butterworth
## 3 Production Rosine Cools
## 4 Development Leon DasSarma
## 5 Production Oscar Ghazalie
## 6 Customer Service Marjo Giarratana
dbDisconnect(con)## [1] TRUE
The following code adds the managers’ names from df2 to df1.
df <- df1 %>% inner_join(df2, by = c("dept_name" = "dept_name"))
head(df)## empID employee gender title dept_name
## 1 258641 Abdelkader Aamodt M Senior Engineer Development
## 2 258641 Abdelkader Aamodt M Senior Engineer Development
## 3 258005 Adhemar Aamodt F Engineer Development
## 4 258005 Adhemar Aamodt F Engineer Development
## 5 258005 Adhemar Aamodt F Assistant Engineer Development
## 6 258005 Adhemar Aamodt F Assistant Engineer Development
## manager
## 1 Leon DasSarma
## 2 DeForest Hagimont
## 3 Leon DasSarma
## 4 DeForest Hagimont
## 5 Leon DasSarma
## 6 DeForest Hagimont
dim(df)## [1] 1323864 6
Import data into MongoDB
This part of the work follows the instructions posted on https://www.r-bloggers.com/using-mongodb-with-r/
# create connection, database and collection
mongo_db = mongo(collection = "df", db = "employees")
str(mongo_db)## Classes 'mongo', 'jeroen', 'environment' <environment: 0x7fae4c6a9f68>
# Insert the data
mongo_db$insert(df)## List of 5
## $ nInserted : num 1323864
## $ nMatched : num 0
## $ nRemoved : num 0
## $ nUpserted : num 0
## $ writeErrors: list()
# check if we have inserted df
mongo_db$iterate()$one()## $empID
## [1] 258641
##
## $employee
## [1] "Abdelkader Aamodt"
##
## $gender
## [1] "M"
##
## $title
## [1] "Senior Engineer"
##
## $dept_name
## [1] "Development"
##
## $manager
## [1] "Leon DasSarma"
# check the number of employees inserted
length(mongo_db$distinct("empID"))## [1] 300024
length(employee_list$emp_no)## [1] 300024
Advantages and disadvantages
- NoSQL is capable of storing, processing, and managing huge amounts of data.
- NoSQL focus on better response time, scale while maintaining availability and performance.
- SQL can be an easier syntax to work with rather than JSON for NoSQL
- The NoSQL database demanded a lot more technical skills from me with the installation and use than SQL did when I first used in the beginner of the course.
Reference
- Fisseha Berhane. “Using MongoDB with R”. R-bloggers. March 2, 2017. https://www.r-bloggers.com/using-mongodb-with-r/