NoSQL Migration

Introduction

The data set I will use for the NoSQL migration will come from my group’s Project 3. We will migrate the data to a Neo4j graph database.

The data for project 3 was stored through RSQLite that did not require any external files. In order to migrate the data to Neo4j, I had to create csv files. For more information of how the data was acquired for project 3, please refer to my group’s project.

Creating csv files from our Project 3 SQL database

To create these csv files, I simply wrote the below code after running all the code blocks from our project 3 database. The below code is for display only so that we don’t inundate this markdown file with code from an old assignment.

write.csv(industry_df, file = ‘industry.csv’, row.names=FALSE)

write.csv(skills_df, file = ‘skills.csv’, row.names=FALSE)

write.csv(title_df, file = ‘title.csv’, row.names=FALSE)

write.csv(company_df, file = ‘company.csv’, row.names=FALSE)

write.csv(jobs_df, file = ‘jobs.csv’, row.names=FALSE)

Modeling the data

For this project, it is useful to get a general sense of how the data is organized. The below shows how the data is organized in the SQLite database.

SQLite Database Mapping

Now let’s take a look at the Neo4j graph modeling we will do

Neo4j Model

Using these models, we can get a general sense of how to structure our relationships in Neo4j.

Loading the csv files into Neo4j

The following steps will be taken for the data migration to a Neo4j graph database

Create the nodes for jobs
Create the nodes for job titles
Create the nodes for each company
Create the nodes for each industry
Create the nodes for each skill
Create the relationships between each node

Below is the code from the cypher file that was created to execute the above steps. Note that the cypher code below is there only to show the code. To execute the code, use Neo4j desktop.

Jobs.cypher

// Create job nodes for each job LOAD CSV WITH HEADERS FROM “https://chesterpoon8.github.io/jobs.csv” AS row CREATE (job:Job {JobId: row.job_id, IndustryId: row.industry_id, SkillsId: row.skills_id, TitleId: row.title_id, CompanyId: row.company_id});

// Create title nodes LOAD CSV WITH HEADERS FROM “https://chesterpoon8.github.io/title.csv” AS row CREATE (title:Title {TitleId: row.title_id}) SET title.TitleName = row.title;

// Create Company nodes LOAD CSV WITH HEADERS FROM “https://chesterpoon8.github.io/company.csv” AS row CREATE (company:Company {CompanyId: row.company_id}) SET company.CompanyName = row.company;

// Create Industry nodes LOAD CSV WITH HEADERS FROM “https://chesterpoon8.github.io/industry.csv” AS row CREATE (industry:Industry {IndustryId: row.industry_id}) SET industry.IndustryName = row.industry;

// Create Skill nodes LOAD CSV WITH HEADERS FROM “https://chesterpoon8.github.io/skills.csv” AS row CREATE (skill:Skill {SkillsId: row.skills_id}) SET skill.SkillName = row.skills;

// Create the relationships between all the nodes LOAD CSV WITH HEADERS FROM “https://chesterpoon8.github.io/jobs.csv” AS row MATCH (job:Job {JobId: row.job_id}) MATCH (skill:Skill {SkillsId: row.skills_id}) MATCH (industry:Industry {IndustryId: row.industry_id}) MATCH (company:Company {CompanyId: row.company_id}) MATCH (title:Title {TitleId: row.title_id})

CREATE (skill)-[:FOR_JOB]->(job) CREATE (job)-[:IN_INDUSTRY]->(industry) CREATE (job)-[:IN_COMPANY]->(company) CREATE (title)-[:HAS_TITLE]->(job)

//Show skills -> job relationship and return max of 25

MATCH p=()-[r:FOR_JOB]->() RETURN p LIMIT 25

FOR_JOB relationship graph

Pros & Cons: Relational Database vs. NoSQL database

The pros and cons for both are my less than informed opinion and completely subjective. Neo4j and NoSQL in general are very new concepts for me whereas SQL is something I’m very familiar with. It’s clear that I have a very strong bias. The below pros and cons are strictly of my limited experience with Neo4j specifically and of SQL.

Relational database Pros

Easier to work with very large datasets.
Can format data that is easier to work with in other programs and programming languages.
Better integration with R and Python.

Relational database Cons

Possible to create relationships that shouldn’t exist
Spatial visualization of the data requires a relatively significant data transformation process to be read into other programs/programming languages

Neo4j Pros

Aesthetically pleasing view of relationships in the data
Spatial visualization of the data can be useful in some circumstances

Neo4j Cons

Larger and more complex relationships in the data is difficult to visualize and does not help when viewing it in graph form
Unless I am unaware of better and more efficient ways to work with Neo4j, it appears that it is slow to query larger datasets