Assignment Objective

The object of this assignment is to develop understanding of data storage and retrieval mechanism for a NoSQL database. For this conversion, I have chosen the SQL database used for our Project 3 (Most Valuable Data Science Skills). Database structure is illustrated below. For the NoSQL database, I have chosen Neo4J.

Required Libraries

# SQL Database Access
library(RMySQL)

#Neo4J Database Access
library(RNeo4j)

Retrieve SQL Data

Establish Connection

# Connect to a Google Cloud SQL database used for project 3
con <- dbConnect(MySQL(), 
                 user='root', 
                 password='data607pw', 
                 host='35.185.104.222', 
                 dbname='datascienceskills')

Retrieve Data

# Store all tables within data frames
location <- dbReadTable(con, "Location")
category <- dbReadTable(con, "Categories")
skill_data <- dbReadTable(con, "skillsdata")
skill_name <- dbReadTable(con, "Skills")

Close Connection

dbDisconnect(con)

Store Data in Neo4J

Establish Database Connection

Connect to the default, local Neo4J database.

# Connect to the database
graph <- startGraph("http://localhost:7474/db/data/", 
                    username="neo4j", 
                    password="data607pw")

# Clear any existing data
clear(graph, input = FALSE)

Store Available Data: Skill Categories

Each skill is assigned to one of four categories - Programming Languages, Analysis Software, Big Data Tools and Databases. Store category names as nodes.

# Add constraint for ID migrated from SQL database
addConstraint(graph, "SkillCategory", "id")

# Neo4J query
query <- "CREATE (s:SkillCategory { id: {category_id}, name: {category_name}})"

# Start new transaction
trans <- newTransaction(graph)

# Loop through the data frame and add corresponding query to transaction
for (i in 1:nrow(category)) {
  appendCypher(trans, 
               query, 
               category_id = category[i, ]$CategoryID,
               category_name = category[i, ]$Description) 
}

# Commit all changes
commit(trans)

Store Available Data: Location

Skills data was collected for multiple locations. Store each location as a node. Node name is City/State combination, but country is also preserved as node property.

# Add constraint for ID migrated from SQL database
addConstraint(graph, "Location", "id")

# Neo4J query
query <- "CREATE (l:Location { id: {loc_id}, description: {loc_desc}, country: {loc_country}})"

# Start new transaction
trans <- newTransaction(graph)

# Loop through the data frame and add corresponding query to transaction
for (i in 1:nrow(location)) {
  appendCypher(trans, 
               query, 
               loc_id = location[i, ]$LocationID,
               loc_desc = location[i, ]$Description,
               loc_country = location[i, ]$Country) 
}

# Commit all changes
commit(trans)

Store Available Data: Skills

Add skills as nodes and connect them to corresponding categories.

# Add constraint for ID migrated from SQL database
addConstraint(graph, "Skill", "id")

# Neo4J query
query <- "MATCH (c:SkillCategory { id: {cat_id}})
          MERGE (s:Skill { id: {skill_id}, name: {skill_name}})
          CREATE (s)-[:CATEGORY]->(c)"

# Start new transaction
trans <- newTransaction(graph)

# Loop through the data frame and add corresponding query to transaction
for (i in 1:nrow(skill_name)) {
  appendCypher(trans, 
               query, 
               cat_id = skill_name[i, ]$SkillCategory,
               skill_id = skill_name[i, ]$SkillID,
               skill_name = skill_name[i, ]$SkillDescription) 
}

# Commit all changes
commit(trans)

Store Available Data: Skills Data

Skills data is added as relationship between skills and corresponding locations. Collected data is available for 2016 and 2017 and each year is added as a separate relationship.

# Neo4J query
query <- "MATCH (s:Skill { id: {skill_id}})
          MATCH (l:Location { id: {loc_id}})
          CREATE (s)-[:DATA_FOR { name: {data_year}, amount: {data_amount} }]->(l)"

# Start new transaction
trans <- newTransaction(graph)

# Loop through the data frame and add corresponding query to transaction
for (i in 1:nrow(skill_data)) {
  appendCypher(trans, 
               query, 
               skill_id = skill_data[i, ]$Skill,
               loc_id = skill_data[i, ]$Place,
               data_year = skill_data[i, ]$YearCollected,
               data_amount = skill_data[i, ]$Amount) 
}

# Commit all changes
commit(trans)

Retrieve Results

Full graph is too busy to show the nodes and relationships. Below example pulls any skills mentioned in over 20 job listings for New York.

results <- cypher(graph, "MATCH (l:Location {id: 8})<-[d:DATA_FOR]-(s:Skill) MATCH (c:SkillCategory)<--(s) 
                          WHERE d.amount > 20 
                          RETURN s.name, d.amount, c.name")
colnames(results) <- c("Skill", "Count", "Category")
Skill Count Category
matlab 107 program_languages
javascript 97 program_languages
java 637 program_languages
python 982 program_languages
powerpoint 69 analysis_software
r 1121 program_languages
perl 324 program_languages
pandas 25 analysis_software
pig 49 bigdata_tool
php 108 program_languages
spark 43 bigdata_tool
sql 49 databases
python 72 program_languages
r 67 program_languages
sas 42 analysis_software
java 49 program_languages
hive 28 bigdata_tool
matlab 22 program_languages
hadoop 40 bigdata_tool
sql 841 databases
spss 85 analysis_software
tableau 53 analysis_software
sas 262 analysis_software
ruby 63 program_languages
scala 112 program_languages
spark 138 bigdata_tool
spotfire 225 analysis_software
excel 980 analysis_software
d3 52 analysis_software
c++ 681 program_languages
hive 158 bigdata_tool
hbase 28 databases
hadoop 520 bigdata_tool

Conclusion

Overall the conversion process was smooth. With some additional experience, especially in Neo4J query language, it should become a fairly routine exercise. The more difficult part is to identify proper database storage method (SQL vs. NoSQL) and proper database structure. I feel that for this particular data set, the tabular structure of a SQL database is a better and easier solution.

Additionally, I have noticed that graph database structure is a lot more flexible and not necessarily intuitive (perhaps, this is just due to my lack of experience). I have set up skills, categories and locations as nodes with collected data as relationships. Perhaps, I could have broken down some data further. My first thought is to separate year the data was collected for into its own node. Then the number of listings a skill was mentioned in for a given year would describe the strength of the relationship between skill and year.