The object of this assignment is to develop understanding of data storage and retrieval mechanism for a NoSQL database. For this conversion, I have chosen the SQL database used for our Project 3 (Most Valuable Data Science Skills). Database structure is illustrated below. For the NoSQL database, I have chosen Neo4J.
# SQL Database Access
library(RMySQL)
#Neo4J Database Access
library(RNeo4j)
# Connect to a Google Cloud SQL database used for project 3
con <- dbConnect(MySQL(),
user='root',
password='data607pw',
host='35.185.104.222',
dbname='datascienceskills')
# Store all tables within data frames
location <- dbReadTable(con, "Location")
category <- dbReadTable(con, "Categories")
skill_data <- dbReadTable(con, "skillsdata")
skill_name <- dbReadTable(con, "Skills")
dbDisconnect(con)
Connect to the default, local Neo4J database.
# Connect to the database
graph <- startGraph("http://localhost:7474/db/data/",
username="neo4j",
password="data607pw")
# Clear any existing data
clear(graph, input = FALSE)
Each skill is assigned to one of four categories - Programming Languages, Analysis Software, Big Data Tools and Databases. Store category names as nodes.
# Add constraint for ID migrated from SQL database
addConstraint(graph, "SkillCategory", "id")
# Neo4J query
query <- "CREATE (s:SkillCategory { id: {category_id}, name: {category_name}})"
# Start new transaction
trans <- newTransaction(graph)
# Loop through the data frame and add corresponding query to transaction
for (i in 1:nrow(category)) {
appendCypher(trans,
query,
category_id = category[i, ]$CategoryID,
category_name = category[i, ]$Description)
}
# Commit all changes
commit(trans)
Skills data was collected for multiple locations. Store each location as a node. Node name is City/State combination, but country is also preserved as node property.
# Add constraint for ID migrated from SQL database
addConstraint(graph, "Location", "id")
# Neo4J query
query <- "CREATE (l:Location { id: {loc_id}, description: {loc_desc}, country: {loc_country}})"
# Start new transaction
trans <- newTransaction(graph)
# Loop through the data frame and add corresponding query to transaction
for (i in 1:nrow(location)) {
appendCypher(trans,
query,
loc_id = location[i, ]$LocationID,
loc_desc = location[i, ]$Description,
loc_country = location[i, ]$Country)
}
# Commit all changes
commit(trans)
Add skills as nodes and connect them to corresponding categories.
# Add constraint for ID migrated from SQL database
addConstraint(graph, "Skill", "id")
# Neo4J query
query <- "MATCH (c:SkillCategory { id: {cat_id}})
MERGE (s:Skill { id: {skill_id}, name: {skill_name}})
CREATE (s)-[:CATEGORY]->(c)"
# Start new transaction
trans <- newTransaction(graph)
# Loop through the data frame and add corresponding query to transaction
for (i in 1:nrow(skill_name)) {
appendCypher(trans,
query,
cat_id = skill_name[i, ]$SkillCategory,
skill_id = skill_name[i, ]$SkillID,
skill_name = skill_name[i, ]$SkillDescription)
}
# Commit all changes
commit(trans)
Skills data is added as relationship between skills and corresponding locations. Collected data is available for 2016 and 2017 and each year is added as a separate relationship.
# Neo4J query
query <- "MATCH (s:Skill { id: {skill_id}})
MATCH (l:Location { id: {loc_id}})
CREATE (s)-[:DATA_FOR { name: {data_year}, amount: {data_amount} }]->(l)"
# Start new transaction
trans <- newTransaction(graph)
# Loop through the data frame and add corresponding query to transaction
for (i in 1:nrow(skill_data)) {
appendCypher(trans,
query,
skill_id = skill_data[i, ]$Skill,
loc_id = skill_data[i, ]$Place,
data_year = skill_data[i, ]$YearCollected,
data_amount = skill_data[i, ]$Amount)
}
# Commit all changes
commit(trans)
Full graph is too busy to show the nodes and relationships. Below example pulls any skills mentioned in over 20 job listings for New York.
results <- cypher(graph, "MATCH (l:Location {id: 8})<-[d:DATA_FOR]-(s:Skill) MATCH (c:SkillCategory)<--(s)
WHERE d.amount > 20
RETURN s.name, d.amount, c.name")
colnames(results) <- c("Skill", "Count", "Category")
| Skill | Count | Category |
|---|---|---|
| matlab | 107 | program_languages |
| javascript | 97 | program_languages |
| java | 637 | program_languages |
| python | 982 | program_languages |
| powerpoint | 69 | analysis_software |
| r | 1121 | program_languages |
| perl | 324 | program_languages |
| pandas | 25 | analysis_software |
| pig | 49 | bigdata_tool |
| php | 108 | program_languages |
| spark | 43 | bigdata_tool |
| sql | 49 | databases |
| python | 72 | program_languages |
| r | 67 | program_languages |
| sas | 42 | analysis_software |
| java | 49 | program_languages |
| hive | 28 | bigdata_tool |
| matlab | 22 | program_languages |
| hadoop | 40 | bigdata_tool |
| sql | 841 | databases |
| spss | 85 | analysis_software |
| tableau | 53 | analysis_software |
| sas | 262 | analysis_software |
| ruby | 63 | program_languages |
| scala | 112 | program_languages |
| spark | 138 | bigdata_tool |
| spotfire | 225 | analysis_software |
| excel | 980 | analysis_software |
| d3 | 52 | analysis_software |
| c++ | 681 | program_languages |
| hive | 158 | bigdata_tool |
| hbase | 28 | databases |
| hadoop | 520 | bigdata_tool |
Overall the conversion process was smooth. With some additional experience, especially in Neo4J query language, it should become a fairly routine exercise. The more difficult part is to identify proper database storage method (SQL vs. NoSQL) and proper database structure. I feel that for this particular data set, the tabular structure of a SQL database is a better and easier solution.
Additionally, I have noticed that graph database structure is a lot more flexible and not necessarily intuitive (perhaps, this is just due to my lack of experience). I have set up skills, categories and locations as nodes with collected data as relationships. Perhaps, I could have broken down some data further. My first thought is to separate year the data was collected for into its own node. Then the number of listings a skill was mentioned in for a given year would describe the strength of the relationship between skill and year.