Project 4 will focus on extracting data from a relational database and migrating the information to a NoSQL database. This project will use Neo4j as the NoSQL database.
Project requirements:
For the relational database, options include using the flights database, the tb database, the “data skills” database created for DATA 607 Project 3, or another database of your own choosing or creation.
For the NoSQL database, options include using MongoDB, Neo4j, or another NoSQL database of your choosing.
The migration process needs to be reproducible. R code is encouraged, but not required.
Briefly describe the advantages and disadvantages of storing the data in a relational database vs. a NoSQL database.
The code for this project requires the following R packages:
Additionally, to run this migration, the default Neo4j graphdb will need to be installed and running on your local machine.
The source of the relational data used in this NoSQL migration will be the skill database located on the cloud MySQL database (https://www.db4free.net/) created by the team for DATA 607 Project 3.
Table and view data will be extracted from the cloud MySQL skill database using the function below:
## ------------------------------------------
## Using RMYSQL
## ------------------------------------------
get_mySQL_data <- function(object_name) {
# establish the connection to the skill DB on db4free.net
skilldb = dbConnect(MySQL(), user=proj_user, password=proj_pwd, dbname=proj_db, host=proj_host)
skilldb.data <- dbGetQuery(skilldb, paste0("select * from ", object_name))
#close the connection
dbDisconnect(skilldb)
return (skilldb.data)
}
The table named tbl_data
will be the source of the data being used in this project.
skill.data <- get_mySQL_data("tbl_data")
Column Name | Description |
---|---|
skill_type_id | Unique ID identifying a skill type |
skill_set_id | Unique ID identifying a skill set |
skill_id |
Unique ID identifying a skill (UK = skill_id and source_id) |
source_id |
Unique ID identifying a source (UK = skill_id and source_id) |
skill_type_name | Highest level of classification of a skill. 1 of 5 values defined in the model – business, communication, math, programming, visualization |
skill_set_name | Mid-level classification of a skill. 1 of 32 values defined in the model. Examples: General Programming, Object-Oriented Programming, Relational Databases, Creative Thinking |
skill_name | Skill name - the lowest level of skill classification. 1 of 122 skills defined in the skill db model. |
source_name | Source name - Kaggle, Google, RJMetrics, Indeed |
rating | Rating value associated with the source |
rating_scalar | Skill’s normalized rating |
weighted_rating_overall | Skill’s weighted ranking overall |
weighted_rating_by_skill_type | Skill’s weighted ranking within the skill type |
weighted_rating_by_skill_set | Skill’s weighted ranking within the skill set |
## 'data.frame': 163 obs. of 13 variables:
## $ skill_type_id : int 1 1 1 1 1 1 1 1 1 1 ...
## $ skill_set_id : int 5 5 5 5 5 5 5 5 5 13 ...
## $ skill_id : int 1 1 1 2 3 6 7 8 15 13 ...
## $ source_id : int 1 2 3 1 2 3 4 4 4 4 ...
## $ skill_type_name : chr "business" "business" "business" "business" ...
## $ skill_set_name : chr "Business" "Business" "Business" "Business" ...
## $ skill_name : chr "analysis" "analysis" "analysis" "business" ...
## $ source_name : chr "Kaggle" "RJMETRICS" "Google" "Kaggle" ...
## $ rating : num 2 55.7 10 14 23.8 ...
## $ rating_scalar : num 4 100 20 48 12 1 21 40 30 18 ...
## $ weighted_rating_overall : num 0.245 NA 1.227 1.718 NA ...
## $ weighted_rating_by_skill_type: num 0.231 NA 2.368 1.615 NA ...
## $ weighted_rating_by_skill_set : num 0.0979 NA 0.4895 0.6853 NA ...
skill_type_id | skill_set_id | skill_id | source_id | skill_type_name | skill_set_name | skill_name | source_name | rating | rating_scalar | weighted_rating_overall | weighted_rating_by_skill_type | weighted_rating_by_skill_set |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 5 | 1 | 1 | business | Business | analysis | Kaggle | 2.00 | 4 | 0.245399 | 0.230769 | 0.0979021 |
1 | 5 | 1 | 2 | business | Business | analysis | RJMETRICS | 55.69 | 100 | NA | NA | NA |
1 | 5 | 1 | 3 | business | Business | analysis | 10.00 | 20 | 1.226990 | 2.368420 | 0.4895100 | |
1 | 5 | 2 | 1 | business | Business | business | Kaggle | 14.00 | 48 | 1.717790 | 1.615380 | 0.6853150 |
1 | 5 | 3 | 2 | business | Business | business intelligence | RJMETRICS | 23.78 | 12 | NA | NA | NA |
1 | 5 | 6 | 3 | business | Business | entrepreneurship | 2.00 | 1 | 0.245399 | 0.473684 | 0.0979021 | |
1 | 5 | 7 | 4 | business | Business | forecasting | Indeed | 49.00 | 21 | 1.503070 | 1.240510 | 0.5996500 |
1 | 5 | 8 | 4 | business | Business | global Optimization | Indeed | 94.00 | 40 | 2.883440 | 2.379750 | 1.1503500 |
1 | 5 | 15 | 4 | business | Business | real analysis | Indeed | 70.00 | 30 | 2.147240 | 1.772150 | 0.8566430 |
1 | 13 | 13 | 4 | business | Product Development | Product Development | Indeed | 41.00 | 18 | 1.257670 | 1.037970 | 0.1433570 |
1 | 13 | 14 | 4 | business | Product Development | project management | Indeed | 25.00 | 11 | 0.766871 | 0.632911 | 0.0874126 |
1 | 18 | 19 | 4 | business | Surveys and Marketing | Surveys and Marketing | Indeed | 130.00 | 56 | 3.987730 | 3.291140 | 0.6818180 |
1 | 19 | 20 | 4 | business | Systems Administration | Systems Administration | Indeed | 109.00 | 47 | 3.343560 | 2.759490 | 1.1433600 |
1 | 26 | 10 | 3 | business | Information Security | information | 9.00 | 18 | 1.104290 | 2.131580 | 0.1888110 | |
1 | 26 | 12 | 3 | business | Information Security | policy | 3.00 | 2 | 0.368098 | 0.710526 | 0.0629371 |
The following section migrates the skill data found in the Project 3 MySQL database to Neo4j. The default Neo4j graphdb will need to be running locally. Additionally, this code requires the RNeo4j package in R.
The code below creates the following nodes in Neo4j based on the skill data sourced from MySQL: - Source - Skill - SkillSet - SkillType
In the MySQL database, a hierarchy exists between a skill, skill set, and skill type:
skill -> skill set -> skill type
where skill type is the top level classification.
In the Neo4j graph database below, a relationship is created from skill –> skill set and skill –> skill type. This has been done to capture the weighted rating values as a property of the relationship between the Skill and SkillSet nodes and the Skill and SKillType nodes.
graph = startGraph("http://localhost:7474/db/data", username = "neo4j", password = "data607")
# This statement will delete all contents from the connnected Neo4j database
clear(graph, input = F)
# Add constraints
addConstraint(graph, "Source", "name")
addConstraint(graph, "SkillSet", "name")
addConstraint(graph, "SkillType", "name")
# Use CQL (Cypher Query Language) to create the following nodes: Source, Skill, Skillset, and Skilltype
# Store the scalar rating value as a Skill property
#
# Relationships
# 1.) A Skill is extracted from a Source
# 2.) A skill is part of a Skillset
# 3.) A skill rolls up to a skilltype
query = "
CREATE (skill:Skill {name: {SkillName} })
SET skill.ScalarRating = TOINT({RatingScalar})
MERGE (source:Source {name: {Source} })
CREATE (skill)-[e:EXTRACTED_FROM]->(source)
SET e.rating = TOFLOAT({Rating})
MERGE (skillset:Skillset {name: {SkillsetName} })
MERGE (skilltype:Skilltype {name: {SkilltypeName} })
CREATE (skill)-[i:IS_PART_OF]->(skillset)
SET i.rating = TOFLOAT({RatingWeightedBySkillset})
CREATE (skill)-[r:ROLLS_UP_TO]->(skilltype)
SET r.rating = TOFLOAT({RatingWeightedBySkilltype})
"
# Open a new transaction
tx = newTransaction(graph)
# Pass the contents of the skill.data dataframe to query above using RNeo4j appendCypher
for(i in 1:nrow(skill.data)) {
row = skill.data[i, ]
appendCypher(tx, query,
SkillName=row$skill_name,
Rating=row$rating,
RatingScalar=row$rating_scalar,
RatingWeightedOverall=row$weighted_rating_overall,
Source=row$source_name,
SkillsetName=row$skill_set_name,
SkilltypeName=row$skill_type_name,
RatingWeightedBySkilltype=row$weighted_rating_by_skill_type,
RatingWeightedBySkillset=row$weighted_rating_by_skill_set)
}
# Commit
commit(tx)
summary(graph)
## This To That
## 1 Skill EXTRACTED_FROM Source
## 2 Skill IS_PART_OF Skillset
## 3 Skill ROLLS_UP_TO Skilltype
The resulting graph database looks like this (not all nodes are shown):
Based on the skill ratings created in the MySQL skill database, we know that machine learning is rated the highest skill.
top10.skills <- get_mySQL_data("vw_top10_skills_overall")
skill_name | SUM(rating_scalar) |
---|---|
machine learning | 212 |
Python | 208 |
big data | 153 |
analytics | 146 |
statistics | 141 |
analysis | 124 |
SQL | 124 |
R | 119 |
GIS | 100 |
MATLAB | 85 |
In Neo4j, issue the following CQL statement to selecting the nodes and relationships associated with the machine learning Skill node:
MATCH (n:Skill {name:‘machine learning’})-[r]->() RETURN r
The graph looks like:
An initial attempt at the Neo4j migration is shown below using more of the RNeo4j package. Although the code created the nodes and relationships, it proved to be significantly slower and less clear than the Cypher code above. Migration code 1 was preferred over the approach below.
# load the MySQL tables into separate dataframes for migrating into Neo4j
source <- get_mySQL_data("tbl_source")
skillset.list <- get_mySQL_data("tbl_skill_set")
skilltype.list <- get_mySQL_data("tbl_skill_type")
weighted_skills_by_source <- get_mySQL_data("tbl_data")
# source
for (i in 1:nrow(source)) {
createNode(graph, "Source", name=source$source_name[i], id=source$source_id[i])
}
# skillset
for (i in 1:nrow(skillset.list)) {
createNode(graph, "SkillSet", name=skillset.list$skill_set_name[i], id=skillset.list$skill_set_id[i])
}
# skilltype
for (i in 1:nrow(skilltype.list)) {
createNode(graph, "SkillType", name = skilltype.list$skill_type_name[i], id = skilltype.list$skill_type_id[i])
}
sourceNodeQuery <- "MATCH (src:Source {name:{name}}) RETURN src"
skillsetNodeQuery <- "MATCH (sk:SkillSet {name:{name}}) RETURN sk"
skilltypeNodeQuery <- "MATCH (st:SkillType {name:{name}}) RETURN st"
for (i in 1:nrow(weighted_skills_by_source)) {
skillNode <- createNode(graph, "Skill",
name= weighted_skills_by_source$skill_name[i],
id= weighted_skills_by_source$unique_id[i],
sourceID = weighted_skills_by_source$source_id[i],
skillID = weighted_skills_by_source$skill_id[i],
skillSetID = weighted_skills_by_source$skill_set_id[i],
skillTypeID = weighted_skills_by_source$skill_set_id[i]
)
# find the source Node Name to create a relationship to in the next step
sourceNode <- getSingleNode(graph, sourceNodeQuery, name=weighted_skills_by_source$source_name[i])
# create the relationship between the skill and the source
createRel(skillNode, "EXTRACTED FROM", sourceNode, rating = weighted_skills_by_source$rating[i])
# find the skillset Node Name to create a relationship to in the next step
skillsetNode <- getSingleNode(graph, skillsetNodeQuery, name=weighted_skills_by_source$skill_set_name[i])
# create the relationship between the skill and the source
createRel(skillNode, "IS PART OF", skillsetNode, skillset_rating = weighted_skills_by_source$weighted_rating_by_skill_set[i])
# find the skillset Node Name to create a relationship to in the next step
skilltypeNode <- getSingleNode(graph, skilltypeNodeQuery, name=weighted_skills_by_source$skill_type_name[i])
# create the relationship between the skill and the source
createRel(skillNode, "ROLLS UP TO", skilltypeNode, skilltype_rating = weighted_skills_by_source$weighted_rating_by_skill_type[i])
}
Several Top N views were created in the MySQL skill database as part of Project 3. Using these SQL-based Top N views, create comparable CQL queries against the Neo4j graph database.
1. Top 10 Skills
# MySQL
get_mySQL_data("vw_top10_skills_overall")
## skill_name SUM(rating_scalar)
## 1 machine learning 212
## 2 Python 208
## 3 big data 153
## 4 analytics 146
## 5 statistics 141
## 6 analysis 124
## 7 SQL 124
## 8 R 119
## 9 GIS 100
## 10 MATLAB 85
# Neo4j
query = "MATCH (s:Skill)
RETURN s.name as skill_name, SUM(s.ScalarRating) AS rating ORDER BY rating DESC LIMIT 10 "
cypher(graph, query)
## skill_name rating
## 1 machine learning 212
## 2 Python 208
## 3 big data 153
## 4 analytics 146
## 5 statistics 141
## 6 analysis 124
## 7 SQL 124
## 8 R 119
## 9 GIS 100
## 10 MATLAB 85
2. Top 10 Skill Sets Overall
# MySQL
get_mySQL_data("vw_top10_skill_sets_overall")
## skill_set_name SUM(rating_scalar)
## 1 Big and Distributed Data 582
## 2 Statistical Programming 482
## 3 Machine Learning 379
## 4 Classical Statistics 344
## 5 Visualization 320
## 6 Object-Oriented Programming 315
## 7 Structured Data 300
## 8 Business 276
## 9 Communication 249
## 10 Back-End Programming 222
# Neo4j
query = "MATCH (s:Skill)-[r:`IS_PART_OF`]->(ss:Skillset)
RETURN ss.name as skill_set_name, SUM(s.ScalarRating) AS rating ORDER BY rating DESC LIMIT 10"
cypher(graph, query)
## skill_set_name rating
## 1 Big and Distributed Data 582
## 2 Statistical Programming 482
## 3 Machine Learning 379
## 4 Classical Statistics 344
## 5 Visualization 320
## 6 Object-Oriented Programming 315
## 7 Structured Data 300
## 8 Business 276
## 9 Communication 249
## 10 Back-End Programming 222
3. Top 5 Skill Sets by Skill Type
# MySQL
get_mySQL_data("vw_top5_skill_sets_by_skill_type")
## row_names skill_type_name skill_set_name rating_scalar
## 1 1 business Business 276
## 2 2 business Surveys and Marketing 56
## 3 3 business Creative Thinking 52
## 4 4 business Systems Administration 47
## 5 5 business Product Development 29
## 6 6 communication Communication 249
## 7 7 communication Science 2
## 8 8 math Classical Statistics 344
## 9 9 math Bayesian/Monte-Carlo Statistics 177
## 10 10 math Math 142
## 11 11 math Temporal Statistics 72
## 12 12 math Simulation 64
## 13 13 programming Big and Distributed Data 582
## 14 14 programming Statistical Programming 482
## 15 15 programming Machine Learning 379
## 16 16 programming Object-Oriented Programming 315
## 17 17 programming Structured Data 300
## 18 18 visualization Visualization 320
## 19 19 visualization Graphical Models 52
## rank
## 1 1
## 2 2
## 3 3
## 4 4
## 5 5
## 6 1
## 7 2
## 8 1
## 9 2
## 10 3
## 11 4
## 12 5
## 13 1
## 14 2
## 15 3
## 16 4
## 17 5
## 18 1
## 19 2
# Neo4j
query = "MATCH (s:Skill)-[r:`IS_PART_OF`]->(ss:Skillset)
OPTIONAL MATCH (s:Skill)-[r2:ROLLS_UP_TO]->(st:Skilltype)
RETURN st.name as skill_type_name, ss.name as skill_set_name, SUM(s.ScalarRating) AS rating
ORDER BY st.name, rating desc"
cypher(graph, query)
## skill_type_name skill_set_name rating
## 1 business Business 276
## 2 business Surveys and Marketing 56
## 3 business Creative Thinking 52
## 4 business Systems Administration 47
## 5 business Product Development 29
## 6 business Information Security 21
## 7 business Industry-Specific Knowledge 16
## 8 communication Communication 249
## 9 communication Science 2
## 10 math Classical Statistics 344
## 11 math Bayesian/Monte-Carlo Statistics 177
## 12 math Math 142
## 13 math Temporal Statistics 72
## 14 math Simulation 64
## 15 math Spatial Statistics 48
## 16 math Optimization 48
## 17 math Algorithms 21
## 18 math Surveys and Marketing 7
## 19 programming Big and Distributed Data 582
## 20 programming Statistical Programming 482
## 21 programming Machine Learning 379
## 22 programming Object-Oriented Programming 315
## 23 programming Structured Data 300
## 24 programming Back-End Programming 222
## 25 programming Relational Databases 209
## 26 programming Unstructured Data 160
## 27 programming Systems Administration 111
## 28 programming Front-End Programming 109
## 29 programming Mathematical Programming 85
## 30 programming General Programming 52
## 31 programming Surveys and Marketing 51
## 32 programming Data Manipulation 37
## 33 programming Simulation 14
## 34 programming Mobile Devices 2
## 35 visualization Visualization 320
## 36 visualization Graphical Models 52
4. Top 10 Skills by Source
# MySQL
get_mySQL_data("vw_top10_skills_by_source")
## row_names source_name skill_name rating_scalar rank
## 1 1 Google big data 100 1
## 2 2 Google machine learning 32 2
## 3 3 Google Hadoop 25 3
## 4 4 Google analytics 22 4
## 5 5 Google analysis 20 5
## 6 6 Google information 18 6
## 7 7 Google data natives 15 7
## 8 8 Google ai law 12 8
## 9 9 Google Python 12 9
## 10 10 Google R 12 10
## 11 11 Indeed GIS 100 1
## 12 12 Indeed statistical graphics 79 2
## 13 13 Indeed XML 79 3
## 14 14 Indeed Bayesian Statistics 75 4
## 15 15 Indeed text mining 75 5
## 16 16 Indeed social networks 74 6
## 17 17 Indeed Monte-Carlo Statistics 73 7
## 18 18 Indeed time-series analysis 72 8
## 19 19 Indeed clustering 71 9
## 20 20 Indeed BUGS 69 10
## 21 21 Kaggle Python 100 1
## 22 22 Kaggle analytics 81 2
## 23 23 Kaggle machine learning 67 3
## 24 24 Kaggle business 48 4
## 25 25 Kaggle programming 48 5
## 26 26 Kaggle team 48 6
## 27 27 Kaggle statistics 44 7
## 28 28 Kaggle SQL 41 8
## 29 29 Kaggle communication 37 9
## 30 30 Kaggle interpersonal 30 10
## 31 31 RJMETRICS analysis 100 1
## 32 32 RJMETRICS R 79 2
## 33 33 RJMETRICS Python 74 3
## 34 34 RJMETRICS data mining 73 4
## 35 35 RJMETRICS machine learning 71 5
## 36 36 RJMETRICS statistics 61 6
## 37 37 RJMETRICS SQL 54 7
## 38 38 RJMETRICS analytics 43 8
## 39 39 RJMETRICS MATLAB 31 9
## 40 40 RJMETRICS Java 29 10
# Neo4j
query = "MATCH (s:Skill)-[r:EXTRACTED_FROM]->(src:Source)
RETURN src.name as source_name, s.name as skill_name, SUM(s.ScalarRating) AS rating
ORDER BY src.name, rating desc"
cypher(graph, query)
## source_name skill_name rating
## 1 Google big data 100
## 2 Google machine learning 32
## 3 Google Hadoop 25
## 4 Google analytics 22
## 5 Google analysis 20
## 6 Google information 18
## 7 Google data natives 15
## 8 Google R 12
## 9 Google ai law 12
## 10 Google Python 12
## 11 Google media 10
## 12 Google SQL 10
## 13 Google NOSQL 8
## 14 Google data engineering 5
## 15 Google data mining 2
## 16 Google government 2
## 17 Google policy 2
## 18 Google creativity 2
## 19 Google tableau 1
## 20 Google IOS 1
## 21 Google security and risk strategy 1
## 22 Google entrepreneurship 1
## 23 Google manufacturing 1
## 24 Google fintech 1
## 25 Google network 1
## 26 Google Java 1
## 27 Google MATLAB 1
## 28 Google mobile development 1
## 29 Google devops 1
## 30 Google pattern discovery 1
## 31 Google RSS 1
## 32 Google web development 1
## 33 Google storage 1
## 34 Google SAS 1
## 35 Google Ruby 1
## 36 Google galaxql 1
## 37 Google cloud 1
## 38 Google retail 1
## 39 Indeed GIS 100
## 40 Indeed XML 79
## 41 Indeed statistical graphics 79
## 42 Indeed text mining 75
## 43 Indeed Bayesian Statistics 75
## 44 Indeed social networks 74
## 45 Indeed Monte-Carlo Statistics 73
## 46 Indeed time-series analysis 72
## 47 Indeed clustering 71
## 48 Indeed BUGS 69
## 49 Indeed Pig 67
## 50 Indeed experimental design 65
## 51 Indeed continuous Simulation 64
## 52 Indeed JSON 63
## 53 Indeed SVM 62
## 54 Indeed DBA 61
## 55 Indeed ANOVA 59
## 56 Indeed mapping 58
## 57 Indeed Simulation 58
## 58 Indeed general linear model 57
## 59 Indeed Rails 57
## 60 Indeed Surveys and Marketing 56
## 61 Indeed Teradata 55
## 62 Indeed Objective C 55
## 63 Indeed PostgreSQL 53
## 64 Indeed Oracle 52
## 65 Indeed SPSS 52
## 66 Indeed Graphical Models 52
## 67 Indeed decision trees 50
## 68 Indeed MySQL 49
## 69 Indeed Spatial Statistics 47
## 70 Indeed Systems Administration 47
## 71 Indeed Map/Reduce 47
## 72 Indeed MATLAB 46
## 73 Indeed Stata 46
## 74 Indeed cloud tech 45
## 75 Indeed Hive 45
## 76 Indeed Ruby 44
## 77 Indeed machine learning 42
## 78 Indeed integer Optimization 41
## 79 Indeed Perl 41
## 80 Indeed global Optimization 40
## 81 Indeed discrete Simulation 40
## 82 Indeed Distributed Data 40
## 83 Indeed CSS 40
## 84 Indeed calculus 39
## 85 Indeed Unstructured Data 39
## 86 Indeed NOSQL 38
## 87 Indeed RegEx 37
## 88 Indeed JavaScript 36
## 89 Indeed statistics 36
## 90 Indeed C 34
## 91 Indeed visualization 34
## 92 Indeed Hadoop 33
## 93 Indeed math 33
## 94 Indeed Structured Data 32
## 95 Indeed Bayes networks 32
## 96 Indeed HTML 31
## 97 Indeed real analysis 30
## 98 Indeed linear algebra 30
## 99 Indeed Java 30
## 100 Indeed MCMC 29
## 101 Indeed neural nets 27
## 102 Indeed R 27
## 103 Indeed SAS 22
## 104 Indeed Python 22
## 105 Indeed forecasting 21
## 106 Indeed big data 19
## 107 Indeed SQL 19
## 108 Indeed Product Development 18
## 109 Indeed agent-based Simulation 14
## 110 Indeed project management 11
## 111 Indeed convex Optimization 7
## 112 Indeed multinomial modeling 7
## 113 Indeed design 6
## 114 Indeed nix 3
## 115 Indeed technical writing/publishing 2
## 116 Indeed web-based data-viz 1
## 117 Indeed geographic covariates 1
## 118 Kaggle Python 100
## 119 Kaggle analytics 81
## 120 Kaggle machine learning 67
## 121 Kaggle team 48
## 122 Kaggle programming 48
## 123 Kaggle business 48
## 124 Kaggle statistics 44
## 125 Kaggle SQL 41
## 126 Kaggle communication 37
## 127 Kaggle interpersonal 30
## 128 Kaggle modeling 30
## 129 Kaggle management 30
## 130 Kaggle big data 22
## 131 Kaggle Hadoop 19
## 132 Kaggle design 19
## 133 Kaggle Java 19
## 134 Kaggle creative 11
## 135 Kaggle research 11
## 136 Kaggle SAS 7
## 137 Kaggle predictive analytics 7
## 138 Kaggle math 7
## 139 Kaggle MATLAB 7
## 140 Kaggle leadership 7
## 141 Kaggle analysis 4
## 142 Kaggle R 1
## 143 Kaggle presentation 1
## 144 RJMETRICS analysis 100
## 145 RJMETRICS R 79
## 146 RJMETRICS Python 74
## 147 RJMETRICS data mining 73
## 148 RJMETRICS machine learning 71
## 149 RJMETRICS statistics 61
## 150 RJMETRICS SQL 54
## 151 RJMETRICS analytics 43
## 152 RJMETRICS MATLAB 31
## 153 RJMETRICS Java 29
## 154 RJMETRICS algorithm design 21
## 155 RJMETRICS modeling 21
## 156 RJMETRICS C++ 18
## 157 RJMETRICS big data 12
## 158 RJMETRICS business intelligence 12
## 159 RJMETRICS SAS 9
## 160 RJMETRICS Hadoop 4
## 161 RJMETRICS sofware engineering 1
## 162 RJMETRICS research 1
## 163 RJMETRICS programming 1
Note - Aggregation seems relatively straight-forward using CQL and Neo4j. However, I did encounter some challenges recreating SQL ranking functions such as a rank number within a group. For examples #3 and #4 above, I could not find a way to rank using CQL, very possibly due to limited experience with CQL as a query language.
Having been new to Neo4j at the beginning of this project, one immediate disadvantage appeared to be the learning curve associated with simply using a NoSQL database. Relational databases and SQL are much more familiar in terms of usage and language features and constructs. However, NoSQL databases seem to be much better at describing, using, and leveraging relationships between entities (nodes) for analysis. Visualizing data as a graph with NoSQL is a standard feature, unlike relational databases, which is valuable as well.