DATA 607: Project 4 - NoSQL Migration

Introduction

Project 4 will focus on extracting data from a relational database and migrating the information to a NoSQL database. This project will use Neo4j as the NoSQL database.

Project requirements:

For the relational database, options include using the flights database, the tb database, the “data skills” database created for DATA 607 Project 3, or another database of your own choosing or creation.
For the NoSQL database, options include using MongoDB, Neo4j, or another NoSQL database of your choosing.
The migration process needs to be reproducible. R code is encouraged, but not required.
Briefly describe the advantages and disadvantages of storing the data in a relational database vs. a NoSQL database.

The code for this project requires the following R packages:

RMySQL
RNeo4j
knitr

Additionally, to run this migration, the default Neo4j graphdb will need to be installed and running on your local machine.

Source Data - MySQL Skill Database

The source of the relational data used in this NoSQL migration will be the skill database located on the cloud MySQL database (https://www.db4free.net/) created by the team for DATA 607 Project 3.

skill ERD

Table and view data will be extracted from the cloud MySQL skill database using the function below:

## ------------------------------------------
## Using RMYSQL
## ------------------------------------------

get_mySQL_data <- function(object_name) {
    
    # establish the connection to the skill DB on db4free.net
    skilldb = dbConnect(MySQL(), user=proj_user, password=proj_pwd, dbname=proj_db, host=proj_host)
    
    skilldb.data <- dbGetQuery(skilldb, paste0("select * from ", object_name))
    
    #close the connection
    dbDisconnect(skilldb)
    
    return (skilldb.data)
    
}

The table named tbl_data will be the source of the data being used in this project.

skill.data <- get_mySQL_data("tbl_data")

Structure of tbl_data

Data Dictionary for the Table Tbl_Data

Column Name	Description
skill_type_id	Unique ID identifying a skill type
skill_set_id	Unique ID identifying a skill set
`skill_id`	Unique ID identifying a skill (UK = skill_id and source_id)
`source_id`	Unique ID identifying a source (UK = skill_id and source_id)
skill_type_name	Highest level of classification of a skill. 1 of 5 values defined in the model – business, communication, math, programming, visualization
skill_set_name	Mid-level classification of a skill. 1 of 32 values defined in the model. Examples: General Programming, Object-Oriented Programming, Relational Databases, Creative Thinking
skill_name	Skill name - the lowest level of skill classification. 1 of 122 skills defined in the skill db model.
source_name	Source name - Kaggle, Google, RJMetrics, Indeed
rating	Rating value associated with the source
rating_scalar	Skill’s normalized rating
weighted_rating_overall	Skill’s weighted ranking overall
weighted_rating_by_skill_type	Skill’s weighted ranking within the skill type
weighted_rating_by_skill_set	Skill’s weighted ranking within the skill set

## 'data.frame':    163 obs. of  13 variables:
##  $ skill_type_id                : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ skill_set_id                 : int  5 5 5 5 5 5 5 5 5 13 ...
##  $ skill_id                     : int  1 1 1 2 3 6 7 8 15 13 ...
##  $ source_id                    : int  1 2 3 1 2 3 4 4 4 4 ...
##  $ skill_type_name              : chr  "business" "business" "business" "business" ...
##  $ skill_set_name               : chr  "Business" "Business" "Business" "Business" ...
##  $ skill_name                   : chr  "analysis" "analysis" "analysis" "business" ...
##  $ source_name                  : chr  "Kaggle" "RJMETRICS" "Google" "Kaggle" ...
##  $ rating                       : num  2 55.7 10 14 23.8 ...
##  $ rating_scalar                : num  4 100 20 48 12 1 21 40 30 18 ...
##  $ weighted_rating_overall      : num  0.245 NA 1.227 1.718 NA ...
##  $ weighted_rating_by_skill_type: num  0.231 NA 2.368 1.615 NA ...
##  $ weighted_rating_by_skill_set : num  0.0979 NA 0.4895 0.6853 NA ...

skill_type_id	skill_set_id	skill_id	source_id	skill_type_name	skill_set_name	skill_name	source_name	rating	rating_scalar	weighted_rating_overall	weighted_rating_by_skill_type	weighted_rating_by_skill_set
1	5	1	1	business	Business	analysis	Kaggle	2.00	4	0.245399	0.230769	0.0979021
1	5	1	2	business	Business	analysis	RJMETRICS	55.69	100	NA	NA	NA
1	5	1	3	business	Business	analysis	Google	10.00	20	1.226990	2.368420	0.4895100
1	5	2	1	business	Business	business	Kaggle	14.00	48	1.717790	1.615380	0.6853150
1	5	3	2	business	Business	business intelligence	RJMETRICS	23.78	12	NA	NA	NA
1	5	6	3	business	Business	entrepreneurship	Google	2.00	1	0.245399	0.473684	0.0979021
1	5	7	4	business	Business	forecasting	Indeed	49.00	21	1.503070	1.240510	0.5996500
1	5	8	4	business	Business	global Optimization	Indeed	94.00	40	2.883440	2.379750	1.1503500
1	5	15	4	business	Business	real analysis	Indeed	70.00	30	2.147240	1.772150	0.8566430
1	13	13	4	business	Product Development	Product Development	Indeed	41.00	18	1.257670	1.037970	0.1433570
1	13	14	4	business	Product Development	project management	Indeed	25.00	11	0.766871	0.632911	0.0874126
1	18	19	4	business	Surveys and Marketing	Surveys and Marketing	Indeed	130.00	56	3.987730	3.291140	0.6818180
1	19	20	4	business	Systems Administration	Systems Administration	Indeed	109.00	47	3.343560	2.759490	1.1433600
1	26	10	3	business	Information Security	information	Google	9.00	18	1.104290	2.131580	0.1888110
1	26	12	3	business	Information Security	policy	Google	3.00	2	0.368098	0.710526	0.0629371

Migration to Neo4j

The following section migrates the skill data found in the Project 3 MySQL database to Neo4j. The default Neo4j graphdb will need to be running locally. Additionally, this code requires the RNeo4j package in R.

MySQL-Neo4j Migration Code - 1

The code below creates the following nodes in Neo4j based on the skill data sourced from MySQL: - Source - Skill - SkillSet - SkillType

In the MySQL database, a hierarchy exists between a skill, skill set, and skill type:

skill -> skill set -> skill type

where skill type is the top level classification.

In the Neo4j graph database below, a relationship is created from skill –> skill set and skill –> skill type. This has been done to capture the weighted rating values as a property of the relationship between the Skill and SkillSet nodes and the Skill and SKillType nodes.

graph = startGraph("http://localhost:7474/db/data", username = "neo4j", password = "data607")

# This statement will delete all contents from the connnected Neo4j database
clear(graph, input = F)

# Add constraints

addConstraint(graph, "Source", "name")
addConstraint(graph, "SkillSet", "name")
addConstraint(graph, "SkillType", "name")


# Use CQL (Cypher Query Language) to create the following nodes: Source, Skill, Skillset, and Skilltype
# Store the scalar rating value as a Skill property
# 

# Relationships
# 1.) A Skill is extracted from a Source
# 2.) A skill is part of a Skillset
# 3.) A skill rolls up to a skilltype

query = "
        CREATE (skill:Skill {name: {SkillName} })
           SET skill.ScalarRating = TOINT({RatingScalar})
        
        MERGE (source:Source {name: {Source} })
        CREATE (skill)-[e:EXTRACTED_FROM]->(source)
        
        SET e.rating = TOFLOAT({Rating})
        
        MERGE (skillset:Skillset {name: {SkillsetName} })
        MERGE (skilltype:Skilltype {name: {SkilltypeName} })
        
        CREATE (skill)-[i:IS_PART_OF]->(skillset)
        
        SET i.rating = TOFLOAT({RatingWeightedBySkillset})
        
        CREATE (skill)-[r:ROLLS_UP_TO]->(skilltype)
        SET r.rating = TOFLOAT({RatingWeightedBySkilltype})

"

# Open a new transaction 
tx = newTransaction(graph)

# Pass the contents of the skill.data dataframe to query above using RNeo4j appendCypher
for(i in 1:nrow(skill.data)) {
    row = skill.data[i, ]
    
    appendCypher(tx, query,
                 SkillName=row$skill_name,
                 Rating=row$rating,
                 RatingScalar=row$rating_scalar,
                 RatingWeightedOverall=row$weighted_rating_overall,
                 Source=row$source_name,
                 SkillsetName=row$skill_set_name,
                 SkilltypeName=row$skill_type_name,
                 RatingWeightedBySkilltype=row$weighted_rating_by_skill_type,
                 RatingWeightedBySkillset=row$weighted_rating_by_skill_set)
}

# Commit
commit(tx)

summary(graph)

##    This             To      That
## 1 Skill EXTRACTED_FROM    Source
## 2 Skill     IS_PART_OF  Skillset
## 3 Skill    ROLLS_UP_TO Skilltype

The resulting graph database looks like this (not all nodes are shown):

Neo4j Graph DB

Based on the skill ratings created in the MySQL skill database, we know that machine learning is rated the highest skill.

top10.skills <- get_mySQL_data("vw_top10_skills_overall")

skill_name	SUM(rating_scalar)
machine learning	212
Python	208
big data	153
analytics	146
statistics	141
analysis	124
SQL	124
R	119
GIS	100
MATLAB	85

In Neo4j, issue the following CQL statement to selecting the nodes and relationships associated with the machine learning Skill node:

MATCH (n:Skill {name:‘machine learning’})-[r]->() RETURN r

The graph looks like:

Machine Learning Graph

MySQL-Neo4j Migration Code - 2

An initial attempt at the Neo4j migration is shown below using more of the RNeo4j package. Although the code created the nodes and relationships, it proved to be significantly slower and less clear than the Cypher code above. Migration code 1 was preferred over the approach below.

# load the MySQL tables into separate dataframes for migrating into Neo4j

source         <- get_mySQL_data("tbl_source")
skillset.list  <- get_mySQL_data("tbl_skill_set")
skilltype.list <- get_mySQL_data("tbl_skill_type")
weighted_skills_by_source <- get_mySQL_data("tbl_data")

# source 
for (i in 1:nrow(source)) {
    
    createNode(graph, "Source",  name=source$source_name[i], id=source$source_id[i])
}


# skillset 
for (i in 1:nrow(skillset.list)) {
    
    createNode(graph, "SkillSet", name=skillset.list$skill_set_name[i], id=skillset.list$skill_set_id[i])
}

# skilltype
for (i in 1:nrow(skilltype.list)) {
    
    createNode(graph, "SkillType", name = skilltype.list$skill_type_name[i], id = skilltype.list$skill_type_id[i])
}

sourceNodeQuery <- "MATCH (src:Source {name:{name}}) RETURN src"

skillsetNodeQuery <- "MATCH (sk:SkillSet {name:{name}}) RETURN sk"

skilltypeNodeQuery <- "MATCH (st:SkillType {name:{name}}) RETURN st"

for (i in 1:nrow(weighted_skills_by_source)) {
    
    skillNode <- createNode(graph, "Skill", 
                    name= weighted_skills_by_source$skill_name[i], 
                    id= weighted_skills_by_source$unique_id[i],
                    sourceID = weighted_skills_by_source$source_id[i],
                    skillID = weighted_skills_by_source$skill_id[i],
                    skillSetID = weighted_skills_by_source$skill_set_id[i],
                    skillTypeID = weighted_skills_by_source$skill_set_id[i]
                    )
    
    # find the source Node Name to create a relationship to in the next step
    sourceNode <- getSingleNode(graph, sourceNodeQuery, name=weighted_skills_by_source$source_name[i])

    # create the relationship between the skill and the source
    createRel(skillNode, "EXTRACTED FROM", sourceNode, rating = weighted_skills_by_source$rating[i])
   
    
    # find the skillset Node Name to create a relationship to in the next step
    skillsetNode <- getSingleNode(graph, skillsetNodeQuery, name=weighted_skills_by_source$skill_set_name[i])           

    # create the relationship between the skill and the source
    createRel(skillNode, "IS PART OF", skillsetNode, skillset_rating = weighted_skills_by_source$weighted_rating_by_skill_set[i])
   
    # find the skillset Node Name to create a relationship to in the next step
    skilltypeNode <- getSingleNode(graph, skilltypeNodeQuery, name=weighted_skills_by_source$skill_type_name[i])           

    # create the relationship between the skill and the source
    createRel(skillNode, "ROLLS UP TO", skilltypeNode, skilltype_rating = weighted_skills_by_source$weighted_rating_by_skill_type[i])
   
    
}

Comparison - Querying Neo4j vs. MySQL

Several Top N views were created in the MySQL skill database as part of Project 3. Using these SQL-based Top N views, create comparable CQL queries against the Neo4j graph database.

1. Top 10 Skills

# MySQL
get_mySQL_data("vw_top10_skills_overall")

##          skill_name SUM(rating_scalar)
## 1  machine learning                212
## 2            Python                208
## 3          big data                153
## 4         analytics                146
## 5        statistics                141
## 6          analysis                124
## 7               SQL                124
## 8                 R                119
## 9               GIS                100
## 10           MATLAB                 85

# Neo4j
query = "MATCH (s:Skill) 
         RETURN s.name as skill_name, SUM(s.ScalarRating) AS rating ORDER BY rating DESC LIMIT 10  "

cypher(graph, query)

##          skill_name rating
## 1  machine learning    212
## 2            Python    208
## 3          big data    153
## 4         analytics    146
## 5        statistics    141
## 6          analysis    124
## 7               SQL    124
## 8                 R    119
## 9               GIS    100
## 10           MATLAB     85

2. Top 10 Skill Sets Overall

# MySQL
get_mySQL_data("vw_top10_skill_sets_overall")

##                 skill_set_name SUM(rating_scalar)
## 1     Big and Distributed Data                582
## 2      Statistical Programming                482
## 3             Machine Learning                379
## 4         Classical Statistics                344
## 5                Visualization                320
## 6  Object-Oriented Programming                315
## 7              Structured Data                300
## 8                     Business                276
## 9                Communication                249
## 10        Back-End Programming                222

# Neo4j
query = "MATCH (s:Skill)-[r:`IS_PART_OF`]->(ss:Skillset) 
         RETURN ss.name as skill_set_name, SUM(s.ScalarRating) AS rating ORDER BY rating DESC LIMIT 10"

cypher(graph, query)

##                 skill_set_name rating
## 1     Big and Distributed Data    582
## 2      Statistical Programming    482
## 3             Machine Learning    379
## 4         Classical Statistics    344
## 5                Visualization    320
## 6  Object-Oriented Programming    315
## 7              Structured Data    300
## 8                     Business    276
## 9                Communication    249
## 10        Back-End Programming    222

3. Top 5 Skill Sets by Skill Type

# MySQL
get_mySQL_data("vw_top5_skill_sets_by_skill_type")

##    row_names skill_type_name                  skill_set_name rating_scalar
## 1          1        business                        Business           276
## 2          2        business           Surveys and Marketing            56
## 3          3        business               Creative Thinking            52
## 4          4        business          Systems Administration            47
## 5          5        business             Product Development            29
## 6          6   communication                   Communication           249
## 7          7   communication                         Science             2
## 8          8            math            Classical Statistics           344
## 9          9            math Bayesian/Monte-Carlo Statistics           177
## 10        10            math                            Math           142
## 11        11            math             Temporal Statistics            72
## 12        12            math                      Simulation            64
## 13        13     programming        Big and Distributed Data           582
## 14        14     programming         Statistical Programming           482
## 15        15     programming                Machine Learning           379
## 16        16     programming     Object-Oriented Programming           315
## 17        17     programming                 Structured Data           300
## 18        18   visualization                   Visualization           320
## 19        19   visualization                Graphical Models            52
##    rank
## 1     1
## 2     2
## 3     3
## 4     4
## 5     5
## 6     1
## 7     2
## 8     1
## 9     2
## 10    3
## 11    4
## 12    5
## 13    1
## 14    2
## 15    3
## 16    4
## 17    5
## 18    1
## 19    2

# Neo4j
query = "MATCH (s:Skill)-[r:`IS_PART_OF`]->(ss:Skillset)
         OPTIONAL MATCH (s:Skill)-[r2:ROLLS_UP_TO]->(st:Skilltype)
         RETURN st.name as skill_type_name, ss.name as skill_set_name, SUM(s.ScalarRating) AS rating 
         ORDER BY st.name, rating desc"


cypher(graph, query)

##    skill_type_name                  skill_set_name rating
## 1         business                        Business    276
## 2         business           Surveys and Marketing     56
## 3         business               Creative Thinking     52
## 4         business          Systems Administration     47
## 5         business             Product Development     29
## 6         business            Information Security     21
## 7         business     Industry-Specific Knowledge     16
## 8    communication                   Communication    249
## 9    communication                         Science      2
## 10            math            Classical Statistics    344
## 11            math Bayesian/Monte-Carlo Statistics    177
## 12            math                            Math    142
## 13            math             Temporal Statistics     72
## 14            math                      Simulation     64
## 15            math              Spatial Statistics     48
## 16            math                    Optimization     48
## 17            math                      Algorithms     21
## 18            math           Surveys and Marketing      7
## 19     programming        Big and Distributed Data    582
## 20     programming         Statistical Programming    482
## 21     programming                Machine Learning    379
## 22     programming     Object-Oriented Programming    315
## 23     programming                 Structured Data    300
## 24     programming            Back-End Programming    222
## 25     programming            Relational Databases    209
## 26     programming               Unstructured Data    160
## 27     programming          Systems Administration    111
## 28     programming           Front-End Programming    109
## 29     programming        Mathematical Programming     85
## 30     programming             General Programming     52
## 31     programming           Surveys and Marketing     51
## 32     programming               Data Manipulation     37
## 33     programming                      Simulation     14
## 34     programming                  Mobile Devices      2
## 35   visualization                   Visualization    320
## 36   visualization                Graphical Models     52

4. Top 10 Skills by Source

# MySQL
get_mySQL_data("vw_top10_skills_by_source")

##    row_names source_name             skill_name rating_scalar rank
## 1          1      Google               big data           100    1
## 2          2      Google       machine learning            32    2
## 3          3      Google                 Hadoop            25    3
## 4          4      Google              analytics            22    4
## 5          5      Google               analysis            20    5
## 6          6      Google            information            18    6
## 7          7      Google           data natives            15    7
## 8          8      Google                ai  law            12    8
## 9          9      Google                 Python            12    9
## 10        10      Google                      R            12   10
## 11        11      Indeed                    GIS           100    1
## 12        12      Indeed   statistical graphics            79    2
## 13        13      Indeed                    XML            79    3
## 14        14      Indeed    Bayesian Statistics            75    4
## 15        15      Indeed            text mining            75    5
## 16        16      Indeed        social networks            74    6
## 17        17      Indeed Monte-Carlo Statistics            73    7
## 18        18      Indeed   time-series analysis            72    8
## 19        19      Indeed             clustering            71    9
## 20        20      Indeed                   BUGS            69   10
## 21        21      Kaggle                 Python           100    1
## 22        22      Kaggle              analytics            81    2
## 23        23      Kaggle       machine learning            67    3
## 24        24      Kaggle               business            48    4
## 25        25      Kaggle            programming            48    5
## 26        26      Kaggle                   team            48    6
## 27        27      Kaggle             statistics            44    7
## 28        28      Kaggle                    SQL            41    8
## 29        29      Kaggle          communication            37    9
## 30        30      Kaggle          interpersonal            30   10
## 31        31   RJMETRICS               analysis           100    1
## 32        32   RJMETRICS                      R            79    2
## 33        33   RJMETRICS                 Python            74    3
## 34        34   RJMETRICS            data mining            73    4
## 35        35   RJMETRICS       machine learning            71    5
## 36        36   RJMETRICS             statistics            61    6
## 37        37   RJMETRICS                    SQL            54    7
## 38        38   RJMETRICS              analytics            43    8
## 39        39   RJMETRICS                 MATLAB            31    9
## 40        40   RJMETRICS                   Java            29   10

# Neo4j
query = "MATCH (s:Skill)-[r:EXTRACTED_FROM]->(src:Source)
         RETURN src.name as source_name, s.name as skill_name, SUM(s.ScalarRating) AS rating 
         ORDER BY src.name, rating desc"


cypher(graph, query)

##     source_name                   skill_name rating
## 1        Google                     big data    100
## 2        Google             machine learning     32
## 3        Google                       Hadoop     25
## 4        Google                    analytics     22
## 5        Google                     analysis     20
## 6        Google                  information     18
## 7        Google                 data natives     15
## 8        Google                            R     12
## 9        Google                      ai  law     12
## 10       Google                       Python     12
## 11       Google                        media     10
## 12       Google                          SQL     10
## 13       Google                        NOSQL      8
## 14       Google             data engineering      5
## 15       Google                  data mining      2
## 16       Google                   government      2
## 17       Google                       policy      2
## 18       Google                   creativity      2
## 19       Google                      tableau      1
## 20       Google                          IOS      1
## 21       Google   security and risk strategy      1
## 22       Google             entrepreneurship      1
## 23       Google                manufacturing      1
## 24       Google                      fintech      1
## 25       Google                      network      1
## 26       Google                         Java      1
## 27       Google                       MATLAB      1
## 28       Google           mobile development      1
## 29       Google                       devops      1
## 30       Google            pattern discovery      1
## 31       Google                          RSS      1
## 32       Google              web development      1
## 33       Google                      storage      1
## 34       Google                          SAS      1
## 35       Google                         Ruby      1
## 36       Google                      galaxql      1
## 37       Google                        cloud      1
## 38       Google                       retail      1
## 39       Indeed                          GIS    100
## 40       Indeed                          XML     79
## 41       Indeed         statistical graphics     79
## 42       Indeed                  text mining     75
## 43       Indeed          Bayesian Statistics     75
## 44       Indeed              social networks     74
## 45       Indeed       Monte-Carlo Statistics     73
## 46       Indeed         time-series analysis     72
## 47       Indeed                   clustering     71
## 48       Indeed                         BUGS     69
## 49       Indeed                          Pig     67
## 50       Indeed          experimental design     65
## 51       Indeed        continuous Simulation     64
## 52       Indeed                         JSON     63
## 53       Indeed                          SVM     62
## 54       Indeed                          DBA     61
## 55       Indeed                        ANOVA     59
## 56       Indeed                      mapping     58
## 57       Indeed                   Simulation     58
## 58       Indeed         general linear model     57
## 59       Indeed                        Rails     57
## 60       Indeed        Surveys and Marketing     56
## 61       Indeed                     Teradata     55
## 62       Indeed                  Objective C     55
## 63       Indeed                   PostgreSQL     53
## 64       Indeed                       Oracle     52
## 65       Indeed                         SPSS     52
## 66       Indeed             Graphical Models     52
## 67       Indeed               decision trees     50
## 68       Indeed                        MySQL     49
## 69       Indeed           Spatial Statistics     47
## 70       Indeed       Systems Administration     47
## 71       Indeed                   Map/Reduce     47
## 72       Indeed                       MATLAB     46
## 73       Indeed                        Stata     46
## 74       Indeed                   cloud tech     45
## 75       Indeed                         Hive     45
## 76       Indeed                         Ruby     44
## 77       Indeed             machine learning     42
## 78       Indeed         integer Optimization     41
## 79       Indeed                         Perl     41
## 80       Indeed          global Optimization     40
## 81       Indeed          discrete Simulation     40
## 82       Indeed             Distributed Data     40
## 83       Indeed                          CSS     40
## 84       Indeed                     calculus     39
## 85       Indeed            Unstructured Data     39
## 86       Indeed                        NOSQL     38
## 87       Indeed                        RegEx     37
## 88       Indeed                   JavaScript     36
## 89       Indeed                   statistics     36
## 90       Indeed                            C     34
## 91       Indeed                visualization     34
## 92       Indeed                       Hadoop     33
## 93       Indeed                         math     33
## 94       Indeed              Structured Data     32
## 95       Indeed               Bayes networks     32
## 96       Indeed                         HTML     31
## 97       Indeed                real analysis     30
## 98       Indeed               linear algebra     30
## 99       Indeed                         Java     30
## 100      Indeed                         MCMC     29
## 101      Indeed                  neural nets     27
## 102      Indeed                            R     27
## 103      Indeed                          SAS     22
## 104      Indeed                       Python     22
## 105      Indeed                  forecasting     21
## 106      Indeed                     big data     19
## 107      Indeed                          SQL     19
## 108      Indeed          Product Development     18
## 109      Indeed       agent-based Simulation     14
## 110      Indeed           project management     11
## 111      Indeed          convex Optimization      7
## 112      Indeed         multinomial modeling      7
## 113      Indeed                       design      6
## 114      Indeed                          nix      3
## 115      Indeed technical writing/publishing      2
## 116      Indeed           web-based data-viz      1
## 117      Indeed        geographic covariates      1
## 118      Kaggle                       Python    100
## 119      Kaggle                    analytics     81
## 120      Kaggle             machine learning     67
## 121      Kaggle                         team     48
## 122      Kaggle                  programming     48
## 123      Kaggle                     business     48
## 124      Kaggle                   statistics     44
## 125      Kaggle                          SQL     41
## 126      Kaggle                communication     37
## 127      Kaggle                interpersonal     30
## 128      Kaggle                     modeling     30
## 129      Kaggle                   management     30
## 130      Kaggle                     big data     22
## 131      Kaggle                       Hadoop     19
## 132      Kaggle                       design     19
## 133      Kaggle                         Java     19
## 134      Kaggle                     creative     11
## 135      Kaggle                     research     11
## 136      Kaggle                          SAS      7
## 137      Kaggle         predictive analytics      7
## 138      Kaggle                         math      7
## 139      Kaggle                       MATLAB      7
## 140      Kaggle                   leadership      7
## 141      Kaggle                     analysis      4
## 142      Kaggle                            R      1
## 143      Kaggle                 presentation      1
## 144   RJMETRICS                     analysis    100
## 145   RJMETRICS                            R     79
## 146   RJMETRICS                       Python     74
## 147   RJMETRICS                  data mining     73
## 148   RJMETRICS             machine learning     71
## 149   RJMETRICS                   statistics     61
## 150   RJMETRICS                          SQL     54
## 151   RJMETRICS                    analytics     43
## 152   RJMETRICS                       MATLAB     31
## 153   RJMETRICS                         Java     29
## 154   RJMETRICS             algorithm design     21
## 155   RJMETRICS                     modeling     21
## 156   RJMETRICS                          C++     18
## 157   RJMETRICS                     big data     12
## 158   RJMETRICS        business intelligence     12
## 159   RJMETRICS                          SAS      9
## 160   RJMETRICS                       Hadoop      4
## 161   RJMETRICS          sofware engineering      1
## 162   RJMETRICS                     research      1
## 163   RJMETRICS                  programming      1

Note - Aggregation seems relatively straight-forward using CQL and Neo4j. However, I did encounter some challenges recreating SQL ranking functions such as a rank number within a group. For examples #3 and #4 above, I could not find a way to rank using CQL, very possibly due to limited experience with CQL as a query language.

Conclusion - Neo4j vs. MySQL

Having been new to Neo4j at the beginning of this project, one immediate disadvantage appeared to be the learning curve associated with simply using a NoSQL database. Relational databases and SQL are much more familiar in terms of usage and language features and constructs. However, NoSQL databases seem to be much better at describing, using, and leveraging relationships between entities (nodes) for analysis. Visualizing data as a graph with NoSQL is a standard feature, unlike relational databases, which is valuable as well.