Assignment 12 - Migrate data from MySQL to Neo4j

For this assignment, you should take information from a relational database and migrate it to a NoSQL database of your own choosing.

For the relational database, you might use the flights database, the tb database, the “data skills” database your team created for Project 3, or another database of your own choosing or creation.

For the NoSQL database, you may use MongoDB (which we introduced in week 7), Neo4j, or another NoSQL database of your choosing.

Your migration process needs to be reproducible. R code is encouraged, but not required. You should also briefly describe the advantages and disadvantages of storing the data in a relational database vs. your NoSQL database.

## Loading required package: DBI
## Warning: changing locked binding for 'length.path' in 'httr' whilst loading
## 'RNeo4j'
## 
## Attaching package: 'igraph'
## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum
## The following object is masked from 'package:base':
## 
##     union

This project extends the “Movies” database that was created in Week 2.

Rpubs output from Week 2: http://rpubs.com/myampol/MY607-Week02-MichaelY-Movies

R code from Week 2: https://raw.githubusercontent.com/myampol/MY607/master/MY-DATA607-Week02-Movies.Rmd

SQL code from Week 2: https://raw.githubusercontent.com/myampol/MY607/master/MY-DATA607-Week02-Movies.sql

It assumes that the above database has already been created in a MySQL database which is on the same machine as the Neo4j database into which the data will be loaded, below.

Pull data from MySQL

We will not dump the data into files (e.g., .csv files) and then upload such files into Neo4j.

Rather, we will pull the data into R, and then push the data from R directly into Neo4j.

Query each MySQL table separately

Friend_id Friend_name
1 Andrew
2 Bernard
3 Charlie
4 Dilbert
5 Ernesto
Movie_id Movie_title
6 Aladdin (2019)
1 Avengers: Endgame
4 Captain Marvel
5 Spider-Man: Far from Home
2 The Lion King (2019)
3 Toy Story 4
Movie_id Friend_id Rating
1 1 4
1 2 1
1 3 3
1 4 2
1 5 1
2 1 2
2 2 1
2 3 2
2 4 5
2 5 3
3 1 3
3 2 2
3 3 2
3 4 1
3 5 NA
4 1 4
4 2 5
4 3 5
4 4 2
4 5 4
5 1 2
5 2 1
5 3 1
5 4 2
5 5 2
6 1 1
6 2 5
6 3 1
6 4 2
6 5 4

Get movies, friends, and ratings data from MySQL in a single query

Movie_title Friend_name Rating
Aladdin (2019) Andrew 1
Aladdin (2019) Bernard 5
Aladdin (2019) Charlie 1
Aladdin (2019) Dilbert 2
Aladdin (2019) Ernesto 4
Avengers: Endgame Andrew 4
Avengers: Endgame Bernard 1
Avengers: Endgame Charlie 3
Avengers: Endgame Dilbert 2
Avengers: Endgame Ernesto 1
Captain Marvel Andrew 4
Captain Marvel Bernard 5
Captain Marvel Charlie 5
Captain Marvel Dilbert 2
Captain Marvel Ernesto 4
Spider-Man: Far from Home Andrew 2
Spider-Man: Far from Home Bernard 1
Spider-Man: Far from Home Charlie 1
Spider-Man: Far from Home Dilbert 2
Spider-Man: Far from Home Ernesto 2
The Lion King (2019) Andrew 2
The Lion King (2019) Bernard 1
The Lion King (2019) Charlie 2
The Lion King (2019) Dilbert 5
The Lion King (2019) Ernesto 3
Toy Story 4 Andrew 3
Toy Story 4 Bernard 2
Toy Story 4 Charlie 2
Toy Story 4 Dilbert 1
Toy Story 4 Ernesto NA

Note that there are only 29 ratings, as one person (Ernesto) did not rate one film (Toy Story 4).


Connect to Neo4j

We will establish a connection to Neo4j. This requires that the database be up and running. An (empty) database needs to have already been created as the default database. (It should be empty, because the next step will delete all nodes, relationships, indexes, and constraints from the graph database…)

Load the ratings into Neo4j

The ratings are “relationships” connecting the Friend and Movie nodes

By creating 5 separate relationships, based upon rating, we can assign different colors to the ratings in the graph below.

## [1] "1 :  Andrew  rated  Avengers: Endgame  a  4"
## [1] "2 :  Bernard  rated  Avengers: Endgame  a  1"
## [1] "3 :  Charlie  rated  Avengers: Endgame  a  3"
## [1] "4 :  Dilbert  rated  Avengers: Endgame  a  2"
## [1] "5 :  Ernesto  rated  Avengers: Endgame  a  1"
## [1] "6 :  Andrew  rated  The Lion King (2019)  a  2"
## [1] "7 :  Bernard  rated  The Lion King (2019)  a  1"
## [1] "8 :  Charlie  rated  The Lion King (2019)  a  2"
## [1] "9 :  Dilbert  rated  The Lion King (2019)  a  5"
## [1] "10 :  Ernesto  rated  The Lion King (2019)  a  3"
## [1] "11 :  Andrew  rated  Toy Story 4  a  3"
## [1] "12 :  Bernard  rated  Toy Story 4  a  2"
## [1] "13 :  Charlie  rated  Toy Story 4  a  2"
## [1] "14 :  Dilbert  rated  Toy Story 4  a  1"
## [1] "15 : ***  Ernesto  DID NOT RATE  Toy Story 4"
## [1] "16 :  Andrew  rated  Captain Marvel  a  4"
## [1] "17 :  Bernard  rated  Captain Marvel  a  5"
## [1] "18 :  Charlie  rated  Captain Marvel  a  5"
## [1] "19 :  Dilbert  rated  Captain Marvel  a  2"
## [1] "20 :  Ernesto  rated  Captain Marvel  a  4"
## [1] "21 :  Andrew  rated  Spider-Man: Far from Home  a  2"
## [1] "22 :  Bernard  rated  Spider-Man: Far from Home  a  1"
## [1] "23 :  Charlie  rated  Spider-Man: Far from Home  a  1"
## [1] "24 :  Dilbert  rated  Spider-Man: Far from Home  a  2"
## [1] "25 :  Ernesto  rated  Spider-Man: Far from Home  a  2"
## [1] "26 :  Andrew  rated  Aladdin (2019)  a  1"
## [1] "27 :  Bernard  rated  Aladdin (2019)  a  5"
## [1] "28 :  Charlie  rated  Aladdin (2019)  a  1"
## [1] "29 :  Dilbert  rated  Aladdin (2019)  a  2"
## [1] "30 :  Ernesto  rated  Aladdin (2019)  a  4"

Display the set of relationships

##     This      To  That
## 1 Friend RATES 1 Movie
## 2 Friend RATES 2 Movie
## 3 Friend RATES 4 Movie
## 4 Friend RATES 3 Movie
## 5 Friend RATES 5 Movie

This shows that we have implemented 5 separate relationships, “Friend –> RATES {1|2|3|4|5} –> Movie”

This enables us to assign a different color to the different ratings arrows.

Below I have assigned red to the poor ratings (1 and 2), yellow to the neutral rating (3), and green to the best ratings (4 and 5). This indicates that everybody hated Spiderman: Far from Home, while everybody (except Dilbert) liked Captain Marvel. (However, Dilbert hated everything except for the Lion King, so perhaps his opinion can be discounted…)

Neo4j graph

In the lower right, note the absence of any arrow between Ernesto and Toy Story 4, as Ernesto did not rate that film.

The relationship highlighted in yellow indicates that Andrew gave a rating of “4” to Avengers:Endgame, which is also reflected in this tabular view"

Neo4j tabular listing

Pull the data back from Neo4j

f.name m.title r.rating
Andrew Aladdin (2019) 1
Andrew Spider-Man: Far from Home 2
Andrew Captain Marvel 4
Andrew Toy Story 4 3
Andrew The Lion King (2019) 2
Andrew Avengers: Endgame 4
Ernesto Aladdin (2019) 4
Ernesto Spider-Man: Far from Home 2
Ernesto Captain Marvel 4
Ernesto The Lion King (2019) 3
Ernesto Avengers: Endgame 1
Dilbert Aladdin (2019) 2
Dilbert Spider-Man: Far from Home 2
Dilbert Captain Marvel 2
Dilbert Toy Story 4 1
Dilbert The Lion King (2019) 5
Dilbert Avengers: Endgame 2
Charlie Aladdin (2019) 1
Charlie Spider-Man: Far from Home 1
Charlie Captain Marvel 5
Charlie Toy Story 4 2
Charlie The Lion King (2019) 2
Charlie Avengers: Endgame 3
Bernard Aladdin (2019) 5
Bernard Spider-Man: Far from Home 1
Bernard Captain Marvel 5
Bernard Toy Story 4 2
Bernard The Lion King (2019) 1
Bernard Avengers: Endgame 1

Plot the graph using igraph

The resulting graph is not easy to read. (More options are needed to make this graph look better.)


Advantages of SQL vs. Neo4j

You should also briefly describe the advantages and disadvantages of storing the data in a relational database vs. your NoSQL database.

The advantage of storing the data in the NoSQL database is that we can traverse relationships to determine who liked which movie. In the relational database, it is less easy to visualize the relationships.

Here, however, there is a disadvantage in that it can become difficult to visualize a large number of relationships, especially if the number of movies and the number of reviewers were larger.

In the present dataset, with only a single exception, every friend rated every movie. In the case of a much larger dataset, it is more likely that the data would be sparse, i.e., there would be many (friend,movie) combinations which would be unrated because there would be too many movies for every person to see.