Yun Mai
April 30, 2017
Assignment:
For this assignment, you should take information from a relational database and migrate it to a NoSQL database of your own choosing.
For the relational database, you might use the flights database, the tb database, the "data skills" database your team created for Project 3, or another database of your own choosing or creation.
For the NoSQL database, you may use MongoDB (which we introduced in week 7), Neo4j, or another NoSQL database of your choosing.
Your migration process needs to be reproducible. R code is encouraged, but not required. You should also briefly describe the advantages and disadvantages of storing the data in a relational database vs. your NoSQL database.
Overview:
I will migrate the movie_rating database generated in week_2 homework to Neo4j. Nodes will be created first followed by buiding relationships. At last the graph will be viewed with visualization tool.
devtools::install_github("nicolewhite/Rneo4j")
install.packages("RMySQL")
library(RMySQL)
## Warning: package 'RMySQL' was built under R version 3.3.3
## Loading required package: DBI
library(RNeo4j)
library(knitr)
# connect to MySQL local database
movie_con <- dbConnect(MySQL(), user="root", password=mysql_pw, dbname = "movie_nosql", host = "localhost")
#Send query to MySQL and retrieve the movie data from MySQL.
query.movie <- dbSendQuery(movie_con, "SELECT * FROM movie")
movie.data <- dbFetch(query.movie, n=-1)
#Send query to MySQL and retrieve the friends data from MySQL. Friends are the people who rated the movies
query.friends <- dbSendQuery(movie_con, "SELECT * FROM friends")
friends.data <- dbFetch(query.friends, n=-1)
#Send query to MySQL and retrieve the rating data from MySQL.
query.rating <- dbSendQuery(movie_con, "SELECT * FROM rating")
rating.data <- dbFetch(query.rating, n=-1)
#send query to MySQL and retrieve the actor data from MySQL
query.actor <- dbSendQuery(movie_con, "SELECT * FROM actor")
## Warning in .local(conn, statement, ...): Unsigned INTEGER in col 3 imported
## as numeric
actor.data <- dbFetch(query.actor, n=-1)
#send query to MySQL and retrieve the director data from MySQL
query.director <- dbSendQuery(movie_con, "SELECT * FROM director")
director.data <- dbFetch(query.director, n=-1)
#send query to MySQL and retrieve the writer data from MySQL
query.writer <- dbSendQuery(movie_con, "SELECT * FROM writer")
writer.data <- dbFetch(query.writer, n=-1)
#send query to MySQL and retrieve the genre data from MySQL
query.genre <- dbSendQuery(movie_con, "SELECT * FROM genre")
genre.data <- dbFetch(query.genre, n=-1)
#send query to MySQL and retrieve the country data from MySQL
query.country <- dbSendQuery(movie_con, "SELECT * FROM country")
country.data <- dbFetch(query.country, n=-1)
#send query to MySQL and retrieve the country data from MySQL
query.omdb <- dbSendQuery(movie_con, "SELECT * FROM omdb")
omdb.data <- dbFetch(query.omdb, n=-1)
kable(head(rating.data, n=7), caption = "Rating - the main table")
row_names | MovieID | MovieName | FriendID | FriendName | FriendRating |
---|---|---|---|---|---|
1 | 1 | The Fate of the Furious | 1 | Ming | 2.0 |
7 | 1 | The Fate of the Furious | 2 | Hao | 1.0 |
13 | 1 | The Fate of the Furious | 3 | Alison | 4.5 |
19 | 1 | The Fate of the Furious | 4 | Eran | 3.0 |
25 | 1 | The Fate of the Furious | 5 | Orshi | 2.5 |
2 | 2 | Star Wars: The Last Jedi | 1 | Ming | 4.0 |
8 | 2 | Star Wars: The Last Jedi | 2 | Hao | 4.5 |
kable(head(movie.data, n=6),caption = "Movie Table")
row_names | MovieID | MovieName |
---|---|---|
1 | 1 | The Fate of the Furious |
2 | 2 | Star Wars: The Last Jedi |
3 | 3 | Guardians of the Galaxy Vol. |
4 | 4 | Thor: Ragnarok |
5 | 5 | Beauty and the Beast |
6 | 6 | Logan |
kable(head(friends.data, n=5), caption = "Audience Table")
row_names | FriendsID | FriendsName |
---|---|---|
1 | 1 | Ming |
2 | 2 | Hao |
3 | 3 | Alison |
4 | 4 | Eran |
5 | 5 | Orshi |
kable(head(actor.data, n=5),caption = "Actor Table")
row_names | actors | title | actorID |
---|---|---|---|
1 | Scott Eastwood | The Fate of the Furious | 1 |
2 | Charlize Theron | The Fate of the Furious | 2 |
3 | Dwayne Johnson | The Fate of the Furious | 3 |
4 | Vin Diesel | The Fate of the Furious | 4 |
5 | Tom Hardy | Star Wars: The Last Jedi | 5 |
kable(head(director.data, n=5),caption = "Director Table")
row_names | directors | title | directorID |
---|---|---|---|
1 | F. Gary Gray | The Fate of the Furious | 1 |
2 | Rian Johnson | Star Wars: The Last Jedi | 2 |
3 | James Gunn | Guardians of the Galaxy Vol. | 3 |
4 | Taika Waititi | Thor: Ragnarok | 4 |
5 | Bill Condon | Beauty and the Beast | 5 |
kable(head(writer.data, n=5),caption = "Writer Table")
row_names | writers | title | writerID |
---|---|---|---|
1 | Chris Morgan | The Fate of the Furious | 1 |
2 | Gary Scott Thompson | The Fate of the Furious | 2 |
3 | Rian Johnson | Star Wars: The Last Jedi | 3 |
4 | George Lucas | Star Wars: The Last Jedi | 4 |
5 | James Gunn | Guardians of the Galaxy Vol. | 5 |
kable(head(genre.data, n=3),caption = "Genre Table")
row_names | genres | title | genreID |
---|---|---|---|
1 | Action | The Fate of the Furious | 1 |
2 | Crime | The Fate of the Furious | 2 |
3 | Thriller | The Fate of the Furious | 3 |
kable(head(country.data, n=3),caption = "Country Table")
row_names | countries | title | countryID |
---|---|---|---|
1 | USA | The Fate of the Furious | 1 |
2 | France | The Fate of the Furious | 2 |
3 | Canada | The Fate of the Furious | 3 |
kable(head(omdb.data, n=3),caption = "Omdb Table")
row_names | Title | Year | imdbID | genre | actor | country | director | writer | MovieID |
---|---|---|---|---|---|---|---|---|---|
1 | The Fate of the Furious | 2017 | tt4630562 | Action,Crime,Thriller | Scott Eastwood,Charlize Theron,Dwayne Johnson,Vin Diesel | USA,France,Canada,UK,Samoa | F. Gary Gray | Chris Morgan,Gary Scott Thompson (based on characters created by) | 1 |
2 | Star Wars: The Last Jedi | 2017 | tt2527336 | Action, Adventure, Fantasy | Tom Hardy, Daisy Ridley, Adam Driver, Mark Hamill | USA | Rian Johnson | Rian Johnson (screenplay), George Lucas (characters) | 2 |
3 | Guardians of the Galaxy Vol. 2 | 2017 | tt3896198 | Action, Adventure, Sci-Fi | Chris Pratt, Karen Gillan, Vin Diesel, Sylvester Stallone | USA | James Gunn | James Gunn, Dan Abnett (based on the Marvel comic book by), Andy Lanning (based on the Marvel comic book by), Stan Lee (characters), Jack Kirby (characters), Gene Colan (characters), Arnold Drake (characters), Steve Englehart (characters), Steve Gan (characters), Jim Starlin (characters) | 3 |
Export data as .csv files
write.csv(movie.data, "movie.csv")
write.csv(friends.data, "friends.csv")
write.csv(rating.data, "rating.csv")
write.csv(actor.data, "actor.csv")
write.csv(director.data, "director.csv")
write.csv(writer.data, "writer.csv")
write.csv(genre.data, "genre.csv")
write.csv(country.data, "country.csv")
write.csv(omdb.data, "omdb.csv")
Disconnect from MySQL
dbDisconnect(movie_con)
## Warning: Closing open result sets
## [1] TRUE
Download .csv files
friends_d <- read.csv(file="https://raw.githubusercontent.com/YunMai-SPS/DA607/master/DA607_homework13/friends.csv", header=TRUE, sep=",", stringsAsFactors = F)
movie_d <- read.csv(file="https://raw.githubusercontent.com/YunMai-SPS/DA607/master/DA607_homework13/movie.csv", header=TRUE, sep=",", stringsAsFactors = F)
rating_d <- read.csv(file="https://raw.githubusercontent.com/YunMai-SPS/DA607/master/DA607_homework13/rating.csv", header=TRUE, sep=",", stringsAsFactors = F)
actor_d <- read.csv(file="https://raw.githubusercontent.com/YunMai-SPS/DA607/master/DA607_homework13/actor.csv", header=T, sep=",", stringsAsFactors = F)
director_d <- read.csv(file="https://raw.githubusercontent.com/YunMai-SPS/DA607/master/DA607_homework13/director.csv", header=TRUE, sep=",", stringsAsFactors = F)
writer_d <- read.csv(file="https://raw.githubusercontent.com/YunMai-SPS/DA607/master/DA607_homework13/writer.csv", header=TRUE, sep=",", stringsAsFactors = F)
genre_d <- read.csv(file="https://raw.githubusercontent.com/YunMai-SPS/DA607/master/DA607_homework13/genre.csv", header=TRUE, sep=",", stringsAsFactors = F)
country_d <- read.csv(file="https://raw.githubusercontent.com/YunMai-SPS/DA607/master/DA607_homework13/country.csv", header=TRUE, sep=",", stringsAsFactors = F)
omdb_d <- read.csv(file="https://raw.githubusercontent.com/YunMai-SPS/DA607/master/DA607_homework13/omdb.csv", header=TRUE, sep=",", stringsAsFactors = F)
#connect to Neo4j
graph = startGraph("http://localhost:7474/db/data", username = "neo4j", password = neo4j.pw)
#delete all nodes, relationships, indexes, and constraints from the graph database
clear(graph, input = F)
# nodes to be created:
# Nodes:
# movie
# audience
# actor
# director
# writer
# genre
# country
# the same nodes will be merged.
query = "
MERGE (movie:Movie {name: {title1}, id:{movieid1}})
CREATE (audience:Audience {name:{audname},id1:{audid},id2:{movieid2}, rating:{friendrating}})
CREATE (audience)-[r:RATED]->(movie)
SET r.rating = {friendrating}
RETURN audience,movie
"
tx = newTransaction(graph)
for (i in 1:nrow(rating_d)){
appendCypher(tx,query,
title1 = rating_d$MovieName[i],
movieid1 = rating_d$MovieID[i],
audname = rating_d$FriendName[i],
audid = rating_d$FriendID[i],
movieid2 = rating_d$MovieID[i],
friendrating = rating_d$FriendRating[i])
}
commit(tx)
summary(graph)
## This To That
## 1 Audience RATED Movie
I used MATCH clauses in the above query to create relationships but error occured. It seemed that Neo4j did not allow to have multiple statements within a query. Then I removed MATCH clauses but only kept CREAT clause to create relationship.
Then I attempted to create relationships and nodes with different lables in separate queries using the same approach but did not work.
I tried to fix the porblem but without sucess. Sometime the error message is: the newTransaction(graph) has expired.
This following code chunk did not work. The error is: Error in appendCypher.transaction(tx, query) : Neo.ClientError.Statement.SyntaxError Invalid input ')': expected whitespace, '.', node labels, '[', "=~", IN, STARTS, ENDS, CONTAINS, IS, '^', '*', '/', '%', '+', '-', '=', "<>", "!=", '<', '>', "<=", ">=", AND, XOR, OR, ',' or '}' (line 2, column 50 (offset: 60)) "CREATE (actor:Actor {name:{actname}, id:{actorid}), title:{movietitle})" ^".
So I put them as statements for further troubleshooting in the future.
# query = "
# MATCH (movie:Movie {id:{movieid1}})
# MATCH (audience:Audience {id1:{audid})
# CREATE (audience)-[r:RATED]->(movie)
# SET r.rating = {friendrating}
# RETURN audience,movie
#"
#tx = newTransaction(graph)
#for (i in 1:nrow(actor_d)){
# appendCypher(tx,query,
# actname = actor_d$actors[i],
# actorid = actor_d$actorID[i],
# movietitle = actor_d$title[i],
# title1 = actor_d$title[i])
#}
#commit(tx)
# query = "
# CREATE (actor:Actor {name:{actname}, id:{actorid}), #title:{movietitle})
# MATCH (movie:Movie {name: {title1}})
# MATCH (actor:Actor {name:{actname}})
# CREATE (actor)-[a:ACTED_IN]->(movie)
#"
#tx = newTransaction(graph)
#appendCypher(tx,query)
#commit(tx)
Since I could not create relationship and more nodes. I tried a different way to create graphes.
# nodes to be created:
# Nodes:
# movie
# audience
# actor
# director
# writer
# genre
# country
# the same nodes will be merged.
# Relationships
# movie -[:belongs_to]-> genre
# movie -[:produced_in]-> country
# audience -[:rated_in]-> movie
# actor -[:acted_in]-> movie
# director -[:directed]-> movie
# writer -[:writed_scripts]-> movie
#connect to Neo4j
graph = startGraph("http://localhost:7474/db/data", username = "neo4j", password = neo4j.pw)
#delete all nodes, relationships, indexes, and constraints from the graph database
clear(graph, input = F)
for (i in 1:nrow(rating_d)){
movie.node <- createNode(graph, "Movie", name=rating_d$MovieName[i], id=rating_d$MovieID[i])
}
for (i in 1:nrow(rating_d)){
audience.node <- createNode(graph, "Audience", name=rating_d$FriendName[i], id1=rating_d$FriendID[i],id2=rating_d$MovieID[i],moviename=movie_d$MovieName[i])
}
for (i in 1:nrow(actor_d)){
actor.node <- createNode(graph, "Actor", name=actor_d$actors[i], id=actor_d$actorID[i], movietitle = actor_d$title[i])
}
for (i in 1:nrow(director_d)){
director.node <- createNode(graph, "Director", name=director_d$directors[i], id=director_d$directorID[i], movietitle = director_d$title[i])
}
for (i in 1:nrow(writer_d)){
writer.node <- createNode(graph, "Writer", name=writer_d$writers[i], id=writer_d$writerID[i], movietitle = writer_d$title[i])
}
for (i in 1:nrow(genre_d)){
genre.node <- createNode(graph, "Genre", name=genre_d$genres[i], id=genre_d$genreID[i], movietitle = genre_d$title[i])
}
for (i in 1:nrow(country_d)){
country.node <- createNode(graph, "Country", name=country_d$countries[i], id=country_d$countryID[i], movietitle = country_d$title[i])
}
** 2.2.1. Relationship between audiences and movies**
movie_query = "
MATCH(a:Movie) WHERE a.id={id}
RETURN a
"
audience_query = "
MATCH(b:Audience) WHERE b.id2={id2} AND b.name = {name}
RETURN b
"
for (i in 1:nrow(rating_d)) {
movie_rated <- getSingleNode(graph, movie_query, id=rating_d$MovieID[i])
audience_rated <- getSingleNode(graph, audience_query, id2=rating_d$MovieID[i], name=rating_d$FriendName[i])
createRel(audience_rated, "RATED", movie_rated, rating = rating_d$FriendRating[i])
}
summary(graph)
## This To That
## 1 Audience RATED Movie
2.2.2 Relationship between actors and movies
movie_query = "
MATCH(a:Movie) WHERE a.name={name}
RETURN a
"
actor_query = "
MATCH(b:Actor) WHERE b.movietitle={movietitle} AND b.name={name}
RETURN b
"
for (i in 1:nrow(actor_d)){
movie_rated <- getSingleNode(graph, movie_query, name=rating_d$MovieName[i])
actor_related <- getSingleNode(graph, actor_query, name=actor_d$actors[i], id=actor_d$id[i], movietitle=actor_d$title[i])
createRel(actor_related, "ACTED_IN", movie_rated)
}
summary(graph)
## This To That
## 1 Actor ACTED_IN Movie
## 2 Audience RATED Movie
2.2.3 Relationship between directors and movies
movie_query = "
MATCH(a:Movie) WHERE a.name={name}
RETURN a
"
director_query = "
MATCH(b:Director) WHERE b.movietitle={movietitle} AND b.name={name}
RETURN b
"
for (i in 1:nrow(director_d)){
movie_rated <- getSingleNode(graph, movie_query, name=rating_d$MovieName[i])
director_related <- getSingleNode(graph, director_query, name=director_d$directors[i], movietitle=director_d$title[i])
createRel(director_related, "DIRECTED", movie_rated)
}
summary(graph)
## This To That
## 1 Director DIRECTED Movie
## 2 Actor ACTED_IN Movie
## 3 Audience RATED Movie
2.2.4 Relationship between writers and movies
movie_query = "
MATCH(a:Movie) WHERE a.name={name}
RETURN a
"
writer_query = "
MATCH(b:Writer) WHERE b.movietitle={movietitle} AND b.name={name}
RETURN b
"
for (i in 1:nrow(writer_d)){
movie_rated <- getSingleNode(graph, movie_query, name=rating_d$MovieName[i])
writer_related <- getSingleNode(graph, writer_query, name=writer_d$writers[i], movietitle=writer_d$title[i])
createRel(writer_related, "WROTE", movie_rated)
}
summary(graph)
## This To That
## 1 Writer WROTE Movie
## 2 Director DIRECTED Movie
## 3 Actor ACTED_IN Movie
## 4 Audience RATED Movie
2.2.5 Relationship between genres and movies
movie_query = "
MATCH(a:Movie) WHERE a.name={name}
RETURN a
"
genre_query = "
MATCH(b:Genre) WHERE b.movietitle={movietitle} AND b.name={name}
RETURN b
"
for (i in 1:nrow(genre_d)){
movie_rated <- getSingleNode(graph, movie_query, name=rating_d$MovieName[i])
genre_related <- getSingleNode(graph, genre_query, name=genre_d$genres[i], movietitle=genre_d$title[i])
createRel(movie_rated, "BELONGS_TO", genre_related)
}
summary(graph)
## This To That
## 1 Writer WROTE Movie
## 2 Director DIRECTED Movie
## 3 Actor ACTED_IN Movie
## 4 Audience RATED Movie
## 5 Movie BELONGS_TO Genre
2.2.6 Relationship between countries and movies
movie_query = "
MATCH(a:Movie) WHERE a.name={name}
RETURN a
"
country_query = "
MATCH(b:Country) WHERE b.movietitle={movietitle} AND b.name={name}
RETURN b
"
for (i in 1:nrow(country_d)){
movie_rated <- getSingleNode(graph, movie_query, name=rating_d$MovieName[i])
country_related <- getSingleNode(graph, country_query, name=country_d$countries[i], movietitle=country_d$title[i])
createRel(movie_rated, "PRODUCED_IN", country_related)
}
summary(graph)
## This To That
## 1 Writer WROTE Movie
## 2 Director DIRECTED Movie
## 3 Actor ACTED_IN Movie
## 4 Audience RATED Movie
## 5 Movie BELONGS_TO Genre
## 6 Movie PRODUCED_IN Country
tabular results could be returned by using Cypher. As introduced in Nicole White's GitHub RNeo4j repo "https://github.com/nicolewhite/RNeo4j#nodes"
query = "
MATCH (audience:Audience)-[r:RATED]->(movie:Movie)
WHERE audience.name = 'Ming'
RETURN audience.name, r.rating, movie.name
"
cypher(graph, query)
## audience.name r.rating movie.name
## 1 Ming 2.0 The Fate of the Furious
## 2 Ming 4.0 Star Wars: The Last Jedi
## 3 Ming 4.0 Guardians of the Galaxy Vol.
## 4 Ming 1.5 Thor: Ragnarok
## 5 Ming 4.0 Beauty and the Beast
## 6 Ming 5.0 Logan
When returning relative more complicated results:
query = "
MATCH (audience:Audience)-[r:RATED]->(movie:Movie)
WHERE audience.name = 'Ming'
RETURN audience.name, COLLECT(movie.movie) AS seen
"
cypherToList(graph, query)
## [[1]]
## [[1]]$audience.name
## [1] "Ming"
##
## [[1]]$seen
## list()
query = "
MATCH (audience:Audience)-[r:RATED]-(movie:Movie {name:'Beauty and the Beast'})
RETURN audience.name, movie.name,r.rating"
cypher(graph, query)
## audience.name movie.name r.rating
## 1 Orshi Beauty and the Beast 4
## 2 Eran Beauty and the Beast 5
## 3 Alison Beauty and the Beast 5
## 4 Hao Beauty and the Beast 4
## 5 Ming Beauty and the Beast 4
Parameters can be passed
query = "
MATCH (audience:Audience)-[r:RATED]-> (movie:Movie)
WHERE audience.name = {name1} AND movie.name = {name2}
RETURN audience.name, r.rating, movie.name
"
cypher(graph, query, name1="Hao", name2="The Fate of the Furious")
## audience.name r.rating movie.name
## 1 Hao 1 The Fate of the Furious
graph = startGraph("http://localhost:7474/db/data/", username="neo4j", password=neo4j.pw)
browse(graph)
My Rstudio somehow cannot show Neo4j browser. Cypher clauses will be put in Neo4j to build the database. One of the graphes is attached(atudience-rate-movie).