For this assignment, you should take information from a relational database and migrate it to a NoSQL database of your own choosing.
For the relational database, you might use the flights database, the tb database, the “data skills” database your team created for Project 3, or another database of your own choosing or creation.
For the NoSQL database, you may use MongoDB (which we introduced in week 7), Neo4j, or another NoSQL database of your choosing.
Your migration process needs to be reproducible. R code is encouraged, but not required. You should also briefly describe the advantages and disadvantages of storing the data in a relational database vs. your NoSQL database.
For this assignment, we modified a table previously created for Assignment 2, where the table described peer ratings for a sample of 2019 films. The table combined two types of attributes: descriptive information and the ratings. To create a proper property graph model of the information stored in this table, we split the table in two: “movie_ratings” and “movies”, where “movie_ratings” describes “movies”.
The migration of this data from PostgreSQL to Neoj4 requires the following steps: 1. export the data to csv 2. import the data in graph structure
migration
# connect to postgresql db
con <- dbConnect(RPostgres::Postgres()
, host="amrepdev.ckxhi71v1dqf.us-east-1.rds.amazonaws.com"
, port="5432"
, dbname="postgres"
, user="nichung"
, password=Sys.getenv("pwd"))
dbExistsTable(con, "movies")
## [1] TRUE
# store tables in dataframes
movies <- dbGetQuery(con, "
SELECT * FROM movies
")
movie_ratings <- dbGetQuery(con, "
SELECT * FROM movie_ratings
")
# write dataframes to csvs
write.csv(movies, "data/movies.csv", row.names = FALSE)
write.csv(movie_ratings, "data/movie_ratings.csv", row.names = FALSE)
con <- neo4j_api$new(
url = "http://localhost:7474",
user = "neo4j",
password = "redy2rok"
)
con$ping() # ping server
## [1] 200
con$get_version() # check neo4j version
## [1] "3.5.12"
on_load_query_m <- 'CREATE (n:Movies)
SET n = row,
n.movieId = toInteger(row.movieId),
n.title = toString(row.title),
n.releasedate = toInteger(row.releasedate),
n.budget = toInteger(row.budget);'
on_load_query_r <- 'CREATE (n:Ratings)
SET n = row,
n.movieId = toInteger(row.movieId),
n.score1 = toInteger(row.score1),
n.score2 = toInteger(row.score2),
n.score3 = toInteger(row.score3),
n.score4 = toInteger(row.score4),
n.score5 = toInteger(row.score5);'
# send the csvs
load_csv(url = "https://raw.githubusercontent.com/nichung/cuny_msds/master/data_607/week_12/data/movies.csv",
con = con, header = TRUE, periodic_commit = 50,
as = "row", on_load = on_load_query_m)
## No data returned.
## # A tibble: 12 x 2
## type value
## <chr> <dbl>
## 1 contains_updates 1
## 2 nodes_created 6
## 3 nodes_deleted 0
## 4 properties_set 42
## 5 relationships_created 0
## 6 relationship_deleted 0
## 7 labels_added 6
## 8 labels_removed 0
## 9 indexes_added 0
## 10 indexes_removed 0
## 11 constraints_added 0
## 12 constraints_removed 0
load_csv(url = "https://raw.githubusercontent.com/nichung/cuny_msds/master/data_607/week_12/data/movie_ratings.csv",
con = con, header = TRUE, periodic_commit = 50,
as = "row", on_load = on_load_query_r)
## No data returned.
## # A tibble: 12 x 2
## type value
## <chr> <dbl>
## 1 contains_updates 1
## 2 nodes_created 6
## 3 nodes_deleted 0
## 4 properties_set 66
## 5 relationships_created 0
## 6 relationship_deleted 0
## 7 labels_added 6
## 8 labels_removed 0
## 9 indexes_added 0
## 10 indexes_removed 0
## 11 constraints_added 0
## 12 constraints_removed 0
# confirm migration
con$get_labels() # get node labels
## # A tibble: 2 x 1
## labels
## <chr>
## 1 Ratings
## 2 Movies
"MATCH (n:Movies) RETURN n LIMIT 25;" %>% # get node
call_neo4j(con)
## $n
## # A tibble: 24 x 3
## movieid title budget
## <chr> <chr> <int>
## 1 10121 Avengers: Endgame 356000000
## 2 10122 Once Upon a Time in Hollywood 96000000
## 3 10123 The Farewell 3000000
## 4 10124 Where'd You Go Bernadette 18000000
## 5 10125 Midsommar 10000000
## 6 10126 The Plagiarists NA
## 7 10121 Avengers: Endgame 356000000
## 8 10122 Once Upon a Time in Hollywood 96000000
## 9 10123 The Farewell 3000000
## 10 10124 Where'd You Go Bernadette 18000000
## # … with 14 more rows
##
## attr(,"class")
## [1] "neo" "neo" "list"
"MATCH (n:Ratings) RETURN n LIMIT 25;" %>%
call_neo4j(con)
## $n
## # A tibble: 24 x 6
## score2 score3 score4 score5 movieid score1
## <int> <int> <int> <int> <chr> <int>
## 1 3 5 5 4 10121 4
## 2 4 3 4 4 10122 5
## 3 5 5 4 5 10123 5
## 4 3 3 4 2 10124 4
## 5 5 4 5 4 10125 5
## 6 4 3 4 4 10126 3
## 7 3 5 5 4 10121 4
## 8 4 3 4 4 10122 5
## 9 5 5 4 5 10123 5
## 10 3 3 4 2 10124 4
## # … with 14 more rows
##
## attr(,"class")
## [1] "neo" "neo" "list"
# create relationships
"MATCH (a:Ratings),(b:Movies) WHERE a.movieid = '10121' AND b.movieid = '10121' CREATE (a)-[r:DESCRIBES]->(b) RETURN r" %>%
call_neo4j(con)
## $r
## # A tibble: 0 x 0
##
## attr(,"class")
## [1] "neo" "neo" "list"
"MATCH (a:Ratings),(b:Movies) WHERE a.movieid = '10122' AND b.movieid = '10122' CREATE (a)-[r:DESCRIBES]->(b) RETURN r" %>%
call_neo4j(con)
## $r
## # A tibble: 0 x 0
##
## attr(,"class")
## [1] "neo" "neo" "list"
"MATCH (a:Ratings),(b:Movies) WHERE a.movieid = '10123' AND b.movieid = '10123' CREATE (a)-[r:DESCRIBES]->(b) RETURN r" %>%
call_neo4j(con)
## $r
## # A tibble: 0 x 0
##
## attr(,"class")
## [1] "neo" "neo" "list"
"MATCH (a:Ratings),(b:Movies) WHERE a.movieid = '10124' AND b.movieid = '10124' CREATE (a)-[r:DESCRIBES]->(b) RETURN r" %>%
call_neo4j(con)
## $r
## # A tibble: 0 x 0
##
## attr(,"class")
## [1] "neo" "neo" "list"
"MATCH (a:Ratings),(b:Movies) WHERE a.movieid = '10125' AND b.movieid = '10125' CREATE (a)-[r:DESCRIBES]->(b) RETURN r" %>%
call_neo4j(con)
## $r
## # A tibble: 0 x 0
##
## attr(,"class")
## [1] "neo" "neo" "list"
"MATCH (a:Ratings),(b:Movies) WHERE a.movieid = '10126' AND b.movieid = '10126' CREATE (a)-[r:DESCRIBES]->(b) RETURN r" %>%
call_neo4j(con)
## $r
## # A tibble: 0 x 0
##
## attr(,"class")
## [1] "neo" "neo" "list"
For a simple dataset like the subject of this migration, a relational database works just fine: there are no complex relationships and the data itself is well structured. As a result, the queries are simple and require few joins. On Neo4j, however, it would be possible to add nested heirarchies and potentially even the movie files themselves. While the graph database is more flexible, it is glaringly over-powered for this application.