Assignment

For this assignment, you should take information from a relational database and migrate it to a NoSQL database of your own choosing.

For the relational database, you might use the flights database, the tb database, the “data skills” database your team created for Project 3, or another database of your own choosing or creation.

For the NoSQL database, you may use MongoDB (which we introduced in week 7), Neo4j, or another NoSQL database of your choosing.

Your migration process needs to be reproducible. R code is encouraged, but not required. You should also briefly describe the advantages and disadvantages of storing the data in a relational database vs. your NoSQL database.

approach

For this assignment, we modified a table previously created for Assignment 2, where the table described peer ratings for a sample of 2019 films. The table combined two types of attributes: descriptive information and the ratings. To create a proper property graph model of the information stored in this table, we split the table in two: “movie_ratings” and “movies”, where “movie_ratings” describes “movies”.

The migration of this data from PostgreSQL to Neoj4 requires the following steps: 1. export the data to csv 2. import the data in graph structure

migration

migration

migration

  1. export the data to csv
# connect to postgresql db
con <- dbConnect(RPostgres::Postgres()
     , host="amrepdev.ckxhi71v1dqf.us-east-1.rds.amazonaws.com"
     , port="5432"
     , dbname="postgres"
     , user="nichung"
     , password=Sys.getenv("pwd"))
dbExistsTable(con, "movies")
## [1] TRUE
# store tables in dataframes
movies <- dbGetQuery(con, "
  SELECT * FROM movies
")
movie_ratings <- dbGetQuery(con, "
  SELECT * FROM movie_ratings                            
")

# write dataframes to csvs
write.csv(movies, "data/movies.csv", row.names = FALSE)
write.csv(movie_ratings, "data/movie_ratings.csv", row.names = FALSE)
  1. import the data in graph structure
con <- neo4j_api$new(
  url = "http://localhost:7474",
  user = "neo4j", 
  password = "redy2rok"
  )

con$ping() # ping server
## [1] 200
con$get_version() # check neo4j version
## [1] "3.5.12"
on_load_query_m <- 'CREATE (n:Movies)
  SET n = row,
  n.movieId = toInteger(row.movieId),
  n.title = toString(row.title),
  n.releasedate = toInteger(row.releasedate),
  n.budget = toInteger(row.budget);'

on_load_query_r <- 'CREATE (n:Ratings)
  SET n = row,
  n.movieId = toInteger(row.movieId),
  n.score1 = toInteger(row.score1),
  n.score2 = toInteger(row.score2),
  n.score3 = toInteger(row.score3),
  n.score4 = toInteger(row.score4),
  n.score5 = toInteger(row.score5);'

# send the csvs 
load_csv(url = "https://raw.githubusercontent.com/nichung/cuny_msds/master/data_607/week_12/data/movies.csv", 
         con = con, header = TRUE, periodic_commit = 50, 
         as = "row", on_load = on_load_query_m)
## No data returned.
## # A tibble: 12 x 2
##    type                  value
##    <chr>                 <dbl>
##  1 contains_updates          1
##  2 nodes_created             6
##  3 nodes_deleted             0
##  4 properties_set           42
##  5 relationships_created     0
##  6 relationship_deleted      0
##  7 labels_added              6
##  8 labels_removed            0
##  9 indexes_added             0
## 10 indexes_removed           0
## 11 constraints_added         0
## 12 constraints_removed       0
load_csv(url = "https://raw.githubusercontent.com/nichung/cuny_msds/master/data_607/week_12/data/movie_ratings.csv", 
         con = con, header = TRUE, periodic_commit = 50, 
         as = "row", on_load = on_load_query_r)
## No data returned.
## # A tibble: 12 x 2
##    type                  value
##    <chr>                 <dbl>
##  1 contains_updates          1
##  2 nodes_created             6
##  3 nodes_deleted             0
##  4 properties_set           66
##  5 relationships_created     0
##  6 relationship_deleted      0
##  7 labels_added              6
##  8 labels_removed            0
##  9 indexes_added             0
## 10 indexes_removed           0
## 11 constraints_added         0
## 12 constraints_removed       0
# confirm migration
con$get_labels() # get node labels
## # A tibble: 2 x 1
##   labels 
##   <chr>  
## 1 Ratings
## 2 Movies
"MATCH (n:Movies) RETURN n LIMIT 25;" %>% # get node
  call_neo4j(con)
## $n
## # A tibble: 24 x 3
##    movieid title                            budget
##    <chr>   <chr>                             <int>
##  1 10121   Avengers: Endgame             356000000
##  2 10122   Once Upon a Time in Hollywood  96000000
##  3 10123   The Farewell                    3000000
##  4 10124   Where'd You Go Bernadette      18000000
##  5 10125   Midsommar                      10000000
##  6 10126   The Plagiarists                      NA
##  7 10121   Avengers: Endgame             356000000
##  8 10122   Once Upon a Time in Hollywood  96000000
##  9 10123   The Farewell                    3000000
## 10 10124   Where'd You Go Bernadette      18000000
## # … with 14 more rows
## 
## attr(,"class")
## [1] "neo"  "neo"  "list"
"MATCH (n:Ratings) RETURN n LIMIT 25;" %>%
  call_neo4j(con)
## $n
## # A tibble: 24 x 6
##    score2 score3 score4 score5 movieid score1
##     <int>  <int>  <int>  <int> <chr>    <int>
##  1      3      5      5      4 10121        4
##  2      4      3      4      4 10122        5
##  3      5      5      4      5 10123        5
##  4      3      3      4      2 10124        4
##  5      5      4      5      4 10125        5
##  6      4      3      4      4 10126        3
##  7      3      5      5      4 10121        4
##  8      4      3      4      4 10122        5
##  9      5      5      4      5 10123        5
## 10      3      3      4      2 10124        4
## # … with 14 more rows
## 
## attr(,"class")
## [1] "neo"  "neo"  "list"
# create relationships
"MATCH (a:Ratings),(b:Movies) WHERE a.movieid = '10121' AND b.movieid = '10121' CREATE (a)-[r:DESCRIBES]->(b) RETURN r" %>%
  call_neo4j(con)
## $r
## # A tibble: 0 x 0
## 
## attr(,"class")
## [1] "neo"  "neo"  "list"
"MATCH (a:Ratings),(b:Movies) WHERE a.movieid = '10122' AND b.movieid = '10122' CREATE (a)-[r:DESCRIBES]->(b) RETURN r" %>%
  call_neo4j(con)
## $r
## # A tibble: 0 x 0
## 
## attr(,"class")
## [1] "neo"  "neo"  "list"
"MATCH (a:Ratings),(b:Movies) WHERE a.movieid = '10123' AND b.movieid = '10123' CREATE (a)-[r:DESCRIBES]->(b) RETURN r" %>%
  call_neo4j(con)
## $r
## # A tibble: 0 x 0
## 
## attr(,"class")
## [1] "neo"  "neo"  "list"
"MATCH (a:Ratings),(b:Movies) WHERE a.movieid = '10124' AND b.movieid = '10124' CREATE (a)-[r:DESCRIBES]->(b) RETURN r" %>%
  call_neo4j(con)
## $r
## # A tibble: 0 x 0
## 
## attr(,"class")
## [1] "neo"  "neo"  "list"
"MATCH (a:Ratings),(b:Movies) WHERE a.movieid = '10125' AND b.movieid = '10125' CREATE (a)-[r:DESCRIBES]->(b) RETURN r" %>%
  call_neo4j(con)
## $r
## # A tibble: 0 x 0
## 
## attr(,"class")
## [1] "neo"  "neo"  "list"
"MATCH (a:Ratings),(b:Movies) WHERE a.movieid = '10126' AND b.movieid = '10126' CREATE (a)-[r:DESCRIBES]->(b) RETURN r" %>%
  call_neo4j(con)
## $r
## # A tibble: 0 x 0
## 
## attr(,"class")
## [1] "neo"  "neo"  "list"

discussion

For a simple dataset like the subject of this migration, a relational database works just fine: there are no complex relationships and the data itself is well structured. As a result, the queries are simple and require few joins. On Neo4j, however, it would be possible to add nested heirarchies and potentially even the movie files themselves. While the graph database is more flexible, it is glaringly over-powered for this application.

references