In this exercise we will migrate a database from MySQL to MongoDB.
library(mongolite)
## Warning: package 'mongolite' was built under R version 3.5.3
library(dplyr)
library(tidyr)
library(rvest)
library(ggplot2)
library(knitr)
Make sure you have MySQL installed. Then please run the following sql script in order to create the movie_review database:
Be sure you use the correct username and password for your particular MySQL setup! We then loop through all of the tables in the db and store them each as a data frame in one list called df_list.
my_db <- src_mysql(dbname = 'movie_reviews', username = 'user', password = 'password')
my_tables <- db_list_tables(my_db$con)
print(my_tables)
## [1] "critic" "movie" "review"
df_list <- list()
for (i in my_tables){
df_list[[i]] <- tbl(my_db, i) %>% collect(n=Inf)
}
df_list %>% head()
## $critic
## # A tibble: 5 x 2
## critic_id critic_name
## <int> <chr>
## 1 1 Aaron
## 2 2 Beverly
## 3 3 Charlie
## 4 4 Dolores
## 5 5 Edward
##
## $movie
## # A tibble: 6 x 2
## movie_id title
## <int> <chr>
## 1 1 Acquaman
## 2 2 Black Panther
## 3 3 The Greatest Showman
## 4 4 Ready Player One
## 5 5 Mary Poppins Returns
## 6 6 Annihilation
##
## $review
## # A tibble: 30 x 3
## movie_id critic_id movie_rating
## <int> <int> <dbl>
## 1 1 1 3.1
## 2 1 2 3.9
## 3 1 3 3.9
## 4 1 4 3.1
## 5 1 5 3.5
## 6 2 1 4.8
## 7 2 2 4.6
## 8 2 3 4.1
## 9 2 4 5
## 10 2 5 5
## # ... with 20 more rows
We then use the mongolite package in R to easily insert the data frames as collections into a Mongo database.
Please be sure to set the variable mongo_url to your particular mongo database location. The default is “mongodb://localhost” but you may have a cloud version you prefer. For security purposes I saved my particular URL in a local file and access it with the readLines() function.
You can also choose your own value for db= within the mongo() function. The default is “test”.
mongo_url <- readLines("mongodb_server_url.txt")
for (i in my_tables){
con <- mongo(collection = i, db = "test", url = mongo_url)
con$insert(df_list[[i]])
con$count()
print(con$find() %>% head(10))
con$drop()
}
## critic_id critic_name
## 1 1 Aaron
## 2 2 Beverly
## 3 3 Charlie
## 4 4 Dolores
## 5 5 Edward
## movie_id title
## 1 1 Acquaman
## 2 2 Black Panther
## 3 3 The Greatest Showman
## 4 4 Ready Player One
## 5 5 Mary Poppins Returns
## 6 6 Annihilation
## movie_id critic_id movie_rating
## 1 1 1 3.1
## 2 1 2 3.9
## 3 1 3 3.9
## 4 1 4 3.1
## 5 1 5 3.5
## 6 2 1 4.8
## 7 2 2 4.6
## 8 2 3 4.1
## 9 2 4 5.0
## 10 2 5 5.0
We can see that the information in the collections is the same as the SQL tables.
The advantages of using a NoSQL database are that it is more flexible in its structure and also takes up less disk space. This may be desirable if working with a very large dataset. However, the flexibility can be a disadvantage as well. Data relationships are not strictly enforced, which could lead to duplication and disorganization. Relational databases are often preferred if a dataset is not overly immense and / or cost is not an issue.