For this assignment, you should take information from a relational database and migrate it to a NoSQL database of your own choosing.

Best Chick Flicks 1995 - 2019

library(rvest)
library(XML)
library(RCurl)
library(tidyverse)
library(plyr)
library(DBI)
library(RMySQL)
library(RSQLite)
library(DT)
library(mongolite)

Scraping the data from the web

links <-paste0("https://www.imdb.com/list/ls058361070/?sort=list_order,asc&st_dt=&mode=detail&page=", 1:2)

Rank <- lapply(links,
                function(pg){
                  pg %>% read_html() %>% 
                    html_nodes(".text-primary") %>% 
                    html_text() %>%
                    parse_number() %>% 
                    na.omit() 
                  
                })
Rank <- unlist(Rank)

Title <- lapply(links,
                function(pg){
                  pg %>% read_html() %>% 
                    html_nodes(".lister-item-header a") %>% 
                    html_text() %>%
                    gsub(" The Wonderful World.+", NA, .) %>% 
                    na.omit()
                  
                })
Title <- unlist(Title)

Year <- lapply(links,
                function(pg){
                  pg %>% read_html() %>% 
                    html_nodes(".text-muted.unbold") %>% 
                    html_text() %>%
                    gsub("\\(\\d{4}.\\d{4}\\)", NA, .) %>% 
                    parse_number() %>% 
                    na.omit()
                  
                })
Year <- unlist(Year)


Genre <- lapply(links,
                function(pg){
                  pg %>% read_html() %>% 
                    html_nodes(".genre") %>% 
                    html_text() %>%
                    gsub("\n", "", .)
                  
                })
Genre <- unlist(Genre)

Rated <- lapply(links,
                function(pg){
                  pg %>% read_html() %>% 
                    html_nodes(".certificate") %>% 
                    html_text()
                  
                })

Rated <- unlist(Rated)

# had to look for the movies that did not have a grade to append NA
for(j in c(111, 150, 152, 158)){
  Rated <- append(Rated, NA, after = j)
}

Runtime <- lapply(links,
                function(pg){
                  pg %>% read_html() %>% 
                    html_nodes(".runtime") %>% 
                    html_text()
                  
                })

Runtime <- unlist(Runtime)

Star_Rating <- lapply(links,
                function(pg){
                  pg %>% read_html() %>% 
                    html_nodes(".ipl-rating-star.small .ipl-rating-star__rating") %>% 
                    html_text()
                  
                })
Star_Rating <-  unlist(Star_Rating)

Director <- lapply(links,
                function(pg){
                  pg %>% read_html() %>% 
                    html_nodes(".text-muted a:nth-child(1)") %>% 
                    html_text()
                  
                })
Director <- unlist(Director)

imdb_movies <- data.frame(Rank, Title, Year, Genre, Rated, Runtime, Star_Rating, Director)

datatable(imdb_movies, rownames = F)

SQL

db_user <- 'newuser'
db_password <- 'pass'
db_name <- 'imdb_movies'
db_host <- '127.0.0.1' # for local access to SQL 
db_port <- 3306

drv <- dbDriver("MySQL")

# creating tbdb connection object with RMysql package
connect <- dbConnect(drv, user = db_user, password = db_password, dbname = db_name, host = db_host, port = db_port)

Loading table into SQL from R

#dbListTables(connect)
#dbListFields(connect, 'tblbest_chick_flicks')

 #Drop table if it already exists
if (dbExistsTable(connect, "tblbest_chick_flicks"))
    dbRemoveTable(connect, "tblbest_chick_flicks")
## [1] TRUE
# Write the data frame to the MySQL
dbWriteTable(connect, name = "tblbest_chick_flicks", value = imdb_movies, append = T,  row.names = FALSE)
## [1] TRUE

Write to CSV

#write.csv(imdb_movies, file = "bestchickflicks.csv", row.names = F)

MongoDB (NoSQL)

Load relational data into MongoDB

my_collection = mongo(collection = "bestchickflicks", db = "imdbmovies") # create connection, database and collection
my_collection$insert(imdb_movies)
## List of 5
##  $ nInserted  : num 191
##  $ nMatched   : num 0
##  $ nRemoved   : num 0
##  $ nUpserted  : num 0
##  $ writeErrors: list()

A view of the first record from the MongoDB with R.

my_collection$iterate()$one() 
## $Rank
## [1] 1
## 
## $Title
## [1] "Mean Girls"
## 
## $Year
## [1] 2004
## 
## $Genre
## [1] "Comedy            "
## 
## $Rated
## [1] "PG-13"
## 
## $Runtime
## [1] "97 min"
## 
## $Star_Rating
## [1] "7"
## 
## $Director
## [1] "Mark Waters"

Over on the Mongo server/terminal, I checked to see if the database and collection was created using the db and show collections command. After that the db.bestchickflicks.find() command was used to view the data imported via R.

A little difficult to read.



Using db.bestchickflicks.find().pretty() puts the output in a structured format.



Export to a .json file

Relational Database VS. NoSQL

Scalability

Relational Databases require a single server to host your entire database. When the load increases on RDBMS database, we scale vertically by increasing hardware power and this involves purchasing expensive and bigger servers. With NoSQL databases, they are designed to expand horizontally by increasing the databases servers in the pool of resources to reduce the load.

Data structure

SQL databases are designed to handle structured data defined by relationships. NoSQL can handle unstructured data, that is, it does not limit storable data types. You can add new types as business needs change.. For instance, when adding data, if the next row requires an additional column, it can be added with no issue. SQL on the other hand will display an error message stating that the row has one colmun too much.

Data Models

No schema is needed when creating a database when it comes to NoSQL but SQL databases requires that you do.

Data Representation

SQL databases are table based databases with n rows and q columns.NoSQL databases are document based, key-value pairs, graph databases or wide-column stores which do not have standard schema definitions which it needs to adhere to.