library(rvest)
library(XML)
library(RCurl)
library(tidyverse)
library(plyr)
library(DBI)
library(RMySQL)
library(RSQLite)
library(DT)
library(mongolite)
links <-paste0("https://www.imdb.com/list/ls058361070/?sort=list_order,asc&st_dt=&mode=detail&page=", 1:2)
Rank <- lapply(links,
function(pg){
pg %>% read_html() %>%
html_nodes(".text-primary") %>%
html_text() %>%
parse_number() %>%
na.omit()
})
Rank <- unlist(Rank)
Title <- lapply(links,
function(pg){
pg %>% read_html() %>%
html_nodes(".lister-item-header a") %>%
html_text() %>%
gsub(" The Wonderful World.+", NA, .) %>%
na.omit()
})
Title <- unlist(Title)
Year <- lapply(links,
function(pg){
pg %>% read_html() %>%
html_nodes(".text-muted.unbold") %>%
html_text() %>%
gsub("\\(\\d{4}.\\d{4}\\)", NA, .) %>%
parse_number() %>%
na.omit()
})
Year <- unlist(Year)
Genre <- lapply(links,
function(pg){
pg %>% read_html() %>%
html_nodes(".genre") %>%
html_text() %>%
gsub("\n", "", .)
})
Genre <- unlist(Genre)
Rated <- lapply(links,
function(pg){
pg %>% read_html() %>%
html_nodes(".certificate") %>%
html_text()
})
Rated <- unlist(Rated)
# had to look for the movies that did not have a grade to append NA
for(j in c(111, 150, 152, 158)){
Rated <- append(Rated, NA, after = j)
}
Runtime <- lapply(links,
function(pg){
pg %>% read_html() %>%
html_nodes(".runtime") %>%
html_text()
})
Runtime <- unlist(Runtime)
Star_Rating <- lapply(links,
function(pg){
pg %>% read_html() %>%
html_nodes(".ipl-rating-star.small .ipl-rating-star__rating") %>%
html_text()
})
Star_Rating <- unlist(Star_Rating)
Director <- lapply(links,
function(pg){
pg %>% read_html() %>%
html_nodes(".text-muted a:nth-child(1)") %>%
html_text()
})
Director <- unlist(Director)
imdb_movies <- data.frame(Rank, Title, Year, Genre, Rated, Runtime, Star_Rating, Director)
datatable(imdb_movies, rownames = F)
db_user <- 'newuser'
db_password <- 'pass'
db_name <- 'imdb_movies'
db_host <- '127.0.0.1' # for local access to SQL
db_port <- 3306
drv <- dbDriver("MySQL")
# creating tbdb connection object with RMysql package
connect <- dbConnect(drv, user = db_user, password = db_password, dbname = db_name, host = db_host, port = db_port)
#dbListTables(connect)
#dbListFields(connect, 'tblbest_chick_flicks')
#Drop table if it already exists
if (dbExistsTable(connect, "tblbest_chick_flicks"))
dbRemoveTable(connect, "tblbest_chick_flicks")
## [1] TRUE
# Write the data frame to the MySQL
dbWriteTable(connect, name = "tblbest_chick_flicks", value = imdb_movies, append = T, row.names = FALSE)
## [1] TRUE
#write.csv(imdb_movies, file = "bestchickflicks.csv", row.names = F)
my_collection = mongo(collection = "bestchickflicks", db = "imdbmovies") # create connection, database and collection
my_collection$insert(imdb_movies)
## List of 5
## $ nInserted : num 191
## $ nMatched : num 0
## $ nRemoved : num 0
## $ nUpserted : num 0
## $ writeErrors: list()
A view of the first record from the MongoDB with R.
my_collection$iterate()$one()
## $Rank
## [1] 1
##
## $Title
## [1] "Mean Girls"
##
## $Year
## [1] 2004
##
## $Genre
## [1] "Comedy "
##
## $Rated
## [1] "PG-13"
##
## $Runtime
## [1] "97 min"
##
## $Star_Rating
## [1] "7"
##
## $Director
## [1] "Mark Waters"
Over on the Mongo server/terminal, I checked to see if the database and collection was created using the db and show collections command. After that the db.bestchickflicks.find() command was used to view the data imported via R.
A little difficult to read.
Using db.bestchickflicks.find().pretty() puts the output in a structured format.
Scalability
Relational Databases require a single server to host your entire database. When the load increases on RDBMS database, we scale vertically by increasing hardware power and this involves purchasing expensive and bigger servers. With NoSQL databases, they are designed to expand horizontally by increasing the databases servers in the pool of resources to reduce the load.
Data structure
SQL databases are designed to handle structured data defined by relationships. NoSQL can handle unstructured data, that is, it does not limit storable data types. You can add new types as business needs change.. For instance, when adding data, if the next row requires an additional column, it can be added with no issue. SQL on the other hand will display an error message stating that the row has one colmun too much.
Data Models
No schema is needed when creating a database when it comes to NoSQL but SQL databases requires that you do.
Data Representation
SQL databases are table based databases with n rows and q columns.NoSQL databases are document based, key-value pairs, graph databases or wide-column stores which do not have standard schema definitions which it needs to adhere to.