Create movie database locally using this SQL file
Then extract movies database from MySQL into a dataframe and print out top and bottom to confirm data came in OK.
# Connect to MySQL database
con <- dbConnect(MySQL(),user='CUNY_SPS',password='',
host='localhost',dbname='tb')
# Download all data in database
rs<- dbSendQuery(con, "select * from tb_hw ORDER BY rating DESC")
# Store data in dataframe
movies <- fetch(rs)
head(movies)## movie year genre mins rating initials
## 1 Avatar 2009 Fantasy 302 5 sjv
## 2 Braveheart 1995 Drama 318 5 sjv
## 3 Heat 1995 Action 310 5 sjv
## 4 Avatar 2009 Fantasy 302 5 jv
## 5 Okja 2017 Drama 120 5 jv
## 6 Avatar 2009 Fantasy 302 5 cl
tail(movies)## movie year genre mins rating initials
## 25 Terminator 1984 Sci-fi 107 2 jv
## 26 Okja 2017 Drama 120 2 cl
## 27 Braveheart 1995 Drama 318 2 np
## 28 Rocky 1976 Drama 120 1 jv
## 29 Heat 1995 Action 310 1 jv
## 30 Rocky 1976 Drama 120 1 np
Inserting a dataframe from R into a MongoDB collection is supereasy thanks to the mongolite package
my_db <- mongo(collection='movies', db="ratings")
my_db$insert(movies)## List of 5
## $ nInserted : num 30
## $ nMatched : num 0
## $ nRemoved : num 0
## $ nUpserted : num 0
## $ writeErrors: list()
Confirm data imported correctly:
my_db$count() == dim(movies)[1]## [1] TRUE
Check first record in MongoDB:
my_db$iterate()$one()## $movie
## [1] "Avatar"
##
## $year
## [1] 2009
##
## $genre
## [1] "Fantasy"
##
## $mins
## [1] 302
##
## $rating
## [1] 5
##
## $initials
## [1] "sjv"
Export data from MongoDB to a data frame and confirm data frame matches the one uploaded:
# return data from MongoDB into a tmp data frame
tmp <- my_db$find()
# test if two data frames are identical
identical(movies,tmp)## [1] TRUE
Return all movies rated >3 from any user:
top_5 <- my_db$distinct("movie", query = '{"rating":{"$gt":3}}')
print(top_5)## [1] "Avatar" "Braveheart" "Heat" "Okja" "Terminator"
## [6] "Rocky"
Return all movies rated by user “sjv”:
sjv <- my_db$find(query = '{"initials":"sjv"}')
sjv## movie year genre mins rating initials
## 1 Avatar 2009 Fantasy 302 5 sjv
## 2 Braveheart 1995 Drama 318 5 sjv
## 3 Heat 1995 Action 310 5 sjv
## 4 Terminator 1984 Sci-fi 107 4 sjv
## 5 Rocky 1976 Drama 120 4 sjv
## 6 Okja 2017 Drama 120 3 sjv
Return all movies created after year 2000:
post2000 <- my_db$distinct("movie",query = '{"year":{"$gt":2000}}')
print(post2000)## [1] "Avatar" "Okja"
Return average rating per movie and plot results:
avg_rating <- my_db$aggregate('[{"$group":{"_id":"$movie","average":{"$avg":"$rating"}}}]')
names(avg_rating) <- c("movie","average_rating")
ggplot(aes(movie,average_rating),data=avg_rating) +
geom_col(fill="sky blue") + ggtitle("Average Rating per Movie") + theme_classic()The pros of NOSQL are that developers don’t have to worry about schemas, which can be time consuming to create if data connections are complex. NOSQL tends to be faster to retrieve data as well.
Some cons of NOSQL is the greater skill set needed to create/query NOSQL databases. Additionally, sometimes businesses want to keep/store the relationship between one item and another. For example, if I want to know how employees report to Bob, this could arguably be faster to query in a relational database.