Choose six recent popular movies. Ask at least five people that you know (friends, family, classmates, imaginary friends if necessary) to rate each of these movies that they have seen on a scale of 1 to 5. Take the results (observations) and store them in a SQL database of your choosing. Load the information from the SQL database into an R dataframe.
library(tidyverse)
library(RMySQL)
Requirement: Created Database DATA607_MOVIEDB, user ‘DATA607User’ and 3 tables: Movie, Reviewer, Rating
Note : SQL Scripts provided separately (github link)
Once the connection is established, view the tables.
mydb =dbConnect(MySQL(),user='DATA607User',password=mypassword,dbname='DATA607_MOVIEDB',host='localhost')
dbListTables(mydb)
## [1] "movie" "rating" "reviewer"
Select the tables: movie, reviwer and ratings and fetch the resultset in their respective data frames
movie_resultSet = dbSendQuery(mydb, 'select * from movie')
movie_df <- fetch(movie_resultSet)
movie_df
## movie_id movie_name
## 1 1 JUMANJI: THE NEXT LEVEL
## 2 2 Spies in Disguise
## 3 3 Dolittle
## 4 4 Fronzen 2
## 5 5 Star Wars: The Rise of Skywalker
## 6 6 1917
reviewer_resultSet = dbSendQuery(mydb, 'select * from reviewer')
reviewer_df <- fetch(reviewer_resultSet)
reviewer_df
## reviewer_id reviewer_name gender
## 1 1 Tom M
## 2 2 Alice F
## 3 3 Nathan M
## 4 4 Dan M
## 5 5 Lucy F
rating_resultSet = dbSendQuery(mydb, 'select * from rating')
rating_df <- fetch(rating_resultSet)
head(rating_df,10)
## reviewer_id movie_id rating
## 1 1 1 6.5
## 2 1 2 6.0
## 3 1 3 7.0
## 4 1 4 8.0
## 5 1 5 9.0
## 6 1 6 7.5
## 7 2 1 7.0
## 8 2 2 6.5
## 9 2 3 7.0
## 10 2 4 8.5
reviewer_movie_rating <- left_join(movie_df,rating_df,by.x ="movie_id",by.y = "movie_id")
## Joining, by = "movie_id"
Now, checking the combined dataframe of movie and rating data
head(reviewer_movie_rating,30)
## movie_id movie_name reviewer_id rating
## 1 1 JUMANJI: THE NEXT LEVEL 1 6.5
## 2 1 JUMANJI: THE NEXT LEVEL 2 7.0
## 3 1 JUMANJI: THE NEXT LEVEL 3 7.7
## 4 1 JUMANJI: THE NEXT LEVEL 4 7.5
## 5 1 JUMANJI: THE NEXT LEVEL 5 7.0
## 6 2 Spies in Disguise 1 6.0
## 7 2 Spies in Disguise 2 6.5
## 8 2 Spies in Disguise 3 6.8
## 9 2 Spies in Disguise 4 6.5
## 10 2 Spies in Disguise 5 6.5
## 11 3 Dolittle 1 7.0
## 12 3 Dolittle 2 7.0
## 13 3 Dolittle 3 7.0
## 14 3 Dolittle 4 7.5
## 15 3 Dolittle 5 7.5
## 16 4 Fronzen 2 1 8.0
## 17 4 Fronzen 2 2 8.5
## 18 4 Fronzen 2 3 7.0
## 19 4 Fronzen 2 4 8.0
## 20 4 Fronzen 2 5 8.0
## 21 5 Star Wars: The Rise of Skywalker 1 9.0
## 22 5 Star Wars: The Rise of Skywalker 2 7.0
## 23 5 Star Wars: The Rise of Skywalker 3 8.0
## 24 5 Star Wars: The Rise of Skywalker 4 7.0
## 25 5 Star Wars: The Rise of Skywalker 5 5.5
## 26 6 1917 1 7.5
## 27 6 1917 2 9.0
## 28 6 1917 3 7.0
## 29 6 1917 4 NA
## 30 6 1917 5 NA
reviewer_movie_rating %>%
filter(is.na(rating))
## movie_id movie_name reviewer_id rating
## 1 6 1917 4 NA
## 2 6 1917 5 NA
Imputing the missing rating values with mean rating value
reviewer_rating_cleaned_df <- reviewer_movie_rating %>%
mutate(
rating = chron::times(rating),
rating = if_else(is.na(rating), mean(rating), rating)
)
reviewer_rating_cleaned_df
## movie_id movie_name reviewer_id rating
## 1 1 JUMANJI: THE NEXT LEVEL 1 6.500000
## 2 1 JUMANJI: THE NEXT LEVEL 2 7.000000
## 3 1 JUMANJI: THE NEXT LEVEL 3 7.700000
## 4 1 JUMANJI: THE NEXT LEVEL 4 7.500000
## 5 1 JUMANJI: THE NEXT LEVEL 5 7.000000
## 6 2 Spies in Disguise 1 6.000000
## 7 2 Spies in Disguise 2 6.500000
## 8 2 Spies in Disguise 3 6.800000
## 9 2 Spies in Disguise 4 6.500000
## 10 2 Spies in Disguise 5 6.500000
## 11 3 Dolittle 1 7.000000
## 12 3 Dolittle 2 7.000000
## 13 3 Dolittle 3 7.000000
## 14 3 Dolittle 4 7.500000
## 15 3 Dolittle 5 7.500000
## 16 4 Fronzen 2 1 8.000000
## 17 4 Fronzen 2 2 8.500000
## 18 4 Fronzen 2 3 7.000000
## 19 4 Fronzen 2 4 8.000000
## 20 4 Fronzen 2 5 8.000000
## 21 5 Star Wars: The Rise of Skywalker 1 9.000000
## 22 5 Star Wars: The Rise of Skywalker 2 7.000000
## 23 5 Star Wars: The Rise of Skywalker 3 8.000000
## 24 5 Star Wars: The Rise of Skywalker 4 7.000000
## 25 5 Star Wars: The Rise of Skywalker 5 5.500000
## 26 6 1917 1 7.500000
## 27 6 1917 2 9.000000
## 28 6 1917 3 7.000000
## 29 6 1917 4 7.267857
## 30 6 1917 5 7.267857
Now, checking for the reviewer rating after the data imputation.
movie_rating <- reviewer_rating_cleaned_df %>%
group_by(movie_name,movie_id) %>%
summarize(reviewer_rating =mean(rating)) %>%
arrange(desc(reviewer_rating))
movie_rating
## # A tibble: 6 x 3
## # Groups: movie_name [6]
## movie_name movie_id reviewer_rating
## <chr> <int> <S3: times>
## 1 Fronzen 2 4 7.900000
## 2 1917 6 7.607143
## 3 Star Wars: The Rise of Skywalker 5 7.300000
## 4 Dolittle 3 7.200000
## 5 JUMANJI: THE NEXT LEVEL 1 7.140000
## 6 Spies in Disguise 2 6.460000
The three different datasets of movie, reiviewer and rating data fetched from SQL database “DATA607_MOVIEDB” and have been analyzed to find the final average score of each movie rating. After handling the missing values present in the reviewer dataset it’s been found that “Frozen 2” topped the list and “Spies in Disguise”" finished last.