Choose six recent popular movies. Ask at least five people that you know (friends, family, classmates, imaginary friends if necessary) to rate each of these movies that they have seen on a scale of 1 to 5. Take the results (observations) and store them in a SQL database of your choosing. Load the information from the SQL database into an R dataframe.
Most popular movies from 2018 to 2021 are picked up for this. Not all the users (rators) would have watched all the movies, so I decided to use 9 movies for survey. SurveyMonkey online tool is used to collect ratings from 20 users. Following task to be performed for this assignment.
Data Collection A survey is performed to collect movie ratings from various people. This survey include 10 questions to find -
* Preferred mode of watching movies
* Rating of listed 9 movies
* Survey was answered by 20 people by using a SurveyMonkey(https://www.surveymonkey.com/r/G6N2ZZL)
Each key table has a primary key except user_movie_rating table which has userId repeated for each movie. Tables user_preference and user_movie_rating has foreign keys defined to maintain database normalization.
Following is entity relationship design to show objects, attributes and associations between objects.
DROP DATABASE IF EXISTS movie_ratings;
CREATE DATABASE `movie_ratings`
USE movie_ratings;
DROP TABLE IF EXISTS users;
DROP TABLE IF EXISTS movies;
DROP TABLE IF EXISTS ratings;
DROP TABLE IF EXISTS preference;
DROP TABLE IF EXISTS user_movie_rating;
DROP TABLE IF EXISTS user_preference;
CREATE TABLE `users` (
`userId` int NOT NULL,
`user_name` varchar(100) DEFAULT NULL,
`addedOn` datetime DEFAULT NULL,
`addedBy` varchar(100) DEFAULT NULL,
`updatedOn` datetime DEFAULT NULL,
`updatedBy` varchar(100) DEFAULT NULL,
PRIMARY KEY (`userId`)
);
CREATE TABLE `movies` (
`movieId` int NOT NULL,
`movie_name` varchar(100) DEFAULT NULL,
`year` int DEFAULT NULL,
`addedOn` datetime DEFAULT NULL,
`addedBy` varchar(100) DEFAULT NULL,
`updatedOn` datetime DEFAULT NULL,
`updatedBy` varchar(100) DEFAULT NULL,
PRIMARY KEY (`movieId`)
);
CREATE TABLE `ratings` (
`rating` int NOT NULL,
`rating_desc` varchar(100) DEFAULT NULL,
`addedOn` datetime DEFAULT NULL,
`addedBy` varchar(100) DEFAULT NULL,
`updatedOn` datetime DEFAULT NULL,
`updatedBy` varchar(100) DEFAULT NULL,
PRIMARY KEY (`rating`)
);
CREATE TABLE `preference` (
`preferenceId` int NOT NULL,
`preference` varchar(100) DEFAULT NULL,
`addedOn` datetime DEFAULT NULL,
`addedBy` varchar(100) DEFAULT NULL,
`updatedOn` datetime DEFAULT NULL,
`updatedBy` varchar(100) DEFAULT NULL,
PRIMARY KEY (`preferenceId`)
);
CREATE TABLE `user_movie_rating` (
`userId` int NOT NULL,
`movieId` int NOT NULL,
`rating` int DEFAULT NULL,
`addedOn` datetime DEFAULT NULL,
`addedBy` varchar(100) DEFAULT NULL,
`updatedOn` datetime DEFAULT NULL,
`updatedBy` varchar(100) DEFAULT NULL,
KEY `userId_idx` (`userId`),
KEY `movieId_idx` (`movieId`),
KEY `rating_idx` (`rating`),
CONSTRAINT `movieId` FOREIGN KEY (`movieId`) REFERENCES `movies` (`movieId`),
CONSTRAINT `rating` FOREIGN KEY (`rating`) REFERENCES `ratings` (`rating`),
CONSTRAINT `userId` FOREIGN KEY (`userId`) REFERENCES `users` (`userId`)
);
CREATE TABLE `user_preference` (
`userId` int NOT NULL,
`preferenceId` int NOT NULL,
`addedOn` datetime DEFAULT NULL,
`addedBy` varchar(100) DEFAULT NULL,
`updatedOn` datetime DEFAULT NULL,
`updatedBy` varchar(100) DEFAULT NULL,
PRIMARY KEY (`userId`),
KEY `preferenceId_idx` (`preferenceId`),
CONSTRAINT `preferenceId` FOREIGN KEY (`preferenceId`) REFERENCES `preference` (`preferenceId`),
CONSTRAINT `userIdx` FOREIGN KEY (`userId`) REFERENCES `users` (`userId`)
);
LOAD DATA INFILE 'C:\\ProgramData\\MySQL\\MySQL Server 8.0\\Uploads\\users.csv'
INTO TABLE users
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 ROWS;
LOAD DATA INFILE 'C:\\ProgramData\\MySQL\\MySQL Server 8.0\\Uploads\\movies.csv'
INTO TABLE movies
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 ROWS;
LOAD DATA INFILE 'C:\\ProgramData\\MySQL\\MySQL Server 8.0\\Uploads\\ratings.csv'
INTO TABLE ratings
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 ROWS;
LOAD DATA INFILE 'C:\\ProgramData\\MySQL\\MySQL Server 8.0\\Uploads\\preference.csv'
INTO TABLE preference
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 ROWS;
LOAD DATA INFILE 'C:\\ProgramData\\MySQL\\MySQL Server 8.0\\Uploads\\user_movie_rating.csv'
INTO TABLE user_movie_rating
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 ROWS;
LOAD DATA INFILE 'C:\\ProgramData\\MySQL\\MySQL Server 8.0\\Uploads\\user_preference.csv'
INTO TABLE user_preference
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 ROWS;
Reproducibility for this assignment can be accomplished in two steps:
1. Run movies_ratings.sql from GitHub. Script can be accessed from this Github Location
Password less code - R is connected to MySQL database using RMySQL Database Interface driver. While running R Markdown, database user and password is asked.
Using provided database user and password, a connection to MySQL Database is made. Database query executed to select movie, users and rating data set
library(RMySQL)## Loading required package: DBI
#dbUser <- rstudioapi::askForPassword("Database username")
#dbPassword <- rstudioapi::askForPassword("Database password")
host ='flights.cbs1lxtno2zh.us-east-2.rds.amazonaws.com'
user ='devops_user'
password='devops_password'
#Above user and password can be saved as environment variable.
#Also there is point to point security between RDS MySQL instance and local R Studio so RDS MySQL can not be misused
connection = dbConnect(MySQL(), user = user, password =password, dbname = 'movie_ratings', host = host)
sqlQuery='SELECT usr.user_name, mv.movie_name, rt.rating FROM movie_ratings.user_movie_rating umr JOIN movies mv ON mv.movieId=umr.movieId JOIN users usr ON usr.userId=umr.userId JOIN ratings rt ON rt.rating=umr.rating'
result = dbSendQuery(connection, sqlQuery)Movie, users and rating data set is used to create Data Frame
data.frame = fetch(result)
summary(data.frame)## user_name movie_name rating
## Length:168 Length:168 Min. :1.00
## Class :character Class :character 1st Qu.:2.00
## Mode :character Mode :character Median :2.00
## Mean :2.31
## 3rd Qu.:3.00
## Max. :5.00
movie.reviews <- as.data.frame(data.frame)So far we have created a database with tables and loaded the data into these tables. We also have connected R with MySQL database and selected Movie, users and rating data set to to create Data Frame. Now data frame is ready and we will be performing data analysis. This analysis contains * Average rating for each movie in survey
* Average rating given by each user
* Most preferred mode to watch a movie
mean.movie<- aggregate(rating ~ movie_name, movie.reviews, mean)
mean.movie## movie_name rating
## 1 Alladin 2.105263
## 2 Avengers: Endgame 3.000000
## 3 Avengers: Infinity War 2.833333
## 4 Deadpool 2 2.315789
## 5 Joker 2.000000
## 6 Mulan 2.526316
## 7 Space Sweepers 1.684211
## 8 Tenet 2.666667
## 9 The Midnight Sky 1.722222
This metric show - ‘Avengers: Endgame’ as highest rated where as Space Sweepers as lowest rated movie
Bar plot is used to to show average movie rating
library(ggplot2)
PivotData = aggregate(rating ~ movie_name, data=movie.reviews,mean)
ggplot(data=PivotData, aes(x=(reorder(movie_name, rating)), y = rating, label = rating))+
geom_bar(stat="identity", fill="#9999FF")+ coord_flip()+
labs(title="Average movie rating", x= "Average Rating", y = "Movie")+
geom_text(aes(label=round(rating, digits = 2)))+
theme(plot.title=element_text(hjust=0.5))Bar plot is used to show average rating by user
mean.user<- aggregate(rating ~ user_name, movie.reviews, mean)
mean.user## user_name rating
## 1 Albert 2.333333
## 2 Alton 2.333333
## 3 Andoh 1.888889
## 4 Aravi 2.666667
## 5 Arjun 2.777778
## 6 David 1.666667
## 7 Frank 2.444444
## 8 Garre 2.666667
## 9 Jasse 2.222222
## 10 John 2.111111
## 11 Manoj 2.333333
## 12 Mike 2.333333
## 13 Munar 1.888889
## 14 Naura 2.000000
## 15 Palak 3.000000
## 16 Paul 2.222222
## 17 Steve 3.222222
## 18 Taylor 2.000000
## 19 William 1.777778
## 20 Yang 1.000000
Lets use Bar plot show average rating by user
PivotData = aggregate(rating ~ user_name, data=movie.reviews,mean)
ggplot(data=PivotData, aes(x=(reorder(user_name, rating)), y = rating, label = rating))+
geom_bar(stat="identity", fill="#9999FF")+
coord_flip()+
labs(title="Average rating by user", x= "User", y = "Average Rating")+
geom_text(aes(label=round(rating, digits = 2)))+
theme(plot.title=element_text(hjust=0.5)) Lets drill down to next level to see rating on each movie given by individual users
ggplot(data=movie.reviews, aes(x=movie_name, y = rating, fill = movie_name, label = rating))+
geom_bar(stat="identity")+ facet_wrap(~user_name)+ ggtitle("Movie ratings by user")+
theme(axis.text.x = element_blank(),plot.title = element_text(hjust=1),legend.position = "right")sqlQuery='SELECT pr.preference, count(*) as count FROM movie_ratings.user_preference up JOIN preference pr on pr.preferenceId=up.preferenceId group by up.preferenceId'
result = dbSendQuery(connection, sqlQuery)
prefData.frame = fetch(result)
prefData.frame## preference count
## 1 Cinema 4
## 2 Online 16
By seeing average movie ratings and average rating by the users, listed below are conclusions:
* ‘Avengers: Endgame’ is highest rated where as Space Sweepers as lowest rated movie
* Steve watched most of the movies and liked most of the movies
* Most users preferred to watch movie online instead going to Cinema or watching on DVD or TV telecast
* I feel there should have added another category ‘Didn’t watch’ to further collect more ratings. That would have improved average ratings for other movies.