Data Title: Movie Ratings Data Set

Source: Collected through indenpendent survey

Database Components:

This analysis involves a MySQL source database called ‘movieratings’. This database has 3 tables -

1. movies

2. users

3. ratings

Note: MySQL database software need to be installed in the local computer in order to successfully execute the R markdown code.

Below are the steps involved in preparing the MySQL DB environment necesaary for the R markdown code to run successfully -

  1. Run the DDL script (MovieRatings_DDL Script.sql) in order to create a database called ‘movieratings’ along with the necessary tables as mentioned above - GitHub location:DDL Script

  2. Run the Data Load script (Data Load DML Script.sql) to load necessary data into the database - GitHub Location: Load Script

Note: All the necessary CSV data files are also available in the below mention location - GitHub Location: Source Data Files

  1. Create a ODBC System DSN called ‘MySQL_MovieRatings’ in the local machine with ‘movieratings’ as the default database with appropriate root credentials or with an appropriate user id having READ access to ‘movieratings’ database.

R Libraries:

Load necessary libraries -

library(RODBC)
library(magrittr)
library(kableExtra)
library(data.table)
library(plyr)
library(ggplot2)
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following objects are masked from 'package:data.table':
## 
##     dcast, melt

Establish DB Connection:

I have used an ODBC data source called ‘MySQL_MovieRatings’ in order to connect to the database.

con <- odbcConnect("MySQL_MovieRatings")

Data Sets:

A. Movies:

I have taken a list of six Hollywood movies released in 2018 for this analysis. Movies belonging to multiple genres have been captured as pipe (‘|’) separated values.

movies_df <- as.data.frame(sqlFetch(con,"movies"),stringsAsFactors = FALSE)

movies_df %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")
Movie_ID Movie_Title Movie_Genre Movie_Year Director_Name
1 Black Panther Action|Adventure|Sci-Fi 2018 Ryan Coogler
2 Incredibles 2 Animation|Action 2018 Brand Bird
3 Avengers: Infinity War Action|Adventure|Fantasy 2018 Anthony Russo
4 Annihilation Adventure|Drama|Horror 2018 Alex Garland
5 Ant-Man and the Wasp Action|Adventure|Sci-Fi 2018 Peyton Reed
6 Mission Impossible: Fallout Action|Adventure|Thriller 2018 Christopher McQuarrie

B. Users:

These users contributed in the movie rating survey process.

users_df <- as.data.frame(sqlFetch(con,"users"),stringsAsFactors = FALSE)

users_df %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")
User_ID First_Name Last_Name Gender Age
1 Soumya Ghosh M 36
2 Oliver Asmus M 36
3 Ricky Sharma M 34
4 Idhant Ghosh M 6
5 Bidisha Kar F 35
6 Krishanu Ghosh M 34
7 Jim Duffy M 42
8 Arron Higgins M 41
9 Alex Smith M 43
10 Reese McGarvey F 7
11 Jacob Oram M 45
12 Pritam Mukherjee M 33

C. Ratings:

Each user rated all the 6 movies which has been loaded through SQL scripts into the database. A rank has been created in SQL based on the ratings provided by the users using SQL window function.

Query = "SELECT 
             User_ID
            ,Movie_ID
            ,Rating
            ,RANK() OVER (PARTITION BY User_ID ORDER BY Rating desc) AS Movie_Rank
         FROM
             ratings 
         ORDER BY 
             User_ID
            ,Movie_Rank;"

ratings_df <- sqlQuery(con,Query)

ratings_df %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")
User_ID Movie_ID Rating Movie_Rank
1 3 4.2 1
1 6 4.1 2
1 1 4.0 3
1 2 3.5 4
1 5 3.0 5
1 4 2.5 6
2 3 4.3 1
2 1 3.2 2
2 4 3.0 3
2 5 2.8 4
2 2 2.4 5
2 6 2.2 6
3 6 4.5 1
3 4 4.2 2
3 3 3.9 3
3 5 3.7 4
3 1 3.5 5
3 2 2.8 6
4 1 4.6 1
4 5 4.1 2
4 3 3.7 3
4 6 3.6 4
4 2 3.2 5
4 4 2.9 6
5 5 4.3 1
5 2 4.1 2
5 1 3.7 3
5 3 3.2 4
5 6 3.0 5
5 4 2.4 6
6 1 4.6 1
6 3 4.2 2
6 4 3.7 3
6 2 3.6 4
6 6 3.0 5
6 5 2.9 6
10 1 4.6 1
10 4 3.7 2
10 2 3.6 3
10 6 3.0 4
10 5 2.0 5
10 3 1.3 6

Prepare Analytics data:

Here I have joined all the 3 data sets and loaded into a data frame to prepare for further analysis along with dynamically created movie rank informtaion.

Query = "SELECT 
          m.Movie_ID
         ,m.Movie_Title
         ,m.Director_Name
         ,m.Movie_Year
         ,u.User_ID
         ,u.First_Name
         ,u.Last_Name
         ,u.Gender
         ,u.Age
         ,r.Rating
         ,RANK() OVER (PARTITION BY r.User_ID ORDER BY r.Rating desc) AS Movie_Rank
        FROM
          movieratings.ratings as r
        INNER JOIN
          movieratings.movies as m
        ON r.Movie_ID = m.Movie_ID
        INNER JOIN
        movieratings.users as u
        ON r.User_ID = u.User_ID;"

detail_ratings_df <- sqlQuery(con,Query)

detail_ratings_df %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")
Movie_ID Movie_Title Director_Name Movie_Year User_ID First_Name Last_Name Gender Age Rating Movie_Rank
3 Avengers: Infinity War Anthony Russo 2018 1 Soumya Ghosh M 36 4.2 1
6 Mission Impossible: Fallout Christopher McQuarrie 2018 1 Soumya Ghosh M 36 4.1 2
1 Black Panther Ryan Coogler 2018 1 Soumya Ghosh M 36 4.0 3
2 Incredibles 2 Brand Bird 2018 1 Soumya Ghosh M 36 3.5 4
5 Ant-Man and the Wasp Peyton Reed 2018 1 Soumya Ghosh M 36 3.0 5
4 Annihilation Alex Garland 2018 1 Soumya Ghosh M 36 2.5 6
3 Avengers: Infinity War Anthony Russo 2018 2 Oliver Asmus M 36 4.3 1
1 Black Panther Ryan Coogler 2018 2 Oliver Asmus M 36 3.2 2
4 Annihilation Alex Garland 2018 2 Oliver Asmus M 36 3.0 3
5 Ant-Man and the Wasp Peyton Reed 2018 2 Oliver Asmus M 36 2.8 4
2 Incredibles 2 Brand Bird 2018 2 Oliver Asmus M 36 2.4 5
6 Mission Impossible: Fallout Christopher McQuarrie 2018 2 Oliver Asmus M 36 2.2 6
6 Mission Impossible: Fallout Christopher McQuarrie 2018 3 Ricky Sharma M 34 4.5 1
4 Annihilation Alex Garland 2018 3 Ricky Sharma M 34 4.2 2
3 Avengers: Infinity War Anthony Russo 2018 3 Ricky Sharma M 34 3.9 3
5 Ant-Man and the Wasp Peyton Reed 2018 3 Ricky Sharma M 34 3.7 4
1 Black Panther Ryan Coogler 2018 3 Ricky Sharma M 34 3.5 5
2 Incredibles 2 Brand Bird 2018 3 Ricky Sharma M 34 2.8 6
1 Black Panther Ryan Coogler 2018 4 Idhant Ghosh M 6 4.6 1
5 Ant-Man and the Wasp Peyton Reed 2018 4 Idhant Ghosh M 6 4.1 2
3 Avengers: Infinity War Anthony Russo 2018 4 Idhant Ghosh M 6 3.7 3
6 Mission Impossible: Fallout Christopher McQuarrie 2018 4 Idhant Ghosh M 6 3.6 4
2 Incredibles 2 Brand Bird 2018 4 Idhant Ghosh M 6 3.2 5
4 Annihilation Alex Garland 2018 4 Idhant Ghosh M 6 2.9 6
5 Ant-Man and the Wasp Peyton Reed 2018 5 Bidisha Kar F 35 4.3 1
2 Incredibles 2 Brand Bird 2018 5 Bidisha Kar F 35 4.1 2
1 Black Panther Ryan Coogler 2018 5 Bidisha Kar F 35 3.7 3
3 Avengers: Infinity War Anthony Russo 2018 5 Bidisha Kar F 35 3.2 4
6 Mission Impossible: Fallout Christopher McQuarrie 2018 5 Bidisha Kar F 35 3.0 5
4 Annihilation Alex Garland 2018 5 Bidisha Kar F 35 2.4 6
1 Black Panther Ryan Coogler 2018 6 Krishanu Ghosh M 34 4.6 1
3 Avengers: Infinity War Anthony Russo 2018 6 Krishanu Ghosh M 34 4.2 2
4 Annihilation Alex Garland 2018 6 Krishanu Ghosh M 34 3.7 3
2 Incredibles 2 Brand Bird 2018 6 Krishanu Ghosh M 34 3.6 4
6 Mission Impossible: Fallout Christopher McQuarrie 2018 6 Krishanu Ghosh M 34 3.0 5
5 Ant-Man and the Wasp Peyton Reed 2018 6 Krishanu Ghosh M 34 2.9 6
1 Black Panther Ryan Coogler 2018 10 Reese McGarvey F 7 4.6 1
4 Annihilation Alex Garland 2018 10 Reese McGarvey F 7 3.7 2
2 Incredibles 2 Brand Bird 2018 10 Reese McGarvey F 7 3.6 3
6 Mission Impossible: Fallout Christopher McQuarrie 2018 10 Reese McGarvey F 7 3.0 4
5 Ant-Man and the Wasp Peyton Reed 2018 10 Reese McGarvey F 7 2.0 5
3 Avengers: Infinity War Anthony Russo 2018 10 Reese McGarvey F 7 1.3 6

Data Transformtaion:

I have parsed the pipe(‘|’) delimited genre column in the movies data frame to identify what all genres a particular movie belongs to as separate columns.

genres_df <- data.frame(sapply(tstrsplit(movies_df$Movie_Genre,'[|]',type.convert = TRUE),c),stringsAsFactors = FALSE)

genres_df %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")
X1 X2 X3
Action Adventure Sci-Fi
Animation Action NA
Action Adventure Fantasy
Adventure Drama Horror
Action Adventure Sci-Fi
Action Adventure Thriller

I have Prepared a matrix with all the movie genres and assigned an indicator ‘1’ if the genre applicable for a given movie.

genre_list <- c("Action","Adventure","Animation","Drama","Fantasy","Horror","Sci-Fi","Thriller")
genre_matrix <- matrix(0,nrow(genres_df)+1,length(genre_list))
genre_matrix[1,] <- genre_list #set first row to genre list
colnames(genre_matrix) <- genre_list

#iterate through matrix
for (i in 1:nrow(genres_df)) {
 for (c in 1:ncol(genres_df)) {
 genmat_col = which(genre_matrix[1,] == genres_df[i,c])
 genre_matrix[i+1,genmat_col] <- 1
 }
}

genre_matrix %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")
Action Adventure Animation Drama Fantasy Horror Sci-Fi Thriller
Action Adventure Animation Drama Fantasy Horror Sci-Fi Thriller
1 1 0 0 0 0 1 0
1 0 1 0 0 0 0 0
1 1 0 0 1 0 0 0
0 1 0 1 0 1 0 0
1 1 0 0 0 0 1 0
1 1 0 0 0 0 0 1

Converted the matrix into a data frame with integer indcator values.

genre_matrix2 <- as.data.frame(genre_matrix[-1,], stringsAsFactors=FALSE) #remove first row, which was the genre list
for (c in 1:ncol(genre_matrix2)) {
  genre_matrix2[,c] <- as.integer(genre_matrix2[,c])
} #convert from characters to integers

genre_matrix2 %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")
Action Adventure Animation Drama Fantasy Horror Sci-Fi Thriller
1 1 0 0 0 0 1 0
1 0 1 0 0 0 0 0
1 1 0 0 1 0 0 0
0 1 0 1 0 1 0 0
1 1 0 0 0 0 1 0
1 1 0 0 0 0 0 1

I have combined the genre indicators to the movies_df data frame.

movies_df <- cbind(movies_df,genre_matrix2)
movies_df %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")
Movie_ID Movie_Title Movie_Genre Movie_Year Director_Name Action Adventure Animation Drama Fantasy Horror Sci-Fi Thriller
1 Black Panther Action|Adventure|Sci-Fi 2018 Ryan Coogler 1 1 0 0 0 0 1 0
2 Incredibles 2 Animation|Action 2018 Brand Bird 1 0 1 0 0 0 0 0
3 Avengers: Infinity War Action|Adventure|Fantasy 2018 Anthony Russo 1 1 0 0 1 0 0 0
4 Annihilation Adventure|Drama|Horror 2018 Alex Garland 0 1 0 1 0 1 0 0
5 Ant-Man and the Wasp Action|Adventure|Sci-Fi 2018 Peyton Reed 1 1 0 0 0 0 1 0
6 Mission Impossible: Fallout Action|Adventure|Thriller 2018 Christopher McQuarrie 1 1 0 0 0 0 0 1

Here I have joined the Data Frames ‘detail_ratings_df’ and ‘movies_df’ based on ‘Movie_ID’ to add genres related indicators to analytical data sets.

detail_ratings_df <- merge(x = detail_ratings_df, y = subset(movies_df,select = c("Movie_ID",genre_list)), by = "Movie_ID")
detail_ratings_df %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")
Movie_ID Movie_Title Director_Name Movie_Year User_ID First_Name Last_Name Gender Age Rating Movie_Rank Action Adventure Animation Drama Fantasy Horror Sci-Fi Thriller
1 Black Panther Ryan Coogler 2018 6 Krishanu Ghosh M 34 4.6 1 1 1 0 0 0 0 1 0
1 Black Panther Ryan Coogler 2018 10 Reese McGarvey F 7 4.6 1 1 1 0 0 0 0 1 0
1 Black Panther Ryan Coogler 2018 2 Oliver Asmus M 36 3.2 2 1 1 0 0 0 0 1 0
1 Black Panther Ryan Coogler 2018 4 Idhant Ghosh M 6 4.6 1 1 1 0 0 0 0 1 0
1 Black Panther Ryan Coogler 2018 3 Ricky Sharma M 34 3.5 5 1 1 0 0 0 0 1 0
1 Black Panther Ryan Coogler 2018 1 Soumya Ghosh M 36 4.0 3 1 1 0 0 0 0 1 0
1 Black Panther Ryan Coogler 2018 5 Bidisha Kar F 35 3.7 3 1 1 0 0 0 0 1 0
2 Incredibles 2 Brand Bird 2018 4 Idhant Ghosh M 6 3.2 5 1 0 1 0 0 0 0 0
2 Incredibles 2 Brand Bird 2018 3 Ricky Sharma M 34 2.8 6 1 0 1 0 0 0 0 0
2 Incredibles 2 Brand Bird 2018 2 Oliver Asmus M 36 2.4 5 1 0 1 0 0 0 0 0
2 Incredibles 2 Brand Bird 2018 1 Soumya Ghosh M 36 3.5 4 1 0 1 0 0 0 0 0
2 Incredibles 2 Brand Bird 2018 5 Bidisha Kar F 35 4.1 2 1 0 1 0 0 0 0 0
2 Incredibles 2 Brand Bird 2018 10 Reese McGarvey F 7 3.6 3 1 0 1 0 0 0 0 0
2 Incredibles 2 Brand Bird 2018 6 Krishanu Ghosh M 34 3.6 4 1 0 1 0 0 0 0 0
3 Avengers: Infinity War Anthony Russo 2018 1 Soumya Ghosh M 36 4.2 1 1 1 0 0 1 0 0 0
3 Avengers: Infinity War Anthony Russo 2018 3 Ricky Sharma M 34 3.9 3 1 1 0 0 1 0 0 0
3 Avengers: Infinity War Anthony Russo 2018 10 Reese McGarvey F 7 1.3 6 1 1 0 0 1 0 0 0
3 Avengers: Infinity War Anthony Russo 2018 4 Idhant Ghosh M 6 3.7 3 1 1 0 0 1 0 0 0
3 Avengers: Infinity War Anthony Russo 2018 6 Krishanu Ghosh M 34 4.2 2 1 1 0 0 1 0 0 0
3 Avengers: Infinity War Anthony Russo 2018 2 Oliver Asmus M 36 4.3 1 1 1 0 0 1 0 0 0
3 Avengers: Infinity War Anthony Russo 2018 5 Bidisha Kar F 35 3.2 4 1 1 0 0 1 0 0 0
4 Annihilation Alex Garland 2018 1 Soumya Ghosh M 36 2.5 6 0 1 0 1 0 1 0 0
4 Annihilation Alex Garland 2018 6 Krishanu Ghosh M 34 3.7 3 0 1 0 1 0 1 0 0
4 Annihilation Alex Garland 2018 5 Bidisha Kar F 35 2.4 6 0 1 0 1 0 1 0 0
4 Annihilation Alex Garland 2018 2 Oliver Asmus M 36 3.0 3 0 1 0 1 0 1 0 0
4 Annihilation Alex Garland 2018 3 Ricky Sharma M 34 4.2 2 0 1 0 1 0 1 0 0
4 Annihilation Alex Garland 2018 4 Idhant Ghosh M 6 2.9 6 0 1 0 1 0 1 0 0
4 Annihilation Alex Garland 2018 10 Reese McGarvey F 7 3.7 2 0 1 0 1 0 1 0 0
5 Ant-Man and the Wasp Peyton Reed 2018 4 Idhant Ghosh M 6 4.1 2 1 1 0 0 0 0 1 0
5 Ant-Man and the Wasp Peyton Reed 2018 5 Bidisha Kar F 35 4.3 1 1 1 0 0 0 0 1 0
5 Ant-Man and the Wasp Peyton Reed 2018 2 Oliver Asmus M 36 2.8 4 1 1 0 0 0 0 1 0
5 Ant-Man and the Wasp Peyton Reed 2018 6 Krishanu Ghosh M 34 2.9 6 1 1 0 0 0 0 1 0
5 Ant-Man and the Wasp Peyton Reed 2018 10 Reese McGarvey F 7 2.0 5 1 1 0 0 0 0 1 0
5 Ant-Man and the Wasp Peyton Reed 2018 3 Ricky Sharma M 34 3.7 4 1 1 0 0 0 0 1 0
5 Ant-Man and the Wasp Peyton Reed 2018 1 Soumya Ghosh M 36 3.0 5 1 1 0 0 0 0 1 0
6 Mission Impossible: Fallout Christopher McQuarrie 2018 4 Idhant Ghosh M 6 3.6 4 1 1 0 0 0 0 0 1
6 Mission Impossible: Fallout Christopher McQuarrie 2018 2 Oliver Asmus M 36 2.2 6 1 1 0 0 0 0 0 1
6 Mission Impossible: Fallout Christopher McQuarrie 2018 3 Ricky Sharma M 34 4.5 1 1 1 0 0 0 0 0 1
6 Mission Impossible: Fallout Christopher McQuarrie 2018 10 Reese McGarvey F 7 3.0 4 1 1 0 0 0 0 0 1
6 Mission Impossible: Fallout Christopher McQuarrie 2018 5 Bidisha Kar F 35 3.0 5 1 1 0 0 0 0 0 1
6 Mission Impossible: Fallout Christopher McQuarrie 2018 6 Krishanu Ghosh M 34 3.0 5 1 1 0 0 0 0 0 1
6 Mission Impossible: Fallout Christopher McQuarrie 2018 1 Soumya Ghosh M 36 4.1 2 1 1 0 0 0 0 0 1

Data Analysis:

Structure of the analytical data set.

str(detail_ratings_df)
## 'data.frame':    42 obs. of  19 variables:
##  $ Movie_ID     : int  1 1 1 1 1 1 1 2 2 2 ...
##  $ Movie_Title  : Factor w/ 6 levels "Annihilation",..: 4 4 4 4 4 4 4 5 5 5 ...
##  $ Director_Name: Factor w/ 6 levels "Alex Garland",..: 6 6 6 6 6 6 6 3 3 3 ...
##  $ Movie_Year   : int  2018 2018 2018 2018 2018 2018 2018 2018 2018 2018 ...
##  $ User_ID      : int  6 10 2 4 3 1 5 4 3 2 ...
##  $ First_Name   : Factor w/ 7 levels "Bidisha","Idhant",..: 3 5 4 2 6 7 1 2 6 4 ...
##  $ Last_Name    : Factor w/ 5 levels "Asmus","Ghosh",..: 2 4 1 2 5 2 3 2 5 1 ...
##  $ Gender       : Factor w/ 2 levels "F","M": 2 1 2 2 2 2 1 2 2 2 ...
##  $ Age          : int  34 7 36 6 34 36 35 6 34 36 ...
##  $ Rating       : num  4.6 4.6 3.2 4.6 3.5 4 3.7 3.2 2.8 2.4 ...
##  $ Movie_Rank   : int  1 1 2 1 5 3 3 5 6 5 ...
##  $ Action       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Adventure    : int  1 1 1 1 1 1 1 0 0 0 ...
##  $ Animation    : int  0 0 0 0 0 0 0 1 1 1 ...
##  $ Drama        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fantasy      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Horror       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Sci-Fi       : int  1 1 1 1 1 1 1 0 0 0 ...
##  $ Thriller     : int  0 0 0 0 0 0 0 0 0 0 ...

Movie Ratings Analysis:

Here I have created a side by side Boxplot for Ratings by Movie titles. This plot provides a comparative between the 6 selected titles.

p<-ggplot(detail_ratings_df, aes(x = Movie_Title, y = Rating, color = Movie_Title)) +
  geom_boxplot()
p + theme(legend.position="bottom") + labs(title = "Movie Rating Boxplot By Title", x = "Movie Title") + theme(
plot.title = element_text(color="black", size=14, face="bold.italic"),
axis.title.x = element_text(color="blue", size=14, face="bold"),
axis.title.y = element_text(color="#993333", size=14, face="bold")
)

Movie Genre Analysis:

Here, I have calculate Mean Percentage of Movie Genre by Gender:

summary_df <- subset(detail_ratings_df,select = c(8,12:19))
summary_df <- aggregate(summary_df,by = list(summary_df$Gender),FUN=mean, na.action=na.omit)
summary_df <- summary_df[-2] 
summary_df <- cbind(summary_df,gender=c(1:2))
summary_df <-summary_df[,2:length(summary_df)]
summary_df.long <- melt(summary_df,id.vars="gender")

ggplot(summary_df.long,aes(x=variable,y=value,fill=factor(gender)))+
  geom_bar(stat="identity",position="dodge")+
  scale_fill_discrete(name="Gender",
                      breaks=c(1, 2),
                      labels=c("Female", "Male")) + labs(title = "Movie Genre Popularity By Gender", x = "Genre", y = "Mean Percentage")