Movie_Rating with SQL

I chose six recent films and asked five different imaginary individuals to rate each of the movies they had seen from a scale of 1 to 5.

Data Collection

I created a database called movie_rating in a MySQL workbench to store the data in individual tables:

Table 1: movie Table 2: name Table 3: review

The SQL code for table creation:

CREATE TABLE ‘movie’ ( ‘movie_id’ int NOT NULL, ‘title’ varchar(45) NOT NULL, ‘length’ varchar(45) NOT NULL, PRIMARY KEY (‘movie_id’))

CREATE TABLE ‘name’( ‘name_id’ int NOT NULL, ‘first_name’ varchar(45) NOT NULL, ‘age’ varchar(45) NOT NULL, PRIMARY KEY (’name_id))

CREATE TABLE ‘review’ ( ‘review_id’ int NOT NULL, ‘movie_id’ varchar(45) NOT NULL, ‘name_id’ varchar(45) NOT NULL, ‘rating’ int DEFAULT NULL, ‘review’ varchar(45) DEFAULT NULL, PRIMARY KEY (‘review_id’) )

First I installed and uploaded the packages I needed.

# First I installed and uploaded the packages I needed. 

#install.packages("RMySQL")
#install.packages("DBI")
library(RMySQL)

## Loading required package: DBI

library(DBI)
library(dbplyr)
library(tidyverse)

## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──

## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.8     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.1
## ✔ readr   2.1.2     ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::ident()  masks dbplyr::ident()
## ✖ dplyr::lag()    masks stats::lag()
## ✖ dplyr::sql()    masks dbplyr::sql()

library(ggplot2)

You need to create a password to access the local database.

# I then connected MySQL to R to upload my database, movie_rating

mydb = dbConnect(RMySQL::MySQL(),
       dbname='movie_rating',
       host='127.0.0.1',
       port=3306,
       user='root',
       password=rstudioapi::askForPassword("database password"))

# I previewed the tables.

dbListTables(mydb)

## [1] "movie"  "name"   "review"

Here I see the oldest of the individual is 39 and the youngest is 7.

# I wrote a query to show name table.

name_table <- dbSendQuery(mydb, "SELECT * FROM name;")
dbFetch(name_table)

##   name_id first_name age
## 1       1    Anthony  27
## 2       2   Josaline  17
## 3       3      Leila   7
## 4       4      Angie  37
## 5       5   Jonathan  39

Here I see the longest in length time is The Good Nurse and the shortest is Texas Chainsaw Massacre.

# I wrote a query to show movie table. 

movie_table <- dbSendQuery(mydb, "SELECT * FROM movie;")
dbFetch(movie_table)

##   movie_id                   title length
## 1        1        The Adam Project    106
## 2        2 Texas Chainsaw Massacre     81
## 3        3               Pinocchio     90
## 4        4          The Good Nurse    121
## 5        5             Tall Girl 2     97
## 6        6         Against The Ice    102

# I wrote a query to show review table.

review_table <- dbSendQuery(mydb, "SELECT * FROM review")
dbFetch(review_table)

##    review_id movie_id name_id rating      review
## 1          1        1       1      5 Great movie
## 2          2        2       1      4      Bloody
## 3          3        3       1      3 Didn't Care
## 4          4        4       1      3      Boring
## 5          5        5       1      2         Not
## 6          6        6       1      4          Ok
## 7          7        1       2      5    Loved it
## 8          8        2       2      5      Bloody
## 9          9        3       2      5       Enjoy
## 10        10        4       2      4          Ok
## 11        11        5       2      4       Funny
## 12        12        6       2      3          Ok
## 13        13        1       3      4        Good
## 14        14        2       3      0          NA
## 15        15        3       3      5    Loved it
## 16        16        4       3      2  Don't care
## 17        17        5       3      4       Funny
## 18        18        6       3      2      Boring
## 19        19        1       4      5       Great
## 20        20        2       4      4       Scary
## 21        21        3       4      5       Loved
## 22        22        4       4      4        Good
## 23        23        5       4      2      Stupid
## 24        24        6       4      4        Good
## 25        25        1       5      5       Great
## 26        26        2       5      5      Bloody
## 27        27        3       5      4        Good
## 28        28        4       5      4          Ok
## 29        29        5       5      2      Really
## 30        30        6       5      4        Good

# I then joined all three tables movie, name, and review to create one table called movie_rating.

movie_rating <- dbSendQuery(mydb, "SELECT 
M.title As 'Title',
M.length AS 'Length',
N.first_name AS 'Name',
N.age AS 'Age',
R.rating As 'Rating',
R.review AS 'Review'
FROM movie AS M
JOIN review AS R
ON M.movie_id = R.movie_id
JOIN name AS N
ON N.name_id = R.name_id;")
#dbFetch(movie_rating)
data<-fetch(movie_rating)
print(data)

##                      Title Length     Name Age Rating      Review
## 1         The Adam Project    106  Anthony  27      5 Great movie
## 2  Texas Chainsaw Massacre     81  Anthony  27      4      Bloody
## 3                Pinocchio     90  Anthony  27      3 Didn't Care
## 4           The Good Nurse    121  Anthony  27      3      Boring
## 5              Tall Girl 2     97  Anthony  27      2         Not
## 6          Against The Ice    102  Anthony  27      4          Ok
## 7         The Adam Project    106 Josaline  17      5    Loved it
## 8  Texas Chainsaw Massacre     81 Josaline  17      5      Bloody
## 9                Pinocchio     90 Josaline  17      5       Enjoy
## 10          The Good Nurse    121 Josaline  17      4          Ok
## 11             Tall Girl 2     97 Josaline  17      4       Funny
## 12         Against The Ice    102 Josaline  17      3          Ok
## 13        The Adam Project    106    Leila   7      4        Good
## 14 Texas Chainsaw Massacre     81    Leila   7      0          NA
## 15               Pinocchio     90    Leila   7      5    Loved it
## 16          The Good Nurse    121    Leila   7      2  Don't care
## 17             Tall Girl 2     97    Leila   7      4       Funny
## 18         Against The Ice    102    Leila   7      2      Boring
## 19        The Adam Project    106    Angie  37      5       Great
## 20 Texas Chainsaw Massacre     81    Angie  37      4       Scary
## 21               Pinocchio     90    Angie  37      5       Loved
## 22          The Good Nurse    121    Angie  37      4        Good
## 23             Tall Girl 2     97    Angie  37      2      Stupid
## 24         Against The Ice    102    Angie  37      4        Good
## 25        The Adam Project    106 Jonathan  39      5       Great
## 26 Texas Chainsaw Massacre     81 Jonathan  39      5      Bloody
## 27               Pinocchio     90 Jonathan  39      4        Good
## 28          The Good Nurse    121 Jonathan  39      4          Ok
## 29             Tall Girl 2     97 Jonathan  39      2      Really
## 30         Against The Ice    102 Jonathan  39      4        Good

# Checked the structure of the data. 30 rows. 6 columns. 

str(data, vec.len = 1)

## 'data.frame':    30 obs. of  6 variables:
##  $ Title : chr  "The Adam Project" ...
##  $ Length: chr  "106" ...
##  $ Name  : chr  "Anthony" ...
##  $ Age   : chr  "27" ...
##  $ Rating: int  5 4 ...
##  $ Review: chr  "Great movie" ...

Here I see The Adam Project had the highest average rating with 4.8 and Tall Girl 2 has the lowest with 2.8.

# I did a group by to see the average score for each movie rated. 

new_data <- data %>%
  filter(!is.na(Rating)) %>%
   group_by(Title) %>%
    summarise(Avg_Score = mean(as.numeric(Rating))) %>%
      arrange(desc(Avg_Score))
        new_data

## # A tibble: 6 × 2
##   Title                   Avg_Score
##   <chr>                       <dbl>
## 1 The Adam Project              4.8
## 2 Pinocchio                     4.4
## 3 Texas Chainsaw Massacre       3.6
## 4 Against The Ice               3.4
## 5 The Good Nurse                3.4
## 6 Tall Girl 2                   2.8

Same results as the above.

# I did a ggplot of the Average Score for each movie for visualization purpose. 

new_data %>%
ggplot +
geom_col(aes(Avg_Score, Title))

The Adam Project has the most 5 rating with 4 counts. Texas Mascare had 0 rating from a 7 year old who thought it was too bloody.

The Good Nurse got 1 - 2 Rating and 1 - 3 Rating and 3 - 4 Rating. That is about average rating.

The Adam Project got 1 - 3 Rating and 3 - 4 Rating. That is above average, 1st in place.

Texas Chainsaw Massacre got 1 - 0 Rating, 2 - 4 Rating, 2 - 5 Rating.

Tall Girl got 3 - 2 Rating and 2 - 4 Rating. That is below average. Ranked the lowest.

Pinocchio got 1 - 3 Rating, 1 - 4 Rating, and 3 - 5 Rating. This is above average, 2nd in place.

Against The Ice got 1 - 2 Rating, 1 - 3 Rating, and 3 - 5 Rating. This is average.

# I did another group by to see the count for each rating per movie. 

count_data <- data 
count_data %>% group_by(Title, Rating) %>% summarise(count = n())%>%
arrange(desc(Title))

## `summarise()` has grouped output by 'Title'. You can override using the
## `.groups` argument.

## # A tibble: 16 × 3
## # Groups:   Title [6]
##    Title                   Rating count
##    <chr>                    <int> <int>
##  1 The Good Nurse               2     1
##  2 The Good Nurse               3     1
##  3 The Good Nurse               4     3
##  4 The Adam Project             4     1
##  5 The Adam Project             5     4
##  6 Texas Chainsaw Massacre      0     1
##  7 Texas Chainsaw Massacre      4     2
##  8 Texas Chainsaw Massacre      5     2
##  9 Tall Girl 2                  2     3
## 10 Tall Girl 2                  4     2
## 11 Pinocchio                    3     1
## 12 Pinocchio                    4     1
## 13 Pinocchio                    5     3
## 14 Against The Ice              2     1
## 15 Against The Ice              3     1
## 16 Against The Ice              4     3

In conlusion more research has to be done on the ratings for these 5 movies to see what motivated these individuals to give them the rating they received. For now we can go as per the written reviews section for the reason of their ratings. For example Texas Chainsaw was bloody. Some gave it high ranking for that because the like horrow movies and some gave 0 because it was too bloody for her. We might want to look at the length of the movies. For example, The Good Nurse has a longest length of 121 minutes out of the 6 movies. Texas Chainsaw Massacre was the shortest with 81 minutes. The length of the movie could also affect the rating of the movies. Also we might want to look at the age of the individuals. The age could affect the rating of the movies. We would also need to take a take rating from a larger group to get a broader analysis.

For now The Adam Project ranks # 1 and Tall Girl ranks # 6 as per the 5 individuals.

Movie_Rating with SQL

Enid Roman

2022-09-11