Summary

We are looking at the ratings of five movies. Respondents were given a google form to rate each movie from 1-5. If the respondent had not watched the movie they were asked to leave it blank.

Libraries Used

The libraries used throughout the assignment are shared below. Googlesheets was used to import the responses provided.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(DBI)
library(dplyr)
library(RMySQL)

Connecting my database

mydb <- dbConnect(MySQL(),
                  user = 'william.forero99', 
                  password = '54e9f1ec38640291', 
                  dbname = 'william.forero99', 
                  host = 'cuny607sql.mysql.database.azure.com')

Loading the responses

We will load the responses of the google survey from the associated google sheets that was created and placed on github

response_url <- 'https://raw.githubusercontent.com/william-forero/Data-607/main/Movie_Ratings.csv'
response_data <- read.csv(response_url)

summary(response_data)
##   Timestamp          First.Name         Last.Name          Borderlands   
##  Length:13          Length:13          Length:13          Min.   :1.000  
##  Class :character   Class :character   Class :character   1st Qu.:1.000  
##  Mode  :character   Mode  :character   Mode  :character   Median :2.500  
##                                                           Mean   :2.625  
##                                                           3rd Qu.:4.000  
##                                                           Max.   :5.000  
##                                                           NA's   :5      
##     Moana.2    Mufasa.The.Lion.King Venom.The.Last.Dance     Wicked     
##  Min.   :1.0   Min.   :2.000        Min.   :1.0          Min.   :1.000  
##  1st Qu.:3.0   1st Qu.:3.000        1st Qu.:3.0          1st Qu.:2.000  
##  Median :3.0   Median :4.000        Median :4.0          Median :3.000  
##  Mean   :3.2   Mean   :3.636        Mean   :3.5          Mean   :2.444  
##  3rd Qu.:4.0   3rd Qu.:4.500        3rd Qu.:4.0          3rd Qu.:3.000  
##  Max.   :5.0   Max.   :5.000        Max.   :5.0          Max.   :4.000  
##  NA's   :3     NA's   :2            NA's   :3            NA's   :4

We have 13 responses, not all movies have a rating for every response, which tells us that not all movies were watched by every respondent

Creating the tables to store the data

Three tables were created to store this information, the intent of creating the three tables would be to achieve third normal form - users: user_id first_name last_name

movies: movie_id movie_name

ratings: user_id movie_id rating

dbExecute(mydb, “DROP TABLE IF EXISTS ratings”) dbExecute(mydb, “DROP TABLE IF EXISTS users”) dbExecute(mydb, “DROP TABLE IF EXISTS movies”)

# Allows the code to be run again without inserting duplicate entries
dbExecute(mydb, "DROP TABLE IF EXISTS ratings")
## [1] 0
dbExecute(mydb, "DROP TABLE IF EXISTS users")
## [1] 0
dbExecute(mydb, "DROP TABLE IF EXISTS movies")
## [1] 0
# Create the users table
dbExecute(mydb, "
    CREATE TABLE IF NOT EXISTS users (
        user_id INT PRIMARY KEY AUTO_INCREMENT,
        first_name VARCHAR(50),
        last_name VARCHAR(50)
    )
")
## [1] 0
# Create movies table
dbExecute(mydb, "
    CREATE TABLE IF NOT EXISTS movies (
        movie_id INT PRIMARY KEY AUTO_INCREMENT,
        movie_name VARCHAR(255) UNIQUE
    )
")
## [1] 0
# Create ratings table
dbExecute(mydb, "
    CREATE TABLE IF NOT EXISTS ratings (
        user_id INT,
        movie_id INT,
        rating INT,
        FOREIGN KEY (user_id) REFERENCES users(user_id),
        FOREIGN KEY (movie_id) REFERENCES movies(movie_id)
    )
")
## [1] 0

Now we insert the rows corresponding to the survey results.

# Inserting the users information
dbExecute(mydb, "
    INSERT INTO users (first_name, last_name) VALUES 
    ('Torren', 'Jarvis '), 
    ('Kalleen', 'Marte'), 
    ('Orlando', 'G'), 
    ('Edwin', 'Edouard'), 
    ('Rene', 'J'), 
    ('Paul', 'Daley'), 
    ('Dorrelle', 'Caine'), 
    ('Richard', 'Aguilar Navarrette'), 
    ('Zulfiqar', 'Soomro'), 
    ('Brexnak', 'Rojas'), 
    ('Ben', 'Lawrence'), 
    ('Marcia', 'Hernandez'), 
    ('Melissa', 'Baez')
")
## [1] 13
# Inserting the movies information
dbExecute(mydb, "
    INSERT INTO movies (movie_name) VALUES 
    ('Borderlands'),
    ('Moana 2'),
    ('Mufasa The Lion King'),
    ('Venom The Last Dance'),
    ('Wicked')
")
## [1] 5
# Inserting the responses into ratings
dbExecute(mydb, "
    INSERT INTO ratings (user_id, movie_id, rating) VALUES 
    (1, 1, 1), 
    (1, 2, 3), 
    (1, 3, 3), 
    (1, 4, 3), 
    (1, 5, 1), 
    (2, 1, 2), 
    (2, 2, 4), 
    (2, 3, 3), 
    (2, 4, 3), 
    (2, 5, 4), 
    (3, 2, 3), 
    (3, 3, 2), 
    (3, 4, 4), 
    (3, 5, 3), 
    (4, 1, 1), 
    (4, 2, 4), 
    (4, 3, 4), 
    (4, 4, 4), 
    (4, 5, 2), 
    (6, 1, 5), 
    (6, 2, 3), 
    (6, 3, 4), 
    (6, 4, 4), 
    (6, 5, 3), 
    (7, 1, 4), 
    (7, 2, 3), 
    (7, 3, 5), 
    (7, 4, 4), 
    (7, 5, 3), 
    (8, 3, 3), 
    (9, 1, 3), 
    (9, 2, 1), 
    (9, 3, 2), 
    (9, 4, 3), 
    (9, 5, 3), 
    (10, 1, 4), 
    (10, 2, 5), 
    (10, 3, 5), 
    (10, 4, 4), 
    (10, 5, 2), 
    (11, 1, 1), 
    (11, 2, 1), 
    (11, 3, 4), 
    (11, 4, 1), 
    (11, 5, 1), 
    (12, 2, 5), 
    (12, 4, 5), 
    (13, 3, 5)
")
## [1] 48

Taking the data from MySQL and creating R dataframes

Now that the tables were created and the responses loaded, lets create dataframes in R that represent the data.

users_df <- dbGetQuery(mydb, "SELECT * FROM users")

movies_df <- dbGetQuery(mydb, "SELECT * FROM movies")

ratings_df <- dbGetQuery(mydb, "SELECT * FROM ratings")

Then we can join the three into a single dataframe

responses_df <- ratings_df %>%
    left_join(users_df, by = "user_id") %>%
    left_join(movies_df, by = "movie_id")

head(responses_df)
##   user_id movie_id rating first_name last_name           movie_name
## 1       1        1      1     Torren   Jarvis           Borderlands
## 2       1        2      3     Torren   Jarvis               Moana 2
## 3       1        3      3     Torren   Jarvis  Mufasa The Lion King
## 4       1        4      3     Torren   Jarvis  Venom The Last Dance
## 5       1        5      1     Torren   Jarvis                Wicked
## 6       2        1      2    Kalleen     Marte          Borderlands

Missing value treatment

Not all respondednts viewed all 5 of the movies. When a rating was left blank, it was omitted from the final results. It was omitted because assigning a 0 value would not give an accurate representation. It can be mistaken as a very bad rating when in reality it was just no rating. Imputing a median or average rating to nulls also would not be an accurate representation as there was no rating from the respondednt which potentially reflects an underlying observation. That observation being that the respondednt may not have been interested in seeing the movie, which could speak to genre choices etc.