We are looking at the ratings of five movies. Respondents were given a google form to rate each movie from 1-5. If the respondent had not watched the movie they were asked to leave it blank.
The libraries used throughout the assignment are shared below. Googlesheets was used to import the responses provided.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(DBI)
library(dplyr)
library(RMySQL)
Connecting my database
mydb <- dbConnect(MySQL(),
user = 'william.forero99',
password = '54e9f1ec38640291',
dbname = 'william.forero99',
host = 'cuny607sql.mysql.database.azure.com')
We will load the responses of the google survey from the associated google sheets that was created and placed on github
response_url <- 'https://raw.githubusercontent.com/william-forero/Data-607/main/Movie_Ratings.csv'
response_data <- read.csv(response_url)
summary(response_data)
## Timestamp First.Name Last.Name Borderlands
## Length:13 Length:13 Length:13 Min. :1.000
## Class :character Class :character Class :character 1st Qu.:1.000
## Mode :character Mode :character Mode :character Median :2.500
## Mean :2.625
## 3rd Qu.:4.000
## Max. :5.000
## NA's :5
## Moana.2 Mufasa.The.Lion.King Venom.The.Last.Dance Wicked
## Min. :1.0 Min. :2.000 Min. :1.0 Min. :1.000
## 1st Qu.:3.0 1st Qu.:3.000 1st Qu.:3.0 1st Qu.:2.000
## Median :3.0 Median :4.000 Median :4.0 Median :3.000
## Mean :3.2 Mean :3.636 Mean :3.5 Mean :2.444
## 3rd Qu.:4.0 3rd Qu.:4.500 3rd Qu.:4.0 3rd Qu.:3.000
## Max. :5.0 Max. :5.000 Max. :5.0 Max. :4.000
## NA's :3 NA's :2 NA's :3 NA's :4
We have 13 responses, not all movies have a rating for every response, which tells us that not all movies were watched by every respondent
Three tables were created to store this information, the intent of creating the three tables would be to achieve third normal form - users: user_id first_name last_name
movies: movie_id movie_name
ratings: user_id movie_id rating
dbExecute(mydb, “DROP TABLE IF EXISTS ratings”) dbExecute(mydb, “DROP TABLE IF EXISTS users”) dbExecute(mydb, “DROP TABLE IF EXISTS movies”)
# Allows the code to be run again without inserting duplicate entries
dbExecute(mydb, "DROP TABLE IF EXISTS ratings")
## [1] 0
dbExecute(mydb, "DROP TABLE IF EXISTS users")
## [1] 0
dbExecute(mydb, "DROP TABLE IF EXISTS movies")
## [1] 0
# Create the users table
dbExecute(mydb, "
CREATE TABLE IF NOT EXISTS users (
user_id INT PRIMARY KEY AUTO_INCREMENT,
first_name VARCHAR(50),
last_name VARCHAR(50)
)
")
## [1] 0
# Create movies table
dbExecute(mydb, "
CREATE TABLE IF NOT EXISTS movies (
movie_id INT PRIMARY KEY AUTO_INCREMENT,
movie_name VARCHAR(255) UNIQUE
)
")
## [1] 0
# Create ratings table
dbExecute(mydb, "
CREATE TABLE IF NOT EXISTS ratings (
user_id INT,
movie_id INT,
rating INT,
FOREIGN KEY (user_id) REFERENCES users(user_id),
FOREIGN KEY (movie_id) REFERENCES movies(movie_id)
)
")
## [1] 0
Now we insert the rows corresponding to the survey results.
# Inserting the users information
dbExecute(mydb, "
INSERT INTO users (first_name, last_name) VALUES
('Torren', 'Jarvis '),
('Kalleen', 'Marte'),
('Orlando', 'G'),
('Edwin', 'Edouard'),
('Rene', 'J'),
('Paul', 'Daley'),
('Dorrelle', 'Caine'),
('Richard', 'Aguilar Navarrette'),
('Zulfiqar', 'Soomro'),
('Brexnak', 'Rojas'),
('Ben', 'Lawrence'),
('Marcia', 'Hernandez'),
('Melissa', 'Baez')
")
## [1] 13
# Inserting the movies information
dbExecute(mydb, "
INSERT INTO movies (movie_name) VALUES
('Borderlands'),
('Moana 2'),
('Mufasa The Lion King'),
('Venom The Last Dance'),
('Wicked')
")
## [1] 5
# Inserting the responses into ratings
dbExecute(mydb, "
INSERT INTO ratings (user_id, movie_id, rating) VALUES
(1, 1, 1),
(1, 2, 3),
(1, 3, 3),
(1, 4, 3),
(1, 5, 1),
(2, 1, 2),
(2, 2, 4),
(2, 3, 3),
(2, 4, 3),
(2, 5, 4),
(3, 2, 3),
(3, 3, 2),
(3, 4, 4),
(3, 5, 3),
(4, 1, 1),
(4, 2, 4),
(4, 3, 4),
(4, 4, 4),
(4, 5, 2),
(6, 1, 5),
(6, 2, 3),
(6, 3, 4),
(6, 4, 4),
(6, 5, 3),
(7, 1, 4),
(7, 2, 3),
(7, 3, 5),
(7, 4, 4),
(7, 5, 3),
(8, 3, 3),
(9, 1, 3),
(9, 2, 1),
(9, 3, 2),
(9, 4, 3),
(9, 5, 3),
(10, 1, 4),
(10, 2, 5),
(10, 3, 5),
(10, 4, 4),
(10, 5, 2),
(11, 1, 1),
(11, 2, 1),
(11, 3, 4),
(11, 4, 1),
(11, 5, 1),
(12, 2, 5),
(12, 4, 5),
(13, 3, 5)
")
## [1] 48
Now that the tables were created and the responses loaded, lets create dataframes in R that represent the data.
users_df <- dbGetQuery(mydb, "SELECT * FROM users")
movies_df <- dbGetQuery(mydb, "SELECT * FROM movies")
ratings_df <- dbGetQuery(mydb, "SELECT * FROM ratings")
Then we can join the three into a single dataframe
responses_df <- ratings_df %>%
left_join(users_df, by = "user_id") %>%
left_join(movies_df, by = "movie_id")
head(responses_df)
## user_id movie_id rating first_name last_name movie_name
## 1 1 1 1 Torren Jarvis Borderlands
## 2 1 2 3 Torren Jarvis Moana 2
## 3 1 3 3 Torren Jarvis Mufasa The Lion King
## 4 1 4 3 Torren Jarvis Venom The Last Dance
## 5 1 5 1 Torren Jarvis Wicked
## 6 2 1 2 Kalleen Marte Borderlands
Not all respondednts viewed all 5 of the movies. When a rating was left blank, it was omitted from the final results. It was omitted because assigning a 0 value would not give an accurate representation. It can be mistaken as a very bad rating when in reality it was just no rating. Imputing a median or average rating to nulls also would not be an accurate representation as there was no rating from the respondednt which potentially reflects an underlying observation. That observation being that the respondednt may not have been interested in seeing the movie, which could speak to genre choices etc.