Data 607 Assignment 2

Overview / Introduction

For this assignment, we were tasked with choosing six recent popular movies and asking at least five people that we know to rate each of these movies that they have seen on a scale of 1 to 5. We then had to take the results and store them in the class MySQL database to later transfer over to and load into an R data frame.

Approach

Seeing as how the majority of people I know aren’t avid movie watchers, I decided to take a different approach to collecting the data. Using the popular movie ratings site, IMDb, I went ahead and searched up the most popular movies from the years 2023 and 2024 and arbitrarily chose six of them for the analysis. IMDb has a peculiar set up in that, in addition to looking into ratings info for movies, you can also look into the ratings activity of individual users. I decided to use this to my advantage and accumulate my ratings data from specific users and not just any random ratings.

Instead of inputting the ratings data directly into a spreadsheet, I decided to simulate rater input by designing a Google form and entering a submission for every rater with their respective movie ratings. The form submission data was then collected in this Google sheet. I entered a total of seven submissions since I wanted to have at least five ratings for each movie.

After obtaining the required data, I downloaded a csv extract of the Google sheet and imported it into my MySQL database. Upon loading, I noticed that some of the observations were missing from the table. Further inspection showed that the import wizard had difficulty interpreting the blanks within the data. In order to circumvent the issue, I manually inputted ‘null’ into the blank spaces. This was feasible for such a small data set but a much larger set would have definitely required a different approach. Either way, this fixed the issue and the file was able to be completely imported to the database. The remainder of this file describes the data loading and manipulation process on RStudio.

Load Packages

Let’s start off by loading the necessary packages.

library(tidyverse)
library(dplyr)
library(digest)
library(RMariaDB)
library(DBI)

Define Log In Information

Let’s then create variables containing the information necessary to sign into the MySQL database. These variables will be called upon later to establish a connection and load the desired table into our RStudio environment.

hash_hex <- digest('23246087', algo = "sha256", serialize = FALSE)
quarter_hash <- substr(hash_hex, 1, nchar(hash_hex) %/% 4)
user <- 'steven.gonzalez87'
host <- 'cunydata607sql.mysql.database.azure.com'
port <- 3306

Establish Connection

Using the dbConnect() function, along with the previously defined variables, we can now establish a connection between RStudio and our MySQL database.

connection <- dbConnect(
  drv = MariaDB(),
  dbname = user,
  user = user,
  password = quarter_hash,
  host = host,
  port = port
)

Load Table and Disconnect from Database

Now that we’re connected, we can easily load our desired table into a local data frame and disconnect from the database once done. A glimpse of the movie_ratings data frame can be seen below.

movie_ratings <- tbl(connection,'imdb_movie_ratings') %>% 
  collect()
dbDisconnect(connection)
glimpse(movie_ratings)

## Rows: 7
## Columns: 8
## $ Timestamp                           <dttm> 2024-10-21 01:14:31, 2024-10-21 0…
## $ Name                                <chr> "SnoopyStyle", "kosmasp", "claudio…
## $ Barbie                              <int> 8, 9, 3, 6, 7, 10, 6
## $ `Deadpool & Wolverine`              <int64> 7, 10, NA, 8, 8, 8, NA
## $ `Dune: Part Two`                    <int64> 8, 10, 8, NA, 7, 9, NA
## $ `Inside Out 2`                      <int64> 8, 10, NA, NA, 8, 9, 7
## $ `Kingdom of the Planet of the Apes` <int64> 8, 8, 6, NA, 6, 7, 6
## $ Oppenheimer                         <int64> 8, 10, NA, 9, 7, 10, 7

Tidy Data and Transform Ratings

Tidying up the data frame can render it more useful for data analysis. We also have to transform the ratings from a scale of 1-10 to a scale of 1-5 since that was the scale initially requested of us. A glimpse of the tidy_movie_ratings data frame can be seen below.

tidy_movie_ratings <- movie_ratings %>%
  pivot_longer(cols=c(3:8), 
               names_to="Movie",
               values_to="Rating") %>% 
  mutate(Rating = round(Rating/2,0))
glimpse(tidy_movie_ratings)

## Rows: 42
## Columns: 4
## $ Timestamp <dttm> 2024-10-21 01:14:31, 2024-10-21 01:14:31, 2024-10-21 01:14:…
## $ Name      <chr> "SnoopyStyle", "SnoopyStyle", "SnoopyStyle", "SnoopyStyle", …
## $ Movie     <chr> "Barbie", "Deadpool & Wolverine", "Dune: Part Two", "Inside …
## $ Rating    <dbl> 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 4, 5, 2, NA, 4, NA, 3, NA, 3, …