knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(RMySQL)
## Loading required package: DBI
library(keyring)
# Create a connection to the database
con <- dbConnect(MySQL(),
user = key_get("data_606_user_id"),
password = key_get("data_606_sqldb_pword"),
dbname = "maxfield.raynolds773",
host = "cuny607sql.mysql.database.azure.com",
port = 3306)
movie_ratings_raw <- dbGetQuery(con, "SELECT * FROM movie_ratings")
glimpse(movie_ratings_raw)
## Rows: 30
## Columns: 4
## $ id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, …
## $ movie <chr> "Anora", "Challengers", "Conclave", "Emilia Perez", "Everything…
## $ rating <chr> "", "", "4", "2.5", "", "", "", "", "", "", "5", "", "1", "4", …
## $ rater <chr> "Anna", "Anna", "Anna", "Anna", "Anna", "Anna", "Evan", "Evan",…
When the data was collected empty cells were left for any film not seen by a reviewer. The following code replaces the blank strings with “NA” values to indicate that the data is not available. The column is then converted to a numeric (dbl) type.
Treating the missing data this way ensures that the lack of rating is retained but actively marks that the data is unavailable while allowing the column to be treated as a numeric type which will allow for mathematical and statistical calculations.
movie_ratings <- movie_ratings_raw
movie_ratings[movie_ratings == ""] <- NA
movie_ratings <- movie_ratings |>
mutate(
rating = as.numeric(rating)
)
glimpse(movie_ratings)
## Rows: 30
## Columns: 4
## $ id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, …
## $ movie <chr> "Anora", "Challengers", "Conclave", "Emilia Perez", "Everything…
## $ rating <dbl> NA, NA, 4.0, 2.5, NA, NA, NA, NA, NA, NA, 5.0, NA, 1.0, 4.0, 3.…
## $ rater <chr> "Anna", "Anna", "Anna", "Anna", "Anna", "Anna", "Evan", "Evan",…
Standardized ratings would have been a useful tool. Some reviewers opted to make “half” ratings, between integers. This limits treating the data as integers. While it makes for a potentially more nuanced expression of the individual rating, it might make the data less universally translatable.
One potential data collection approach would be to request a star rating of 1 to 5, but only accept integers and assign a short desciptor phrase to each integer. For example a “5” rating could mean “One of the best films of all time; everyone should see”, while a “1” could mean “Terrible; never should have been made”.
To conceal the user login id and the password, the “keyring” package was used. While this is vulnerable on an individual system level, it allows for a quick and easy way to conceal the information in a document that will be distributed. There are individual options for more direct solutions.