This week, I asked five people in my life to score 6 recent movies on a scale from 1-5. The results are stored in a MySQL database in three tables: movies, people, and scores. The movies and people tables include a bit of metadata about each, most of which is irrelevant to the analysis in this notebook. I included them for practice and realism; in most contexts, there will be columns that, however important, do not need to be included in every analysis.
The scores table has a couple notable traits. For one, it treats “score” as one feature, rather than having a separate feature for each film. It also contains no real information about people or movies, only id’s for their respective tables. Thus, a sensical analysis will require some thoughtful joins and querying.
NOTE: The people referenced in the people table are real, but I have replaced their last names for anonymity’s sake. See if you notice a pattern in the naming convention!
RMySQL for accessing the database, tidyverse for analysis and visualization, and keyring for obscuring and accessing the MySQL password.
library(RMySQL)
## Loading required package: DBI
library(tidyverse)
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.3.0 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(keyring)
The password has been obscured using keyring for security purposes.
mysqlconnection = dbConnect(RMySQL::MySQL(),
dbname='movies',
host='localhost',
port=3306,
user='root',
password=key_get('MySQL', 'root'))
# Got help from: https://www.projectpro.io/recipes/connect-mysql-r
Want to make sure each was connected properly.
dbListTables(mysqlconnection)
## [1] "movies" "people" "scores"
Pull in all 3 tables and glance at their structure:
df_movies <- fetch(
dbSendQuery(mysqlconnection, "SELECT * FROM movies"),
n = -1)
df_people <- fetch(
dbSendQuery(mysqlconnection, "SELECT * FROM people"),
n = -1)
df_scores <- fetch(
dbSendQuery(mysqlconnection, "SELECT * FROM scores"),
n = -1)
head(df_movies)
## movie_id title year director
## 1 1 Everything Everywhere All at Once 2022 Daniel Kwan, Daniel Scheinert
## 2 2 Elvis 2022 Baz Luhrmann
## 3 3 NOPE 2022 Jordan Peele
## 4 4 Top Gun: Maverick 2022 Joseph Kosinski
## 5 5 Glass Onion 2022 Rian Johnson
## 6 6 RRR 2022 S. S. Rajamouli
head(df_people)
## person_id first_name last_name relationship_to_jacob age gender
## 1 1 Victoria Spielberg wife 28 f
## 2 2 Adam Campion friend 30 m
## 3 3 Lily Mann friend 32 f
## 4 4 Ethan Bigelow friend 29 m
## 5 5 Emily Scorcese friend 29 f
head(df_scores)
## movie_id person_id score
## 1 1 1 4
## 2 3 1 3
## 3 4 1 4
## 4 5 1 2
## 5 1 2 3
## 6 2 2 2
Note that if we want to see who rated which movies in what way, the “scores” table would be the place to see that, but it is incomprehensible to the human eye. We need to conduct some joins to see what names are attached to what movies and scores.
We now have all the tables read in from the SQL database, and can conduct joins using the R syntax. However, we should be able to pull in the proper table in one fell swoop with the right SQL query.
df_movie_scores <- fetch(
dbSendQuery(
mysqlconnection, "SELECT people.first_name,
movies.title,
scores.score
FROM scores
INNER JOIN people ON
scores.person_id = people.person_id
INNER JOIN movies ON
scores.movie_id = movies.movie_id"
),
n = -1
)
df_movie_scores
## first_name title score
## 1 Emily Everything Everywhere All at Once 5
## 2 Ethan Everything Everywhere All at Once 3
## 3 Lily Everything Everywhere All at Once 4
## 4 Adam Everything Everywhere All at Once 3
## 5 Victoria Everything Everywhere All at Once 4
## 6 Emily Elvis 3
## 7 Ethan Elvis 3
## 8 Lily Elvis 4
## 9 Adam Elvis 2
## 10 Ethan NOPE 3
## 11 Lily NOPE 5
## 12 Victoria NOPE 3
## 13 Emily Top Gun: Maverick 4
## 14 Ethan Top Gun: Maverick 5
## 15 Lily Top Gun: Maverick 4
## 16 Victoria Top Gun: Maverick 4
## 17 Emily Glass Onion 4
## 18 Ethan Glass Onion 1
## 19 Lily Glass Onion 1
## 20 Adam Glass Onion 3
## 21 Victoria Glass Onion 2
## 22 Ethan RRR 5
## 23 Adam RRR 4
We now have one table where each row contains one person, one movie, and that persons 1-5 score for that particular film.
I’m curious to get a sense of how each movie fared among my friends in general. Perhaps a box plot would be a good way to visualize the distribution of results across the films.
p <- ggplot(
df_movie_scores,
aes(x = score,
y = reorder(title, score, mean)
)
) +
geom_boxplot(aes(fill = title),
show.legend = FALSE) +
labs(title = "Movies: Aggregate Scores",
y = "Films")
p
Here, we can see that the top rated movies overall were RRR and Top Gun: Maverick. These are arguably the most action-heavy films in the data-set, so perhaps we have a thrill-seeking crowd! However, it is important to note the small sample sizes, particular for RRR.
In a future analysis, I would be interested to uncover patterns in film preferences. Does gender or age seem to play a role? With such a small sample, it would be hard to say, but our database tables have plenty of room for more rows should the need arise!