data607

Overview

This week, I asked five people in my life to score 6 recent movies on a scale from 1-5. The results are stored in a MySQL database in three tables: movies, people, and scores. The movies and people tables include a bit of metadata about each, most of which is irrelevant to the analysis in this notebook. I included them for practice and realism; in most contexts, there will be columns that, however important, do not need to be included in every analysis.

The scores table has a couple notable traits. For one, it treats “score” as one feature, rather than having a separate feature for each film. It also contains no real information about people or movies, only id’s for their respective tables. Thus, a sensical analysis will require some thoughtful joins and querying.

NOTE: The people referenced in the people table are real, but I have replaced their last names for anonymity’s sake. See if you notice a pattern in the naming convention!

Import relevant packages

RMySQL for accessing the database, tidyverse for analysis and visualization, and keyring for obscuring and accessing the MySQL password.

library(RMySQL)

## Loading required package: DBI

library(tidyverse)

## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──

## ✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.3.0      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(keyring)

Connect to the database

The password has been obscured using keyring for security purposes.

mysqlconnection = dbConnect(RMySQL::MySQL(),
                            dbname='movies',
                            host='localhost',
                            port=3306,
                            user='root',
                            password=key_get('MySQL', 'root'))

# Got help from: https://www.projectpro.io/recipes/connect-mysql-r

View tables in database

Want to make sure each was connected properly.

dbListTables(mysqlconnection)

## [1] "movies" "people" "scores"

Examine tables

Pull in all 3 tables and glance at their structure:

df_movies <- fetch(
  dbSendQuery(mysqlconnection, "SELECT * FROM movies"), 
  n = -1)
df_people <- fetch(
  dbSendQuery(mysqlconnection, "SELECT * FROM people"), 
  n = -1) 
df_scores <- fetch(
  dbSendQuery(mysqlconnection, "SELECT * FROM scores"), 
  n = -1) 

head(df_movies)

##   movie_id                             title year                      director
## 1        1 Everything Everywhere All at Once 2022 Daniel Kwan, Daniel Scheinert
## 2        2                             Elvis 2022                  Baz Luhrmann
## 3        3                              NOPE 2022                  Jordan Peele
## 4        4                 Top Gun: Maverick 2022               Joseph Kosinski
## 5        5                       Glass Onion 2022                  Rian Johnson
## 6        6                               RRR 2022               S. S. Rajamouli

head(df_people)

##   person_id first_name last_name relationship_to_jacob age gender
## 1         1   Victoria Spielberg                  wife  28      f
## 2         2       Adam   Campion                friend  30      m
## 3         3       Lily      Mann                friend  32      f
## 4         4      Ethan   Bigelow                friend  29      m
## 5         5      Emily  Scorcese                friend  29      f

head(df_scores)

##   movie_id person_id score
## 1        1         1     4
## 2        3         1     3
## 3        4         1     4
## 4        5         1     2
## 5        1         2     3
## 6        2         2     2

Reflect on tables

Note that if we want to see who rated which movies in what way, the “scores” table would be the place to see that, but it is incomprehensible to the human eye. We need to conduct some joins to see what names are attached to what movies and scores.

Create readable scores table

We now have all the tables read in from the SQL database, and can conduct joins using the R syntax. However, we should be able to pull in the proper table in one fell swoop with the right SQL query.

df_movie_scores <- fetch(
  dbSendQuery(
    mysqlconnection, "SELECT people.first_name, 
                      movies.title, 
                      scores.score 
                      FROM scores
                      INNER JOIN people ON 
                      scores.person_id = people.person_id
                      INNER JOIN movies ON 
                      scores.movie_id = movies.movie_id"
    ),
  n = -1
  )

df_movie_scores

##    first_name                             title score
## 1       Emily Everything Everywhere All at Once     5
## 2       Ethan Everything Everywhere All at Once     3
## 3        Lily Everything Everywhere All at Once     4
## 4        Adam Everything Everywhere All at Once     3
## 5    Victoria Everything Everywhere All at Once     4
## 6       Emily                             Elvis     3
## 7       Ethan                             Elvis     3
## 8        Lily                             Elvis     4
## 9        Adam                             Elvis     2
## 10      Ethan                              NOPE     3
## 11       Lily                              NOPE     5
## 12   Victoria                              NOPE     3
## 13      Emily                 Top Gun: Maverick     4
## 14      Ethan                 Top Gun: Maverick     5
## 15       Lily                 Top Gun: Maverick     4
## 16   Victoria                 Top Gun: Maverick     4
## 17      Emily                       Glass Onion     4
## 18      Ethan                       Glass Onion     1
## 19       Lily                       Glass Onion     1
## 20       Adam                       Glass Onion     3
## 21   Victoria                       Glass Onion     2
## 22      Ethan                               RRR     5
## 23       Adam                               RRR     4

Visualize the scores

We now have one table where each row contains one person, one movie, and that persons 1-5 score for that particular film.

I’m curious to get a sense of how each movie fared among my friends in general. Perhaps a box plot would be a good way to visualize the distribution of results across the films.

p <- ggplot(
  df_movie_scores,
  aes(x = score,
      y = reorder(title, score, mean)
      )
  ) +
  geom_boxplot(aes(fill = title), 
               show.legend = FALSE) +
  labs(title = "Movies: Aggregate Scores",
       y = "Films")

p

Conclusion

Here, we can see that the top rated movies overall were RRR and Top Gun: Maverick. These are arguably the most action-heavy films in the data-set, so perhaps we have a thrill-seeking crowd! However, it is important to note the small sample sizes, particular for RRR.

In a future analysis, I would be interested to uncover patterns in film preferences. Does gender or age seem to play a role? With such a small sample, it would be hard to say, but our database tables have plenty of room for more rows should the need arise!

data607_hw2

2023-02-04