Week2 - Romantic Movies

How does the Murray’s CX team feel about Romantic Movies?

In honor of Valentine’s Day, we’ll conduct a brief and informal survey of the Murray’s Cheese CX team. How do they feel about the best romantic movies of the last 30 years, as aggregated by Rotten Tomatoes?

Import movies from Rotten Tomatoes

We’ll pull the movie titles with their years directly from Rotten Tomatoes…

url <- "https://www.rottentomatoes.com/top/bestofrt/top_100_romance_movies"
rt_romanticfilms <- read_html(url)

table <- rt_romanticfilms %>%
  html_table(fill = TRUE)

table <- as.data.frame(table[[3]])

## <https://rpubs.com/jasdumas/rotten-tomatoes> was an informative resource in writing this code

Organize the data by year

We want to identify a subset of movies that the CX team is likely to have seen. Since they are a younger team, we’ll first subset the movies released just in the last 30 years. In order to do so effectively, we need to pull the years out of the “Title” column and into their own “Year” column using some regular expressions.

year <- as.numeric(str_extract_all(table$Title, "[:digit:]{4}", simplify = FALSE))

table <- table %>%
  mutate(Year = year)

Subset the data by year

Now, we can subset the data by limiting the entries to those movies released after 1980. To keep the survey to around 5 minutes, we’ll arbitrarily cut the number of entries at 12.

after80 <- subset(table, Year >= 1980)
after80_top12 <- slice_head(after80, n = 12)

The Survey

Participants are given the link https://bit.ly/3daRpYr to a Microsoft Forms page where they can anonymously rate each of the films using a simple 1-5 star scale. They are given the following directions:

Please rate from 1-5 stars. 1 = “hated it!”, 3 = “it was OK”, 5 = “loved it!”
Please leave the question blank if you have not seen the movie.

The list of entries are as follows:

##    Rank RatingTomatometer                       Title No. of Reviews Year
## 1     3               92%   The Shape of Water (2017)            451 2017
## 2     6               98%         The Big Sick (2017)            301 2017
## 3     8               94% Call Me by Your Name (2018)            358 2018
## 4    16               94%                Carol (2015)            309 2015
## 5    17               98%      Before Midnight (2013)            205 2013
## 6    21               95%           The Artist (2011)            318 2011
## 7    25               94%                  Her (2013)            282 2013
## 8    26               99%  Ash Is Purest White (2019)            153 2019
## 9    33              100%       Only Yesterday (1991)             56 1991
## 10   34               97%             Sideways (2004)            233 2004
## 11   35               97%   The Princess Bride (1987)             77 1987
## 12   37               94% Beauty and the Beast (1991)            117 1991

Survey Results to SQL Workbench

7 team members complete the survey with an average completion time of 04:36. Despite the author’s earnest efforts to keep the list of options relevant and engaging, the team is furious about the omission of such recent classics as When Harry Met Sally, Dirty Dancing, and The Wedding Singer.

Nevertheless, they fill out the survey and the results provide a small dataset, and we proceed to transform them for our analysis using SQL Workbench.

The data is exported from Microsoft Forms as a .csv file. View/ obtain this file from GitHub: https://raw.githubusercontent.com/curdferguson/607-wk2-romcoms/main/cx_movies.csv.

MySQL Workbench is used to import the cx_movies.csv data into a SQL table. MySQL DDL and DML language are used to subset the data for just those movies which have at least 3 ratings, and to export them in a new .csv file with their number of ratings and average rating. View and download the 607-wk2-movies.sql script and new wk2-relevant-movies.csv file from https://github.com/curdferguson/607-wk2-romcoms.

Import the relevant data from SQL back into R

Let’s import the summary data we’ve wrangled in SQL back into R. We’ll sort them in descending order of Response_Average.

relevant_movies <- read.csv("https://raw.githubusercontent.com/curdferguson/607-wk2-romcoms/main/wk2-relevant-movies.csv", header=FALSE)

names(relevant_movies) <- c("id", "Title", "Response_Count", "Average_Rating")

relevant_movies <- relevant_movies %>% mutate(Title = fct_reorder(Title, Average_Rating, .desc=FALSE))
## based on code from <https://www.r-graph-gallery.com/267-reorder-a-variable-in-ggplot2.html>

relevant_movies <- arrange(relevant_movies, desc(Title))

relevant_movies

##   id                       Title Response_Count Average_Rating
## 1  4   The Princess Bride (1987)              7         4.5714
## 2  5 Beauty and the Beast (1991)              7         4.5714
## 3  2                  Her (2013)              3         4.3333
## 4  1         The Big Sick (2017)              5         4.0000
## 5  3             Sideways (2004)              3         2.6667

Analysis

The final analysis shows that The Princess Bride (1987) and Beauty and the Beast (1991) tied for first place, with an average rating of 4.5714. The only film in the final “relevant” dataset that scored below 4.0 was Sideways (2004), with an average rating of 2.6667.

ggplot(relevant_movies, aes(x=Title, y=Average_Rating)) +
      geom_col(fill="#f68060") +
      coord_flip()