Introduction
For this quick survey, respondents were asked via text-messaging app to rate six recent major Hollywood films on a 1-5 scale (with 1 being the worst and 5 the best.)
- Black Widow
- Cruella
- F9: The Fast Saga
- Godzilla vs. Kong
- Jungle Cruise
- The Green Knight
Results were manually entered into an Excel spreadsheet and saved to .csv format for import into a database.
Setup: MySQL
For this step we’re working with a MySQL database installed on the local machine (“localhost”).
For a dataset of such a small size, a simple approach with a single file containing all the survey data is sufficient. However for practice working with much larger datasets, the use of separate, normalized tables is also demonstrated.
Survey data were converted into four .csv files in order to demonstrate the simple and normalized approaches:
- all_scores.csv : a simple file with all the response data.
- films.csv: a list of the 6 film titles, each with a unique id number.
- critics.csv: a list of survey respondent names, each with a unique id number.
- scores.csv: a normalized list of the response scores and unique id, with corresponding film_id and critic_id.
First, we create the database, create the four tables, and import the .csv data into the tables:
Click here to view code
-- mysql
create database film_scores;
use film_scores;
-- Method 1: simple table with all data
create table all_scores (
id int primary key not null,
film_title varchar(255) not null,
critic_name varchar(255) not null,
score int
);
load data infile 'data/all_scores.csv'
into table all_scores
fields terminated by ','
enclosed by '"'
lines terminated by '\n'
ignore 1 rows;
-- Method 2: three normalized tables
create table films (
id int primary key not null,
film_title varchar(255) not null
);
load data infile 'data/films.csv'
into table all_scores
fields terminated by ','
enclosed by '"'
lines terminated by '\n'
ignore 1 rows;
--
create table critics (
id int primary key not null,
critic_name varchar(255) not null
);
load data infile 'data/critics.csv'
into table all_scores
fields terminated by ','
enclosed by '"'
lines terminated by '\n'
ignore 1 rows;
--
create table scores (
id int primary key not null,
film_id int not null,
critic_id int not null,
score int
);
load data infile 'data/scores.csv'
into table all_scores
fields terminated by ','
enclosed by '"'
lines terminated by '\n'
ignore 1 rows;
Setup: R
Since we’re working with MySQL, let’s install the DBI and RMySQL packages along with the tidyverse.
library(tidyverse)
library(RMySQL)
library(DBI)(It’s also possible to just use dbplyer and keep everything in R language, but this will allow us to view and use SQL queries directly)
Next, we create our database connection (allowing the local machine to prompt us for a username and password):
db <- DBI::dbConnect(RMySQL::MySQL(),
host = "localhost",
user = rstudioapi::askForPassword("user"),
password = rstudioapi::askForPassword("password"),
dbname= "film_scores")Query Data
Let’s create two versions .. the simple query grabs all data from the all_scores table, while the normalized query grabs data from the scores, films, and critics tables, and joins them all based on their ids.
# simple
q_simple <-
"select * from all_scores"
df_simple <- dbFetch(dbSendQuery(db,q_simple))
# normalized
q_norm <-
"select
s.id,
f.film_title,
c.critic_name,
s.score
from scores as s
left join films as f
on s.film_id = f.id
left join critics as c
on s.critic_id = c.id;"
df_norm <- dbFetch(dbSendQuery(db,q_norm))Let’s compare results .. they should be identical.
str(df_simple)## 'data.frame': 42 obs. of 4 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ film_title : chr "Black Widow" "Cruella" "F9: The Fast Saga" "Godzilla vs. Kong" ...
## $ critic_name: chr "Ariel" "Ariel" "Ariel" "Ariel" ...
## $ score : int NA 3 1 NA 1 NA NA 5 3 4 ...
str(df_norm)## 'data.frame': 42 obs. of 4 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ film_title : chr "Black Widow" "Cruella" "F9: The Fast Saga" "Godzilla vs. Kong" ...
## $ critic_name: chr "Ariel" "Ariel" "Ariel" "Ariel" ...
## $ score : int NA 3 1 NA 1 NA NA 5 3 4 ...
Transformations
Looking good so far, but apparently some of our critics haven’t seen every single one of these Hollywood screen gems! To clean up let’s drop all the rows with a score of ‘NA’.
df <- df_norm
df <- df[!is.na(df$score),]
str(df)## 'data.frame': 30 obs. of 4 variables:
## $ id : int 2 3 5 8 9 10 11 12 14 15 ...
## $ film_title : chr "Cruella" "F9: The Fast Saga" "Jungle Cruise" "Cruella" ...
## $ critic_name: chr "Ariel" "Ariel" "Ariel" "Jeff" ...
## $ score : int 3 1 1 5 3 4 2 3 4 5 ...
Now let’s create a couple of datasets to help us perform some analysis. One based on films, and one based on our critics:
df_films <- df %>%
group_by(film_title) %>%
summarize(views = n(),
avg_score = mean(score),
hi_score = max(score),
lo_score = min(score),
sd = sd(score)) %>%
arrange(desc(avg_score))
df_films## # A tibble: 6 × 6
## film_title views avg_score hi_score lo_score sd
## <chr> <int> <dbl> <int> <int> <dbl>
## 1 Cruella 6 4 5 3 0.632
## 2 Godzilla vs. Kong 4 3.75 5 3 0.957
## 3 The Green Knight 6 3.67 5 3 1.03
## 4 Black Widow 4 2.75 4 2 0.957
## 5 F9: The Fast Saga 5 2.4 5 1 1.67
## 6 Jungle Cruise 5 2 3 1 0.707
df_critics <- df %>%
group_by(critic_name) %>%
summarize(views = n(),
avg_score = mean(score),
hi_score = max(score),
lo_score = min(score),
sd = sd(score)) %>%
arrange(desc(avg_score))
df_critics## # A tibble: 7 × 6
## critic_name views avg_score hi_score lo_score sd
## <chr> <int> <dbl> <int> <int> <dbl>
## 1 Sam 3 4.33 5 3 1.15
## 2 Josiah 4 3.5 5 2 1.29
## 3 Jeff 5 3.4 5 2 1.14
## 4 Rachel 6 3.17 5 1 1.47
## 5 Naomi 4 3 4 2 0.816
## 6 Rebecca 5 2.8 4 2 0.837
## 7 Ariel 3 1.67 3 1 1.15
"“The Rock” Is Not Going to be Happy About This
I personally enjoy the self-aware, ironic screen stylings of Dwayne “The Rock” Johnson, but sadly, the numbers tell another story. Our critics overwhelmingly rejected Jungle Cruise while ranking Cruella at the top. Looks like Disney wins this round either way.
df_films %>%
ggplot(aes(x=reorder(film_title, -avg_score), y=avg_score, label=round(avg_score,1))) +
geom_bar(stat='identity', fill='blue', alpha=.6) +
geom_text(vjust=2, color='white') +
ggtitle("Avg Score by Film") +
xlab('Film Title') +
ylab('Avg Score') +
theme(axis.text.x = element_text(angle=45, hjust=1))Divided Opinions
Interestingly, while the critics were near-unanimous in thier praise of Cruella, they were most divided in their opinion of the latest installation of the Fast and Furious franchise, as evidenced in the Standard Deviation (SD) amongst scores:
df_films %>%
ggplot(aes(x=reorder(film_title, -sd), y=sd, label=round(sd,2))) +
geom_bar(stat='identity', fill='red', alpha=.6) +
geom_text(vjust=2, color='white') +
ggtitle("Divided Opinions") +
xlab('Film Title') +
ylab('SD') +
theme(axis.text.x = element_text(angle=45, hjust=1))Tough Crowd
Finally, who was the toughest critic in the theatre? While Sam and Josiah were handing out high scores all around, Rebecca and Ariel proved to be the most discriminating viewers.
df_critics %>%
ggplot(aes(x=reorder(critic_name, -avg_score), y=avg_score, label=round(avg_score,1))) +
geom_bar(stat='identity', fill='purple', alpha=.6) +
geom_text(vjust=2, color='white') +
ggtitle("Toughest Critics") +
xlab('Critic Name') +
ylab('Avg Score') +
theme(axis.text.x = element_text(angle=45, hjust=1))