Project overview
This project uses the Pixar Films dataset from the
TidyTuesday project for the week of 11 March 2025. The analysis combines
two publicly available datasets:
pixar_films.csv: general information about Pixar films,
including release order, title, release date, runtime, and film
rating.
public_response.csv: public and critical response
measures, including Rotten Tomatoes, Metacritic, CinemaScore, and
Critics Choice scores.
The main goal is to understand how Pixar films differ across time,
rating categories, runtime, and public/critical reception. The project
does not try to prove causality. Instead, it uses descriptive analysis,
data transformation, and visualizations to identify patterns.
Research questions
The analysis is guided by the following questions:
- How has Pixar’s film output changed over time?
- Are longer Pixar films rated better, worse, or similarly to shorter
films?
- Do audience-oriented and critic-oriented rating measures tell the
same story?
- Which films perform strongest across multiple rating systems?
- Are some film rating categories associated with different runtime or
score patterns?
- Which films have the largest gap between Rotten Tomatoes and
Metacritic scores?
Required packages
library(data.table)
library(ggplot2)
library(readr)
library(dplyr)
library(tidyr)
library(stringr)
library(forcats)
library(RColorBrewer)
library(scales)
library(knitr)
Loading the TidyTuesday
data
The data is read directly from the official TidyTuesday GitHub
repository.
pixar_films <- read_csv(
"https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-03-11/pixar_films.csv"
)
public_response <- read_csv(
"https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-03-11/public_response.csv"
)
head(pixar_films)
head(public_response)
Data structure
str(pixar_films)
## spc_tbl_ [27 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ number : num [1:27] 1 2 3 4 5 6 7 8 9 10 ...
## $ film : chr [1:27] "Toy Story" "A Bug's Life" "Toy Story 2" "Monsters, Inc." ...
## $ release_date: Date[1:27], format: "1995-11-22" "1998-11-25" ...
## $ run_time : num [1:27] 81 95 92 92 100 115 117 111 98 96 ...
## $ film_rating : chr [1:27] "G" "G" "G" "G" ...
## - attr(*, "spec")=
## .. cols(
## .. number = col_double(),
## .. film = col_character(),
## .. release_date = col_date(format = ""),
## .. run_time = col_double(),
## .. film_rating = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
str(public_response)
## spc_tbl_ [24 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ film : chr [1:24] "Toy Story" "A Bug's Life" "Toy Story 2" "Monsters, Inc." ...
## $ rotten_tomatoes: num [1:24] 100 92 100 96 99 97 74 96 95 98 ...
## $ metacritic : num [1:24] 95 77 88 79 90 90 73 96 95 88 ...
## $ cinema_score : chr [1:24] "A" "A" "A+" "A+" ...
## $ critics_choice : num [1:24] NA NA 100 92 97 88 89 91 90 95 ...
## - attr(*, "spec")=
## .. cols(
## .. film = col_character(),
## .. rotten_tomatoes = col_double(),
## .. metacritic = col_double(),
## .. cinema_score = col_character(),
## .. critics_choice = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
The first dataset contains the film-level information, while the
second dataset contains public and critic response variables. The shared
variable is film, which makes it possible to merge the two
datasets.
Converting to
data.table
pixar_dt <- as.data.table(pixar_films)
response_dt <- as.data.table(public_response)
class(pixar_dt)
## [1] "data.table" "data.frame"
class(response_dt)
## [1] "data.table" "data.frame"
Merging datasets
This is one of the extra requirements for an A grade. I merge the
film metadata with the public response data using the shared
film column.
pixar_merged <- merge(
pixar_dt,
response_dt,
by = "film",
all.x = TRUE
)
head(pixar_merged)
dim(pixar_merged)
## [1] 27 9
Required item:
Filtering rows with data.table
The following table filters for Pixar films released from 2010
onwards with Rotten Tomatoes scores of 90 or higher.
high_rt_recent <- analysis_dt[
release_year >= 2010 & rotten_tomatoes >= 90,
.(film, release_year, film_rating, run_time, rotten_tomatoes, metacritic, critics_choice, average_score)
][order(-rotten_tomatoes)]
high_rt_recent
This filter shows which newer Pixar films had especially strong
Rotten Tomatoes results. Filtering is useful because it lets us focus on
a specific part of the dataset instead of looking at all films at
once.
Required item:
Aggregating data with data.table
Aggregation by film
rating
rating_summary <- analysis_dt[
,
.(
films = .N,
average_runtime = round(mean(run_time, na.rm = TRUE), 1),
average_rotten_tomatoes = round(mean(rotten_tomatoes, na.rm = TRUE), 1),
average_metacritic = round(mean(metacritic, na.rm = TRUE), 1),
average_critics_choice = round(mean(critics_choice, na.rm = TRUE), 1),
average_combined_score = round(mean(average_score, na.rm = TRUE), 1)
),
by = film_rating
][order(-average_combined_score)]
rating_summary
Aggregation by
release period
period_summary <- analysis_dt[
,
.(
films = .N,
average_runtime = round(mean(run_time, na.rm = TRUE), 1),
median_runtime = median(run_time, na.rm = TRUE),
average_score = round(mean(average_score, na.rm = TRUE), 1),
highest_score = round(max(average_score, na.rm = TRUE), 1),
lowest_score = round(min(average_score, na.rm = TRUE), 1)
),
by = release_period
][order(release_period)]
period_summary
These aggregated tables provide a compact summary of the dataset.
They make it easier to compare groups instead of interpreting every film
separately.
Visualization theme
and palette
For the visualizations, I use a consistent minimal theme and
ColorBrewer palettes.
main_palette <- brewer.pal(8, "Set2")
sequential_palette <- brewer.pal(9, "YlGnBu")
theme_project <- theme_minimal(base_size = 12) +
theme(
plot.title = element_text(face = "bold", size = 15),
plot.subtitle = element_text(size = 11),
axis.title = element_text(face = "bold"),
legend.position = "bottom"
)
Plot 1: Pixar film
releases over time
ggplot(analysis_dt, aes(x = release_year)) +
geom_histogram(binwidth = 5, fill = main_palette[1], color = "white") +
scale_x_continuous(breaks = pretty_breaks()) +
labs(
title = "Pixar film releases by year",
subtitle = "Films are grouped into five-year intervals",
x = "Release year",
y = "Number of films"
) +
theme_project

The plot shows how Pixar releases are distributed across time. This
provides background for the rest of the analysis, because later patterns
may partly reflect changes in the number of films released during
different periods.
Plot 2: Runtime trend
across release order
ggplot(analysis_dt, aes(x = number, y = run_time)) +
geom_line(color = "grey60", linewidth = 0.8) +
geom_point(aes(color = film_rating), size = 3) +
geom_smooth(method = "loess", se = FALSE, color = "black", linewidth = 0.9) +
scale_color_brewer(palette = "Set2") +
labs(
title = "Runtime trend across Pixar release order",
subtitle = "The black smooth line shows the general runtime trend",
x = "Release order",
y = "Runtime in minutes",
color = "Film rating"
) +
theme_project

This plot uses multiple geom layers: line, points, and a smoothed
trend line. It helps show whether Pixar films became longer or shorter
across time.
Plot 3: Runtime
distribution by film rating
ggplot(analysis_dt, aes(x = film_rating, y = run_time, fill = film_rating)) +
geom_boxplot(alpha = 0.75, outlier.shape = NA) +
geom_jitter(width = 0.15, alpha = 0.65, size = 2) +
scale_fill_brewer(palette = "Set2") +
labs(
title = "Runtime distribution by film rating",
subtitle = "Each dot represents one film",
x = "Film rating",
y = "Runtime in minutes",
fill = "Film rating"
) +
theme_project

The boxplot compares runtimes across rating groups. The added
jittered points show individual films, making the distribution more
transparent.
Plot 5: Combined score
over release order
ggplot(analysis_dt, aes(x = number, y = average_score)) +
geom_line(color = "grey50", linewidth = 0.8) +
geom_point(aes(color = release_period), size = 3) +
geom_hline(yintercept = mean(analysis_dt$average_score, na.rm = TRUE), linetype = "dashed") +
scale_color_brewer(palette = "Set1") +
labs(
title = "Combined rating score across Pixar films",
subtitle = "Dashed line shows the overall average combined score",
x = "Release order",
y = "Average score across three rating systems",
color = "Release period"
) +
theme_project

The combined score reduces dependence on one rating system. This
makes it easier to identify films that perform strongly across different
forms of evaluation.
Plot 6: Top 10 films
by combined score
top_10_plot <- analysis_dt[
order(-average_score)
][1:10]
ggplot(top_10_plot, aes(x = fct_reorder(film, average_score), y = average_score, fill = film_rating)) +
geom_col() +
coord_flip() +
scale_fill_brewer(palette = "Set2") +
labs(
title = "Top 10 Pixar films by combined score",
subtitle = "Combined score averages Rotten Tomatoes, Metacritic, and Critics Choice",
x = "Film",
y = "Combined score",
fill = "Film rating"
) +
theme_project

This plot provides a clear ranking of the strongest films based on
multiple rating measures. It is more balanced than using only one
score.
Plot 7: Rating system
heatmap for top films
top_heatmap <- analysis_dt[
order(-average_score)
][1:12, .(
film,
Rotten_Tomatoes = rotten_tomatoes,
Metacritic = metacritic,
Critics_Choice = critics_choice
)]
top_heatmap_long <- melt(
top_heatmap,
id.vars = "film",
variable.name = "rating_system",
value.name = "score"
)
ggplot(top_heatmap_long, aes(x = rating_system, y = fct_reorder(film, score), fill = score)) +
geom_tile(color = "white") +
scale_fill_distiller(palette = "YlGnBu", direction = 1) +
labs(
title = "Heatmap of rating scores for top Pixar films",
subtitle = "Darker cells indicate higher scores",
x = "Rating system",
y = "Film",
fill = "Score"
) +
theme_project

The heatmap makes it easy to compare the same films across several
rating systems. This is useful because a film may perform very well in
one system but less strongly in another.
Plot 9: Average score
by runtime group and release period
runtime_period_summary <- analysis_dt[
,
.(
films = .N,
average_score = mean(average_score, na.rm = TRUE)
),
by = .(runtime_group, release_period)
]
ggplot(runtime_period_summary, aes(x = runtime_group, y = average_score, fill = release_period)) +
geom_col(position = "dodge") +
scale_fill_brewer(palette = "Set3") +
labs(
title = "Average combined score by runtime group and release period",
subtitle = "Comparing runtime categories across different periods",
x = "Runtime group",
y = "Average combined score",
fill = "Release period"
) +
theme_project

This plot adds a more detailed comparison by combining two grouping
variables. It checks whether the relationship between runtime and scores
looks different across release periods.
Plot 10: Missing data
visualization
missing_long <- pixar_merged[, lapply(.SD, function(x) sum(is.na(x)))]
missing_long <- melt(
missing_long,
measure.vars = names(missing_long),
variable.name = "variable",
value.name = "missing_values"
)
ggplot(missing_long, aes(x = fct_reorder(variable, missing_values), y = missing_values)) +
geom_col(fill = main_palette[3]) +
coord_flip() +
labs(
title = "Missing values by variable",
subtitle = "Checking data completeness after merging",
x = "Variable",
y = "Number of missing values"
) +
theme_project

This plot is an extra diagnostic step. It shows whether some
variables are less complete than others, which is important before
interpreting results.
Main findings
The analysis suggests several main findings:
- Pixar films vary meaningfully in runtime, release period, and rating
performance.
- Rotten Tomatoes, Metacritic, and Critics Choice scores are related,
but they do not always rank films in the same way.
- The merged dataset allows a stronger analysis because film
characteristics can be studied together with public and critic
response.
- Runtime alone does not fully explain film reception. Some shorter
and longer films both perform well.
- Looking at score gaps is useful because it identifies films where
rating systems disagree.
- The top-performing films are stronger when they perform well across
multiple rating systems, not only one.
Conclusion
This project used a real TidyTuesday dataset to analyze Pixar films
using data.table transformations, dataset merging, grouped
aggregation, filtering, and ggplot2 visualizations. The
analysis goes beyond the minimum requirements by adding data quality
checks, feature engineering, a combined rating index, correlation
analysis, score-gap analysis, a missing-data visualization, and a simple
exploratory regression model. The regression model did not identify
strong statistical predictors of average score, which suggests that
Pixar film reception cannot be explained well by simple structural
variables such as runtime, release year, or rating alone.
The strongest part of the analysis is the combination of film
metadata and public response data. This makes it possible to compare
Pixar films not only by when they were released or how long they are,
but also by how they were received across different rating systems.
Requirement
checklist
| Publicly accessible TidyTuesday dataset |
Yes |
Loading the TidyTuesday data |
| Filtering rows using data.table |
Yes |
Filtering section |
| Aggregating data using data.table |
Yes |
Aggregation section |
| At least 7 plots |
Yes |
10 plots included |
| At least 3 ggplot2 geoms |
Yes |
histogram, point, line, smooth, boxplot, jitter, col, tile,
hline |
| Merge datasets |
Yes |
pixar_films + public_response |
| Apply a theme |
Yes |
theme_project |
| Axis and plot titles |
Yes |
All plots use labs() |
| ColorBrewer palette |
Yes |
scale_*_brewer() and
scale_fill_distiller() |
| Multiple geom layers on same plot |
Yes |
Plots 2, 3, 4, and 5 |
| Extra analysis beyond requirements |
Yes |
quality checks, engineered variables, ranking, correlation, model,
missingness |
| Publish on RPubs/Medium |
To do after knitting |
Using RStudio Publish button |