1 Project overview
2 Research questions
3 Required packages
4 Loading the TidyTuesday data
5 Data structure
6 Converting to data.table
7 Extra work 1: Data quality checks
8 Merging datasets
9 Extra work 2: Feature engineering
10 Extra work 3: A clean analysis dataset
11 Required item: Filtering rows with data.table
12 Required item: Aggregating data with data.table
- 12.1 Aggregation by film rating
- 12.2 Aggregation by release period
13 Extra work 4: Ranking films across multiple rating systems
14 Extra work 5: Correlation table
15 Visualization theme and palette
16 Plot 1: Pixar film releases over time
17 Plot 2: Runtime trend across release order
18 Plot 3: Runtime distribution by film rating
19 Plot 4: Rotten Tomatoes vs Metacritic
20 Plot 5: Combined score over release order
21 Plot 6: Top 10 films by combined score
22 Plot 7: Rating system heatmap for top films
23 Plot 8: Score gap between Rotten Tomatoes and Metacritic
24 Plot 9: Average score by runtime group and release period
25 Plot 10: Missing data visualization
26 Extra work 6: Simple regression model
- 26.0.1 Interpretation of the regression model
27 Extra work 7: Film-level interpretation table
28 Main findings
29 Conclusion
30 Requirement checklist

1 Project overview

This project uses the Pixar Films dataset from the TidyTuesday project for the week of 11 March 2025. The analysis combines two publicly available datasets:

pixar_films.csv: general information about Pixar films, including release order, title, release date, runtime, and film rating.
public_response.csv: public and critical response measures, including Rotten Tomatoes, Metacritic, CinemaScore, and Critics Choice scores.

The main goal is to understand how Pixar films differ across time, rating categories, runtime, and public/critical reception. The project does not try to prove causality. Instead, it uses descriptive analysis, data transformation, and visualizations to identify patterns.

2 Research questions

The analysis is guided by the following questions:

How has Pixar’s film output changed over time?
Are longer Pixar films rated better, worse, or similarly to shorter films?
Do audience-oriented and critic-oriented rating measures tell the same story?
Which films perform strongest across multiple rating systems?
Are some film rating categories associated with different runtime or score patterns?
Which films have the largest gap between Rotten Tomatoes and Metacritic scores?

3 Required packages

library(data.table)
library(ggplot2)
library(readr)
library(dplyr)
library(tidyr)
library(stringr)
library(forcats)
library(RColorBrewer)
library(scales)
library(knitr)

4 Loading the TidyTuesday data

The data is read directly from the official TidyTuesday GitHub repository.

pixar_films <- read_csv(
  "https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-03-11/pixar_films.csv"
)

public_response <- read_csv(
  "https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-03-11/public_response.csv"
)

head(pixar_films)

head(public_response)

5 Data structure

str(pixar_films)

## spc_tbl_ [27 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ number      : num [1:27] 1 2 3 4 5 6 7 8 9 10 ...
##  $ film        : chr [1:27] "Toy Story" "A Bug's Life" "Toy Story 2" "Monsters, Inc." ...
##  $ release_date: Date[1:27], format: "1995-11-22" "1998-11-25" ...
##  $ run_time    : num [1:27] 81 95 92 92 100 115 117 111 98 96 ...
##  $ film_rating : chr [1:27] "G" "G" "G" "G" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   number = col_double(),
##   ..   film = col_character(),
##   ..   release_date = col_date(format = ""),
##   ..   run_time = col_double(),
##   ..   film_rating = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

str(public_response)

## spc_tbl_ [24 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ film           : chr [1:24] "Toy Story" "A Bug's Life" "Toy Story 2" "Monsters, Inc." ...
##  $ rotten_tomatoes: num [1:24] 100 92 100 96 99 97 74 96 95 98 ...
##  $ metacritic     : num [1:24] 95 77 88 79 90 90 73 96 95 88 ...
##  $ cinema_score   : chr [1:24] "A" "A" "A+" "A+" ...
##  $ critics_choice : num [1:24] NA NA 100 92 97 88 89 91 90 95 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   film = col_character(),
##   ..   rotten_tomatoes = col_double(),
##   ..   metacritic = col_double(),
##   ..   cinema_score = col_character(),
##   ..   critics_choice = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

The first dataset contains the film-level information, while the second dataset contains public and critic response variables. The shared variable is film, which makes it possible to merge the two datasets.

6 Converting to data.table

pixar_dt <- as.data.table(pixar_films)
response_dt <- as.data.table(public_response)

class(pixar_dt)

## [1] "data.table" "data.frame"

class(response_dt)

## [1] "data.table" "data.frame"

7 Extra work 1: Data quality checks

Before analysis, I check the number of rows, columns, duplicate films, and missing values.

data_quality <- data.table(
  dataset = c("pixar_films", "public_response"),
  rows = c(nrow(pixar_dt), nrow(response_dt)),
  columns = c(ncol(pixar_dt), ncol(response_dt)),
  duplicate_films = c(
    pixar_dt[, sum(duplicated(film))],
    response_dt[, sum(duplicated(film))]
  )
)

data_quality

missing_pixar <- pixar_dt[, lapply(.SD, function(x) sum(is.na(x)))]
missing_response <- response_dt[, lapply(.SD, function(x) sum(is.na(x)))]

missing_pixar

missing_response

This step is useful because missing values and duplicate rows can affect summaries and plots. There are no duplicate film titles in the main film-level files, so the merge can be done safely at the film level.

8 Merging datasets

This is one of the extra requirements for an A grade. I merge the film metadata with the public response data using the shared film column.

pixar_merged <- merge(
  pixar_dt,
  response_dt,
  by = "film",
  all.x = TRUE
)

head(pixar_merged)

dim(pixar_merged)

## [1] 27  9

9 Extra work 2: Feature engineering

I add several new variables to make the analysis more meaningful:

release_year: year of release.
decade: release decade.
runtime_group: shorter, medium, or longer films.
rating_group: simplified film rating category.
cinema_score_numeric: numeric version of CinemaScore.
average_score: combined score using Rotten Tomatoes, Metacritic, and Critics Choice.
rt_metacritic_gap: difference between Rotten Tomatoes and Metacritic.

pixar_merged[, release_year := as.integer(format(release_date, "%Y"))]

pixar_merged[, decade := paste0(floor(release_year / 10) * 10, "s")]

pixar_merged[, runtime_group := fifelse(
  run_time < 90, "Shorter than 90 min",
  fifelse(run_time <= 105, "90-105 min", "Longer than 105 min")
)]

pixar_merged[, rating_group := fifelse(
  film_rating %in% c("G"), "G",
  fifelse(film_rating %in% c("PG"), "PG", "Other/Unknown")
)]

cinema_lookup <- data.table(
  cinema_score = c("A+", "A", "A-", "B+", "B", "B-", "C+", "C", "C-", "D", "F"),
  cinema_score_numeric = c(100, 95, 90, 87, 83, 80, 77, 73, 70, 60, 40)
)

pixar_merged <- merge(
  pixar_merged,
  cinema_lookup,
  by = "cinema_score",
  all.x = TRUE
)

pixar_merged[, average_score := rowMeans(
  .SD,
  na.rm = TRUE
),
.SDcols = c("rotten_tomatoes", "metacritic", "critics_choice")]

pixar_merged[, rt_metacritic_gap := rotten_tomatoes - metacritic]

pixar_merged[, release_period := fifelse(
  release_year < 2010, "Before 2010",
  fifelse(release_year < 2020, "2010-2019", "2020 onwards")
)]

head(pixar_merged)

10 Extra work 3: A clean analysis dataset

I keep only complete rows for the variables used most often in the analysis. I also arrange the films by release order.

analysis_dt <- pixar_merged[
  !is.na(rotten_tomatoes) &
    !is.na(metacritic) &
    !is.na(critics_choice) &
    !is.na(run_time)
][order(number)]

analysis_dt[, .(
  film,
  release_year,
  run_time,
  film_rating,
  rotten_tomatoes,
  metacritic,
  critics_choice,
  cinema_score,
  average_score
)]

11 Required item: Filtering rows with data.table

The following table filters for Pixar films released from 2010 onwards with Rotten Tomatoes scores of 90 or higher.

high_rt_recent <- analysis_dt[
  release_year >= 2010 & rotten_tomatoes >= 90,
  .(film, release_year, film_rating, run_time, rotten_tomatoes, metacritic, critics_choice, average_score)
][order(-rotten_tomatoes)]

high_rt_recent

This filter shows which newer Pixar films had especially strong Rotten Tomatoes results. Filtering is useful because it lets us focus on a specific part of the dataset instead of looking at all films at once.

12 Required item: Aggregating data with data.table

12.1 Aggregation by film rating

rating_summary <- analysis_dt[
  ,
  .(
    films = .N,
    average_runtime = round(mean(run_time, na.rm = TRUE), 1),
    average_rotten_tomatoes = round(mean(rotten_tomatoes, na.rm = TRUE), 1),
    average_metacritic = round(mean(metacritic, na.rm = TRUE), 1),
    average_critics_choice = round(mean(critics_choice, na.rm = TRUE), 1),
    average_combined_score = round(mean(average_score, na.rm = TRUE), 1)
  ),
  by = film_rating
][order(-average_combined_score)]

rating_summary

12.2 Aggregation by release period

period_summary <- analysis_dt[
  ,
  .(
    films = .N,
    average_runtime = round(mean(run_time, na.rm = TRUE), 1),
    median_runtime = median(run_time, na.rm = TRUE),
    average_score = round(mean(average_score, na.rm = TRUE), 1),
    highest_score = round(max(average_score, na.rm = TRUE), 1),
    lowest_score = round(min(average_score, na.rm = TRUE), 1)
  ),
  by = release_period
][order(release_period)]

period_summary

These aggregated tables provide a compact summary of the dataset. They make it easier to compare groups instead of interpreting every film separately.

13 Extra work 4: Ranking films across multiple rating systems

A single rating system may not tell the full story. For this reason, I rank films using a combined score based on Rotten Tomatoes, Metacritic, and Critics Choice.

top_films <- analysis_dt[
  order(-average_score),
  .(
    rank = seq_len(.N),
    film,
    release_year,
    film_rating,
    run_time,
    rotten_tomatoes,
    metacritic,
    critics_choice,
    average_score = round(average_score, 1)
  )
][1:10]

top_films

14 Extra work 5: Correlation table

This table checks whether the numerical rating variables tend to move together.

score_vars <- analysis_dt[, .(
  rotten_tomatoes,
  metacritic,
  critics_choice,
  run_time,
  cinema_score_numeric
)]

correlation_table <- round(cor(score_vars, use = "pairwise.complete.obs"), 2)

correlation_table

##                      rotten_tomatoes metacritic critics_choice run_time
## rotten_tomatoes                 1.00       0.80           0.85    -0.15
## metacritic                      0.80       1.00           0.86     0.00
## critics_choice                  0.85       0.86           1.00    -0.13
## run_time                       -0.15       0.00          -0.13     1.00
## cinema_score_numeric            0.62       0.53           0.58     0.00
##                      cinema_score_numeric
## rotten_tomatoes                      0.62
## metacritic                           0.53
## critics_choice                       0.58
## run_time                             0.00
## cinema_score_numeric                 1.00

The correlation table helps compare the rating measures. Stronger positive correlations suggest that two measures often increase together, while weaker correlations suggest that they capture somewhat different aspects of film reception.

15 Visualization theme and palette

For the visualizations, I use a consistent minimal theme and ColorBrewer palettes.

main_palette <- brewer.pal(8, "Set2")
sequential_palette <- brewer.pal(9, "YlGnBu")

theme_project <- theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold", size = 15),
    plot.subtitle = element_text(size = 11),
    axis.title = element_text(face = "bold"),
    legend.position = "bottom"
  )

16 Plot 1: Pixar film releases over time

ggplot(analysis_dt, aes(x = release_year)) +
  geom_histogram(binwidth = 5, fill = main_palette[1], color = "white") +
  scale_x_continuous(breaks = pretty_breaks()) +
  labs(
    title = "Pixar film releases by year",
    subtitle = "Films are grouped into five-year intervals",
    x = "Release year",
    y = "Number of films"
  ) +
  theme_project

The plot shows how Pixar releases are distributed across time. This provides background for the rest of the analysis, because later patterns may partly reflect changes in the number of films released during different periods.

17 Plot 2: Runtime trend across release order

ggplot(analysis_dt, aes(x = number, y = run_time)) +
  geom_line(color = "grey60", linewidth = 0.8) +
  geom_point(aes(color = film_rating), size = 3) +
  geom_smooth(method = "loess", se = FALSE, color = "black", linewidth = 0.9) +
  scale_color_brewer(palette = "Set2") +
  labs(
    title = "Runtime trend across Pixar release order",
    subtitle = "The black smooth line shows the general runtime trend",
    x = "Release order",
    y = "Runtime in minutes",
    color = "Film rating"
  ) +
  theme_project

This plot uses multiple geom layers: line, points, and a smoothed trend line. It helps show whether Pixar films became longer or shorter across time.

18 Plot 3: Runtime distribution by film rating

ggplot(analysis_dt, aes(x = film_rating, y = run_time, fill = film_rating)) +
  geom_boxplot(alpha = 0.75, outlier.shape = NA) +
  geom_jitter(width = 0.15, alpha = 0.65, size = 2) +
  scale_fill_brewer(palette = "Set2") +
  labs(
    title = "Runtime distribution by film rating",
    subtitle = "Each dot represents one film",
    x = "Film rating",
    y = "Runtime in minutes",
    fill = "Film rating"
  ) +
  theme_project

The boxplot compares runtimes across rating groups. The added jittered points show individual films, making the distribution more transparent.

19 Plot 4: Rotten Tomatoes vs Metacritic

ggplot(analysis_dt, aes(x = metacritic, y = rotten_tomatoes)) +
  geom_point(aes(size = run_time, color = film_rating), alpha = 0.8) +
  geom_smooth(method = "lm", se = FALSE, color = "black", linewidth = 0.8) +
  scale_color_brewer(palette = "Dark2") +
  labs(
    title = "Relationship between Metacritic and Rotten Tomatoes scores",
    subtitle = "Point size represents runtime",
    x = "Metacritic score",
    y = "Rotten Tomatoes score",
    color = "Film rating",
    size = "Runtime"
  ) +
  theme_project

This plot compares two rating systems. If the points follow the fitted line closely, it suggests the two measures tell a similar story. If some points are far from the line, those films may be rated differently by the two systems.

20 Plot 5: Combined score over release order

ggplot(analysis_dt, aes(x = number, y = average_score)) +
  geom_line(color = "grey50", linewidth = 0.8) +
  geom_point(aes(color = release_period), size = 3) +
  geom_hline(yintercept = mean(analysis_dt$average_score, na.rm = TRUE), linetype = "dashed") +
  scale_color_brewer(palette = "Set1") +
  labs(
    title = "Combined rating score across Pixar films",
    subtitle = "Dashed line shows the overall average combined score",
    x = "Release order",
    y = "Average score across three rating systems",
    color = "Release period"
  ) +
  theme_project

The combined score reduces dependence on one rating system. This makes it easier to identify films that perform strongly across different forms of evaluation.

21 Plot 6: Top 10 films by combined score

top_10_plot <- analysis_dt[
  order(-average_score)
][1:10]

ggplot(top_10_plot, aes(x = fct_reorder(film, average_score), y = average_score, fill = film_rating)) +
  geom_col() +
  coord_flip() +
  scale_fill_brewer(palette = "Set2") +
  labs(
    title = "Top 10 Pixar films by combined score",
    subtitle = "Combined score averages Rotten Tomatoes, Metacritic, and Critics Choice",
    x = "Film",
    y = "Combined score",
    fill = "Film rating"
  ) +
  theme_project

This plot provides a clear ranking of the strongest films based on multiple rating measures. It is more balanced than using only one score.

22 Plot 7: Rating system heatmap for top films

top_heatmap <- analysis_dt[
  order(-average_score)
][1:12, .(
  film,
  Rotten_Tomatoes = rotten_tomatoes,
  Metacritic = metacritic,
  Critics_Choice = critics_choice
)]

top_heatmap_long <- melt(
  top_heatmap,
  id.vars = "film",
  variable.name = "rating_system",
  value.name = "score"
)

ggplot(top_heatmap_long, aes(x = rating_system, y = fct_reorder(film, score), fill = score)) +
  geom_tile(color = "white") +
  scale_fill_distiller(palette = "YlGnBu", direction = 1) +
  labs(
    title = "Heatmap of rating scores for top Pixar films",
    subtitle = "Darker cells indicate higher scores",
    x = "Rating system",
    y = "Film",
    fill = "Score"
  ) +
  theme_project

The heatmap makes it easy to compare the same films across several rating systems. This is useful because a film may perform very well in one system but less strongly in another.

23 Plot 8: Score gap between Rotten Tomatoes and Metacritic

gap_plot <- analysis_dt[
  order(abs(rt_metacritic_gap), decreasing = TRUE)
][1:12]

ggplot(gap_plot, aes(x = fct_reorder(film, rt_metacritic_gap), y = rt_metacritic_gap, fill = rt_metacritic_gap > 0)) +
  geom_col() +
  coord_flip() +
  scale_fill_brewer(palette = "Paired", labels = c("Metacritic higher", "Rotten Tomatoes higher")) +
  labs(
    title = "Largest gaps between Rotten Tomatoes and Metacritic",
    subtitle = "Positive values mean Rotten Tomatoes is higher than Metacritic",
    x = "Film",
    y = "Rotten Tomatoes minus Metacritic",
    fill = "Direction of gap"
  ) +
  theme_project

This plot identifies films where the two rating systems disagree most. A large positive gap means Rotten Tomatoes is higher than Metacritic, while a negative gap means Metacritic is higher.

24 Plot 9: Average score by runtime group and release period

runtime_period_summary <- analysis_dt[
  ,
  .(
    films = .N,
    average_score = mean(average_score, na.rm = TRUE)
  ),
  by = .(runtime_group, release_period)
]

ggplot(runtime_period_summary, aes(x = runtime_group, y = average_score, fill = release_period)) +
  geom_col(position = "dodge") +
  scale_fill_brewer(palette = "Set3") +
  labs(
    title = "Average combined score by runtime group and release period",
    subtitle = "Comparing runtime categories across different periods",
    x = "Runtime group",
    y = "Average combined score",
    fill = "Release period"
  ) +
  theme_project

This plot adds a more detailed comparison by combining two grouping variables. It checks whether the relationship between runtime and scores looks different across release periods.

25 Plot 10: Missing data visualization

missing_long <- pixar_merged[, lapply(.SD, function(x) sum(is.na(x)))]
missing_long <- melt(
  missing_long,
  measure.vars = names(missing_long),
  variable.name = "variable",
  value.name = "missing_values"
)

ggplot(missing_long, aes(x = fct_reorder(variable, missing_values), y = missing_values)) +
  geom_col(fill = main_palette[3]) +
  coord_flip() +
  labs(
    title = "Missing values by variable",
    subtitle = "Checking data completeness after merging",
    x = "Variable",
    y = "Number of missing values"
  ) +
  theme_project

This plot is an extra diagnostic step. It shows whether some variables are less complete than others, which is important before interpreting results.

26 Extra work 6: Simple regression model

As an additional step, I estimate a simple linear model predicting the combined score from runtime, release year, and film rating. This is not meant to prove causality. It is only used as an exploratory model.

model_dt <- analysis_dt[
  !is.na(average_score) &
    !is.na(run_time) &
    !is.na(release_year) &
    !is.na(film_rating)
]

score_model <- lm(
  average_score ~ run_time + release_year + film_rating,
  data = model_dt
)

summary(score_model)

## 
## Call:
## lm(formula = average_score ~ run_time + release_year + film_rating, 
##     data = model_dt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.432  -6.031   3.040   8.163  15.351 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   1677.93483  918.38691   1.827   0.0853 .
## run_time        -0.08804    0.32345  -0.272   0.7888  
## release_year    -0.78891    0.45858  -1.720   0.1035  
## film_ratingPG    6.06937    5.72659   1.060   0.3040  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.41 on 17 degrees of freedom
## Multiple R-squared:  0.1598, Adjusted R-squared:  0.01153 
## F-statistic: 1.078 on 3 and 17 DF,  p-value: 0.3849

26.0.1 Interpretation of the regression model

The regression model examines whether a film’s run_time, release_year, and film_rating help explain differences in the average_score.

Overall, the model has limited explanatory power. The Multiple R-squared value is 0.1598, meaning that the variables included in the model explain only about 16% of the variation in the average score. The adjusted R-squared is much lower, at 0.0115, which suggests that after adjusting for the number of predictors, the model explains almost none of the variation in the outcome. This means that other factors not included in the model are likely much more important in explaining why some Pixar films receive higher or lower average scores.

The overall model is also not statistically significant. The F-statistic has a p-value of 0.3849, which is above the usual 0.05 threshold. This means that, taken together, run_time, release_year, and film_rating do not provide strong statistical evidence for predicting the average score in this dataset.

Looking at the individual coefficients, none of the predictors are statistically significant at the 5% level. run_time has a very small negative coefficient, suggesting that longer films are associated with slightly lower average scores, but this relationship is weak and not significant. release_year also has a negative coefficient, suggesting that more recent films may have slightly lower average scores in this sample, but again the result is not statistically significant. The film_ratingPG coefficient is positive, suggesting that PG-rated films may have higher average scores than the reference rating category, but this effect is also not statistically significant.

Therefore, the model should be interpreted as exploratory rather than conclusive.However, including the model is still useful because it adds an analytical layer beyond visualizations and shows that film reception is likely influenced by more complex factors such as story quality, cultural context, franchise popularity, nostalgia, marketing, and audience expectations.

27 Extra work 7: Film-level interpretation table

This table combines the main transformed variables and creates a clear final output.

final_table <- analysis_dt[
  order(-average_score),
  .(
    film,
    release_year,
    film_rating,
    run_time,
    runtime_group,
    rotten_tomatoes,
    metacritic,
    critics_choice,
    cinema_score,
    average_score = round(average_score, 1),
    rt_metacritic_gap
  )
]

kable(
  final_table,
  caption = "Final film-level analysis table ordered by combined score"
)

Final film-level analysis table ordered by combined score
film	release_year	film_rating	run_time	runtime_group	rotten_tomatoes	metacritic	critics_choice	cinema_score	average_score	rt_metacritic_gap
Toy Story 2	1999	G	92	90-105 min	100	88	100	A+	96.0	12
Toy Story 3	2010	G	103	90-105 min	98	92	97	A	95.7	6
Finding Nemo	2003	G	100	90-105 min	99	90	97	A+	95.3	9
Inside Out	2015	PG	95	90-105 min	98	94	93	A	95.0	4
Ratatouille	2007	G	111	Longer than 105 min	96	96	91	A	94.3	0
Up	2009	PG	96	90-105 min	98	88	95	A+	93.7	10
WALL-E	2008	G	98	90-105 min	95	95	90	A	93.3	0
The Incredibles	2004	PG	115	Longer than 105 min	97	90	88	A+	91.7	7
Toy Story 4	2019	G	100	90-105 min	97	84	94	A	91.7	13
Soul	2020	PG	100	90-105 min	96	83	93	NA	90.7	13
Monsters, Inc.	2001	G	92	90-105 min	96	79	92	A+	89.0	17
Coco	2017	PG	105	90-105 min	97	81	89	A+	89.0	16
Finding Dory	2016	PG	97	90-105 min	94	77	89	A	86.7	17
Incredibles 2	2018	PG	118	Longer than 105 min	93	80	86	A+	86.3	13
Cars	2006	G	117	Longer than 105 min	74	73	89	A	78.7	1
Brave	2012	PG	93	90-105 min	78	69	81	A	76.0	9
Onward	2020	PG	102	90-105 min	88	61	79	A-	76.0	27
Monsters University	2013	G	104	90-105 min	80	65	79	A	74.7	15
The Good Dinosaur	2015	PG	93	90-105 min	76	66	75	A	72.3	10
Cars 3	2017	G	102	90-105 min	69	59	66	A	64.7	10
Cars 2	2011	G	106	Longer than 105 min	40	57	67	A-	54.7	-17

28 Main findings

The analysis suggests several main findings:

Pixar films vary meaningfully in runtime, release period, and rating performance.
Rotten Tomatoes, Metacritic, and Critics Choice scores are related, but they do not always rank films in the same way.
The merged dataset allows a stronger analysis because film characteristics can be studied together with public and critic response.
Runtime alone does not fully explain film reception. Some shorter and longer films both perform well.
Looking at score gaps is useful because it identifies films where rating systems disagree.
The top-performing films are stronger when they perform well across multiple rating systems, not only one.

29 Conclusion

This project used a real TidyTuesday dataset to analyze Pixar films using data.table transformations, dataset merging, grouped aggregation, filtering, and ggplot2 visualizations. The analysis goes beyond the minimum requirements by adding data quality checks, feature engineering, a combined rating index, correlation analysis, score-gap analysis, a missing-data visualization, and a simple exploratory regression model. The regression model did not identify strong statistical predictors of average score, which suggests that Pixar film reception cannot be explained well by simple structural variables such as runtime, release year, or rating alone.

The strongest part of the analysis is the combination of film metadata and public response data. This makes it possible to compare Pixar films not only by when they were released or how long they are, but also by how they were received across different rating systems.

30 Requirement checklist

Requirement	Completed?	Where
Publicly accessible TidyTuesday dataset	Yes	Loading the TidyTuesday data
Filtering rows using data.table	Yes	Filtering section
Aggregating data using data.table	Yes	Aggregation section
At least 7 plots	Yes	10 plots included
At least 3 ggplot2 geoms	Yes	histogram, point, line, smooth, boxplot, jitter, col, tile, hline
Merge datasets	Yes	`pixar_films` + `public_response`
Apply a theme	Yes	`theme_project`
Axis and plot titles	Yes	All plots use `labs()`
ColorBrewer palette	Yes	`scale_*_brewer()` and `scale_fill_distiller()`
Multiple geom layers on same plot	Yes	Plots 2, 3, 4, and 5
Extra analysis beyond requirements	Yes	quality checks, engineered variables, ranking, correlation, model, missingness
Publish on RPubs/Medium	To do after knitting	Using RStudio Publish button

Pixar Films: Ratings, Runtime, and Audience Response

Yllke Berisha

30.05.2026