What makes a great screenplay? That question drives this project.
We analyzed eleven critically acclaimed screenplays spanning 1952 to 2025 — Ikiru, Thelma & Louise, Boogie Nights, Eyes Wide Shut, Moonlight, Get Out, Parasite, Portrait of a Lady on Fire, Aftersun, Sentimental Value, and Sinners. Together they cover six countries, seven decades, and a wide range of genre, authorship, and cultural tradition. Nine are original screenplays. Two — Eyes Wide Shut and Moonlight — are adapted from existing source material.
The analysis is organized into five sections. Section 1 looks at the script on the page: how dense is the writing, how much is dialogue versus action, how fast does it read? Section 2 examines structure and pacing: where do key story beats fall, how long is each act, and which films follow conventional structure versus breaking from it? Section 3 looks at character concentration: how many voices does the script have, and how much of the story runs through the protagonist? Section 4 examines the physical world of each screenplay: how many locations, how interior or exterior, how concentrated is the dramatic space? Section 5 pulls it all together and asks what, if anything, the best scripts actually share — not as a formula, but as a set of instincts that show up again and again even in films that look nothing alike.
The underlying question throughout is one that anyone at A24 or Neon or any company betting on difficult, ambitious films has to answer: is there a pattern underneath the scripts that break the rules but still work?
act_plot_data <- full_data |>
select(title, act1_pct, act2_pct, act3_pct) |>
pivot_longer(cols = -title, names_to = "act", values_to = "pct") |>
mutate(act = factor(act,
levels = c("act1_pct", "act2_pct", "act3_pct"),
labels = c("Act 1", "Act 2", "Act 3")))
act1_order <- full_data |>
arrange(act1_pct) |>
pull(title)
act_breaks_overlay2 <- full_data |>
select(title, inciting_incident_pct, act1_climax_pct, act2_climax_pct) |>
pivot_longer(-title, names_to = "beat", values_to = "pct") |>
mutate(beat = factor(beat,
levels = c("inciting_incident_pct", "act1_climax_pct", "act2_climax_pct"),
labels = c("Inciting Incident", "Act I Break", "Act II Break")))
Ikiru is omitted from this chart due to an unresolved Act II break — the beat could not be coded with confidence and is excluded rather than estimated.
# --- Act structure chart without Ikiru ---
act_plot_data <- full_data |>
filter(title != "Ikiru") |>
mutate(
act1_pct = round(act1_climax_page / page_count, 3),
act2_pct = round((act2_climax_page - act1_climax_page) / page_count, 3),
act3_pct = round((page_count - act2_climax_page) / page_count, 3),
title_label = ifelse(quality == "Comparison",
paste0(title, " *"), title)
) |>
select(title_label, act1_pct, act2_pct, act3_pct) |>
pivot_longer(cols = -title_label,
names_to = "act",
values_to = "pct") |>
mutate(act = factor(act,
levels = c("act1_pct", "act2_pct", "act3_pct"),
labels = c("Act 1", "Act 2", "Act 3")))
act1_order <- full_data |>
filter(title != "Ikiru") |>
mutate(
act1_pct = round(act1_climax_page / page_count, 3),
title_label = ifelse(quality == "Comparison",
paste0(title, " *"), title)
) |>
arrange(act1_pct) |>
pull(title_label)
act_breaks_overlay2 <- full_data |>
filter(title != "Ikiru") |>
mutate(
title_label = ifelse(quality == "Comparison",
paste0(title, " *"), title)
) |>
select(title_label, inciting_incident_pct, act1_climax_pct, act2_climax_pct) |>
pivot_longer(-title_label, names_to = "beat", values_to = "pct") |>
mutate(beat = factor(beat,
levels = c("inciting_incident_pct", "act1_climax_pct", "act2_climax_pct"),
labels = c("Inciting Incident", "Act I Break", "Act II Break")))
label_colors_act <- ifelse(grepl("\\*", rev(act1_order)), "#E74C3C", "black")
act_plot_data |>
mutate(title_label = factor(title_label, levels = act1_order),
act = factor(act, levels = c("Act 3", "Act 2", "Act 1"))) |>
ggplot(aes(x = title_label, y = pct, fill = act)) +
geom_col(width = 0.7) +
geom_point(data = act_breaks_overlay2 |>
mutate(title_label = factor(title_label, levels = act1_order)),
aes(x = title_label, y = pct, shape = beat, color = beat),
size = 3.5, inherit.aes = FALSE) +
coord_flip() +
scale_y_continuous(labels = scales::percent) +
scale_fill_manual(values = c("Act 1" = "#1A5276",
"Act 2" = "#85C1E9",
"Act 3" = "#D5F5E3")) +
scale_shape_manual(values = c("Inciting Incident" = 16,
"Act I Break" = 17,
"Act II Break" = 15)) +
scale_color_manual(values = c("Inciting Incident" = "#F39C12",
"Act I Break" = "#E74C3C",
"Act II Break" = "#8E44AD")) +
labs(
title = "Estimated Act Structure with Act Break Positions",
subtitle = "Sorted by Act I length — * = comparison film\nActs estimated from interpreted Act I and Act II turning points",
x = NULL,
y = "Share of Total Pages",
fill = "Act",
shape = "Beat",
color = "Beat"
) +
theme_minimal(base_size = 13) +
theme(
plot.title = element_text(face = "bold", size = 15),
plot.subtitle = element_text(color = "gray40", size = 11),
axis.title = element_text(face = "bold"),
legend.position = "right",
axis.text.y = element_text(size = 11, color = rev(label_colors_act))
) +
guides(
fill = guide_legend(reverse = TRUE, order = 2),
shape = guide_legend(order = 1),
color = guide_legend(order = 1)
)
The most immediate thing this chart shows is that Act I length varies
more than expected across acclaimed films. Amsterdam commits to its main
conflict by page 16 — 11% of the screenplay. Sentimental Value doesn’t
break until page 75 — 54%. Both are considered strong scripts. That
range alone is enough to complicate any claim that great screenplays
follow a fixed structural template. What most of them do share is a long
middle — Act II consistently takes up the largest share of the
screenplay regardless of where Act I ends. The comparison films don’t
look obviously different here, which is itself a finding: structural
timing in broad strokes doesn’t separate good scripts from weak
ones.
full_data |>
select(title, inciting_incident_page, act1_climax_page,
midpoint_page, act2_climax_page, climax_page,
resolution_page, page_count) |>
mutate(
i1 = act1_climax_page - inciting_incident_page,
i2 = midpoint_page - act1_climax_page,
i3 = act2_climax_page - midpoint_page,
i4 = climax_page - act2_climax_page,
i5 = resolution_page - climax_page
) |>
select(title, i1, i2, i3, i4, i5) |>
print(n = 14)
## # A tibble: 14 × 6
## title i1 i2 i3 i4 i5
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Ikiru 24 11 NA NA 8
## 2 Thelma & Louise 11 35 44 18 9
## 3 Boogie Nights 46 30 53 2 11
## 4 Eyes Wide Shut 19 16 60 1 13
## 5 Moonlight 24 30 29 9 13
## 6 Get Out 41 18 16 11 9
## 7 Parasite 38 31 50 0 10
## 8 Portrait of a Lady on Fire 21 18 11 0 16
## 9 Aftersun 30 34 16 0 10
## 10 Sentimental Value 45 29 16 0 18
## 11 Sinners 21 28 40 0 22
## 12 Amsterdam 10 70 34 23 1
## 13 Don't Worry Darling 25 17 33 9 4
## 14 The Mummy 12 2 54 4 7
beats_clean <- read_csv("data/script_structure_dataset.csv") |>
select(-notes, -source_file, -scene_count, -scene_density) |>
rename(title = film) |>
mutate(
inciting_incident_pct = round(inciting_incident_page / page_count, 3),
act1_climax_pct = round(act1_climax_page / page_count, 3),
midpoint_pct = round(midpoint_page / page_count, 3),
act2_climax_pct = round(act2_climax_page / page_count, 3),
climax_pct = round(climax_page / page_count, 3),
resolution_pct = round(resolution_page / page_count, 3)
)
## Rows: 14 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): film, notes, source_file
## dbl (9): page_count, scene_count, scene_density, inciting_incident_page, act...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
full_data <- screenplays_clean |>
left_join(beats_clean, by = "title") |>
select(-page_count.y) |>
rename(page_count = page_count.x) |>
left_join(
tibble(
title = c("Aftersun", "Boogie Nights", "Eyes Wide Shut", "Get Out",
"Ikiru", "Moonlight", "Parasite", "Portrait of a Lady on Fire",
"Sentimental Value", "Sinners", "Thelma & Louise",
"Amsterdam", "Don't Worry Darling", "The Mummy"),
dialogue_ratio = c(0.132, NA, 0.239, 0.191, 0.142, 0.137, 0.068, NA,
0.127, 0.168, NA, NA, NA, NA),
avg_lines_page = c(36.0, NA, 21.3, 38.3, 40.4, 37.2, 39.1, NA,
41.9, 38.2, NA, NA, NA, NA),
quality = c(rep("Acclaimed", 11), rep("Comparison", 3))
),
by = "title"
) |>
mutate(
pages_per_min = round(page_count / runtime, 3),
act1_pct = round(act1_climax_page / page_count, 3),
act2_pct = round((act2_climax_page - act1_climax_page) / page_count, 3),
act3_pct = round((page_count - act2_climax_page) / page_count, 3)
)
# --- Check intervals ---
full_data |>
filter(title != "Ikiru") |>
select(title, quality) |>
mutate(
i1 = full_data |> filter(title != "Ikiru") |>
mutate(v = act1_climax_page - inciting_incident_page) |> pull(v),
i2 = full_data |> filter(title != "Ikiru") |>
mutate(v = midpoint_page - act1_climax_page) |> pull(v),
i3 = full_data |> filter(title != "Ikiru") |>
mutate(v = act2_climax_page - midpoint_page) |> pull(v),
i4 = full_data |> filter(title != "Ikiru") |>
mutate(v = climax_page - act2_climax_page) |> pull(v),
i5 = full_data |> filter(title != "Ikiru") |>
mutate(v = resolution_page - climax_page) |> pull(v)
)
## # A tibble: 13 × 7
## title quality i1 i2 i3 i4 i5
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Thelma & Louise Acclaimed 11 35 44 18 9
## 2 Boogie Nights Acclaimed 46 30 53 2 11
## 3 Eyes Wide Shut Acclaimed 19 16 60 1 13
## 4 Moonlight Acclaimed 24 30 29 9 13
## 5 Get Out Acclaimed 41 18 16 11 9
## 6 Parasite Acclaimed 38 31 50 0 10
## 7 Portrait of a Lady on Fire Acclaimed 21 18 11 0 16
## 8 Aftersun Acclaimed 30 34 16 0 10
## 9 Sentimental Value Acclaimed 45 29 16 0 18
## 10 Sinners Acclaimed 21 28 40 0 22
## 11 Amsterdam Comparison 10 70 34 23 1
## 12 Don't Worry Darling Comparison 25 17 33 9 4
## 13 The Mummy Comparison 12 2 54 4 7
Ikiru is omitted from this chart due to an unresolved Act II break — the beat could not be coded with confidence and is excluded rather than estimated.
# --- Build heatmap data ---
heatmap_data <- full_data |>
filter(title != "Ikiru") |>
mutate(
i1 = round((act1_climax_page - inciting_incident_page) / page_count, 3),
i2 = round((midpoint_page - act1_climax_page) / page_count, 3),
i3 = round((act2_climax_page - midpoint_page) / page_count, 3),
i4 = round((climax_page - act2_climax_page) / page_count, 3),
i5 = round((resolution_page - climax_page) / page_count, 3)
) |>
select(title, quality, i1, i2, i3, i4, i5) |>
pivot_longer(-c(title, quality),
names_to = "interval",
values_to = "pct") |>
mutate(
interval = factor(interval,
levels = c("i1", "i2", "i3", "i4", "i5"),
labels = c("Inciting to Act I Break",
"Act I Break to Midpoint",
"Midpoint to Act II Break",
"Act II Break to Climax",
"Climax to Resolution")),
title_label = ifelse(quality == "Comparison",
paste0(title, " *"), title)
)
# --- Sort order ---
title_order <- full_data |>
filter(title != "Ikiru") |>
mutate(
mid_to_act2 = (act2_climax_page - midpoint_page) / page_count,
title_label = ifelse(quality == "Comparison",
paste0(title, " *"), title)
) |>
arrange(mid_to_act2) |>
pull(title_label)
# --- Label colors ---
label_colors <- ifelse(grepl("\\*", title_order), "#E74C3C", "black")
# --- Plot ---
heatmap_data |>
mutate(title_label = factor(title_label, levels = title_order)) |>
ggplot(aes(x = interval, y = title_label, fill = pct)) +
geom_tile(color = "white", linewidth = 0.5) +
geom_text(aes(label = scales::percent(pct, accuracy = 1)),
size = 3, color = "white") +
scale_fill_gradient(low = "#AED6F1", high = "#1A5276",
labels = scales::percent) +
labs(
title = "Where Do Screenplays Spend Their Pages?",
subtitle = "Sorted by Midpoint to Act II Break length — * = comparison film",
x = NULL,
y = NULL,
fill = "Share of\nTotal Pages"
) +
theme_minimal(base_size = 13) +
theme(
plot.title = element_text(face = "bold", size = 15),
plot.subtitle = element_text(color = "gray40"),
axis.text.x = element_text(angle = 25, hjust = 1, face = "bold"),
axis.text.y = element_text(size = 11, color = label_colors),
legend.position = "right",
panel.grid = element_blank()
)
This chart asks a more granular version of the same question: within the
screenplay, where does the time actually go? The Midpoint to Act II
Break column is the darkest across almost every film — acclaimed and
comparison alike. That post-midpoint escalation phase appears to be the
longest segment in most screenplays regardless of quality, which
suggests it may be less a marker of greatness and more a structural
constant. The clearest outlier is Amsterdam, which spends 49% of its
screenplay between Act I and the midpoint — an unusually long first half
before the story locks in. The Mummy concentrates an unusual amount in
the Midpoint to Act II Break segment, suggesting a script that
front-loads setup and then rushes through its final turns. Ikiru is
omitted from this chart due to an unresolved Act II break.
# --- Calculate averages for reference lines ---
full_data |>
filter(title != "Ikiru") |>
mutate(climax_to_res = round((resolution_page - climax_page) / page_count, 3)) |>
group_by(quality) |>
summarise(avg = mean(climax_to_res, na.rm = TRUE))
## # A tibble: 2 × 2
## quality avg
## <chr> <dbl>
## 1 Acclaimed 0.112
## 2 Comparison 0.035
full_data |>
filter(title != "Ikiru") |>
mutate(
climax_to_res = round((resolution_page - climax_page) / page_count, 3),
title_label = ifelse(quality == "Comparison",
paste0(title, " *"), title)
) |>
arrange(climax_to_res) |>
mutate(title_label = factor(title_label, levels = title_label)) |>
ggplot(aes(x = climax_to_res, y = title_label,
color = quality, shape = quality)) +
geom_segment(aes(x = 0, xend = climax_to_res,
y = title_label, yend = title_label),
linewidth = 0.8) +
geom_point(size = 4) +
scale_x_continuous(labels = scales::percent) +
scale_color_manual(values = c("Acclaimed" = "#1A5276",
"Comparison" = "#E74C3C")) +
scale_shape_manual(values = c("Acclaimed" = 16,
"Comparison" = 17)) +
labs(
title = "Climax to Resolution Length",
subtitle = "Share of total pages remaining after the climax — * = comparison film",
x = "Share of Total Pages",
y = NULL,
color = NULL,
shape = NULL
) +
theme_minimal(base_size = 13) +
theme(
plot.title = element_text(face = "bold", size = 15),
plot.subtitle = element_text(color = "gray40"),
axis.title = element_text(face = "bold"),
legend.position = "none",
panel.grid.minor = element_blank()
)
This is one of the more filmmaker-relevant findings in the project. The
comparison films — Amsterdam, Don’t Worry Darling, and The Mummy — all
leave relatively little space after the climax. The acclaimed films tend
to leave more, though with real variation. Portrait of a Lady on Fire,
Sinners, and Moonlight all let the story settle significantly after the
final confrontation. That willingness to sit with the aftermath rather
than end immediately may be one of the subtler traits that distinguishes
stronger scripts in this sample — not a rule, but a tendency worth
noting.
This project set out to ask whether great screenplays share a common structure. The short answer is no — not in any precise, measurable sense. Act I length varies enormously across the acclaimed sample. Inciting incidents arrive anywhere from page 6 to page 53. Some of the best scripts in this dataset are structurally unconventional by any standard definition.
But that’s not the whole answer. What the data does suggest is that strong scripts tend to share something closer to structural discipline than structural uniformity. The middle almost always does the most work. The ending almost always lands with some deliberateness — not necessarily long, but not abrupt either. And the major turning points, wherever they fall, tend to feel earned rather than arbitrary.
The comparison films are instructive here. Amsterdam, Don’t Worry Darling, and The Mummy don’t look dramatically different from the acclaimed films in terms of broad beat placement. They hit similar structural marks at roughly similar points. What they appear to lack is harder to quantify — the sense that the time between those marks is being used purposefully. That’s a craft question, not a structural one, and it’s largely beyond what page counts can tell us.
The strongest takeaway from this project may be negative: structural timing alone does not explain quality. A screenplay can hit every conventional beat at every conventional page and still fail. What separates the films in this sample isn’t where the turns happen — it’s what happens between them.