screenplaysanalysis

Introduction

What makes a great screenplay? That question drives this project.

We analyzed eleven critically acclaimed screenplays spanning 1952 to 2025 — Ikiru, Thelma & Louise, Boogie Nights, Eyes Wide Shut, Moonlight, Get Out, Parasite, Portrait of a Lady on Fire, Aftersun, Sentimental Value, and Sinners. Together they cover six countries, seven decades, and a wide range of genre, authorship, and cultural tradition. Nine are original screenplays. Two — Eyes Wide Shut and Moonlight — are adapted from existing source material.

The analysis is organized into five sections. Section 1 looks at the script on the page: how dense is the writing, how much is dialogue versus action, how fast does it read? Section 2 examines structure and pacing: where do key story beats fall, how long is each act, and which films follow conventional structure versus breaking from it? Section 3 looks at character concentration: how many voices does the script have, and how much of the story runs through the protagonist? Section 4 examines the physical world of each screenplay: how many locations, how interior or exterior, how concentrated is the dramatic space? Section 5 pulls it all together and asks what, if anything, the best scripts actually share — not as a formula, but as a set of instincts that show up again and again even in films that look nothing alike.

The underlying question throughout is one that anyone at A24 or Neon or any company betting on difficult, ambitious films has to answer: is there a pattern underneath the scripts that break the rules but still work?

act_plot_data <- full_data |>
  select(title, act1_pct, act2_pct, act3_pct) |>
  pivot_longer(cols = -title, names_to = "act", values_to = "pct") |>
  mutate(act = factor(act,
                      levels = c("act1_pct", "act2_pct", "act3_pct"),
                      labels = c("Act 1", "Act 2", "Act 3")))

act1_order <- full_data |>
  arrange(act1_pct) |>
  pull(title)

act_breaks_overlay2 <- full_data |>
  select(title, inciting_incident_pct, act1_climax_pct, act2_climax_pct) |>
  pivot_longer(-title, names_to = "beat", values_to = "pct") |>
  mutate(beat = factor(beat,
    levels = c("inciting_incident_pct", "act1_climax_pct", "act2_climax_pct"),
    labels = c("Inciting Incident", "Act I Break", "Act II Break")))

Ikiru is omitted from this chart due to an unresolved Act II break — the beat could not be coded with confidence and is excluded rather than estimated.

# --- Act structure chart without Ikiru ---
act_plot_data <- full_data |>
  filter(title != "Ikiru") |>
  mutate(
    act1_pct    = round(act1_climax_page / page_count, 3),
    act2_pct    = round((act2_climax_page - act1_climax_page) / page_count, 3),
    act3_pct    = round((page_count - act2_climax_page) / page_count, 3),
    title_label = ifelse(quality == "Comparison",
                         paste0(title, " *"), title)
  ) |>
  select(title_label, act1_pct, act2_pct, act3_pct) |>
  pivot_longer(cols = -title_label,
               names_to  = "act",
               values_to = "pct") |>
  mutate(act = factor(act,
                      levels = c("act1_pct", "act2_pct", "act3_pct"),
                      labels = c("Act 1", "Act 2", "Act 3")))

act1_order <- full_data |>
  filter(title != "Ikiru") |>
  mutate(
    act1_pct    = round(act1_climax_page / page_count, 3),
    title_label = ifelse(quality == "Comparison",
                         paste0(title, " *"), title)
  ) |>
  arrange(act1_pct) |>
  pull(title_label)

act_breaks_overlay2 <- full_data |>
  filter(title != "Ikiru") |>
  mutate(
    title_label = ifelse(quality == "Comparison",
                         paste0(title, " *"), title)
  ) |>
  select(title_label, inciting_incident_pct, act1_climax_pct, act2_climax_pct) |>
  pivot_longer(-title_label, names_to = "beat", values_to = "pct") |>
  mutate(beat = factor(beat,
    levels = c("inciting_incident_pct", "act1_climax_pct", "act2_climax_pct"),
    labels = c("Inciting Incident", "Act I Break", "Act II Break")))

label_colors_act <- ifelse(grepl("\\*", rev(act1_order)), "#E74C3C", "black")

act_plot_data |>
  mutate(title_label = factor(title_label, levels = act1_order),
         act = factor(act, levels = c("Act 3", "Act 2", "Act 1"))) |>
  ggplot(aes(x = title_label, y = pct, fill = act)) +
  geom_col(width = 0.7) +
  geom_point(data = act_breaks_overlay2 |>
               mutate(title_label = factor(title_label, levels = act1_order)),
             aes(x = title_label, y = pct, shape = beat, color = beat),
             size = 3.5, inherit.aes = FALSE) +
  coord_flip() +
  scale_y_continuous(labels = scales::percent) +
  scale_fill_manual(values = c("Act 1" = "#1A5276",
                               "Act 2" = "#85C1E9",
                               "Act 3" = "#D5F5E3")) +
  scale_shape_manual(values = c("Inciting Incident" = 16,
                                "Act I Break"       = 17,
                                "Act II Break"      = 15)) +
  scale_color_manual(values = c("Inciting Incident" = "#F39C12",
                                "Act I Break"       = "#E74C3C",
                                "Act II Break"      = "#8E44AD")) +
  labs(
    title    = "Estimated Act Structure with Act Break Positions",
    subtitle = "Sorted by Act I length — * = comparison film\nActs estimated from interpreted Act I and Act II turning points",
    x        = NULL,
    y        = "Share of Total Pages",
    fill     = "Act",
    shape    = "Beat",
    color    = "Beat"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    plot.title      = element_text(face = "bold", size = 15),
    plot.subtitle   = element_text(color = "gray40", size = 11),
    axis.title      = element_text(face = "bold"),
    legend.position = "right",
    axis.text.y     = element_text(size = 11, color = rev(label_colors_act))
  ) +
  guides(
    fill  = guide_legend(reverse = TRUE, order = 2),
    shape = guide_legend(order = 1),
    color = guide_legend(order = 1)
  )

The most immediate thing this chart shows is that Act I length varies more than expected across acclaimed films. Amsterdam commits to its main conflict by page 16 — 11% of the screenplay. Sentimental Value doesn’t break until page 75 — 54%. Both are considered strong scripts. That range alone is enough to complicate any claim that great screenplays follow a fixed structural template. What most of them do share is a long middle — Act II consistently takes up the largest share of the screenplay regardless of where Act I ends. The comparison films don’t look obviously different here, which is itself a finding: structural timing in broad strokes doesn’t separate good scripts from weak ones.

full_data |>
  select(title, inciting_incident_page, act1_climax_page, 
         midpoint_page, act2_climax_page, climax_page, 
         resolution_page, page_count) |>
  mutate(
    i1 = act1_climax_page - inciting_incident_page,
    i2 = midpoint_page - act1_climax_page,
    i3 = act2_climax_page - midpoint_page,
    i4 = climax_page - act2_climax_page,
    i5 = resolution_page - climax_page
  ) |>
  select(title, i1, i2, i3, i4, i5) |>
  print(n = 14)

## # A tibble: 14 × 6
##    title                         i1    i2    i3    i4    i5
##    <chr>                      <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 Ikiru                         24    11    NA    NA     8
##  2 Thelma & Louise               11    35    44    18     9
##  3 Boogie Nights                 46    30    53     2    11
##  4 Eyes Wide Shut                19    16    60     1    13
##  5 Moonlight                     24    30    29     9    13
##  6 Get Out                       41    18    16    11     9
##  7 Parasite                      38    31    50     0    10
##  8 Portrait of a Lady on Fire    21    18    11     0    16
##  9 Aftersun                      30    34    16     0    10
## 10 Sentimental Value             45    29    16     0    18
## 11 Sinners                       21    28    40     0    22
## 12 Amsterdam                     10    70    34    23     1
## 13 Don't Worry Darling           25    17    33     9     4
## 14 The Mummy                     12     2    54     4     7

beats_clean <- read_csv("data/script_structure_dataset.csv") |>
  select(-notes, -source_file, -scene_count, -scene_density) |>
  rename(title = film) |>
  mutate(
    inciting_incident_pct = round(inciting_incident_page / page_count, 3),
    act1_climax_pct       = round(act1_climax_page / page_count, 3),
    midpoint_pct          = round(midpoint_page / page_count, 3),
    act2_climax_pct       = round(act2_climax_page / page_count, 3),
    climax_pct            = round(climax_page / page_count, 3),
    resolution_pct        = round(resolution_page / page_count, 3)
  )

## Rows: 14 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): film, notes, source_file
## dbl (9): page_count, scene_count, scene_density, inciting_incident_page, act...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

full_data <- screenplays_clean |>
  left_join(beats_clean, by = "title") |>
  select(-page_count.y) |>
  rename(page_count = page_count.x) |>
  left_join(
    tibble(
      title          = c("Aftersun", "Boogie Nights", "Eyes Wide Shut", "Get Out",
                         "Ikiru", "Moonlight", "Parasite", "Portrait of a Lady on Fire",
                         "Sentimental Value", "Sinners", "Thelma & Louise",
                         "Amsterdam", "Don't Worry Darling", "The Mummy"),
      dialogue_ratio = c(0.132, NA, 0.239, 0.191, 0.142, 0.137, 0.068, NA,
                         0.127, 0.168, NA, NA, NA, NA),
      avg_lines_page = c(36.0, NA, 21.3, 38.3, 40.4, 37.2, 39.1, NA,
                         41.9, 38.2, NA, NA, NA, NA),
      quality        = c(rep("Acclaimed", 11), rep("Comparison", 3))
    ),
    by = "title"
  ) |>
  mutate(
    pages_per_min = round(page_count / runtime, 3),
    act1_pct      = round(act1_climax_page / page_count, 3),
    act2_pct      = round((act2_climax_page - act1_climax_page) / page_count, 3),
    act3_pct      = round((page_count - act2_climax_page) / page_count, 3)
  )

# --- Check intervals ---
full_data |>
  filter(title != "Ikiru") |>
  select(title, quality) |>
  mutate(
    i1 = full_data |> filter(title != "Ikiru") |> 
         mutate(v = act1_climax_page - inciting_incident_page) |> pull(v),
    i2 = full_data |> filter(title != "Ikiru") |> 
         mutate(v = midpoint_page - act1_climax_page) |> pull(v),
    i3 = full_data |> filter(title != "Ikiru") |> 
         mutate(v = act2_climax_page - midpoint_page) |> pull(v),
    i4 = full_data |> filter(title != "Ikiru") |> 
         mutate(v = climax_page - act2_climax_page) |> pull(v),
    i5 = full_data |> filter(title != "Ikiru") |> 
         mutate(v = resolution_page - climax_page) |> pull(v)
  )

## # A tibble: 13 × 7
##    title                      quality       i1    i2    i3    i4    i5
##    <chr>                      <chr>      <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 Thelma & Louise            Acclaimed     11    35    44    18     9
##  2 Boogie Nights              Acclaimed     46    30    53     2    11
##  3 Eyes Wide Shut             Acclaimed     19    16    60     1    13
##  4 Moonlight                  Acclaimed     24    30    29     9    13
##  5 Get Out                    Acclaimed     41    18    16    11     9
##  6 Parasite                   Acclaimed     38    31    50     0    10
##  7 Portrait of a Lady on Fire Acclaimed     21    18    11     0    16
##  8 Aftersun                   Acclaimed     30    34    16     0    10
##  9 Sentimental Value          Acclaimed     45    29    16     0    18
## 10 Sinners                    Acclaimed     21    28    40     0    22
## 11 Amsterdam                  Comparison    10    70    34    23     1
## 12 Don't Worry Darling        Comparison    25    17    33     9     4
## 13 The Mummy                  Comparison    12     2    54     4     7

Ikiru is omitted from this chart due to an unresolved Act II break — the beat could not be coded with confidence and is excluded rather than estimated.

# --- Build heatmap data ---
heatmap_data <- full_data |>
  filter(title != "Ikiru") |>
  mutate(
    i1 = round((act1_climax_page - inciting_incident_page) / page_count, 3),
    i2 = round((midpoint_page - act1_climax_page) / page_count, 3),
    i3 = round((act2_climax_page - midpoint_page) / page_count, 3),
    i4 = round((climax_page - act2_climax_page) / page_count, 3),
    i5 = round((resolution_page - climax_page) / page_count, 3)
  ) |>
  select(title, quality, i1, i2, i3, i4, i5) |>
  pivot_longer(-c(title, quality),
               names_to  = "interval",
               values_to = "pct") |>
  mutate(
    interval = factor(interval,
      levels = c("i1", "i2", "i3", "i4", "i5"),
      labels = c("Inciting to Act I Break",
                 "Act I Break to Midpoint",
                 "Midpoint to Act II Break",
                 "Act II Break to Climax",
                 "Climax to Resolution")),
    title_label = ifelse(quality == "Comparison",
                         paste0(title, " *"), title)
  )

# --- Sort order ---
title_order <- full_data |>
  filter(title != "Ikiru") |>
  mutate(
    mid_to_act2 = (act2_climax_page - midpoint_page) / page_count,
    title_label = ifelse(quality == "Comparison",
                         paste0(title, " *"), title)
  ) |>
  arrange(mid_to_act2) |>
  pull(title_label)

# --- Label colors ---
label_colors <- ifelse(grepl("\\*", title_order), "#E74C3C", "black")

# --- Plot ---
heatmap_data |>
  mutate(title_label = factor(title_label, levels = title_order)) |>
  ggplot(aes(x = interval, y = title_label, fill = pct)) +
  geom_tile(color = "white", linewidth = 0.5) +
  geom_text(aes(label = scales::percent(pct, accuracy = 1)),
            size = 3, color = "white") +
  scale_fill_gradient(low = "#AED6F1", high = "#1A5276",
                      labels = scales::percent) +
  labs(
    title    = "Where Do Screenplays Spend Their Pages?",
    subtitle = "Sorted by Midpoint to Act II Break length — * = comparison film",
    x        = NULL,
    y        = NULL,
    fill     = "Share of\nTotal Pages"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    plot.title      = element_text(face = "bold", size = 15),
    plot.subtitle   = element_text(color = "gray40"),
    axis.text.x     = element_text(angle = 25, hjust = 1, face = "bold"),
    axis.text.y     = element_text(size = 11, color = label_colors),
    legend.position = "right",
    panel.grid      = element_blank()
  )

This chart asks a more granular version of the same question: within the screenplay, where does the time actually go? The Midpoint to Act II Break column is the darkest across almost every film — acclaimed and comparison alike. That post-midpoint escalation phase appears to be the longest segment in most screenplays regardless of quality, which suggests it may be less a marker of greatness and more a structural constant. The clearest outlier is Amsterdam, which spends 49% of its screenplay between Act I and the midpoint — an unusually long first half before the story locks in. The Mummy concentrates an unusual amount in the Midpoint to Act II Break segment, suggesting a script that front-loads setup and then rushes through its final turns. Ikiru is omitted from this chart due to an unresolved Act II break.

# --- Calculate averages for reference lines ---
full_data |>
  filter(title != "Ikiru") |>
  mutate(climax_to_res = round((resolution_page - climax_page) / page_count, 3)) |>
  group_by(quality) |>
  summarise(avg = mean(climax_to_res, na.rm = TRUE))

## # A tibble: 2 × 2
##   quality      avg
##   <chr>      <dbl>
## 1 Acclaimed  0.112
## 2 Comparison 0.035

full_data |>
  filter(title != "Ikiru") |>
  mutate(
    climax_to_res = round((resolution_page - climax_page) / page_count, 3),
    title_label   = ifelse(quality == "Comparison",
                           paste0(title, " *"), title)
  ) |>
  arrange(climax_to_res) |>
  mutate(title_label = factor(title_label, levels = title_label)) |>
  ggplot(aes(x = climax_to_res, y = title_label,
             color = quality, shape = quality)) +
  geom_segment(aes(x = 0, xend = climax_to_res,
                   y = title_label, yend = title_label),
               linewidth = 0.8) +
  geom_point(size = 4) +
  scale_x_continuous(labels = scales::percent) +
  scale_color_manual(values = c("Acclaimed"  = "#1A5276",
                                "Comparison" = "#E74C3C")) +
  scale_shape_manual(values = c("Acclaimed"  = 16,
                                "Comparison" = 17)) +
  labs(
    title    = "Climax to Resolution Length",
    subtitle = "Share of total pages remaining after the climax — * = comparison film",
    x        = "Share of Total Pages",
    y        = NULL,
    color    = NULL,
    shape    = NULL
  ) +
  theme_minimal(base_size = 13) +
  theme(
    plot.title       = element_text(face = "bold", size = 15),
    plot.subtitle    = element_text(color = "gray40"),
    axis.title       = element_text(face = "bold"),
    legend.position  = "none",
    panel.grid.minor = element_blank()
  )

This is one of the more filmmaker-relevant findings in the project. The comparison films — Amsterdam, Don’t Worry Darling, and The Mummy — all leave relatively little space after the climax. The acclaimed films tend to leave more, though with real variation. Portrait of a Lady on Fire, Sinners, and Moonlight all let the story settle significantly after the final confrontation. That willingness to sit with the aftermath rather than end immediately may be one of the subtler traits that distinguishes stronger scripts in this sample — not a rule, but a tendency worth noting.

screenplaysanalysis

Mark Hamer

2026-03-13

Introduction

Conclusion