Quantile analysis of motion picture shot lengths

Nick Redfern

2022-02-28

This tutorial accompanies my article ‘Distributional thinking about film style: quantile comparisons of motion picture shot length data.’

Abstract

In this article I illustrate the use of quantiles as a means of describing and comparing motion picture shot length distributions. This approach is conceptually and computationally simple and leads us to think distributionally about shot lengths rather than focussing on individual values. The result is a better understanding of how this element of film style of two (or more) films differs.

In this tutorial I demonstrate the code used in my article so that those who would like to use quantile-based methods in their own research into shot length distributions may do so.

The GitHub repository for this tutorial contains the shot length data used.

Packages

In this tutorial we will use five packages.

The pacman package installs (if they are not already installed) the other packages, and loads them into the workspace.

# If the pacman package is not already installed on your computer un-comment the following line and run
# install.packages("pacman")

# Load the required packages
pacman::p_load(Hmisc, ggridges, tidyverse, viridis)

The table below summarises the use of these packages in this tutorial.

Package	Description
Hmisc	Calculate the quantiles of a film’s shot lengths
ggridges	Visualise multiple distributions on a single axis
tidyverse	Data wrangling and plotting
viridis	Colour schemes for plotting

Quantiles

A quantile \(Q_{p}\) is a cut point dividing a data set arranged from the smallest to the largest value so that a specified proportion of data points \(p\) is less than or equal to that point.

\[ Q_{p} = \{Pr(X \leq x) = p\} \]

We can calculate a quantile for any value of \(p\), but commonly used quantiles include the median (\(Q_{0.5}\)), which divides a data set into two equal parts, and the lower (\(Q_{0.25}\)) and upper (\(Q_{0.75}\)) quartiles that cut off the lower and upper 25% of a data set, respectively.

Quantiles can be used to describe the shape of a distribution in terms of its asymmetry or skewness:

\[ S = \frac{Q_{0.25} + Q_{0.75} - 2Q_{0.5}}{Q_{0.75} - Q_{0.25}} \]

and its kurtosis, which describes how much of the data is in the tails of the distribution relative to the centre:

\[ T = \frac{(Q_{0.875} - Q_{0.625}) + (Q_{0.375} - Q_{0.125})}{Q_{0.75} - Q_{0.25}} \]

To calculate the Harrell-Davis quantiles of the shot length data for each film we will us the hdquantile() function from the Hmisc package.

Comparing shot lengths in two films

The shot length data of You’re Darn Tootin’ (1928) and Hog Wild (1930) is available on the GitHub repository of this tutorial as a csv file in the Data folder. To read this data directly from GitHub we run the following code:

# Note that the URL must point to the raw version of the data and not the display version
df_LH <- read_csv("https://raw.githubusercontent.com/DrNickRedfern/SL-quantiles/main/Data/SL-quantiles-Figure-1.csv")

We can inspect the top of the data frame df_figure1 using the head() function.

head(df_LH)

## # A tibble: 6 x 2
##   film                values
##   <chr>                <dbl>
## 1 You're Darn Tootin'   10.7
## 2 You're Darn Tootin'    5  
## 3 You're Darn Tootin'    1.9
## 4 You're Darn Tootin'    8  
## 5 You're Darn Tootin'    7  
## 6 You're Darn Tootin'    1.5

Summarise the shot length data

To get a quantile summary of the shot length data of each film we can create a simple function that takes a vector of shot lengths as an input and returns a list of values that includes the number of shots in a film, the average (mean) shot length, the five-number summary of the data set (the minimum, lower quartile, median, upper quartile, and maximum shot lengths), the interquartile range, and the quantile-based measures of skewness (\(S\)) and kurtosis (\(T\)) defined above.

# The input to the function is a vector containing the shot lengths for a film
SL_summary <- function(x){
  
  # Check to see if the Hmisc package is installed
  if (!require("Hmisc")) 
    stop(print("This function requires the Hmisc package to be installed."))
  
  # Calculate the quantiles required to produce the summaries of the data
  q <- Hmisc::hdquantile(x, probs = seq(0.125, 0.875, 0.125), na.rm = TRUE, 
                         names = FALSE, se = FALSE, weights = FALSE)
  
  # create a list containing the output values
  SL_sum <- list(`Shots (N)` = round(length(x), 0),
             `Mean (s)` = round(mean(x), 1),
             `Minimum (s)` = min(x),
             `Lower quantile (s)` = round(q[2], 1),
             `Median (s)` = round(q[4], 1),
             `Upper quantile (s)` = round(q[6], 1),
             `Maximum (s)` = max(x),
             `Interquartile range` = round(q[6] - q[2], 1),
             `Quantile skewness` = round((q[2] + q[6] - (2 * q[4]))/(q[6] - q[2]), 2),
             `Quantile kurtosis` = round(((q[7] - q[5]) + (q[3] - q[1]))/(q[6] - q[2]), 2))
  
}

Next, we apply the SL_summary() function to the data for each and rearrange the result so it will display nicely.

df_LH %>% 
  # apply the SL_summary function to each film
  group_by(film) %>%
  group_modify(~{.x %>% 
      map_dfc(SL_summary)}) %>%  
  # rearrange the result
  tibble() %>% 
  column_to_rownames(var = "film") %>%
  t()

	Hog Wild	You’re Darn Tootin’
Shots (N)	169.00	189.00
Mean (s)	6.60	6.60
Minimum (s)	0.50	0.70
Lower quantile (s)	1.80	2.20
Median (s)	3.00	4.00
Upper quantile (s)	7.30	8.10
Maximum (s)	65.10	49.80
Interquartile range	5.50	5.90
Quantile skewness	0.57	0.39
Quantile kurtosis	1.93	1.28

Plotting the kernel densities

Now we can plot the kernel densities of the two films in order to compare their respective distributions. In the example below I have set the bandwidth of the density function (bw) to 0.1.

# Plot the kernel densities of the two films
ggplot(data = df_LH, aes(x = values, fill = film)) +
  geom_density(bw = 0.1, alpha = 0.8, trim = FALSE) +
  scale_x_continuous(limits = c(0.1, 100), breaks = c(0.1, 1, 10, 100), 
                     minor_breaks = c(seq(0.2, 0.9, 0.1), seq(2, 9, 1), seq(20, 90, 10)), 
                     trans = "log10") +
  scale_y_continuous(breaks = seq(0, 1, 0.25)) +
  labs(x = "Shot length (s)", y = "Density") +
  scale_fill_manual(name = NULL, values = c("#440154", "#21908C")) +
  theme(legend.position = "bottom",
        axis.title = element_text(size = 10,face = "bold"),
        plot.title = element_text(size = 10),
        panel.background = element_rect(fill = "gray85"))

Figure 1: Kernel density estimates of shot length distributions of two Laurel and Hardy films: You’re Darn Tootin’ (1928) and Hog Wild (1930).

Calculate and plot the quantiles

To compare the quantiles of the shot lengths of the two films, we first need to calculate the quantiles. We will calculate the quantiles at from \(Q_{0.05}\) to \(Q_{0.95}\) at intervals of 0.05.

df_LH_quantiles <- df_LH %>%
  group_by(film) %>%
  group_modify(~ {
     round(hdquantile(.x$values, probs = seq(0.05, 0.95, 0.05), na.rm = TRUE, 
                      names = FALSE, se = FALSE, weights = FALSE), 1) %>%
     enframe(name = "quantile", value = "SL")
  }) %>%
  mutate(quantile = seq(0.05, 0.95, 0.05))

film	quantile	SL
Hog Wild	0.05	1.0
Hog Wild	0.10	1.2
Hog Wild	0.15	1.5
Hog Wild	0.20	1.6
Hog Wild	0.25	1.8
Hog Wild	0.30	2.0

Next we plot the quantile profiles for each film.

ggplot(data = df_LH_quantiles, aes(x = quantile, y = SL, group = film)) +
  geom_line(aes(colour = film)) +
  geom_point(aes(colour = film)) +
  scale_x_continuous(breaks = seq(0.1, 0.9, 0.1)) +
  scale_y_continuous(trans = "log10", limits = c(0.9, 30), 
                     breaks = c(1, 10, 30), minor_breaks= c(seq(2, 9, 1), 20)) +
  labs(title = "Quantile profiles", x ="Quantile", y = " Shot length (s)") +
  scale_colour_manual(name = NULL, values = c("#440154", "#21908C")) +
  theme(legend.position = "bottom",
        axis.title = element_text(size = 10, face = "bold"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_rect(fill = "gray85"))

Figure 2A: The quantile profiles for two Laurel and Hardy films

To calculate the difference between quantiles of each film we re-arrange the data to a wide format. The differences are calculated by subtracting the value of each quantile for You’re Darn Tootin’ from those of Hog Wild, so that negative differences indicate quantiles for which shots in You’re Darn Tootin’ tend to be of greater duration and positive differences identify quantiles when shots in Hog Wild tend to be longer.

df_LH_quantile_diff <- df_LH_quantiles %>% 
  pivot_wider(names_from = film, values_from = SL) %>%
  mutate(diff = .[[2]] - .[[3]])

quantile	Hog Wild	You’re Darn Tootin’	diff
0.05	1.0	1.2	-0.2
0.10	1.2	1.5	-0.3
0.15	1.5	1.7	-0.2
0.20	1.6	2.0	-0.4
0.25	1.8	2.2	-0.4
0.30	2.0	2.5	-0.5

Now we plot the result.

ggplot(data = df_LH_quantile_diff,
                    aes(x = quantile, y = diff, label = sprintf("%0.1f", round(diff, digits = 1)))) +
  geom_point(aes(colour = diff), size = 8, show.legend = FALSE) +
  geom_text(colour = "white", size = 2.4, fontface = "bold") +
  scale_colour_gradient(low = "#21908C", high = "#440154") +
  scale_x_continuous(breaks = seq(0.1, 0.9, 0.1,)) +
  scale_y_continuous(breaks = c(-2, 0, 2, 4), 
                     labels = sprintf("%0.1f", round(c(-2, 0, 2, 4), digits = 1))) +
  expand_limits(y = c(-2,5)) +
  labs(title = "Quantile differences", x = "Quantile", y = "Quantile difference (s)") +
  theme(axis.title = element_text(size = 10, face = "bold"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.grid.minor.y = element_blank(),
        panel.background = element_rect(fill = "gray85"))

Figure 2B: Quantile differences for two Laurel and Hardy films.

Comparing shot lengths in two groups of films

In this part of the tutorial we will use the early Hitchcock shot length data, which includes shot length data for four silent (The Ring (1927), The Farmer’s Wife (1928), Champagne (1928), and The Manxman (1929)) and five sound films (Blackmail (1929), Murder! (1930), The Skin Game (1931), Rich and Strange (1931), Number Seventeen (1932)) directed by Alfred Hitchcock.

First, we need to load the data from GitHub.

# Note that the URL must point to the raw version of the data and not the display version
df_Hitchcock <- read_csv("https://raw.githubusercontent.com/DrNickRedfern/SL-quantiles/main/Data/SL-quantiles-Hitchcock_SL.csv")

head(df_Hitchcock)

## # A tibble: 6 x 9
##   `The Ring (1927)` `The Farmers Wife (1928)` `Champagne (192~` `The Manxman (~`
##               <dbl>                     <dbl>             <dbl>            <dbl>
## 1              4.04                      5.79             19.2               7.8
## 2              8.46                     17.1              17.0              25.1
## 3              5.58                      8.67              3.42              7.5
## 4              6                         5.08             14.7               3.1
## 5              1.71                      4.5               2.67              3.2
## 6              5.04                     11.1               2.79              2.2
## # ... with 5 more variables: `Blackmail (1929)` <dbl>, `Murder (1930)` <dbl>,
## #   `The Skin Game (1931)` <dbl>, `Rich and Strange (1931)` <dbl>,
## #   `Number Seventeen (1932)` <dbl>

We need to re-arrange the data so it is in a long format.

df_Hitchcock <- df_Hitchcock %>% 
  pivot_longer(cols = 1:9, names_to = "film", values_to = "SL", values_drop_na = TRUE) %>%
  arrange(film)

film	SL
Blackmail (1929)	6.88
Blackmail (1929)	4.67
Blackmail (1929)	2.96
Blackmail (1929)	1.75
Blackmail (1929)	2.67
Blackmail (1929)	6.21

Summarise the shot length data

To summarise the shot length data we apply the SL_summary function we created earlier.

df_Hitchcock %>%
  # apply the SL_summary function to each film
  group_by(film) %>%
  group_modify(~{.x %>%
      map_dfc(SL_summary)}) %>%
  # rearrange the result
  tibble() %>%
  column_to_rownames(var = "film") %>%
  t()

	Blackmail (1929)	Champagne (1928)	Murder (1930)	Number Seventeen (1932)	Rich and Strange (1931)	The Farmers Wife (1928)	The Manxman (1929)	The Ring (1927)	The Skin Game (1931)
Shots (N)	438.00	893.00	438.00	655.00	687.00	1007.00	808.00	1056.00	269.00
Mean (s)	11.50	5.90	13.70	5.70	7.10	5.70	6.10	5.00	18.10
Minimum (s)	0.92	0.04	0.50	0.10	0.20	0.25	0.60	0.04	0.80
Lower quantile (s)	2.80	2.30	2.20	1.30	2.00	2.30	2.60	2.20	2.20
Median (s)	5.50	4.00	4.80	2.40	3.70	4.00	4.40	3.50	5.00
Upper quantile (s)	12.70	7.40	15.30	5.60	7.30	6.90	7.60	6.20	14.70
Maximum (s)	148.17	63.71	223.90	77.70	79.10	78.83	52.50	57.63	281.00
Interquartile range	9.80	5.10	13.10	4.30	5.20	4.60	4.90	4.00	12.50
Quantile skewness	0.46	0.32	0.60	0.49	0.37	0.25	0.26	0.32	0.55
Quantile kurtosis	1.63	1.40	1.86	1.94	1.93	1.39	1.30	1.48	2.60

Plot the distributions

We want to identify films as being either silent films or sound films so that we can compare differences between groups. We will also set the titles of the films as factors to plot int he correct order.

# Classify the films
df_Hitchcock <- df_Hitchcock %>% mutate(type = ifelse(film %in% c("The Ring (1927)", 
                                                                  "The Manxman (1929)", 
                                                                  "The Farmers Wife (1928)",
                                                                  "Champagne (1928)"),
                                                      "Silent", "Sound"))

# # Set film titles as factor
df_Hitchcock$film <- factor(df_Hitchcock$film,
                            levels = c("The Ring (1927)",
                                       "The Farmers Wife (1928)",
                                       "Champagne (1928)",
                                       "The Manxman (1929)",
                                       "Blackmail (1929)",
                                       "Murder (1930)",
                                       "The Skin Game (1931)",
                                       "Rich and Strange (1931)",
                                       "Number Seventeen (1932)"))

film	SL	type
Blackmail (1929)	6.88	Sound
Blackmail (1929)	4.67	Sound
Blackmail (1929)	2.96	Sound
Blackmail (1929)	1.75	Sound
Blackmail (1929)	2.67	Sound
Blackmail (1929)	6.21	Sound

To compare the distribution of shot lengths in these films we can use the ggridges package to plot the kernel densities on a single x-axis for easy comparison, with the median and upper and lower quartiles added for reference.

ggplot(data = df_Hitchcock, aes(x = SL, y = reorder(film, desc(film)), fill = type)) + 
  geom_density_ridges(scale = 1, quantile_lines = TRUE, alpha = 0.8) +
  scale_x_continuous(expand = c(0.01, 0), breaks = c(0.1, 1, 10, 100), 
                     minor_breaks = c(seq(0.02, 0.09, 0.01), 
                                      seq(0.2, 0.9, 0.1), 
                                      seq(2, 9, 1), 
                                      seq(20, 90, 10), 
                                      seq(200, 900, 100)),
                     labels = c("0.1", "1.0", "10.0", "100.0"), trans = "log10") +
  scale_y_discrete(expand = c(0.01, 0)) +
  scale_fill_manual(name = NULL, values = c("#440154", "#21908C")) +
  xlab("\nShot length (s)") +
  theme(legend.position = "bottom", 
        axis.title.y = element_blank(),
        axis.title = element_text(size = 10,face = "bold"),
        panel.background = element_rect(fill = "gray85"))

Figure 3: Shot length distributions of films directed by Alfred Hitchcock, 1927-1932. The quantile lines in each density plot show the lower quartile, median, and upper quartile of each distribution.

Compare the quantiles

Moving on to the quantile-based comparison of the shot length distributions, we need to calculate the quantiles for each film. As before, we will calculate the quantiles from \(Q_{0.05}\) to \(Q_{0.95}\) with steps of 0.05.

df_Hitchcock_quantiles <- df_Hitchcock %>%
  group_by(film) %>%
  group_modify(~ { 
   round(hdquantile(.x$SL, probs = seq(0.05, 0.95, 0.05), na.rm = TRUE,
                    names = FALSE, se = FALSE, weights = FALSE), 1) %>%
      enframe(name = "quantile", value = "SL")
  }) %>%
  mutate(type = ifelse(film %in% c("The Ring (1927)", 
                                   "The Manxman (1929)", 
                                   "The Farmers Wife (1928)", 
                                   "Champagne (1928)"), 
                       "Silent", "Sound"),
         quantile = seq(0.05, 0.95, 0.05))

film	quantile	SL	type
The Ring (1927)	0.05	1.2	Silent
The Ring (1927)	0.10	1.5	Silent
The Ring (1927)	0.15	1.7	Silent
The Ring (1927)	0.20	1.9	Silent
The Ring (1927)	0.25	2.2	Silent
The Ring (1927)	0.30	2.4	Silent

To plot the quantile profiles of the silent and sound films:

# create plot
ggplot(data = df_Hitchcock_quantiles, aes(x = quantile, y = SL, group = film)) +
  geom_line(aes(colour = factor(type))) +
  geom_point(aes(colour = factor(type))) +
  scale_x_continuous(breaks = seq(0.1, 0.9, 0.1)) +
  scale_y_continuous(trans = "log10", limits = c(0.5, 100), 
                     breaks = c(1, 10, 100), 
                     minor_breaks= c(seq(0.2, 0.9, 0.1), seq(2, 9, 1), seq(20, 90, 10))) +
  labs(title = "Quantile profiles", x ="Quantile", y = " Shot length (s)") +
  scale_colour_manual(name = NULL, values = c("#440154", "#21908C")) +
  theme(legend.position = "bottom",
        axis.title = element_text(size = 10, face = "bold"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_rect(fill = "gray85"))

Figure 4A: Quantile profiles of shot length distributions of films directed by Alfred Hitchcock, 1927-1932.

To plot the differences of the quantiles between the two groups of films, we need to calculate the difference of each quantile \(Q_{p}\) in the silent films from the corresponding quantiles (\(Q_{p}\)) of the sound films.

First, we need to split the data into silent and sound films and re-arrange the new data frames into a wide format.

# Split the data frame containing th quantiles for each Hitchcock film into silent and sound films
df_split <- df_Hitchcock_quantiles %>% 
  group_by(type) %>% 
  group_split()

# Create data frames for each type of film and re-arrange to a wide format
df_Hitchcock_silent <- df_split[[1]] %>% 
  pivot_wider(names_from = quantile, values_from = SL)

df_Hitchcock_sound <- df_split[[2]] %>% 
  pivot_wider(names_from = quantile, values_from = SL)

film	type	0.05	0.1	0.15	0.2	0.25	0.3	0.35	0.4	0.45	0.5	0.55	0.6	0.65	0.7	0.75	0.8	0.85	0.9	0.95
Blackmail (1929)	Sound	1.6	1.9	2.2	2.5	2.8	3.3	3.8	4.3	4.9	5.5	6.5	7.6	9.0	10.8	12.7	15.2	19.0	26.5	42.9
Murder (1930)	Sound	1.0	1.3	1.5	1.9	2.2	2.6	3.1	3.5	4.1	4.8	5.8	7.1	8.7	11.3	15.3	19.8	25.8	35.1	53.4
The Skin Game (1931)	Sound	1.0	1.2	1.5	1.8	2.2	2.5	2.9	3.5	4.1	5.0	5.8	6.7	8.2	10.3	14.7	21.8	31.0	52.2	97.3

The function to calculate the pairwise quantile differences, qdiff, is simple. The function takes two matrices of quantiles as inputs and assumes the matrices have equal shapes and that the step in the quantiles are equal. The argument p is the step increase for the quantiles, which in this case is 0.05.

qdiff <- function(x, y, p = 0.05){

  df <- data.frame()
  for (i in 1:dim(x)[2]){
    diff <- as.vector(outer(x[, i], y[, i], '-'))
    q <- i * p
    df_a <- cbind(rep(q, length(diff)), diff)
    df <- rbind(df, df_a)
  }
  colnames(df) <- c("quantile", "difference")
  return(df)
  
}

To apply the qdiff function to the data for silent and sound films we first drop the columns containing the film title and type because these will not be used in the calculations and convert the data format to a matrix, the input type expected by the function.

# Arrange the data and convert to matrix
df_Hitchcock_silent <- df_Hitchcock_silent %>%
  select(-c(film, type)) %>%
  as.matrix()

df_Hitchcock_sound <- df_Hitchcock_sound%>%
  select(-c(film, type)) %>%
  as.matrix()

# Calculate the quantile differences
df_Hitchcock_q_diff <- qdiff(df_Hitchcock_sound, df_Hitchcock_silent)

quantile	difference
0.05	0.4
0.05	-0.2
0.05	-0.2
0.05	-0.1
0.05	-0.6
0.05	0.2

Now to plot the distributions of the differences:

ggplot(data = df_Hitchcock_q_diff, 
       aes(x = difference, group = as.factor(quantile), colour = as.factor(quantile))) +  
  geom_density(size = 0.6) + 
  labs(title = "Quantile differences", x = "Difference (s)", y = "Density") +
  scale_x_continuous(limits = c(-10, 90), breaks = seq(-10, 90, 10)) + 
  scale_y_continuous(limits = c(0, 1.25), breaks = seq(0, 1.25, 0.25)) +
  scale_colour_viridis(name = "Quantile", 
                       labels = c("0.05","0.10","0.15","0.20","0.25","0.30","0.35",
                                  "0.40","0.45","0.50","0.55","0.60","0.65","0.70",
                                  "0.75","0.80", "0.85","0.90", "0.95"), 
                       discrete = TRUE, direction = -1) +
  guides(colour = guide_legend(ncol = 2)) +
  theme(legend.key.width = unit(0.6, "cm"), 
        legend.key.height = unit(0.4, "cm"),
        legend.position = c(0.75, 0.595),
        legend.title = element_text(size = 10), 
        panel.grid.minor = element_blank(),
        axis.title = element_text(size = 10, face = "bold"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_rect(fill = "gray85"))

Figure 4B: Difference distributions for pairwise differences between quantiles of shot length distributions of films directed by Alfred Hitchcock, 1927-1932. Positive differences indicate that quartiles the sound films in the sample are higher than those of the silent films.

Summary

If our goal is to understand differences in shot lengths as differences in style between films, the quantile approach described in my article and demonstrated in this tutorial is simple and intuitive to understand and accurately describes the nature of the differences between these shot length distributions and the size of those differences.