This tutorial accompanies my article ‘Distributional thinking about film style: quantile comparisons of motion picture shot length data.’
Abstract
In this article I illustrate the use of quantiles as a means of describing and comparing motion picture shot length distributions. This approach is conceptually and computationally simple and leads us to think distributionally about shot lengths rather than focussing on individual values. The result is a better understanding of how this element of film style of two (or more) films differs.
In this tutorial I demonstrate the code used in my article so that those who would like to use quantile-based methods in their own research into shot length distributions may do so.
The GitHub repository for this tutorial contains the shot length data used.
Packages
In this tutorial we will use five packages.
The pacman package installs (if they are not already installed) the other packages, and loads them into the workspace.
# If the pacman package is not already installed on your computer un-comment the following line and run
# install.packages("pacman")
# Load the required packages
pacman::p_load(Hmisc, ggridges, tidyverse, viridis)
The table below summarises the use of these packages in this tutorial.
| Package | Description |
|---|---|
| Hmisc | Calculate the quantiles of a film’s shot lengths |
| ggridges | Visualise multiple distributions on a single axis |
| tidyverse | Data wrangling and plotting |
| viridis | Colour schemes for plotting |
Quantiles
A quantile \(Q_{p}\) is a cut point dividing a data set arranged from the smallest to the largest value so that a specified proportion of data points \(p\) is less than or equal to that point.
\[ Q_{p} = \{Pr(X \leq x) = p\} \]
We can calculate a quantile for any value of \(p\), but commonly used quantiles include the median (\(Q_{0.5}\)), which divides a data set into two equal parts, and the lower (\(Q_{0.25}\)) and upper (\(Q_{0.75}\)) quartiles that cut off the lower and upper 25% of a data set, respectively.
Quantiles can be used to describe the shape of a distribution in terms of its asymmetry or skewness:
\[ S = \frac{Q_{0.25} + Q_{0.75} - 2Q_{0.5}}{Q_{0.75} - Q_{0.25}} \]
and its kurtosis, which describes how much of the data is in the tails of the distribution relative to the centre:
\[ T = \frac{(Q_{0.875} - Q_{0.625}) + (Q_{0.375} - Q_{0.125})}{Q_{0.75} - Q_{0.25}} \]
To calculate the Harrell-Davis quantiles of the shot length data for each film we will us the hdquantile() function from the Hmisc package.
Comparing shot lengths in two films
The shot length data of You’re Darn Tootin’ (1928) and Hog Wild (1930) is available on the GitHub repository of this tutorial as a csv file in the Data folder. To read this data directly from GitHub we run the following code:
# Note that the URL must point to the raw version of the data and not the display version
df_LH <- read_csv("https://raw.githubusercontent.com/DrNickRedfern/SL-quantiles/main/Data/SL-quantiles-Figure-1.csv")
We can inspect the top of the data frame df_figure1 using the head() function.
head(df_LH)
## # A tibble: 6 x 2
## film values
## <chr> <dbl>
## 1 You're Darn Tootin' 10.7
## 2 You're Darn Tootin' 5
## 3 You're Darn Tootin' 1.9
## 4 You're Darn Tootin' 8
## 5 You're Darn Tootin' 7
## 6 You're Darn Tootin' 1.5
Summarise the shot length data
To get a quantile summary of the shot length data of each film we can create a simple function that takes a vector of shot lengths as an input and returns a list of values that includes the number of shots in a film, the average (mean) shot length, the five-number summary of the data set (the minimum, lower quartile, median, upper quartile, and maximum shot lengths), the interquartile range, and the quantile-based measures of skewness (\(S\)) and kurtosis (\(T\)) defined above.
# The input to the function is a vector containing the shot lengths for a film
SL_summary <- function(x){
# Check to see if the Hmisc package is installed
if (!require("Hmisc"))
stop(print("This function requires the Hmisc package to be installed."))
# Calculate the quantiles required to produce the summaries of the data
q <- Hmisc::hdquantile(x, probs = seq(0.125, 0.875, 0.125), na.rm = TRUE,
names = FALSE, se = FALSE, weights = FALSE)
# create a list containing the output values
SL_sum <- list(`Shots (N)` = round(length(x), 0),
`Mean (s)` = round(mean(x), 1),
`Minimum (s)` = min(x),
`Lower quantile (s)` = round(q[2], 1),
`Median (s)` = round(q[4], 1),
`Upper quantile (s)` = round(q[6], 1),
`Maximum (s)` = max(x),
`Interquartile range` = round(q[6] - q[2], 1),
`Quantile skewness` = round((q[2] + q[6] - (2 * q[4]))/(q[6] - q[2]), 2),
`Quantile kurtosis` = round(((q[7] - q[5]) + (q[3] - q[1]))/(q[6] - q[2]), 2))
}
Next, we apply the SL_summary() function to the data for each and rearrange the result so it will display nicely.
df_LH %>%
# apply the SL_summary function to each film
group_by(film) %>%
group_modify(~{.x %>%
map_dfc(SL_summary)}) %>%
# rearrange the result
tibble() %>%
column_to_rownames(var = "film") %>%
t()
| Hog Wild | You’re Darn Tootin’ | |
|---|---|---|
| Shots (N) | 169.00 | 189.00 |
| Mean (s) | 6.60 | 6.60 |
| Minimum (s) | 0.50 | 0.70 |
| Lower quantile (s) | 1.80 | 2.20 |
| Median (s) | 3.00 | 4.00 |
| Upper quantile (s) | 7.30 | 8.10 |
| Maximum (s) | 65.10 | 49.80 |
| Interquartile range | 5.50 | 5.90 |
| Quantile skewness | 0.57 | 0.39 |
| Quantile kurtosis | 1.93 | 1.28 |
Plotting the kernel densities
Now we can plot the kernel densities of the two films in order to compare their respective distributions. In the example below I have set the bandwidth of the density function (bw) to 0.1.
# Plot the kernel densities of the two films
ggplot(data = df_LH, aes(x = values, fill = film)) +
geom_density(bw = 0.1, alpha = 0.8, trim = FALSE) +
scale_x_continuous(limits = c(0.1, 100), breaks = c(0.1, 1, 10, 100),
minor_breaks = c(seq(0.2, 0.9, 0.1), seq(2, 9, 1), seq(20, 90, 10)),
trans = "log10") +
scale_y_continuous(breaks = seq(0, 1, 0.25)) +
labs(x = "Shot length (s)", y = "Density") +
scale_fill_manual(name = NULL, values = c("#440154", "#21908C")) +
theme(legend.position = "bottom",
axis.title = element_text(size = 10,face = "bold"),
plot.title = element_text(size = 10),
panel.background = element_rect(fill = "gray85"))
Figure 1: Kernel density estimates of shot length distributions of two Laurel and Hardy films: You’re Darn Tootin’ (1928) and Hog Wild (1930).
Calculate and plot the quantiles
To compare the quantiles of the shot lengths of the two films, we first need to calculate the quantiles. We will calculate the quantiles at from \(Q_{0.05}\) to \(Q_{0.95}\) at intervals of 0.05.
df_LH_quantiles <- df_LH %>%
group_by(film) %>%
group_modify(~ {
round(hdquantile(.x$values, probs = seq(0.05, 0.95, 0.05), na.rm = TRUE,
names = FALSE, se = FALSE, weights = FALSE), 1) %>%
enframe(name = "quantile", value = "SL")
}) %>%
mutate(quantile = seq(0.05, 0.95, 0.05))
| film | quantile | SL |
|---|---|---|
| Hog Wild | 0.05 | 1.0 |
| Hog Wild | 0.10 | 1.2 |
| Hog Wild | 0.15 | 1.5 |
| Hog Wild | 0.20 | 1.6 |
| Hog Wild | 0.25 | 1.8 |
| Hog Wild | 0.30 | 2.0 |
Next we plot the quantile profiles for each film.
ggplot(data = df_LH_quantiles, aes(x = quantile, y = SL, group = film)) +
geom_line(aes(colour = film)) +
geom_point(aes(colour = film)) +
scale_x_continuous(breaks = seq(0.1, 0.9, 0.1)) +
scale_y_continuous(trans = "log10", limits = c(0.9, 30),
breaks = c(1, 10, 30), minor_breaks= c(seq(2, 9, 1), 20)) +
labs(title = "Quantile profiles", x ="Quantile", y = " Shot length (s)") +
scale_colour_manual(name = NULL, values = c("#440154", "#21908C")) +
theme(legend.position = "bottom",
axis.title = element_text(size = 10, face = "bold"),
plot.title = element_text(size = 12, face = "bold"),
panel.background = element_rect(fill = "gray85"))
Figure 2A: The quantile profiles for two Laurel and Hardy films
To calculate the difference between quantiles of each film we re-arrange the data to a wide format. The differences are calculated by subtracting the value of each quantile for You’re Darn Tootin’ from those of Hog Wild, so that negative differences indicate quantiles for which shots in You’re Darn Tootin’ tend to be of greater duration and positive differences identify quantiles when shots in Hog Wild tend to be longer.
df_LH_quantile_diff <- df_LH_quantiles %>%
pivot_wider(names_from = film, values_from = SL) %>%
mutate(diff = .[[2]] - .[[3]])
| quantile | Hog Wild | You’re Darn Tootin’ | diff |
|---|---|---|---|
| 0.05 | 1.0 | 1.2 | -0.2 |
| 0.10 | 1.2 | 1.5 | -0.3 |
| 0.15 | 1.5 | 1.7 | -0.2 |
| 0.20 | 1.6 | 2.0 | -0.4 |
| 0.25 | 1.8 | 2.2 | -0.4 |
| 0.30 | 2.0 | 2.5 | -0.5 |
Now we plot the result.
ggplot(data = df_LH_quantile_diff,
aes(x = quantile, y = diff, label = sprintf("%0.1f", round(diff, digits = 1)))) +
geom_point(aes(colour = diff), size = 8, show.legend = FALSE) +
geom_text(colour = "white", size = 2.4, fontface = "bold") +
scale_colour_gradient(low = "#21908C", high = "#440154") +
scale_x_continuous(breaks = seq(0.1, 0.9, 0.1,)) +
scale_y_continuous(breaks = c(-2, 0, 2, 4),
labels = sprintf("%0.1f", round(c(-2, 0, 2, 4), digits = 1))) +
expand_limits(y = c(-2,5)) +
labs(title = "Quantile differences", x = "Quantile", y = "Quantile difference (s)") +
theme(axis.title = element_text(size = 10, face = "bold"),
plot.title = element_text(size = 12, face = "bold"),
panel.grid.minor.y = element_blank(),
panel.background = element_rect(fill = "gray85"))
Figure 2B: Quantile differences for two Laurel and Hardy films.
Comparing shot lengths in two groups of films
In this part of the tutorial we will use the early Hitchcock shot length data, which includes shot length data for four silent (The Ring (1927), The Farmer’s Wife (1928), Champagne (1928), and The Manxman (1929)) and five sound films (Blackmail (1929), Murder! (1930), The Skin Game (1931), Rich and Strange (1931), Number Seventeen (1932)) directed by Alfred Hitchcock.
First, we need to load the data from GitHub.
# Note that the URL must point to the raw version of the data and not the display version
df_Hitchcock <- read_csv("https://raw.githubusercontent.com/DrNickRedfern/SL-quantiles/main/Data/SL-quantiles-Hitchcock_SL.csv")
head(df_Hitchcock)
## # A tibble: 6 x 9
## `The Ring (1927)` `The Farmers Wife (1928)` `Champagne (192~` `The Manxman (~`
## <dbl> <dbl> <dbl> <dbl>
## 1 4.04 5.79 19.2 7.8
## 2 8.46 17.1 17.0 25.1
## 3 5.58 8.67 3.42 7.5
## 4 6 5.08 14.7 3.1
## 5 1.71 4.5 2.67 3.2
## 6 5.04 11.1 2.79 2.2
## # ... with 5 more variables: `Blackmail (1929)` <dbl>, `Murder (1930)` <dbl>,
## # `The Skin Game (1931)` <dbl>, `Rich and Strange (1931)` <dbl>,
## # `Number Seventeen (1932)` <dbl>
We need to re-arrange the data so it is in a long format.
df_Hitchcock <- df_Hitchcock %>%
pivot_longer(cols = 1:9, names_to = "film", values_to = "SL", values_drop_na = TRUE) %>%
arrange(film)
| film | SL |
|---|---|
| Blackmail (1929) | 6.88 |
| Blackmail (1929) | 4.67 |
| Blackmail (1929) | 2.96 |
| Blackmail (1929) | 1.75 |
| Blackmail (1929) | 2.67 |
| Blackmail (1929) | 6.21 |
Summarise the shot length data
To summarise the shot length data we apply the SL_summary function we created earlier.
df_Hitchcock %>%
# apply the SL_summary function to each film
group_by(film) %>%
group_modify(~{.x %>%
map_dfc(SL_summary)}) %>%
# rearrange the result
tibble() %>%
column_to_rownames(var = "film") %>%
t()
| Blackmail (1929) | Champagne (1928) | Murder (1930) | Number Seventeen (1932) | Rich and Strange (1931) | The Farmers Wife (1928) | The Manxman (1929) | The Ring (1927) | The Skin Game (1931) | |
|---|---|---|---|---|---|---|---|---|---|
| Shots (N) | 438.00 | 893.00 | 438.00 | 655.00 | 687.00 | 1007.00 | 808.00 | 1056.00 | 269.00 |
| Mean (s) | 11.50 | 5.90 | 13.70 | 5.70 | 7.10 | 5.70 | 6.10 | 5.00 | 18.10 |
| Minimum (s) | 0.92 | 0.04 | 0.50 | 0.10 | 0.20 | 0.25 | 0.60 | 0.04 | 0.80 |
| Lower quantile (s) | 2.80 | 2.30 | 2.20 | 1.30 | 2.00 | 2.30 | 2.60 | 2.20 | 2.20 |
| Median (s) | 5.50 | 4.00 | 4.80 | 2.40 | 3.70 | 4.00 | 4.40 | 3.50 | 5.00 |
| Upper quantile (s) | 12.70 | 7.40 | 15.30 | 5.60 | 7.30 | 6.90 | 7.60 | 6.20 | 14.70 |
| Maximum (s) | 148.17 | 63.71 | 223.90 | 77.70 | 79.10 | 78.83 | 52.50 | 57.63 | 281.00 |
| Interquartile range | 9.80 | 5.10 | 13.10 | 4.30 | 5.20 | 4.60 | 4.90 | 4.00 | 12.50 |
| Quantile skewness | 0.46 | 0.32 | 0.60 | 0.49 | 0.37 | 0.25 | 0.26 | 0.32 | 0.55 |
| Quantile kurtosis | 1.63 | 1.40 | 1.86 | 1.94 | 1.93 | 1.39 | 1.30 | 1.48 | 2.60 |
Plot the distributions
We want to identify films as being either silent films or sound films so that we can compare differences between groups. We will also set the titles of the films as factors to plot int he correct order.
# Classify the films
df_Hitchcock <- df_Hitchcock %>% mutate(type = ifelse(film %in% c("The Ring (1927)",
"The Manxman (1929)",
"The Farmers Wife (1928)",
"Champagne (1928)"),
"Silent", "Sound"))
# # Set film titles as factor
df_Hitchcock$film <- factor(df_Hitchcock$film,
levels = c("The Ring (1927)",
"The Farmers Wife (1928)",
"Champagne (1928)",
"The Manxman (1929)",
"Blackmail (1929)",
"Murder (1930)",
"The Skin Game (1931)",
"Rich and Strange (1931)",
"Number Seventeen (1932)"))
| film | SL | type |
|---|---|---|
| Blackmail (1929) | 6.88 | Sound |
| Blackmail (1929) | 4.67 | Sound |
| Blackmail (1929) | 2.96 | Sound |
| Blackmail (1929) | 1.75 | Sound |
| Blackmail (1929) | 2.67 | Sound |
| Blackmail (1929) | 6.21 | Sound |
To compare the distribution of shot lengths in these films we can use the ggridges package to plot the kernel densities on a single x-axis for easy comparison, with the median and upper and lower quartiles added for reference.
ggplot(data = df_Hitchcock, aes(x = SL, y = reorder(film, desc(film)), fill = type)) +
geom_density_ridges(scale = 1, quantile_lines = TRUE, alpha = 0.8) +
scale_x_continuous(expand = c(0.01, 0), breaks = c(0.1, 1, 10, 100),
minor_breaks = c(seq(0.02, 0.09, 0.01),
seq(0.2, 0.9, 0.1),
seq(2, 9, 1),
seq(20, 90, 10),
seq(200, 900, 100)),
labels = c("0.1", "1.0", "10.0", "100.0"), trans = "log10") +
scale_y_discrete(expand = c(0.01, 0)) +
scale_fill_manual(name = NULL, values = c("#440154", "#21908C")) +
xlab("\nShot length (s)") +
theme(legend.position = "bottom",
axis.title.y = element_blank(),
axis.title = element_text(size = 10,face = "bold"),
panel.background = element_rect(fill = "gray85"))
Figure 3: Shot length distributions of films directed by Alfred Hitchcock, 1927-1932. The quantile lines in each density plot show the lower quartile, median, and upper quartile of each distribution.
Compare the quantiles
Moving on to the quantile-based comparison of the shot length distributions, we need to calculate the quantiles for each film. As before, we will calculate the quantiles from \(Q_{0.05}\) to \(Q_{0.95}\) with steps of 0.05.
df_Hitchcock_quantiles <- df_Hitchcock %>%
group_by(film) %>%
group_modify(~ {
round(hdquantile(.x$SL, probs = seq(0.05, 0.95, 0.05), na.rm = TRUE,
names = FALSE, se = FALSE, weights = FALSE), 1) %>%
enframe(name = "quantile", value = "SL")
}) %>%
mutate(type = ifelse(film %in% c("The Ring (1927)",
"The Manxman (1929)",
"The Farmers Wife (1928)",
"Champagne (1928)"),
"Silent", "Sound"),
quantile = seq(0.05, 0.95, 0.05))
| film | quantile | SL | type |
|---|---|---|---|
| The Ring (1927) | 0.05 | 1.2 | Silent |
| The Ring (1927) | 0.10 | 1.5 | Silent |
| The Ring (1927) | 0.15 | 1.7 | Silent |
| The Ring (1927) | 0.20 | 1.9 | Silent |
| The Ring (1927) | 0.25 | 2.2 | Silent |
| The Ring (1927) | 0.30 | 2.4 | Silent |
To plot the quantile profiles of the silent and sound films:
# create plot
ggplot(data = df_Hitchcock_quantiles, aes(x = quantile, y = SL, group = film)) +
geom_line(aes(colour = factor(type))) +
geom_point(aes(colour = factor(type))) +
scale_x_continuous(breaks = seq(0.1, 0.9, 0.1)) +
scale_y_continuous(trans = "log10", limits = c(0.5, 100),
breaks = c(1, 10, 100),
minor_breaks= c(seq(0.2, 0.9, 0.1), seq(2, 9, 1), seq(20, 90, 10))) +
labs(title = "Quantile profiles", x ="Quantile", y = " Shot length (s)") +
scale_colour_manual(name = NULL, values = c("#440154", "#21908C")) +
theme(legend.position = "bottom",
axis.title = element_text(size = 10, face = "bold"),
plot.title = element_text(size = 12, face = "bold"),
panel.background = element_rect(fill = "gray85"))
Figure 4A: Quantile profiles of shot length distributions of films directed by Alfred Hitchcock, 1927-1932.
To plot the differences of the quantiles between the two groups of films, we need to calculate the difference of each quantile \(Q_{p}\) in the silent films from the corresponding quantiles (\(Q_{p}\)) of the sound films.
First, we need to split the data into silent and sound films and re-arrange the new data frames into a wide format.
# Split the data frame containing th quantiles for each Hitchcock film into silent and sound films
df_split <- df_Hitchcock_quantiles %>%
group_by(type) %>%
group_split()
# Create data frames for each type of film and re-arrange to a wide format
df_Hitchcock_silent <- df_split[[1]] %>%
pivot_wider(names_from = quantile, values_from = SL)
df_Hitchcock_sound <- df_split[[2]] %>%
pivot_wider(names_from = quantile, values_from = SL)
| film | type | 0.05 | 0.1 | 0.15 | 0.2 | 0.25 | 0.3 | 0.35 | 0.4 | 0.45 | 0.5 | 0.55 | 0.6 | 0.65 | 0.7 | 0.75 | 0.8 | 0.85 | 0.9 | 0.95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Blackmail (1929) | Sound | 1.6 | 1.9 | 2.2 | 2.5 | 2.8 | 3.3 | 3.8 | 4.3 | 4.9 | 5.5 | 6.5 | 7.6 | 9.0 | 10.8 | 12.7 | 15.2 | 19.0 | 26.5 | 42.9 |
| Murder (1930) | Sound | 1.0 | 1.3 | 1.5 | 1.9 | 2.2 | 2.6 | 3.1 | 3.5 | 4.1 | 4.8 | 5.8 | 7.1 | 8.7 | 11.3 | 15.3 | 19.8 | 25.8 | 35.1 | 53.4 |
| The Skin Game (1931) | Sound | 1.0 | 1.2 | 1.5 | 1.8 | 2.2 | 2.5 | 2.9 | 3.5 | 4.1 | 5.0 | 5.8 | 6.7 | 8.2 | 10.3 | 14.7 | 21.8 | 31.0 | 52.2 | 97.3 |
The function to calculate the pairwise quantile differences, qdiff, is simple. The function takes two matrices of quantiles as inputs and assumes the matrices have equal shapes and that the step in the quantiles are equal. The argument p is the step increase for the quantiles, which in this case is 0.05.
qdiff <- function(x, y, p = 0.05){
df <- data.frame()
for (i in 1:dim(x)[2]){
diff <- as.vector(outer(x[, i], y[, i], '-'))
q <- i * p
df_a <- cbind(rep(q, length(diff)), diff)
df <- rbind(df, df_a)
}
colnames(df) <- c("quantile", "difference")
return(df)
}
To apply the qdiff function to the data for silent and sound films we first drop the columns containing the film title and type because these will not be used in the calculations and convert the data format to a matrix, the input type expected by the function.
# Arrange the data and convert to matrix
df_Hitchcock_silent <- df_Hitchcock_silent %>%
select(-c(film, type)) %>%
as.matrix()
df_Hitchcock_sound <- df_Hitchcock_sound%>%
select(-c(film, type)) %>%
as.matrix()
# Calculate the quantile differences
df_Hitchcock_q_diff <- qdiff(df_Hitchcock_sound, df_Hitchcock_silent)
| quantile | difference |
|---|---|
| 0.05 | 0.4 |
| 0.05 | -0.2 |
| 0.05 | -0.2 |
| 0.05 | -0.1 |
| 0.05 | -0.6 |
| 0.05 | 0.2 |
Now to plot the distributions of the differences:
ggplot(data = df_Hitchcock_q_diff,
aes(x = difference, group = as.factor(quantile), colour = as.factor(quantile))) +
geom_density(size = 0.6) +
labs(title = "Quantile differences", x = "Difference (s)", y = "Density") +
scale_x_continuous(limits = c(-10, 90), breaks = seq(-10, 90, 10)) +
scale_y_continuous(limits = c(0, 1.25), breaks = seq(0, 1.25, 0.25)) +
scale_colour_viridis(name = "Quantile",
labels = c("0.05","0.10","0.15","0.20","0.25","0.30","0.35",
"0.40","0.45","0.50","0.55","0.60","0.65","0.70",
"0.75","0.80", "0.85","0.90", "0.95"),
discrete = TRUE, direction = -1) +
guides(colour = guide_legend(ncol = 2)) +
theme(legend.key.width = unit(0.6, "cm"),
legend.key.height = unit(0.4, "cm"),
legend.position = c(0.75, 0.595),
legend.title = element_text(size = 10),
panel.grid.minor = element_blank(),
axis.title = element_text(size = 10, face = "bold"),
plot.title = element_text(size = 12, face = "bold"),
panel.background = element_rect(fill = "gray85"))
Figure 4B: Difference distributions for pairwise differences between quantiles of shot length distributions of films directed by Alfred Hitchcock, 1927-1932. Positive differences indicate that quartiles the sound films in the sample are higher than those of the silent films.
Summary
If our goal is to understand differences in shot lengths as differences in style between films, the quantile approach described in my article and demonstrated in this tutorial is simple and intuitive to understand and accurately describes the nature of the differences between these shot length distributions and the size of those differences.