Computational analysis of a horror film trailer soundtrack

The R code presented here supports the article “Computational analysis of a horror film trailer soundtrack.”

Abstract

In this article I demonstrate the visualisation and analysis of the soundtrack of the trailer for the Korean horror film Into the Mirror (2003) using the packages tuneR and seewave for the statistical programming language R. I address a range of practical considerations relevant to the method demonstrated. No prior experience with R is assumed and examples of code are provided throughout so that the reader can begin to analyse audio files in a short space of time.

Keywords

Computational film analysis, film sound, horror cinema, R, seewave, tuneR

Into the Mirror

The raw material for analysis is the soundtrack of the trailer for the 2003 Korean horror film, Into the Mirror. The version used for the analysis in this tutorial and in my article is taken from the DVD, but I include the online version here for reference. (NB: the online version includes some corporate logos not included in the DVD version and so has a different running time).

Getting started

We will use tuneR to load and process an audio file in R. We will use seewave to analyse audio files and visualise the results, in combination with viridis, which is used in conjunction with seewave::spectro() to control the colour palette of a spectrogram.

# install required packages (if not already installed): 
# note the use of quote marks when installing a package
install.packages(c("tuneR", "seewave", "viridis"))

# load packages: note the absence of quote marks when loading a package
library(tuneR)
library(seewave)
library(viridis)

Media processing

I exported the trailer from the Into the Mirror trailer from the PAL DVD (25 fps) as an mp4 file. I loaded the trailer into a DaVinci Resolve (v. 16.1.2.026) and rendered the trailer’s audio as a stereo 16-bit linear PCM wave file sampled at 48 kHz.

Once we have an audio file named itm.wav to work with, we can process and analyse it using R.

To load the wave file itm.wav from the working directory into the workspace, we use the tuneR::readWave() function and assign this to an object of wave class with the name itm.

itm <- readWave("itm.wav")

We can access the attributes of the wave object by calling its name itm.

itm

## 
## Wave Object
##  Number of Samples:      7119360
##  Duration (seconds):     148.32
##  Samplingrate (Hertz):   48000
##  Channels (Mono/Stereo): Stereo
##  PCM (integer format):   TRUE
##  Bit (8/16/24/32/64):    16

To visualise the stereo wave object itm, we can use tuneR::plot().

plot(itm,
     col = "#440154",                 # the colour of the wave
     xlab = "Time (s)",
     ylab = c("Left channel", "Right channel"),
     yaxt = "n")                      # suppresses the left y-axis

# add a label to the left y-axis
mtext("Amplitude", side = 2, line = 1)

The wave object itm is a stereo wave object derived from a stereo wave file, but we will need a mono file to calculate the spectrogram and the time contour plot in the next section. To convert a stereo wave object to mono, we create a new object called itm_mono by averaging the samples across the left and right channels using tuneR::mono() with which set to both.

itm_mono <- mono(itm, which = "both")

Data processing

The next stage applies the short-time Fourier transform (STFT) to the wave object itm_mono to produce a 2D time-frequency representation of the signal called a spectrogram.

The seewave function spectro() calculates the STFT and returns the spectrogram of a wave object. The window size is 2048 to give a temporal resolution of 2048/48000 = 0.0426 seconds per window, and windows are overlapped by 50%. The frequency resolution is 48000/2048 = 23.44Hz per band. Each column in the spectrogram is a Fourier transform of length equal to half the window length (2048/2 = 1024).

spectro(itm_mono,
        f = 48000,                    # the sampling rate
        wl = 2048,                    # window length in samples
        wn = "hanning",               # the shape of the window
        ovlp = 50,                    # overlap of the windows
        norm = TRUE,                  # amplitude is normalised to 0.0 dB
        fastdisp = TRUE,              # draws the plot quicker
        flog = TRUE,                  # plot frequencies on a log scale
        collevels = seq(-150, 0, 5),  # the range of the amplitude scale
        palette = viridis,            # colour palette selection
        axisX = FALSE)                # suppress the x-axis 

# label the x-axis every 20 seconds from 0 to 140 seconds
axis(1, at = seq(0, 140, 20), labels = seq(0, 140, 20))

seewave::acoustat() produces the time contour (along with a range of other statistics not discussed here). To create an object containing the time contour data using acoustat() we use the same values for wl and ovlp used above in order to ensure the normalised aggregated power envelope is consistent with the spectrogram. The object itm_tc is a matrix with two columns: column one (itm_tc[, 1]) comprises the time codes of the individual time-spectra in the spectrogram and column two (itm_tc[, 2]) contains the normalised aggregated power of the envelope at each time code. The command plot = FALSE suppresses the default plot settings so we can create our own plot of the data later.

itm_tc <- acoustat(itm_mono, 
                   f = 48000, 
                   wl = 2048, 
                   ovlp = 50, 
                   wn = "hanning", 
                   plot = FALSE
                   )$time.contour

We fit a loess trendline to better see the structure fo the soundtrack, creating an object itm_fit that contains the fitted values. Experimenting with different values of span and degree will help to produce a trendline that is informative without being too noisy. The choice of family determines the method used for fitting the polynomials.

itm_fit <- loess(itm_tc[, 2] ~ itm_tc[, 1], 
                 span = 0.015, 
                 family = "symmetric", 
                 degree = 1)$fitted

Plotting the normalised aggregated power envelope as a time contour with the fitted trendline lets us look at the structure of the soundtrack at different scales, identifying local features and placing them within larger structures at the meso- and macro-levels.

plot(itm_tc[, 1], itm_tc[, 2], 
   type = "l",                       # draws a line graph
   col = "#1F968B",                  # the colour of the time contour
   log = "y",                        # plot the y-axis on a logarithmic scale
   xlab = "Time (s)", 
   ylab = "Normalised aggregated power")

# add the loess trendline to the plot
lines(itm_tc[, 1], itm_fit, 
   col = "#440154",                 # the colour of the trendline
   lwd = 1.7)                       # the width of the trendline

Practical considerations

In this section, we will apply some tools to compare soundtracks with different properties in order to consider some of the practical considerations for audio analysis relating to the use of mono or stereo tracks and the sampling rate of audio files.

Mono versus stereo

To separate the left and right channels of the original itm wave object we create two new wave objects, itm_left and itm_right, using tuneR::channel().

itm_left <- channel(itm, "left")   # the left channel of itm
itm_right <- channel(itm, "right") # the right channel of itm

seewave includes some functions that allow us to compare wave objects and which can assist in determining whether the use of a mono wave object will result in an unacceptable loss of information about a soundtrack. First, we can use diffwave() to get an overall comparison. The result is a unitless number with range [0, 1] that is the L1-norm describing the relative distance between the two wave objects. For itm_left and itm_right the difference between the wave objects is 0.006, indicating these two channels are not identical but that there is only a small difference between them.

diffwave(itm_left, itm_right, wl = 2048, envt = "hil")

## [1] 0.006204744

The distance measure returned by seewave::diffwave() is the product of the difference between the envelopes and the difference between the spectra of the two wave objects. We can evaluate these features individually. To look at the difference between the envelopes, we use seewave::diffenv.

diffenv(itm_left, itm_right, 
   envt = "hil",            # type of envelope used for comparison
   plot = TRUE,             # plot the envelopes
   lty1 = 1,                # line type for the left channel 
   lty2 = 1,                # line type for the right channel 
   col1 = "#440154",        # colour of the left channel envelope
   col2 = "#1F968B",        # colour of the right channel envelope
   cold = "#FDE725",        # colour of the difference surface
   xlab = "Time (s)", 
   ylab = "Amplitude", 
   legend = FALSE)          # suppress the legend

## [1] 0.2549887

The difference between channels’ spectra is based on the mean spectrum.

# calculate the mean spectrum for each channel 
# using the same settings used above for the spectrogram and time contour plot
itm_left_spec <- meanspec(itm_left, wl = 2048, ovlp = 50, plot = FALSE)
itm_right_spec <- meanspec(itm_right, wl = 2048, ovlp = 50, plot = FALSE)

# compare mean spectra
diffspec(itm_left_spec, itm_right_spec, 
   f = 48000, 
   plot = TRUE, 
   type ="l",                                   # draws a line plot
   lty = c(1, 1),                               # line type for the spectra
   col = c("#440154", "#1F968B", "#FDE725"),    # line colours
   flab = "Frequency (kHz)",                    # label the frequency axis
   alab = "Amplitude",                          # label the amplitude axis
   alim = c(0, 0.042),                          # range of the amplitude axis
   title = FALSE,                               # suppress the plot title
   legend = FALSE,                              # suppress the legend
   yaxt = "n")                                  # suppress the y-axis

# draw a nicer y axis
axis(2, at = seq(0, 0.04, 0.01), labels = seq(0, 0.04, 0.01))

The relative distance between the envelopes is 0.255 and the relative distance between the spectra is 0.023, and we can conclude that there is little difference between the envelopes and the mean spectra of the channels.

We can also plot the cumulative time contours for the mono, stereo left, and stereo right wave objects.

# calculate the time contour of each wave object
itm_mono_tc <- acoustat(itm_mono, 
                        wl = 2048, 
                        ovlp = 50, 
                        plot = FALSE)$time.contour

itm_left_tc <- acoustat(itm_left, 
                        wl = 2048, 
                        ovlp = 50, 
                        plot = FALSE)$time.contour

itm_right_tc <- acoustat(itm_right, 
                         wl = 2048, 
                         ovlp = 50, 
                         plot = FALSE)$time.contour

# calculate the cumulative time contour of each wave object
itm_mono_cum <- cumsum(itm_mono_tc[, 2])
itm_left_cum <- cumsum(itm_left_tc[, 2])
itm_right_cum <- cumsum(itm_right_tc[, 2])

# plot the cumulative time contours
plot(itm_left_tc[, 1], itm_left_cum, 
     col = "#440154", 
     type = "l", 
     lwd = 1.7, 
     xaxt = "n", 
     xlab = "Time (s)", 
     ylab = "Cumulative power")

axis(1, at = seq(0, 140, 20), labels = seq(0, 140, 20))

lines(itm_right_tc[, 1], itm_right_cum, col = "#1F968B", lwd = 1.7)
lines(itm_mono_tc[, 1], itm_mono_cum, col = "#FDE725", lwd = 1.7)

legend("top", legend = c("Left", "Right", "Mono"), 
       col = c("#440154", "#1F968B", "#FDE725"), 
       lty = 1, box.lty = 0, horiz = TRUE, cex = 0.8)

We can see that there is not a large difference and there is no reason to believe that in using the mono soundtrack as the basis for analysis any key information about the sound design in this trailer has been overlooked.

Sampling rate

How will using different sampling rates affect the analysis?

The sampling rate is the number of samples recorded per second (measured in Hertz (Hz)) and determines the range of frequencies captured by a digital audio file. The standard sampling rate of audio for video is 48000 Hz or 48 kHz.

To assess the impact of different sampling rates on the analysis of film audio we can compare the soundtrack to Into the Mirror trailer at 48 kHz and 22.05 kHz. For simplicity, we will compare mono wave objects, downsampling the 48 kHz mono soundtrack to produce a version with a sampling rate of 22.05 kHz.

To downsample a wave object, use tuneR::downsample(). The code below downsamples the 48 kHz wave object itm_mono to create a new wave object itm_22 with a sampling rate of 22.05 kHz for comparison.

itm_22 <- tuneR::downsample(itm_mono, 22050)

To plot the spectrogram of the mono soundtrack at a sampling rate of 48 kHz:

spectro(itm_mono, 
        f = 48000, 
        wl = 2048, 
        ovlp = 50, 
        wn = "hanning", 
        flog = TRUE, 
        palette = viridis, 
        fastdisp = TRUE, 
        collevels = seq(-150, 0, 5), 
        axisX = FALSE)

axis(1, at = seq(0, 140, 20), labels = seq(0, 140, 20))

To plot the spectrogram of the mono soundtrack at a sampling rate of 22.05 kHz:

spectro(itm_22, 
        f = 22050, 
        wl = 2048, 
        ovlp = 50, 
        wn = "hanning", 
        flog = TRUE, 
        palette = viridis, 
        fastdisp = TRUE, 
        collevels = seq(-150, 0, 5), 
        axisX = FALSE)

axis(1, at = seq(0, 140, 20), labels = seq(0, 140, 20))

For a signal with a sampling rate of 48 kHz, a spectrogram can represent a frequency range from 0 to 24 kHz; but only a frequency range of 0 to 11.025 kHz can be represented for a signal with a sampling rate of 22.05 kHz. There will inevitably be some loss information by sampling at the lower rate. However, there is little energy above 11.025 kHz the spectrogram for soundtrack sampled at 48 kHz and so the loss of information is minimal and would not lead to substantially incorrect analyses of the soundtrack. In fact, we see that there is almost no energy above 16 kHz, and so a sampling rate of 32 kHz may therefore be optimal.

We can de-normalise the aggregated power envelopes in each time contour plot by multiplying the value of the time contour by the number of datapoints, which is equal to the number of short-time spectra, to make a direct comparison between the time contour plots at different sampling rates. This removes the scaling effect of normalising the envelope. To plot the results, we use the following code.

# calculate the time contours for itm_22 and itm_mono
itm_22_tc <- acoustat(itm_22, 
                      wl = 2048, 
                      ovlp = 50, 
                      plot = FALSE)$time.contour

itm_48_tc <- acoustat(itm_mono, 
                      wl = 2048, 
                      ovlp = 50, 
                      plot = FALSE)$time.contour

# de-normalise the contours
itm_22_denorm <- itm_22_tc[,2] * length(itm_22_tc[,2])
itm_48_denorm <- itm_48_tc [,2] * length(itm_48_tc[,2])

# plot the de-normalised contours
plot(itm_22_tc[,1], itm_22_denorm, 
     log = "y", 
     col = "#440154", 
     type = "l", 
     xaxt = "n", 
     yaxt = "n", 
     xlab = "Time (s)", 
     ylab = " ")

lines(itm_48_tc[,1], itm_48_denorm, col = "#1F968B")

axis(1, at = seq(0, 140, 20), labels = seq(0, 140, 20))
axis(2, at = c(0.01, 0.1, 1.0), labels = c("0.01", "0.10", "1.00"))

legend("bottom", 
       legend=c("48 kHz", "22.05 kHz"), 
       col = c("#1F968B", "#440154"), 
       lty = 1, 
       cex = 0.9, 
       box.lty = 0, 
       horiz = TRUE)

mtext("Aggregated power", 2, line = 3, las = 0)

We can see that there is little difference between the time contour plots at different sampling rates and what differences are apparent appear to be in the aggregated power of the envelope and not in its temporal structure. Using the lower sampling rate would not have altered the conclusions arrived at through analysis of the soundtrack.

Sampling rates of 22.05 kHz, 32 kHz, and 48 kHz seem the most likely sampling rates to yield satisfactory results.