Lena-Pred Pitch Extraction Pipeline

Setup and Overview

This code takes raw audio and converts it to a categorical time series representing the unfolding of pitch information. The goal is to create a feature set that is suited for a categorical time series prediction model. We want to create a series of states that we can then use to predict the next state in a sequence, taking into account long-term dependencies.

The data processing pipeline follows Rasanen et al. (2018):

Extract F0 pitch contour from raw audio
Interpolate F0 across unvoiced regions of the signal using loess model
Log transform and z-score normalize F0
Segement the time series into 100 millisecond bins
Fit a second order polynomial to the F0 curve in each time bin
Use k-means clustering algorithm to classify F0 shapes into categories
Vectorize the sequence data for training LSTM

Extract pitch contours

set.seed(12345)
path_to_wav <- "data/02_processed_data/pilot-segments-norm"
min_prop_voiced <- 0.1 # min proportion of voiced samples (need to ask AW about this number)
time_bin_width <- 100  # in ms

pitch_min <- 75             # clips extreme pitch estimates
pitch_max <- 750            # clips extreme pitch estimates 
silence_min <- 0.01        # lower bound of silence in recording, silent frames are not analyzed for pitch
pathfinding_alg <- "fast"   # method of finding the optimal path through pitch candidates
pitch_methods = c("autocor", "spec", "dom")

min_n_samples_loess <- 20   # min number of pitch candidates needed to fit loess (20 = 500 ms of pitch data)
frac_points_loess <- 0.20   # what percentage point for fitting loess interpoloation (higher = less wiggly)
preds_sample_rate <- 10    # how many frequently to sample from fitted loess (ms)

min_samples_bin <- 10  # min number of samples in a time bin

degree_poly <- 2 # degree of polynomial curve fit to the temporal segments
n_q_shapes <- 10  # number of q-shapes for k-means clustering step

seq_max_len <- 10     # max length of the subs-sequences for training lstm
skip_val <- 1         # how many steps to shift each training sub-sequence
prop_train <- 0.9     # prop of uttereances to use for training vs. test
prop_train_cds <- 0.5 # prop of CDS utterances in th training dataseet

# path to audio
files_to_analyze <- list.files(here(path_to_wav), 
                               pattern = "*.wav", 
                               recursive = T)

# add the full path
files_to_analyze <- here(path_to_wav, files_to_analyze)

Extract pitch contours for each .wav file. We use our own get_pitch_contour() function, which gets mapped over each .wav file.

d <- files_to_analyze %>% map_df(get_pitch_contour, 
                                 pitchFloor = pitch_min, 
                                 pitchCeiling = pitch_max,
                                 pitchMethods = pitch_methods, 
                                 pathfinding = pathfinding_alg, 
                                 silence = silence_min)

Find first and last voiced region and remove time values outside. This ensures that we aren’t extrapolating outside the range of the data with our interpolation step.

batch_filter_voiced(d) -> d

Plot a sample of eight pitch contours, four from each dataset and and speech register.

segs_to_plot <- d %>% 
  distinct(dataset, seg_id, speech_register, path_to_wav) %>% 
  group_by(dataset, speech_register) %>% 
  sample_n(2)

d %>% 
  filter(seg_id %in% segs_to_plot$seg_id) %>% 
  ggplot(aes(x = time, y = pitch, color = speech_register)) +
  geom_line(size = 1) +
  labs(x = "Time (ms)", y = "Pitch ") +
  lims(y = c(0, 750)) +
  facet_wrap(dataset + speech_register~seg_id, 
             #scales = "free_x", 
             ncol = 4
             ) +
  theme(legend.position = 'top') +
  ggthemes::scale_color_ptol()

Looks like we can extract pitch estimates from audio files in both datasets.

Interpolate pitch contour over unvoiced regions between pitch estimates

Create a blacklist of segment ids with too few pitch estimate to do any kind of reliable interpolation using the loess. Here we choose 20 pitch samples, which is 500 ms of pitch data.

seg_id_blacklist <- d %>% 
  filter(!is.na(pitch)) %>% 
  count(seg_id) %>% 
  filter(n <= min_n_samples_loess) %>% 
  pull(seg_id)

d %>% 
  filter(!(seg_id %in% seg_id_blacklist)) %>% 
  split(.$seg_id) %>% 
  map_df(interpolate_loess, frac_points = frac_points_loess, sample_rate = preds_sample_rate) %>% 
  filter(!is.na(pitch_interpolated)) -> d_interp

# log transform and Z-score
d_interp %>% mutate(log_pitch = log(pitch_interpolated), z_log_pitch = scale(log_pitch)) -> d_interp

Let’s plot the originial pitch estimates (points) with our interpolated pitch contours (black lines) to sanity check the interpolation step.

d_interp %>% 
  filter(seg_id %in% segs_to_plot$seg_id) %>% 
  ggplot(aes(x = time, y = pitch_interpolated)) +
  geom_line(size = 1, color = "grey20") +
  geom_point(data = filter(d, seg_id %in% segs_to_plot$seg_id), 
             aes(time, pitch, color = speech_register), 
             size = 2, 
             alpha = 0.6) +
  labs(x = "Time (ms)", y = "Pitch") +
  lims(y = c(0, 750)) +
  facet_wrap(dataset + speech_register~seg_id, 
             #scales = "free_x", 
             ncol = 4) +
  theme(legend.position = 'top') +
  ggthemes::scale_color_ptol()

These curves look reasonable to me, but the span parameter, which controls the wiggliness of the loess, is a free parameter that we should experiment with.

Temporal segmentation

Divide each audio clip into fixed frame 100ms segments.

d_interp %>% 
  create_time_bins(bin_width = time_bin_width) %>% 
  get_time_in_bin() %>% 
  relabel_bins -> d_interp

# Remove 100 ms segments with fewer than the min number of samples in each bin.
d_interp %>% filter(n_bins_in_seg == min_samples_bin) -> d_interp

Make a plot to sanity check temporal segmentation step where we color each point based on its 100 ms time bin.

one_seg_to_plot <- segs_to_plot$seg_id[1]

d_interp %>% 
  filter(seg_id %in% one_seg_to_plot) %>% 
  ggplot(aes(x = time, y = z_log_pitch, color = time_bin)) +
  geom_point(size = 1) +
  guides(color = F) +
  labs(x = "Time (ms)", y = "Normalized Log(Pitch)")  +
  facet_wrap(~seg_id)

From the plot, it looks like the temporal segmentation step is working.

Fit second-order polynomial in each time bin

We pass the time and normalized log transformed pitch values aand get out the coefficients of a second-order polynomial function for each time bin (vector with length 3). We hold onto the segment_id, dataset, speech register, and time_bin as metadata.

d_interp %>% 
  group_by(seg_id, dataset, speech_register, time_bin_id) %>% 
  nest() -> d_by_bin

# fit polynomial and make predictions based on those fits
d_by_bin %>% 
  mutate(poly_coefs = map(data, fit_poly, degree_poly)) %>% 
  mutate(poly_preds = map(poly_coefs, predict_poly)) -> d_by_bin

K-means clustering on poly coefs

We pass a 2 x 2 matrix of coefficient values for each 100 ms segment of the pitch contour and get back a cluster assignment for each segment.

d_coefs <- unnest(d_by_bin, poly_coefs, .drop = T) 
d_coefs %>% get_cluster_assignments(k = n_q_shapes, scale_coefs = T) -> d_final

We can plot the distribution of cluster assignments for each dataset.

plot_clusters_scatter(d_final$d_clusters)

The top panel shows the polynomial shape for the center of each cluster generated by the kmeans clustering. The bottom panel shows a reconstructed pitch contour by plotting the 2nd order polynomial for each 100 ms time bin (bottom row) alongside the interpolated pitch contour (top row). The number displayed in each time bin facet represents the cluster assignment for that pitch shape based on the kmeans step.

shapes_plot <- plot_cluster_shapes(d_final$centers, scaled = TRUE)
shapes_plot

# add cluster assignments to display on plot
d_by_bin %>% left_join(select(d_final$d_clusters, seg_id, time_bin_id, cluster)) -> d_by_bin

# plot
seg_id_to_plot <- "B-01_06_1167_4_15_fids"
recontruct_plot <- plot_reconstructed_pitch(seg_id_to_plot, df_raw = d_interp, df_preds = d_by_bin) 

recontruct_plot

Vectorize the cluster assignments for training the LSTM

Our final step is to take the sequence of cluster assignments for each pitch contour and split it into into training, validation, and test sets for the LSTM modeling step. Intuitively, we want to generate “question and answer” sub-sequences where the question/input is the prior sequence of pitch shapes and the answer/target is the next pitch shape in the sequence.

First, I’m adding the number of segments (q-shapes in each clip), which will be useful for reconstructing the utterances from the training/test sequences)

add_utt_duration_segs(d_final$d_clusters) -> d_final$d_clusters

Now we create the dataset. We want to be able to control the proportion of data used for training. We also want to be able to do experiments with the proportion of:

ADS/CDS utterance types
male/female
speaker (caregiver vs. not caregiver) [TODO]

To do this, we add the relevant knobs to our dataset generator function.

d_lstm <- generate_lstm_dataset(d_final$d_clusters, 
                                max_seq_len = seq_max_len, 
                                skip = skip_val,
                                train_test_split = prop_train,
                                prop_cds = prop_train_cds)

# save coefs and cluster sequences
fst::write_fst(d_final$centers, here("data/03_summaries/lena-pred-kmeans-centers.fst"))
fst::write_fst(d_final$d_clusters, here("data/03_summaries/lena-pred-poly-coefs.fst"))
write_rds(d_lstm, here("data/03_summaries/lena-pred-lstm-train-test.rds"))