A Tutorial on Recent Methods for Estimating Working Vowel Space from Connected Speech

Daniel Stehr

University of California, Irvine
dstehr@uci.edu

19 January, 2018

Abstract

Vowel space area (VSA) has emerged as a useful metric for quantifying articulatory range of motion and vowel working space. However, traditional static measures of VSA, such as the triangular or quadrilateral area, are laborious and time intensive to measure and have been criticised as sometimes lacking adequate sensitivity to characterize clarity related changes. In recent years, Whitfield & Goberman (2014) and Sandoval, Berisha, Utianski, Liss, & Spanias (2013) have proposed new data-driven methods for estimating vowel space from continuously sampled formant trajectories. Both techniques promise greater sensitivity and enable VSA to be estimated from natural connected speech (without the need for hand segmentation). In this article, I demonstrate how both algorithms can be implemented in the programming language R and I evaluate their relative strengths and weaknesses.

1 Background
2 Preprocessing
- 2.1 Formant Extraction
- 2.2 Outlier Rejection
3 Articulatory-Acoustic Vowel Space
4 Sandoval et al. (2013) Approach
- 4.1 k-means clustering
- 4.2 Convex hull area calculation
5 Spectral Density Maps
- 5.1 Statistical Properties
- 5.2 2D KDE in R
References

1 Background

Vowel space is a long standing metric in acoustic phonetic research that is used to quantify where a vowel sound originates in both “acoustic” and “articulatory” dimensions. The primary acoustic dimensions which characterize a vowel sound are the first (F1) and second (F2) formant frequencies. These acoustic dimensions are generally accepted as relating to the size and shape of the cavities created by the two main articulatory dimensions of jaw opening/tongue height (F1) and vocal tract openness/tongue advancement (F2) (Lee, Shaiman, & Weismer, 2016).

Traditionally, F1 and F2 are measured from the middle 50% (steady state) portion of repeated vowel productions and F1 is plotted as a function of F2. When the axes are reversed, the result closely resembles the standard IPA vowel chart.

Below is a plot of the steady state portion of four peripheral vowels from each of 48 talkers in the Hillenbrand dataset (1995). This dataset is freely downloadable at http://homepages.wmich.edu/~hillenbr/voweldata/bigdata.dat

Vowel space area (VSA) is generally computed as the 2D area of the triangle or quadrilateral formed by either the three point or four corner vowels which represent the extremes of kinematic displacements of the articulators. VSA therefore serves as an acoustic proxy for a speaker’s articulatory working space.

The traditional method of measuring VSA from hand segmented vowel sounds has been criticised as lacking sensitivity for the reasons that
1. it is only based on a subset of the total vowel sounds in a speaker’s language and
2. the points used to calculate the area are category averages (repeated measures) that may not be truly reflective of the full distribution of a speaker’s vowel tokens.

This tutorial investigates in detail a few recent methodological advances that attempt to circumvent these problems as well as enable the possibility to measure VSA from natural, connected speech.

2 Preprocessing

For illustration, I will use a recording of a sentence I made using very high quality equipment (my iPhone). The sentence being spoken was,

“She saw the stack of keys outside the blue box”.

This sentence was chosen because it contains a wide variety of monophthong vowel sounds (6 if I counted correctly).

2.1 Formant Extraction

We first need to extract a F1/F2 time series from the sound file. This is best done using Praat, an open source phonetics program. Thankfully, we can send a shell command to Praat directly from R using the PraatR package, enabling us to do all our analysis without leaving R.

For formant extraction I will use the following settings recommended in the Praat manual
* 1 msec frame advance
* 25 msec analysis window
* 5000 Hz ceiling value (for male voice)
* pre-emphasis starting from 50 Hz

library(PraatR)
# Set parameters for formant analysis
formantArguments <- list( 0.001, # Time step (s)
                         5,     # Max. number of formants
                         5000,  # Maximum formant (Hz)
                         0.025, # Window length (s)
                         50    )# Pre-emphasis from (Hz)
#Path to wav file (Must use full path!)
wavFname <- 'D:/R/RPUB/VSA/CornerSentence.wav'
#Specify output path
formantFname <- 'D:/R/RPUB/VSA/CornerSentence.Formant'

#Use PraatR to extract out the formant contour
praat( "To Formant (burg)...",
       arguments = formantArguments,
       input = wavFname,
       output = formantFname,
       overwrite = TRUE)

# Specify the parameters for how the formant data should be summarized in a table
tabulationArguments = list( TRUE, # Include frame number
                            TRUE, # Include time
                            3,    # Time decimals
                            TRUE, # Include intensity
                            3,    # Intensity decimals
                            TRUE, # Include number of formants
                            3,    # Frequency decimals
                            TRUE )# Include bandwidths
tableFname <- 'D:/R/RPUB/VSA/CornerSentence_table.txt'
# Now, Summarize the formant object as a table 
praat( "Down to Table...",
       arguments = tabulationArguments,
       input = formantFname,
       output = tableFname,
       filetype = "tab-separated",
       overwrite=TRUE)

#Read in the formant data and take a peak
fData <- read.table( file=tableFname, header=TRUE, sep="\t", na.strings="--undefined--" )
head(fData)

##   frame time.s. intensity nformants  F1.Hz.  B1.Hz.   F2.Hz.  B2.Hz.
## 1     1   0.025     3e-05         4 403.519 988.842 1770.957 473.065
## 2     2   0.026     3e-05         4 463.710 939.272 1783.308 437.152
## 3     3   0.027     3e-05         4 506.959 897.316 1795.918 400.249
## 4     4   0.028     4e-05         4 534.001 865.924 1806.316 363.075
## 5     5   0.029     4e-05         4 546.770 845.113 1812.702 327.212
## 6     6   0.030     4e-05         4 547.817 832.822 1814.390 295.178
##     F3.Hz.  B3.Hz.   F4.Hz.  B4.Hz. F5.Hz. B5.Hz.
## 1 2766.828 274.896 3842.407 751.692     NA     NA
## 2 2771.586 278.213 3860.847 760.949     NA     NA
## 3 2776.024 282.955 3897.276 754.887     NA     NA
## 4 2780.632 287.448 3945.012 716.209     NA     NA
## 5 2785.607 291.351 3989.575 647.327     NA     NA
## 6 2791.385 295.966 4021.892 573.278     NA     NA

Next, we will subset the data to contain only F1 and F2 (since that is all we are going to work with) and transform the data into the perceptually motivated bark scale.

To convert a frequency \(f\) (Hz) into bark I will use the following equation from (Traunmüller, 1990)

Critical band rate (bark) = \([(26.81f)/(1960+f)] - 0.53\)
for calculated \(z < 2.0\) bark: \(z' = z + 0.15(2-z)\)
for calculated \(z > 20.1\) bark: \(z' = z + 0.22(z-20.1)\)

# subset data from previous step
library(dplyr)
fData <- select(fData, F1.Hz., F2.Hz.)

# Create function to convert frequency from Hertz to bark
hz2bark <- function(f) {
    z <- 26.81/(1 + 1960/f) - 0.53
    if (z < 2) {
        z <- z + (0.15 * (2 - z))
    } else if (z > 20.1) {
        z <- z + (0.22 * (z - 20.1))
    }
    return(z)
}

# apply bark transform
fData <- as.data.frame(apply(fData, 2, function(x) hz2bark(x)))
names(fData) <- c("F1.bark", "F2.bark")  #rename cols
head(fData)

##    F1.bark  F2.bark
## 1 4.047219 12.19579
## 2 4.599353 12.24226
## 3 4.979443 12.28939
## 4 5.210401 12.32801
## 5 5.317726 12.35163
## 6 5.326477 12.35786

Finally, let’s plot the formant data in a scatterplot.

# plot scatterplot
library(ggplot2)
ggplot(as.data.frame(fData), aes(F2.bark, F1.bark)) + geom_point(alpha = 0.5, 
    color = "salmon") + scale_y_reverse() + scale_x_reverse() + labs(x = "-F2 (bark)", 
    y = "-F1 (bark)") + theme_minimal()

2.2 Outlier Rejection

Automated formant extraction algorithms can sometimes produce outliers due to error. There are many approaches to outlier detection but I will utilize probabalistic mixture modeling of the datapoints as recommended by Sandoval et al. (2013). First, we will assume a generative model (a mixture of Gaussians) and estimate the parameters of this model using the expectation-maximization (EM) algorithm. This produces parameter estimates such that the observed data has a maximum likelihood fit to the generative model. Afterwards, for each individual datapoint we can estimate the generative probability (or fit probability) to the model. Datapoints that fit the probability distribution of a speaker’s formants will have high fit values, whereas anomalies will have low fit values.

The broad principle of the Gaussian mixture model (GMM) is to assume that the data, \(\textbf{X}\), was generated by a mixture of \(k\) gaussian distributions with probability density functions \(f_1(x), ..., f_k(x)\) according to the following iterative process:

Select the \(k\)th probability distribution with probability \(\alpha_k\) where \(\alpha_k \in (0,1)\) and \(\sum\limits_{k=1}^{k}\alpha_k = 1\).
Generate a data point from \(f_k(x)\)

The value \(\alpha_k\) represents the prior probability or the fraction of the data generated by mixture component \(k\). These values are not known in advance but need to be learned in a data-driven manner. The distributions \(f_k(x)\) are assumed to be bivariate Gaussians each characterized by a unique mean vector \(\mu_k\) and variance-covariance matrix \(\Sigma_k\) and have the following probability density function

\(f_k(x_i;\mu_k,\Sigma_k) = \frac{\exp(-\frac{1}{2}(x_i-\mu_k)^T \Sigma_k^{-1}(x_i-\mu_k))} {\sqrt{det(2\pi\Sigma_k)}}\)

The parameters \(\mu_k\) and \(\Sigma_k\) will be estimated by the EM algorithm so that the data has a maximum likelihood fit of being generated. To do this, we need to define the fit of the dataset to the generative model, denoted by \(\mathcal{M}\). The probability density of a datapoint \(X_i\) being generated by the model is given by the following:

\(f^{point}(x_i;\mathcal{M}) = \sum\limits_{i=1}^{k}\alpha_i f_k(x_i;\mu_k,\Sigma_k)\)

The likelihood for data consisting of \(n\) observations (i.e. formant samples) assuming a Gaussian mixture model with \(k\) multivariate mixture components is computed as the product of the corresponding individual point-wise probabilities (or probability densities):

\(\mathcal{L}(\mu,\sigma;\mathbf{X}) = \prod\limits_{i=1}^{n}\sum\limits_{j=1}^{k}\alpha_kf(x_i;\mu_k,\Sigma_k)\)

Taking the logarithm of the likelihood enables the expression to be (more conveniently) represented as a sum over datapoints instead of a product:
\(\ln(\mathcal{L}(\mu,\sigma;\mathbf{X})) = \sum\limits_{i=1}^{n}\ln(\sum\limits_{j=1}^{k}\alpha_kf(x_i;\mu_k,\Sigma_k))\)

Following application of the EM algorithm, which is beyond the scope of this article (see Aggarwal (2013) for more depth), we have a probabalistic model which describes the entire dataset as the outcome of a generative process. The mixture likelihood provides a probabilistic fit value for each datapoint in the dataset which will serve as our outlier score. The basic intuition is that ouliers, which lie far away from the dense regions of the data, will be identified by extremely low fit values.

Let’s apply a Gaussian Mixture Model to our formant data using the MClust package. The number of Gaussian components in the mixture model must be specified in advance. Since our sentence, “She saw the stack of keys outside the blue box”, contains at least 6 vowel types, we will set the number of components to 6. The function densityMclust() will produce a density estimate for each data point using a GMM with a predefined number of mixing components.

library(mclust)
G <- 6  #Set number of Gaussian components in mixture model
# gmmfit <- densityMclust(fData,G)
gmmfit <- readRDS("gmmfit.RData")
summary(gmmfit, parameters = TRUE)

## -------------------------------------------------------
## Density estimation via Gaussian finite mixture modeling 
## -------------------------------------------------------
## 
## Mclust VVV (ellipsoidal, varying volume, shape, and orientation) model with 6 components:
## 
##  log.likelihood    n df       BIC       ICL
##       -20578.21 5342 35 -41456.84 -42312.19
## 
## Clustering table:
##    1    2    3    4    5    6 
##  419 2219  952  541  587  624 
## 
## Mixing probabilities:
##          1          2          3          4          5          6 
## 0.07841708 0.42641797 0.17838916 0.10044153 0.10724902 0.10908524 
## 
## Means:
##              [,1]      [,2]     [,3]      [,4]     [,5]     [,6]
## F1.bark  3.001473  5.226115 11.38109  9.770184 3.156303 6.799952
## F2.bark 13.768002 11.516560 14.75576 13.636434 8.913564 9.407958
## 
## Variances:
## [,,1]
##            F1.bark    F2.bark
## F1.bark 0.04302751 0.04801546
## F2.bark 0.04801546 0.16186607
## [,,2]
##            F1.bark    F2.bark
## F1.bark  4.3166491 -0.2934772
## F2.bark -0.2934772  0.7756983
## [,,3]
##           F1.bark   F2.bark
## F1.bark 0.7276015 0.2640038
## F2.bark 0.2640038 0.2762549
## [,,4]
##            F1.bark    F2.bark
## F1.bark  1.3499038 -0.4833783
## F2.bark -0.4833783  0.9589697
## [,,5]
##           F1.bark   F2.bark
## F1.bark 0.3496242 0.1158769
## F2.bark 0.1158769 0.8760845
## [,,6]
##           F1.bark   F2.bark
## F1.bark 0.5484650 0.1723567
## F2.bark 0.1723567 0.1006297

Let me offer a short word of explanation on the output shown above. Mclust offers a variety of parameterizations of the Guassian mixture components. For instance, all mixing components may be constrained to be spherical and of the same volume or spherical but allowed to vary in volume, or allowed to vary in both regards. See the package documentation for a full list of parameterizations. By default, densityMclust() fits models with all different parameterizations to the data by maximum likelihood and then applies a statistical criterion for model selection. The criterion used is Bayesian Information Criterion (BIC) which is nice in that it includes a penalty term on the loglikelihood for increasing numbers of parameters. As indicated in the top line, the model with the highest BIC is ellipsoidal (varying volume, shape, and orientation) which is referred to with the three letter code VVV

Clustering table is a frequency table of the total formant samples assigned to each of the 6 Gaussian components.
Mixing probabilities lists the prior probabilites, \(\alpha_k\), associated with each component
Means lists the mean for each component, and
Variances lists the variance-covariance matrix, \(\Sigma_k\), for each component

To visualize the estimated density, we can produce a 3d perspective plot …

# Plot density surface
plot(gmmfit, what = "density", type = "persp")

In order to identify the extrema, we need to retrieve a density estimate for each of our datapoints and rank order them. The idea is that datapoints which are far away from the dense regions in the plot above are likely to be anomalies. Here is a boxplot of the fit values

# Add density estimates as new column to our formant data
fData$density <- gmmfit$density
# plot boxplot of fit values per datapoint
ggplot(fData, aes(x = "", y = density, fill = "gray")) + geom_boxplot() + guides(fill = FALSE) + 
    scale_y_continuous(name = "Density Estimate") + scale_x_discrete(name = "") + 
    theme_bw()

As can be seen, many of the datapoints have very high fit values. In order to reduce the relative influence of inliers, let’s apply the logarithm function to our fit values and reproduce the boxplot …

# Add density estimates as new column to our formant data
fData$log_density <- log(gmmfit$density)
# plot boxplot of fit values per datapoint
ggplot(fData, aes(x = "", y = log_density, fill = "gray")) + geom_boxplot() + 
    guides(fill = FALSE) + scale_y_continuous(name = "Log of Density Estimate") + 
    scale_x_discrete(name = "") + theme_bw()

We can now label the outliers as all datapoints with fit values more than 1.5 times the interquartile range (IQR) below the first quartile.

# label outliers based on 1.5*IQR
lowerq <- quantile(fData$log_density)[2]
iqr <- IQR(fData$log_density)
threshold.lower <- lowerq - (1.5 * iqr)
fData$outlier <- ifelse(fData$log_density < threshold.lower, 1, 0)
fData$outlier <- factor(fData$outlier, levels = c(0, 1), labels = c("valid", 
    "outlier"))

# Make new copy of our filtered data
fDataClean <- filter(fData, outlier == "valid")

# Plot scatterplot showing location of outliers
ggplot(fData, aes(x = F2.bark, y = F1.bark, shape = outlier, color = outlier)) + 
    geom_point(alpha = 0.9) + scale_y_reverse() + scale_x_reverse() + labs(x = "-F2 (bark)", 
    y = "-F1 (bark)") + theme_minimal()

The GMM removed 3.99% of the data as outliers.

3 Articulatory-Acoustic Vowel Space

Recently, Whitfield & Goberman (2014) introduced a novel measure of vowel space, which they named the Articulatory-acoustic vowel space (AAVS). The AAVS is calculated from time-varying formant trajectories and is claimed to be more sensitive than the traditional method of measuring vowel space from segmented corner vowels.

The AAVS measure is simply the standardized general variance (SGV) of the bivariate F1/F2 samples across a speaker’s utterance. The generalized variance (GV) was a measure introduced by Wilks as a scalar measure of overall multidimensional scatter. For any \(p\)-dimensional random vector \(\mathbf{X}\), the GV is defined as the determinant of its variance-covariance matrix \(\Sigma\). The SGV then is the positive \(p\)th root of the GV.

\(SGV = \sqrt[p]{\det{(\Sigma)}}\)

The SGV provides a measure of formant variability in F1 x F2 space that is equivalent to a bivariate standard deviation. Therefore, an increase in the dispersion of formants will result in an increased SGV. This property makes SGV an attractive global metric for articulatory range of motion or working formant space.

Whitfield & Goberman (2014) examined the ability of AAVS to track clarity-related changes in the speech of patients with Parkinson’s disease (PD), which is known to involve speech motor deficits The habitual speech of individual’s with PD had significantly lower AAVS than habitual speech of typicals. Furthermore, when individual’s with PD were instructed to speak more clearly, their AAVS increased relative to their habitual speaking style. This suggests that the AAVS is capable of tracking both between and within subject changes in vocal clarity.

The AAVS (or SGV) is easy to compute in R from our filtered formant data.

sgv <- sqrt(det(cov(as.matrix(select(fDataClean, F1.bark, F2.bark)))))

The obtained AAVS (or SGV) for our speaker is 5.4072105

4 Sandoval et al. (2013) Approach

Sandoval et al. (2013) describe a different (unsupervised learning) technique to estimate peripheral vowel space from continuous speech. Their method also advertizes improved sensitivity over the traditional VSA measure for the following reasons:
* It is fully automated
* It can be collected from any length of natural connected speech
* It considers all vowels rather than just the usual three (point) or four (corner) vowels.

Their method involves the following 4 steps:

Formant extraction
Outlier rejection using GMM filtering
k-means clustering
Convex hull area calculation

Steps 1 & 2 have already been covered in sections 2.1 and 2.2 of this tutorial, so we will dive right into step 3.

4.1 k-means clustering

Following formant extraction and outlier rejection, Sandoval et al. (2013) cluster the remaining (filtered) datapoints using a k-means algorithm. Use of the kmeans algorithm requires deciding in advance how many cluster centers to initialize. In their original paper, Sandoval et al. (2013) initialized 12 clusters (one for each of the 12 English vowels) while working with the TIMIT database. While this seems like a legitimate decision given you are working with a phonetically rich and varied speech sample, I argue that the number of initialized clusters should be tailored to the specific utterance you are anlyzing. Since our sentence, “she saw the stack of keys outside the blue box”, contains 6 distinct vowel sounds, let’s go with 6 for our number of clusters.

# Perform k-means clustering using the 'stats' package included in base R
# nstart is the number of random sets chosen iter.max is the max number of
# iterations allowed
km <- kmeans(as.matrix(select(fDataClean, F1.bark, F2.bark)), centers = 6, nstart = 100, 
    iter.max = 100)

# Produce scatterplot of raw data with cluster nodes found by k-means
# algorithm superimposed
ggplot() + # The raw data (from Praat)
geom_point(data = fDataClean, aes(F2.bark, F1.bark), color = "gray") + # The cluster nodes
geom_point(data = as.data.frame(km$centers), aes(F2.bark, F1.bark), color = "blue", 
    size = 2) + scale_y_reverse() + scale_x_reverse() + guides(fill = F, color = F, 
    alpha = F) + labs(x = "-F2 (Hz)", y = "-F1 (Hz)") + theme_minimal()

4.2 Convex hull area calculation

Finally, the convex hull of the set of points returned by the k-means analysis is found and the area of the resulting polygon is computed. This value is the estimate of vowel space.

In the code below, the subset of points which lie on the convex hull of the kmeans cluster is calculated using the chull() function found in the base installation of R. Then, the area of the polygon formed by the convex hull is computed using the Polygon() function in the sp package.

# Find the indices of the unique points lying on the convex hull, in
# clockwise order
chIn <- chull(km$centers)
# Note: in order to calculate the AREA of the convex hull, the list of
# indices must begin and end with the same point
chIn <- c(chIn, chIn[1])
# Get Coordinates of convex hull
chCoords <- km$centers[chIn, ]

# Compute area of polygon formed by convex hull points
library(sp)
chPoly <- Polygon(chCoords, hole = F)
chArea <- chPoly@area

The obtained convex hull area is 23.6018708

Below is a plot of the convex hull polygon overlaid on the filtered formant samples.

ggplot() + # The raw data (from Praat)
geom_point(data = fDataClean, aes(F2.bark, F1.bark), color = "gray") + # The cluster nodes
geom_point(data = as.data.frame(km$centers), aes(F2.bark, F1.bark), color = "blue", 
    size = 2) + # The area bounded by the convex hull
geom_polygon(data = as.data.frame(chCoords), aes(x = F2.bark, y = F1.bark, fill = "salmon", 
    alpha = 0.8)) + # annotate('text',label=paste('VSA = ', round(chArea,2)), x=2000,y=2100)+
scale_y_reverse() + scale_x_reverse() + guides(fill = F, color = F, alpha = F) + 
    labs(x = "-F2 (Hz)", y = "-F1 (Hz)") + # coord_fixed()+
theme_minimal()

Using the 630 speakers from the TIMIT database, Sandoval et al. (2013) compared this new automated method for estimating VSA with the traditional quadrilateral VSA and found high correlations for both male, \(\rho\) = 0.79, and female, \(\rho\) = 0.78, speakers. When the two methods diverge, the method of Sandoval et al. (2013) may be argued to be more accurate since unlike the traditional quadrilateral measure it doesn’t limit the definition of vowel space to the area interior of the four corner vowels but includes all 12 vowel categories. This also highlights a key requirement of this algorithm which is that the recording must include an adequate sample of vowel sounds. Furthermore, for comparisons across different individuals to be meaningful, the phonentic content being spoken should be identical.

5 Spectral Density Maps

Fox & Jacewicz (2017) demonstrated that the vowel quadrilateral fails to characterize variation in regional dialects of American English. They proposed that although two speakers from different regions may utilize the same working space, differences in speaking style may be better captured by the internal distribution of F1/F2 points (or spectral density regions). They call their proposed formant density approach the “formant space” to emphasize that the focus is on the absolute frequency distribution of F1/F2 points regardless of what vowel category they came from. By using continuously sampled formant trajectories from connected speech, this approach inherently takes into account formant dynamics, for it can be argued that if two speakers differ in their formant dynamics, this will also appear in their “formant space” as different regions of spectral overlap.

To derive their spectral density regions in “formant space”, Fox & Jacewicz (2017) began by plotting a scatterplot of F1/F2 values from a particular speech recording. They then produced a 3D histrogram displaying the frequency of points in distinct F1 by F2 bins. Next, they spatially smooth the historgram using grid-based interpolation. The final product is a contour map of the smoothed histogram enabling one to easily visualize the internal distribution of formants in the vowel space.

This approach amounts to a form of density estimation in which the underlying probability density of a dataset from an unknown distribution is flexibly estimated nonparametrically. The histogram approach, employed by Fox & Jacewicz (2017), is probably the oldest and least sophisticated form of density estimation. A better approach is to use kernal density estimation which centers a smooth kernel function at each datapoint and then sums to get a density estimate.

5.1 Statistical Properties

For a \(d\)-variate random sample \(\mathbf{X}_1, \mathbf{X}_2,...,\mathbf{X}_n\) drawn from a density \(f\), the kernel density estimate is defined by:

\(\hat{f}_{kde}(x) = \frac{1}{n}\sum\limits_{i=1}^{n}K_{\mathbf{H}}(x-\mathbf{X}_i)\)

where \(K_{\mathbf{H}}\) is the kernel which is a symmetric probability density function and \(\mathbf{H}\) is the bandwidth matrix which is symmetric and positive-definite. The choice of the bandwidth is much more important than the choice of the kernel function in determining the performance of \(\hat{f}\). If the choice of bandwidth is too large, the density estimate will be overly smooth and may hide local features of interest. On the other hand, if the bandwidth is chosen too small, the estimate may introduce too much variability in the form of spurious bumps. The problem of choosing an optimal bandwidth in a data-driven way is typically approached by minimizing the mean integrated square error,

\(\textrm{MISE}(\hat{f}) = \textrm{E}\lbrack\int{(\hat{f}(x) - f(x))}^2 dx\rbrack\)

This optimization problem is not easy though since \(f(x)\) is unknown. Nevertheless, there is wide consensus that plug-in selectors and cross validation selectors are generally recommended across a wide range of inputs (Duong, 2007).

5.2 2D KDE in R

There are many packages for doing kernel density estimation in R (for a review see Deng & Wickham (2011)). However, in this tutorial I will showcase the ks package since in allows multivariate kernel smoothing and offers a wide range and bandwidth selectors to choose from (see Duong (2007) for an excellent tutorial).

Multivariate plug-in bandwidth selection is performed with the function Hpi. The argument pilot specifies the type of pilot estimation (I choose the Sum of Asymptotic Mean Square Error (SAMSE) criterion) and the argument pre specifies the type of pre-transformation (“scale” or “sphere”).

Next, 2D KDE is computed using the function kde. This produces a kde class object which contains a list including the x and y axis increments and a matrix of density estimates that we can use to plot a density map.

The argument binned=TRUE can be passed to both Hpi and kde and executes binned approximation which greatly increases speed of computation for larger datasets.

library(ks)
# Load color pallette
library(RColorBrewer)
rf <- colorRampPalette(rev(brewer.pal(11, "Spectral")))
r <- rf(32)
# Select bandwidth
H_hpi <- Hpi(x = fDataClean[, c(2, 1)], pilot = "samse", pre = "scale", binned = T)
# Compute 2d kde
k <- kde(x = fDataClean[, c(2, 1)], H = H_hpi, binned = T)

# Before we can plot the density estimate we need to melt it into long
# format
library(reshape2)
mat.melted <- melt(k$estimate)
names(mat.melted) <- c("x", "y", "density")
# We need to add two more colums to preserve the axes units
mat.melted$F2.bark <- rep(k$eval.points[[1]], times = nrow(k$estimate))
mat.melted$F1.bark <- rep(k$eval.points[[2]], each = nrow(k$estimate))

# Load color pallette
library(RColorBrewer)
rf <- colorRampPalette(rev(brewer.pal(11, "Spectral")))
r <- rf(32)

# Finally, plot the kde
ggplot(mat.melted, aes(x = F2.bark, y = F1.bark, fill = density)) + geom_tile() + 
    coord_equal() + scale_fill_gradientn(colours = r) + scale_x_reverse(expand = c(0, 
    0), breaks = round(seq(min(mat.melted$F2.bark), max(mat.melted$F2.bark)))) + 
    scale_y_reverse(expand = c(0, 0), breaks = round(seq(min(mat.melted$F1.bark), 
        max(mat.melted$F1.bark)))) + ylab("F1 (bark)") + xlab("F2 (bark)")

References

Aggarwal, C. C. (2013). Outlier analysis (pp. 1–446). https://doi.org/10.1007/978-1-4614-6396-2

Deng, H., & Wickham, H. (2011). Density Estimation In R. Density Estimation In R, (September), 17.

Duong, T. (2007). ks: Kernel Density Estimation and Kernel Discriminant Analysis for Multivariate Data in R. Journal of Statistical Software, 21(7), 1–16. https://doi.org/http://dx.doi.org/10.18637/jss.v021.i07

Fox, R. A., & Jacewicz, E. (2017). Reconceptualizing the vowel space in analyzing regional dialect variation and sound change in American English. The Journal of the Acoustical Society of America, 142(1), 444–459. https://doi.org/10.1121/1.4991021

Hillenbrand, J. M., Getty, L. A., Clark, M. J., & Wheeler, K. (1995). Acoustic characteristics of American English vowels. Journal of the Acoustical Society of America, 97(5), 3099–3111. https://doi.org/10.1121/1.411872

Lee, J., Shaiman, S., & Weismer, G. (2016). Relationship between tongue positions and formant frequencies in female speakers. The Journal of the Acoustical Society of America, 139(1), 426–440. https://doi.org/10.1121/1.4939894

Sandoval, S., Berisha, V., Utianski, R. L., Liss, J. M., & Spanias, A. (2013). Automatic assessment of vowel space area. The Journal of the Acoustical Society of America, 134(5), EL477–83. https://doi.org/10.1121/1.4826150

Traunmüller, H. (1990). Analytical expressions for the tonotopic sensory scale. The Journal of the Acoustical Society of America, 88(1), 97–100. https://doi.org/10.1121/1.399849

Whitfield, J. A., & Goberman, A. M. (2014). Articulatory – acoustic vowel space : Application to clear speech in individuals with Parkinson ’ s disease. Journal of Communication Disorders, 51, 19–28. https://doi.org/10.1016/j.jcomdis.2014.06.005