Signal processing is concerned with the representation, transformation and manipulation of signals and information they contain.

When doing Digital Signal Processing, R is not typically the first computer language to ring a bell. DSP seems to be widely done using languages such as MATLAB, Python and C++. 10 years ago, I probably wouldn’t have checked R's capabilities for DSP, I mean, MATLAB and C++ were the de facto back then for these tasks (Oh! Did I mention I would still be in elementary school? 😄 👩‍🏫). Fast forwarding, there are lots of tools in R’s ecosystem that have been developed.

It’s safe to say that it has never been easier to use R in almost any domain as demonstrated by Jared Lander’s talk at e-Rum 2020. In that same spirit of using R Anywhere and Everywhere, after implementing the lab: Speech in the time and frequency domain using MATLAB (can be found here), the next adventure was doing it in R too and Voila, here it is:

Speech in the time and frequency domain lab: An R Version

In this lab, we analysed a speech signal in the time and frequency domain. The objectives of this experiment were:

  1. To create a two second recording of one of the 10 digits in Kiswahili.
  2. To compute a spectrogram of the recorded speech signal.
  3. To obtain a segment 32ms long from a region with speech and another segment of the same duration from a segment without speech, plot and analyze them in the time domain.
  4. To compute the Discrete Fourier, Transform of both signals in (c) above.

Background: Putting it in very simple terms

Discrete-time signal processing is based on processing of numeric sequences indexed on integer variables rather than functions of a continuous independent variable. In Digital Signal Processing, signals are represented by sequences of finite precision numbers and processing is implemented using digital computation. Although there are many examples in which signals to be processed are inherently discrete-time sequences, most applications involve the use of discrete-time technology for processing signals that originate as continuous-time signals. In this case, a continuous-time signal is typically converted into a sequence of samples. After discrete-time processing, the output sequence is converted back to a continuous-time signal.

The Discrete Fourier Transform (DFT) is considered as the frequency domain representation of the original input sequence.The DFT is itself a sequence rather than a function of a continuous variable, and it corresponds to samples, equally spaced in frequency. A continuous time signal x(t) is first sampled to form the discrete time signal x[n] of length N samples. The transformed signal within the frequency domain is of finite extent but discrete and of the same length as the original signal. The DFT then decomposes a time domain signal defined over a finite range into a sum of weighted complex sinusoids as:

\(X(k) =\sum_{n=0}^{N-1} x[n] {e^{-j\frac{2\pi}Nkn}}\)

This operation is useful in many fields but computing it directly from the definition is often too slow to be practical. A fast Fourier transform (FFT) is an algorithm that computes the discrete Fourier Transform of a sequence or its inverse.

Data Analysis in R

Let’s now take a look at how we can analyze a speech signal in the time and frequency domain. The following table lists common quantities used in signal processing to characterize and interpret signal properties. They will be quite useful in our analysis, so yes, they are worth a look:


Quantity Description
y Sampled data
N <- length(y) Number of samples
Fs Sample frequency (samples per unit time or space)
dt <- 1/Fs Time or space increment per sample
t <- (seq_len(n) - 1)/Fs Time or space range for data
dft <- fft(y) Discrete Fourier transform of data (DFT)
abs(dft) Amplitude of the DFT
(abs(dft)^2)/N Power of the DFT
Fs/N Frequency increment/Frequency resolution
f <- (seq_len(n) - 1)*(Fs/N) Frequency range


A single channel 2s audio signal of the word moja (Swahili word for digit one) sampled at 8 kHz was recorded using a Raspberry Pi and stored as a .wav file. The recording can also be done using a laptop or mobile phone too.

If you would like to follow along using the audio file used in this post, you can download it from here: moja wav

For this analysis, we’ll be requiring the following packages in R:

suppressMessages({
library(seewave)
library(tuneR)
library(tidyverse)
library(audio)
library(plotly) # for adding some interactivity dust
library(fftw)
library(cowplot)
})

You can have them installed as:

`install.packages(c(“tidyverse”, “tuneR”,

“audio”, “plotly”, “oce”, “fftw”, “cowplot”))`

Loading the audio data and some pre-processing

Let’s begin by loading the speech signal recorded, reading it and then rescaling the amplitude of the waveform to a canonical interval [-1,1].

# loading wav file
one = "C:/Users/keras/Documents/DSP_R/moja.wav"

# read and normalize wav file using tuneR
data <- tuneR::readWave(one) %>%
  tuneR::normalize(unit = c("1"), center = FALSE, rescale = FALSE)

The audio signal was recorded for 2 seconds and then reduced to a discrete time signal by sampling at 8 kHz to obtain 16000 samples as below:

\(sampling \, period \, (T_s) =\, \frac{1}{sampling\,frequency \, (F_s)} \, =\, \frac{1}{8000}\)

\(T_s \, = \,0.125\, _{ms} \, (time\,difference\,between\,samples)\)

\(Number\,of\,samples\, = segment\,length(sec) \times sampling\,frequency \,\)

\(Number\,of\,samples\, = \,16000\)

All these summaries of the Wave-object can be obtained as:

summary(data)
## 
## Wave Object
##  Number of Samples:      16000
##  Duration (seconds):     2
##  Samplingrate (Hertz):   8000
##  Channels (Mono/Stereo): Mono
##  PCM (integer format):   TRUE
##  Bit (8/16/24/32/64):    32
## 
## Summary statistics for channel(s):
## 
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.578278 -0.027527  0.003113 -0.001011  0.029671  0.763031

Plotting the audio data

Now we’ll make a plot of the signal’s amplitude vs the corresponding sample number.

# extracting sampled data, y
y = data@left

# extracting the sample rate, Fs
Fs = data@samp.rate

# you can listen to the audio too!
# audio::play.audioSample(y, Fs)

# getting our ggplot on
theme_set(theme_light())
wvfm <- ggplot(mapping = aes(x = seq_len(length(y)), y = y))+
  geom_line(color = 'blue')+
  labs(x = "Sample Number", y = "Amplitude", title = "Speech waveform")+
  theme(plot.title = element_text(hjust = 0.5))

ggplotly(wvfm)

From the Speech waveform plot, one can easily identify the segments that contain speech and those without (very faint speech).

We’ll need these segments for some later analysis, so let’s define sample numbers that contain speech and those without.

# sample index of segments with speech and those without
speech_start <- 5100
silence_start <- 9850

Computing a spectrogram of the signal

A spectrogram is a visual representation of the spectrum of frequencies of a signal as it varies with time. It is usually depicted as a heat map with intensity being shown by varying the colour or brightness.

We will compute the spectrogram by dividing the speech signal into segments 32ms long (256 samples at 8kHz) with 50% ovelap between neighboring segments and taking the FFT of each segment.

We will then plot the magnitude of this FFT for all the segments as a 2-dimensional image showing how the frequency varies with time. ggspectro {seewave} (an alternative to spectro {seewave}) allows us to draw a spectrogram with the package ggplot2. Maybe some few things to note is that I have specified a Fourier Transform window and used the Fast Fourier Transform (by setting fftw = TRUE) to speed up the process.

# defining the window size (number of samples in a 32ms long segment)
N <- 32e-3*Fs

# computing a spectrogram of the audio showing how frequency varies with time
spect <- ggspectro(y, Fs, wl = N, wn = "hamming", ovlp = 50, fftw = T)+
  geom_tile(aes(fill = amplitude))+
  scale_fill_viridis_c()
ggplotly(spect)

The spectrogram appears as a heatmap with the various colours representing power in decibels. High values are represented by yellow, low values by deep blue and intermediate values appear as green. It is evident that the digit was spoken in the interval between 0.4s and 1s where there are high values of frequency.


Now, let’s divide the speech signal into segments 64ms, take the FFT of each segment and whip up a spectrogram.

# defining the window size (number of samples in a 32ms long segment)
N_2 <- 64e-3*Fs

# computing a spectrogram of the audio showing how frequency varies with time
spect <- ggspectro(y, Fs, wl = N_2, wn = "hamming", ovlp = 50, fftw = TRUE)+
  geom_tile(aes(fill = amplitude))+
  scale_fill_viridis_c()
ggplotly(spect)

Increasing the segment length to 64ms consequently increases the window length to 512 samples. This effect of this is that the frequency resolution increases. The frequency resolution is obtained by the ratio Fs/N. So, for a given sampling frequency (Fs), the more samples (N) in the signal, the smaller the frequency increment between successive DFT data points. Since more points are being sampled, the spectral resolution increases.

In what frequency range is most of the energy of the signal?

To answer this questions, we will have to come up with a magnitude spectrum of the speech waveform which shows the magnitude of the signal at a particular frequency. This means we will have to convert the signal from the time domain to a representation in the frequency domain by computing the Discrete Fourier Transform (using a FFT algorithm). Once we have the amplitude of the DFT and the frequency range, we’ll be ready to go!

Due to mathematical convenience, the arrangement of the FFT output has positive and negative frequencies side by side in a slightly non-intuitive order (the frequency increases from 0 until N/2 and then it decreases for negative frequencies). As the positive and negative frequency magnitudes are identical for a real signal it is common just to plot the positive frequencies from 0 to Fs/2 (in our case 0 to 4000 Hz).

# Discrete Fourier transform of data (DFT)
dft_speech <- fftw::FFT(y)

# amplitude of DFT
dft_amp <- abs(dft_speech) # DFT is a sequence of complex sinusoids 

# number of samples
n <- length(y)

# frequency range
f <- (seq_len(n)-1)*Fs/n

freq_plot <- ggplot(mapping = aes(x = f[1:Fs], y = dft_amp[1:Fs]))+
  geom_line(color = 'blue')+
  labs(x = "Frequency (Hz)", y = "Magnitude",
title = "Frequency domain representation of the entire speech sequence")+
  theme(plot.title = element_text(hjust = 0.5))

ggplotly(freq_plot)

From the magnitude spectrum of the Discrete Fourier Transform, we can observe that most of the energy is concentrated in the frequency range 0 to 1 KHz.

Time domain, Frequency domain and Energy comparison of a segment containing speech and one without

The time domain representation of a speech signal shows how the sound changes in intensity over time

Earlier on in this analysis, we defined some sample numbers in regions containing speech and those without. First, we’ll obtain a segment 32ms long from a region with speech and another segment of the same duration from a segment without speech and plot these time domain signals side by side with time in seconds as the x-axis. A 32ms long segment corresponds to 256 samples:

\(Number\,of\,samples\, = \,segment\,length(sec) \times sampling\,frequency \,\)

\(= 0.032 \times 8000 = 256\)

# sample length
sample_len = 32e-3*Fs

# obtaining a 32ms long segment/256 samples without speech
moja_sl <- y[silence_start : (silence_start + sample_len - 1)]

# obtaining a 32ms long segment/256 samples with speech
moja_sp <- y[speech_start : (speech_start + sample_len - 1)]

# corresponding time-length of the sample (32ms long)
t <- (seq_len(sample_len) - 1) * 1/Fs

# plotting the waveform of the silent segment
wvfm_sl <- ggplot(mapping = aes(x = t, y = moja_sl)) +
  geom_line(color = 'blue', lwd = 1.1)+
  labs(x = "Time (s)", y = "Amplitude",
       title = "Time domain of silent segment")+
  theme(plot.title = element_text(hjust = 0.5))

# ggplotly(wvfm_sl)

# plotting the waveform of the segment with speech
wvfm_sp <- ggplot(mapping = aes(x = t, y = moja_sp)) +
  geom_line(color = 'blue', lwd = 1.1)+
  labs(x = "Time (s)", y = "Amplitude",
       title = "Time domain of speech segment")+
  theme(plot.title = element_text(hjust = 0.5))

# ggplotly(wvfm_sp)

cowplot::plot_grid(wvfm_sl, wvfm_sp)

From the above plots above, the time domain plot of the segment containing speech has higher intensity (amplitude ranging from -0.27 to 0.24) than that without speech/faint speech due to noise (amplitude ranging from -0.02 to 0.06).

The time domain plot of the segment containing speech is also more erratic due to more fluctuations in amplitude of sound.

One common metric that can be easily computed from the sample values is a signal's energy.

The energy of a discrete time signal is defined as the sum of the square of the sampled signal values: \(E(y) =\sum_{n=-\infty}^{\infty} y^2\)

# Energy of silent segment
E_sl <- sum(moja_sl^2)
E_sl
## [1] 0.1733198
# Energy of segment with speech
E_sp <- sum(moja_sp^2)
E_sp
## [1] 3.440872

The silent segment has an energy of 0.1733 whereas the segment containing speech has an energy of 3.4409. Due to higher intensities of sound in the region containing speech, the sample values in this region have higher amplitude values which in turn account for the higher energy. The energy in the silent segment could be attributed to noise and can be filtered out using an appropriate filter.


The time domain analysis can also be done in the frequency domain by computing the Discrete Fourier Transform (using a FFT algorithm) of the two segments.

# DFT of silent segment
dft_sl <- fftw::FFT(moja_sl)

# DFT of speech segment
dft_sp <- fftw::FFT(moja_sp)

# frequency range
fs <- (seq_len(sample_len)-1)*Fs/sample_len

# plotting the waveform of the silent segment in freq domain
freqd_sl <- ggplot(mapping = aes(x = fs[1 : (sample_len)/2],
  y = abs(dft_sl[1: (sample_len)/2]))) +
  geom_line(color = 'blue', lwd = 1.1)+
  labs(x = "Frequency (Hz)", y = "Amplitude",
       title = "Frequency domain of silent segment")+
  theme(plot.title = element_text(hjust = 0.5))

# ggplotly(freqd_sl)

# plotting the waveform of the speech segment in freq domain
freqd_sp <- ggplot(mapping = aes(x = fs[1 : (sample_len)/2], 
  y = abs(dft_sp[1: (sample_len)/2]))) +
  geom_line(color = 'blue', lwd = 1.1)+
  labs(x = "Frequency (Hz)", y = "Amplitude",
       title = "Frequency domain of speech segment")+
  theme(plot.title = element_text(hjust = 0.5))

# ggplotly(freqd_sp)

cowplot::plot_grid(freqd_sl, freqd_sp)

When the two segments are converted from the time domain to a representation in the frequency domain the same inferences made earlier still hold. The segment containing speech has higher magnitude than that without speech at various frequencies.

Signal power as a function of frequency is another common metric used in signal processing. Power is the squared magnitude of a signal’s Fourier transform, normalized by the number of frequency samples:

\(P_{dft}\, =\, [\mid {(dft)}\mid^2]/N\)

Let’s plot a power spectrum of the segment containing speech, shall we?

# power of the dft
power <-  ((abs(dft_sp))^2)/sample_len

# frequency range
fp <- (seq_len(sample_len)-1)*Fs/sample_len

# power spectrum
power_sp <- ggplot(mapping = aes(x = fp[1 : (sample_len)/2],
  y = power[1: (sample_len)/2])) +
  geom_line(color = 'blue', lwd = 1.1)+
  labs(x = "Frequency (Hz)", y = "Power",
       title = "Power Spectrum of speech segment")+
  theme(plot.title = element_text(hjust = 0.5))

ggplotly(power_sp)

The plot indicates that the speech segment consists of a fundamental frequency around 156.25 Hz and a sequence of harmonics, where the third harmonic is emphasized.

Time to it a wrap! It’s been quite the adventure, right 😊? Anyhow, we got to analyze a speech signal and at the same time cover some fundamental concepts in Digital Signal Processing.

This is by no means a thorough walk-through but I hope it will ignite some R-DSP spark or something.

Hoping to cover some more exciting concepts … soon!

Be sure to check out great blogs, tutorials and other formats of R resources coming out every day at RWeekly.org!

Till then,

Happy Learning 👩🏽‍💻 👨‍💻 👨🏾‍💻 👩‍💻 ,

Eric (R_ic), Microsoft Learn Student Ambassador.

Reference Material

  • Andrew O. Finley, Jeffrey W. Doser, Vince Melfi, “Digital Signal Processing”, in R Programming for Data Sciences. 2020

  • I. V. McLoughlin, Speech and Audio Processing: A MATLAB®-based approach. 2016.

  • MATLAB Documentation, Basic Spectral Analysis

  • A. V Oppenheim and R. W. Schafer, “Discrete Time Signal Processing 2nd Edition,” Book. 1998.

  • Eric Wanjau, Samuel Kariuki, Ian Muchiri, Joshua Ndemenge, Analysis of Speech in the Time and Frequency Domain using MATLAB.

---
title: "**A gentle introduction: R in Digital Signal Processing**"
output:
  html_document:
    css: style_3.css
    df_print: paged
    theme: flatly
    highlight: breezedark
    toc: yes
    toc_float: yes
    code_download: TRUE
    includes:
      after_body: footer.html
  html_notebook:
    toc: yes
---

Signal processing is concerned with the representation, transformation and manipulation of signals and information they contain. 

When doing Digital Signal Processing, `R` is not typically the first computer language to ring a bell. DSP seems to be widely done using languages such as `MATLAB`, `Python` and `C++`. 10 years ago, I probably wouldn't have checked `R's` capabilities for DSP, I mean, MATLAB and C++ were the de facto back then for these tasks (Oh! Did I mention I would still be in elementary school? 😄 👩‍🏫). Fast forwarding, there are lots of tools in R's ecosystem that have been developed. 

It's safe to say that it has never been easier to use R in almost any domain as demonstrated by [Jared Lander's talk](https://jaredlander.com/content/2020/06/ApplyingRAtWork.html#1) at [e-Rum 2020](https://2020.erum.io/). In that same spirit of `using R Anywhere and Everywhere`, after implementing the lab: `Speech in the time and frequency domain` using MATLAB (can be found [here](https://github.com/R-icntay/DSP_MATLAB)), the next adventure was doing it in R too and Voila, here it is:

# **Speech in the time and frequency domain lab:** An R Version

In this lab, we analysed a speech signal in the time and frequency domain. The objectives of this experiment were:

a. To create a two second recording of one of the 10 digits in Kiswahili.
b. To compute a spectrogram of the recorded speech signal.
c. To obtain a segment 32ms long from a region with speech and another segment of the same duration from a segment without speech, plot and analyze them in the time domain.
d. To compute the Discrete Fourier, Transform of both signals in (c) above.


# **Background:** Putting it in very simple terms

Discrete-time signal processing is based on processing of numeric sequences indexed on integer variables rather than functions of a continuous independent variable. In Digital Signal Processing, signals are represented by sequences of finite precision numbers and processing is implemented using digital computation. Although there are many examples in which signals to be processed are inherently discrete-time sequences, most applications involve the use of discrete-time technology for processing signals that originate as continuous-time signals. In this case, a continuous-time signal is typically converted into a sequence of samples. After discrete-time processing, the output sequence is converted back to a continuous-time signal.

The Discrete Fourier Transform (DFT) is considered as the frequency domain representation of the original input sequence.The DFT is itself a sequence rather than a function of a continuous variable, and it corresponds to samples, equally spaced in frequency. A continuous time signal x(t) is first sampled to form the discrete time signal x[n] of length N samples. The transformed signal within the frequency domain is of finite extent but discrete and of the same length as the original signal. The DFT then decomposes a time domain signal defined over a finite range into a sum of weighted complex sinusoids as:

$X(k) =\sum_{n=0}^{N-1} x[n] {e^{-j\frac{2\pi}Nkn}}$

This operation is useful in many fields but computing it directly from the definition is often too slow to be practical.
A fast Fourier transform (FFT) is an algorithm that computes the discrete Fourier Transform of a sequence or its inverse. 


# **Data Analysis in R**

Let's now take a look at how we can analyze a speech signal in the time and frequency domain. The following table lists common quantities used in signal processing to characterize and interpret signal properties. They will be quite useful in our analysis, so yes, they are worth a look:

<br>


```{r echo=FALSE, warning=FALSE, message=FALSE}
library(dplyr)
df <- tibble(
  Quantity = c("y", "N <-  length(y)", "Fs", "dt <- 1/Fs", "t <- (seq_len(n) - 1)/Fs", "dft <- fft(y)", "abs(dft)", "(abs(dft)^2)/N" , "Fs/N", "f <- (seq_len(n) - 1)*(Fs/N)"),
  
  
  Description = c("Sampled data", "Number of samples", "Sample frequency (samples per unit time or space)", "Time or space increment per sample", "Time or space range for data", "Discrete Fourier transform of data (DFT)", "Amplitude of the DFT", "Power of the DFT", "Frequency increment/Frequency resolution", "Frequency range"  )
)

library(kableExtra)

df %>%
  kbl() %>%
  kable_material_dark()

#library(formattable)
#formattable(df, list(
 # Quantity = color_tile("white", "grey"),
 # Description = color_tile("white", "orange")
#))

```

<br>

A single channel 2s audio signal of the word `moja` (Swahili word for digit one) sampled at 8 kHz was recorded using a Raspberry Pi and stored as a `.wav` file. The recording can also be done using a laptop or mobile phone too.

If you would like to follow along using the audio file used in this post, you can download it from here: [moja wav](https://stdntpartners-my.sharepoint.com/:u:/g/personal/eric_wanjau_studentambassadors_com/EXpE2v0Mi8lLrLRBXCHBZtsBGjOYTwlWil5vYs1pR9RR_Q?e=sZfvpd)

For this analysis, we'll be requiring the following packages in R:

```{r message=FALSE, warning=FALSE, results='hide'}
suppressMessages({
library(seewave)
library(tuneR)
library(tidyverse)
library(audio)
library(plotly) # for adding some interactivity dust
library(fftw)
library(cowplot)
})
```

You can have them installed as: 

`install.packages(c("tidyverse", "tuneR",

"audio", "plotly", "oce", "fftw", "cowplot"))`

### **Loading the audio data and some pre-processing**

Let's begin by loading the speech signal recorded, reading it and then rescaling the amplitude of the waveform to a canonical interval [-1,1].

```{r}
# loading wav file
one = "C:/Users/keras/Documents/DSP_R/moja.wav"

# read and normalize wav file using tuneR
data <- tuneR::readWave(one) %>%
  tuneR::normalize(unit = c("1"), center = FALSE, rescale = FALSE)

```

The audio signal was recorded for 2 seconds and then reduced to a discrete time signal by sampling at 8 kHz to obtain 16000 samples as below:

$sampling \, period \, (T_s) =\, \frac{1}{sampling\,frequency \, (F_s)} \, =\, \frac{1}{8000}$

$T_s \, = \,0.125\, _{ms} \, (time\,difference\,between\,samples)$

$Number\,of\,samples\, = segment\,length(sec) \times sampling\,frequency \,$

$Number\,of\,samples\, = \,16000$

All these summaries of the Wave-object can be obtained as:

```{r}
summary(data)
```


### **Plotting the audio data**

Now we'll make a plot of the signal's `amplitude` vs the corresponding `sample number`.

```{r}
# extracting sampled data, y
y = data@left

# extracting the sample rate, Fs
Fs = data@samp.rate

# you can listen to the audio too!
# audio::play.audioSample(y, Fs)

# getting our ggplot on
theme_set(theme_light())
wvfm <- ggplot(mapping = aes(x = seq_len(length(y)), y = y))+
  geom_line(color = 'blue')+
  labs(x = "Sample Number", y = "Amplitude", title = "Speech waveform")+
  theme(plot.title = element_text(hjust = 0.5))

ggplotly(wvfm)

```


From the Speech waveform plot, one can easily identify the segments that contain speech and those without (very faint speech).

We'll need these segments for some later analysis, so let's define sample numbers that contain speech and those without.

```{r}
# sample index of segments with speech and those without
speech_start <- 5100
silence_start <- 9850
```


### **Computing a spectrogram of the signal**

A spectrogram is a visual representation of the spectrum of frequencies of a signal as it varies with time. It is usually depicted as a heat map with intensity being shown by varying the colour or brightness.

We will compute the spectrogram by dividing the speech signal into segments 32ms long (256 samples at 8kHz) with 50% ovelap between neighboring segments and taking the FFT of each segment.



We will then plot the magnitude of this FFT for all the segments as a 2-dimensional image showing how the frequency varies with time. `ggspectro {seewave}` (an alternative to `spectro {seewave}`) allows us to draw a spectrogram with the package `ggplot2`. Maybe some few things to note is that I have specified a [Fourier Transform window](https://en.wikipedia.org/wiki/Window_function) and used the [Fast Fourier Transform](https://en.wikipedia.org/wiki/Fast_Fourier_transform) (by setting fftw = TRUE) to speed up the process.

```{r message=FALSE}
# defining the window size (number of samples in a 32ms long segment)
N <- 32e-3*Fs

# computing a spectrogram of the audio showing how frequency varies with time
spect <- ggspectro(y, Fs, wl = N, wn = "hamming", ovlp = 50, fftw = T)+
  geom_tile(aes(fill = amplitude))+
  scale_fill_viridis_c()
ggplotly(spect)
  
```


The spectrogram appears as a heatmap with the various colours representing
power in decibels. High values are represented by yellow, low values by deep blue and intermediate values appear as green. It is evident that the digit was spoken in the interval between **0.4s** and **1s** where there are high values of frequency.

<br>

Now, let's divide the speech signal into segments 64ms, take the FFT of each segment and whip up a spectrogram. 

```{r message=FALSE}
# defining the window size (number of samples in a 32ms long segment)
N_2 <- 64e-3*Fs

# computing a spectrogram of the audio showing how frequency varies with time
spect <- ggspectro(y, Fs, wl = N_2, wn = "hamming", ovlp = 50, fftw = TRUE)+
  geom_tile(aes(fill = amplitude))+
  scale_fill_viridis_c()
ggplotly(spect)
```


Increasing the segment length to `64ms` consequently increases the `window length` to `512 samples`. This effect of this is that the `frequency resolution` increases. The frequency resolution is obtained by the ratio `Fs/N`.  So, for a given sampling frequency (Fs), the more samples (N) in the signal, the smaller the frequency increment between successive DFT data points. Since more points are being sampled, the spectral resolution increases.


### **In what frequency range is most of the energy of the signal?**

To answer this questions, we will have to come up with a magnitude spectrum of the speech waveform which shows the magnitude of the signal at a particular frequency. This means we will have to convert the signal from the time domain to a representation in the `frequency domain` by computing the `Discrete Fourier Transform` (using a FFT algorithm). Once we have the amplitude of the DFT and the frequency range, we'll be ready to go!


Due to mathematical convenience, the arrangement of the FFT output has
positive and negative frequencies side by side in a slightly non-intuitive order (the frequency increases from 0 until N/2 and then it decreases for negative frequencies). As the positive and negative frequency magnitudes are identical for a real signal it is common just to plot the positive frequencies from `0 to Fs/2` (in our case `0 to 4000 Hz`).



```{r}
# Discrete Fourier transform of data (DFT)
dft_speech <- fftw::FFT(y)

# amplitude of DFT
dft_amp <- abs(dft_speech) # DFT is a sequence of complex sinusoids 

# number of samples
n <- length(y)

# frequency range
f <- (seq_len(n)-1)*Fs/n

freq_plot <- ggplot(mapping = aes(x = f[1:Fs], y = dft_amp[1:Fs]))+
  geom_line(color = 'blue')+
  labs(x = "Frequency (Hz)", y = "Magnitude",
title = "Frequency domain representation of the entire speech sequence")+
  theme(plot.title = element_text(hjust = 0.5))

ggplotly(freq_plot)
```


From the magnitude spectrum of the Discrete Fourier Transform, we can observe that most of the energy is concentrated in the frequency range **0** to **1 KHz**.


### **Time domain, Frequency domain and Energy comparison of a segment containing speech and one without**

The time domain representation of a speech signal shows how the sound changes in intensity over time


Earlier on in this analysis, we defined some sample numbers in regions containing speech and those without. First, we'll obtain a segment `32ms` long from a region with speech and another segment of the same duration from a segment without speech and plot these time domain signals side
by side with time in seconds as the x-axis. A `32ms` long segment corresponds to `256 samples`:

$Number\,of\,samples\, = \,segment\,length(sec) \times sampling\,frequency \,$

$= 0.032 \times 8000 = 256$

```{r}
# sample length
sample_len = 32e-3*Fs

# obtaining a 32ms long segment/256 samples without speech
moja_sl <- y[silence_start : (silence_start + sample_len - 1)]

# obtaining a 32ms long segment/256 samples with speech
moja_sp <- y[speech_start : (speech_start + sample_len - 1)]

# corresponding time-length of the sample (32ms long)
t <- (seq_len(sample_len) - 1) * 1/Fs

# plotting the waveform of the silent segment
wvfm_sl <- ggplot(mapping = aes(x = t, y = moja_sl)) +
  geom_line(color = 'blue', lwd = 1.1)+
  labs(x = "Time (s)", y = "Amplitude",
       title = "Time domain of silent segment")+
  theme(plot.title = element_text(hjust = 0.5))

# ggplotly(wvfm_sl)

# plotting the waveform of the segment with speech
wvfm_sp <- ggplot(mapping = aes(x = t, y = moja_sp)) +
  geom_line(color = 'blue', lwd = 1.1)+
  labs(x = "Time (s)", y = "Amplitude",
       title = "Time domain of speech segment")+
  theme(plot.title = element_text(hjust = 0.5))

# ggplotly(wvfm_sp)

cowplot::plot_grid(wvfm_sl, wvfm_sp)

```

From the above plots above, the time domain plot of the segment containing speech has higher intensity (amplitude ranging from -0.27 to 0.24) than that without speech/faint speech due to noise (amplitude ranging from -0.02 to 0.06). 

The time domain plot of the segment containing speech is also more erratic due to more fluctuations in amplitude of sound.

One common metric that can be easily computed from the sample values is a `signal's energy`.

The energy of a discrete time signal is defined as the sum of the square of the sampled signal values:
$E(y) =\sum_{n=-\infty}^{\infty} y^2$

```{r}
# Energy of silent segment
E_sl <- sum(moja_sl^2)
E_sl

# Energy of segment with speech
E_sp <- sum(moja_sp^2)
E_sp
```

The silent segment has an energy of **0.1733** whereas the segment containing speech has an energy of **3.4409**. Due to higher intensities of sound in the region containing speech, the sample values in this region have higher amplitude values which in turn account for the higher energy. The energy in the silent segment could be attributed to noise and can be filtered out using an appropriate filter.

<br>

The time domain analysis can also be done in the frequency domain by computing the `Discrete Fourier Transform` (using a FFT algorithm) of the two segments.

```{r}
# DFT of silent segment
dft_sl <- fftw::FFT(moja_sl)

# DFT of speech segment
dft_sp <- fftw::FFT(moja_sp)

# frequency range
fs <- (seq_len(sample_len)-1)*Fs/sample_len

# plotting the waveform of the silent segment in freq domain
freqd_sl <- ggplot(mapping = aes(x = fs[1 : (sample_len)/2],
  y = abs(dft_sl[1: (sample_len)/2]))) +
  geom_line(color = 'blue', lwd = 1.1)+
  labs(x = "Frequency (Hz)", y = "Amplitude",
       title = "Frequency domain of silent segment")+
  theme(plot.title = element_text(hjust = 0.5))

# ggplotly(freqd_sl)

# plotting the waveform of the speech segment in freq domain
freqd_sp <- ggplot(mapping = aes(x = fs[1 : (sample_len)/2], 
  y = abs(dft_sp[1: (sample_len)/2]))) +
  geom_line(color = 'blue', lwd = 1.1)+
  labs(x = "Frequency (Hz)", y = "Amplitude",
       title = "Frequency domain of speech segment")+
  theme(plot.title = element_text(hjust = 0.5))

# ggplotly(freqd_sp)

cowplot::plot_grid(freqd_sl, freqd_sp)

```

When the two segments are converted from the time domain to a representation in the frequency domain the same inferences made earlier still hold. The segment containing speech has higher magnitude than that without speech at various frequencies.

`Signal power` as a function of frequency is another common metric used in signal processing. Power is the squared magnitude of a signal's Fourier transform, normalized by the number of frequency samples:

$P_{dft}\, =\, [\mid {(dft)}\mid^2]/N$

Let's plot a power spectrum of the segment containing speech, shall we?

```{r}
# power of the dft
power <-  ((abs(dft_sp))^2)/sample_len

# frequency range
fp <- (seq_len(sample_len)-1)*Fs/sample_len

# power spectrum
power_sp <- ggplot(mapping = aes(x = fp[1 : (sample_len)/2],
  y = power[1: (sample_len)/2])) +
  geom_line(color = 'blue', lwd = 1.1)+
  labs(x = "Frequency (Hz)", y = "Power",
       title = "Power Spectrum of speech segment")+
  theme(plot.title = element_text(hjust = 0.5))

ggplotly(power_sp)
```


The plot indicates that the speech segment consists of a fundamental frequency around 156.25 Hz and a sequence of harmonics, where the third harmonic is emphasized.


Time to it a wrap! It's been quite the adventure, right 😊? 
Anyhow, we got to analyze a speech signal and at the same time cover some fundamental concepts in Digital Signal Processing.

This is by no means a thorough walk-through but I hope it will ignite some `R-DSP spark` or something.

Hoping to cover some more exciting concepts ... soon!

Be sure to check out great blogs, tutorials and other formats of R resources coming out every day at [RWeekly.org](https://rweekly.org/)!


Till then, 

Happy Learning 👩🏽‍💻 👨‍💻 👨🏾‍💻 👩‍💻 ,

[Eric (R_ic)](https://twitter.com/ericntay), Microsoft Learn Student Ambassador.


# **Reference Material**

* Andrew O. Finley, Jeffrey W. Doser, Vince Melfi, "Digital Signal Processing", in *R Programming for Data Sciences*. 2020

* I. V. McLoughlin, *Speech and Audio Processing: A MATLAB®-based approach*. 2016.

* MATLAB Documentation, *[Basic Spectral Analysis](https://www.mathworks.com/help/matlab/math/basic-spectral-analysis.html)*

* A. V Oppenheim and R. W. Schafer, “Discrete Time Signal Processing 2nd
Edition,” Book. 1998.

* Eric Wanjau, Samuel Kariuki, Ian Muchiri, Joshua Ndemenge, *[Analysis of Speech in the Time and Frequency Domain using MATLAB](https://github.com/R-icntay/DSP_MATLAB)*.

