Audio Recording and Transcription via Azure Cognitive Services using R

Author

Ian Muchiri

This notebook contains a step by step guide on how to record audio in R using the audio package and convert the recording to text using Microsoft Azure Cognitive Services, specifically the speech-to-text service.

1 Introduction

I recently found myself working on a chess play automation project. The user would issue a voice command describing the location of the piece they want to move and their desired destination location, the data would then processed and a chess move made. This was so cool and being the R-nista that I am (i don’t think this is a word, i have only come across Pythonistas ) ,I thought to myself, maybe this can also be done in R and voila, this article was born. For any feedback feel free to reach out at Ian

2 Recording Audio in R using the audio Package

We begin with the input which we plan to transcribe. In order to record audio,we need to install and load the audio package

#install the audio package
install.packages("audio")

We then proceed to setup the audio driver which we will use to record as shown below:

#Load the library
library(audio)
#Check which audio driver is set
current.audio.driver()
#if there is no driver set, view what drivers are available
audio.drivers()
#from the list provided, set which driver to use (default driver is always best)
set.audio.driver(NULL)# sets the default audio driver
#option 2
set.audio.driver (insert_name_here ) #sets the audio driver to a driver other than the default
#Checks and verifies that indeed the audio driver is set as per the command above
current.audio.driver() 

2.1 Recording Audio

Some of the key parameters in audio processing which we are going to encounter during this exercise include:

Sample rate: This is the number of times per second a sound is sampled and recorded.Therefore, if we use a sampling rate of say 8000 Hertz, a recording with a duration of 5 seconds will contain 40,000 samples (8000 * 5). The industry standard sampling rate commonly used is 44100 Hertz.

Mono vs Stereo: If you are a sound enthusiast, you’ve probably come across these terms. Simply put, Mono sound is recorded and played back using only one audio channel e.g. a guitarist recording using one mic to pick up sound of the guitar and Stereo sound is recorded and played through more than one channel.

In our case we will use one audio channel(mono) and a sampling rate of 44100Hz. That being said let’s start our recording

#Set our recording time
rec_time <- 5 

#Recording
Samples<-rep(NA_real_, 44100 * rec_time) #Defining number of samples recorded
print("Start speaking")
audio_obj <-record(Samples, 44100, 1) #Create an audio instance with sample rate 44100 and mono channel
wait(6)
rec <- audio_obj$data # get the recorded data from the audio object
 
#Save the recorded audio
file.create("sample.wav")#gets created in your current working directory 
save.wave(rec,"Insert_path_to_sample.wav_here")

On recording and saving the audio file to sample.wav , we have to clear the audio instance object audio_obj before proceeding to make the next recording. From the audio package documentation, this is achieved by using the close(con,…) method, where con is the audioInstance object .However, using this method proved to be cumbersome as it causes the console to freeze anytime you want to record audio for more than one time. After doing some research, I discovered that restarting the console clears the audio instance object therefore allowing for one to record audio multiple times with no issue. I know this is not an elegant solution (more like tying duct tape around a leaking pipe )and i am actively looking for a better solution to fix this issue. For now, we go by the saying: if it works, don’t touch it and implement this step to clear the audio instance object.

.rs.restartR() # clear the audio instance object(looking for a more elegant solution)
wait(3)

Play the recording we just created just to confirm that indeed we recorded something.

play(rec)

That’s it for our input. We now proceed to process this and transcribe the audio we just recorded

3 House Keeping on Azure

First we begin with some light housekeeping, i.e setting up a Cognitive Services resource in our Azure subscription. You can create a free Azure account here and if you have an account already, you can skip this step.

We then proceed to create a cognitive resource group by following these steps:\\ 1. Open the Azure portal https://portal.azure.com and sign in using your Microsoft account.\\ 2. Click + Create a resource and on the search bar, type Speech, click create and create a Cognitive service resource with the following settings:

  • Subscription: Enter your Azure subscription

  • Resource group: Create one with a unique name or an existing one

  • Region: Choose any

  • Name: Enter a unique name

  • Pricing tier: Standard S0

  • Select I have read and understood the notices

  • Click on Review + create and we are good to go.

As a guide, these are the settings I used however, you can go ahead and get creative with the names.

4 Speech-to-text API call

Since now we have our audio sample input sample.wav ,we can now go ahead to the transcribing bit. It is noteworthy that the speech to text rest API for short audio has several limitations which are:

  • Requests cannot contain more than 60 seconds of audio. For batch transcriptions and custom speech use Speech to Text API v3.0

  • The API does not provide partial results.

  • It does not support Speech translation. For translation, you can checkout the Speech SDK

Now that we are done with that, we can then go ahead and get coding.

Note: To access the speech service we just created from R, we need the URL of the endpoint, location/region details and KEY 1 & 2 parameters. These can be found in the azure portal under Keys and Endpoint page of your cognitive service resource as illustrated below:

Copy these details and save them since we are going to need them when make our API call. Note: you can use either KEY 1 or KEY 2 as your subscription key

To begin our transcription, install the packages we are going to need if you don’t have them already installed:

#Enables us to work with HTTP in R
install.packages("httr") 
install.packages("jsonlite")

Note You will have to modify the subscription key, language and data path parameters to match the ones on the Cognitive Services resource you created earlier. Also modify the URL 'https://eastus.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1' by inserting the aforementioned location details as shown in the modified URL : 'https://_ENTER_LOCATION_HERE_.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1' . At the end you should have something like this:

library(httr)
library(jsonlite)  
#Documentation on the two packages can be found by running ?httr and ?jsonlite commands respectively on the console.

#Define headers containing subscription key and content type
headers = c(
  `Ocp-Apim-Subscription-Key` = '_ENTER_YOUR_SUBSCRIPTION_KEY_HERE',  #Key 1 or 2
  `Content-Type` = 'audio/wav' #Since we are transcribing a WAV file
)

#Create a parameters list, in this case we specify the languange parameter
params = list(
  `language` = 'en-US'
)

#Enter path to the audio file we just recorded and saved
data = upload_file ('Insert_path_to_sample.wav_here') 

#Make the API call and save the response recived
response <- httr::POST(url = 'https://eastus.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1', 
httr::add_headers(.headers=headers), query = params, body = data)

#Convert response received to a dataframe
result <- fromJSON(content(response, as  = 'text')) 
txt_output <- as.data.frame(result)

#Extract transcribed text
txt_input <- txt_output[1,2]
txt_input

And with that, we have successfully recorded audio and transcribed it with the aid of Microsoft Azure Cognitive Services. The applications of the cognitive services are wide and this is just but a glimpse of what one can accomplish using the Speech to Text service. I’ll leave it at that. Thank you foR youR time. Cheers.