The Open AIR Library is an online catalog of samples that can be downloaded and used as impulse responses to simulate real environments in reverb based audio applications. The catalog is hosted by the University of York at www.openairlib.net and is the only online resource that provides detailed metadata and acoustic analysis along with each sample. Each sound has a unique URL with audio data analyzed according to octave band and grouped according to its space category and generation type.
The goal of this project is to place the contents of the Open AIR Library into a database in order to find out which categories and locations of samples have the highest and lowest average reverb times.
Analysis of reverberation times would be useful to sound designers and audio engineers who might be interested in making more informed decisions about the IR samples they wish to use in a reverb application. If an engineer decides on transforming a sound with particular sonic characteristics, rather than trial and error they might be able to narrow the IR sample down to just a few choices based on category.
This project is reproducible and can be built using scripts from this github repo.
The process of building the dataset begins with data extraction. As of this documentation, the database contains 56 seperate entries each with its own unique URL. Each entry contains multiple tabs with different sets of information. A sample entry from the database can be found here .
The data from the tabs I will be focusing on extracting will be Information and Analysis. Almost all of the data extraction/web scraping is handled through the rvest and stringr libraries.
The first step in extracting the data was to quickly gather the urls in order to systematically download the page content for each entry. This was done by building a web crawler that pulls the links from the database listing page and hits the “next button” until it runs out of pages.
crawler = function(){
library(rvest)
sess = html_session("http://www.openairlib.net/auralizationdb") # creates session object
web_urls = sess %>% # builds initial page from session object
html_nodes(".views_title a") %>%
html_attr("href")
while(TRUE){
sess = try(follow_link(sess, "next"), silent = TRUE) # crawls pages until follow_link errors out
if (class(sess) != "try-error"){
web_urls = c(web_urls, html_attr(
html_nodes(sess, ".views_title a"),"href"))
} else if (class(sess) == "try-error"){
break # break out of while loop
}
}
return(web_urls)
}
The extraction script uses this function to download the file into a local Html folder, with each file named after its URL. Two entries were eliminated from the project because they did not contain any frequency data.
### extraction_script.R ###
load("./crawler.rda") # loads crawler
library(stringr)
### Extracts websites, filters out 2 bad entries and builds into a local directory
web_urls = crawler()
web_urls = web_urls[c(-37, -44)] # no freq data for these pages - eliminating
if (!dir.exists("Html")){
dir.create("Html")
}
sapply(web_urls, function(i){
download.file(
paste("http://www.openairlib.net",i,sep=""),
destfile=paste("./Html/",
str_extract(i,pattern="[^\\/]*$"),sep=""),
mode = 'wb'
)
})
Now that the set of source pages are saved locally, they can be scraped using Rvest. In order to do this, there are two helper functions, one to scrape the data from the Analysis tab and one to scrape the metadata for each location from the Information tab. The data is scraped and transformed into a tall dataframe using dplyr and tidyr.
# freq_tab_builder function - scrapes Analysis tab
freq_tab_builder = function(url_txt){
# returns dataframe of freq table data
library(dplyr)
library(rvest)
page = read_html(url_txt)
## build the freq table
tab = page %>% # pulls raw freq data from a table
html_node("#analysis .field-items table") %>%
html_table()
tab = tbl_df(tab) # convert to dplyr table for convience
# seperate even and odd rows
response_type = tab[c(TRUE, FALSE), ] # odd rows
freq_data = tab[c(FALSE, TRUE), ] # even rows -- numerics
# convert this variable to numeric
freq_data$`31.25 Hz` = as.numeric(freq_data$`31.25 Hz`)
# extract names and rename variable
response_type = response_type[1]
names(response_type) = "Response_type"
# bind to freq_data
freq_data = bind_cols(response_type, freq_data)
return(freq_data)
}
## page_information function-- scrapes Information tab
page_information = function(url_txt){
library(rvest)
library(stringr)
library(dplyr)
library(tidyr)
page = read_html(url_txt)
title = page %>% # grabs page title
html_node("title") %>%
html_text() %>%
str_replace(pattern="( \\| .*)", replacement="")
tab_info = page %>% # pulls information fields
html_nodes("#information .field-items") %>%
html_text() %>%
str_trim()
info_headers = page %>%
html_nodes("#information .field-label")%>%
html_text() %>%
str_replace(pattern=":", replacement="") %>%
str_trim()
names(tab_info) = info_headers
page_info = c(
title,
tab_info["Source Sound"],
tab_info["Source Sound Category"],
tab_info["Input"],
tab_info["Space Category"],
tab_info["Generation Type"]
)
names(page_info) = NULL #remove names
## create a dataframe of replicated page_info data -- append long (5 x 6 matrix)
n = c("Location", "Source Sound","Source Sound Category", "Input",
"Space Category", "Generation Type")
page_info = tbl_df(
as.data.frame(
(matrix(
rep(page_info, each=5), nrow = 5, ncol = 6)
),
stringsAsFactors = FALSE
)
)
colnames(page_info) = n
return(page_info)
}
The results of these two functions are combined using the build-single-location function which outputs a long dataframe of all the relevant information for each page.
build_single_location = function(url_txt){
library(dplyr)
library(tidyr)
page = url_txt
page_info = page_information(page)
freq_data = freq_tab_builder(page)
long_df = bind_cols(page_info, freq_data) ## final long output
## tidyr -- transformation into tall df
tall_df = long_df %>% gather("Octave_band","Freq", 8:17)
return(tall_df)
}
Finally in the main script, the build-single-location function is looped over all the Html documents in the local folder. This results in a list of dataframes. These are then combined with a loop and call to dplyr::union on each frame to build the complete dataset.
For the sake of reproducibility, the main script also builds the Html into the working directory by downloading the html content directly from the project github repo. All functions along with an object that contains the filenames are loaded directly into main.R
load("./page_information.rda") # loads page_info helper
load("./freq_data_builder.rda") # loads freq table helper
load("./build_location.rda") # loads build-single location function
load("./page_names.rda") # loads page_names for file reference download
library(stringr)
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
### ::main script::
### builds local HTML folder from github content...
### creates file names from local Html folder
### builds dataframe using lapply
### saves resulting image for use with analysis script
# 1. Load Html files locally for scraping.
# Creates folder Html in working directory if it does not exist and loads
# content from project github site.
base_url = "https://raw.githubusercontent.com/bsnacks000/IS607_Final/master/Html/"
if (!dir.exists("Html")){
dir.create("Html")
sapply(page_names, function(i){
web_raw_url = paste(base_url,i,sep="")
download.file(
web_raw_url,
destfile=paste("./Html/", str_extract(i,pattern="[^\\/]*$"),sep=""),
mode = 'wb'
)
})
}
###
filenames = list.files("./Html", full.names = TRUE)
dfs = lapply(filenames, build_single_location) # builds list of dataframes
## Loading required package: xml2
# build large dataframe for project master and sort by Location
large_df = dfs[[1]] # start here with the first one
dfs = dfs[-1] # pop the first one from the list
for (i in 1:length(dfs)){
large_df = dplyr::union(large_df, dfs[[i]]) # union on each
}
large_df = dplyr::arrange(large_df, Location) # arrange by location descending
The result is the final dataset:
large_df[1:3]
## Source: local data frame [2,700 x 3]
##
## Location Source Sound Source Sound Category
## (chr) (chr) (chr)
## 1 Abernyte Grain Silo Balloon pop Balloon Pop
## 2 Abernyte Grain Silo Balloon pop Balloon Pop
## 3 Abernyte Grain Silo Balloon pop Balloon Pop
## 4 Abernyte Grain Silo Balloon pop Balloon Pop
## 5 Abernyte Grain Silo Balloon pop Balloon Pop
## 6 Abernyte Grain Silo Balloon pop Balloon Pop
## 7 Abernyte Grain Silo Balloon pop Balloon Pop
## 8 Abernyte Grain Silo Balloon pop Balloon Pop
## 9 Abernyte Grain Silo Balloon pop Balloon Pop
## 10 Abernyte Grain Silo Balloon pop Balloon Pop
## .. ... ... ...
large_df[4:6]
## Source: local data frame [2,700 x 3]
##
## Input Space Category Generation Type
## (chr) (chr) (chr)
## 1 Balloon pop Chamber Real World
## 2 Balloon pop Chamber Real World
## 3 Balloon pop Chamber Real World
## 4 Balloon pop Chamber Real World
## 5 Balloon pop Chamber Real World
## 6 Balloon pop Chamber Real World
## 7 Balloon pop Chamber Real World
## 8 Balloon pop Chamber Real World
## 9 Balloon pop Chamber Real World
## 10 Balloon pop Chamber Real World
## .. ... ... ...
large_df[7:9]
## Source: local data frame [2,700 x 3]
##
## Response_type Octave_band Freq
## (chr) (fctr) (dbl)
## 1 Reverberation Time RT60 T30 (seconds) 250 Hz 7.93
## 2 Clarity C50 (dB) 2 kHz -3.79
## 3 Clarity C50 (dB) 4 kHz -1.50
## 4 Early Decay Time EDT (seconds) 250 Hz 7.52
## 5 Reverberation Time RT60 T30 (seconds) 2 kHz 3.48
## 6 Reverberation Time RT60 T30 (seconds) 62.5 Hz 11.51
## 7 Clarity C80 (dB) 250 Hz -11.11
## 8 Clarity C80 (dB) 4 kHz -0.99
## 9 Definition D50 125 Hz 0.06
## 10 Clarity C80 (dB) 8 kHz 5.42
## .. ... ... ...
In order to find out which combination of spaces and locations contained the highest and lowest reverbation times I used dplyr to subset and aggregate the large dataset constructed by the main script. I eliminated Source Sound, Input and Generation Type from the dataset filtered the results to only include the Reverberation times across each octave band. Some cleaning was required to remove newlines and whitespace from the result sets.
load("current_main.rda") # load image from main.R
library(dplyr)
library(ggplot2)
library(stringr)
## Remove cols and filter Reverb times
reverb_df = large_df %>%
select(-c(`Source Sound`, `Input`, `Generation Type`)) %>%
filter(Response_type == "Reverberation Time RT60 T30 (seconds)")
# need to strip \n and whitespace for correct display
reverb_df[1:4] = lapply(reverb_df[1:4], function(i){
str_replace_all(i, fixed(" "), "")
})
reverb_df[1:4] = lapply(reverb_df[1:4], function(i){
str_replace_all(i, "[\n]", "")
})
The resulting dataframe was seperately grouped by Space Category and Location. Reverb times were then averaged to produce the following result sets.
# Reverb avg by space category
category_avg_df = reverb_df %>%
group_by(`Space Category`)%>%
summarise(avg_reverb = mean(Freq)) %>%
arrange(desc(avg_reverb)) %>%
na.omit()
# Reverb avg by location
location_avg_df = reverb_df %>%
group_by(`Location`) %>%
summarise(avg_reverb = mean(Freq)) %>%
arrange(desc(avg_reverb))
# By location plot
ggplot(data=location_avg_df,
aes(x=reorder(Location,avg_reverb), y=avg_reverb)) +
geom_bar(stat="identity") +
coord_flip()
## By category plot
ggplot(data=category_avg_df,
aes(x=reorder(`Space Category`,avg_reverb), y=avg_reverb)) +
geom_bar(stat="identity") +
coord_flip()
I also wanted to look at how the source sound alongside of the space category might play a role in average reverberation time.
# Reverb avg by source and space category
cat_source_avg_df = reverb_df %>%
group_by(`Source Sound Category`, `Space Category`) %>%
summarise(avg_reverb = mean(Freq)) %>%
arrange(`Space Category`, `Source Sound Category`, desc(avg_reverb)) %>%
na.omit()
## By category and source input category plot
ggplot(
data=cat_source_avg_df,
aes(x=reorder(`Space Category`,avg_reverb),
y=avg_reverb, fill= `Source Sound Category`))+
geom_bar(stat="identity") +
coord_flip()
Based on the above results, space categories that are considered as some type of hall seem to produce the most reverb in seconds. In order to find the distribution of hall reverberation over each octave band, the dataset is filtered by the top three Hall type categories and the results averaged across octave bands.
halls = c("Hall", "HallSportsHall", "ConcertHall")
hall_octaves = reverb_df %>%
filter(`Space Category` %in% halls) %>%
group_by(Octave_band) %>%
summarise(avg_reverb = mean(Freq))
ggplot(
data=hall_octaves,
aes(x=Octave_band, y=avg_reverb)) +
geom_bar(stat="identity") +
coord_flip()
The results of the exploratory analysis show that in general, Halls and Cathederals tend to produce more overall reverb while rooms and open air produce the least. In terms of locations in the dataset, Terry’s Factory Warehouse produced the most overall reverberation, with averages close to 30 seconds long. The swept-sine signal accounted for many of the observations across the dataset, including a substantial portion of the top three categories.
In terms of distribution, lower frequency reverberation constituted the greatest proportion of the overall reverb for the top 3 space categories. This implies that spaces with higher overall reverb might contain a greater low frequency response, though more research needs to be done in order to draw a more definitive conclusion.