Tutorial.knit

Learn & Grow: Hearing-aid data logging for field trials

The following notebook presents code that you can copy/paste in R. The code will load an SQLite file for the “Evaluate HA” EMA app and analyse the hearing-aid logging data.

First, the relevant libraries need to be installed/loaded in R.

if (!require('RSQLite'))
  install.packages('RSQLite')
library('RSQLite')
if (!require('ggpubr'))
  install.packages('ggpubr')
library('ggpubr')
if (!require('lubridate'))
  install.packages('lubridate')
library('lubridate')
if (!require('dplyr'))
  install.packages('dplyr')
library('dplyr')
if (!require('tidyr'))
  install.packages('tidyr')
library('tidyr')
if (!require('rstudioapi'))
  install.packages('rstudioapi')
library('rstudioapi')
if (!require('reshape2'))
  install.packages('reshape2')
library('reshape2')
if (!require('ggthemes'))
  install.packages('ggthemes')
library('ggthemes')
theme_set(new = theme_few())

Next, locate and load the SQLite file. You can find a file under the “Learn&Grow” folder in the EWB sharepoint. The file is extracted from a participants iPhone. The following code will start a dialog window for selection of the SQLite file.

path <- rstudioapi::selectFile(caption = "Select SQLite File",
                               filter = "SQLite Files (*.sqlite)",
                               existing = TRUE)

The “path” variable contain the full path to the SQLite file. Using the “SQLite” library, it can be read in R as follows.

# connect to SQLite file
con <- dbConnect(drv = RSQLite::SQLite(),dbname = path)
# extract tables from file
tables <- dbListTables(con)
# convert tables to a list of R data frames  
lDataFrames <- vector("list", length = length(tables))
for (i in seq(along = tables)) {
  lDataFrames[[i]] <-
    dbGetQuery(conn = con,
               statement = paste("SELECT * FROM '", tables[[i]], "'", sep = ""))
}
# disconnect from SQLite file
dbDisconnect(con)

All tables from the SQLite file will now be in the “lDataFrames” variable, which is an R list of data frames. Tables can be queried (as data frames in R). For example, “lDataFrames[[6]]” index table 6 which holds EMA responses.

To make it easier to handle the data in ,R and to prepare for inputting to statistical functions, a new data frame “df” gets asssigend the relevant columns for further processing.

df  <-
  data.frame(
    Timestamp  = as.character(lDataFrames[[9]]$ZSOUNDSCAPETIMESTAMP),
    SPL = lDataFrames[[8]]$ZINPUTAWLEDARWIN,
    NL1      = lDataFrames[[8]]$ZINPUTNL1DARWIN,
    NL2      = lDataFrames[[8]]$ZINPUTNL2DARWIN,
    NL3      = lDataFrames[[8]]$ZINPUTNL3DARWIN,
    NL4      = lDataFrames[[8]]$ZINPUTNL4DARWIN,
    SNR1       = lDataFrames[[8]]$ZINPUTSNR1,
    SNR2       = lDataFrames[[8]]$ZINPUTSNR2,
    SNR3       = lDataFrames[[8]]$ZINPUTSNR3,
    SNR4       = lDataFrames[[8]]$ZINPUTSNR4,
    Volume   = lDataFrames[[9]]$ZVOLUME,
    Program = lDataFrames[[9]]$ZPROGRAM,
    Activity  = lDataFrames[[9]]$ZACTIVITY
)

The data frame “df” now contain all samples of the selected colums, which in this case is all the sound-data logs + Volume + Program + Activity logs. The timestamps are saved in a non-recognized format, these need to be transformed to R timestamps known as the “POSIXct” class. It will then be straight forward to use various functions from the “lubridate” R library to compute various timestamp aggregations or transformations.

# convert timestamps to R-recognized timestamps
df$Timestamp <- as.POSIXct(df$Timestamp)

As the SQLite file also contains EMAs, we can extract the relevant tables for those separately into the “df.ema” data frame.

df.ema  <-
  data.frame(
    Timestamp  = as.character(lDataFrames[[6]]$ZSOUNDSCAPETIMESTAMP),
    Program            = lDataFrames[[6]]$ZPROGRAMNUMBER,
    ListeningActivity  = lDataFrames[[6]]$ZQ5PICKERVIEW,
    Q1                 = lDataFrames[[6]]$ZANSWERLOUDNESSP1,
    Q2                 = lDataFrames[[6]]$ZANSWERNOISINESS,
    Q3                 = lDataFrames[[6]]$ZANSWERSOUNDQUALITY,
    Q4                 = lDataFrames[[6]]$ZANSWERSUITABILITYP1
)

As before, timestamps need to be transformed.

df.ema$Timestamp <- as.POSIXct(df.ema$Timestamp)

The two seperate data frames, one with sound-data logs occuring every 20 seconds and one with EMAs occuring whenever the participant initiated them, can be merged using the “plyr” library. This will bind them together so we have both sound data and EMA data in one data frame. This will simply put NAs into columns that are not present in both data frames.

df <- plyr::rbind.fill(df, df.ema)

As it can be helpful to aggregate data across time-windows or days, new columns with “Hour” and “Date” information are added using the lubruidate library. The “Hour” column here holds the value of each 5 minutes (but this can be adjusted). Thus, we can use it to aggregate data for each 5 minutes of the day.

df$Hour <- (round((hour(df$Timestamp)*60+minute(df$Timestamp))/5)*5)/60
df$Date <- date(df$Timestamp)

In order to link EMAs with the sound data, we need to aggregate the sound-data logs prior to each EMA. Usually, I use a 5 minute window since I instruct participants in thinking about the last 5 minutes when completing EMA. But there is no golden standard, some research studies have used up to 30 minutes too.

# identify the rows in df that belongs to an EMA and therefore do not have sound data
ema_index <- which(!is.na(df$Q1))
# initiate variables to hold the aggregated sound data
df$SPL_5MIN_median <- df$SNR1_5MIN_median <- df$NL1_5MIN_median <- NA

# run through all EMAs
for(i in ema_index){
    # select data 5 minutes prior to EMA
    tmp <- df[df$Timestamp <= df$Timestamp[i] & df$Timestamp >= (df$Timestamp[i] - 5*60),]
    # aggregate data 
    df$SPL_5MIN_median[i] <- median(tmp$SPL, na.rm = T)
    df$SNR1_5MIN_median[i] <- median(tmp$SNR1, na.rm = T)
    df$NL1_5MIN_median[i] <- median(tmp$NL1, na.rm = T)
}
# remove variabels from workspace
rm(con,df.ema,lDataFrames,tmp,ema_index,i,path,tables)
# order data frame by time
df <- df[order(df$Timestamp),]

Before association analysis, I usually inspect the data for irregularities. This could be visual inspection or summary statistics. The code below will plot density distributions (using a Gaussian kernel) of the SNR variable before and after excluding the 0.002% extreme outliers.

# format to long-format
df.plot <- melt(df,id.vars = 'Timestamp',measure.vars = c('SNR1','SNR2','SNR3','SNR4','SNR1_5MIN_median'))
# compute plot of SNR with all data
p1 <- ggplot(df.plot) + geom_density(aes(x=value,fill=variable),alpha=0.2)+labs(x='SNR (dB)',title='All logs')
# now, remove outliers using extreme quantiles
QR <- quantile(df.plot$value,c(0.001,0.999),na.rm = T)
df.plot$value[df.plot$value<QR[1] | df.plot$value>QR[2]] <- NA
# compute plot of SNR with cleaned data
p2 <- ggplot(df.plot) + geom_density(aes(x=value,fill=variable),alpha=0.2)+labs(x='SNR (dB)',title='Extreme values removed')
p <- ggarrange(p1,p2,ncol=2,common.legend = T)

Now, plot the distributions.

plot(p)

We can also plot the raw observed SPL together with the median SPL aggregated for each EMA. This way, we can inspect if the SPL associated with each EMA is representative of the SPL observed during daily life when not performing EMAs.

ggplot(df) + geom_density(aes(x=SPL,fill='SPL'),alpha=0.2) + geom_density(aes(x=SPL_5MIN_median,fill='SPL_5MIN_median'),alpha=0.2)+labs(fill='',x='SPL (dBA)')

In the figure above, the distribution of SPL during normal wear time is plotted in red and the distribution of median SPL observed in the 5-min time-window prior to EMA is shown in blue. Besides in the extreme (>70 dB SPL) they overlap, which indicate that the listening conditions associated with EMAs are representative of everyday life. If the blue distribution had been centered around lower SPLs, this would indicate that the participant only performed EMAs in very quite situations.

Besides the distributions, it can be helpful to inspect data across the hours of the day. This is to investigate how well data cover a typical day and to inspect how the sound exposure fluctuate across the hours. Typically, in a cross-over design, it is important to consider if the sound exposure (i.e., levels across the hours across the day) is comparable between test periods / participants.

Below, we compute the median and inter-quartile range of SPL and SNR per hour.

df.hour <- aggregate(SPL~Hour,FUN=function(x)quantile(x,0.5),data=df)
df.hour$SPL_25 <- aggregate(SPL~Hour,FUN=function(x)quantile(x,0.25),data=df)$SPL
df.hour$SPL_75 <- aggregate(SPL~Hour,FUN=function(x)quantile(x,0.75),data=df)$SPL
df.hour$SNR <- aggregate(SNR1~Hour,FUN=function(x)quantile(x,0.5),data=df)$SNR
df.hour$SNR_25 <- aggregate(SNR1~Hour,FUN=function(x)quantile(x,0.25),data=df)$SNR
df.hour$SNR_75 <- aggregate(SNR1~Hour,FUN=function(x)quantile(x,0.75),data=df)$SNR

Then, plotting it using a ggplot line plot.

p1 <- ggplot(df.hour)+
  geom_line(aes(x=Hour,y=SPL,linetype='Median'))+
  geom_point(aes(x=Hour,y=SPL,color='Median'))+
  geom_line(aes(x=Hour,y=SPL_75,linetype='75% percentile'))+
  geom_point(aes(x=Hour,y=SPL_75,color='75% percentile'))+
  geom_line(aes(x=Hour,y=SPL_25,linetype='25% percentile'))+
  geom_point(aes(x=Hour,y=SPL_25,color='25% percentile'))+
  labs(x='Hour',y='SPL (dBA)',linetype='',color='')+
  grids('xy')

p2 <- ggplot(df.hour)+
  geom_line(aes(x=Hour,y=SNR,linetype='Median'))+
  geom_point(aes(x=Hour,y=SNR,color='Median'))+
  geom_line(aes(x=Hour,y=SNR_75,linetype='75% percentile'))+
  geom_point(aes(x=Hour,y=SNR_75,color='75% percentile'))+
  geom_line(aes(x=Hour,y=SNR_25,linetype='25% percentile'))+
  geom_point(aes(x=Hour,y=SNR_25,color='25% percentile'))+
  labs(x='Hour',y='SNR (dB)',linetype='',color='')+
  grids('xy')

p <- ggarrange(p1,p2,ncol=2,common.legend = T)

plot(p)

Finally, we can do a very basic association analysis hypothesizing that answers on the EMAs are influenced by the SPL and SNR at the time of completion.

# first, scale each predictor
df$SPLz <- as.numeric(scale(df$SPL_5MIN_median))
df$SNRz <- as.numeric(scale(df$SNR1_5MIN_median))
# then, model simple interaction for each EMA question
fm1 <- lm(Q1~SNRz*SPLz,data=df) # "How important is it to listen?"
fm2 <- lm(Q2~SNRz*SPLz,data=df) # "How difficult is it to listen?"
fm3 <- lm(Q3~SNRz*SPLz,data=df) # "How much effort do you need to listen?"
fm4 <- lm(Q4~SNRz*SPLz,data=df) # "How (mental) tired are you?"

The R library “sjPlot” includes some nice functions for making inferences on regression models. For example, it can produce nice tables with coefficients and statistics as shown below.

sjPlot::tab_model(fm1,fm2,auto.label = TRUE,  CSS = list(css.centeralign = 'text-align: left;'))

	Q1			Q2
Predictors	Estimates	CI	p	Estimates	CI	p
(Intercept)	3.66	3.38 – 3.94	<0.001	4.64	4.13 – 5.14	<0.001
SNRz	0.00	-0.29 – 0.30	0.976	1.11	0.59 – 1.64	<0.001
SPLz	0.81	0.49 – 1.13	<0.001	0.04	-0.54 – 0.61	0.896
SNRz × SPLz	0.29	0.04 – 0.55	0.025	0.07	-0.39 – 0.53	0.762
Observations	127			127
R² / R² adjusted	0.202 / 0.182			0.160 / 0.139

Evidently, for Q1 (‘Importance’), higher SPL is related to higher ratings while an interaction suggest that this effect is even stronger in high SNR. For Q2 (‘Difficulty’), a positive association with SNR suggest that higher ratings of difficulty is usually related to higher SNR.

Especially interaction effects can be a bit tricky to understand - so plotting them helps understanding the direction of effects.

sjPlot::plot_model(fm1,'int')

Dynamic visualizations

For fun, data can be presented in a animated plot (GIF). Sometimes for demonstration purposes, this can make it easier to convey insights from multidimensional data.

The libraries for this is “gganimate” for producing the images that will go into the GIF and “gifski” is a renderere for compiling the images into a GIF.

if (!require('gganimate'))
  install.packages('gganimate')
library('gganimate')
if (!require('gifski'))
  install.packages('gifski')
library('gifski')

Below code will aggregate SPL and SNR across each 15-minutes of a day and the aggregate this across weekdays. Rendering the GIF using “animate()” can take some time (minutes) if the frames per second (fps) and level of detail (currently detail = 2) are high. But these parameters also make the GIF play more smooth. When running the code, I suggest lowering the values (for example, fps = 10) to be able to see the output fast.

# re-compute time-window for averaging data
df$Hour <- (round((hour(df$Timestamp)*60+minute(df$Timestamp))/15)*15)/60

# extract weekday from timestamps (using function from "lubridate" library)
df$Weekday <- weekdays(df$Timestamp)
df$Weekday <- as.factor(df$Weekday)

# re-order days so monday to sunday (might need changing if using english weekdays)
levels(df$Weekday) <- levels(df$Weekday)[c(3,6,4,7,1,2,5)]

# aggregate SPL and SNR across the time-widows and weekdays
df.animate <- aggregate(.~Hour+Weekday,FUN=function(x)median(x,na.rm = T),data=df[,c('Hour','Weekday','SPL','SNR1')])

# make some nice colors
custom.col <- c("#FFDB6D", "#C4961A","#D16103", "#C3D7A4", "#52854C", "#4E84C4", "#293352")

# make a simple scatterplot
graph1 <- ggplot(df.animate)+
  geom_point(aes(x=SNR1,y=SPL,color=Weekday),shape=16,alpha=1,size=6)+
  coord_cartesian(xlim = c(-20,20),ylim = c(35,85))+
  labs(x='SNR (dB)',y='SPL (dBA)',color='Weekday')+
  theme_few()+
  grids('xy')+
  scale_color_manual(values = custom.col)+
  geom_hline(aes(yintercept=mean(SPL)),linetype='dashed')+
  geom_vline(aes(xintercept=mean(SNR1)),linetype='dashed')

# add the animation information to the plot ("transition_time" defines what variable is used for timing)
graph1.animation = graph1 +
  transition_time(Hour)+
  labs(subtitle = paste("Hour of day: {round(frame_time)}",':00',sep = ""),x='SNR (dB)',y='SPL (dBA)')+
  shadow_mark(past = T,size=3,alpha=0.3)

# render the GIF for displaying or saving
pani1 <- animate(graph1.animation, height = 1200, width = 1500, fps = 40, duration = 20,detail = 2,
        end_pause = 30, res = 300,renderer = gifski_renderer())

Then, display the GIF.

pani1