Load libraries

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.3
## Warning: package 'ggplot2' was built under R version 4.3.3
## Warning: package 'tidyr' was built under R version 4.3.3
## Warning: package 'purrr' was built under R version 4.3.3
## Warning: package 'dplyr' was built under R version 4.3.3
## Warning: package 'forcats' was built under R version 4.3.3
## Warning: package 'lubridate' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(plotly)
## Warning: package 'plotly' was built under R version 4.3.3
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout

Defining signal and noise

We will simulate a measurement in this excise. The measurement is observation of peaks that could be in any setting such as chromatography, spectroscopy or mass spectrometry.

Noise is defined as variation of observations and is quantified by standard deviation. Lets assume we have two Gaussian peaks in our measurements.

define a Gaussian function

peak= function (sigma, mu, height, x) {
  return(height * exp(-(x-mu)^2/(2*sigma^2)))
}

lets simulate a noise-free chromatogram with two peaks but a baseline above zero

times= seq(0, 50, 0.1)

height1=100
sigma1= 2
mu1=15

height2=80
sigma2=3
mu2=30

baseline= 20

trueoutput =  baseline + peak(sigma1, mu1, height1, times) + peak(sigma2, mu2, height2, times)

graph=ggplot()+
  geom_line(aes(x=times, y=trueoutput)) + labs(x="time(s)", y="signal")

ggplotly(graph, dynamicTicks=TRUE)

Good data acqusition practices

To quantitatively capture peaks, there should be about 10 points across FWHM of peaks. This translates to frequency of data acquisition in chromatography. What frequency would you choose for the above chromatogram? Was the data acquisition sufficient in the above data?

FWHM is about 5 seconds, this means to collect data every 0.5 seconds to collect 10 data points

Now lets add some random noise to this chromatogram

We will assume that the output is measured with RSD=5% variation

noisyoutput = trueoutput * rnorm(length(trueoutput), 1, 0.05)

graph=ggplot()+
  geom_line(aes(x=times, y=noisyoutput)) + labs(x="time(s)", y="signal")

ggplotly(graph, dynamicTicks=TRUE)

calculate noise

To detect these compounds, the height of the peak should be differentiated from baseline. So the signal is height of the peak. Noise is the variation of baseline where the peak of interest is. But in this experiment we do no not have data about variation of baseline at the elution time of each peak. We can estimate that variation of baseline at the peak position can be estimated fom the variation of baseline next to each peak where there is no other compound.

Finding desired values:

find the index of time values you are interested in. Then use the indices to pull the values from the vector of output.

desiredindex = times < 6 & times >1
sd(noisyoutput[desiredindex])
## [1] 0.9656406

More tidy way of doing the same using tidyvrese

To work with the data more easily we fist place it into a dataframe in R. The package tidyverse helps streamline filtering.

mydata= data.frame(time=times, output=noisyoutput)

now we filter the data for the desired time and calculate the noise from standard deviation

noise = sd(mydata %>% filter (time < 6 & time >1) %>% pull(output))
noise
## [1] 0.9656406

How do we improve signal to noise?

We can take several measurements and average the values.

# define a function that simulates several trials and averages the output of the measurements

averageTrials = function (trials=10, RSD=5){

  # create a blank vector the length of the output
  totaloutput=numeric(length(trueoutput))

for (i in 1:trials){
  totaloutput= totaloutput + trueoutput * rnorm(length(trueoutput), 1, 5/100)
}

  return (totaloutput/trials)
  
}

Simulate 10 trials and plot the data and calculate the SD

averageoutput = averageTrials(25, 5)

graph = ggplot()+
  geom_line(aes(x=times, y=averageoutput ))

ggplotly(graph, dynamicTicks = TRUE)

now calculate sd

desiredindex = times < 6 & times >1
sd(averageoutput[desiredindex])
## [1] 0.1762032

How does noise depend on the number of trials averaged? try 4, 10, 20, 40, 80, 200 averages. What do you conclude?

As the number of trials increases, the noise decreases. I am assuming this is because of the decreasing standard deviation that decreases based on the number of trials. It looks like an exponential decay curve when you look at the number of trials over the standard deviation curve.

Mathematical S/N improvement

Graph SD as a function of # of trials

Doing experiments is costly and we may not be able to repeat the experiment easily. So we are interested in ways of improving S/N without doing additional experiments.

One way is to assume that the points adjacent to each other have similar values and noise and thus each point can be replaced with a point whose value is the average of the adjacent points to it. So we apply an averaging within a window centered at the point of interest. Then we move the window one point at a time. This is called moving average smoothing.

# number of adjacent data points to smooth, using an odd number helps to put the window center at one of the data points
points= 9

# nine point moving average
smooth.coef= rep(1, points)/points

# the filter function is pulled from base R. We use stats:: here because there is a filter function in tidyverse that has become the default due to the loading of the package and will not do smoothing

smoothed.output = stats::filter(noisyoutput, smooth.coef)

lets plot the results

graph = ggplot()+
  geom_line(aes(x=times, y=noisyoutput))+
  geom_line(aes(x=times, y=smoothed.output), col="red")+ labs(x="time(s)", y="signal")

ggplotly(graph, dynamicTicks=TRUE)

Calculate the noise

desiredindex = times < 6 & times >1
sd(smoothed.output[desiredindex])
## [1] 0.2303507

try different number of points for smoothing

9, 25, 75 What do you conclude?

For smoothing, as you increase the number of points, the standard deviation would decrease. Peak maximums and minimums decreased as points went up, but too many points skewed and distorted the data because of too much averaging.

Savitzy-Golay smoothing

Another way to smooth the data is to fit a polynomial to the data in the selected window of points and replace the center point with the value predicted by the fitting. This is called Savitzky-Golay smoothing.

pracma package in R implemenets this smoothing. Install the package if you do not have it: install.packages(“pracma”)

You can then use the Savitzky_Golay filter function in the package by pracma::savgol command

#define filter length (fl) which is the number of points and should be an odd number
# order is quadratic=2 or quatric=4
# dorder is for derivative. for smoothing use 0

sg.filtered = pracma::savgol(noisyoutput, fl=9, forder=2, dorder=0)
graph = ggplot()+
  geom_line(aes(x=times, y=noisyoutput))+
  geom_line(aes(x=times, y=sg.filtered), col="red")+ labs(x="time(s)", y="signal")

ggplotly(graph, dynamicTicks=TRUE)

try window sizes

test various window sizes and compare to moving average approach. What do you conclude?

The greater the filter length, the more smooth the line is, but it can run into the same problem as the smoothing point method, that the line can get distorted and flattened the bigger the filter length is.

FFT filtering

A completely different approach to noise filtration is offered by Fourier transform. Here, we assume that the signal and noise have distinct frequencies. For example, in a chromatograpm, the variation in signal due to a compound eluting is much slower compared to the fast fluctuations from noise. Therefore, we can filter all components of the signal with high frequency and keep low frequency components to discriminate against noise.

To see the advantages of fourier transform, we add a periodic noise to the output with a frequency of 2 Hz in addition to the random noise.

Observations

Frequency of noise is greater than frequency of signal because there are more peaks of noise per unit of time

newnoisy= noisyoutput + 5*sin(2*pi*2*times)
graph= ggplot()+
  geom_line(aes(x=times, y= newnoisy))

ggplotly(graph, dynamicTicks = T)
# do a Fourier transform of the output, note that the calculated numbers are complex and reflect the amplitude of the signal for each frequency
FFTed= fft(newnoisy)

# plot the values of amplitude against the index
# note the symmetry in graph which indicates that only half the indices are used for frequency calculation
graph= ggplot()+
  geom_line(aes(x=seq_along(FFTed), y=abs(FFTed)))

ggplotly(graph, dynamicTicks=T)

plotting against frequency

To convert the index of FFTed vector to frequency, we need to know the sampling frequency or how fast the data was acquired. This can be calculated from inverse of the time difference between two adjacent points.

The first member of the FFTed vector corresponds to frequency=0, the frequency for the next member is calculated from: samp.freq/length(FFTed)

samp.freq= 1/(times[2]-times[1])
freqs = (seq_along(FFTed)-1)*samp.freq/length(FFTed)

# remember that fft function creates a symmetric vector so we only need the first half of the FTTed vector to examine the frequencies:
max.index= floor(length(FFTed)/2)

graph= ggplot()+
  geom_line(aes(x= freqs[1:max.index], y=abs(FFTed[1:max.index])))+ labs(x="freq (Hz)", y="Intensity")

ggplotly(graph, dynamicTicks = T)

Foilter the unwanted frequencies

Define the the threshold frequency that you want to eliminate and calculate which index corresponds to the threshold Then set the value of all amplitudes corresponding to the indices to zero

thresh.freq=2.5
# rearrange the index formula above and convert to integer so that it is a whole number
thresh.index= as.integer(thresh.freq * length(FFTed)/samp.freq + 1)

# now we zero out all of the intensities corresponding to frequencies above the threshold which should be all noise
# note that we zero out the full FFTed vector not just half.

filtered.FFTed= FFTed
filtered.FFTed[thresh.index:(length(FFTed)-thresh.index-2)]=0 +0i

graph=ggplot()+
  geom_line(aes(x=seq_along(filtered.FFTed), y= abs(filtered.FFTed)))

ggplotly(graph, dynamicTicks = T)
# now we do an inverse FFT to get the intensities of output back
fftsmoothed= fft(filtered.FFTed, inverse=TRUE)/length(FFTed)

# now we plot the real part of the inverse FFTed output

graph= ggplot()+
  geom_line(aes(x=times, y=newnoisy))+
  geom_line( aes(x=times, y=Re(fftsmoothed)), col="red")+
  
  geom_line(aes(x=times, y=noisyoutput))+
  geom_line(aes(x=times, y=sg.filtered), col="blue")+ labs(x="time(s)", y="signal")

ggplotly(graph, dynamicTicks = TRUE)

test various frequency thresholds for filtering

What do you conclude?

The lower the frequency threshold, the less data that is getting filtered out. It seems to get filtered out from the center, so the higher the frequency threshold, less data is getting filtered out from the center, and more from each side is getting filtered back in. It seems like using a moderate threshold is the best way to filter necessary data as it removes the unnecessary noise.

now add the Savitzky-Golay and fourier transform filtered output into one graph

What do you conclude for advantages of various smoothing methods?

FFT seemed to be better when looking at the ends. SG lowered intensity of data peaks allowing to not really seeing big difference in data; both behave very similar. The larger # data points you use, more fluctuations you smooth out. Another advantage of FFT is if signal is periodic, you can use FFT to filter is out; choose what frequency to keep.