Intro to R

Every programming language has similar elements. We need to input data, identify variables, perform calculations, create functions, control program flow, and output data and graphics. The R language is attractive in that it has it has a full online development environment -RStudio Cloud- and built in functions for regression analysis, solving equations, and producing graphics. R is increasingly used for those involved in data analysis and statistics, particularly social sciences, bio-statistics, and medicine. References on R tend to emphasize different features of R more suitable for social sciences, data mining, ecology, and medicine. A good reference for scientists using numerical methods is available1, and an Analytical Chemistry textbook2 that provides examples in R is available. This manual will present the elements of R gradually, as needed, mostly through chemical data examples. This document itself is created with RMarkdown, which integrates a document creation with R code.

R in RStudio Cloud

We will write our programs on a online platform (an IDE) called Rstudio Cloud (RStudio Cloud is being renamed Posit Cloud), which makes it platform independent - we access the program the same way on a PC, Mac, or Chromebook. The Rstudio Cloud environment is divided into 4 main sections, with the most important elements listed below.

  • Top left: Script - where you write R code
  • Bottom left: Console - show output
  • Top right: Environment - we will mostly ignore this
  • Bottom right: show Plots, help, packages

To start writing a program in Rstudio Cloud:

create a Space (like a file folder that can contain related programs) , select New project, and from the menu:
File ——> Newfile —–> Rscript

& start typing!

To run a program, highlight the code and select “run”.

You can create multiple Projcts and R scripts in a single Space. Spaces are displayed on the left and you can create multiple spaces. In the tools icon (upper right) spaces can be shared.

When you run a script program, results and error messages will appear on the console, and plots appear on the plot area.

In this document, the shaded area is R code, and the output appears in a white background following thr shaded R sceipt. The theory is described in the text, and the details of R code are described in the code comment (text preceded by #)

Vectors & Numerics

In R we use “<-” to set a variable to a value. R uses “vector” calculation, which avoids use of loops in more traditional programming. For instance, below we set x equal to a series of values (a vector), and then calculate a series of values of y.

#  A comment line 

x <- 5.0  
#  Set x equal to 5.0

y <- x^2
# y is equal to x squared. 

x <- c(1.0, 2.0, 3.0)
# x is equal to a numbered list of values - a “vector” 

x <- seq(1,2,0.2)      
# create an incremented sequence

xx <-matrix(c(2,4,6,8),2,2)
# a matrix row 1 = 2 6 row 2 = 4 8
#  length(x) returns the length of the vector x.
#  x returns all the values of x

length(x)    
## [1] 6
x           
## [1] 1.0 1.2 1.4 1.6 1.8 2.0
#  length(x) returns the length of the vector x.
#  x returns all the values of x
xx
##      [,1] [,2]
## [1,]    2    6
## [2,]    4    8
#  exponents and logs

10^(-2) 
## [1] 0.01
# power of 10

exp(1)
## [1] 2.718282
#  power of 2.718282 .....

log(10)  
## [1] 2.302585
# base e

log10(10)   # base 10
## [1] 1

Once we have a series of x and y, of equal length, we can easily create a graph.

x   <-  seq(0,10,0.5) 
# a sequence from 1 to 10, increments 0f 0.5

y  <-   x^2 
#  Note that y is calculated for every x.
#  This is called vectorized.

plot(x,y)  

# Create a plot. We can add a lot of formatting!

Below is an example of basic formatting commands.

plot(x,y,type = "b",main = "A Formatted Graph",
col = "darkblue", xlab = "X Label", ylab = "Y Label")

type=“p”: for points (by default) type=“l”: for lines type=“b”: for both; points are connected by a line

Functions

We can define a function and evaluate it later in the R script.

Afunc <- function(x)  {x^2 + log(x)}

y <- c(1,2,3)

Afunc(y)
## [1]  1.000000  4.693147 10.098612
# note the vectorized evaluation 

Reading and Writing Data Files

With larger data sets, as in a titration, we prefer to be able to read a data file, such as output from a spreadsheet. A common filetype is csv (comma separated values) files. Usually, the first row contains the headers, and them remaining rows the values. If you upload a csv file into your working directory, they can be read into Dataframes. Dataframes are columns of values that do not have to be of the same type (Names, dates, income, etc), and they are very important in many R applications. However, here’ we are focusing on vectors, and we can convert a dataframe with two or more numeric columns (such as volume and pH) into respective vectors. The data file “mydat.csv” looks like this:

vol, pH
1.5 , 3.2
3.0, 6.6
7.0, 14.0

# Mydata <- read.csv("mydat.csv")
# a file with headers vol and pH 

#  Mydata 
#  it's a dataframe

#  volume <- Mydata$vol  
# extract column vol as vector

#   pH <- Mydata$pH       
# extract column pH as vector 

#  volume 

#  pH

#  and to write a vector file

# write(volume,file="newfile")

Are they Different?: Data and Gaussian Distributions

Every measurement has random error: Error that is inherent in the nature of the measurement.For instance, while the true mass of an object might be 1.0000 g, the actual measured mass will deviate from that value in a random manner, and the measured mass cold well be 1.02 grams, or 0.98 grams. These deviations follow a Gaussian distribution. Random error is also referred to as noise.

We can simulate the results of a number of measurements with the use of the R command rnorm. Below we simulate 10 measured values assuming a true value 0f 1. The rnorm command is useful in simulating relistic data. Using set.seed will reproduce the same set of random numbers.

  noisy1 <- rnorm(5,0,0.2)
  
  noisy1
## [1] -0.3895936 -0.1783962 -0.1596842  0.1169115 -0.4421683
  noisey2 <- 1 +  rnorm(5,0,0.2)
  
  noisey2
## [1] 0.8969841 0.7237467 1.0489504 1.1670184 1.1020308
set.seed(1)
# if set.seed(1) then gives same sequence 

noisey3 <- 1.0 + rnorm(5,0,0.2) 

noisey3
## [1] 0.8747092 1.0367287 0.8328743 1.3190562 1.0659016

rnorm(n,mean=0,sd=1) Generates Gaussian Random Values
input: n is the number of values, mean, statndard deviation. Default values are 0 for mean and standad deviation of 1.
Comments: Most used here to simulate realitic data by adding noise to and idealized model. If we want a reproducible simulation we can use set.seed.

Significant Figures: The number of significant figures is all the certain figures plus one uncertain figure. The degree of uncertainty in the last digit is ultimately determined by a statistical analysis.

The Mean (or average) of a set of n measurements x is defined:

\[\large\overline{x} = \frac{\sum_{i=1}^{n}x_i}{n}\] The mean and standard deviation can be related to the Guassian distribution - that gives the probability of observing a particular value of x. For a finite number of measurements, the Gaussian distribution can be approximated as:

\[\large y =\frac{1}{s\sqrt(2\pi)}e^\frac{-(x-\overline{x})^2}{2s^2}\] Data that we can collect from individuals or society can roughly follow a Gaussian distribution. In a study3, the average male weight is 76.7 kg with a standard deviation of 12.1 kilograms, while women had an average weight of 61.5 with a standard deviation of 11.1 kg. Substituting this into the Gaussian distribution formula

s <- 12.1
# standard deviation

meanw <-  76.7
# mean weight

xval <- seq(20,100,2)

xval
##  [1]  20  22  24  26  28  30  32  34  36  38  40  42  44  46  48  50  52  54  56
## [20]  58  60  62  64  66  68  70  72  74  76  78  80  82  84  86  88  90  92  94
## [39]  96  98 100
#  here we calculate a gaussian distribution

yp <- (1/(s*sqrt(2*pi)))*exp(1)^((-(xval-meanw)^2)/(2*s^2))

yp
##  [1] 5.623116e-07 1.203433e-06 2.506117e-06 5.078268e-06 1.001301e-05
##  [6] 1.921095e-05 3.586475e-05 6.515106e-05 1.151622e-04 1.980767e-04
## [11] 3.315062e-04 5.398645e-04 8.554856e-04 1.319093e-03 1.979124e-03
## [16] 2.889386e-03 4.104619e-03 5.673812e-03 7.631536e-03 9.988120e-03
## [21] 1.272010e-02 1.576275e-02 1.900677e-02 2.230075e-02 2.546041e-02
## [26] 2.828436e-02 3.057469e-02 3.215974e-02 3.291531e-02 3.278070e-02
## [31] 3.176678e-02 2.995457e-02 2.748450e-02 2.453847e-02 2.131778e-02
## [36] 1.802068e-02 1.482297e-02 1.186409e-02 9.239917e-03 7.002235e-03
## [41] 5.163452e-03
plot(xval,yp,xlab="Weight in kg",ylab="Probability") 

#  R has a command for Gaussian distribution, dnorm.

dp  <- dnorm(xval,meanw,s)    

plot(xval,dp,xlab="Weight in kg",ylab="Probability",ylim=c(0,.04),
     main="Women (x) and Men (o) Weight Distributions")

fdp <-   dnorm(xval,61.5,11.1)
#  women weight distribution

points(xval,fdp,pch=4)  

#  adding points to previous plot

Models + Data: The Linear Case

Beer’s law for UV-Vis Spectroscopy is: \[\large Abs = \epsilon_\lambda b c \]

where A = absorbance \(\large\epsilon_\lambda\) = absorption coefficient at a particular wavelength, b = pathlength (typically cm), and c = concentration (usually in molarity or \(\large\mu\)grams)

There is a linear relationship between the abosorbance and the concentration, given a fixed pathlength and wavelength. In any experimental situation, the inherent noise in the data will mean that we need to include a constant (I = intercept).

\[\large Abs = \epsilon_\lambda b c + I\] In order to use Beer’s Law to analyze for a particular component in solution (an ~analyte), we construct a calibration curve using experimentally determined data. The object is to determine the absorption coeficcient, \(\epsilon\). We prepare solutions of varying concentrations of the analyte, and measure the absorbance. According to Beer’s Law, a plot of Absorbance versus concentration should result in a straight line with a slope equal to \(\epsilon\).

Lets take an example. The data comes from the widely use Bradford assay for protein4. In the Bradford assay, a dye (Coomassie brilliant blue G-250) changes from red to blue when attched to proteins, and this color absorbtion at 595 nm) is used to measure protein concentration. Here is sample data for a calibration5 for the BSA (Bovine Serum Albumin) standard.

Conc <- c(2.0, 5.0, 10.0, 15.0, 18.0)
# in micrograms/ml
Abs <- c(0.115,0.266,0.413,0.701,0.811)
# Absorbance

Xdata <- data.frame(Conc,Abs) 
#  x and y to a data frame

knitr::kable(Xdata[,],
col.names = c('Conc(microg/ml)','Abs'), 
caption = "Bradford assay data")     #  makes a table
Bradford assay data
Conc(microg/ml) Abs
2 0.115
5 0.266
10 0.413
15 0.701
18 0.811

Linear Regression

We use the “lm” command to do linear regression - to find the best fitto the data according to the least squares criterion: the slope (\(\epsilon\)) and intercept are adjusted to minimize the sum of the squares of the deviations of the experimental data from the fitted line.The concept of regression is used throughout this book and is central to analyzing data in chemistry. Models in Chemistry have adjustable parameters - for Beer’s Law they are the absorptivity coefficient (the slope) and the intercept. Regression methods find the “best fit” line: the choice of slope and intercept which generates a line which will have the smallest sum of the squares of the deviations from the experimental data.

BL <- lm(Abs ~ Conc)   
# linear regression
  
summary(BL)
## 
## Call:
## lm(formula = Abs ~ Conc)
## 
## Residuals:
##       1       2       3       4       5 
##  0.0018  0.0223 -0.0482  0.0223  0.0018 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.026200   0.029054   0.902 0.433649    
## Conc        0.043500   0.002495  17.435 0.000411 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03329 on 3 degrees of freedom
## Multiple R-squared:  0.9902, Adjusted R-squared:  0.987 
## F-statistic:   304 on 1 and 3 DF,  p-value: 0.0004113
# summarize results of linear regression

 cf <- coef(BL)
int <- cf[1]
sl <-  cf[2]
int
## (Intercept) 
##      0.0262
sl
##   Conc 
## 0.0435
# vector: intercept and slope

plot(Conc, Abs, main = "Beer's Law ", xlab = "", 
ylab = "Absorbance", xlim = c(0.0,20.0), 
ylim =  c(0.0, 1.0), 
sub = "Figure 1. Beer's Law Plot for BSA")
title(xlab="Concentration, ug/ml", line=2, cex.lab=1.0)    
library(plotrix)
abline(BL,col = "red")
ablineclip(a =0.0, b = 0.03,x1=1,x2=18)

ablineclip
## function (a = NULL, b = NULL, h = NULL, v = NULL, reg = NULL, 
##     coef = NULL, untf = FALSE, x1 = NULL, x2 = NULL, y1 = NULL, 
##     y2 = NULL, ...) 
## {
##     if (!is.null(c(x1, x2, y1, y2))) {
##         oldclip <- par("usr")
##         if (is.null(x1)) 
##             x1 <- oldclip[1]
##         if (is.null(x2)) 
##             x2 <- oldclip[2]
##         if (is.null(y1)) 
##             y1 <- oldclip[3]
##         if (is.null(y2)) 
##             y2 <- oldclip[4]
##         clip(x1, x2, y1, y2)
##         abline(h = oldclip[4] + 1)
##         clip(x1, x2, y1, y2)
##     }
##     abline(a = a, b = b, h = h, v = v, reg = reg, coef = coef, 
##         untf = untf, ...)
##     if (!is.null(c(x1, x2, y1, y2))) 
##         do.call("clip", as.list(oldclip))
## }
## <bytecode: 0x5b629eca3eb0>
## <environment: namespace:plotrix>
# adding lines that show other slopes
Coomassie Blue. Associates with Proteins through non-covalent interactions
Coomassie Blue. Associates with Proteins through non-covalent interactions

From summary(BL) we learn the best fit values for the slope( \(\epsilon\)) and the intercept, the standard deviations of these fitted values, the p values, and the \(r^2\) value: how well a linear model explains the data.

lm(ydata,xdata) Linear regression on a set of paied x and y values.
input: xdata and ydata are the names of x and y vectors previosly defined.
Comments: Output of lm are the intercept and slope of the best fit line, the standard error of these values, and r squared, a measure of how well a linear model explains the data, and the p value indicates the the probability pf thr “null” hypothesis: there is no linear relationship between x and y.

The final step in an analysis is to calculate the concentration of an unknown based on the absorbance measured. Rearranging Beer’s Law, we have, for example, for an absorbance = 0.350

\[\large c = \frac{Abs-I}{\epsilon}\]

val <- (0.350 - cf[1] )/cf[2]
val
## (Intercept) 
##    7.443678
# the unknown concentratiom

Residual Plots

In a regression analysis, it can be helpful to plot the residuals (the deviations) of the experimental data from the theoretical best fit line (or curve). The residuals should appear randomly distributed above and below the best fit line.

For the Bradford determination, we have:

res <- residuals(BL)
plot(Conc,res,ylab="residuals",ylim=c(0.05,-.05),
     main="Residual Plot Bradford Method")
abline(a = 0.0, b = 0.0, col = "blue") 

Making Reports with Rmarkdown and Rpub

RStudio Cloud contains a powerful document creation tool Rmarkdown, suitable for scientific reports, thesis, publication quality research papers. Some nice features

  1. Simple mark up language 2, latex style equations
  2. Sharing through rpubs or other systems
  3. embedded R code to generate graphs in document

Creating Simple Report with RMarkdown

In rstudio cloud we can choose file -> new file -> R markdown. A new file (.rmd) will be created which already includes the first part of an rmd file the “YAML” header, which declares the title, and html and pdf outputs.

---
title: "Untitled"
output:
  pdf_document: default
  html_document: default
date: "2022-10-02"
---

The default created Rmarkdown file also includes examples of simple formatting and embedding r code.

Now you can input and format text, equations, images, and references, and R code. You can view your formatted document: Select “knit” to produce an html or pdf document. For detailed reference to rmarkdown see (xie guide) or a cheatsheet.

Headers and subheaders are created with #, ##, ###.

Header Levels 

# Header 1
## Header 2
### Header 3

Below are examples of Rmarkdown code and the output they produce.

Basic Formatting

**Boldface**

Boldface


[I'm an inline-style link](https://www.google.com)

I’m an inline-style link

Embedding R code in Rmarkdown


```r
x <- seq(1,20,1)
y <- x^2
plot(x,y)
```

<img src="R-for-Science_files/figure-html/unnamed-chunk-12-1.png" width="672" />
x <- seq(1,20,1)
y <- x^2
plot(x,y)

For scientific work the ability to insert latex equations is very useful. To insert an inline equation bracket the latex code with dollar signs.

Examples of latex equations

$y^2$

$k_{cat}$

$a \cdot b$

$\frac {a \cdot b}{c}$

$\alpha \cdot \beta$

$$\alpha \cdot \beta$$

which result in:

\(y^2\)

\(k_{cat}\)

\(a \cdot b\)

\(\frac {a \cdot b}{c}\)

\(\alpha \cdot \beta\)

\[\alpha \cdot \beta\] (next line, centered)

After knitting to html, a simple way to share is to publish the document in an rpub account, which provides an attractive html based document.In Rstudiocloud, after you knit to html, choose “publish” and choose rpub and create an account.

Populations: Exponential Growth and its limits

# https://www.nature.com/scitable/knowledge/library/how-populations-grow-the-exponential-and-logistic-13240157/

#  Exponential fit, US population growth 1790 t0 1850

year <- c(0,10,20,30,40,50,60)  
length(year)
## [1] 7
pop <-  c( 3.929, 5.308, 7.240, 9.638, 12.866, 17.069, 23.192)
length(pop)
## [1] 7
#   A = Ai * exp(-k*t)
tryfit <- nls(pop ~ alpha * exp(k*year) ,  
              start = c(alpha = 2.0, k = 0.05))

plot(year,pop,main = "US Population Growth, 1790-1850", xlab = "Starting 1790", ylab = "Population / Million")
lines(year,predict(tryfit),col="red")

summary(tryfit)
## 
## Formula: pop ~ alpha * exp(k * year)
## 
## Parameters:
##        Estimate Std. Error t value Pr(>|t|)    
## alpha 3.9744064  0.0407277   97.58 2.14e-09 ***
## k     0.0293421  0.0002023  145.02 2.96e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09817 on 5 degrees of freedom
## 
## Number of iterations to convergence: 5 
## Achieved convergence tolerance: 5.672e-08
#   us population logistic  

year <- seq(0,150, by = 10)    
year
##  [1]   0  10  20  30  40  50  60  70  80  90 100 110 120 130 140 150
pop <-  c( 3.929, 5.308, 7.240, 9.638, 12.866, 17.069, 23.192, 31.443, 38.558,
  50.156, 62.948, 75.996, 91.97,  105.711, 122.775, 131.66)


length(year)
## [1] 16
length(pop)
## [1] 16
pop
##  [1]   3.929   5.308   7.240   9.638  12.866  17.069  23.192  31.443  38.558
## [10]  50.156  62.948  75.996  91.970 105.711 122.775 131.660
Pi <- 3.929     #  Initial Population

tryfit <- nls((pop ~   (K*Pi)/(Pi + (K - Pi)*exp(-r*year))), 
                start = list( K = 155,  r = 0.03))

summary(tryfit)
## 
## Formula: pop ~ (K * Pi)/(Pi + (K - Pi) * exp(-r * year))
## 
## Parameters:
##    Estimate Std. Error t value Pr(>|t|)    
## K 1.882e+02  2.967e+00   63.44   <2e-16 ***
## r 3.158e-02  1.615e-04  195.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9336 on 14 degrees of freedom
## 
## Number of iterations to convergence: 3 
## Achieved convergence tolerance: 1.954e-06
length(predict(tryfit))
## [1] 16
x <-   predict(tryfit)
x 
##  [1]   3.929000   5.346405   7.255018   9.808444  13.195035  17.635352
##  [7]  23.370397  30.635572  39.616557  50.388586  62.850930  76.680934
## [13]  91.337289 106.129855 120.342752 133.367268
r <- 0.029
xpo <- 3.97 *  exp(r* year)
xpo
##  [1]   3.970000   5.305617   7.090573   9.476036  12.664035  16.924565
##  [7]  22.618453  30.227923  40.397427  53.988232  72.151357  96.425057
## [13] 128.865097 172.218857 230.158015 307.589498
plot(year,pop,main = "US Population Growth",xlab = "Year, 1790 to 1940", ylab = "Population / Millions")
# plot(year,pop)
  lines(year,xpo,col = "blue")
  lines(year,predict(tryfit), col = "red")

  plot(year,residuals(tryfit))

  residuals(tryfit)
##  [1]  0.00000000 -0.03840518 -0.01501772 -0.17044402 -0.32903453 -0.56635183
##  [7] -0.17839748  0.80742798 -1.05855750 -0.23258552  0.09706987 -0.68493353
## [13]  0.63271053 -0.41885454  2.43224831 -1.70726806
## attr(,"label")
## [1] "Residuals"
  #  Bacteria Growth, Exponential
  
  
  minutes <- seq(0,64, by = 16)  
  length(minutes)
## [1] 5
  pop <-  c(0.022,0.036,0.060,0.101,0.169)
  length(pop)
## [1] 5
  tryfit <- nls(pop ~ alpha * exp(k*minutes) ,  
                  start = c(alpha = 0.05, k = 0.05))
  
  plot(minutes,pop,main = "Bacterial Density", xlab = "Minutes", ylab = "Relative Population Density")
  lines(minutes,predict(tryfit),col="red")

  summary(tryfit)
## 
## Formula: pop ~ alpha * exp(k * minutes)
## 
## Parameters:
##        Estimate Std. Error t value Pr(>|t|)    
## alpha 0.0215382  0.0001423   151.3 6.36e-07 ***
## k     0.0321848  0.0001155   278.6 1.02e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0003288 on 3 degrees of freedom
## 
## Number of iterations to convergence: 6 
## Achieved convergence tolerance: 2.66e-08
  #    Logistic 
  
  minutes <- seq(0,160, by = 16)    
  minutes
##  [1]   0  16  32  48  64  80  96 112 128 144 160
  pop <-  c(0.022,0.036,0.060,0.101,0.169,0.266,0.360,0.510,0.704,0.827,0.928)
  
  
  length(minutes)
## [1] 11
  length(pop)
## [1] 11
  pop
##  [1] 0.022 0.036 0.060 0.101 0.169 0.266 0.360 0.510 0.704 0.827 0.928
  Pi <- 0.022     #  Initial Population
  
  tryfit <- nls((pop ~   (K*Pi)/(Pi + (K - Pi)*exp(-r*minutes))), 
                  start = list( K = 1.5,  r = 0.03))
  
  summary(tryfit)
## 
## Formula: pop ~ (K * Pi)/(Pi + (K - Pi) * exp(-r * minutes))
## 
## Parameters:
##    Estimate Std. Error t value Pr(>|t|)    
## K 1.1653827  0.0275806   42.25 1.16e-11 ***
## r 0.0335128  0.0003406   98.39 5.87e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01226 on 9 degrees of freedom
## 
## Number of iterations to convergence: 5 
## Achieved convergence tolerance: 3.757e-06
  length(predict(tryfit))
## [1] 11
  x <-   predict(tryfit)
  x 
##  [1] 0.02200000 0.03711208 0.06204156 0.10220017 0.16447787 0.25558274
##  [7] 0.37808856 0.52540371 0.68050449 0.82254397 0.93694239
  r <- 0.032
  xpo <- 0.0215 *  exp(r* minutes)
  xpo
##  [1] 0.02150000 0.03587544 0.05986266 0.09988834 0.16667619 0.27812007
##  [7] 0.46407814 0.77437243 1.29213728 2.15609272 3.59771045
  plot(minutes,pop,main = "Bacterial Growth",xlab = "Minutes", ylab = "Relative Population Density")
  # plot(year,pop)
  lines(minutes,xpo,col = "blue")
  lines(minutes,predict(tryfit), col = "red")

  #  ********************************************************
 
  
  #  Simulate population density

Periodic Data and Fourier Transform

Audio Analogy

Sound waves share useful similarities to electromagnetic radiation. The are both represented as travelling sine waves. Sound waves are a periodic disturbance in physical medium (air) whereas em radiation is a periodic change in an electric field.

After a pulse excitation, the nmr signal decay contains that reflect multiple relaxations at different frequencies.

To explore the nature of wave motion and the Fourier Transform we can use audio files and R packages that create, read, and manipuate audio files (.wav).

packages:

fftw
TuneR

Below we used R to generate representations of sound. The graphs are of sound waves of different frequency, as might be detected by a microphone: the plots are amplitude of sound versus time.

library(tuneR,fftw)

     Wobj1 <- 0.75*sine(880) 
     #  here is unit = freq not degree in radians 
     tdir <- tempdir()
     tfile <- file.path(tdir, "myWave.wav")
     writeWave(Wobj1, filename = tfile)
     sdat1 <- tuneR::readWave(tfile) 
     
     Wobj2 <- 1.0*sine(4400)   
     tdir <- tempdir()
     tfile2 <- file.path(tdir, "myWave2.wav")
     writeWave(Wobj2, filename = tfile2)
     sdat2 <- tuneR::readWave(tfile2) 
     
     
     Wobj3 <- 0.5*sine(2000)  
     tdir <- tempdir()
     tfile3 <- file.path(tdir, "myWave3.wav")
     writeWave(Wobj3, filename = tfile3)
     sdat3 <- tuneR::readWave(tfile3) 
      
      N <- 44100
      
      x <- seq(1,44100,1)
      w1 <- sdat1@left
      w2 <- sdat2@left
      w3 <-  sdat3@left
      
      wt <- w1 + w2 + w3
      
      times <- x/44100  
      #  same as freq because time = 1  sec
      
      length(times)
## [1] 44100
      par(mfrow=c(2,2))
      #  following plots 2x2 
      
      plot(times,w1,type="l",xlab="Time/sec",ylab="Amplitude",xlim=c(0,0.01))
      plot(times,w2,type="l",xlab="Time/sec",ylab="Amplitude",xlim=c(0,0.01))
      plot(times,w3,type="l",xlab="Time/sec",ylab="Amplitude",xlim=c(0,0.01))
      plot(times,wt,type="l",xlab="Time/sec",ylab="Amplitude",xlim=c(0,0.01))

Questions for discussion

  1. Consider the first graph. Imagine you are the microphone, and you measure amplitude by raising and lowering your hand.

  2. According to the graph estimate how long does it take to go up and down once?

  3. How many times will it go up and down in one second? This is the frequency.

  4. Sound amplitude is related to compression of air. Sketch what compression of air might look like as a function of distance.

  5. What is oscillating up and down in the case of EM radiation?

Now we will consider how the Fourier Transform can translate a time domain representation to a frequency representation. For study of sound, and in spectroscopy, frequency representations are more useful. We use the R fftw package command FFT to do this. Now we have a graph of intensity versus frequency. We call this a transformation from the time domain to the frequency domain. For the individual time domain graphs you could easily sketch a frequency domain representation. But what about the fourth graph? Somehow the FFT is able to unscramble the waves and do the conversion.

      ft1  <- abs(fftw::FFT(w1)) 
      
      ft2  <- abs(fftw::FFT(w2)) 
      
      ft3  <- abs(fftw::FFT(w3)) 
      
      ft4  <- abs(fftw::FFT(wt)) 
      
      par(mfrow=c(2,2))
      
      plot(x,ft1/N,type="l",xlim=c(0,5000))
      plot(x,ft2/N,type="l",xlim=c(0,5000))
      plot(x,ft3/N,type="l",xlim=c(0,5000))
      plot(x,ft4/N,type="l",xlim=c(0,5000))

The Fourier Transform

As we have seen, the FT transforms time domain data to frequency domain. This turns out to be widely important, and is key part of modern NMR (Earning Richard Ernst the Noble Prize), other spectrocopies (FTIR), signal processing, image processing, acoustics, and biology (bioacoustics).

Even though we may not encounter the Fourier Transform in our math courses, we can try to get a conceptual understanding with the math we know. The key is in the idea of orthogonality.

Questions: Orthogonality

To try to understand what the FFT is doing, we can start with someting familiar, a vector represented on x y coordinates.

  1. Draw a vector v(.75,.50) on an x y grid. Now we know that if we drop a vertical line from the end of the vector, we will find the x component. How d we find the y component? Illustrate both.

  2. In each case, more formally, what you just did was to multiply the vector v(0.75, 0.50) by x and y unit vectors to find the scalar product.

vec1 = v(1,0) vec2 = v(0,1) vec3 <- v(2,3)

Vector multiplication

v(x1,y1) * v(x2,y2) = v(x1x2,y1y2)

  1. What is the scalar product of the x and y unit vectors? The unit x and y vectors are othogonal. We used orthogonality to break apart a vector into its x and y components.

Ok, what does this have to do sound waves and the fourier transform. We see that a complex waveform is composed of many individual waves. Now it turns out that the individual waves are orthogonal to each other, and we can use that fact to “pick out” the individual waves using the fourier transform.

Since the math is more complicated, let’s use R to explore.

If we multiply two functions, and find the area under the curve of the resulting function, if it is zero then those are orthogonal.

We can sometimes see if the area under the curve is zero because the positive and negative area are symmetrical.

using R:

  1. graph the product for the following:

    sin(x)sin(x) sine(x)sin(2x) sin(x)sin(5x) sin(x)sin(2x)sin(5x)sin(x)

  2. can use estimate the area under the curve of these graph? finite or zero

The fourier transform is simply doing this over and over again, with many frequencies. when it “hits” a part of the curve that matches the frequncy (that is non-orthogonal) it has a non zero result and picks that frequency out.

Here is example of actual soud wave & fft using R.

  #  **********************************************  
    # reading a music sample wav file
      
    mwave = "CantinaBand3.wav" 
#  loading the wav file
    
    data <- tuneR::readWave(mwave)   
    
  #  sampw = "sample.wav"
    
 #   sdata <- tuneR::readWave(sampw) 
    
    summary(data)    
## 
## Wave Object
##  Number of Samples:      66150
##  Duration (seconds):     3
##  Samplingrate (Hertz):   22050
##  Channels (Mono/Stereo): Mono
##  PCM (integer format):   TRUE
##  Bit (8/16/24/32/64):    16
## 
## Summary statistics for channel(s):
## 
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -8587.000  -568.000     1.000    -2.556   588.000  9002.000
    #  sampling rate 44,100
    
    ysig <- data@left
    
    N <-  length(ysig)
    
    x <- seq(1,N,1)
    
    T <- 6.85
    
    freq <- x/T
    
    length(freq)
## [1] 66150
    time <- T/x
    ysig <- ysig / 2^(data@bit -1)   
    plot(time,ysig,xlim=c(0,0.002))

    length(time)
## [1] 66150
   par(mar=c(1,1,1,1))  
   #  figure margin error fix
  
  
  ffty <- fftw::FFT(ysig)     #  discrete fourier transform 
  
  amp <- abs(ffty)
  # DFT is a sequence of complex sinusoids 
  
  plot(freq,amp/N,type="l",xlim = c(0,2000))

FT and Signal Averaging

Since FT spectrocopies are faster than continous wave spectroscopies, more spectra can be collected, and we can take advantage of signal averaging. Use R to generate noisy sine waves and use signal averaging to see reduction in S/N ratio.

rnorm can be used to add noise to a signal. Here we simply demonstrate the impact of signal averaging by lookind at the standard deviation as a function of number of signals.

after 1 after 10 after 100

how does the average noise change with n (=number of signals)

   rn1 <- rnorm(10,1,0.5)

   arn1 <-mean(rn1)
   
   arn1
## [1] 1.043315
    rn2 <- rnorm(50,1,0.5)

arn2 <-mean(rn2)
   
   arn2
## [1] 1.072812
 rn3 <- rnorm(1000,1,0.5)

arn3 <-mean(rn3)
   
   arn3
## [1] 0.9902493

Big Data and Visualization

plotly https://plotly.com/r/ plotly has 3d plots


  1. O Jones, R Maillardet, A Robinson. Introduction to Scientific Programming and Simulation Using R, CRC Press (2009)↩︎

  2. D. Harvey. Analytical Chemistry↩︎

  3. [Distribution of body weight and height. W. Millar, Journal of Epidemiology and Community Health, 1986, 40, 319-323]↩︎

  4. Bradford, M.M. (1976), Rapid and sensitive method for the quantitation of microgram quantities of protein utilizing the principle of protein-dye binding, Anal. Biochem., 72 (1–2): 248–254↩︎

  5. Brady and Macnaughtan, Anal Biochem. 2015 Dec 15; 491: 43–51.[Evaluation of Colorimetric Assays for Analyzing Reductively Methylated Proteins: Biases and Mechanistic Insights](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4631703/↩︎