1 Preface

This book originated from an effort to integrate programming into the teaching of Analytial Chemistry (or Quantitative Chemical Analysis, as it is sometimes called). In choosing a language, I found in R the following desireable qualities for teaching:

  1. A comprehensive online platform in R Studio Cloud.
  2. Easy to get up & running.
  3. Quality graphics
  4. Wide range of statistical packages.
  5. Commonly used in scientific research
  6. Vibrant online community

Each chapter introduces one or more problems that can be solved in R, and includes a suggested project (typically involving student collected or literature data) through which thr student can deepen their understanding of analytical chemistry and also practice new R skills.

2 About the Author

My original training in programming traces back to FORTRAN and punchcards & Mainframe at Brown University, where I completed a PhD in Chemistry & developed a simulation program to analyze electrchemical experiment, soon ported to PC with the lovely Turbo Pascal (now still available using FreePascal).

3 The R Language

(Anderson, Anderson, and Anderson 2015)

Every programming language has similar elements. We need to input data, identify variables, perform calculations, create functions, control program flow, and output data and graphics. The R language is attractive in that it has it has a full online development environment -RStudio Cloud- and built in functions for regression analysis, solving equations, and producing graphics. R is increasingly used for those involved in data analysis and statistics, particularly social sciences, bio-statistics, and medicine. It is currently available in a free version. References on R tend to emphasize different features of R more suitable for social sciences, data mining, ecology, and medicine. A good reference for scientists using numerical methods is available1, as well as an Analytical Chemistry textbook2. This manual will present the elements of R gradually, as needed, mostly through chemical data examples. This document itself is created with RMarkdown, which integrates a document creation with R code.

3.1 R in RStudio Cloud

We will write our programs on a online platform (an IDE) called Rstudio Cloud, which makes it platform independent - we access the program the same way on a PC, Mac, or Chromebook. The Rstudio Cloud environment is divided into 4 main sections, with the most important elements listed below.

  • Top left: Script - where you write R code
  • Bottom left: Console - show output
  • Top right: Environment - we will mostly ignore this
  • Bottom right: show Plots, help, packages

A package is set of programs that R can call, such as a statistical package. There are base packages that come with R, and there are many that have to be added. Default Packages are listed in the bttom right pane. To add a package use install and type name of packages.

Some important packages used here:

To start writing a program in Rstudio Cloud:

from a Workspace (like a file folder that can contain related programs) , select New project, and fromthe menu:
File ——> Newfile —–> Rscript

& start typing!

To run a program, highlight the code and select “run”,

When you run a script program, results and error messages will appear on the console, and plots appear on the plot area.

In this document, the shaded area is R code, and the white background appears as output from a previous R command.

In this section, we present only the very basics of calculations and graphics.

3.2 Vectors & Numerics

In R we commonly use “<-” to set a variable to a value. R uses “vector” calculation, which avoids use of loops in more traditional programming. For instance, below we set x equal to a series of values (a vector), and then calculate a series of values of y.

#  A comment lineR 

x <- 5.0    #  Set x equal to 5.0

y <- x^2    # y is equal to x squared. 

x <- c(1.0, 2.0, 3.0)   #   x is equal to a numbered list of values - a “vector” 

x <- seq(1,2,0.2)       # create an incremented sequence

xx <-matrix(c(2,4,6,8),2,2)   # a matrix row 1 = 2 6 row 2 = 4 8
#  length(x) returns the length of the vector x.
#  x returns all the values of x

length(x)    # the number of values in the vector x
## [1] 6
x           
## [1] 1.0 1.2 1.4 1.6 1.8 2.0
xx
##      [,1] [,2]
## [1,]    2    6
## [2,]    4    8

Once we have a series of x and y, of equal length, we can easily create a graph.

x   <-  seq(0,10,0.5)     # a sequence from 1 to 10, increments 0f 0.5

y  <-   x^2     #  Note that y is calculated for every x. This is called vectorized.

plot(x,y)     # Create a plot. We can add a lot of formatting!

Below is an example of basic formatting commands.

plot(x,y,type = "b",main = "A Formatted Graph",col = "darkblue", xlab = "X Label", ylab = "Y Label")

3.3 Functions

We can define a function and evaluate it later in the R script.

Afunc <- function(x)  {x^2 + log(x)}

y <- c(1,2,3)

Afunc(y)       # note the vectorized evaluation 
## [1]  1.000000  4.693147 10.098612

3.4 Reading Data Files

With larger data sets, as in a titration, we prefer to be able to read a data file, such as output from a spreadsheet. A common filetype is csv (comma separated values) files. Usually, the first row contains the headers, and them remaining rows the values. If you upload a csv file into your working directory, they can be read into Dataframes. Dataframes are columns of values that do not have to be of the same type (Names, dates, income, etc), and they are very important in many R applications. However, here’ we are focusing on vectors, and we can convert a dataframe with two or more numeric columns (such as volume and pH) into respective vectors. The data file “mydat.csv” looks like this:

volume, pH 1.01, 2.2 2.03, 2.4 3.20, 2.6

Mydata <- read.csv("mydat.csv") # a file with headers vol and pH 
## Warning in read.table(file = file, header = header, sep = sep, quote = quote, :
## incomplete final line found by readTableHeader on 'mydat.csv'
Mydata      #  it's a dataframe
##    vol  pH
## 1 1.01 2.2
## 2 2.03 2.4
## 3 3.20 2.6
volume <- Mydata$vol  # extract column vol as vector

pH <- Mydata$pH       # extract column pH as vector 

volume 
## [1] 1.01 2.03 3.20
pH
## [1] 2.2 2.4 2.6

3.5 Control Structures: For Loop

Control structures allow iterative operations (for loop) and decision directed operations (if then else).

  x <- 2  

 for(i in 1:4) {
         x <- x*x
         
 }
x
## [1] 65536

4 Data and Statistics

Every measurement has random error: Error that is inherent in the nature of the measurement.

Significant Figures: The number of significant figures is all the certain figures plus one uncertain figure. The degree of uncertainty in the last digit is ultimately determined by a statistical analysis.

The Mean (or average) of a set of n measurements x is defined:

\[\large\overline{x} = \frac{\sum_{i=1}^{n}x_i}{n}\] The mean and standard deviation can be related to the Guassian distribution - that gives the probability of observing a particular value of x. For a finite number of measurements, the Gaussian distribution can be approximated as:

\[\large y =\frac{1}{s\sqrt(2\pi)}e^\frac{-(x-\overline{x})^2}{2s^2}\] According to one reference, the average (mean) male weight is 172 pounds with a standard deviation of 29 pounds. Substituting this into the Gaussian distribution formula

s <- 29

xmean <-  172

xval <- seq(72,282,5)

xval
##  [1]  72  77  82  87  92  97 102 107 112 117 122 127 132 137 142 147 152 157 162
## [20] 167 172 177 182 187 192 197 202 207 212 217 222 227 232 237 242 247 252 257
## [39] 262 267 272 277 282
#  here we calculate a gaussian distribution

yp <- (1/(s*sqrt(2*pi)))*exp(1)^((-(xval-xmean)^2)/(2*s^2))

yp
##  [1] 3.601635e-05 6.430517e-05 1.114505e-04 1.875030e-04 3.062134e-04
##  [6] 4.854339e-04 7.470093e-04 1.115865e-03 1.618034e-03 2.277473e-03
## [11] 3.111781e-03 4.127191e-03 5.313615e-03 6.640726e-03 8.056213e-03
## [16] 9.487162e-03 1.084505e-02 1.203419e-02 1.296259e-02 1.355367e-02
## [21] 1.375663e-02 1.355367e-02 1.296259e-02 1.203419e-02 1.084505e-02
## [26] 9.487162e-03 8.056213e-03 6.640726e-03 5.313615e-03 4.127191e-03
## [31] 3.111781e-03 2.277473e-03 1.618034e-03 1.115865e-03 7.470093e-04
## [36] 4.854339e-04 3.062134e-04 1.875030e-04 1.114505e-04 6.430517e-05
## [41] 3.601635e-05 1.958139e-05 1.033421e-05
plot(xval,yp) 

#  R has a command for Gaussian distribution, dnorm.

dp  <- dnorm(xval,199.8,29)    #  we can also get probability distribution using the command dnorm distribution

plot(xval,dp)
lines(xval,dp)
fdp <-   dnorm(xval,170.8,29)

lines(xval,fdp)

Each point represents the probability of a particular observation,and the area under the curve (the sum of all probabilities) is 1.

The distance from the mean of a particular measurement can be discussed in terms of “deviations from the mean” as multiples of the standard deviation. For instance:

68.3% of measurements lie within plus or minus one standard deviation

95.5% within plus or minus two standard deviations

99.7% within plus or minus three standard deviations.

The R command for the cumulative distribution, which approaches 1, is pnorm.

cdp <-  pnorm(xval,172,29)    #   men  weight distribution
fdp <-   pnorm(xval,160,29)   #   women weight distribution

plot(xval,cdp)
lines(xval,fdp)

4.1 Comparing two sets of data: the t test

Argon was discovered because the mass of chemically generated nitrogen was significantly different than nitrogen obtained from air. Significant, in statistics, is carefully defined in terms of probabilities. An interesting example comes from history: Rayleigh’s investigation of the mass of chemically generated nitrogen (for instance, from the decomposition of pure NO) and nitrogen obtained from air was about 0.5% greater than that obtained from chemical decomposition. Was this slight difference attributable to experimental error? The mass of the The two sets of masses were as follows, and the results of a t-test in R are shown.

Rdata <- c(2.31017,2.30986,2.31010,2.31001,2.310024,2.31010,2.31028)
Tdata <- c(2.30143,2.29890,2.29816,2.30182,2.29869,2.29940,2.29848)

Xdata <- data.frame(Rdata,Tdata)
knitr::kable(Xdata[,], col.names = c('Chem Data','Air Data'), caption = "Gas Data for t.test/grams")
Gas Data for t.test/grams
Chem Data Air Data
2.310170 2.30143
2.309860 2.29890
2.310100 2.29816
2.310010 2.30182
2.310024 2.29869
2.310100 2.29940
2.310280 2.29848
t.test(Rdata,Tdata)
## 
##  Welch Two Sample t-test
## 
## data:  Rdata and Tdata
## t = 18.876, df = 6.0976, p-value = 1.218e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.009164539 0.011882319
## sample estimates:
## mean of x mean of y 
##  2.310078  2.299554

The important value to look for is p (probability) value: it tells us the probability that this overlap is due to merely random experimental error. The p value reported indicates about a one in a million chance that these results would occur if the masses were actually the same. We can also see that the size standard deviations, and see that they are much smaller than the difference between the two average values.


4.2 Gaussian Project

  1. Find a data set that is likely random distribution. This could be baseball batting averages, heights, or some other attribute. Should be > 20 values.

  2. Using R, create a vector with c command. X <- c(1,2,5,8,9)

  3. Find the mean and the standard deviation. The commands are - if X is the vector: mean(X) and sd(X).

  4. Use the dnrom command & graph the estimated Gaussian distribution over a reasonable range (created with seq())

5 Linear Regression & Beer’s Law

Beer’s law for UV-Vis Spectroscopy is: \[\large Abs = \epsilon_\lambda b c \]

where A = absorbance \(\epsilon_\lambda\) = absorption coefficient at a particular wavelength b = pathlength (typically cm) c = concentration (usually in molarity or \(\mu\)grams)

There is a linear relationship between the abosorbance and the concentration, given a fixed pathlength and wavelength. In any experimental situation, the inherent noise in the data will mean that we need to include a constant (I = intercept).

\[\large Abs = \epsilon_\lambda b c + I\] In order to use Beer’s Law to analyze for a particular component in solution (an ~analyte), we construct a calibration curve (in this case, a line!) using experimentally determined data. The object is to determine the absorption coeficcient, \(\epsilon\). We prepare solutions of varying concentrations of the analyte, and measure the absorbance. According to Beer’s Law, a plot of Absorbance versus concentration should result in a straight line with a slope equal to \(\epsilon\).

Lets take an example. The data comes from the widely use Bradford assay for protein3. In the Bradford assay, a dye (Coomassie brilliant blue G-250) changes from red to blue when attched to proteins, and this color absorbtion at 595 nm) is used to measure protein concentration. Here is sample data for a calibration4 for the BSA (Bovine Serum Albumin) standard.

Conc <- c(2.0, 5.0, 10.0, 15.0, 18.0)   # [BSA], in micrograms / ml
Abs <- c(0.115,0.266,0.413,0.701,0.811) # Absorbance

Xdata <- data.frame(Conc,Abs)    #converts x and y data to a data frame
knitr::kable(Xdata[,], col.names = c('Conc(microg/ml)','Abs'), caption = "Bradford assay data")     #  makes a table
Bradford assay data
Conc(microg/ml) Abs
2 0.115
5 0.266
10 0.413
15 0.701
18 0.811

We use the “lm” command to do linear regression - to find the best fit to the data according to the least squares criterion: the slope (\(\epsilon\)) and intercept are adjusted to minimize the sum of the squares of the deviations of the experimental data from the fitted line .

BL <- lm(Abs ~ Conc)   # linear regression, lm(y(x)~x)
  
summary(BL)            # summarize results of linear regression
## 
## Call:
## lm(formula = Abs ~ Conc)
## 
## Residuals:
##       1       2       3       4       5 
##  0.0018  0.0223 -0.0482  0.0223  0.0018 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.026200   0.029054   0.902 0.433649    
## Conc        0.043500   0.002495  17.435 0.000411 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03329 on 3 degrees of freedom
## Multiple R-squared:  0.9902, Adjusted R-squared:  0.987 
## F-statistic:   304 on 1 and 3 DF,  p-value: 0.0004113
coef(BL) 
## (Intercept)        Conc 
##      0.0262      0.0435
plot(Conc, Abs, main = "Beer's Law ", xlab = "", ylab = "Absorbance", xlim = c(0.0,20.0), ylim =  c(0.0, 1.0), sub = "Figure 1. Beer's Law Plot for BSA")

title(xlab="Concentration, ug/ml", line=2, cex.lab=1.0)    

abline(a = 0.0262, b = 0.0435, col = "blue")  # add a line with intercept = 0.262 and slope = 0.0435, taken from regression.

fitted <- coef(BL)

int <- fitted[1]

slope <- fitted[2]



slope
##   Conc 
## 0.0435

Coomassie Blue. Associates with Proteins through non-covalent interactions

Interpreting lm() output. We used lm(y ~ f(x)) to do linear regression. summary(SBL)provides a summary of the results of linear regression.

From summary(BL) we learn the best fit values for the slope( \(\epsilon\)) and the intercept, the standard deviations of these fitted values, the p values, and the r value - which should be close to 1.

The final step in an analysis is to calculate the concentration of an unknown based on the absorbance measured. Rearranging Beer’s Law, we have, for example, for an absorbance = 0.350

\[\large c = \frac{Abs-I}{\epsilon}\]

val <- (0.350 - int)/slope
val
## (Intercept) 
##    7.443678

The unknown concentration is 7.45 micrgrams/ml.

5.1 Residual Plots

In a regression analysis, it can be helpful to plot the residuals (the deviations) of the experimental data from the theoretical best fit line(or curve). The residuals should appear randomly distributed above and below a zero line.

We can simulate an example, based on the Bradford assay. The simulated data exaggerates the magnitude of typical deviations to help visualize.

In this program we used rnorm() to create normally distributed random numbers. This is very useful in creating realistic simuations of data. We used lm(y ~ f(x)) to do linear regression. summary(SBL)provides a summary of the results of linear regression.

Scon <- seq(0.5,20,0.5)     #  create a series 0f concentrations
length(Scon)
## [1] 40
rnoise <- rnorm(40,0,0.1)   #  40 gaussian random, mean 0, sd 0.1
length(rnoise)
## [1] 40
Sabs <- 0.0435 * Scon + 0.026 + rnoise   # create a series of abs  values

SBL <- lm(Sabs ~ Scon)    # linear regression applied to the vectors Sabs(y value) and Scon (function y(x))                 

summary(SBL)
## 
## Call:
## lm(formula = Sabs ~ Scon)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.27649 -0.09056  0.01325  0.09197  0.19722 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.001256   0.039161  -0.032    0.975    
## Scon         0.046618   0.003329  14.003   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1215 on 38 degrees of freedom
## Multiple R-squared:  0.8377, Adjusted R-squared:  0.8334 
## F-statistic: 196.1 on 1 and 38 DF,  p-value: < 2.2e-16
cf <- coef(SBL)          #  Extraction the Intercept and Slope
Int <- cf[1]
Slope <- cf[2]
plot(Scon,Sabs)
abline(a = Int, b = Slope, col = "blue") 

plot(Scon,residuals(SBL))

abline(a = 0.0, b = 0.0, col = "blue") 

5.2 What is regression?

Since we will rely on the so called “regression” methods, it is useful to examine in a bit more detail what programs are doing when we use the lm command, and later nls.

6 Solving Equilibrium

In solving chemical equilibrium, we often make approximation to simplify the mathematical solution. For instance, the approximation of negligible dissociation of weak acids. However, this approximation can lead to significant error in important problems, such as the pKas of diprotic acids. Here, we will explore the use of R to solve equilibrium without approximations. We find the equation to solve through aplication of a systematic analysis of equilibrium, which takes into account the equilibrium, charge balance, and mass balance expressions.

6.1 pH of hydrofluoric acid

We can show four equations:

HF dissociation

\(HF \rightleftarrows H^+ + F^-\)

\(K_a = \large \frac{[H^+][F^-]}{[HF]} = 6.8 \times 10^{-4}\)

Water Dissociation

\(H_2O \rightleftarrows H^+ + OH^-\)

\(K_w = [H^+][OH^-]] = 10^{-14}\)

Mass Balance

\([HF]_i = [HF] + [F^-]\)

Charge Balance

\([H^+] = [F^-]+ [OH^-]\)

Successively substituting the mass balance, charge balance, and \(K_w\) into the dissociation expression:

eliminate \([HF]\)

\(K_a = \large \frac{[H^+][F^-]}{[HF]_i-[[F^-]]}\)

eliminate \([F^-]\)

\(K_a = \large \frac{[H^+]([H^+]-[OH^-])}{[HF]_i-[H^+]+[OH^-]}\)

eliminate \([OH^-]\)

\(K_a = \large \frac{[H^+]([H^+]-\frac{K_w}{[H^+]})}{[HF]_i-[H^+]+\frac{K_w}{[H^+]}}\)

Finally, simplifying a bit:

\(K_a\times([HF]_i[H^+]-[H^+]^2+K_w)-([H^+]^3-K_w[H^+] = 0\)

We can solve this for \([H^+]\) using the uniroot command.

Ka <- 6.8 * 10^-4

Kw <-  10^(-14)

CHF <- 0.0010

Ka
## [1] 0.00068
Kw
## [1] 1e-14
HFFunc <-  function(H) {
  
  Ka * (CHF*H - H^2 + Kw) -(H^3-Kw*H)
  
}

  Hroot <-  uniroot(HFFunc,c(10^-6,10^-2),tol = 10^-18)

  hC  <-  Hroot$root
  
  hC
## [1] 0.0005519641
   solpH <- -log(hC,base = 10)
   
   solpH
## [1] 3.258089
  # compare with approximate method
   
   hapx <- sqrt(0.001 *  Ka)
   
   hapx
## [1] 0.0008246211
   pHapx <- -log(hapx,base=10)
   
   pHapx
## [1] 3.083746

We can also make a graphical illustration of root finding. We can explore the value of HFFunc for values of H+ ariund the root.

phseq <- seq(3.2,3.4,0.01)
      
      phseq
##  [1] 3.20 3.21 3.22 3.23 3.24 3.25 3.26 3.27 3.28 3.29 3.30 3.31 3.32 3.33 3.34
## [16] 3.35 3.36 3.37 3.38 3.39 3.40
      hval <- 10^-phseq
      
      hval
##  [1] 0.0006309573 0.0006165950 0.0006025596 0.0005888437 0.0005754399
##  [6] 0.0005623413 0.0005495409 0.0005370318 0.0005248075 0.0005128614
## [11] 0.0005011872 0.0004897788 0.0004786301 0.0004677351 0.0004570882
## [16] 0.0004466836 0.0004365158 0.0004265795 0.0004168694 0.0004073803
## [21] 0.0003981072
    val  <- HFFunc(hval)
     
    val
##  [1] -9.285051e-11 -7.366706e-11 -5.592871e-11 -3.954115e-11 -2.441606e-11
##  [6] -1.047071e-11  2.372399e-12  1.418583e-11  2.503756e-11  3.499124e-11
## [11]  4.410651e-11  5.243921e-11  6.004165e-11  6.696282e-11  7.324858e-11
## [16]  7.894192e-11  8.408307e-11  8.870971e-11  9.285714e-11  9.655839e-11
## [21]  9.984442e-11
     plot(phseq,val)

6.2 Carbon Dioxide & Acidification

Added carbon dioxide in atmosphere not only affects climate, but also pH of water. Lets explore using the systematic method. We assume aqueous carbon dioxide is fixed by the constant atmospheric carbon dioxide. This is provided by the Henry’s Law constant.

\(CO_2(g) \rightleftharpoons CO_2(aq)\)
\(K_H = 0.034\) where units atmospheres and molarity. Carbon dioxide is typically reported in ppm, and a recent measurement show it is 415 ppm. This is equivalent to 0.000415 atm, because ppm is also mole fraction, and partial pressure is mole fraction time total pressure, which is 1 atm.

$ CO_2$ = 0.034 x 0.000415 = \(1.41\cdot 10^{-5}\) M$

Given CO2(aq) is constant & equilibrium with atmospheric, we have four coupled equilibrium to consider, and a charge balance and five unknowns.

Hydrolysis Reaction

\(CO_2(aq) + H_2O ⇄ H_2CO_3(aq)\) \(K_{hyd} = 0.0026\)

Carbonic Acid Dissociation

\(H_2CO_3(aq) \rightleftharpoons H_2CO_3^-(aq) + H^+(aq)\) \(K_{a1} = 1.67 \cdot10^{-4}\)

\(HCO_3^-(aq) ⇄ CO_3^{2-}(aq) + H^+(aq)\) \(K_{a2} = 5.6 \cdot 10^{-11}\)

Water Dissociation

\(H_2O ⇄ H^+(aq) + OH^-(aq)\) \(K_w = 1.00 \cdot 10^{-14}\)

And a charge balance

\([H^+] = [OH^-] + [HCO_3^-] + 2\cdot [CO_3^{2-}]\)

Strategy:

substitute OH- = kw/H+ Substitute CO32- = Ka2 x HCO3- / H+ Substitute HCO3- = Ka1 x H2CO3/H+ H2CO3 = Khyd CO2(ag)

Substitute 1,2 and 3 into the charge balance

H+ = Kw/H+ + Ka1 * H2CO3 / H+ + 2 * Ka2 HCO3- / H+

Substitute 4 and 3

H+ = Kw/H+ + Ka1 * Khyd * CO2(aq)/ H+ + 2* Ka1 * Ka2 * H2CO3 /H+2

Finally, 4 again

H+ = Kw/H+ + Ka1 * Khyd * CO2(aq)/ H+ + 2* Ka1 * Ka2 * Khyd * CO2 /H+2

Multiply by H+2 and total to zero

H+3 - Kw * H+ + Ka1 * Khyd * CO2(aq) * H+ + 2* Ka1 * Ka2 * Khyd * CO2 = 0

Project download csv file use uniroot / sapply to get pH

Kh <- 0.0034
Khyd  <-  0.0026 
Ka1   <-  1.67 * 10^-4
Ka2   <-   5.6 * 10^-11
Kw    <-  1.00 * 10^-14
ca <-  1.41 * 10^-5              # concentration of co2(aq)

  froot3 <- function(H)       
  
  {H^3 - Kw*H - Ka1 * Khyd  * ca * H - 2.0 * Ka1 * Ka2 * Khyd * ca}
  
  X <- uniroot(froot3, c(10^(-6),10^(-4)), tol = 10^(-18))$root 
  
  X
## [1] 2.476388e-06
  pH <-  -log10(X)

  pH
## [1] 5.606181
  #   graph of froot3 in area around solution
  
  
      ph <- seq(5.5,5.7,0.02)
      
      ph
##  [1] 5.50 5.52 5.54 5.56 5.58 5.60 5.62 5.64 5.66 5.68 5.70
      h <- 10^-ph
      
      h
##  [1] 3.162278e-06 3.019952e-06 2.884032e-06 2.754229e-06 2.630268e-06
##  [6] 2.511886e-06 2.398833e-06 2.290868e-06 2.187762e-06 2.089296e-06
## [11] 1.995262e-06
     val  <- froot3(h)
     
     val
##  [1]  1.223031e-17  9.022593e-18  6.302128e-18  4.002739e-18  2.066941e-18
##  [6]  4.448060e-19 -9.070142e-19 -2.026146e-18 -2.945236e-18 -3.692601e-18
## [11] -4.292791e-18
     plot(ph,val)

     #   the pH  of standing water is 5.6

7 Titrations

##  volume as function pH

# Generating Titration  Curve

#    pH f vol


cA = 0.100   #  concentration acid Molar
cB = 0.100   #               base Molar
VA = 25.00   #  volume of acid ml
Ka = 1.38 * 10^(-4)
Kw = 1.00 * 10^(-14)

 
 pH <- seq(3,12,0.5)
 pH
##  [1]  3.0  3.5  4.0  4.5  5.0  5.5  6.0  6.5  7.0  7.5  8.0  8.5  9.0  9.5 10.0
## [16] 10.5 11.0 11.5 12.0
 #   try using sequence 
length(pH)
## [1] 19
   H <- 10^-pH
   

 VB <-  (Ka/(H + Ka) - (H - Kw/H)/cA) /
 (1.00 +  (H - Kw/H)/cB) * cA*VA/cB   
   
   
VB
##  [1]  2.754094  7.492558 14.456342 20.324917 23.305981 24.438395 24.819651
##  [8] 24.942701 24.981897 24.994415 24.998684 25.001007 25.004819 25.015759
## [15] 25.050032 25.158610 25.505049 26.632771 30.555555
pH
##  [1]  3.0  3.5  4.0  4.5  5.0  5.5  6.0  6.5  7.0  7.5  8.0  8.5  9.0  9.5 10.0
## [16] 10.5 11.0 11.5 12.0
plot(VB,pH,main = "Weak Acid Titration",
     xlab = "Volume Base / ml", xlim = c(0,35),ylim = c(2,13) )

7.1 pH as function of volume

cA = 0.100   #  concentration acid Molar
cB = 0.100   #               base Molar
VA = 25.00   #  volume of acid ml
Ka = 1.00 * 10^-4
Kw = 1.00 * 10^-14

VB = 12.500  #   midpoint test



froot <- function(H)  {
  
  
   VB - (Ka/(H + Ka) - (H - Kw/H)/cA) /
  (1.00 +  (H - Kw/H)/cB) * cA*VA/cB   }


 #  VB  - (1.00 +  (H*cB - Kw/H)/cB) *  (Ka/(H + Ka) - (H - Kw/H)/cA) *  cA*VA/cB
                                   

 X <- uniroot(froot, c(10^(-12),10^(-3)), tol = 10^(-14)) 


X
## $root
## [1] 9.940534e-05
## 
## $f.root
## [1] 3.215206e-12
## 
## $iter
## [1] 10
## 
## $init.it
## [1] NA
## 
## $estim.prec
## [1] 5.000042e-15

7.2 Using sapply with uniroot

# numerical soution of titration weak acid / strong base
#  derivation of acid base titratin curve in Harris 

#  used sapply to run unirooot for different values

#   generate full  titration curve

cA = 0.100   #  concentration acid Molar
cB = 0.100   #               base Molar
VA = 25.00   #  volume of acid ml
Ka = 1.00 * 10^(-3)
Kw = 1.00 * 10^(-14)

Kw
## [1] 1e-14
VB <- 0

 froot <- function(VB,H)
     
     
    { VB - (Ka/(H + Ka) - (H - Kw/H)/cA) /
          (1.00 +  (H - Kw/H)/cB) * cA*VA/cB  }  

    numb <- sapply(seq(0,30,1), function(VB) uniroot(froot, c(10^-12,10^-1),tol = 10^-16, VB=VB)$root)
    
    # sapply can use sequential functions 
    
 VB <- seq(0,30,1)
    numb
##  [1] 9.512492e-03 7.485454e-03 5.938106e-03 4.767242e-03 3.878322e-03
##  [6] 3.195567e-03 2.662759e-03 2.239722e-03 1.898090e-03 1.617788e-03
## [11] 1.384459e-03 1.187692e-03 1.019821e-03 8.751112e-04 7.492070e-04
## [16] 6.387531e-04 5.411300e-04 4.542669e-04 3.765081e-04 3.065155e-04
## [21] 2.431973e-04 1.856545e-04 1.331406e-04 8.503107e-05 4.079969e-05
## [26] 1.400290e-08 5.100016e-12 2.600001e-12 1.766666e-12 1.350022e-12
## [31] 1.099991e-12
    length(numb)
## [1] 31
    ph <- -log10(numb)
    ph
##  [1]  2.021706  2.125782  2.226352  2.321733  2.411356  2.495452  2.574668
##  [8]  2.649806  2.721683  2.791078  2.858720  2.925296  2.991476  3.057937
## [15]  3.125398  3.194667  3.266698  3.342689  3.424226  3.513548  3.614041
## [22]  3.731295  3.875690  4.070422  4.389343  7.853782 11.292428 11.585027
## [29] 11.752846 11.869659 11.958611
 VB
##  [1]  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
## [26] 25 26 27 28 29 30
    length(VB)
## [1] 31
    length(ph)
## [1] 31
    plot(VB,ph,xlab = "volume of base/ml", main = "Titration")

8 Binding Curves

Binding Curves (or Binding Assays) are an approach to determining the equilibrium constant for the interaction of two molecules, commonly used in biochemistry5. Typically, the a small molecule (commonly called the “ligand”) attaches to a larger molecule such as a protein. If we denote the protein as P and the ligand as L we can express the dissociation as:

\(PL \rightleftharpoons P + L\)

and where the total protein is the bound plus unbound.

\(P_{tot} = PL + P\)

With the dissociation constant:

\(K_d = \frac{[P][L]}{[PL]}\)

The concentration of the ligand (in excess) is varied and the fraction bound \(\frac{[PL]}{P_{tot}}\) is plotted versus the L6.

Substituting \(P_{tot} = P + PL\) into the \(K_d\) expression:

we get

\(K_d = \frac{([P_{tot}]-[PL])[L]}{PL]} = \frac{([P_{tot}][L]-[PL]{L]}}{[PL]}\)

\(K_d[PL] + [PL][L] = [P_{tot}][L]\)

\([PL](K_d +[L]) = [P_{tot}][L]\)

and we get expression for fraction bound as a function of \(K_d\) and L.

\(\frac{[PL]}{P_{tot}} = \frac{[L]}{K_d + [L]}\)

note as L increases the fraction bound goes from 0 to 1.

The key to binding assay is to find an experimental method to determine the concentration of PL, the bound form.

Although it is called a “binding assay” the equilibrium is discussed in terms of the dissociation, with \(K_d\)

8.1 Myoglobin Bindng

knitr::include_graphics('250px-Myoglobin.png')
Myoglobin Binding

Myoglobin Binding

#  Nonlinear Regression / Binding Curve 
#  simulate noisy data for myoglobin  KD = 0.26  [L] in kPa
#  data generated with random noise added 

library(nls2)
## Loading required package: proto
 KD <- 0.26              # initializing variables
 L <- 0.0  
 y <- 0.0
 
 L <-  seq(.1,1.5,0.1)   #  set of 15 ligand concentrations 
 
 rnd <- rnorm(15,0,0.04)   #   set of 15 random values

 y <-  L/(KD+L) + rnd     #  binding curve fuction with noise
 
 plot(L,y, main = "Simulated binding data", xlab = "pO2 Kpa",
      ylab = "Fractional Binding")

tryfit <- nls2(y ~ L/(KD+L),  
              start = c(KD = 5))

summary(tryfit)
## 
## Formula: y ~ L/(KD + L)
## 
## Parameters:
##    Estimate Std. Error t value Pr(>|t|)    
## KD  0.27248    0.01039   26.21 2.67e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02848 on 14 degrees of freedom
## 
## Number of iterations to convergence: 7 
## Achieved convergence tolerance: 1.189e-06
plot(L,y, main = "Myoglobin Binding Curve", xlab = "pO2 Kpa", ylab = "Fractional Binding")
  
lines(L,predict(tryfit), col = "blue")    # add a line fitted to points

8.2 L not in Excess

For the Binding Assay, with P = Initial Protein and L = Initial Ligand, and PL = x = amount of bound protein, we have the equilibrium expression:

\(K_d = \frac{([P]-x)([L]-x)}{x}\)

if we solve for x and divide by L, we will get fraction bound.

\(K_d = \frac{[P][L] -[P]x -x[L] + x^2}{x}\)

\(K_dx = [P][L] -[P]x -x[L] + x^2\)

\([P][L] -[P]x -x[L] -K_dx + x^2 = 0\)

\([P][L] -(K_d + [P] + [L])x + x^2 = 0\)

The quadratic solution, divlded by [L],

\(\frac{x}{L} = \frac{(K_d + [P] + [L]) - \sqrt{(K_d + [P] + [L])^2-4([P]+[L])}}{2[L]}\)

8.3 Acid/Base Binding Curve

Adapt to titration data, framed as fraction binding as determined from the endpoint, vs pH (which is actually equilibrium ligand H)

We can transform titration data to a binding curve representation.

For the equilibrium:

\(HA \rightleftharpoons H^+ + A^-\)

\(K_a = \frac{[H^+][A^-]}{[HA]}\)

we are measuring H+ and for a weak base strong acid titration, the fraction bound is fraction of the way to the endpoint.

\(f = \frac{[HA]}{[A^-]+[HA]} = \frac{vol}{vol_t}\)

substituting the equilibrium expressions into this:

\(f = \frac{[HA]}{\frac{K_a[HA]}{[H^+]}+[HA]}\)

\(\frac{1}{\frac{K^a}{[H^+]}+1}\)

\(f = \frac{[H^+]}{[H^+]+K_a}\)

It is useful to note that we are meauring [H^+], so unlike the similar binding equation, there is no approximation of excess “ligand”.

This is a very useful way to analyze acid base data to determie \(K_a\) vaues.

Where is the error? The error is in the estimation of the endpoint, which translates into a sytematic error in the fraction bound.

KD <- 1.75 * 10^(-5)             # initializing variables
L <- 0.0  
y <- 0.0

KD
## [1] 1.75e-05
#  A mote carlo determination of sd, consistent with nls
#  can use with duplex 

pH <-  seq(3,6,0.2)  #   generating a list of values 

pH
##  [1] 3.0 3.2 3.4 3.6 3.8 4.0 4.2 4.4 4.6 4.8 5.0 5.2 5.4 5.6 5.8 6.0
length(pH)
## [1] 16
L <- 10^(-pH)

L
##  [1] 1.000000e-03 6.309573e-04 3.981072e-04 2.511886e-04 1.584893e-04
##  [6] 1.000000e-04 6.309573e-05 3.981072e-05 2.511886e-05 1.584893e-05
## [11] 1.000000e-05 6.309573e-06 3.981072e-06 2.511886e-06 1.584893e-06
## [16] 1.000000e-06
ra <-rnorm(16,0,0.005)    #   generate 15 random numbers 

length(ra)
## [1] 16
y <-  L/(KD+L)




y
##  [1] 0.98280098 0.97301287 0.95789293 0.93486885 0.90056215 0.85106383
##  [7] 0.78286692 0.69464699 0.58938371 0.47524556 0.36363636 0.26500153
## [13] 0.18532929 0.12551972 0.08304438 0.05405405
y <-  L/(KD+L) + ra

y
##  [1] 0.97969162 0.97755767 0.96124512 0.93705068 0.90225273 0.85848376
##  [7] 0.78786350 0.69243102 0.58615082 0.48011077 0.36482037 0.26557091
## [13] 0.18291901 0.12154736 0.08234533 0.06309297
plot(pH,y)

tryfit <- nls(y ~ L/(KD+L),  
              start = c(KD = 0.0001))

summary(tryfit)
## 
## Formula: y ~ L/(KD + L)
## 
## Parameters:
##     Estimate Std. Error t value Pr(>|t|)    
## KD 1.739e-05  1.207e-07   144.1   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.004163 on 15 degrees of freedom
## 
## Number of iterations to convergence: 4 
## Achieved convergence tolerance: 1.028e-06
plot(pH,y, main = "Acid/Base Binding", xlab = "pH", ylab = "Fractional Binding")

lines(pH,predict(tryfit), col = "blue")  # add a line fitted line

This approach to acid base binding can be generalized, and the solution for a multi-protic system is:

\[f =\frac{ \frac{[H^+]}{K_1} + \frac{2[H^+]^2}{K_1K_2} + \frac{3[H^+]^3}{K_1K{_2K_3}}}{1+\frac{[H^+]}{K_1}+\frac{[H^+]^2}{K_1K_2}+\frac{[H^+]^3}{K_1K_2K_3}}\]

ould us monte carlo to add error to x values after calculation of theoretical curve & examine validity of fit and error estimates on ka.

8.3.1 Exploration of diprotic titrations

research questions 1. what is the error of midpoint est of pk for amino acids? 2. what is the error in assuming small dissociation for binding curves for dirotic acids? can we select points
3, can we fit volume total? 4. what about error in x? pH in binding curve analysis

8.3.2 hemoglobin

9 Optimization Methods

library(subplex)
 #  test of  sublex (nelder mead variation) 

     #   use monte carlo to get error estimate 

     #  compare with nls2

     re <- 0
#   generate exponential data 

      xv  <- seq(0,9,1)
      
      alpha <- 7.5
      
      k  <- 0.2
      
      y  <- alpha*exp(k*xv)
      
      y
##  [1]  7.500000  9.160521 11.188685 13.665891 16.691557 20.387114 24.900877
##  [8] 30.414000 37.147743 45.372356
      plot(xv,y)

      length(xv)
## [1] 10
      rn <- rnorm(10,0,2)
      
      y <- y + rn      #   exp with noise
      
      plot(xv,y)  

      yfit <- 0
      
      library(nls2)
      
      yfit <- nls2(y~alpha*exp(k*xv),start=c(alpha=5,k=0.5))
      
      yfit     #  nls or nls2
## Nonlinear regression model
##   model: y ~ alpha * exp(k * xv)
##    data: <environment>
##  alpha      k 
## 7.3781 0.1978 
##  residual sum-of-squares: 21.13
## 
## Number of iterations to convergence: 7 
## Achieved convergence tolerance: 3.252e-07
      #   nelder mead function to be minimized
      
         
     
    
      #   exponential fit using subplex
      
      #  this looks similar to uniroot can do sapply?
      
      #  y is previously defined, data set
      
      myfun <- function (x) { 
        x1 <- x[1]
        x2 <- x[2]
        sum((y -  x1*exp(x2*xv))^2)
      }
      re <- subplex(par=c(1,6),fn=myfun)
      
      re  
## $par
## [1] 7.3781374 0.1977538
## 
## $value
## [1] 21.13417
## 
## $counts
## [1] 523
## 
## $convergence
## [1] 1
## 
## $message
## [1] "limit of machine precision reached"
## 
## $hessian
## NULL
    #  sucess!!!!
      
      #  this method described in phd thesis
      
      #  can we iterate this / sapply to do monte carlo 
      
     #  or solve equations 

10 Fourier Transform and NMR

11 Electrochemistry

11.1 Pourbaix Diagrams

12 Kinetic Analysis

12.1 First order reactions

12.2 Enzyme Kinetics

13 Reports with Rmarkdown

Appendix

Anderson, S. W., S. P. Anderson, and R. S Anderson. 2015. Exhumation by Debris Flows in the 2013 Colorado Front Range Storm. Geology. http://www.pltl.org/.

  1. Introduction to Scientific Programming and Simulation Using R, CRC Press (2014)↩︎

  2. *Analytical Chemistry↩︎

  3. Bradford, M.M. (1976), Rapid and sensitive method for the quantitation of microgram quantities of protein utilizing the principle of protein-dye binding, Anal. Biochem., 72 (1–2): 248–254↩︎

  4. Brady and Macnaughtan, Anal Biochem. 2015 Dec 15; 491: 43–51.Evaluation of Colorimetric Assays for Analyzing Reductively Methylated Proteins: Biases and Mechanistic Insights↩︎

  5. See Pollard, Thomas (2017). A Guide to Simple and Informative Binding Assays. Molecular Biology of the Cell 21(23)↩︎

  6. While the mathematics of equilibrium is similar to a simple acid, note the difference in a titration curve, where the -log(H) is measured↩︎