For our biochemistry labs we need to use regression for a few different analyses:

  1. To fit a line to a standard curve.
  2. To get the rate of a single reaction.
  3. To estimate the parameters from a Michaelis-Menten function.

Before we can do any hypothesis testing, we will format our data and estimate a few parameters. In the process you’ll get some more experience doing conceptually simple things with R. My experience teaching R is this: when you try to use R to do a statistical analysis that you aren’t familiar with, its hard to focus on learning R and the statistics at the same time. If you use R for things that are conceptually simple (like the first part of this analysis), you can focus on getting R to do what you want, without also trying to understand what it is you want R to do.

Fitting the standard curve

library(tidyverse)
library(broom)
theme_set(theme_classic())

Let’s imagine that we’ve got the following data for our standard curve. (These are made up values. You’ll have real ones!)

conc_pnp <-  c(0.00,0.01,0.02,0.03,0.04,0.05,0.06,0.07,0.08,0.09,0.10,
               0.11,0.12,0.13,0.14,0.15,0.16,0.17,0.18,0.19,0.20) 

abs <- c(-0.010,0.112,0.190,0.289,0.393,0.495,0.589,0.681,0.789,0.932,
         1.014,1.108,1.239,1.302,1.353,1.398,1.423,1.449,1.462,1.473,1.488)

standard_curve <- tibble(conc_pnp, abs)

Once you your data into R, you can plot it:

standard_curve %>% 
  ggplot(aes(conc_pnp, abs)) +
  geom_point() +
  labs( x = "Concentration pNP (mM)",
        y = "Absorbance at 405 nm")

In this case, the (made up) data look to form a straight line over most of the range of substrate concentrations, but perhaps not at the highest concentrations.

If we would like to plot a regression line fit to the full data, we can do this:

standard_curve %>% 
  ggplot(aes(conc_pnp, abs)) +
  geom_point() +
  labs( x = "Concentration pNP (mM)",
        y = "Absorbance at 405 nm") +
  geom_smooth(method = "lm", 
              se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'

The line fits pretty well, but perhaps not at the highest concentrations. If we would like to exclude some high values where Beer’s law may not hold, we can use the following:

standard_curve %>%
  ggplot(aes(conc_pnp, abs)) +
  geom_point() +
  labs(x = "Concentration pNP (mM)",
       y = "Absorbance at 405 nm") +
  geom_smooth(
    data = subset(standard_curve, abs < 0.75),
    method = "lm",
    se = FALSE
  )
## `geom_smooth()` using formula = 'y ~ x'

If we wanted to calculate the slope of this line, we could use the approaches we already know

standard_curve %>%
  filter(abs < 0.75) %>%
  lm(abs ~ conc_pnp, data = .) %>%
  tidy()

The threshold used here (abs < 0.75) is excessively conservative. It will be up to you to choose a threshold appropriate for your data.

Your turn

Enter your own standard curve results. Use the tools presented above to decide on the value of absorbance above which you do not believe the relationship between pNP concentration and absorbance is linear for your data, and record it in your notebook. You’ll need it for setting up your kinetics experiments.

Once you’ve done so, you can return to instructions in your lab manual.

Part 2: Calculating the initial rate of individual reactions

Once you’ve recorded all your kinetics data, you’ll need to find the initial rate of each reaction.

Start by importing data into R. If the .csv file you export from the kinetics software is like mine, you’ll have a large rectangular set of data at the top, with a pair of columns for each experiment. Below all of that and in the left-most column is a whole bunch data from the instrument on each kinetic run. That could be super useful data to keep, but it will cause some problems for R. The easiest thing to do for now is to delete it and save the file as a new .csv file. (Its possible to leave the file unchanged and process it within R as well- ask Prof. Stoebel if you’re interested in how.)

You can then import the data into R.

bio54_trial_data <- read_csv("http://pages.hmc.edu/stoebel/bio54_trial_data.csv")
## New names:
## Rows: 602 Columns: 14
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (14): Sample 1, ...2, Sample 2, ...4, Sample 3, ...6, Sample 4, ...8, Sa...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...2`
## • `` -> `...4`
## • `` -> `...6`
## • `` -> `...8`
## • `` -> `...10`
## • `` -> `...12`
## • `` -> `...14`
glimpse(bio54_trial_data) 
## Rows: 602
## Columns: 14
## $ `Sample 1` <chr> "Time (sec)", "0.050000001", "0.150000006", "0.25", "0.3500…
## $ ...2       <chr> "Abs", "0.212629199", "0.214399874", "0.215875015", "0.2176…
## $ `Sample 2` <chr> "Time (sec)", "0.050000001", "0.150000006", "0.25", "0.3500…
## $ ...4       <chr> "Abs", "0.158929765", "0.161100715", "0.162367597", "0.1638…
## $ `Sample 3` <chr> "Time (sec)", "0.050000001", "0.150000006", "0.25", "0.3500…
## $ ...6       <chr> "Abs", "0.135142401", "0.136546865", "0.13818723", "0.13987…
## $ `Sample 4` <chr> "Time (sec)", "0.050000001", "0.150000006", "0.25", "0.3500…
## $ ...8       <chr> "Abs", "0.099794328", "0.101361834", "0.102812424", "0.1044…
## $ `Sample 5` <chr> "Time (sec)", "0.050000001", "0.150000006", "0.25", "0.3500…
## $ ...10      <chr> "Abs", "0.088778839", "0.090153083", "0.091593221", "0.0927…
## $ `Sample 6` <chr> "Time (sec)", "0.050000001", "0.150000006", "0.25", "0.3500…
## $ ...12      <chr> "Abs", "0.074440829", "0.075701468", "0.076760612", "0.0778…
## $ `Sample 7` <chr> "Time (sec)", "0.050000001", "0.150000006", "0.25", "0.3500…
## $ ...14      <chr> "Abs", "0.067592159", "0.068157755", "0.06861911", "0.06977…

This data frame is most definitely not a tidy data set. First, it doesn’t have proper column headers. The column header tells us the specific experiment, not the what each variable is. The first row is the names of the variables, but it is being treated as data. Further, each vector in this data frame is a character vector. This is because when R imported the data, the first row was characters rather than numbers.

Our next step will be to fix the formatting issues and extract data from a single trial for analysis. (Once we’ve figured out how to do one trial, doing the rest will be easy.)

To get data from a single trial we will do the following:

  1. Extract the part of the large data frame with results for that trial.
  2. Set the column names correctly.
  3. Make sure that the vectors are coded as numeric.

We can do that for a single trial as follows:

trial_1 <- bio54_trial_data %>%
  slice(2:602) %>% #i.e. slice out rows 2 to 602 from the whole tibble, discard the rest
  select(c(1, 2)) %>% #i.e select columns 1-2
  rename(time = 1,
         abs = 2) %>% #I.e. rename the
  mutate(time = as.numeric(time),
         abs = as.numeric(abs))
head(trial_1) #The function head() displays the top few rows

An aside: Writing a function

Since we need to go through these steps for each of the samples, we can save ourselves lots of copying and pasting (and troubleshooting headaches later) by writing a function and using to do all of these steps. This function will take four arguments:

  • the name of the data frame,
  • the rows to get. (Refer to this as vector of numbers using :, e.g 2:206),
  • the column that contains the time data, referred to by is number in the data set (e.g. 1 for the example above)
  • the column that contains the absorbance data, referred to by is number in the data set (e.g. 2 for the example above)

The function will return the extracted and reformatted data.

extract_data_set <- function(data, rows_to_get, time_col, abs_col) {
  new_data <- data %>%
    slice(rows_to_get) %>%
    select(c(all_of(time_col), all_of(abs_col))) %>%
    rename(time = 1,
           abs = 2) %>%
    mutate(time = as.numeric(time),
           abs = as.numeric(abs))
}

Once we have this function, we can extract a single experiment as follows:

trial_1 <-
  extract_data_set(
    data = bio54_trial_data,
    rows_to_get = 2:602,
    time_col = 1,
    abs_col = 2
  )

To analyze your data, you can once again fit a line to your data.

trial_1 %>%
  ggplot(aes(time, abs)) +
  geom_point() +
  labs(x = "time (s)",
       y = "Absorbance at 405 nm") +
  geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'

The regression line (in blue) very closely matches the raw data (in black). Given this good fit, we can definitely use this regression over the full time span.

trial_1 %>%
  lm(abs ~ time, data = .) %>%
  tidy()

R labels the calculated slope with the name of the independent variable. (I.e. our slope is labeled with time, and has a value of 0.01560258. (What are the units for each of those coefficients?)

As you proceed to analyze your data, you may find that a line does not fit your data well for the full time course of some trials. For example, here is some real data:

trial_7 <-
  extract_data_set(
    data = bio54_trial_data,
    rows_to_get = 2:602,
    time_col = 13,
    abs_col = 14
  )

trial_7 %>%
  ggplot(aes(time, abs)) +
  geom_point() +
  labs(x = "time (s)",
       y = "Absorbance at 405 nm") +
  geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'

In these cases, remember that we are trying to measure the initial rate of the reaction. You’ll want to restrict your regression line to just the intial part that looks linear.

Your turn

Find the initial rates of the reaction for each of your trials using the approach outlined above.

Convert to the right units

The initial rates of reaction you just measured were in units of Abs at 405 nm per sec. We should estimate our rates in terms of molecules / time (e.g µmol / sec ).

Your turn

Use Beer’s law to do this conversion. (The extinction coefficient is reported in your lab manual, and the cuvette you used had a path length of 1 cm.)

Estimate Km and Vmax

Once you have estimated the initial rate of the reaction (i.e. v) and converted to the proper units, you can fit the data.

To fit our rate data to a Michaelis-Menten function, we’ll need to use the nls function in R.

First, I’ll create some data to work with. You’ll need to import or enter your own.

initial_concentration<-1:10 # substrate concentrations for this example
rate<-c(142.5881, 197.7911, 221.6461, 239.3364, 246.4013, 
     256.5914, 256.1170, 262.9451, 270.5476, 268.5552) # The example rates of reaction
mic_men_data <- tibble(initial_concentration, rate)

mic_men_data %>%
  ggplot(aes(initial_concentration, rate)) +
  geom_point() +
  labs(x = "Substrate concentration (mM)",
       "Initial rate (µmol / sec)")

To fit the data to the Michaelis-Menten function, use:

mm_fit <- nls(rate ~ SSmicmen(initial_concentration, Vmax, Km), data = mic_men_data)
#Replace x and y with the names of your vectors of data.
#Leave the Vmax and Km: R is going to calculate them for us.

We can get the values of Vmax and Km using:

coefficients(mm_fit)
##       Vmax         Km 
## 299.203999   1.057428

The 95% confidence intervals on Vmax and Km are:

confint(mm_fit)
##            2.5%      97.5%
## Vmax 294.373742 304.202951
## Km     0.973144   1.146852

To draw your curve on top of the points:

mic_men_data %>%
  ggplot(aes(initial_concentration, rate)) +
  geom_point() +
  labs(x = "Substrate concentration (mM)",
       y = "Initial rate (µmol / sec)") +
  geom_smooth(
    method = "nls",
    formula = y ~ SSmicmen(x, Vmax, Km),
    se = FALSE
  )

Bonus example

Imagine that I had two experiments that I wanted to plot on a single graph, but give points and lines a color unique to the individual experiment. That could be accomplished by a simply.

Start by creating the new tibble in tidy format, and then take a look at it.

# Create the new table of two experiments, in tidy format.
combined_sample_data <- tibble(
    initial_concentration = c(initial_concentration, initial_concentration),
    rate = c(rate, rate * 2),
    exp = c(rep("Condition 1", 10), rep("Condition 2", 10))
  )

#Take a look at the data
head(combined_sample_data)

With data in this format, the plot is straightforward.

combined_sample_data %>% 
  ggplot(aes(x = initial_concentration, y = rate, color = exp)) +
  geom_point() +
  geom_smooth(
    method = "nls",
    formula = y ~ SSmicmen(x, Vmax, Km),
    se = FALSE
  ) +  
  labs(x = "Substrate concentration (mM)", 
       y = "Initial rate (µmol / sec)",
       col = "Experiment") 

Your turn

Use this example code to analyze your own data.