For our biochemistry labs we need to use regression for a few different analyses:
Before we can do any hypothesis testing, we will format our data and estimate a few parameters. In the process you’ll get some more experience doing conceptually simple things with R. My experience teaching R is this: when you try to use R to do a statistical analysis that you aren’t familiar with, its hard to focus on learning R and the statistics at the same time. If you use R for things that are conceptually simple (like the first part of this analysis), you can focus on getting R to do what you want, without also trying to understand what it is you want R to do.
library(tidyverse)
library(broom)
theme_set(theme_classic())
Let’s imagine that we’ve got the following data for our standard curve. (These are made up values. You’ll have real ones!)
conc_pnp <- c(0.00,0.01,0.02,0.03,0.04,0.05,0.06,0.07,0.08,0.09,0.10,
0.11,0.12,0.13,0.14,0.15,0.16,0.17,0.18,0.19,0.20)
abs <- c(-0.010,0.112,0.190,0.289,0.393,0.495,0.589,0.681,0.789,0.932,
1.014,1.108,1.239,1.302,1.353,1.398,1.423,1.449,1.462,1.473,1.488)
standard_curve <- tibble(conc_pnp, abs)
Once you your data into R, you can plot it:
standard_curve %>%
ggplot(aes(conc_pnp, abs)) +
geom_point() +
labs( x = "Concentration pNP (mM)",
y = "Absorbance at 405 nm")
In this case, the (made up) data look to form a straight line over most of the range of substrate concentrations, but perhaps not at the highest concentrations.
If we would like to plot a regression line fit to the full data, we can do this:
standard_curve %>%
ggplot(aes(conc_pnp, abs)) +
geom_point() +
labs( x = "Concentration pNP (mM)",
y = "Absorbance at 405 nm") +
geom_smooth(method = "lm",
se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'
The line fits pretty well, but perhaps not at the highest concentrations. If we would like to exclude some high values where Beer’s law may not hold, we can use the following:
standard_curve %>%
ggplot(aes(conc_pnp, abs)) +
geom_point() +
labs(x = "Concentration pNP (mM)",
y = "Absorbance at 405 nm") +
geom_smooth(
data = subset(standard_curve, abs < 0.75),
method = "lm",
se = FALSE
)
## `geom_smooth()` using formula = 'y ~ x'
If we wanted to calculate the slope of this line, we could use the approaches we already know
standard_curve %>%
filter(abs < 0.75) %>%
lm(abs ~ conc_pnp, data = .) %>%
tidy()
The threshold used here (abs < 0.75) is excessively
conservative. It will be up to you to choose a threshold appropriate for
your data.
Enter your own standard curve results. Use the tools presented above to decide on the value of absorbance above which you do not believe the relationship between pNP concentration and absorbance is linear for your data, and record it in your notebook. You’ll need it for setting up your kinetics experiments.
Once you’ve done so, you can return to instructions in your lab manual.
Once you’ve recorded all your kinetics data, you’ll need to find the initial rate of each reaction.
Start by importing data into R. If the .csv file you export from the kinetics software is like mine, you’ll have a large rectangular set of data at the top, with a pair of columns for each experiment. Below all of that and in the left-most column is a whole bunch data from the instrument on each kinetic run. That could be super useful data to keep, but it will cause some problems for R. The easiest thing to do for now is to delete it and save the file as a new .csv file. (Its possible to leave the file unchanged and process it within R as well- ask Prof. Stoebel if you’re interested in how.)
You can then import the data into R.
bio54_trial_data <- read_csv("http://pages.hmc.edu/stoebel/bio54_trial_data.csv")
## New names:
## Rows: 602 Columns: 14
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (14): Sample 1, ...2, Sample 2, ...4, Sample 3, ...6, Sample 4, ...8, Sa...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...2`
## • `` -> `...4`
## • `` -> `...6`
## • `` -> `...8`
## • `` -> `...10`
## • `` -> `...12`
## • `` -> `...14`
glimpse(bio54_trial_data)
## Rows: 602
## Columns: 14
## $ `Sample 1` <chr> "Time (sec)", "0.050000001", "0.150000006", "0.25", "0.3500…
## $ ...2 <chr> "Abs", "0.212629199", "0.214399874", "0.215875015", "0.2176…
## $ `Sample 2` <chr> "Time (sec)", "0.050000001", "0.150000006", "0.25", "0.3500…
## $ ...4 <chr> "Abs", "0.158929765", "0.161100715", "0.162367597", "0.1638…
## $ `Sample 3` <chr> "Time (sec)", "0.050000001", "0.150000006", "0.25", "0.3500…
## $ ...6 <chr> "Abs", "0.135142401", "0.136546865", "0.13818723", "0.13987…
## $ `Sample 4` <chr> "Time (sec)", "0.050000001", "0.150000006", "0.25", "0.3500…
## $ ...8 <chr> "Abs", "0.099794328", "0.101361834", "0.102812424", "0.1044…
## $ `Sample 5` <chr> "Time (sec)", "0.050000001", "0.150000006", "0.25", "0.3500…
## $ ...10 <chr> "Abs", "0.088778839", "0.090153083", "0.091593221", "0.0927…
## $ `Sample 6` <chr> "Time (sec)", "0.050000001", "0.150000006", "0.25", "0.3500…
## $ ...12 <chr> "Abs", "0.074440829", "0.075701468", "0.076760612", "0.0778…
## $ `Sample 7` <chr> "Time (sec)", "0.050000001", "0.150000006", "0.25", "0.3500…
## $ ...14 <chr> "Abs", "0.067592159", "0.068157755", "0.06861911", "0.06977…
This data frame is most definitely not a tidy data set. First, it doesn’t have proper column headers. The column header tells us the specific experiment, not the what each variable is. The first row is the names of the variables, but it is being treated as data. Further, each vector in this data frame is a character vector. This is because when R imported the data, the first row was characters rather than numbers.
Our next step will be to fix the formatting issues and extract data from a single trial for analysis. (Once we’ve figured out how to do one trial, doing the rest will be easy.)
To get data from a single trial we will do the following:
We can do that for a single trial as follows:
trial_1 <- bio54_trial_data %>%
slice(2:602) %>% #i.e. slice out rows 2 to 602 from the whole tibble, discard the rest
select(c(1, 2)) %>% #i.e select columns 1-2
rename(time = 1,
abs = 2) %>% #I.e. rename the
mutate(time = as.numeric(time),
abs = as.numeric(abs))
head(trial_1) #The function head() displays the top few rows
Since we need to go through these steps for each of the samples, we can save ourselves lots of copying and pasting (and troubleshooting headaches later) by writing a function and using to do all of these steps. This function will take four arguments:
:, e.g 2:206),The function will return the extracted and reformatted data.
extract_data_set <- function(data, rows_to_get, time_col, abs_col) {
new_data <- data %>%
slice(rows_to_get) %>%
select(c(all_of(time_col), all_of(abs_col))) %>%
rename(time = 1,
abs = 2) %>%
mutate(time = as.numeric(time),
abs = as.numeric(abs))
}
Once we have this function, we can extract a single experiment as follows:
trial_1 <-
extract_data_set(
data = bio54_trial_data,
rows_to_get = 2:602,
time_col = 1,
abs_col = 2
)
To analyze your data, you can once again fit a line to your data.
trial_1 %>%
ggplot(aes(time, abs)) +
geom_point() +
labs(x = "time (s)",
y = "Absorbance at 405 nm") +
geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'
The regression line (in blue) very closely matches the raw data (in black). Given this good fit, we can definitely use this regression over the full time span.
trial_1 %>%
lm(abs ~ time, data = .) %>%
tidy()
R labels the calculated slope with the name of the independent
variable. (I.e. our slope is labeled with time, and has a
value of 0.01560258. (What are the units for each of those
coefficients?)
As you proceed to analyze your data, you may find that a line does not fit your data well for the full time course of some trials. For example, here is some real data:
trial_7 <-
extract_data_set(
data = bio54_trial_data,
rows_to_get = 2:602,
time_col = 13,
abs_col = 14
)
trial_7 %>%
ggplot(aes(time, abs)) +
geom_point() +
labs(x = "time (s)",
y = "Absorbance at 405 nm") +
geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'
In these cases, remember that we are trying to measure the initial rate of the reaction. You’ll want to restrict your regression line to just the intial part that looks linear.
Find the initial rates of the reaction for each of your trials using the approach outlined above.
The initial rates of reaction you just measured were in units of Abs at 405 nm per sec. We should estimate our rates in terms of molecules / time (e.g µmol / sec ).
Use Beer’s law to do this conversion. (The extinction coefficient is reported in your lab manual, and the cuvette you used had a path length of 1 cm.)
Once you have estimated the initial rate of the reaction (i.e. v) and converted to the proper units, you can fit the data.
To fit our rate data to a Michaelis-Menten function, we’ll need to
use the nls function in R.
First, I’ll create some data to work with. You’ll need to import or enter your own.
initial_concentration<-1:10 # substrate concentrations for this example
rate<-c(142.5881, 197.7911, 221.6461, 239.3364, 246.4013,
256.5914, 256.1170, 262.9451, 270.5476, 268.5552) # The example rates of reaction
mic_men_data <- tibble(initial_concentration, rate)
mic_men_data %>%
ggplot(aes(initial_concentration, rate)) +
geom_point() +
labs(x = "Substrate concentration (mM)",
"Initial rate (µmol / sec)")
To fit the data to the Michaelis-Menten function, use:
mm_fit <- nls(rate ~ SSmicmen(initial_concentration, Vmax, Km), data = mic_men_data)
#Replace x and y with the names of your vectors of data.
#Leave the Vmax and Km: R is going to calculate them for us.
We can get the values of Vmax and Km using:
coefficients(mm_fit)
## Vmax Km
## 299.203999 1.057428
The 95% confidence intervals on Vmax and Km are:
confint(mm_fit)
## 2.5% 97.5%
## Vmax 294.373742 304.202951
## Km 0.973144 1.146852
To draw your curve on top of the points:
mic_men_data %>%
ggplot(aes(initial_concentration, rate)) +
geom_point() +
labs(x = "Substrate concentration (mM)",
y = "Initial rate (µmol / sec)") +
geom_smooth(
method = "nls",
formula = y ~ SSmicmen(x, Vmax, Km),
se = FALSE
)
Imagine that I had two experiments that I wanted to plot on a single graph, but give points and lines a color unique to the individual experiment. That could be accomplished by a simply.
Start by creating the new tibble in tidy format, and then take a look at it.
# Create the new table of two experiments, in tidy format.
combined_sample_data <- tibble(
initial_concentration = c(initial_concentration, initial_concentration),
rate = c(rate, rate * 2),
exp = c(rep("Condition 1", 10), rep("Condition 2", 10))
)
#Take a look at the data
head(combined_sample_data)
With data in this format, the plot is straightforward.
combined_sample_data %>%
ggplot(aes(x = initial_concentration, y = rate, color = exp)) +
geom_point() +
geom_smooth(
method = "nls",
formula = y ~ SSmicmen(x, Vmax, Km),
se = FALSE
) +
labs(x = "Substrate concentration (mM)",
y = "Initial rate (µmol / sec)",
col = "Experiment")
Use this example code to analyze your own data.