This project is based on my research internship at the University of California, San Diego. I will discuss data from four experiments where we treated plants with different salt concentrations and measured their photon emission.

Lets first set the working directory and load the raw data.

library(tidyverse)
library(knitr)

### set working dir
setwd("~/osmotic_pressure")

### reads all files containing csv extension and stores them in read data
osm_names <- list.files(pattern = "osm.csv")
osm_data <- lapply(osm_names, function (i){
  read.csv(i,skip = 2, na.strings = c(""," ","NA","PER WELL"))})

aeq_names <- list.files(pattern = "aeq.csv")
aeq_data <- lapply(aeq_names, function (i){
  read.csv(i,skip = 2, na.strings = c(""," ","NA","PER WELL"))})

### modify theme components

theme_set(theme_bw(base_size = 15)+theme(panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank()))

great this worked!

This is data from multiple experiments - lets have an exploratory look by checking out the dimensions of each data frame.

### print dimensions of each data frame
for (df in osm_data){
  print(dim(df))
}

[1] 452 99 [1] 402 99 [1] 452 99 [1] 452 99

There are 4 dataframes with each having roughly 450 observations in 95 variables. Lets only look at the first 10 rows of the first data frame to get an idea.

example_df <- osm_data[[1]]
kable(head(example_df[,1:10]))
X Time T..Lum Col.0 X9.3CO9K Col.0.1 X9.3CO9K.1 Col.0.2 X9.3CO9K.2 Col.0.3
NA 00:00.0 NA 84 94 51 34 35 38 39
NA 00:00.1 NA 133 69 24 57 50 17 78
NA 00:00.2 NA 128 90 26 30 19 30 42
NA 00:00.3 NA 130 90 26 43 39 49 36
NA 00:00.4 NA 134 96 43 36 60 38 51
NA 00:00.5 NA 168 117 19 27 41 36 34

This looks quite messy. Lets remove the NAs and convert the times to decimal numbers and make the data tidy. I will also add a variable with the file name.

### function to remove NAs
remove_na <- function (df){
  dummy <- df[, colSums(is.na(df)) == 0]
  return(dummy)
}

### function to convert time format
convert_time <- function (df) {
  dummi <- mutate(df, Time = as.numeric(Time)/10) 
  return(dummi)
}

### function to tidy data
tidyit <- function(df){
  tidy_df <- gather(df, line, lum, -Time, -plate)
  return(tidy_df)
}

### remove columns containing nas
osm_data <- lapply(osm_data, remove_na)
aeq_data <- lapply(aeq_data, remove_na)

### change time column 
osm_data <- lapply(osm_data, convert_time)
aeq_data <- lapply(aeq_data, convert_time)

### add plate variable
for (i in  seq_along(osm_data)){osm_data[[i]] <- mutate(osm_data[[i]], plate = osm_names[i])}
for (i in  seq_along(aeq_data)){aeq_data[[i]] <- mutate(aeq_data[[i]], plate = osm_names[i])}

### make data tidy
tidy_osm <- lapply(osm_data, tidyit)
tidy_aeq <- lapply(aeq_data, tidyit)
kable(head(tidy_osm[[1]]))
Time plate line lum
0.1 2016_12_19_600mM_sorb_osm.csv Col.0 84
0.2 2016_12_19_600mM_sorb_osm.csv Col.0 133
0.3 2016_12_19_600mM_sorb_osm.csv Col.0 128
0.4 2016_12_19_600mM_sorb_osm.csv Col.0 130
0.5 2016_12_19_600mM_sorb_osm.csv Col.0 134
0.6 2016_12_19_600mM_sorb_osm.csv Col.0 168

Much better, the new data frame has only four variables of interest and is neat and tidy.

Since we also added a file identifier we can now merge all dataframes. In this case we need information of the total counts of the variable lum in the aeq files as a normalization reference

### combine all data frames into one
bind_osm <- bind_rows(tidy_osm)
bind_aeq <- bind_rows(tidy_aeq)

### calculate total aequorin function

aequorin_sum <- function(df){
  
  total_aeq <- group_by(df, line, plate)
  total_aeq_1 <- summarize(total_aeq, tot_aeq = sum(lum))
  
  return(total_aeq_1)
  
}

### calculate total aequorin needed for normalization between experiments
total_aeq <- aequorin_sum(bind_aeq)
total_aeq <- total_aeq %>% select(plate,line,tot_aeq)
osm_merge <- left_join(bind_osm, total_aeq)
## Joining, by = c("plate", "line")
osm_rename <- osm_merge %>% mutate(plant = substr(line,1,2))

### rename lines
osm_final <-osm_rename %>% mutate(plant = ifelse(plant == "X9","9.3CO9K", plant))
osm_final <-osm_final %>% mutate(plant = ifelse(plant == "Co","Col-0", plant))

To analyze the data, we first need to correct for experimental noise. Therefore we look at the data that were collected in the first five seconds before the actual experiment was started.

### first cut off values below 40000 to prevent analysis of dead plants (empirical estimation)
osm_cut <- osm_final %>% filter(tot_aeq > 40000)

### set factor levels for plate variable so that plotting is nicer
osm_cut$plate <- as.factor(osm_cut$plate)

plate_names <- c("600 mM",
                 "1200 mM",
                 "300 mM",
                 "900 mM")

levels(osm_cut$plate) <- plate_names

### change factor levels order
osm_cut$plate <- factor(osm_cut$plate, c("300 mM", "600 mM","900 mM","1200 mM"))
osm_cut$plant <- factor(osm_cut$plant, c("Col-0","9.3CO9K"))

### plot baselines
baseline_test <- osm_cut %>% filter(Time<5)

ggplot(baseline_test, aes(x=Time, y= lum))+geom_point(alpha = 0.15)+facet_grid(plate~plant)+labs(x="Time [s]",y="photons")

The baseline noise appears normally distributed. Since all distributions appear similar, it seems reasonable to build the overall mean and subtract it from each experiment.

### calculate means
means <- baseline_test %>% filter(Time < 5) %>% group_by(plant,plate) %>% summarize(mean_lum = mean(lum))

mean_of_means <- mean(means$mean_lum)
osm_base <- osm_cut %>% mutate(lum = lum - mean_of_means)
kable(means)
plant plate mean_lum
Col-0 300 mM 28.63084
Col-0 600 mM 40.78784
Col-0 900 mM 36.81165
Col-0 1200 mM 39.72283
9.3CO9K 300 mM 26.96016
9.3CO9K 600 mM 38.13393
9.3CO9K 900 mM 36.06217
9.3CO9K 1200 mM 40.03560

To normalize, we devide by the sum of all luminescence counts for each experiment. Then, we apply an empirical function to calculate calcium concentrations based on the luminescence:

\(Calcium\, concentration = 0.332588 (−log(\frac{photons\, from\, plant}{total\, no.\, of\, photons})) + 5.5593\)

The next step is to look at the calcium time series.

### normalize function
normalize <- function(df){
  norm <- mutate(df, luminescence = lum / (tot_aeq + lum))
}

#### calculate cacl2
calc <- function(df){
  ca <- mutate(df, luminescence = ifelse(luminescence > 0, luminescence, 0.0000005))
  ca <-mutate(ca, cacl2 = (10^9)*(10^-(5.5593-0.332588*log10(luminescence*10))))
}

osm_norm <- normalize(osm_base)

### rename wt and mutant to wt and mu
osm_cacl2 <- calc(osm_norm)
osm_mean <- osm_cacl2 %>% group_by(Time, plant,plate) %>% mutate(avg = mean(cacl2))


ggplot(osm_cacl2, aes(x=Time,y=cacl2))+geom_line()+facet_grid(plate~plant)

You can see that the time traces for individual plants are very heterogeneous which is typical for biological systems. Lets focus on the mean trace to make the visualization more clear even for the sake of losing some information.

### plot cacl2 traces r
ca_traces <- osm_cacl2 %>% group_by(Time, plant,plate) %>%
  summarize(mean_ca = mean(cacl2), dev = sd(cacl2), counts = n(), se = dev / sqrt(counts))


#### change linetype for plate
ca_traces_plot <- ggplot(ca_traces, aes(x = Time, y = mean_ca, color = plate))
ca_traces_plot+geom_line(size=1.0, aes(color = plant))+
  theme(legend.title = element_blank(),
        legend.justification = c(0,1),
        legend.position = c(0.04,0.96))+
  facet_grid(.~plate)+
  labs(x="Time [s]", y ="[Ca2+]i [nM]")+
  coord_cartesian(ylim = c(0,800))+
  scale_x_continuous(expand = c(0,1))+
  scale_y_continuous(expand = c(0,1))

Thats awesome. Now we can see that the calcium makes a sharp peak in both plant types but its always stronger for Col-0 than for 9.3CO9K. The peak also increases with salt concentration. To quantify this, lets use the primary peak as a metric.

#### plot peaks with error bars
error_test <- osm_cacl2 %>% group_by(Time, plant, plate) %>% summarize(avg = mean(cacl2), dev = sd(cacl2), counts = n(), se = dev / sqrt(counts))
error_test <- error_test %>% group_by(plant,plate) %>% filter(avg == max(avg)) %>% select(-Time)
error_test <- ungroup(error_test)

error_plot <- ggplot(error_test,aes(plate, avg, group=plant))

#define dodge position
pd <- position_dodge(0.4)

# actual peak plot comes here
error_plot +geom_col(width = 0.3, aes(fill = plant), color = "black", position = pd)+
  labs(x="Osmolarity [mOsmol]",y ="1. Peak [Ca2+]i [nM]")+coord_cartesian(ylim = c(0,800))+
  geom_errorbar(aes(ymin=avg-se, ymax=avg+se), width=.1, position = pd)+
  scale_y_continuous(expand = c(0,1))+
  theme(legend.title = element_blank(), legend.justification = c(0,1), legend.position = c(0.04,0.96))

Great, here you can see the peaks for each plant type and again we see that there are huge differences. The error bars show standard deviations. To test, whether the peak differences are indeed significant we can apply some statistics…even though it seems obvious if we look at the standard deviations. To compare, means between two independent samples, the two sample t-test seems reasonable. Note that we then assume that the data is normally distributed.

### calculate peak for each individual seedling
peak_by_seedling <- osm_cacl2 %>% group_by(line,plate) %>% summarize(peak = max(cacl2))
peak_by_seedling <-ungroup(peak_by_seedling)
peak_by_seedling <- peak_by_seedling %>% mutate(line = substr(line,1,2))

for(i in plate_names){
  sample_a <- peak_by_seedling %>% filter(line == "Co") %>% filter(plate == i) %>% select(peak)
  sample_b <- peak_by_seedling %>% filter(line != "Co") %>% filter(plate == i) %>% select(peak)
  
  print(paste0("Testing significance for ",i," Sorbitol"))
  print(paste0("p value = ",t.test(as.data.frame(sample_a),as.data.frame(sample_b), var.equal = TRUE)$p.value))
  print(" ")
}
## [1] "Testing significance for 600 mM Sorbitol"
## [1] "p value = 4.36154921447432e-11"
## [1] " "
## [1] "Testing significance for 1200 mM Sorbitol"
## [1] "p value = 9.70400726570581e-05"
## [1] " "
## [1] "Testing significance for 300 mM Sorbitol"
## [1] "p value = 0.000878890198574683"
## [1] " "
## [1] "Testing significance for 900 mM Sorbitol"
## [1] "p value = 8.06111988200361e-07"
## [1] " "

As expected, the p values are really low. Lets look at the p value = 0.0008 for 300 mM sorbitol. This means that with a probability of 0.0008 the means of the two sample populations are actually the same. The other p values are even smaller.