library(tidyverse)
library(ggplot2)
library(dplyr)
library(mgcv)
library(broom)

Import data

Bash pre-process the data files

The data files are organized in three folders under data/d06..., each corresponding to a replicate experiment of the same design. Within each folder, the separate CSV files are the different time points. In this case because the time point information is already encoded in the files, to make the parsing easier, we will first concatenate all the files in each folder into a single one. To make sure that one and only one header is included in the concatenated file, we use a trick thanks to this post

cd data
for i in d06*;
do
  awk '(NR == 1) || (FNR > 1)' $i/*.csv > ./merge-${i%-statistics}.csv
done
gzip merge-*.csv
cd ..

Import data into R

files <- dir("data", pattern = "merge-*") # get the file names of the merged
files <- file.path("data", files)         # this appends the directory to the file names
names(files) <- paste0("Rep", 1:3)        # name the replicates
tp <- c("0min",  "30min", "60min", "90min", "120min", "150min",  "180min", "210min", "240min") # ordered time points

# import and process the data to the format we want
dat <- files %>% 
  map(~read_csv(., na = c("", "NA", "N/A"), col_types = cols())) %>%   
  # ~read_csv(): apply read_csv() function to each file, returns a list
  #   using the readr::read_csv() is much faster than the base read.csv
  #   it doesn't "repair" the column names, like "X Parameter" -> "X.Parameter"
  #   to refer to them, use the backsticks, like `X Parameter`
  # na = c(): the output files use N/A to encode missing data
  #   to be able to read all data in one go, we need to specify that
  # col_types = cols(): this tells the function to guess the colum types
  bind_rows(.id = "Replicate") %>%
  # bind all the rows from the list items into a single data frame
  filter(`X Parameter` == "BL1-H", is.na(`Y Parameter`)) %>% 
  # only save the GFP channel, ane make sure that Y parameter is NA
  select(Replicate, Group, Sample, Count, median = `X Median`, peak =`X Peak`, 
         sd = `X SD`, cv = `X %CV`, rCV = `X %rCV`) %>% 
  mutate(Group = ordered(Group, levels = tp)) %>% # make sure time points are ordered
  arrange(Group) # sort by time points

write_csv(dat, file = "data/20210616-Msn2-Msn4-KO-flow-data.csv")

Subset the dataset and merge with the genotype information:

# get genotype info
Genotype <- read.csv("data/Genotype.csv")

## select rows based on genotype info
pi_condition <- Genotype %>% filter(Treatment == "no_Pi")

pi.dat <- dat %>% 
  filter(Sample %in% pi_condition$Sample) %>% 
  select(time = Group, Sample, replicate = Replicate, median) %>% 
  left_join(pi_condition, by = "Sample") %>% 
  select(sample = Sample, everything(), -Treatment) %>% 
  arrange(time, genotype, replicate)

write_csv(pi.dat, file = "data/20210616-Msn2-Msn4-KO-flow-data.csv.gz")

Analysis

Visualize the data

pi.dat %>% 
  ggplot(aes(x = time, y = median, color = replicate)) + geom_point(size = 0.2) + 
  facet_wrap(~genotype) + stat_summary(fun = "median", geom = "point", size = 2) +
  stat_smooth(aes(x = as.numeric(time)), se = FALSE) +
  theme_bw()
`geom_smooth()` using method = 'loess' and formula 'y ~ x'

Since the three replicates are rather similar to each other, let’s treat them as one large pool, and combine the four genotypes

pi.dat %>% 
  ggplot(aes(x = time, y = median, color = genotype)) + geom_point(size = 0.2) + 
  stat_summary(fun = "mean", geom = "point", size = 2) +
  stat_smooth(aes(x = as.numeric(time)), se = FALSE) +
  theme_bw()
`geom_smooth()` using method = 'loess' and formula 'y ~ x'

Data transformation

We want to transform the median fluorescent intensity into fold change relative to time point 0 for each genotype. Then we would like to normalize that for each non-wt genotype by the wild type fold change at each time point. The goal is to highlight the deficiency in gene induction in the knock out.

# first let's calculate the fold change by dividing the median fluorescent indensity of each genotype at each time point by the mean MFI for that genotype at time point 0.
# we begin by calculating the baseline for each genotype
pi.summary <- pi.dat %>% 
  group_by(genotype, time) %>% 
  summarize(baseline = mean(median)) %>% 
  filter(time == "0min") %>% 
  select(-time)
`summarise()` has grouped output by 'genotype'. You can override using the `.groups` argument.
# next we calculate the fold change
pi.dat1 <- pi.dat %>% 
  left_join(pi.summary, by = "genotype") %>% 
  mutate(fc = median / baseline)
# for each time point, we calculate the mean fold change for the wild type strain  
pi.wt.fc <- pi.dat1 %>% 
  filter(genotype == "wt") %>% 
  group_by(time) %>% 
  summarize(wt.mFC = mean(fc))
# finally, we normalize the non-wt strains by the wild type mean fold change at each time point
pi.dat1 <- pi.dat1 %>% 
  left_join(pi.wt.fc, by = "time") %>% 
  mutate(normFC = fc / wt.mFC)

plot the normalized fold changes

pi.dat1 %>% 
  ggplot(aes(x = time, y = normFC, color = genotype)) + geom_point(size = 0.2) + 
  stat_summary(fun = "mean", geom = "point", size = 2) +
  stat_smooth(aes(x = as.numeric(time)), se = FALSE) +
  ylab("fold change relative to 0 min, normalized to wild type") +
  theme_bw()
`geom_smooth()` using method = 'loess' and formula 'y ~ x'

We could perform linear regression on the data to test if the slope is significantly different from zero. We use a “nest-map-unnest” workflow (see this link)

pi.dat1 %>% 
  filter(genotype != "wt") %>% 
  mutate(time = as.numeric(time)) %>% 
  group_by(genotype) %>% 
  nest() %>% 
  mutate(
    fit = map(data, ~ lm(normFC ~ time, data = .x)),
    tidied = map(fit, tidy)
  ) %>% 
  unnest(tidied)
NA

For each non-wt genotype, we can read the estimates of the intercept and slope (time) and the significance of them. We see that msn2∆ has a slope that is not siginificantly different from zero while both msn4∆ and msn2∆ msn4∆ show significant slopes that are below zero, indicating that they both show a less effective induction of CTA1 compared with the wild type. This allows us to conclude that most likely Msn4 is a positive regulator of CTA1 in C. glabrata

---
title: "data_collection_visualizing"
author: "Jinye Liang"
date: "June 13, 2021"
output:
  html_notebook:
    toc: true
    toc_float: true
    toc_depth: 5
    code_folding: hide
---

```{r}
library(tidyverse)
library(ggplot2)
library(dplyr)
library(mgcv)
library(broom)
```

## Import data
### Bash pre-process the data files
The data files are organized in three folders under `data/d06...`, each corresponding to a replicate experiment of the same design. Within each folder, the separate CSV files are the different time points. In this case because the time point information is already encoded in the files, to make the parsing easier, we will first concatenate all the files in each folder into a single one. To make sure that one and only one header is included in the concatenated file, we use a trick thanks to [this post](https://apple.stackexchange.com/questions/80611/merging-multiple-csv-files-without-merging-the-header)

```{bash}
cd data
for i in d06*;
do
  awk '(NR == 1) || (FNR > 1)' $i/*.csv > ./merge-${i%-statistics}.csv
done
gzip merge-*.csv
cd ..
```

### Import data into R
```{r}
files <- dir("data", pattern = "merge-*") # get the file names of the merged
files <- file.path("data", files)         # this appends the directory to the file names
names(files) <- paste0("Rep", 1:3)        # name the replicates
tp <- c("0min",  "30min", "60min", "90min", "120min", "150min",  "180min", "210min", "240min") # ordered time points

# import and process the data to the format we want
dat <- files %>% 
  map(~read_csv(., na = c("", "NA", "N/A"), col_types = cols())) %>%   
  # ~read_csv(): apply read_csv() function to each file, returns a list
  #   using the readr::read_csv() is much faster than the base read.csv
  #   it doesn't "repair" the column names, like "X Parameter" -> "X.Parameter"
  #   to refer to them, use the backsticks, like `X Parameter`
  # na = c(): the output files use N/A to encode missing data
  #   to be able to read all data in one go, we need to specify that
  # col_types = cols(): this tells the function to guess the colum types
  bind_rows(.id = "Replicate") %>%
  # bind all the rows from the list items into a single data frame
  filter(`X Parameter` == "BL1-H", is.na(`Y Parameter`)) %>% 
  # only save the GFP channel, ane make sure that Y parameter is NA
  select(Replicate, Group, Sample, Count, median = `X Median`, peak =`X Peak`, 
         sd = `X SD`, cv = `X %CV`, rCV = `X %rCV`) %>% 
  mutate(Group = ordered(Group, levels = tp)) %>% # make sure time points are ordered
  arrange(Group) # sort by time points
```

### Subset the dataset and merge with the genotype information:
```{r}
# get genotype info
Genotype <- read.csv("data/Genotype.csv")

## select rows based on genotype info
pi_condition <- Genotype %>% filter(Treatment == "no_Pi")

pi.dat <- dat %>% 
  filter(Sample %in% pi_condition$Sample) %>% 
  select(time = Group, Sample, replicate = Replicate, median) %>% 
  left_join(pi_condition, by = "Sample") %>% 
  select(sample = Sample, everything(), -Treatment) %>% 
  arrange(time, genotype, replicate)

write_csv(pi.dat, file = "data/20210616-Msn2-Msn4-KO-flow-data.csv.gz")
```

## Analysis
### Visualize the data
```{r}
pi.dat %>% 
  ggplot(aes(x = time, y = median, color = replicate)) + geom_point(size = 0.2) + 
  facet_wrap(~genotype) + stat_summary(fun = "median", geom = "point", size = 2) +
  stat_smooth(aes(x = as.numeric(time)), se = FALSE) +
  theme_bw()
```
Since the three replicates are rather similar to each other, let's treat them as one large pool, and combine the four genotypes
```{r}
pi.dat %>% 
  ggplot(aes(x = time, y = median, color = genotype)) + geom_point(size = 0.2) + 
  stat_summary(fun = "mean", geom = "point", size = 2) +
  stat_smooth(aes(x = as.numeric(time)), se = FALSE) +
  theme_bw()
```

### Data transformation
We want to transform the median fluorescent intensity into fold change relative to time point 0 for each genotype. Then we would like to normalize that for each non-wt genotype by the wild type fold change at each time point. The goal is to highlight the deficiency in gene induction in the knock out.

```{r}
# first let's calculate the fold change by dividing the median fluorescent indensity of each genotype at each time point by the mean MFI for that genotype at time point 0.
# we begin by calculating the baseline for each genotype
pi.summary <- pi.dat %>% 
  group_by(genotype, time) %>% 
  summarize(baseline = mean(median)) %>% 
  filter(time == "0min") %>% 
  select(-time)
# next we calculate the fold change
pi.dat1 <- pi.dat %>% 
  left_join(pi.summary, by = "genotype") %>% 
  mutate(fc = median / baseline)
# for each time point, we calculate the mean fold change for the wild type strain  
pi.wt.fc <- pi.dat1 %>% 
  filter(genotype == "wt") %>% 
  group_by(time) %>% 
  summarize(wt.mFC = mean(fc))
# finally, we normalize the non-wt strains by the wild type mean fold change at each time point
pi.dat1 <- pi.dat1 %>% 
  left_join(pi.wt.fc, by = "time") %>% 
  mutate(normFC = fc / wt.mFC)
```
plot the normalized fold changes
```{r}
pi.dat1 %>% 
  ggplot(aes(x = time, y = normFC, color = genotype)) + geom_point(size = 0.2) + 
  stat_summary(fun = "mean", geom = "point", size = 2) +
  stat_smooth(aes(x = as.numeric(time)), se = FALSE) +
  ylab("fold change relative to 0 min, normalized to wild type") +
  theme_bw()
```
We could perform linear regression on the data to test if the slope is significantly different from zero. We use a "nest-map-unnest" workflow (see this [link](https://cran.r-project.org/web/packages/broom/vignettes/broom_and_dplyr.html))
```{r}
pi.dat1 %>% 
  filter(genotype != "wt") %>% 
  mutate(time = as.numeric(time)) %>% 
  group_by(genotype) %>% 
  nest() %>% 
  mutate(
    fit = map(data, ~ lm(normFC ~ time, data = .x)),
    tidied = map(fit, tidy)
  ) %>% 
  unnest(tidied)
  
```

For each non-wt genotype, we can read the estimates of the intercept and slope (time) and the significance of them. We see that _msn2∆_ has a slope that is not siginificantly different from zero while both _msn4∆_ and _msn2∆ msn4∆_ show significant slopes that are below zero, indicating that they both show a less effective induction of _CTA1_ compared with the wild type. This allows us to conclude that most likely Msn4 is a positive regulator of _CTA1_ in _C. glabrata_