1. Summary of the analysis task

Given an input table listing the proportion of cells expressing a neuronal marker MAP2 for each combination of treatments (a dose of NGN2 virus +/- NT3 supplement; 3 replicates per condition), select the treatment combination that yields the highest proportion of MAP2-expressing cells (i.e., neurons).

We will use data visualizations and a formal statistical analysis by logistic regression to decide on the best treatment combination.

2. Getting started

Welcome to the R notebook that will guide you through your data analysis task. Whether you are an R aficionado or it is the first time using it, we hope that you will find this analysis feasible, enjoyable and educational.

We’ll start by installing several external packages (also known as libraries) that we will need for this analysis. This should only be run once - after this we recommend that you comment out these lines by adding the “#” sign to each of them.

#install.packages("effects")
#install.packages("dplyr")
#install.packages("ggplot2")

Now that these libraries are installed, we need to load them. This will need to be done every time.

library(dplyr)
library(ggplot2)
library(effects)

3. Load the data

The input data table contains the proportion of MAP2-positive cells in each well of our multiplexed experiment (in the “fractMAP2” column). The experimental conditions in each well - the concentrations of the NGN2 and control viruses (in MOI), and the amount of the NT3 supplement - are recorded in the NGN2_MOI, CONTROL_MOI and NT3_ng columns, respectively.

Let’s load our dataset using the read.table() function.

Replace the path in quotes with the path to the input data file on your machine
data <- read.table("C:/Users/Dell/Downloads/Telegram Desktop/FORAGE/LIFEARC/Input data.txt", sep='\t', header=TRUE)

Let’s print out the data as a table to have a look.

data

As you can see, this experimental design tests four doses of the NGN2 virus (0, 2, 5 and 10 MOI), using 5 MOI of the empty virus vector as a negative control in the wells where no NGN2 virus was added. In addition, for each dose of the NGN2 virus, the effects of the NT3 supplement were tested at 0 and 10 ng/mL. There are eight experimental conditions, with each condition replicated over three wells, giving 24 samples overall (one 24-well plate).

4. Represent treatment doses as categorical data

Currently, our treatment doses are represented as numbers. However, we don’t necessarily expect the doses to show a consistent trend with respect to the outcome (i.e., it is possible that both “too much” and “not enough” of the virus will impair differentiation efficiency). It may therefore more appropriate in our setting to treat each dose as a separate “category”, in the same way as we would typically encode colours or genders. (Note that categorical encoding does not mean that the underlying data must be inherently discrete!).

To tell R that our treatment doses should be considered as categorical data, we need to transform the corresponding columns from the numerical into the so-called factor format. We will define the “levels” of each factor such that the dose of 0 will be considered as a “control”, with the following levels corresponding to increasing doses. This will help us in data visualization and regression analysis further down below.

data$NGN2_MOI <- factor(data$NGN2_MOI, levels=c(0,2,5,10))
data$NT3_ng <- factor(data$NT3_ng, levels=c(0,10))

We will print just the top bit of the modified table now using the head() function.

head(data)
Question: Can you see what has changed in the output?
The head() function displays the first 6 experiments, after being categorized based on the different levels of NGN2_MOI and NT3_ng

Answer: ** From the categorization, it can be realized that all the 5 Control_MOI supplemented only the 0 doses of NGN2_MOI **.

5. Plot the data

Let’s now generate a bar plot the data. We will construct the plot such that the heights of the bars show the mean fraction of MAP2-positive cells for each treatment condition, and the error bars showing standard deviations across the three replicates.

To do this, we will first generate a summary table with conditions in rows (8 in total), and the means and standard deviations across replicates in columns. We will call the columns meanFractMAP2 and sdFractMAP2, respectively.

We will use the group_by() and summarise() functions from the dplyr package as below.

sumdata <- data %>% 
  group_by(NGN2_MOI, NT3_ng) %>%
  summarise(meanFractMAP2=mean(fractMAP2), sdFractMAP2 = sd(fractMAP2))
## `summarise()` has grouped output by 'NGN2_MOI'. You can override using the
## `.groups` argument.

We can now visualise the data from the summary table as a barplot. We will use the powerful ggplot2 package for this.

You don’t need to understand each parameter in the code below, but note how the plot is “assembled” from a “sum” of different function calls:

ggplot(sumdata, aes(x=NGN2_MOI, y=meanFractMAP2, fill=NT3_ng)) +
  geom_bar(position=position_dodge(), stat="identity", colour='black') +
  geom_errorbar(aes(ymin=meanFractMAP2-sdFractMAP2, ymax=meanFractMAP2+sdFractMAP2), 
                width=0.2, position=position_dodge(0.9))

Let’s now try replotting the data by averaging across NT3-treated and untreated wells. This way we will only have four bars in the bar plot, one per each NGN2 virus dose.

First, we will create a different summary table for this, where we will group the data differently.

Can you have a go at the correct parameter for this group_by() function in this case?
sumdataNGN2 <- data %>% 
  group_by(NGN2_MOI, NT3_ng=="" ) %>%
  summarise(meanFracMAP2=mean(fractMAP2), sdFracMAP2 = sd(fractMAP2))
## `summarise()` has grouped output by 'NGN2_MOI'. You can override using the
## `.groups` argument.

The rest of the code is ready to go (note that we’ve removed the fill parameter from ggplot() as with just one parameter to plot - NGN2 virus dose - we no longer need to use the fill colour to represent the second condition).

ggplot(sumdataNGN2, aes(x=NGN2_MOI, y=meanFracMAP2)) +
  geom_bar(stat="identity") +
  geom_errorbar(aes(ymin=meanFracMAP2-sdFracMAP2, ymax=meanFracMAP2+sdFracMAP2))

Let’s now do the same by averaging across all NGN2 doses, such that we only have two bars in the bar plot, one per each NT3 dose (treated and untreated).

Please insert the correct grouping into group_by() for this visualisation.
sumdataNGN2 <- data %>% 
  group_by(NT3_ng) %>%
  summarise(meanFracMAP2=mean(fractMAP2), sdFracMAP2 = sd(fractMAP2))

ggplot(sumdataNGN2, aes(x=NT3_ng, y=meanFracMAP2)) +
  geom_bar(stat="identity") +
  geom_errorbar(aes(ymin=meanFracMAP2-sdFracMAP2, ymax=meanFracMAP2+sdFracMAP2), width=0.2,
                position=position_dodge(0.9))

Question: Look at the error bars in the plot above. What do you notice and how would you interpret this result?


Answer: ** The untreated group has a shorter error bar compared to that of the treated group. This suggests that the mean may fall within that range. The treated group has a minimum error that is below the minimum value and this suggests that the sample size may be small for the analysis, or perhaps the biological variability of the data is very wide. The differences in the error bars of the treated and untreated groups defines a statistical significance between the two groups, suggesting that, neurons that were treated with the NT3_ng may have grown or differentiated [the more], whereas the opposite was true for the contrary. **.

6*. Formally test the effects of treatment doses on differentiation efficiency

From the plots above, we can probably have some idea already of what combination of treatment doses has worked best, and which treatment had a larger effect. However, we still need to test this formally.

Importantly, the outcomes that we are comparing are given by proportions (of MAP2-positive cells), and proportions are not distributed normally. Therefore, we shouldn’t use standard significance tests such as the T test or ANOVA for this analysis.

In addition, we have many conditions, so rather than doing lots of pairwise comparisons (and accumulating multiple testing errors), we are going to tie them all together into a regression model.

When the outcome is given by a proportion, one suitable regression framework is logistic regression. Formally, logistic regression is a type of generalised linear models of the form:

f(y) = k1x1 + k2x2 + … + b

Regression analysis can be used to assess which (if any) explanatory variables (in our case, the two treatments we have varied) affect the chances of a positive outcome (in our case, the fraction of MAP2-expressing cells generated in the differentiation experiment).

One key additional piece of information that we need for this analysis is the total number of cells scanned in each well.

The input file you are given doesn’t provide this information, so you had to check with the colleague who has performed this experiment. She told you that the data are based on roughly 10,000 cells scanned in each well.

We will now use R to fit our logistic regression model using the glm() function, which has the following key parameters:

The measurements of both the outcome variable y and all explanatory variables x1, …, xN should be provided in the input data table under the corresponding column names.

Please think of the outcome variable and explanatory variables in our model and input them into the formula below.
mod = glm( fractMAP2 ~ NGN2_MOI + NT3_ng , family=binomial(link="logit"), data=data, 
          weights = rep(10000,nrow(data)))

Let’s now inspect the summary of this model generated by R.

summary(mod) 
## 
## Call:
## glm(formula = fractMAP2 ~ NGN2_MOI + NT3_ng, family = binomial(link = "logit"), 
##     data = data, weights = rep(10000, nrow(data)))
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -6.30098    0.08862  -71.10   <2e-16 ***
## NGN2_MOI2    2.12408    0.09307   22.82   <2e-16 ***
## NGN2_MOI5    3.97231    0.08883   44.72   <2e-16 ***
## NGN2_MOI10   3.11278    0.08992   34.62   <2e-16 ***
## NT3_ng10     0.31290    0.02084   15.02   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10204.34  on 23  degrees of freedom
## Residual deviance:   712.97  on 19  degrees of freedom
## AIC: 890.95
## 
## Number of Fisher Scoring iterations: 5

This command prints a few things, but we will focus on the following:

We can now plot the predictions of this model by using just one line of code in R.

plot(allEffects(mod))

Questions:
- What treatments/doses have a significant effect on the outcome compared with the no-treatment control?


Answer: ** All treatment groups have a significant effect on the outcome variable **.

- What treatment/dose has the strongest positve effect on the outcome compared with the control?


Answer: ** NGN2_MOI of 5 doses **.

Finally, as you may remember, the regression coefficients and their significance are reported with respect to the no-treatment controls. In the case of NGN2 virus, they clearly show that using the virus works better than not doing it. But is the difference between the various doses of the virus significant?

The most straightforward way to test this is to swap the control dose of the virus against which we are testing from 0 MOI to the best-working dose: 5 MOI (that is currently level 3). Do other doses doses perform significantly worse?

We will use the relevel() function to do this, rerun regression analysis, and then return the levels back to the original order.

Please insert the model formula into the glm function (it didn’t change from the analyses above)
levels(data$NGN2_MOI) # the original level order
## [1] "0"  "2"  "5"  "10"
data$NGN2_MOI = relevel(data$NGN2_MOI, ref=3) 
levels(data$NGN2_MOI) # the new level order
## [1] "5"  "0"  "2"  "10"
mod3 = glm( fractMAP2 ~ NGN2_MOI + NT3_ng , family=binomial(link="logit"), data=data, 
          weights = rep(10000,nrow(data)))

summary(mod3)
## 
## Call:
## glm(formula = fractMAP2 ~ NGN2_MOI + NT3_ng, family = binomial(link = "logit"), 
##     data = data, weights = rep(10000, nrow(data)))
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.32867    0.01782 -130.66   <2e-16 ***
## NGN2_MOI0   -3.97231    0.08883  -44.72   <2e-16 ***
## NGN2_MOI2   -1.84823    0.03365  -54.92   <2e-16 ***
## NGN2_MOI10  -0.85953    0.02360  -36.42   <2e-16 ***
## NT3_ng10     0.31290    0.02084   15.02   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10204.34  on 23  degrees of freedom
## Residual deviance:   712.97  on 19  degrees of freedom
## AIC: 890.95
## 
## Number of Fisher Scoring iterations: 5
# Restore the original order - note that relevel() won't work as it will put 5 before 2 
data$NGN2_MOI = factor(data$NGN2_MOI, levels = c("0", "2", "5", "10"))
levels(data$NGN2_MOI) # back to how they were originally
## [1] "0"  "2"  "5"  "10"
Question: Did 5 MOI of the NGN2 virus perform significantly better than 2 and 10 MOI?


Answer: ** It did but in the negative direction **.

7. Outlier removal

Let’s have another look at the very first plot. Did you notice that one treatment condition has a surprisingly large standard deviation (almost equal to the mean).

Question: Can you see what condition it is?


Answer: ** It is the treatment NGN2_MOI with 2 doses **.

If you locate this condition in the experimental table, you’ll see that it’s likely to do with one of the three replicates for this condition being a “dropout”, in which either the differentiation or the immunostaining hasn’t really worked for some technical reasons.

Let’s remove this dropout replicate and rerun the analysis. For this, we will locate the row in the original table corresponding to this replicate and create a new data table with this row filtered out.

Enter the correct row instead of ROW_NUMBER below.
dataFilt = data[-10,]

We can now rerun the analysis.

First, let’s regenerate the summary table and replot the data. You can see that the error bar for the affected condition is now much smaller.

sumdataFilt <- dataFilt %>% 
  group_by(NGN2_MOI, NT3_ng) %>%
  summarise(meanFractMAP2=mean(fractMAP2), sdFractMAP2 = sd(fractMAP2))
## `summarise()` has grouped output by 'NGN2_MOI'. You can override using the
## `.groups` argument.
ggplot(sumdataFilt, aes(x=NGN2_MOI, y=meanFractMAP2, fill=NT3_ng)) +
  geom_bar(position=position_dodge(), stat="identity", colour='black') +
  geom_errorbar(aes(ymin=meanFractMAP2-sdFractMAP2, ymax=meanFractMAP2+sdFractMAP2), 
                width=0.2,position=position_dodge(0.9))

Let’s rerun the logistic regression model and compare the results with those run on unfiltered data.

Enter the model formula in the glm() function below.
mod2 = glm( fractMAP2 ~ NGN2_MOI + NT3_ng , family=binomial(link="logit"), data=dataFilt, 
           weights = rep(10000,nrow(dataFilt)))

message("With the outlier:")
## With the outlier:
summary(mod)
## 
## Call:
## glm(formula = fractMAP2 ~ NGN2_MOI + NT3_ng, family = binomial(link = "logit"), 
##     data = data, weights = rep(10000, nrow(data)))
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -6.30098    0.08862  -71.10   <2e-16 ***
## NGN2_MOI2    2.12408    0.09307   22.82   <2e-16 ***
## NGN2_MOI5    3.97231    0.08883   44.72   <2e-16 ***
## NGN2_MOI10   3.11278    0.08992   34.62   <2e-16 ***
## NT3_ng10     0.31290    0.02084   15.02   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10204.34  on 23  degrees of freedom
## Residual deviance:   712.97  on 19  degrees of freedom
## AIC: 890.95
## 
## Number of Fisher Scoring iterations: 5
message("Without the outlier:")
## Without the outlier:
summary(mod2) 
## 
## Call:
## glm(formula = fractMAP2 ~ NGN2_MOI + NT3_ng, family = binomial(link = "logit"), 
##     data = dataFilt, weights = rep(10000, nrow(dataFilt)))
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -6.32706    0.08866  -71.36   <2e-16 ***
## NGN2_MOI2    2.33672    0.09316   25.08   <2e-16 ***
## NGN2_MOI5    3.97305    0.08883   44.73   <2e-16 ***
## NGN2_MOI10   3.11310    0.08992   34.62   <2e-16 ***
## NT3_ng10     0.35765    0.02088   17.13   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9415.18  on 22  degrees of freedom
## Residual deviance:  331.88  on 18  degrees of freedom
## AIC: 505.7
## 
## Number of Fisher Scoring iterations: 4

And let’s now do the same using 5 MOI of NGN2 virus as reference.

Enter the model formula in the glm() function call below.
levels(dataFilt$NGN2_MOI) # the original level order
## [1] "0"  "2"  "5"  "10"
dataFilt$NGN2_MOI = relevel(dataFilt$NGN2_MOI, ref=3) 
levels(dataFilt$NGN2_MOI) # the new level order
## [1] "5"  "0"  "2"  "10"
modFilt5MOI = glm( fractMAP2 ~ NGN2_MOI + NT3_ng, family=binomial(link="logit"), 
                   data=dataFilt, weights = rep(10000,nrow(dataFilt)))

message("With the outlier")
## With the outlier
summary(mod3)
## 
## Call:
## glm(formula = fractMAP2 ~ NGN2_MOI + NT3_ng, family = binomial(link = "logit"), 
##     data = data, weights = rep(10000, nrow(data)))
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.32867    0.01782 -130.66   <2e-16 ***
## NGN2_MOI0   -3.97231    0.08883  -44.72   <2e-16 ***
## NGN2_MOI2   -1.84823    0.03365  -54.92   <2e-16 ***
## NGN2_MOI10  -0.85953    0.02360  -36.42   <2e-16 ***
## NT3_ng10     0.31290    0.02084   15.02   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10204.34  on 23  degrees of freedom
## Residual deviance:   712.97  on 19  degrees of freedom
## AIC: 890.95
## 
## Number of Fisher Scoring iterations: 5
message("Without the outlier")
## Without the outlier
summary(modFilt5MOI)
## 
## Call:
## glm(formula = fractMAP2 ~ NGN2_MOI + NT3_ng, family = binomial(link = "logit"), 
##     data = dataFilt, weights = rep(10000, nrow(dataFilt)))
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.35401    0.01796 -131.06   <2e-16 ***
## NGN2_MOI0   -3.97305    0.08883  -44.73   <2e-16 ***
## NGN2_MOI2   -1.63633    0.03389  -48.29   <2e-16 ***
## NGN2_MOI10  -0.85995    0.02360  -36.43   <2e-16 ***
## NT3_ng10     0.35765    0.02088   17.13   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9415.18  on 22  degrees of freedom
## Residual deviance:  331.88  on 18  degrees of freedom
## AIC: 505.7
## 
## Number of Fisher Scoring iterations: 4
# Restore the original order - note that relevel() won't work as it will put 5 before 2 
dataFilt$NGN2_MOI = factor(dataFilt$NGN2_MOI, levels = c("0", "2", "5", "10"))
levels(dataFilt$NGN2_MOI) # back to how they were originally
## [1] "0"  "2"  "5"  "10"
Questions:
- What regression coefficients have changed?

Answer: ** The 2 doses of treatment has been removed. The 5 doses is now showing. The 10 doses of NGN2 is now positively significant together with the 5 doses of NGN2, but 0 dose still maintains the negative significance. **.

- Has the apparent optimal experimental condition changed after removing the outlier (and if so, how)?

Answer: ** Yes, it has. The new conditions give us the notice that 2 MOI of NGN is not enough to induce differentiation in the cells. Ideally, a dose of 5 MOI of NGN2 is the optimal concentration required for a successful differentiation of the cells, in the presence of 10 ng of NT3 supplementation **.

8. Final conclusions

And we are done. It’s time for your verdict!

Question: Based on the analyses above, what combination of treatment doses do you recommend to use in the differentiation protocol?

Answer: ** 5MOI of NGN2 and 10ng/ml of NT3 **.

9. Reporting the results

We will now save this notebook in a human-readable format (in our case, as an HTML page). For R notebooks, this process is called “knitting” it, and so we will click the “Knit” button at the top of this window to generate our HTML report.