Given an input table listing the proportion of cells expressing a neuronal marker MAP2 for each combination of treatments (a dose of NGN2 virus +/- NT3 supplement; 3 replicates per condition), select the treatment combination that yields the highest proportion of MAP2-expressing cells (i.e., neurons).
We will use data visualizations and a formal statistical analysis by logistic regression to decide on the best treatment combination.
Welcome to the R notebook that will guide you through your data analysis task. Whether you are an R aficionado or it is the first time using it, we hope that you will find this analysis feasible, enjoyable and educational.
We’ll start by installing several external packages (also known as libraries) that we will need for this analysis. This should only be run once - after this we recommend that you comment out these lines by adding the “#” sign to each of them.
#install.packages("effects")
#install.packages("dplyr")
#install.packages("ggplot2")
Now that these libraries are installed, we need to load them. This will need to be done every time.
library(dplyr)
library(ggplot2)
library(effects)
The input data table contains the proportion of MAP2-positive cells in each well of our multiplexed experiment (in the “fractMAP2” column). The experimental conditions in each well - the concentrations of the NGN2 and control viruses (in MOI), and the amount of the NT3 supplement - are recorded in the NGN2_MOI, CONTROL_MOI and NT3_ng columns, respectively.
Let’s load our dataset using the read.table() function.
data <- read.table("C:/Users/Dell/Downloads/Telegram Desktop/FORAGE/LIFEARC/Input data.txt", sep='\t', header=TRUE)
Let’s print out the data as a table to have a look.
data
As you can see, this experimental design tests four doses of the NGN2 virus (0, 2, 5 and 10 MOI), using 5 MOI of the empty virus vector as a negative control in the wells where no NGN2 virus was added. In addition, for each dose of the NGN2 virus, the effects of the NT3 supplement were tested at 0 and 10 ng/mL. There are eight experimental conditions, with each condition replicated over three wells, giving 24 samples overall (one 24-well plate).
Currently, our treatment doses are represented as numbers. However, we don’t necessarily expect the doses to show a consistent trend with respect to the outcome (i.e., it is possible that both “too much” and “not enough” of the virus will impair differentiation efficiency). It may therefore more appropriate in our setting to treat each dose as a separate “category”, in the same way as we would typically encode colours or genders. (Note that categorical encoding does not mean that the underlying data must be inherently discrete!).
To tell R that our treatment doses should be considered as categorical data, we need to transform the corresponding columns from the numerical into the so-called factor format. We will define the “levels” of each factor such that the dose of 0 will be considered as a “control”, with the following levels corresponding to increasing doses. This will help us in data visualization and regression analysis further down below.
data$NGN2_MOI <- factor(data$NGN2_MOI, levels=c(0,2,5,10))
data$NT3_ng <- factor(data$NT3_ng, levels=c(0,10))
We will print just the top bit of the modified table now using the head() function.
head(data)
Answer: ** From the categorization, it can be realized that all the 5 Control_MOI supplemented only the 0 doses of NGN2_MOI **.
Let’s now generate a bar plot the data. We will construct the plot such that the heights of the bars show the mean fraction of MAP2-positive cells for each treatment condition, and the error bars showing standard deviations across the three replicates.
To do this, we will first generate a summary table with conditions in rows (8 in total), and the means and standard deviations across replicates in columns. We will call the columns meanFractMAP2 and sdFractMAP2, respectively.
We will use the group_by() and summarise() functions from the dplyr package as below.
sumdata <- data %>%
group_by(NGN2_MOI, NT3_ng) %>%
summarise(meanFractMAP2=mean(fractMAP2), sdFractMAP2 = sd(fractMAP2))
## `summarise()` has grouped output by 'NGN2_MOI'. You can override using the
## `.groups` argument.
We can now visualise the data from the summary table as a barplot. We will use the powerful ggplot2 package for this.
You don’t need to understand each parameter in the code below, but note how the plot is “assembled” from a “sum” of different function calls:
ggplot() defines what table should be used as input (in our case, sumdata) and which of its columns should be used for the plotting and how. Here we will ask it to plot the fraction of MAP2+ cells (meanFractMAP2) along the y axis, plot the bars for different concentrations of the NGN2 virus (NGN2_MOI) along the x axis, and use different fill colour for the conditions with and without NT3 (NT3_ng).
geom_bar() determines that these data should be plotted as a barplot (i.e., with the y axis reflecting the height of the bars). The position_dodge() call inside this function determines that the two barplots for the same virus concentration (with and without NT3, respectively) should be plotted next to each other (rather than stacked).
geom_errorbar() determines how the error bars should be plotted. Here we define that the bottom and the top boundaries of the error bars should correspond to the (mean+/-standard deviation) for each condition, respectively.
ggplot(sumdata, aes(x=NGN2_MOI, y=meanFractMAP2, fill=NT3_ng)) +
geom_bar(position=position_dodge(), stat="identity", colour='black') +
geom_errorbar(aes(ymin=meanFractMAP2-sdFractMAP2, ymax=meanFractMAP2+sdFractMAP2),
width=0.2, position=position_dodge(0.9))
Let’s now try replotting the data by averaging across NT3-treated and untreated wells. This way we will only have four bars in the bar plot, one per each NGN2 virus dose.
First, we will create a different summary table for this, where we will group the data differently.
sumdataNGN2 <- data %>%
group_by(NGN2_MOI, NT3_ng=="" ) %>%
summarise(meanFracMAP2=mean(fractMAP2), sdFracMAP2 = sd(fractMAP2))
## `summarise()` has grouped output by 'NGN2_MOI'. You can override using the
## `.groups` argument.
The rest of the code is ready to go (note that we’ve removed the fill parameter from ggplot() as with just one parameter to plot - NGN2 virus dose - we no longer need to use the fill colour to represent the second condition).
ggplot(sumdataNGN2, aes(x=NGN2_MOI, y=meanFracMAP2)) +
geom_bar(stat="identity") +
geom_errorbar(aes(ymin=meanFracMAP2-sdFracMAP2, ymax=meanFracMAP2+sdFracMAP2))
Let’s now do the same by averaging across all NGN2 doses, such that we only have two bars in the bar plot, one per each NT3 dose (treated and untreated).
sumdataNGN2 <- data %>%
group_by(NT3_ng) %>%
summarise(meanFracMAP2=mean(fractMAP2), sdFracMAP2 = sd(fractMAP2))
ggplot(sumdataNGN2, aes(x=NT3_ng, y=meanFracMAP2)) +
geom_bar(stat="identity") +
geom_errorbar(aes(ymin=meanFracMAP2-sdFracMAP2, ymax=meanFracMAP2+sdFracMAP2), width=0.2,
position=position_dodge(0.9))
Answer: ** The untreated group has a shorter error bar compared to that of the treated group. This suggests that the mean may fall within that range. The treated group has a minimum error that is below the minimum value and this suggests that the sample size may be small for the analysis, or perhaps the biological variability of the data is very wide. The differences in the error bars of the treated and untreated groups defines a statistical significance between the two groups, suggesting that, neurons that were treated with the NT3_ng may have grown or differentiated [the more], whereas the opposite was true for the contrary. **.
From the plots above, we can probably have some idea already of what combination of treatment doses has worked best, and which treatment had a larger effect. However, we still need to test this formally.
Importantly, the outcomes that we are comparing are given by proportions (of MAP2-positive cells), and proportions are not distributed normally. Therefore, we shouldn’t use standard significance tests such as the T test or ANOVA for this analysis.
In addition, we have many conditions, so rather than doing lots of pairwise comparisons (and accumulating multiple testing errors), we are going to tie them all together into a regression model.
When the outcome is given by a proportion, one suitable regression framework is logistic regression. Formally, logistic regression is a type of generalised linear models of the form:
f(y) = k1x1 + k2x2 + … + b
y is a binary outcome variable (such as “pass”/“fail” or “heads”/“tails”);
f(y) is typically the log-odds of success: f(y) = logit(y) = log(y/1-y);
x1, x2, … are explanatory (predictor) variables;
k1, k1, … are the regression coefficients to be determined by the regression analysis;
b is the intercept term that affects the outcome uniformly and does not depend on the predictor variables.
Regression analysis can be used to assess which (if any) explanatory variables (in our case, the two treatments we have varied) affect the chances of a positive outcome (in our case, the fraction of MAP2-expressing cells generated in the differentiation experiment).
Note: The explanatory variables can be either continuous or categorical (as in our case, as we are using treatment doses as categorical variables for reasons explained above).
Also note that in our case instead of single outcomes for each observation (which would be: “Is a given cell MAP2-positive?”), we will use the fraction of MAP2-positive cells per well as the outcome variable.
One key additional piece of information that we need for this analysis is the total number of cells scanned in each well.
We will now use R to fit our logistic regression model using the glm() function, which has the following key parameters:
data - our input table as the data parameter
family - the type of the generalised regression model (for logistic regression, it should be set to “binomial”)
the model formula in the form y ~ x1 + … + xN.
The measurements of both the outcome variable y and all explanatory variables x1, …, xN should be provided in the input data table under the corresponding column names.
mod = glm( fractMAP2 ~ NGN2_MOI + NT3_ng , family=binomial(link="logit"), data=data,
weights = rep(10000,nrow(data)))
Note how we have entered the number of cells tested in each row as the weights parameter. In our case, they are the same for each well, so we are just repeating the same number for each row, but they could also be different.
Also note that in this model we are assuming that the effects of virus concentration and NT3 treatment are fully independent. If we aren’t sure about it, it may make sense to test their interaction effects as well. If you are interested, you can read up on how to do this in R very straightforwardly - for example, here: https://www.theanalysisfactor.com/generalized-linear-models-glm-r-part4/. Feel free to try this analysis in a separate chunk below.
Let’s now inspect the summary of this model generated by R.
summary(mod)
##
## Call:
## glm(formula = fractMAP2 ~ NGN2_MOI + NT3_ng, family = binomial(link = "logit"),
## data = data, weights = rep(10000, nrow(data)))
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.30098 0.08862 -71.10 <2e-16 ***
## NGN2_MOI2 2.12408 0.09307 22.82 <2e-16 ***
## NGN2_MOI5 3.97231 0.08883 44.72 <2e-16 ***
## NGN2_MOI10 3.11278 0.08992 34.62 <2e-16 ***
## NT3_ng10 0.31290 0.02084 15.02 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10204.34 on 23 degrees of freedom
## Residual deviance: 712.97 on 19 degrees of freedom
## AIC: 890.95
##
## Number of Fisher Scoring iterations: 5
This command prints a few things, but we will focus on the following:
The values of each regression coefficient given in the Estimate column.
For our purposes, it is sufficient to know that the sign of the regression coefficient corresponds to whether inreasing the value of the predictor results in either increasing (+) or decreasing (-) the chances of success (in our case, on MAP2 expression).
Also, the higher is the absolute value of the coefficient, the stronger is the effect of the predictor on the chances of success.
The significance of the coefficients’ difference from zero (with a coefficient of zero corresponding to the predictor not affecting the outcome). The p-values for this significance are given in the Pr(>|z|) column.
You have also probably noticed that the names of explanatory variables listed in the summary have changed a little from what we have provided originally. This is because for categorical variables defined this way, R is testing the effect of each level separately compared with the first one. This way, each treatment level becomes a separate independent explanatory variable, which R names by adding the treatment level (in our case, concentration) to the original variable name.
We can now plot the predictions of this model by using just one line of code in R.
plot(allEffects(mod))
Answer: ** All treatment groups have a significant effect on the outcome variable **.
Answer: ** NGN2_MOI of 5 doses **.
Finally, as you may remember, the regression coefficients and their significance are reported with respect to the no-treatment controls. In the case of NGN2 virus, they clearly show that using the virus works better than not doing it. But is the difference between the various doses of the virus significant?
The most straightforward way to test this is to swap the control dose of the virus against which we are testing from 0 MOI to the best-working dose: 5 MOI (that is currently level 3). Do other doses doses perform significantly worse?
We will use the relevel() function to do this, rerun regression analysis, and then return the levels back to the original order.
levels(data$NGN2_MOI) # the original level order
## [1] "0" "2" "5" "10"
data$NGN2_MOI = relevel(data$NGN2_MOI, ref=3)
levels(data$NGN2_MOI) # the new level order
## [1] "5" "0" "2" "10"
mod3 = glm( fractMAP2 ~ NGN2_MOI + NT3_ng , family=binomial(link="logit"), data=data,
weights = rep(10000,nrow(data)))
summary(mod3)
##
## Call:
## glm(formula = fractMAP2 ~ NGN2_MOI + NT3_ng, family = binomial(link = "logit"),
## data = data, weights = rep(10000, nrow(data)))
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.32867 0.01782 -130.66 <2e-16 ***
## NGN2_MOI0 -3.97231 0.08883 -44.72 <2e-16 ***
## NGN2_MOI2 -1.84823 0.03365 -54.92 <2e-16 ***
## NGN2_MOI10 -0.85953 0.02360 -36.42 <2e-16 ***
## NT3_ng10 0.31290 0.02084 15.02 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10204.34 on 23 degrees of freedom
## Residual deviance: 712.97 on 19 degrees of freedom
## AIC: 890.95
##
## Number of Fisher Scoring iterations: 5
# Restore the original order - note that relevel() won't work as it will put 5 before 2
data$NGN2_MOI = factor(data$NGN2_MOI, levels = c("0", "2", "5", "10"))
levels(data$NGN2_MOI) # back to how they were originally
## [1] "0" "2" "5" "10"
Answer: ** It did but in the negative direction **.
Let’s have another look at the very first plot. Did you notice that one treatment condition has a surprisingly large standard deviation (almost equal to the mean).
Answer: ** It is the treatment NGN2_MOI with 2 doses **.
If you locate this condition in the experimental table, you’ll see that it’s likely to do with one of the three replicates for this condition being a “dropout”, in which either the differentiation or the immunostaining hasn’t really worked for some technical reasons.
Let’s remove this dropout replicate and rerun the analysis. For this, we will locate the row in the original table corresponding to this replicate and create a new data table with this row filtered out.
dataFilt = data[-10,]
We can now rerun the analysis.
First, let’s regenerate the summary table and replot the data. You can see that the error bar for the affected condition is now much smaller.
sumdataFilt <- dataFilt %>%
group_by(NGN2_MOI, NT3_ng) %>%
summarise(meanFractMAP2=mean(fractMAP2), sdFractMAP2 = sd(fractMAP2))
## `summarise()` has grouped output by 'NGN2_MOI'. You can override using the
## `.groups` argument.
ggplot(sumdataFilt, aes(x=NGN2_MOI, y=meanFractMAP2, fill=NT3_ng)) +
geom_bar(position=position_dodge(), stat="identity", colour='black') +
geom_errorbar(aes(ymin=meanFractMAP2-sdFractMAP2, ymax=meanFractMAP2+sdFractMAP2),
width=0.2,position=position_dodge(0.9))
Let’s rerun the logistic regression model and compare the results with those run on unfiltered data.
mod2 = glm( fractMAP2 ~ NGN2_MOI + NT3_ng , family=binomial(link="logit"), data=dataFilt,
weights = rep(10000,nrow(dataFilt)))
message("With the outlier:")
## With the outlier:
summary(mod)
##
## Call:
## glm(formula = fractMAP2 ~ NGN2_MOI + NT3_ng, family = binomial(link = "logit"),
## data = data, weights = rep(10000, nrow(data)))
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.30098 0.08862 -71.10 <2e-16 ***
## NGN2_MOI2 2.12408 0.09307 22.82 <2e-16 ***
## NGN2_MOI5 3.97231 0.08883 44.72 <2e-16 ***
## NGN2_MOI10 3.11278 0.08992 34.62 <2e-16 ***
## NT3_ng10 0.31290 0.02084 15.02 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10204.34 on 23 degrees of freedom
## Residual deviance: 712.97 on 19 degrees of freedom
## AIC: 890.95
##
## Number of Fisher Scoring iterations: 5
message("Without the outlier:")
## Without the outlier:
summary(mod2)
##
## Call:
## glm(formula = fractMAP2 ~ NGN2_MOI + NT3_ng, family = binomial(link = "logit"),
## data = dataFilt, weights = rep(10000, nrow(dataFilt)))
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.32706 0.08866 -71.36 <2e-16 ***
## NGN2_MOI2 2.33672 0.09316 25.08 <2e-16 ***
## NGN2_MOI5 3.97305 0.08883 44.73 <2e-16 ***
## NGN2_MOI10 3.11310 0.08992 34.62 <2e-16 ***
## NT3_ng10 0.35765 0.02088 17.13 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 9415.18 on 22 degrees of freedom
## Residual deviance: 331.88 on 18 degrees of freedom
## AIC: 505.7
##
## Number of Fisher Scoring iterations: 4
And let’s now do the same using 5 MOI of NGN2 virus as reference.
levels(dataFilt$NGN2_MOI) # the original level order
## [1] "0" "2" "5" "10"
dataFilt$NGN2_MOI = relevel(dataFilt$NGN2_MOI, ref=3)
levels(dataFilt$NGN2_MOI) # the new level order
## [1] "5" "0" "2" "10"
modFilt5MOI = glm( fractMAP2 ~ NGN2_MOI + NT3_ng, family=binomial(link="logit"),
data=dataFilt, weights = rep(10000,nrow(dataFilt)))
message("With the outlier")
## With the outlier
summary(mod3)
##
## Call:
## glm(formula = fractMAP2 ~ NGN2_MOI + NT3_ng, family = binomial(link = "logit"),
## data = data, weights = rep(10000, nrow(data)))
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.32867 0.01782 -130.66 <2e-16 ***
## NGN2_MOI0 -3.97231 0.08883 -44.72 <2e-16 ***
## NGN2_MOI2 -1.84823 0.03365 -54.92 <2e-16 ***
## NGN2_MOI10 -0.85953 0.02360 -36.42 <2e-16 ***
## NT3_ng10 0.31290 0.02084 15.02 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10204.34 on 23 degrees of freedom
## Residual deviance: 712.97 on 19 degrees of freedom
## AIC: 890.95
##
## Number of Fisher Scoring iterations: 5
message("Without the outlier")
## Without the outlier
summary(modFilt5MOI)
##
## Call:
## glm(formula = fractMAP2 ~ NGN2_MOI + NT3_ng, family = binomial(link = "logit"),
## data = dataFilt, weights = rep(10000, nrow(dataFilt)))
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.35401 0.01796 -131.06 <2e-16 ***
## NGN2_MOI0 -3.97305 0.08883 -44.73 <2e-16 ***
## NGN2_MOI2 -1.63633 0.03389 -48.29 <2e-16 ***
## NGN2_MOI10 -0.85995 0.02360 -36.43 <2e-16 ***
## NT3_ng10 0.35765 0.02088 17.13 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 9415.18 on 22 degrees of freedom
## Residual deviance: 331.88 on 18 degrees of freedom
## AIC: 505.7
##
## Number of Fisher Scoring iterations: 4
# Restore the original order - note that relevel() won't work as it will put 5 before 2
dataFilt$NGN2_MOI = factor(dataFilt$NGN2_MOI, levels = c("0", "2", "5", "10"))
levels(dataFilt$NGN2_MOI) # back to how they were originally
## [1] "0" "2" "5" "10"
Answer: ** The 2 doses of treatment has been removed. The 5 doses is now showing. The 10 doses of NGN2 is now positively significant together with the 5 doses of NGN2, but 0 dose still maintains the negative significance. **.
Answer: ** Yes, it has. The new conditions give us the notice that 2 MOI of NGN is not enough to induce differentiation in the cells. Ideally, a dose of 5 MOI of NGN2 is the optimal concentration required for a successful differentiation of the cells, in the presence of 10 ng of NT3 supplementation **.
And we are done. It’s time for your verdict!
Answer: ** 5MOI of NGN2 and 10ng/ml of NT3 **.
We will now save this notebook in a human-readable format (in our case, as an HTML page). For R notebooks, this process is called “knitting” it, and so we will click the “Knit” button at the top of this window to generate our HTML report.
Important notes:
The first chunk of code (where we were installing new packages) should be commented out (by prepending each line with ‘#’) for the knitting to work.
If you skipped the logistic regression analysis parts, you also need to comment out (or delete) code chunks 12-15, 18 and 19. You can locate them quickly in the dropdown menu on the bottom right of this window.
Also note that Rstudio will likely ask you to download additional packages if it is your first time knitting a notebook into an HTML report.