We will re-make the figures that appeared in the following publications.
Figure 1b, Yi P and Melton DA. 2014. Perspectives on the Activities of ANGPTL8/Betatrophin. Cell 159:467-468.
Figure 5b, Yi P, Park JS, Melton DA (2013) Betatrophin:a hormone that controls pancreatic beta cell proliferation. Cell 153:747-758.
These figures present the same data but in somewhat different forms. The original publication (Yi et al 2013) presented the raw data; the follow up (Yi et al 2014) presented the raw data in a dot plot.
Data from: Figure 1b, Yi P and Melton DA. 2014. Perspectives on the Activities of ANGPTL8/Betatrophin. Cell 159:467-468.
Data Originally presented: Figure 5b, Yi P, Park JS, Melton DA (2013) Betatrophin:a hormone that controls pancreatic beta cell proliferation. Cell 153:747-758.
Here is my first chunk:
1+1
#Control: "beta-cell proliferation ratio" (Ki67/insulin) from gfp-treated animals
gfp <- c(0.56,0.32,0.24,0.32,0.56)
#Treatment:"beta-cell proliferation ratio" (Ki67/insulin) from cells treated with betatropin
betat <- c(0.71,0.79,1.43,2.14,7.70,8.65,8.73)
We can queary R about how long each of these “vectors” are using the length() command
length(gfp)
## [1] 5
length(betat)
## [1] 7
“dataframes”" are the principal way we work with data in R.
This code is a bit complex for a beginner - don’t worry about what its doing.
beta.df <- data.frame(trt = c(rep("gfp",5),
rep("beta.T",7)),
response = c(gfp,betat))
The data now looks like this
beta.df
## trt response
## 1 gfp 0.56
## 2 gfp 0.32
## 3 gfp 0.24
## 4 gfp 0.32
## 5 gfp 0.56
## 6 beta.T 0.71
## 7 beta.T 0.79
## 8 beta.T 1.43
## 9 beta.T 2.14
## 10 beta.T 7.70
## 11 beta.T 8.65
## 12 beta.T 8.73
Data in R almost alywas is organized in “long” format with each row being a seperate observation, organism, etc.
We can see how big our dataframe is uing the dim() command.
dim(beta.df)
## [1] 12 2
R tells us its 12 rows by 2 columns.
We can see if R does this right. We know that our original data has 7 and 5 data points each, and if we can’t do math in our head, we can
5+7
## [1] 12
Or, if we can’t remember how many datapoint there are
length(gfp)+length(betat)
## [1] 12
summary(beta.df)
## trt response
## beta.T:7 Min. :0.240
## gfp :5 1st Qu.:0.500
## Median :0.750
## Mean :2.679
## 3rd Qu.:3.530
## Max. :8.730
mean.gfp <- mean(gfp)
mean.betat <- mean(betat)
sd.gfp <- sd(gfp)
sd.betat <- sd(betat)
n.gfp <- length(gfp)
n.betat <- length(betat)
SE.gfp <- sd.gfp/sqrt(n.gfp)
SE.betat <- sd.betat/sqrt(n.betat)
Don’t worry so much about this code - we normally won’t do all of this!
summary.df <- data.frame(trt = c("gfp","betat"),
mean = c(mean.gfp, mean.betat),
SD = c(sd.gfp,sd.betat),
SE = c(SE.gfp,SE.betat))
Take a look at our little dataframe
summary.df
## trt mean SD SE
## 1 gfp 0.400000 0.1496663 0.0669328
## 2 betat 4.307143 3.8344435 1.4492834
The upper bar. It will take both means and add both SEs to give 2 new values.
summary.df$EB.up <- summary.df$mean + summary.df$SE
The lower bar.
summary.df$EB.low <- summary.df$mean - summary.df$SE
The new dataframe
summary.df
## trt mean SD SE EB.up EB.low
## 1 gfp 0.400000 0.1496663 0.0669328 0.4669328 0.3330672
## 2 betat 4.307143 3.8344435 1.4492834 5.7564263 2.8578594
Plot just the bars
barplot(height = summary.df$mean)
Use the “arguement” “names.arg=” to label x-axis
barplot(height = summary.df$mean,
names.arg = summary.df$trt)
R’s barplot() function is very basic. For the sake of demonstrating what R can do I’ll add error bars to the plot; however, there are functions that make this much, much easier.
Did I mention we won’t really ever do this?
#Make the plot
## note that I have to set ylim = ... to accomdate the height of the error bar
the.plot <- barplot(height = summary.df$mean,
names.arg = summary.df$trt,
ylim = c(0,6))
#this object cotains the centers of the bars along the x-axis
the.plot
## [,1]
## [1,] 0.7
## [2,] 1.9
#Add error bars made with the segemetns() comamnd
## lwd() makes the bars thick
segments(x0 = the.plot,
y0 = summary.df$EB.up,
x1=the.plot,
y1=summary.df$EB.low,
lwd = 5)
#Make the plot
## note that I have to set ylim = ... to accomdate the height of the error bar
the.plot <- barplot(height = summary.df$mean,
names.arg = summary.df$trt,
col = 1,
ylim = c(0,6))
#this object cotains the centers of the bars along the x-axis
the.plot
## [,1]
## [1,] 0.7
## [2,] 1.9
#Add error bars made with the segemetns() comamnd
## lwd() makes the bars thick
segments(x0 = the.plot,
y0 = summary.df$EB.up,
x1=the.plot,
y1=summary.df$EB.low,
lwd = 5)
What we just did was pretty labor intensive. Let’s speed it up a bit using a handing function that can apply functions to dataframes, tapply()
Here we can get both means from our dataframe in one step
my.means <- tapply(X = beta.df$response, #the x value
INDEX = beta.df$trt, #the grouping variable
FUN = mean) #the function
my.means
## beta.T gfp
## 4.307143 0.400000
Now both SDs
my.sd <- tapply(X = beta.df$response, #the x value
INDEX = beta.df$trt, #the grouping variable
FUN = sd) #the function
my.sd
## beta.T gfp
## 3.8344435 0.1496663
And the sample sizes
my.n <- tapply(X = beta.df$response, #the x value
INDEX = beta.df$trt, #the grouping variable
FUN = sd) #the function
my.n
## beta.T gfp
## 3.8344435 0.1496663
Use SD and n to calcualte SE for the error bars
my.SE <- my.sd/sqrt(my.n)
my.SE
## beta.T gfp
## 1.9581735 0.3868673
Put things together into a dataframe
summary.df2 <- data.frame(trt = c("beta.T", "gfp"), my.means,my.SE)
summary.df2
## trt my.means my.SE
## beta.T beta.T 4.307143 1.9581735
## gfp gfp 0.400000 0.3868673
Thing got reversed; this will flip them. For now we’ll ignore why, but it has to do with the topic of “indexing” dataframes.
summary.df2 <- summary.df2[c(2,1),]
library(gplots)
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
barplot2 still takes a “height” arguement
barplot2(height = summary.df2$my.means)
barplot2() also has an internal error bar builder!
It requries 3 arguements
barplot2(height = summary.df2$my.means, #the means
plot.ci = T, #turn on errorbars
ci.l = summary.df2$my.means-
summary.df2$my.SE, #calcualte lower bar
ci.u = summary.df2$my.means+
summary.df2$my.SE) #and the upper bar
Now, just turn it black so it looks like the original
barplot2(height = summary.df2$my.means, #the means
col = 1,
plot.ci = T, #turn on errorbars
ci.l = summary.df2$my.means-
summary.df2$my.SE, #calcualte lower bar
ci.u = summary.df2$my.means+
summary.df2$my.SE) #and the upper bar
Let’s make things even easier
#for summaryBy
library(doBy)
#std.error
library(plotrix)
##
## Attaching package: 'plotrix'
## The following object is masked from 'package:gplots':
##
## plotCI
REcall that our original dataframe looked like this,
beta.df
## trt response
## 1 gfp 0.56
## 2 gfp 0.32
## 3 gfp 0.24
## 4 gfp 0.32
## 5 gfp 0.56
## 6 beta.T 0.71
## 7 beta.T 0.79
## 8 beta.T 1.43
## 9 beta.T 2.14
## 10 beta.T 7.70
## 11 beta.T 8.65
## 12 beta.T 8.73
summary.df3 <- summaryBy(response ~ trt,
data = beta.df,
FUN = c(mean,
std.error) )
Now take a look at our df
summary.df3
## trt response.mean response.std.error
## 1 beta.T 4.307143 1.4492834
## 2 gfp 0.400000 0.0669328
flip it so its in original order
summary.df3 <- summary.df3[c(2,1),]
barplot2(height = summary.df3$response.mean, #the means
col = 1,
names.arg = summary.df3$trt,
plot.ci = T, #turn on errorbars
ci.l = summary.df3$response.mean-
summary.df3$response.std.error, #calcualte lower bar
ci.u = summary.df3$response.mean+
summary.df3$response.std.error)
#the names
my.names <- c("GFP", "Betatropin")
Now plot with better names and annotations
#The plot
theplot <- barplot2(height = summary.df3$response.mean, #the means
col = 1,
names.arg = my.names,
xlab = "Treatment",
ylab = "Ki67+/insulin %",
ylim = c(0,6.25),
plot.ci = T, #turn on errorbars
ci.l = summary.df3$response.mean-
summary.df3$response.std.error, #calcualte lower bar
ci.u = summary.df3$response.mean+
summary.df3$response.std.error)
theplot
## [,1]
## [1,] 0.7
## [2,] 1.9
#adding text
text(x = theplot[2],y = 5.875,labels = "* *",cex = 2)
mtext(text = c("n = 7","n = 5"),
side = 1,line = 1.95,at = theplot)
Now you to can make a plot that can be published in Cell!
The methods say “All of the p values were calculated by a standard Student’s t test with two-tails distribution.” and they note that the p-value is p < 0.005
So let’s do a t-test and check
t.test(response ~ trt,
data = beta.df)
##
## Welch Two Sample t-test
##
## data: response by trt
## t = 2.693, df = 6.0256, p-value = 0.03576
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.3607443 7.4535414
## sample estimates:
## mean in group beta.T mean in group gfp
## 4.307143 0.400000
Wait - they report their p-value as < 0.005. What’s up? Maybe they didn’t use Welch’s correction for some reason
t.test(response ~ trt,
data = beta.df,
var.equal = T)
##
## Two Sample t-test
##
## data: response by trt
## t = 2.2455, df = 10, p-value = 0.04855
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.03012982 7.78415589
## sample estimates:
## mean in group beta.T mean in group gfp
## 4.307143 0.400000
Nope, p got bigger!
Maybe they tranformed that data?
#log transform
t.test(log(response) ~ trt,
data = beta.df)
##
## Welch Two Sample t-test
##
## data: log(response) by trt
## t = 4.2856, df = 7.7561, p-value = 0.002865
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.8994077 3.0199304
## sample estimates:
## mean in group beta.T mean in group gfp
## 0.9865447 -0.9731244
P is now 0.002 which is < 0.005. (WE can check that with R)
0.002 < 0.005
## [1] TRUE
Or like this
my.p <- t.test(log(response) ~ trt,
data = beta.df)
my.p$p.value < 0.005
## [1] TRUE
This could account for this. BUt they didn’t say the did a transformation. THat is frustrating.
#sorry, annoying code to make the orders correct
beta.df$trt <- factor(beta.df$trt, levels = c("gfp","beta.T"))
boxplot(response ~ trt,
data = beta.df)
boxplot(response ~ trt,
data = beta.df)
points(response ~ trt,
data = beta.df)
Stats usually uses ln
(it also seemd to help get their original p-value…)
boxplot(log(response) ~ trt,
data = beta.df)
points(log(response) ~ trt,
data = beta.df)
We can now make a plot like in Yi et al. 2014
#install.packages("beeswarm")
library(beeswarm)
## Warning: package 'beeswarm' was built under R version 3.3.2
beeswarm(response ~ trt,
data = beta.df)
Let’s add some stuff from out summary dataframe
#fix plotting issue...
summary.df3$trt <- factor(summary.df3$trt,levels = c("gfp","beta.T") )
#dotplot
beeswarm(response ~ trt,
data = beta.df)
#mean
points(response.mean ~ trt, data = summary.df3, pch = "+", cex = 2)
Notice anyting?…
#set up for 2 panels side by side
par(mfrow = c(1,2))
#Barplot
#The plot
theplot <- barplot2(height = summary.df3$response.mean, #the means
col = 1,
names.arg = my.names,
xlab = "Treatment",
ylab = "Ki67+/insulin %",
ylim = c(0, 9),
plot.ci = T, #turn on errorbars
ci.l = summary.df3$response.mean-
summary.df3$response.std.error, #calcualte lower bar
ci.u = summary.df3$response.mean+
summary.df3$response.std.error)
theplot
## [,1]
## [1,] 0.7
## [2,] 1.9
#adding text
text(x = theplot[2],y = 5.875,labels = "* *",cex = 2)
mtext(text = c("n = 7","n = 5"),
side = 1,line = 1.95,at = theplot)
#Dotplot
#fix plotting issue...
summary.df3$trt <- factor(summary.df3$trt,levels = c("gfp","beta.T") )
#dotplot
beeswarm(response ~ trt,
data = beta.df, ylim = c(0,9),
ylab = "")
#mean
points(response.mean ~ trt, data = summary.df3, pch = "+", cex = 2)
Load data from csv
lab3.df <- read.csv(file = "Yi_et_al_PLOS_betatropin_data_subset_lab_3_CSV.csv")
Look at data
lab3.df
## animal.ID trt slide total.beta.cell Ki67.cells percent.Ki67
## 1 CM7721 MBP 3 1887 28 1.48
## 2 CM7722 MBP 3 897 6 0.67
## 3 CM7725 MBP 3 1997 11 0.55
## 4 CM7726 MBP 3 4513 17 0.38
## 5 CM7729 MBP 3 930 3 0.32
## 6 CM7731 MBP 3 3535 12 0.34
## 7 CM7733 MBP 3 2260 40 1.77
## 8 CM7734 MBP 3 4879 29 0.59
## 9 CM7735 MBP 3 4218 33 0.78
## 10 CM7737 MBP 3 4484 47 1.05
## 11 CM7720 MBP+hBT 3 1295 18 1.39
## 12 CM7723 MBP+hBT 3 1897 26 1.37
## 13 CM7724 MBP+hBT 3 1527 24 1.57
## 14 CM7727 MBP+hBT 3 3179 72 2.26
## 15 CM7728 MBP+hBT 3 4117 62 1.51
## 16 CM7730 MBP+hBT 3 4210 77 1.83
## 17 CM7732 MBP+hBT 3 3444 65 1.89
## 18 CM7736 MBP+hBT 3 9505 104 1.09
## 19 CM7738 MBP+hBT 3 7375 140 1.90
## 20 CM7739 MBP+hBT 3 4275 73 1.71
beeswarm(percent.Ki67 ~ trt, data= lab3.df)
The report a p-value of 0.0003 in Excel.
t.test(percent.Ki67 ~ trt, data= lab3.df)
##
## Welch Two Sample t-test
##
## data: percent.Ki67 by trt
## t = -4.5404, df = 15.826, p-value = 0.0003433
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.2604237 -0.4575763
## sample estimates:
## mean in group MBP mean in group MBP+hBT
## 0.793 1.652
Still significant
t.test(log(percent.Ki67) ~ trt, data= lab3.df)
##
## Welch Two Sample t-test
##
## data: log(percent.Ki67) by trt
## t = -4.4056, df = 11.188, p-value = 0.001012
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.3152160 -0.4400854
## sample estimates:
## mean in group MBP mean in group MBP+hBT
## -0.3948425 0.4828082
Comments on assumptions
Biologists often get really hung up on normality, but often its not a big deal.
Failure to address homogeneity of variance is a MAJOR PROBLEM
Its better to plot your raw data and look at the “residauls” (errors) of the model