This is an R Markdown document. Markdown is a simple formatting synta? for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R co?e chunks within the document. You can embed an R code chunk like this:
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
You can also embed plots, for example:
Note that the echo = FALSE
parameter was ad?ed to the code chunk to prevent printing of the R code that generated the plot.
#3.1.1 Describing the central tendency: Mean and median#
#create 10 random numbers from uniform distribution x=runif(10) #create 100 randomm numvbers from uniform distribution y=runif(100) # calculate mean mean(x) mean(y) #3.1.2 Describing the spread: Measurem?nts of variation #We can calculate the sample standard deviation and variation with the sd() and var() functions in R. These functions take a vector of numeric values as input and calculate the desired quantities. Below we use those functions on a randomly?generated vector of numbers x=rnorm(20,mean=6,sd=0.7) var(x) sd(x) #and y=rnorm(100, mean=6, sd=0.7) var(y) sd(y) #One common way is to look at the difference between 75th percentile and 25th percentile, this effectively removes a lot of potential outliers?which will be towards the edges of the range of values. This is called the interquartile range, and can be easily calculated using R via the IQR() function and the quantiles of a vector are calculated with the quantile() function. x=rnorm(20,mean=6,sd=0.7)?IQR(x) quantile(x) #and y=rnorm(100, mean=6, sd=0.7) IQR(y) quantile(y) #boxplot boxplot(x,horizontal = T) boxplot(y, horizontal = T) #The normal distribution #In R, the family of *norm functions (rnorm,dnorm,qnorm and pnorm) can be used to operate with ?he normal distribution, such as calculating probabilities and generating random numbers drawn from a normal distribution. We show some of those capabilities below # get the value of probability density function when X= -2, # where mean=0 and sd=2 dnorm(-2,?mean=0, sd=2) dnorm(-3, mean=0, sd=3) dnorm(-4, mean=0, sd=4) # get the probability of P(X =< -2) where mean=0 and sd=2 pnorm(-2, mean=0, sd=2) pnorm(-3, mean=0, sd=3) # get the probability of P(X > -2) where mean=0 and sd=2 pnorm(-2, mean=0, sd=2,lower.ta?l = FALSE) pnorm(-3, mean=0, sd=3, lower.tail = FALSE) # get 5 random numbers from normal dist with mean=0 and sd=2 rnorm(5, mean=0 , sd=2) # get y value corresponding to P(X > y) = 0.15 with mean=0 and sd=2 qnorm( 0.15, mean=0 , sd=2) # We will estimate?the precision of the mean of the sample using bootstrapping to build confidence intervals, the resulting plot after this procedure #This procedure, resampling with replacement to estimate the precision of population parameter estimates, is known as the bo?tstrap resampling or bootstraping. library(parallel) reproducible_runif <- function(seed = NULL) {
if(is.null(seed)) {
seed <- .Random.seed
} else {
# .Random.seed <<- seed
# mind the double arrow to assign in the paren? enviroment or
assign(x = ".Random.seed", value = seed, envir = .GlobalEnv)
}
return(list(x = runif(1), seed = seed))
} library(mosaic) set.seed(21) sample1= rnorm(50,20,5) #simulate a sample
boot.means=do(1000) * mean(resample(sample1))
q=quantile(boot.means[,1],p=c(0.025,0.975))
hist(boot.means[,1],col=“cornflowerblue”,border=“white”, xlab=“sa?ple means”) abline(v=c(q[1], q[2] ),col=“red”) text(x=q[1],y=200,round(q[1],3),adj=c(1,0)) text(x=q[2],y=200,round(q[2],3),adj=c(0,0)) #In R, we can do this using the qnorm() function to get Z-scores associated with ??/2 and 1?????/2 As you can see, the confidence intervals we calculated using CLT are very similar to the ones we got from the bootstrap for the same sample. alpha=0.05 sd=5 n=50 mean(sample1)+qnorm(c(alpha/2,1-alpha/2))*sd/sqrt(n) #3.2.1 Randomization-based testing for?difference of the means #we are doing this process in R. We are first simulating two samples from two different distributions. These would be equivalent to gene expression measurements obtained under different conditions. Then, we calculate the differences?in the means and do the randomization procedure to get a null distribution when we assume there is no difference between samples, set.seed(100) gene1=rnorm(30,mean=4,sd=2) gene2=rnorm(30,mean=2,sd=2) org.diff=mean(gene1)-mean(gene2) gene.df=data.frame(exp=?(gene1,gene2), group=c(rep(“test”,30),rep(“control”,30) ) )
exp.null <- do(1000) * diff(mosaic::mean(exp ~ shuffle(group), data=gene.df)) hist(exp.null[,1],xlab=“null distribution | no difference in samples”, main=expression(paste(?[0]," :no difference in means“) ), xlim=c(-2,2),col=”cornflowerblue“,border=”white“) abline(v=quantile(exp.null[,1],0.95),col=”red" ) abline(v=org.diff,col=“blue” ) text(x=quantile(exp.null[,1],0.95),y=200,“0.05”,adj=c(1,0),col=“red”) text(x=org.diff,?=200,“org. diff.”,adj=c(1,0),col=“blue”) #The red line marks the value that corresponds to P-value of 0.05 p.val=sum(exp.null[,1]>org.diff)/length(exp.null[,1]) p.val #3.2.2 Using t-test for difference of the means between two samples #We can also calculat? the difference between means using a t-test. Sometimes we will have too few data points in a sample to do a meaningful randomization test, also randomization takes more time than doing a t-test. This is a test that depends on the t distribution. # Welch’s?t-test stats::t.test(gene1,gene2) # t-test with equal variance assumption stats::t.test(gene1,gene2,var.equal=TRUE) #One final method that is also popular is called the “q-value” method and related to the method above. This procedure relies on estimating t?e proportion of true null hypotheses from the distribution of raw p-values and using that quantity to come up with what is called a “q-value”, which is also an FDR-adjusted P-value (Storey and Tibshirani 2003). # R, the base function p.adjust() implements?most of the p-value correction methods described above. For the q-value, we can use the qvalue package from Bioconductor. Below we demonstrate how to use them on a set of simulated p-values library(qvalue) data(hedenfalk)
qvalues <- qvalue(hedenfalk\(p)\)q ?onf.pval=p.adjust(hedenfalk\(p,method ="bonferroni") fdr.adj.pval=p.adjust(hedenfalk\)p,method =“fdr”)
plot(hedenfalk\(p,qvalues,pch=19,ylim=c(0,1), xlab="raw P-values",ylab="adjusted P-values") points(hedenfalk\)p,bonf.pval,pch=19,col=“red”) points(hede?falk$p,fdr.adj.pval,pch=19,col=“blue”) legend(“bottomright”,legend=c(“q-value”,“FDR (BH)”,“Bonferroni”), fill=c(“black”,“blue”,“red”))
set?seed(100)
#sample data matrix from normal distribution
gset=rnorm(3000,mean=200,sd=70) data=matrix(gset,ncol=6)
group1=1:3 group2=4:6 n1=3 n2=3 dx=rowMeans(data[,group1])-rowMeans(data[,group2])
require(matrixStats)
stderr = sqrt( (rowVars(data[,group1])(n1-1) + rowVars(data[,group2])(n2-1)) / (n1+n2-2) * ( 1/n1 + 1/n2 ))
mod.stderr = (stderr + median(stderr)) / 2 # moderation in variation
t.mod <- dx / mod.stderr
p.mod = 2*pt( -abs(t.mod), n1+n2-2 )
t = dx / stderr
p = 2*pt( -abs(t), ?1+n2-2 )
par(mfrow=c(1,2)) hist(p,col=“cornflowerblue”,border=“white”,main="“,xlab=”P-values t-test“) mtext(paste(”signifcant tests:“,sum(p<0.05)) ) hist(p.mod,col=”cornflowerblue“,border=”white“,main=”“, xlab=”P-values mod. t-test“) mtext(paste(”si?nifcant tests:",sum(p.mod<0.05)) ) .