R Markdown

This is an R Markdown document. Markdown is a simple formatting synta? for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R co?e chunks within the document. You can embed an R code chunk like this:

summary(cars)

##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Including Plots

You can also embed plots, for example:

Note that the echo = FALSE parameter was ad?ed to the code chunk to prevent printing of the R code that generated the plot.

#3.1.1 Describing the central tendency: Mean and median#

In R, the median() function will calculate the mean of a provided vector of numbers. Let’s create a set of random ?umbers and calculate their mean and median using R.

#create 10 random numbers from uniform distribution x=runif(10) #create 100 randomm numvbers from uniform distribution y=runif(100) # calculate mean mean(x) mean(y) #3.1.2 Describing the spread: Measurem?nts of variation #We can calculate the sample standard deviation and variation with the sd() and var() functions in R. These functions take a vector of numeric values as input and calculate the desired quantities. Below we use those functions on a randomly?generated vector of numbers x=rnorm(20,mean=6,sd=0.7) var(x) sd(x) #and y=rnorm(100, mean=6, sd=0.7) var(y) sd(y) #One common way is to look at the difference between 75th percentile and 25th percentile, this effectively removes a lot of potential outliers?which will be towards the edges of the range of values. This is called the interquartile range, and can be easily calculated using R via the IQR() function and the quantiles of a vector are calculated with the quantile() function. x=rnorm(20,mean=6,sd=0.7)?IQR(x) quantile(x) #and y=rnorm(100, mean=6, sd=0.7) IQR(y) quantile(y) #boxplot boxplot(x,horizontal = T) boxplot(y, horizontal = T) #The normal distribution #In R, the family of *norm functions (rnorm,dnorm,qnorm and pnorm) can be used to operate with ?he normal distribution, such as calculating probabilities and generating random numbers drawn from a normal distribution. We show some of those capabilities below # get the value of probability density function when X= -2, # where mean=0 and sd=2 dnorm(-2,?mean=0, sd=2) dnorm(-3, mean=0, sd=3) dnorm(-4, mean=0, sd=4) # get the probability of P(X =< -2) where mean=0 and sd=2 pnorm(-2, mean=0, sd=2) pnorm(-3, mean=0, sd=3) # get the probability of P(X > -2) where mean=0 and sd=2 pnorm(-2, mean=0, sd=2,lower.ta?l = FALSE) pnorm(-3, mean=0, sd=3, lower.tail = FALSE) # get 5 random numbers from normal dist with mean=0 and sd=2 rnorm(5, mean=0 , sd=2) # get y value corresponding to P(X > y) = 0.15 with mean=0 and sd=2 qnorm( 0.15, mean=0 , sd=2) # We will estimate?the precision of the mean of the sample using bootstrapping to build confidence intervals, the resulting plot after this procedure #This procedure, resampling with replacement to estimate the precision of population parameter estimates, is known as the bo?tstrap resampling or bootstraping. library(parallel) reproducible_runif <- function(seed = NULL) {

if(is.null(seed)) {
    seed <- .Random.seed
} else {
    # .Random.seed <<- seed
    # mind the double arrow to assign in the paren? enviroment or
    assign(x = ".Random.seed", value = seed, envir = .GlobalEnv)
}

return(list(x = runif(1), seed = seed))

} library(mosaic) set.seed(21) sample1= rnorm(50,20,5) #simulate a sample

do bootstrap resampling, sampling w?th replacement

boot.means=do(1000) * mean(resample(sample1))

get percentiles from the bootstrap means

q=quantile(boot.means[,1],p=c(0.025,0.975))

plot the histogram

hist(boot.means[,1],col=“cornflowerblue”,border=“white”, xlab=“sa?ple means”) abline(v=c(q[1], q[2] ),col=“red”) text(x=q[1],y=200,round(q[1],3),adj=c(1,0)) text(x=q[2],y=200,round(q[2],3),adj=c(0,0)) #In R, we can do this using the qnorm() function to get Z-scores associated with ??/2 and 1?????/2 As you can see, the confidence intervals we calculated using CLT are very similar to the ones we got from the bootstrap for the same sample. alpha=0.05 sd=5 n=50 mean(sample1)+qnorm(c(alpha/2,1-alpha/2))*sd/sqrt(n) #3.2.1 Randomization-based testing for?difference of the means #we are doing this process in R. We are first simulating two samples from two different distributions. These would be equivalent to gene expression measurements obtained under different conditions. Then, we calculate the differences?in the means and do the randomization procedure to get a null distribution when we assume there is no difference between samples, set.seed(100) gene1=rnorm(30,mean=4,sd=2) gene2=rnorm(30,mean=2,sd=2) org.diff=mean(gene1)-mean(gene2) gene.df=data.frame(exp=?(gene1,gene2), group=c(rep(“test”,30),rep(“control”,30) ) )

exp.null <- do(1000) * diff(mosaic::mean(exp ~ shuffle(group), data=gene.df)) hist(exp.null[,1],xlab=“null distribution | no difference in samples”, main=expression(paste(?[0]," :no difference in means“) ), xlim=c(-2,2),col=”cornflowerblue“,border=”white“) abline(v=quantile(exp.null[,1],0.95),col=”red" ) abline(v=org.diff,col=“blue” ) text(x=quantile(exp.null[,1],0.95),y=200,“0.05”,adj=c(1,0),col=“red”) text(x=org.diff,?=200,“org. diff.”,adj=c(1,0),col=“blue”) #The red line marks the value that corresponds to P-value of 0.05 p.val=sum(exp.null[,1]>org.diff)/length(exp.null[,1]) p.val #3.2.2 Using t-test for difference of the means between two samples #We can also calculat? the difference between means using a t-test. Sometimes we will have too few data points in a sample to do a meaningful randomization test, also randomization takes more time than doing a t-test. This is a test that depends on the t distribution. # Welch’s?t-test stats::t.test(gene1,gene2) # t-test with equal variance assumption stats::t.test(gene1,gene2,var.equal=TRUE) #One final method that is also popular is called the “q-value” method and related to the method above. This procedure relies on estimating t?e proportion of true null hypotheses from the distribution of raw p-values and using that quantity to come up with what is called a “q-value”, which is also an FDR-adjusted P-value (Storey and Tibshirani 2003). # R, the base function p.adjust() implements?most of the p-value correction methods described above. For the q-value, we can use the qvalue package from Bioconductor. Below we demonstrate how to use them on a set of simulated p-values library(qvalue) data(hedenfalk)

qvalues <- qvalue(hedenfalk$p)$q ?onf.pval=p.adjust(hedenfalk$p,method ="bonferroni") fdr.adj.pval=p.adjust(hedenfalk$p,method =“fdr”)

plot(hedenfalk$p,qvalues,pch=19,ylim=c(0,1), xlab="raw P-values",ylab="adjusted P-values") points(hedenfalk$p,bonf.pval,pch=19,col=“red”) points(hede?falk$p,fdr.adj.pval,pch=19,col=“blue”) legend(“bottomright”,legend=c(“q-value”,“FDR (BH)”,“Bonferroni”), fill=c(“black”,“blue”,“red”))

calculate moderated and unmoderated t-test statistics and plot the P-value distributions for tests.

set?seed(100)

#sample data matrix from normal distribution

gset=rnorm(3000,mean=200,sd=70) data=matrix(gset,ncol=6)

set groups

group1=1:3 group2=4:6 n1=3 n2=3 dx=rowMeans(data[,group1])-rowMeans(data[,group2])

require(matrixStats)

get the esimate of?pooled variance

stderr = sqrt( (rowVars(data[,group1])(n1-1) + rowVars(data[,group2])(n2-1)) / (n1+n2-2) * ( 1/n1 + 1/n2 ))

do the shrinking towards median

mod.stderr = (stderr + median(stderr)) / 2 # moderation in variation

esimate t stat?stic with moderated variance

t.mod <- dx / mod.stderr

calculate P-value of rejecting null

p.mod = 2*pt( -abs(t.mod), n1+n2-2 )

esimate t statistic without moderated variance

t = dx / stderr

calculate P-value of rejecting null

p = 2*pt( -abs(t), ?1+n2-2 )

par(mfrow=c(1,2)) hist(p,col=“cornflowerblue”,border=“white”,main="“,xlab=”P-values t-test“) mtext(paste(”signifcant tests:“,sum(p<0.05)) ) hist(p.mod,col=”cornflowerblue“,border=”white“,main=”“, xlab=”P-values mod. t-test“) mtext(paste(”si?nifcant tests:",sum(p.mod<0.05)) ) .

StatisticInRPartOne

Arash Refai

April 1, 2021

R Markdown

Including Plots

In R, the median() function will calculate the mean of a provided vector of numbers. Let’s create a set of random ?umbers and calculate their mean and median using R.

do bootstrap resampling, sampling w?th replacement

get percentiles from the bootstrap means

plot the histogram

calculate moderated and unmoderated t-test statistics and plot the P-value distributions for tests.

set groups

get the esimate of?pooled variance

do the shrinking towards median

esimate t stat?stic with moderated variance

calculate P-value of rejecting null

esimate t statistic without moderated variance

calculate P-value of rejecting null