Literate programing and reproducable research:

Go To: http://rpubs.com/nstrayer/71833 to see hosted version of this document.

What is R-Markdown?

It is a literate programing method for generating reports using R.
Similar to iPython notebooks.
Comments are written like a normal text document in a format called (Markdown)[http://daringfireball.net/projects/markdown/syntax]
Code is interspursed between the comments by being wrapped in blocks started by ```{r} and ended by another ```.
When compiled all code is run in sequence and an html (or word, or pdf) document is generated.

In this demo we will be implimenting the algorithm for calculating the q-value for a given p-value given an array of p-values from Storey and Tibshiran, 2003.

The following is an example of a use of RMarkdown.

The data is p-values of correlation to obesity among words used on twitter.

Comes from the UVM’s own (Hedonometer)[http://hedonometer.org/index.html]

First we load in the data from the hedonometer paper:

setwd("/Users/Nick/spring15/independentStudy")#have to set this by chunk of code
obesityRaw = read.table("data/obesity_censored.txt", header = TRUE, fill = TRUE)
obesity = na.omit(obesityRaw) #There are some NAs in the data.
head(obesity)

##   WordFrequencyRank LabMTword AttributeCorrelation      pvalue
## 1             00658      5.98               -0.553 1.33318e-16
## 2             02001      6.78               -0.509 6.06703e-14
## 3             00873      6.88               -0.493 4.86716e-13
## 4             02326      5.40               -0.487 9.93030e-13
## 5             00828      5.54               -0.479 2.68595e-12
## 6             03226      6.22               -0.476 3.69366e-12
##   WordHappiness
## 1           pic
## 2          cafe
## 3         photo
## 4         sushi
## 5        posted
## 6          thai

Now we will set up an object that we will use throughout the process:

tests = data.frame(obesity$WordHappiness, obesity$pvalue)
names(tests) = c("word", "pval")
head(tests)

##     word        pval
## 1    pic 1.33318e-16
## 2   cafe 6.06703e-14
## 3  photo 4.86716e-13
## 4  sushi 9.93030e-13
## 5 posted 2.68595e-12
## 6   thai 3.69366e-12

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

Now the algorithm:

1) Order the p-values.

Note: The data are already in order, but I will include the step anyways.

tests = arrange(tests, pval)
head(tests)

##     word        pval
## 1    pic 1.33318e-16
## 2   cafe 6.06703e-14
## 3  photo 4.86716e-13
## 4  sushi 9.93030e-13
## 5      i 1.00012e-12
## 6 posted 2.68595e-12

2) Look at the potential \(\hat{\pi_0}(\lambda)\) values:

\(\hat{\pi_0}(\lambda)=\frac{\#\{p_i>\lambda\}}{m(1-\lambda)}\)

#Generate lambdas
m = length(tests$pval)
lambdas = seq(0.01,0.95,0.01)
piHat   = sapply(lambdas, function(l) sum(tests$pval > l)/m*(1 - l) )

piHat_df = data.frame(lambdas, piHat)

ggplot(piHat_df,aes(x=lambdas, y=piHat)) + geom_point()

3) Fit a spline to the \(\hat{\pi_0}\) values

Note: Not super sure why the spline is so weird. Investigating further methods of modeling this trend could be useful.

spline = smooth.Pspline(lambdas,piHat, norder = 3, df = 3, method = 2)

piHat_df$spline = spline$ysmth

ggplot(piHat_df,aes(x=lambdas)) + geom_point(aes(y = piHat)) + geom_line(aes(y = spline))

4) Set \(\hat{\pi_0}\) to the spline at 1.

pi0 = predict(spline, 1)
print(pi0)

##           [,1]
## [1,] 0.0646744

5) Calculate \(\hat{q_{m}}\)

\[\hat{q_{m}} = \hat{\pi_o} \cdot p_m\]

#Start the array of values by putting in the last one
qvals = pi0 * tail(tests$pval, 1)

6) Find all of the rest of the q vals:

\[min((\hat{\pi_o} \cdot m \cdot p_i)/i, q_{i + 1})\]

for(i in (m-1):1 ){
  
  latest = pi0 * m * tests$pval[i]/ i
  last   = qvals[1] #because we are filling qvals reverse, 1 will always be the last computed val
  
  q = min(latest, last)
  qvals = c(q, qvals)
}

tests$qval = qvals


# ggplot(tests, aes(x = qval)) + 
#   geom_histogram(aes(y=..count../sum(..count..)), color = "black", fill = "#7fc97f") + 
#   labs(title = "Histogram of q-values for obesity words (normalized)",
#       x = "q-Value", y = "% of words at q-val")

# par(mfrow=c(2,1))
# hist(tests$pval, breaks=20, xlim = c(0,1) )
# hist(tests$qval, breaks=20, xlim = c(0,1) )

head(tests)

##     word        pval         qval
## 1    pic 1.33318e-16 3.413812e-13
## 2   cafe 6.06703e-14 7.767781e-11
## 3  photo 4.86716e-13 4.154370e-10
## 4  sushi 9.93030e-13 5.121922e-10
## 5      i 1.00012e-12 5.121922e-10
## 6 posted 2.68595e-12 1.146298e-09

Point of note:

I think it would be interesting to look more into the distribution of \(\hat{pi}\) and also the type of curve fitted to it.

R Markdown Example

Nick Strayer

April 4, 2015