It is a literate programing method for generating reports using R.
Similar to iPython notebooks.
Comments are written like a normal text document in a format called (Markdown)[http://daringfireball.net/projects/markdown/syntax]
Code is interspursed between the comments by being wrapped in blocks started by ```{r} and ended by another ```.
When compiled all code is run in sequence and an html (or word, or pdf) document is generated.
In this demo we will be implimenting the algorithm for calculating the q-value for a given p-value given an array of p-values from Storey and Tibshiran, 2003.
The data is p-values of correlation to obesity among words used on twitter.
Comes from the UVM’s own (Hedonometer)[http://hedonometer.org/index.html]
setwd("/Users/Nick/spring15/independentStudy")#have to set this by chunk of code
obesityRaw = read.table("data/obesity_censored.txt", header = TRUE, fill = TRUE)
obesity = na.omit(obesityRaw) #There are some NAs in the data.
head(obesity)
## WordFrequencyRank LabMTword AttributeCorrelation pvalue
## 1 00658 5.98 -0.553 1.33318e-16
## 2 02001 6.78 -0.509 6.06703e-14
## 3 00873 6.88 -0.493 4.86716e-13
## 4 02326 5.40 -0.487 9.93030e-13
## 5 00828 5.54 -0.479 2.68595e-12
## 6 03226 6.22 -0.476 3.69366e-12
## WordHappiness
## 1 pic
## 2 cafe
## 3 photo
## 4 sushi
## 5 posted
## 6 thai
Now we will set up an object that we will use throughout the process:
tests = data.frame(obesity$WordHappiness, obesity$pvalue)
names(tests) = c("word", "pval")
head(tests)
## word pval
## 1 pic 1.33318e-16
## 2 cafe 6.06703e-14
## 3 photo 4.86716e-13
## 4 sushi 9.93030e-13
## 5 posted 2.68595e-12
## 6 thai 3.69366e-12
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
Note: The data are already in order, but I will include the step anyways.
tests = arrange(tests, pval)
head(tests)
## word pval
## 1 pic 1.33318e-16
## 2 cafe 6.06703e-14
## 3 photo 4.86716e-13
## 4 sushi 9.93030e-13
## 5 i 1.00012e-12
## 6 posted 2.68595e-12
\(\hat{\pi_0}(\lambda)=\frac{\#\{p_i>\lambda\}}{m(1-\lambda)}\)
#Generate lambdas
m = length(tests$pval)
lambdas = seq(0.01,0.95,0.01)
piHat = sapply(lambdas, function(l) sum(tests$pval > l)/m*(1 - l) )
piHat_df = data.frame(lambdas, piHat)
ggplot(piHat_df,aes(x=lambdas, y=piHat)) + geom_point()
Note: Not super sure why the spline is so weird. Investigating further methods of modeling this trend could be useful.
spline = smooth.Pspline(lambdas,piHat, norder = 3, df = 3, method = 2)
piHat_df$spline = spline$ysmth
ggplot(piHat_df,aes(x=lambdas)) + geom_point(aes(y = piHat)) + geom_line(aes(y = spline))
pi0 = predict(spline, 1)
print(pi0)
## [,1]
## [1,] 0.0646744
\[\hat{q_{m}} = \hat{\pi_o} \cdot p_m\]
#Start the array of values by putting in the last one
qvals = pi0 * tail(tests$pval, 1)
\[min((\hat{\pi_o} \cdot m \cdot p_i)/i, q_{i + 1})\]
for(i in (m-1):1 ){
latest = pi0 * m * tests$pval[i]/ i
last = qvals[1] #because we are filling qvals reverse, 1 will always be the last computed val
q = min(latest, last)
qvals = c(q, qvals)
}
tests$qval = qvals
# ggplot(tests, aes(x = qval)) +
# geom_histogram(aes(y=..count../sum(..count..)), color = "black", fill = "#7fc97f") +
# labs(title = "Histogram of q-values for obesity words (normalized)",
# x = "q-Value", y = "% of words at q-val")
# par(mfrow=c(2,1))
# hist(tests$pval, breaks=20, xlim = c(0,1) )
# hist(tests$qval, breaks=20, xlim = c(0,1) )
head(tests)
## word pval qval
## 1 pic 1.33318e-16 3.413812e-13
## 2 cafe 6.06703e-14 7.767781e-11
## 3 photo 4.86716e-13 4.154370e-10
## 4 sushi 9.93030e-13 5.121922e-10
## 5 i 1.00012e-12 5.121922e-10
## 6 posted 2.68595e-12 1.146298e-09