We will be using the algorithm for calculating the q-value for a given p-value given an array of p-values from Storey and Tibshiran, 2003.


First we load in the data from the hedonometer paper:

setwd("/Users/Nick/spring15/independentStudy")#have to set this by chunk of code
obesityRaw = read.table("data/obesity_censored.txt", header = TRUE, fill = TRUE)
obesity = na.omit(obesityRaw) #There are some NAs in the data.
head(obesity)
##   WordFrequencyRank LabMTword AttributeCorrelation      pvalue
## 1             00658      5.98               -0.553 1.33318e-16
## 2             02001      6.78               -0.509 6.06703e-14
## 3             00873      6.88               -0.493 4.86716e-13
## 4             02326      5.40               -0.487 9.93030e-13
## 5             00828      5.54               -0.479 2.68595e-12
## 6             03226      6.22               -0.476 3.69366e-12
##   WordHappiness
## 1           pic
## 2          cafe
## 3         photo
## 4         sushi
## 5        posted
## 6          thai

Now we will set up an object that we will use throughout the process:

tests = data.frame(obesity$WordHappiness, obesity$pvalue)
names(tests) = c("word", "pval")
head(tests)
##     word        pval
## 1    pic 1.33318e-16
## 2   cafe 6.06703e-14
## 3  photo 4.86716e-13
## 4  sushi 9.93030e-13
## 5 posted 2.68595e-12
## 6   thai 3.69366e-12
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

Now the algorithm:

1) Order the p-values.

Note: The data are already in order, but I will include the step anyways.

tests = arrange(tests, pval)
head(tests)
##     word        pval
## 1    pic 1.33318e-16
## 2   cafe 6.06703e-14
## 3  photo 4.86716e-13
## 4  sushi 9.93030e-13
## 5      i 1.00012e-12
## 6 posted 2.68595e-12

2) Look at the potential \(\hat{\pi_0}(\lambda)\) values:

\(\hat{\pi_0}(\lambda)=\frac{\#\{p_i>\lambda\}}{m(1-\lambda)}\)

#Generate lambdas
m = length(tests$pval)
lambdas = seq(0.01,0.95,0.01)
piHat   = sapply(lambdas, function(l) sum(tests$pval > l)/m*(1 - l) )

piHat_df = data.frame(lambdas, piHat)

ggplot(piHat_df,aes(x=lambdas, y=piHat)) + geom_point() 

3) Fit a spline to the \(\hat{\pi_0}\) values

Note: Not super sure why the spline is so weird. Investigating further methods of modeling this trend could be useful.

spline = smooth.Pspline(lambdas,piHat, norder = 3, df = 3, method = 2)

piHat_df$spline = spline$ysmth

ggplot(piHat_df,aes(x=lambdas)) + geom_point(aes(y = piHat)) + geom_line(aes(y = spline))

4) Set \(\hat{\pi_0}\) to the spline at 1.

pi0 = predict(spline, 1)
print(pi0)
##           [,1]
## [1,] 0.0646744

5) Calculate \(\hat{q_{m}}\)

\[\hat{q_{m}} = \hat{\pi_o} \cdot p_m\]

#Start the array of values by putting in the last one
qvals = pi0 * tail(tests$pval, 1)

6) Find all of the rest of the q vals:

\[min((\hat{\pi_o} \cdot m \cdot p_i)/i, q_{i + 1})\]

for(i in (m-1):1 ){
  
  latest = pi0 * m * tests$pval[i]/ i
  last   = qvals[1] #because we are filling qvals reverse, 1 will always be the last computed val
  
  q = min(latest, last)
  qvals = c(q, qvals)
}

tests$qval = qvals

head(tests)
##     word        pval         qval
## 1    pic 1.33318e-16 3.413812e-13
## 2   cafe 6.06703e-14 7.767781e-11
## 3  photo 4.86716e-13 4.154370e-10
## 4  sushi 9.93030e-13 5.121922e-10
## 5      i 1.00012e-12 5.121922e-10
## 6 posted 2.68595e-12 1.146298e-09

Point of note:

  • I think it would be interesting to look more into the distribution of \(\hat{pi}\) and also the type of curve fitted to it.