HNSCC genomics

Background

Matthias has created a very good figure showing the various mutations, deletions or amplifications together with several clinical variables and HPV status of HNSCC.

Fig2

It has been suggested by the reviewers though that there are no stats describing the relation between variables and Matthias, Chris, Stephan etc want to address this by adding p-values.

The main points re stats which Chris mentioned were:
‘we can see what could be tested by statistical analysis, e.g. p53 mutation and smoking status; FBX7 and HPV status; PIK3CA/PTEN combined and HPV status; ? p values

Cyclin D1, p16 (CDKN2A) and p15 (CDKN2B) and HPV status, p value
to reviewer’s comment:
If the paper is proposing that these mutations and pathways are significant, especially in a clinical context they need at least basic statistical analysis to demonstrate that the frequency of alterations are worth pursuing. No analysis seems to have been attempted with regard to the basic patient demographics relative to genetic alterations. They simply reported and not analyzed. With a small sample size the likelihood that it is significant is low, however, if the authors would like to draw any conclusions based upon their findings, some attempt must be made.)
‘We should add p values’

Matthias however suggests using the FDR as in a similar Fig in a recent Science paper, The Mutational Landscape of Head and Neck Squamous Cell Carcinoma.

SienceFig

Initial thoughts

OK The data looks like this. I have added WT for wild-type where there were many blank spaces. Otherwise the software would see it as NA (missing or not applicable) and would exclude it from any stats - which is incorrect.

load("~/Google Drive/Matthias/genm.RData")
head(genm)

##        ID Gender Age           Site       Grade  T  N  M Survival Death
## 1  P6_pos      F  57 base_of_tongue poorly diff  2  2  0     1337     1
## 2  P8_pos      M  63         tonsil poorly diff  3  2  0     2711     0
## 3 P13_pos      F  56         tonsil    mod diff  2  1  0      792     1
## 4 P19_pos      F  63         tonsil poorly diff  3  3  0     2586     0
## 5 P26_pos      M  53         tonsil poorly diff NA NA NA     1277     1
## 6 P28_pos      M  57         tonsil poorly diff  4  3  0      734     1
##      Smoking            Alcohol TP53 CCND1 CCND3 CDKN2A CDKN2B EGFR MDM2
## 1 no_tobacco         no_alcohol    P    WT    WT     WT     WT   WT   WT
## 2 no_tobacco         no_alcohol   WT    WT    WT     WT     WT   WT   WT
## 3 no_tobacco         no_alcohol   WT    WT    WT     WT     WT   WT   WT
## 4  ex_smoker               <NA>   WT    WT    WT     WT     WT   WT   WT
## 5       <NA>               <NA>   WT    WT    WT     WT     WT   WT   WT
## 6     smoker occasional_alcohol   WT    WT    WT     WT     WT   WT   WT
##   CCND3.1 TET2 SUFU PKHD1 LRP1B KDM6A FGFR1 FGFR3 KDM6A.1 MYC MCL1 RICTOR
## 1      WT   WT   WT    WT    WT    WT    WT    WT      WT  WT   WT     WT
## 2      WT   WT   WT    WT    WT    WT    WT    WT      WT  WT   WT     WT
## 3      WT   WT   WT    WT    WT    WT    WT    WT      WT  WT   WT     WT
## 4      WT   WT   WT    WT    WT    WT    WT    WT      WT  WT   WT     WT
## 5      WT   WT   WT    WT    WT    WT    WT    WT      WT  WT   WT     WT
## 6      WT   WT   WT    WT    WT    WT    WT    WT      WT  WT   WT     WT
##   STK11 RB1 NF1 KRAS BCL2L1 SOX2 PTEN FBXW7 PIK3CA HPVstatus
## 1     P  WT  WT   WT     WT    G   WT     P     PG    HPVpos
## 2    WT  WT  WT   WT     WT   WT   WT     P     WT    HPVpos
## 3    WT  WT  WT   WT     WT   WT    L     p     WT    HPVpos
## 4    WT  WT  WT   WT     WT   WT   WT    WT      P    HPVpos
## 5    WT  WT  WT   WT     WT   WT   WT    WT     WT    HPVpos
## 6    WT  WT  WT   WT     WT   WT   WT    WT     WT    HPVpos

# I need to reorder some variables so that they are correctly handled by R
# - by default factors are not ordered - and when ordered you have to make
# sure they are in the correct order not alphabetical.  reorder Grade
gr = genm$Grade
gr2 = as.vector(gr)
gr2[gr == "poorly diff"] <- 3
gr2[gr == "mod diff"] <- 2
gr2[gr == "well diff"] <- 1
gr2 = as.numeric(gr2)
# genm$Grade= ordered(gr2, labels=c('well diff', 'mod diff', 'poorly
# diff'))
rm(gr, gr2)

# reorder Smoking
sm = genm$Smoking
sm2 = as.vector(sm)
sm2[sm == "no tobacco exposure"] <- 1
sm2[sm == "ex_smoker"] <- 2
sm2[sm == "active"] <- 3
sm2[sm == "smoker"] <- 3
sm2[is.na(sm)] <- NA
sm2 = as.numeric(sm2)

## Warning: NAs introduced by coercion

# genm$Smoking = ordered(sm2, labels=c('no_tobacco', 'ex_smoker',
# 'smoker'))
rm(sm, sm2)

# reorder Smoking
al = genm$Alcohol
al2 = as.vector(al)
al2[al == "no alcohol intake"] <- 1
al2[al == "occasional alcohol intake"] <- 2
al2[al == "Heavy drinker"] <- 3
al2[is.na(al)] <- NA
al2 = as.numeric(al2)

## Warning: NAs introduced by coercion

# genm$Alcohol = ordered(al2, labels=c('no_alcohol', 'occasional_alcohol',
# 'heavy_alcohol'))
rm(al, al2)

# The histological factors are in the right order but are numeric whereas
# they should be ordinal factors
genm$T = ordered(genm$T)
genm$N = ordered(genm$N)
genm$M = ordered(genm$M)

Essentially the analysis requested would simply be a list of Chi-square tests (HPVstatus vs ManyOtherFactors), plus a couple of other tests (Smoking vs TP53) and then the more complex (PIK3CA/PTEN vs HPVstatus). Then I would choose some adjustment (p or q-value) and he would reorder the fig to follow the pvals like in the Science Fig above.

However note that all the factors/genes tested in the Science paper are binomial MUTANT or WT. Here we have a lot of factors that are ordinal (e.g notobacco < exsmoker < smoker). We can use kendalls tau and test the significance of correlation on these - as using the ordinal information should be more sensitive (powerful).

This is long winded though and there is another way: Log-Linear Analysis (LLA). The problem is though that the long winded way (lots of chisq.test) is simple but LLA - whilst technically more correct for lots of comparisons and ostensibly quicker (a few lines of code) is far more complex to interpret. It would look for interactions between factors (of which there would be hundreds) (e.g. such as PIK3CA/PTEN vs HPVstatus) and the you would try and cut this down usign a step down procedure. Unwieldy and judging by initial tests unlikely to work.

… We'll have a look at the chisq.tests first:

colnames(genm)

##  [1] "ID"        "Gender"    "Age"       "Site"      "Grade"    
##  [6] "T"         "N"         "M"         "Survival"  "Death"    
## [11] "Smoking"   "Alcohol"   "TP53"      "CCND1"     "CCND3"    
## [16] "CDKN2A"    "CDKN2B"    "EGFR"      "MDM2"      "CCND3.1"  
## [21] "TET2"      "SUFU"      "PKHD1"     "LRP1B"     "KDM6A"    
## [26] "FGFR1"     "FGFR3"     "KDM6A.1"   "MYC"       "MCL1"     
## [31] "RICTOR"    "STK11"     "RB1"       "NF1"       "KRAS"     
## [36] "BCL2L1"    "SOX2"      "PTEN"      "FBXW7"     "PIK3CA"   
## [41] "HPVstatus"

#  example of what the table command is doing
with(genm, table(HPVstatus, Gender))

##          Gender
## HPVstatus  F  M
##    HPVneg  4 12
##    HPVpos  5 13

# the chisq.test takes this table format as input
geneP = list()
genePsim = list()

geneP$tp53 = chisq.test(with(genm, table(HPVstatus, TP53)))$p.value
geneP$ccnd1 = chisq.test(with(genm, table(HPVstatus, CCND1)))$p.value
geneP$ccnd3 = chisq.test(with(genm, table(HPVstatus, CCND3)))$p.value
geneP$cdkn2a = chisq.test(with(genm, table(HPVstatus, CDKN2A)))$p.value
geneP$cdkn2b = chisq.test(with(genm, table(HPVstatus, CDKN2B)))$p.value
geneP$egfr = chisq.test(with(genm, table(HPVstatus, EGFR)))$p.value
geneP$mdm2 = chisq.test(with(genm, table(HPVstatus, MDM2)))$p.value
geneP$ccnd3.1 = chisq.test(with(genm, table(HPVstatus, CCND3.1)))$p.value
geneP$tet2 = chisq.test(with(genm, table(HPVstatus, TET2)))$p.value
geneP$sufu = chisq.test(with(genm, table(HPVstatus, SUFU)))$p.value
geneP$pkhd1 = chisq.test(with(genm, table(HPVstatus, PKHD1)))$p.value
geneP$lrp1b = chisq.test(with(genm, table(HPVstatus, LRP1B)))$p.value
geneP$kdm6a = chisq.test(with(genm, table(HPVstatus, KDM6A)))$p.value
geneP$fgfr1 = chisq.test(with(genm, table(HPVstatus, FGFR1)))$p.value
geneP$fgfr3 = chisq.test(with(genm, table(HPVstatus, FGFR3)))$p.value
geneP$kdm6a.1 = chisq.test(with(genm, table(HPVstatus, KDM6A.1)))$p.value
geneP$myc = chisq.test(with(genm, table(HPVstatus, MYC)))$p.value
geneP$mcl1 = chisq.test(with(genm, table(HPVstatus, MCL1)))$p.value
geneP$rictor = chisq.test(with(genm, table(HPVstatus, RICTOR)))$p.value
geneP$stk11 = chisq.test(with(genm, table(HPVstatus, STK11)))$p.value
geneP$rb1 = chisq.test(with(genm, table(HPVstatus, RB1)))$p.value
geneP$nf1 = chisq.test(with(genm, table(HPVstatus, NF1)))$p.value
geneP$kras = chisq.test(with(genm, table(HPVstatus, KRAS)))$p.value
geneP$bcl2l1 = chisq.test(with(genm, table(HPVstatus, BCL2L1)))$p.value
geneP$sox2 = chisq.test(with(genm, table(HPVstatus, SOX2)))$p.value
geneP$pten = chisq.test(with(genm, table(HPVstatus, PTEN)))$p.value
geneP$fbxw7 = chisq.test(with(genm, table(HPVstatus, FBXW7)))$p.value
geneP$pik3ca = chisq.test(with(genm, table(HPVstatus, PIK3CA)))$p.value


genePsim$tp53 = chisq.test(with(genm, table(HPVstatus, TP53)), simulate.p.value = T)$p.value
genePsim$ccnd1 = chisq.test(with(genm, table(HPVstatus, CCND1)), simulate.p.value = T)$p.value
genePsim$ccnd3 = chisq.test(with(genm, table(HPVstatus, CCND3)), simulate.p.value = T)$p.value
genePsim$cdkn2a = chisq.test(with(genm, table(HPVstatus, CDKN2A)), simulate.p.value = T)$p.value
genePsim$cdkn2b = chisq.test(with(genm, table(HPVstatus, CDKN2B)), simulate.p.value = T)$p.value
genePsim$egfr = chisq.test(with(genm, table(HPVstatus, EGFR)), simulate.p.value = T)$p.value
genePsim$mdm2 = chisq.test(with(genm, table(HPVstatus, MDM2)), simulate.p.value = T)$p.value
genePsim$ccnd3.1 = chisq.test(with(genm, table(HPVstatus, CCND3.1)), simulate.p.value = T)$p.value
genePsim$tet2 = chisq.test(with(genm, table(HPVstatus, TET2)), simulate.p.value = T)$p.value
genePsim$sufu = chisq.test(with(genm, table(HPVstatus, SUFU)), simulate.p.value = T)$p.value
genePsim$pkhd1 = chisq.test(with(genm, table(HPVstatus, PKHD1)), simulate.p.value = T)$p.value
genePsim$lrp1b = chisq.test(with(genm, table(HPVstatus, LRP1B)), simulate.p.value = T)$p.value
genePsim$kdm6a = chisq.test(with(genm, table(HPVstatus, KDM6A)), simulate.p.value = T)$p.value
genePsim$fgfr1 = chisq.test(with(genm, table(HPVstatus, FGFR1)), simulate.p.value = T)$p.value
genePsim$fgfr3 = chisq.test(with(genm, table(HPVstatus, FGFR3)), simulate.p.value = T)$p.value
genePsim$kdm6a.1 = chisq.test(with(genm, table(HPVstatus, KDM6A.1)), simulate.p.value = T)$p.value
genePsim$myc = chisq.test(with(genm, table(HPVstatus, MYC)), simulate.p.value = T)$p.value
genePsim$mcl1 = chisq.test(with(genm, table(HPVstatus, MCL1)), simulate.p.value = T)$p.value
genePsim$rictor = chisq.test(with(genm, table(HPVstatus, RICTOR)), simulate.p.value = T)$p.value
genePsim$stk11 = chisq.test(with(genm, table(HPVstatus, STK11)), simulate.p.value = T)$p.value
genePsim$rb1 = chisq.test(with(genm, table(HPVstatus, RB1)), simulate.p.value = T)$p.value
genePsim$nf1 = chisq.test(with(genm, table(HPVstatus, NF1)), simulate.p.value = T)$p.value
genePsim$kras = chisq.test(with(genm, table(HPVstatus, KRAS)), simulate.p.value = T)$p.value
genePsim$bcl2l1 = chisq.test(with(genm, table(HPVstatus, BCL2L1)), simulate.p.value = T)$p.value
genePsim$sox2 = chisq.test(with(genm, table(HPVstatus, SOX2)), simulate.p.value = T)$p.value
genePsim$pten = chisq.test(with(genm, table(HPVstatus, PTEN)), simulate.p.value = T)$p.value
genePsim$fbxw7 = chisq.test(with(genm, table(HPVstatus, FBXW7)), simulate.p.value = T)$p.value
genePsim$pik3ca = chisq.test(with(genm, table(HPVstatus, PIK3CA)), simulate.p.value = T)$p.value

plot(unlist(geneP), unlist(genePsim))

plot of chunk chisq

From these two sets of chisq.test the simulated values are probably more correct. There is a rule of thumb that if there are n<5 in any cell of the contingency table then you should think about the Fisher exact test - or simulate the p-values. However looking at the plot you can see that it really isn't going to make any difference. The significant HPV-gene associations are very significant and there are no marginal calls that would be altered. So I use the simulated p.values and now adjust them.

# I reorder the list of p.values so that the most significant are at the
# top
genePsim = genePsim[order(unlist(genePsim))]
# this would be the most conservative p-value adjustment when
p.bon = p.adjust(unlist(genePsim), method = "bonferroni")
p.fdr = p.adjust(unlist(genePsim), method = "fdr")
genePsim.tab = data.frame(unlist(genePsim), bonferroni = p.bon, fdr = p.fdr)
rm(p.bon, p.fdr)
genePsim.tab

##         unlist.genePsim. bonferroni      fdr
## tp53           0.0004998    0.01399 0.006997
## ccnd1          0.0004998    0.01399 0.006997
## cdkn2b         0.0014993    0.04198 0.013993
## cdkn2a         0.0019990    0.05597 0.013993
## egfr           0.2238881    1.00000 0.605958
## fbxw7          0.2368816    1.00000 0.605958
## pten           0.3018491    1.00000 0.605958
## sox2           0.3463268    1.00000 0.605958
## lrp1b          0.4517741    1.00000 0.605958
## tet2           0.4557721    1.00000 0.605958
## rictor         0.4557721    1.00000 0.605958
## sufu           0.4622689    1.00000 0.605958
## mcl1           0.4622689    1.00000 0.605958
## kdm6a.1        0.4642679    1.00000 0.605958
## pkhd1          0.4672664    1.00000 0.605958
## fgfr1          0.4712644    1.00000 0.605958
## bcl2l1         0.4717641    1.00000 0.605958
## mdm2           0.4747626    1.00000 0.605958
## fgfr3          0.4747626    1.00000 0.605958
## ccnd3          0.4837581    1.00000 0.605958
## myc            0.4892554    1.00000 0.605958
## kras           0.4897551    1.00000 0.605958
## nf1            0.4977511    1.00000 0.605958
## pik3ca         0.7876062    1.00000 0.918874
## ccnd3.1        0.8535732    1.00000 0.919233
## kdm6a          0.8535732    1.00000 0.919233
## stk11          1.0000000    1.00000 1.000000
## rb1            1.0000000    1.00000 1.000000

If we wanted to follow the Science Fig exactly then we would use this Fig ordering and the false discovery rate (q-value, fdr). Personally I think that the bonferroni is best as it is the most conservative and there are not so many tests as in the Science paper. The need for a bar as the difference betwen significant and not is absolute so asterisks (*) to mark significant should be fine. We focus on 28 genes that have mutations, they seem to have sequenced and aligned the whole exome???

Now for the other clinical variable comparisons… firstly I should say that I think there is no point in comparison of Smoking to TP53. TP53 is highly associated with HPV- status which is associated with Non-smoking. They are confounded and there are too few samples to disentangle - So you should be consistent and just report the HPV to Smoking relation. I don't think there are enough samples for multivariate logistic regression (though that would be quicker to type) so I'm just going to do every variable singly as before

genm.var = genm[, c(2:7, 10:12, 41)]
head(genm.var)

##   Gender Age           Site       Grade    T    N Death    Smoking
## 1      F  57 base_of_tongue poorly diff    2    2     1 no_tobacco
## 2      M  63         tonsil poorly diff    3    2     0 no_tobacco
## 3      F  56         tonsil    mod diff    2    1     1 no_tobacco
## 4      F  63         tonsil poorly diff    3    3     0  ex_smoker
## 5      M  53         tonsil poorly diff <NA> <NA>     1       <NA>
## 6      M  57         tonsil poorly diff    4    3     1     smoker
##              Alcohol HPVstatus
## 1         no_alcohol    HPVpos
## 2         no_alcohol    HPVpos
## 3         no_alcohol    HPVpos
## 4               <NA>    HPVpos
## 5               <NA>    HPVpos
## 6 occasional_alcohol    HPVpos


# after all that messing about with the ordering of factors it turns out R
# (or at least the wilcoxon.test) doesn't handle the data how I expected
# so I'm changing them back to numeric order
genm.var$Site = as.numeric(genm$Site)
genm.var$Grade = as.numeric(genm$Grade)
genm.var$T = as.numeric(genm$T)
genm.var$N = as.numeric(genm$N)
genm.var$Smoking = as.numeric(genm$Smoking)
genm.var$Alcohol = as.numeric(genm$Alcohol)
head(genm.var)

##   Gender Age Site Grade  T  N Death Smoking Alcohol HPVstatus
## 1      F  57    1     3  2  3     1       1       1    HPVpos
## 2      M  63    2     3  3  3     0       1       1    HPVpos
## 3      F  56    2     2  2  2     1       1       1    HPVpos
## 4      F  63    2     3  3  4     0       2      NA    HPVpos
## 5      M  53    2     3 NA NA     1      NA      NA    HPVpos
## 6      M  57    2     3  4  4     1       3       2    HPVpos


clinP = list()
clinP$Site = with(genm.var, wilcox.test(Site ~ HPVstatus))
clinP$Grade = with(genm.var, wilcox.test(Grade ~ HPVstatus))
clinP$T = with(genm.var, wilcox.test(T ~ HPVstatus))
clinP$N = with(genm.var, wilcox.test(N ~ HPVstatus))
clinP$Smoking = with(genm.var, wilcox.test(Smoking ~ HPVstatus))
clinP$Alcohol = with(genm.var, wilcox.test(Alcohol ~ HPVstatus))
# NB I slip in a chisq test for Death and Gender plus a (single) logistic
# regression for Age and lastly a quick comparison of the survival curves
clinP$Death = chisq.test(with(genm.var, table(HPVstatus, Death)))
clinP$Gender = chisq.test(with(genm.var, table(HPVstatus, Gender)))
summary(glm(HPVstatus ~ Age, data = genm.var, family = "binomial", na.action = na.omit))

## 
## Call:
## glm(formula = HPVstatus ~ Age, family = "binomial", data = genm.var, 
##     na.action = na.omit)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.380  -1.210   0.962   1.101   1.388  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)   1.6469     2.2109    0.74     0.46
## Age          -0.0263     0.0375   -0.70     0.48
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 47.016  on 33  degrees of freedom
## Residual deviance: 46.517  on 32  degrees of freedom
## AIC: 50.52
## 
## Number of Fisher Scoring iterations: 4
##

clinP$Age$p.value = 0.484
require(survival)

## Loading required package: survival

## Loading required package: splines

with(genm, survdiff(Surv(Survival, Death) ~ HPVstatus))

## Call:
## survdiff(formula = Surv(Survival, Death) ~ HPVstatus)
## 
##                   N Observed Expected (O-E)^2/E (O-E)^2/V
## HPVstatus=HPVneg 16        9     7.06     0.534     0.892
## HPVstatus=HPVpos 18        9    10.94     0.345     0.892
## 
##  Chisq= 0.9  on 1 degrees of freedom, p= 0.345

clinP$Survival$p.value = 0.345

# I'm adjusting all tests together and main a tidy table to view this
p.all = c(unlist(genePsim), sapply(clinP, function(x) x$p.value))
p.all.bon = p.adjust(p.all, method = "bonferroni")
p.all.fdr = p.adjust(p.all, method = "fdr")
test = c(rep("chisq", 28), rep("wilcox", 6), rep("chisq", 2), "log-reg", "survival")
p.all.tab = data.frame(unadjustedP = p.all, bonferroni = p.all.bon, fdr = p.all.fdr, 
    test = test)

p.all.tab

##          unadjustedP bonferroni      fdr     test
## tp53       0.0004998    0.01899 0.009495    chisq
## ccnd1      0.0004998    0.01899 0.009495    chisq
## cdkn2b     0.0014993    0.05697 0.018991    chisq
## cdkn2a     0.0019990    0.07596 0.018991    chisq
## egfr       0.2238881    1.00000 0.610147    chisq
## fbxw7      0.2368816    1.00000 0.610147    chisq
## pten       0.3018491    1.00000 0.610147    chisq
## sox2       0.3463268    1.00000 0.610147    chisq
## lrp1b      0.4517741    1.00000 0.610147    chisq
## tet2       0.4557721    1.00000 0.610147    chisq
## rictor     0.4557721    1.00000 0.610147    chisq
## sufu       0.4622689    1.00000 0.610147    chisq
## mcl1       0.4622689    1.00000 0.610147    chisq
## kdm6a.1    0.4642679    1.00000 0.610147    chisq
## pkhd1      0.4672664    1.00000 0.610147    chisq
## fgfr1      0.4712644    1.00000 0.610147    chisq
## bcl2l1     0.4717641    1.00000 0.610147    chisq
## mdm2       0.4747626    1.00000 0.610147    chisq
## fgfr3      0.4747626    1.00000 0.610147    chisq
## ccnd3      0.4837581    1.00000 0.610147    chisq
## myc        0.4892554    1.00000 0.610147    chisq
## kras       0.4897551    1.00000 0.610147    chisq
## nf1        0.4977511    1.00000 0.610147    chisq
## pik3ca     0.7876062    1.00000 0.935282    chisq
## ccnd3.1    0.8535732    1.00000 0.953994    chisq
## kdm6a      0.8535732    1.00000 0.953994    chisq
## stk11      1.0000000    1.00000 1.000000    chisq
## rb1        1.0000000    1.00000 1.000000    chisq
## Site       0.1830444    1.00000 0.610147   wilcox
## Grade      0.1368043    1.00000 0.577618   wilcox
## T          0.0320464    1.00000 0.152220   wilcox
## N          0.0176130    0.66929 0.095613   wilcox
## Smoking    0.0173409    0.65895 0.095613   wilcox
## Alcohol    0.0025951    0.09861 0.019722   wilcox
## Death      0.9838468    1.00000 1.000000    chisq
## Gender     1.0000000    1.00000 1.000000    chisq
## Age        0.4840000    1.00000 0.610147  log-reg
## Survival   0.3450000    1.00000 0.610147 survival

OK when I add in all the clinical variables each tested separately and singly using the appropriate test then do adjustment on all tests (genes and clinical) together, then using the FDR makes more sense. I presume that Matthias can take these values and alter the figure appropriately.

Lastly there was a question about a possible interaction between PIK3CA, PTEN and HPVStatus. this is a specific test selection (rather than dredging of all possible interactions) so no adjustment is required.

chisq.test(with(genm, table(PIK3CA, PTEN)), simulate.p.value = T)

## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  with(genm, table(PIK3CA, PTEN)) 
## X-squared = 2.804, df = NA, p-value = 0.7031
##

chisq.test(with(genm, table(PIK3CA, HPVstatus)), simulate.p.value = T)

## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  with(genm, table(PIK3CA, HPVstatus)) 
## X-squared = 2.076, df = NA, p-value = 0.7851
##

chisq.test(with(genm, table(HPVstatus, PTEN)), simulate.p.value = T)

## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  with(genm, table(HPVstatus, PTEN)) 
## X-squared = 2.927, df = NA, p-value = 0.2989
##


with(genm, table(PIK3CA, PTEN, HPVstatus))

## , , HPVstatus = HPVneg
## 
##       PTEN
## PIK3CA  L  P WT
##     G   0  0  1
##     P   0  0  3
##     PG  0  0  0
##     WT  0  1 11
## 
## , , HPVstatus = HPVpos
## 
##       PTEN
## PIK3CA  L  P WT
##     G   0  0  1
##     P   0  0  4
##     PG  0  0  2
##     WT  3  1  7
##

chisq.test(with(genm, table(PIK3CA, PTEN, HPVstatus)), simulate.p.value = T)

## 
##  Chi-squared test for given probabilities with simulated p-value
##  (based on 2000 replicates)
## 
## data:  with(genm, table(PIK3CA, PTEN, HPVstatus)) 
## X-squared = 115.6, df = NA, p-value = 0.0004998
##

There is a simple logic to these results. You maybe suspicious that there is a relation between PIK3CA and PTEN based on prior evidence but it's not significant on its own. However whilst the relation of PIK3CA or PTEN separately to HPVstatus is not signifcant (NB if the numbers are slightly different to the table that is simulation error), their joint relationship is significant. So you can make a statement in the text that given prior knowledge they are on the same path there is a significant joint relation between PIK3CA, PTEN and HPVstatus.

As I said before I'm not doing TP53 to Smoking - it doesn't make sense to me when you have too few samples and a strongly confounded comparison to HPVstatus.

# this was the code to create the report in HTML Tumblr format# knit('MatthiasGenm.Rmd'); markdownToHTML('MatthiasGenm.md',# 'Test-Post.html', fragment.only = TRUE)