Reanalysis of sarcoma predictive signatures from GEO GSE21050

Validated prediction of clinical outcome in sarcomas and multiple types of cancer on the basis of a gene expression signature related to genome complexity Chibon et al.. Nature Medicine, 2010

Until recently the actual sample data was not in the GEO database. We had to contact the Nat Med editor to get it put in. Chibon claims that this was a mistake/oversight by GEO. I'm not sure how this happened. We wanted to use this data for our own purposes but given the trouble we had getting it I thought I should check through it.

The data comprises: 310 expression data microarrays , metastasis free survival (time and metastasis status), their 67 gene CINSARC signature that was used to predict either C1 (low risk) or C2 (high risk). Then there is a split into train (n= 183) and test groups (n=127).

I saved all this data into sarcSurv.RData and CINSARCids.txt (the Affymetrix IDs of the CINSARC signature). Here is some summary code showing the data objects in R

load("sarcSurv.RData")
CINSARCids = readLines("CINSARCids.txt")
## Warning: incomplete final line found on 'CINSARCids.txt'

ls()
## [1] "CINSARCids" "sarc.study"
# These are the 67 affymetrix probeIDs that make up the reported CINSARC
# signature
length(CINSARCids)
## [1] 67
head(CINSARCids)
## [1] "219918_s_at"  "1552619_a_at" "204092_s_at"  "239219_at"   
## [5] "202095_s_at"  "215509_s_at"

#  The rest of the data is in a generic list format that I use for many
# different expression/survival studies
attributes(sarc.study)
## $names
## [1] "idStud"    "idPlat"    "geneDescr" "genes"     "survival"  "samples"  
## [7] "data"
# The GEO accession
sarc.study$idStud
## [1] "GSE21050"
# The GEO platform ID here is for Affy hgu133plus2
sarc.study$idPlat
## [1] "GPL570"
# some identifiers for the Affy Ids
head(sarc.study$geneDescr)
##               probe entrezgene   symb
## 1007_s_at 1007_s_at        780   DDR1
## 1053_at     1053_at       5982   RFC2
## 117_at       117_at       3310  HSPA6
## 121_at       121_at       7849   PAX8
## 1255_g_at 1255_g_at       2978 GUCA1A
## 1294_at     1294_at       7318   UBA7
# the Affy Ids
sarc.study$genes[1:6]
## [1] "1007_s_at" "1053_at"   "117_at"    "121_at"    "1255_g_at" "1294_at"
# the survival data e.g. Surv(time=time, status=Metastasis)
head(sarc.study$survival)
## [1] 9.00 1.22 0.92 1.48 0.21 0.95
# Cohort 1/2 were train/test, CINSARC is their diagnosis based on the
# signature
head(sarc.study$samples)
##    Cohort   SampleSource CINSARC Metastasis Time
## 1 cohort1 Internal trunk      C1          1 9.00
## 2 cohort1    Extremities      C2          2 1.22
## 3 cohort1    Extremities      C2          2 0.92
## 4 cohort1    Extremities      C2          2 1.48
## 5 cohort1    Extremities      C2          1 0.21
## 6 cohort1 Internal trunk      C1          2 0.95
##                        Diagnosis    ID_REF
## 1 Liposarcoma - dedifferentiated GSM525806
## 2                 Leiomyosarcoma GSM525807
## 3       Undifferentiated sarcoma GSM525808
## 4                          Other GSM525809
## 5       Undifferentiated sarcoma GSM525810
## 6 Liposarcoma - dedifferentiated GSM525811
# the data appears unlogged or transformed
head(sarc.study$data[, 1:4])
##   GSM525806 GSM525807 GSM525808 GSM525809
## 1    106.15    108.38    199.47    195.36
## 2     91.77    121.10    177.29    118.60
## 3    108.38    107.63    357.05    121.94
## 4      6.68      8.51      4.76      9.65
## 5      4.26      4.23      4.29      4.35
## 6    237.21    131.60     24.25     81.01

# but it appears pretty well normalised so I leave it alone....discuss
# this more later...  NB this is 4 samples cohort1 and 4 samples chort 2
# so it all seems normalised
boxplot(log2(sarc.study$data[, c(1:4, 300:304)]))

plot of chunk unnamed-chunk-1

Within the paper in Fig3a Chibon et al. report that their signture predicst metsatasis free survival for both train (Cohort1) and test groups (Cohort2).

Fig3a

Using the sample data supplied we can replicate these figures.

library(survival)
## Loading required package: splines
# Copy Fig3a format 1 row 2 cols
par(mfrow = c(1, 2))

# The 1st 183 samples were Cohort1 the training samples
Cohort1Cols = 1:183
train.CINSARC = sarc.study$samples$CINSARC[Cohort1Cols]
train.survival = sarc.study$survival[Cohort1Cols]

# the vertical line is where their Fig3a Train was cropped
survdiff(train.survival ~ train.CINSARC)
## Call:
## survdiff(formula = train.survival ~ train.CINSARC)
## 
## n=182, 1 observation deleted due to missingness.
## 
##                   N Observed Expected (O-E)^2/E (O-E)^2/V
## train.CINSARC=C1 83       18     41.1      13.0      28.1
## train.CINSARC=C2 99       60     36.9      14.4      28.1
## 
##  Chisq= 28.1  on 1 degrees of freedom, p= 1.17e-07
plot(survfit(train.survival ~ train.CINSARC), col = c("blue", "red"), main = "Cohort1")
abline(v = 10, lty = 3)
legend(1, 0.2, c("C1", "C2"), col = c("blue", "red"))


# The other samples are Cohort2: the test samples
test.CINSARC = sarc.study$samples$CINSARC[-Cohort1Cols]
test.survival = sarc.study$survival[-Cohort1Cols]

# the vertical line is where their Fig3a Train was cropped
survdiff(test.survival ~ test.CINSARC)
## Call:
## survdiff(formula = test.survival ~ test.CINSARC)
## 
##                  N Observed Expected (O-E)^2/E (O-E)^2/V
## test.CINSARC=C1 42        6       17      7.11      12.1
## test.CINSARC=C2 85       37       26      4.65      12.1
## 
##  Chisq= 12.1  on 1 degrees of freedom, p= 0.00051
plot(survfit(test.survival ~ test.CINSARC), col = c("blue", "red"), main = "Cohort2")
abline(v = 10, lty = 3)
legend(1, 0.2, c("C1", "C2"), col = c("blue", "red"))

plot of chunk Fig3aReplicated

So far ..so good. The CINSARC predictions that they have released match the Figs in the paper. I now move on to check that their reported signatures do classify/predict the samples as reported.

The predictor is described in the Methods section somewhat unclearly as follows:

To assign prognosis, we applied the nearest centroid method. Centroids represent a centered mean of expression for the signature genes for each patient outcome (metastatic and nonmetastatic). Thus, centroids were calculated from cohort 1 samples for sarcomas and GISTs and from training set samples for breast cancers and lymphomas. Each sample of the training (for example, sarcomas cohort 1) and the validation (for example, sarcomas cohort 2) sets was allocated to the prognostic class (centroid) with the highest Spearman correlation.

I am presuming from this that two prognostic centroid vectors metCentroid and nonmetCentroid were calculated from only the training data as follows.

# The 1st 183 samples were Cohort1 the training samples
Cohort1Cols = 1:183
# these rows of data match the CINSARCids
cinsarcRows = match(CINSARCids, sarc.study$genes)
# so this is the training data
train.data = sarc.study$data[cinsarcRows, Cohort1Cols]
# the survival is of class Surv but the second column has the status where
# non-met is = 0 and met is = 1
train.status = sarc.study$survival[Cohort1Cols, 2]

nonmetCentroid = rowMeans(train.data[, which(train.status == 0)])
metCentroid = rowMeans(train.data[, which(train.status == 1)])

Curiously the next step is (to partly repeat above):

Each sample of the training (for example, sarcomas cohort 1) and the validation (for example, sarcomas cohort 2) sets was allocated to the prognostic class (centroid) with the highest Spearman correlation.

Why the Spearman? That seems the least powerful and unlikely method to relate their well normalised samples to the centroid Predictors? At least as it is Spearman there is no need to worry about whether to scale the data as only the rank is important.

# I will use a loop for clarity of coding
mytrain.CINSARC = vector()
for (i in 1:183) # OK this is a bit complicated but it says if the sample is closer to
# nonmetCentroid then its C1 or else its a C2
mytrain.CINSARC[i] = ifelse((cor(train.data[, i], nonmetCentroid, method = "spearman") > 
    cor(train.data[, i], metCentroid, method = "spearman")), "C1", "C2")

# my results are a bit different from their reported results
table(mytrain.CINSARC, train.CINSARC)
##                train.CINSARC
## mytrain.CINSARC C1 C2
##              C1 72 27
##              C2 11 73

# I presume the same centroids are used for Cohort2 the test data as this
# would be proper validation and the Methods though ambiguously worded
# 'sounds' like this is what was done.
test.data = sarc.study$data[cinsarcRows, -Cohort1Cols]
mytest.CINSARC = vector()
for (i in 1:127) # OK this is a bit complicated but it says if the sample is closer to
# nonmetCentroid then its C1 or else its a C2
mytest.CINSARC[i] = ifelse((cor(test.data[, i], nonmetCentroid, method = "spearman") > 
    cor(test.data[, i], metCentroid, method = "spearman")), "C1", "C2")

# my results for the test data are even more different
table(mytrain.CINSARC, train.CINSARC)
##                train.CINSARC
## mytrain.CINSARC C1 C2
##              C1 72 27
##              C2 11 73

I do not understand why I get large differences here. As we are using a Spearman correlation biases due to normalisation, transformation(log2) of the data should be minimal. The interesting thing is the survival differences I obtain from mytest.CINSARC and mytrain.CINSARC.

par(mfrow = c(1, 2))
# the vertical line is where their Fig3a Train was cropped
survdiff(train.survival ~ mytrain.CINSARC)
## Call:
## survdiff(formula = train.survival ~ mytrain.CINSARC)
## 
## n=182, 1 observation deleted due to missingness.
## 
##                     N Observed Expected (O-E)^2/E (O-E)^2/V
## mytrain.CINSARC=C1 99       30     45.8      5.45      13.4
## mytrain.CINSARC=C2 83       48     32.2      7.76      13.4
## 
##  Chisq= 13.4  on 1 degrees of freedom, p= 0.00025
plot(survfit(train.survival ~ mytrain.CINSARC), col = c("blue", "red"), main = "myCohort1Predictor")
abline(v = 10, lty = 3)
legend(1, 0.2, c("C1", "C2"), col = c("blue", "red"))

# the vertical line is where their Fig3a Train was cropped
survdiff(test.survival ~ mytest.CINSARC)
## Call:
## survdiff(formula = test.survival ~ mytest.CINSARC)
## 
##                    N Observed Expected (O-E)^2/E (O-E)^2/V
## mytest.CINSARC=C1 65       18     24.2      1.61      3.85
## mytest.CINSARC=C2 62       25     18.8      2.08      3.85
## 
##  Chisq= 3.8  on 1 degrees of freedom, p= 0.0498
plot(survfit(test.survival ~ mytest.CINSARC), col = c("blue", "red"), main = "myCohort2Predictor")
abline(v = 10, lty = 3)
legend(1, 0.2, c("C1", "C2"), col = c("blue", "red"))

plot of chunk myFig3a

My predictor is decent but not quite as good as reported. My best guess here is that whilst they have deposited the full survival data they have used right-censored data in their publication. The Fig3a from the paper is over 10 years whilst the table below is over 5 years. It's possible this would be a good idea as metastasis after that time may arise from further sporadic aberrations rather than latent factors in the original tumour. I'll try this…

# make a couple of copies
train5.survival = sarc.study$survival[Cohort1Cols]
train10.survival = train5.survival

# i will right censor all > 5 years or >10 years i.e they all become
# non-metastasis
wh5 = which(train5.survival[, 1] > 5)
wh10 = which(train5.survival[, 1] > 10)
train5.survival[wh5, 2] = 0
train5.status = train5.survival[, 2]
train10.survival[wh10, 2] = 0
train10.status = train10.survival[, 2]

# I rebuild the prognostic centroid with the new Metastasis status
nonmetCentroid5 = rowMeans(train.data[, which(train5.status == 0)])
metCentroid5 = rowMeans(train.data[, which(train5.status == 1)])

nonmetCentroid10 = rowMeans(train.data[, which(train10.status == 0)])
metCentroid10 = rowMeans(train.data[, which(train10.status == 1)])

mytrain5.CINSARC = vector()
mytrain10.CINSARC = vector()
for (i in 1:183) {
    # going copy paste mad here
    mytrain5.CINSARC[i] = ifelse((cor(train.data[, i], nonmetCentroid5, method = "spearman") > 
        cor(train.data[, i], metCentroid5, method = "spearman")), "C1", "C2")
    mytrain10.CINSARC[i] = ifelse((cor(train.data[, i], nonmetCentroid10, method = "spearman") > 
        cor(train.data[, i], metCentroid10, method = "spearman")), "C1", "C2")
}

mytest5.CINSARC = vector()
mytest10.CINSARC = vector()

for (i in 1:127) {
    # OK this is a bit complicated but it says if the sample is closer to
    # nonmetCentroid then its C1 or else its a C2
    mytest5.CINSARC[i] = ifelse((cor(test.data[, i], nonmetCentroid5, method = "spearman") > 
        cor(test.data[, i], metCentroid5, method = "spearman")), "C1", "C2")
    mytest10.CINSARC[i] = ifelse((cor(test.data[, i], nonmetCentroid10, method = "spearman") > 
        cor(test.data[, i], metCentroid10, method = "spearman")), "C1", "C2")
}

# we can review all of this using table but NOPE I'm still not getting the
# same results as Chibon et al... I'ma ctually further from their data

table(test.CINSARC, mytest.CINSARC)
##             mytest.CINSARC
## test.CINSARC C1 C2
##           C1 32 10
##           C2 33 52
table(test.CINSARC, mytest5.CINSARC)
##             mytest5.CINSARC
## test.CINSARC C1 C2
##           C1 34  8
##           C2 30 55
table(test.CINSARC, mytest10.CINSARC)
##             mytest10.CINSARC
## test.CINSARC C1 C2
##           C1 34  8
##           C2 36 49

table(train.CINSARC, mytrain.CINSARC)
##              mytrain.CINSARC
## train.CINSARC C1 C2
##            C1 72 11
##            C2 27 73
table(train.CINSARC, mytrain5.CINSARC)
##              mytrain5.CINSARC
## train.CINSARC C1 C2
##            C1 68 15
##            C2 24 76
table(train.CINSARC, mytrain10.CINSARC)
##              mytrain10.CINSARC
## train.CINSARC C1 C2
##            C1 73 10
##            C2 29 71

I'm flummoxed. I don't know how to get the same predictions as them following what I think are their methods. Censoring makes it even less like their predictions.

par(mfrow = c(1, 2))
survdiff(train.survival ~ mytrain5.CINSARC)
## Call:
## survdiff(formula = train.survival ~ mytrain5.CINSARC)
## 
## n=182, 1 observation deleted due to missingness.
## 
##                      N Observed Expected (O-E)^2/E (O-E)^2/V
## mytrain5.CINSARC=C1 92       29     43.9      5.08      11.8
## mytrain5.CINSARC=C2 90       49     34.1      6.56      11.8
## 
##  Chisq= 11.8  on 1 degrees of freedom, p= 0.000579
plot(survfit(train.survival ~ mytrain5.CINSARC), col = c("blue", "red"), main = "myCohort1 Predictor 5years")
abline(v = 10, lty = 3)
legend(1, 0.2, c("C1", "C2"), col = c("blue", "red"))

survdiff(test.survival ~ mytest5.CINSARC)
## Call:
## survdiff(formula = test.survival ~ mytest5.CINSARC)
## 
##                     N Observed Expected (O-E)^2/E (O-E)^2/V
## mytest5.CINSARC=C1 64       18     23.4      1.23      2.82
## mytest5.CINSARC=C2 63       25     19.6      1.47      2.82
## 
##  Chisq= 2.8  on 1 degrees of freedom, p= 0.0932
plot(survfit(test.survival ~ mytest5.CINSARC), col = c("blue", "red"), main = "myCohort2 Predictor 5years")
abline(v = 10, lty = 3)
legend(1, 0.2, c("C1", "C2"), col = c("blue", "red"))

plot of chunk FiveTenYears



par(mfrow = c(1, 2))
survdiff(train.survival ~ mytrain10.CINSARC)
## Call:
## survdiff(formula = train.survival ~ mytrain10.CINSARC)
## 
## n=182, 1 observation deleted due to missingness.
## 
##                        N Observed Expected (O-E)^2/E (O-E)^2/V
## mytrain10.CINSARC=C1 102       31     47.4      5.67      14.7
## mytrain10.CINSARC=C2  80       47     30.6      8.79      14.7
## 
##  Chisq= 14.7  on 1 degrees of freedom, p= 0.000128
plot(survfit(train.survival ~ mytrain10.CINSARC), col = c("blue", "red"), main = "myCohort1 Predictor 10years")
abline(v = 10, lty = 3)
legend(1, 0.2, c("C1", "C2"), col = c("blue", "red"))

survdiff(test.survival ~ mytest10.CINSARC)
## Call:
## survdiff(formula = test.survival ~ mytest10.CINSARC)
## 
##                      N Observed Expected (O-E)^2/E (O-E)^2/V
## mytest10.CINSARC=C1 70       21     25.6     0.836      2.16
## mytest10.CINSARC=C2 57       22     17.4     1.233      2.16
## 
##  Chisq= 2.2  on 1 degrees of freedom, p= 0.142
plot(survfit(test.survival ~ mytest10.CINSARC), col = c("blue", "red"), main = "myCohort2 Predictor 10years")
abline(v = 10, lty = 3)
legend(1, 0.2, c("C1", "C2"), col = c("blue", "red"))

plot of chunk FiveTenYears

All the predictions that I make following their method are pretty good but nowhere near as good as their reported success. Really not sure what to do with this…

Bringing in ANGPTL2

I have previously seen that there is a significant relation between survival and ANGPTL2 expression. Leonid has asked me to relate this to the CINSARC signatures in both cohorts.

load("~/Google Drive/Pablo_Lucy/GSE21050/survSarcDF.RData")
# Confusion arises because there is no relation between ANGPTL2 and
# survival in the training set Cohort 1
survdiff(study.survival ~ ANGPTL2, data = survSarc.df, subset = Cohort == "cohort1")
## Call:
## survdiff(formula = study.survival ~ ANGPTL2, data = survSarc.df, 
##     subset = Cohort == "cohort1")
## 
## n=182, 1 observation deleted due to missingness.
## 
##               N Observed Expected (O-E)^2/E (O-E)^2/V
## ANGPTL2=high 96       39     40.3    0.0444     0.093
## ANGPTL2=low  86       39     37.7    0.0475     0.093
## 
##  Chisq= 0.1  on 1 degrees of freedom, p= 0.76
plot(survfit(study.survival ~ ANGPTL2, data = survSarc.df, subset = Cohort == 
    "cohort1"), lty = 1:2, col = 1:2, main = "cohort1")
legend(0.25, 0.2, c("high ANGTPL2", "low ANGPTL2"), lty = 1:2, col = 1:2)

plot of chunk ANGPTL2surv


# But there is a clear association in the test set Cohort 2
survdiff(study.survival ~ ANGPTL2, data = survSarc.df, subset = Cohort == "cohort2")
## Call:
## survdiff(formula = study.survival ~ ANGPTL2, data = survSarc.df, 
##     subset = Cohort == "cohort2")
## 
##               N Observed Expected (O-E)^2/E (O-E)^2/V
## ANGPTL2=high 60       13     23.3      4.56      10.4
## ANGPTL2=low  67       30     19.7      5.41      10.4
## 
##  Chisq= 10.4  on 1 degrees of freedom, p= 0.00127
plot(survfit(study.survival ~ ANGPTL2, data = survSarc.df, subset = Cohort == 
    "cohort2"), , lty = 1:2, col = 1:2, main = "cohort2")
legend(0.25, 0.2, c("high ANGTPL2", "low ANGPTL2"), lty = 1:2, col = 1:2)

plot of chunk ANGPTL2surv

It is odd if we remember that the CINSARC is a poorer predictor in the test set. We can split these up to check relation to CINSARC (as Leonid suggests)

survdiff(study.survival ~ ANGPTL2 + CINSARC, data = survSarc.df, subset = Cohort == 
    "cohort1")
## Call:
## survdiff(formula = study.survival ~ ANGPTL2 + CINSARC, data = survSarc.df, 
##     subset = Cohort == "cohort1")
## 
## n=182, 1 observation deleted due to missingness.
## 
##                           N Observed Expected (O-E)^2/E (O-E)^2/V
## ANGPTL2=high, CINSARC=C1 50       13     24.0      5.05      7.40
## ANGPTL2=high, CINSARC=C2 46       26     16.3      5.74      7.41
## ANGPTL2=low, CINSARC=C1  33        5     17.1      8.53     11.08
## ANGPTL2=low, CINSARC=C2  53       34     20.6      8.73     12.02
## 
##  Chisq= 28.7  on 3 degrees of freedom, p= 2.56e-06
plot(survfit(study.survival ~ ANGPTL2 + CINSARC, data = survSarc.df, subset = Cohort == 
    "cohort1"), lty = 1:2, col = c(1, 1, 2, 2), main = "cohort1")
legend(0.25, 0.2, c("high ANGTPL2/C1", "high ANGPTL2/C2", "low ANGTPL2/C1", 
    "low ANGPTL2/C2"), lty = 1:2, col = c(1, 1, 2, 2))

plot of chunk ANGPTL2_CINSARC


# But there is a clear association in the test set Cohort 2
survdiff(study.survival ~ ANGPTL2 + CINSARC, data = survSarc.df, subset = Cohort == 
    "cohort2")
## Call:
## survdiff(formula = study.survival ~ ANGPTL2 + CINSARC, data = survSarc.df, 
##     subset = Cohort == "cohort2")
## 
##                           N Observed Expected (O-E)^2/E (O-E)^2/V
## ANGPTL2=high, CINSARC=C1 25        1     10.7    8.7882    12.004
## ANGPTL2=high, CINSARC=C2 35       12     12.6    0.0305     0.044
## ANGPTL2=low, CINSARC=C1  17        5      6.3    0.2685     0.323
## ANGPTL2=low, CINSARC=C2  50       25     13.4   10.0812    15.227
## 
##  Chisq= 20  on 3 degrees of freedom, p= 0.000173
plot(survfit(study.survival ~ ANGPTL2 + CINSARC, data = survSarc.df, subset = Cohort == 
    "cohort2"), lty = 1:2, col = c(1, 1, 2, 2), main = "cohort2")
legend(0.25, 0.2, c("high ANGTPL2/C1", "high ANGPTL2/C2", "low ANGTPL2/C1", 
    "low ANGPTL2/C2"), lty = 1:2, col = c(1, 1, 2, 2))

plot of chunk ANGPTL2_CINSARC

Not sure what to make of this. The ANGPTL2 signature seems to carry extra info above and beyond the CINSARC signature but what the difference between cohort 1 and 2 is I cannot imagine. It's probably instructive to look at the relation between CINSARC signature and ANGPTL2 expression…

CINSARC.wh = match(CINSARCids, sarc.study$geneDescr$probe)
sym = sarc.study$geneDescr$symb
ANGPTL2.wh = match("ANGPTL2", sarc.study$geneDescr$symb)
wh = c(ANGPTL2.wh, CINSARC.wh)
# log2 and scaled
scaledSarcData = as.matrix(log2(sarc.study$data))

library(gplots)
csc = rep("Gray", 310)
csc[which(survSarc.df$Cohort == "cohort1")] <- "Black"
heatmap.2(scaledSarcData[wh, ], scale = "r", trace = "n", Colv = F, labRow = sym[wh], 
    col = "bluered", labCol = F, ColSideColors = csc)
## Warning: Discrepancy: Colv is FALSE, while dendrogram is `row'. Omitting
## column dendogram.

plot of chunk ANGPTL2_CINSARC_corr

ANGPTL2 shows some correlation to the CINSARC signature but it's an out-group. The gray and white bars above show Cohort 1 and 2 samples.