November 7, 2015

####################################################################################################

Some time ago I was reading a web page on classification. The page was on Receiver Operating Characteristics (ROC) graphs, a helpful technique to summarize and visualize classifiers’ performances.

A ROC graph is something like that:

Alt text

The author of the post was arguing that people often do not understand the Area under an ROC Curve (AUC), but they are more able to understand the grouping of cases on the basis of them being true positive, false positive and so on.

Essentially, a ROC curve is a graphical representation of the relationship between sensitivity and specificity of a classifier and it helps to identify the best threshold for a diagnostic test.

Sensitivity and specificity result from a balance between true positive (TP), true negative (TN), false negative (FN), and false positive (FP). When your test identifies correctly a case with a condition of interest, it provides a true positive (TP) case. If the test is unable to correctly identify a case with the condition of interest, marking it as ‘negative’, it provides a false negative (FN) case. When the test correctly marks as ‘negative’ a case without the condition of interest, it provides a true negative (TN) case. And when the test marks as ‘positive’ a case without the condition of interest, it provides a false positive (FP) case.

Sensitivity = TP/(TP + FN), i.e., the sensitivity is the ratio of the cases that were correctly identified to all potential instances of positive cases (i.e., the sum of both the true positive and false negative case).

Conversely, Specificity = TN/(TN + FP), i.e., the specificity is the ratio of the non-cases (those without the condition of interest) that were correctly identified as negative to all potential instances of negative cases (i.e., the sum of both the true negative and false positive case).

Finally, Accuracy = (TN + TP)/(TN+TP+FN+FP), i.e., the accuracy (which is not a good measure) is the ratio of the correctly identified cases to all cases. The accuracy is the proportion of true results: the higher, the better. However, when you have a rare condition, the classifier can perform with high accuracy simply classifying all cases as “negative”.

A ROC curve is a plot of Sensitivity versus 1-Specifity (but different ways of plotting a ROC curve do exist). The area under the curve (AUC) is a measure of how good the classifier is in detecting the condition of interest in a sample. AUC is not affected by the criterion used to establish the boundaries between cases and non-cases, and it is also independent of prevalence of the condition, since it is based entirely on the performance of the classifier (i.e, sensitivity and specificity). Optimal cut off values can be derived from the curve, moreover, several diagnostic classifiers can be compared on the same subjects within the ROC space.

Anyway, this post is not on ROC analysis and AUC, it is on how to make clear to people how a classifier does work. In the web page I was reading, the use of an infogrid was suggested as an easy visual method to make things clear.

Alt text

This image is all I saved from the web page I was reading. I saved the image, then I did some typying on my mail service, and … Boom! The PC crashed. When I ’ve tried to return to the web page, all my chronology was lost, and I was no more able to find the page…

However, the idea behind the infogrid is appealing. Essentially, you have to plot a 10 X 10 grid, which results into a 100% space. Then, all you have to do is painting TP cases in a color, the FN cases in another color and so on. Eventually, you ’ll have a grid with percentages of TP, FN, FP and TN. It is a clear way of have the ROC space at hand.

Let ’s see one example.

Alt text

In this example, you have 26% of TP, 5% of FN, 15% of FP, and 54% of TN. The prevalence of cases in this fictional example is 26% + 5% = 31%, which you can see clearly in the plot by summing TP and FN.

There is also a calculation of the cost of not recognition, which is the ratio between the FN and the TP (FN/TP): the lower, the better.

There is also a calculation for the cost of false recognition, which is the ratio between the FP and the TN (FP/TN): again, the lower, the better.

In the example, the cost of not recognition is 5/26 = 0.19 = 19%.

The cost of false recognition was 15/54 = 0.277 = 28% (when rounded).

You do not want a classifier with a high cost of not recognition (too many cases leaved behind).

But sometime you cannot afford to have a classifier with a high cost of false recognition, particularly when the inclusion of the case in the positive folder has negative consequences (e.g., severe side effects of treatment or a high chance of negative labelling, i..e stigma, as it can happen for mental disorders or lethal infective illnesses, as HIV).

I ’ve prepared a function to plot this infogrid. The code is really, really ugly, but it does work. If anyone has a better solution, please, let me know.

This is the function:

#### infogrid for results ROC analysis

### call it with source("Infogrid.ROC.R")

cat("\n Utility for drawing an infogrid with percent distribution of true positive, true negative, false positive, false negative. \n
  You may call the function by typing 'infogrid' and including the value in this sequence: tp, tn, fp, fn.
    E.g., : infogrid(30,12,34,24) \n
    You may also call the function by specifiying the values in any order: infogrid(tp=30, tn=24, fn=12, fp=34) \n
    If the values are already available in the platfom environment, you may get the plot just typying 'infogrid()'
    ")
## 
##  Utility for drawing an infogrid with percent distribution of true positive, true negative, false positive, false negative. 
## 
##   You may call the function by typing 'infogrid' and including the value in this sequence: tp, tn, fp, fn.
##  E.g., : infogrid(30,12,34,24) 
## 
##  You may also call the function by specifiying the values in any order: infogrid(tp=30, tn=24, fn=12, fp=34) 
## 
##  If the values are already available in the platfom environment, you may get the plot just typying 'infogrid()'
##  
infogrid <- function(tp,tn,fp,fn)
    {

######################################################
# Plot infogrid
######################################################


### Expand right side of clipping rect to make room for the legend

par(xpd=T, mar=par()$mar+c(0,0,0,6))

###################
### plot a square
###################

plot(c(10,0), c(10,0), type= "n", xlab = "", ylab = "", 
    xlim = c(0,10), ylim = c(10,0), axes=FALSE) ### axes = FALSE --> do not plot axes

axis(3, at=c(0:10))     ### Draw the x-axis above the plot area
axis(2, at=c(0:10))     ### Draw the y-axis to the left of the plot area

box()       ### Draw the usual box


#### draw the infogrid

###################
### true positive
###################

for(i in 1:tp)

    {
    

    k <- tp
    k <- ifelse(k>10, 10, k)

rect(0,0,1,k, col = "pink", lwd=2) # coloured


    k1 <- tp-10
    k1 <- ifelse(k1>10, 10, k1)
    k1 <- ifelse(k1<1, 0, k1)

rect(1,0,2,k1, col = "pink", lwd=2) # coloured


    k2 <- tp-(10*2)
    k2 <- ifelse(k2>10, 10, k2)
    k2 <- ifelse(k2<1, 0, k2)

rect(2,0,3,k2, col = "pink", lwd=2) # coloured

    k3 <- tp-(10*3)
    k3 <- ifelse(k3>10, 10, k3)
    k3 <- ifelse(k3<1, 0, k3)

rect(3,0,4,k3, col = "pink", lwd=2) # coloured


    k4 <- tp-(10*4)
    k4 <- ifelse(k4>10, 10, k4)
    k4 <- ifelse(k4<1, 0, k4)

rect(4,0,5,k4, col = "pink", lwd=2) # coloured


    k5 <- tp-(10*5)
    k5 <- ifelse(k5>10, 10, k5)
    k5 <- ifelse(k5<1, 0, k5)

rect(5,0,6,k5, col = "pink", lwd=2) # coloured


    k6 <- tp-(10*6)
    k6 <- ifelse(k6>10, 10, k6)
    k6 <- ifelse(k6<1, 0, k6)

rect(6,0,7,k6, col = "pink", lwd=2) # coloured


    k7 <- tp-(10*7)
    k7 <- ifelse(k7>10, 10, k7)
    k7 <- ifelse(k7<1, 0, k7)

rect(7,0,8,k7, col = "pink", lwd=2) # coloured


    k8 <- tp-(10*8)
    k8 <- ifelse(k8>10, 10, k8)
    k8 <- ifelse(k8<1, 0, k8)

rect(8,0,9,k8, col = "pink", lwd=2) # coloured

    k9 <- tp-(10*9)
    k9 <- ifelse(k9>10, 10, k9)
    k9 <- ifelse(k9<1, 0, k9)

rect(9,0,10,k9, col = "pink", lwd=2) # coloured


    }

###################
### false negative 
###################

digits <- function(x){as.integer(substring(x, seq(nchar(x)), seq(nchar(x))))}
vb <- digits(tp)

    
    vb[2] <-  ifelse(is.na(vb[2]),vb[1],vb[2])
    vb[1] <-  ifelse(tp<10,0,vb[1])

    h <- vb[2]+fn
    h <- ifelse(h>10, 10, h)

rect(vb[1],vb[2],vb[1]+1,h, col = "red", lwd=2) # coloured


start <- vb[1]+1

for(i in start:10)

    {

rect(i,0,10,10, col = "red") # coloured

    }

###################
### false positive 
###################

vb <- digits(tp+fn)

    vb[2] <-  ifelse(is.na(vb[2]),vb[1],vb[2])
    vb[1] <-  ifelse(tp+fn<10,0,vb[1])

    h <- vb[2]+fp
    h <- ifelse(h>10, 10, h)

rect(vb[1],vb[2],vb[1]+1,h, col = "yellow", lwd=2) # coloured

start <- vb[1]+1

for(i in start:10)

    {

rect(i,0,10,10, col = "yellow") # coloured

    }

###################
### true negative 
###################

vb <- digits(tp+fn+fp)

    vb[2] <-  ifelse(is.na(vb[2]),vb[1],vb[2])
    vb[1] <-  ifelse(tp+fn+fp<10,0,vb[1])

    h <- vb[2]+tn
    h <- ifelse(h>10, 10, h)

rect(vb[1],vb[2],vb[1]+1, h, col = "darkolivegreen1") # coloured

start <- vb[1]+1

for(i in start:10)

    {

rect(i,0,10,10, col = "darkolivegreen1") # coloured

    }

### griglia

for(i in 1:9)

    {
    
segments(i,0,i,10, lwd=2, col="black")
segments(0,i,10,i, lwd=2, col="black")

    }

### Add legend

### Plot legend where you want (outside the marging, a 11 on axis x)
### lty = c(0,0,0,0) does eliminate the line 

legend(11,1,c("True positive", "False negative", "False positive", "True negative"), 
    pch = c(22,22,22,22), lty = c(0,0,0,0), 
    pt.bg = c("pink", "red", "yellow", "darkolivegreen1"), pt.cex=1.75,
    bty="y", bg= "cornsilk1", cex=1)

# Restore default clipping rect

par(mar=c(5, 4, 4, 2) + 0.1)

title("Infogrid for ROC analysis\n")

mtext(paste("Cost of not recognition:", round(fn/tp,2), "; Cost of false recognition: ", round(fp/tn,2)), 
    side=1, line=1)

### done
###


    }

The infogrid() function does accept values in the following order: tp,tn,fp,fn.

So, if you type ‘infogrid(30,30,30,10)’, the function will interpret the values as true positive, true negative, false positive and false negative cases

Let ’s see:

infogrid(30,30,30,10)

plot of chunk unnamed-chunk-3

However, you can specify the values according the order you prefer, by specifying their nature: infogrid(tp=40, fn=20, fn=10, tn=30)

infogrid(tp=40, fn=10, fp=20, tn=30)

plot of chunk unnamed-chunk-4

Let’ see an example with some data.

Meet the ‘ovarian’ dataset, which is available at the repository http://www.dmi.units.it/borelli/dataset/ovarian.csv

The dataset is curated by Massimo Borelli, teaching statistics at the Trieste University (Italy).

The dataset include information on the risk of cancer in women with or without menopausa. There are also data on four putative cancer biomarkers and information on the age of the patient.

The dataset comes from this study: Richard G Moore, Moune Jabre-Raughley, Amy K Brown, Katina M Robison, M Craig Miller, W Jeffery Allard, Robert J Kurman, Robert C Bast, and Steven J Skates. Compar- ison of a novel multiple marker assay vs the risk of malignancy index for the prediction of epithelial ovarian cancer in patients with a pelvic mass. American journal of obstetrics and gynecology, 203(3):228–e1, 2010.

ovarian <- structure(list(HE4 = c(3576L, 3046L, 29376L, 6296L, 3546L, 4056L, 
130456L, 3546L, 5286L, 6066L, 3526L, 3836L, 4886L, 4066L, 5766L, 
4466L, 6386L, 9536L, 4576L, 3866L, 4236L, 4796L, 5606L, 9466L, 
4626L, 3396L, 2156L, 3426L, 6686L, 4746L, 4626L, 3586L, 4756L, 
3566L, 11216L, 3936L, 5006L, 3606L, 3996L, 8606L, 11526L, 2906L, 
8216L, 4796L, 11006L, 3726L, 6126L, 6736L, 424126L, 3226L, 5506L, 
3736L, 5096L, 4306L, 3556L, 94506L, 5876L, 7576L, 6666L, 4026L, 
4196L, 5196L, 37841L, 4226L, 4956L, 4556L, 4046L, 3436L, 6496L, 
3516L, 3626L, 5996L, 62176L, 5636L, 13816L, 5506L, 2796L, 6996L, 
7536L, 5836L, 2786L, 4936L, 1276L, 5376L, 7446L, 41036L, 125776L, 
6536L, 7766L, 8786L, 7366L, 12226L, 6666L, 4416L, 3476L, 4236L, 
6116L, 4516L, 3856L, 6246L, 4286L, 11046L, 6376L, 2926L, 5186L, 
4076L, 4316L, 3676L, 4176L, 5176L, 4626L, 4276L, 3876L, 5706L, 
4806L, 6556L, 20326L, 9106L, 11536L, 3776L, 99206L, 3666L, 2796L, 
3566L, 5226L, 4886L, 4476L, 4126L, 4306L, 3186L, 16239L, 5316L, 
5546L, 27366L, 3786L, 3056L, 9286L, 5296L, 5566L, 4876L, 4416L, 
3486L, 3044L, 54726L, 4116L, 5986L, 3086L, 3496L, 4036L, 3306L, 
7406L, 2646L, 24816L, 86336L, 4766L, 4526L, 18086L, 3306L, 4406L, 
6846L, 4486L, 3486L, 7216L, 6076L, 4446L, 4606L, 5006L, 3486L, 
4956L, 3496L, 5466L, 4716L, 3276L, 4586L, 4226L, 3906L, 4916L, 
2706L, 5396L, 8196L, 5166L, 8466L, 3256L, 4426L, 3576L, 3446L, 
3706L, 3496L, 5306L, 2446L, 210796L, 7266L, 5026L, 3366L, 7396L, 
4246L, 3826L, 7306L, 4436L, 3676L, 4926L, 7806L, 4336L, 2926L, 
7046L, 3776L, 3276L, 4116L, 5816L, 5226L), CA125 = c(7014L, 23216L, 
11226L, 5224L, 2069L, 6084L, 196336L, 1196L, 3819L, 5606L, 9809L, 
1696L, 30636L, 20866L, 3706L, 1167L, 953L, 1419L, 1876L, 986L, 
9776L, 17006L, 2222L, 34226L, 1796L, 3756L, 596L, 1130L, 824L, 
3056L, 2716L, 2086L, 6408L, 9286L, 242636L, 1099L, 7756L, 1906L, 
1076L, 3156L, 9961L, 6451L, 5616L, 1523L, 1246L, 1636L, 1749L, 
1746L, 16976L, 3056L, 1795L, 3362L, 2259L, 6556L, 2117L, 43386L, 
2858L, 14686L, 1335L, 1796L, 2981L, 8699L, 35936L, 2336L, 3890L, 
2486L, 1326L, 2246L, 4756L, 937L, 7741L, 28576L, 30616L, 666L, 
1012L, 2678L, 1486L, 1593L, 1744L, 46366L, 1346L, 54346L, 4836L, 
52516L, 5614L, 98086L, 48016L, 879L, 3346L, 1190L, 646L, 2916L, 
9986L, 5496L, 574L, 1236L, 1882L, 7394L, 1076L, 960L, 1176L, 
14276L, 1669L, 3816L, 1261L, 1399L, 2702L, 1176L, 3776L, 5562L, 
1516L, 1386L, 2707L, 976L, 1226L, 33026L, 241236L, 1360L, 1145L, 
1477L, 142736L, 6226L, 3506L, 2136L, 2092L, 2166L, 1116L, 5356L, 
1486L, 2314L, 77456L, 879L, 935L, 46936L, 4806L, 5571L, 716L, 
2825L, 18396L, 516L, 1156L, 1326L, 1500L, 131836L, 776L, 5291L, 
1965L, 1226L, 96956L, 1933L, 2415L, 1718L, 8416L, 18426L, 4726L, 
1466L, 3326L, 2626L, 3360L, 2582L, 7336L, 1683L, 6426L, 1173L, 
2816L, 7712L, 4826L, 5486L, 661L, 11866L, 886L, 5245L, 3927L, 
3394L, 1323L, 744L, 1930L, 2416L, 10866L, 85736L, 477L, 18806L, 
3246L, 937L, 1956L, 2966L, 2059L, 2241L, 1166L, 2196L, 238436L, 
14466L, 1546L, 2843L, 4736L, 1284L, 2136L, 766L, 2466L, 1998L, 
2436L, 1236L, 1612L, 1140L, 3169L, 6259L, 1896L, 6156L, 903L, 
5616L), CA199 = c(2785L, 12656L, 2462L, 3450L, 96L, 3123L, 1156L, 
432L, 996L, 11366L, 6449L, 2001L, 1123L, 5588L, 886L, 444L, 2575L, 
1116L, 3116L, 96L, 10136L, 3152L, 775L, 901L, 106L, 662L, 222L, 
735L, 845L, 789L, 3664L, 824L, 4702L, 1596L, 2914L, 1787L, 1010L, 
232L, 644L, 3589L, 207036L, 5184L, 96L, 1835L, 407L, 1448L, 580L, 
1021L, 679L, 833L, 795L, 2582L, 2405L, 96L, 1410L, 1194L, 96L, 
904L, 646L, 9917L, 1789L, 1968L, 3334L, 661L, 1329L, 1909L, 310L, 
1244L, 477L, 9699L, 2189L, 641L, 553L, 1664L, 1350L, 4262L, 1210L, 
479L, 2108L, 2266L, 794L, 21836L, 341L, 2625L, 40206L, 9484L, 
1246L, 930L, 492L, 2612L, 826L, 1738L, 20916L, 3364L, 905L, 1609L, 
128L, 2181L, 96L, 677L, 458L, 3112L, 1420L, 7361L, 2613L, 4944L, 
9117L, 1892L, 3182L, 1135L, 956L, 614L, 2267L, 2395L, 2088L, 
1290L, 19666L, 2683L, 96L, 432L, 1940L, 1434L, 536L, 372L, 412L, 
522L, 2468L, 1936L, 2108L, 690L, 298126L, 421L, 1145L, 1337L, 
3048L, 96L, 96L, 908L, 49556L, 679L, 1184L, 96L, 203L, 6544L, 
786L, 866L, 6516L, 346L, 1434L, 639L, 2180L, 1393L, 3001L, 879L, 
1253L, 633L, 1878L, 625L, 9563L, 1539L, 96L, 1021L, 1494L, 923L, 
1142L, 2585L, 1114L, 1558L, 185L, 1672L, 537L, 3564L, 1666L, 
2944L, 747L, 1011L, 541L, 1080L, 826L, 1312L, 687L, 189L, 442L, 
1034L, 763L, 1657L, 229L, 2285L, 595L, 1030L, 1606L, 1168L, 1959L, 
2243L, 2964L, 4279L, 2246L, 791L, 96L, 876L, 1901L, 1221L, 1728L, 
875L, 2340L, 2818L, 96L, 842L, 98L, 530L), CEA = c(124L, 127L, 
250L, 583L, 280L, 178L, 155L, 110L, 115L, 228L, 140L, 203L, 156L, 
80L, 136L, 141L, 194L, 219L, 257L, 210L, 191L, 105L, 461L, 813L, 
150L, 58L, 94L, 77L, 222L, 246L, 385L, 272L, 124L, 96L, 114L, 
193L, 154L, 329L, 98L, 269L, 496L, 56L, 695L, 203L, 115L, 265L, 
527L, 233L, 109L, 138L, 125L, 110L, 260L, 106L, 144L, 139L, 337L, 
99L, 127L, 149L, 272L, 122L, 182L, 170L, 71L, 140L, 200L, 96L, 
256L, 203L, 70L, 128L, 279L, 310L, 152L, 123L, 106L, 139L, 173L, 
1370L, 150L, 146L, 74L, 336L, 910L, 200L, 145L, 144L, 915L, 232L, 
266L, 201L, 227L, 142L, 140L, 258L, 272L, 114L, 189L, 196L, 80L, 
395L, 290L, 101L, 587L, 204L, 175L, 146L, 314L, 128L, 180L, 85L, 
390L, 215L, 255L, 83L, 177L, 190L, 752L, 112L, 99L, 149L, 101L, 
545L, 217L, 133L, 434L, 89L, 146L, 117L, 5712L, 300L, 324L, 477L, 
181L, 133L, 269L, 138L, 199L, 208L, 83L, 141L, 178L, 1076L, 217L, 
158L, 112L, 116L, 76L, 70L, 273L, 123L, 437L, 115L, 161L, 85L, 
188L, 127L, 293L, 168L, 145L, 174L, 149L, 170L, 166L, 212L, 101L, 
119L, 185L, 135L, 98L, 295L, 98L, 107L, 170L, 195L, 64L, 356L, 
419L, 208L, 93L, 248L, 106L, 234L, 143L, 166L, 316L, 240L, 145L, 
320L, 56L, 370L, 334L, 162L, 114L, 302L, 246L, 373L, 219L, 56L, 
484L, 857L, 195L, 154L, 266L, 299L, 255L, 260L, 146L, 203L), 
    ETA = c(34L, 21L, 64L, 58L, 74L, 40L, 51L, 21L, 27L, 75L, 
    37L, 30L, 71L, 40L, 54L, 45L, 77L, 65L, 42L, 60L, 37L, 33L, 
    47L, 64L, 51L, 33L, 44L, 24L, 59L, 27L, 62L, 46L, 60L, 36L, 
    57L, 40L, 52L, 18L, 34L, 61L, 62L, 37L, 60L, 62L, 71L, 42L, 
    36L, 69L, 55L, 24L, 26L, 43L, 63L, 41L, 16L, 66L, 77L, 63L, 
    48L, 58L, 41L, 43L, 81L, 25L, 37L, 37L, 55L, 32L, 25L, 56L, 
    38L, 68L, 75L, 64L, 75L, 12L, 31L, 36L, 54L, 26L, 31L, 25L, 
    24L, 60L, 61L, 41L, 73L, 58L, 54L, 73L, 41L, 74L, 52L, 38L, 
    58L, 44L, 44L, 21L, 35L, 56L, 28L, 58L, 69L, 48L, 79L, 41L, 
    46L, 57L, 51L, 57L, 52L, 52L, 41L, 72L, 45L, 30L, 59L, 75L, 
    66L, 32L, 55L, 48L, 42L, 47L, 34L, 66L, 70L, 40L, 33L, 24L, 
    62L, 76L, 46L, 72L, 55L, 43L, 48L, 50L, 42L, 69L, 38L, 42L, 
    28L, 56L, 72L, 32L, 47L, 41L, 27L, 26L, 50L, 34L, 77L, 65L, 
    49L, 36L, 37L, 48L, 39L, 45L, 44L, 44L, 52L, 70L, 42L, 50L, 
    48L, 33L, 74L, 45L, 69L, 47L, 31L, 35L, 56L, 70L, 35L, 51L, 
    68L, 46L, 57L, 77L, 33L, 51L, 34L, 57L, 68L, 47L, 58L, 51L, 
    51L, 84L, 52L, 45L, 53L, 73L, 61L, 45L, 46L, 47L, 48L, 84L, 
    53L, 52L, 43L, 51L, 42L, 36L, 55L, 63L), MENOPAUSA = structure(c(2L, 
    2L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 1L, 2L, 
    1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 
    1L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 
    2L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 
    2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 
    2L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 
    1L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 
    2L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 
    2L, 2L, 1L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 
    2L, 2L, 2L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 
    2L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 
    2L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 
    1L, 2L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 
    1L, 2L, 2L, 1L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L), .Label = c("POST", 
    "PRE"), class = "factor"), OUTCOME = structure(c(1L, 1L, 
    2L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 
    1L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 
    1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 2L, 1L, 
    2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 
    2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 
    1L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    2L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L), .Label = c("BENIGNO", 
    "MALIGNO"), class = "factor")), .Names = c("HE4", "CA125", 
"CA199", "CEA", "ETA", "MENOPAUSA", "OUTCOME"), class = "data.frame", row.names = c(NA, 
-210L))

### rename headers and levels (they were in Italian language)

names(ovarian) <- c("HE4", "CA125", "CA199", "CEA", "Age", "Menopause", "Outcome")

levels(ovarian$Outcome) <- c("Benign", "Malignant")

### check the data

dim(ovarian)
## [1] 210   7
head(ovarian)
##     HE4 CA125 CA199 CEA Age Menopause   Outcome
## 1  3576  7014  2785 124  34       PRE    Benign
## 2  3046 23216 12656 127  21       PRE    Benign
## 3 29376 11226  2462 250  64      POST Malignant
## 4  6296  5224  3450 583  58      POST Malignant
## 5  3546  2069    96 280  74      POST    Benign
## 6  4056  6084  3123 178  40       PRE    Benign
str(ovarian)
## 'data.frame':    210 obs. of  7 variables:
##  $ HE4      : int  3576 3046 29376 6296 3546 4056 130456 3546 5286 6066 ...
##  $ CA125    : int  7014 23216 11226 5224 2069 6084 196336 1196 3819 5606 ...
##  $ CA199    : int  2785 12656 2462 3450 96 3123 1156 432 996 11366 ...
##  $ CEA      : int  124 127 250 583 280 178 155 110 115 228 ...
##  $ Age      : int  34 21 64 58 74 40 51 21 27 75 ...
##  $ Menopause: Factor w/ 2 levels "POST","PRE": 2 2 1 1 1 2 2 2 2 1 ...
##  $ Outcome  : Factor w/ 2 levels "Benign","Malignant": 1 1 2 2 1 1 2 1 1 2 ...
summary(ovarian)
##       HE4             CA125            CA199             CEA      
##  Min.   :  1276   Min.   :   477   Min.   :    96   Min.   :  56  
##  1st Qu.:  3776   1st Qu.:  1389   1st Qu.:   661   1st Qu.: 124  
##  Median :  4751   Median :  2534   Median :  1216   Median : 174  
##  Mean   : 11836   Mean   : 13122   Mean   :  4931   Mean   : 250  
##  3rd Qu.:  6526   3rd Qu.:  6251   3rd Qu.:  2402   3rd Qu.: 260  
##  Max.   :424126   Max.   :242636   Max.   :298126   Max.   :5712  
##       Age       Menopause       Outcome   
##  Min.   :12.0   POST: 92   Benign   :171  
##  1st Qu.:37.2   PRE :118   Malignant: 39  
##  Median :48.0                             
##  Mean   :49.3                             
##  3rd Qu.:60.0                             
##  Max.   :84.0
########################################################
### relationship of menopause to cancer's outcome
########################################################

tab <- table(ovarian$Menopause, ovarian$Outcome)

### table and proportional table of the relationship of menopause to cancer's outcome

tab
##       
##        Benign Malignant
##   POST     65        27
##   PRE     106        12
round(prop.table(tab),3)*100
##       
##        Benign Malignant
##   POST   31.0      12.9
##   PRE    50.5       5.7
### Chi.square

c.s <- chisq.test(tab)

### semplify p.value for printing

p <- ifelse(c.s$p.value < 0.0001, 0.0001, round(c.s$p.value,4))

### The plot

options(scipen=999)     ### avoid scientific notation

plot(tab,col=c("gold","red"), main="", xlab="Time in relation to the menopause")
title(paste0("Pearson's Chi-squared = ", round(c.s$statistic,2), "; df = ", c.s$parameter, "; p-value = ", p))

plot of chunk unnamed-chunk-5

options(scipen=0)   ### reinstate the default


########################################################
### relationship of age to cancer's outcome
########################################################

t <- t.test(ovarian$Age~ovarian$Outcome, var.equal = TRUE)

p <- ifelse(t$p.value < 0.0001, 0.0001, round(t$p.value,4))


### The plot

options(scipen=999)     ### avoid scientific notation

plot(ovarian$Age~ovarian$Outcome, col = c("gold", "red"), main = "", xlab="Cancer's outcome", ylab = "Age")
title(paste0("Two Sample t-test = ", round(t$statistic,2), "; df = ", t$parameter, "; p-value = ", p))

plot of chunk unnamed-chunk-5

options(scipen=0)   ### reinstate the default

OK, there is a link between menopause and the probability that an ovarian neoplasm is a malignant cancer.

Since women in menopause usually are older than those not in menopause, age, too, is related to cancer’s outcome.

However, what we are interested in is whether or not our biomarkers could predict the risk of malignant cancer.

This is a classification problem, and we can apply logistic regression to investigate the issue.

Let ’s see how these biomarkers are related to the risk of cancer.

### density plot of the biomarkers


par(mfrow=c(2,2))

plot(density(ovarian[,1]), main="HE4", lwd=3, col = "blue")

plot(density(ovarian[,2]), main="CA125", lwd=3, col = "blue")

plot(density(ovarian[,3]), main="CA199", lwd=3, col = "blue")

plot(density(ovarian[,4]), main="CEA", lwd=3, col = "blue")

plot of chunk unnamed-chunk-6

par(mfrow=c(1,1))
### done

Data are rather skewed, so we will use log transformation

### comparison of biomarkers by cancer ' status

par(mfrow=c(2,2))

boxplot(log(ovarian[,1])~ovarian$Outcome, col = c("gold", "red"), ylab="log(HE4)")

boxplot(log(ovarian[,2])~ovarian$Outcome, col = c("gold", "red"), ylab="log(CA125)")

boxplot(log(ovarian[,3])~ovarian$Outcome, col = c("gold", "red"), ylab="log(CA199)")

boxplot(log(ovarian[,4])~ovarian$Outcome, col = c("gold", "red"), ylab="log(CEA)")

plot of chunk unnamed-chunk-7

par(mfrow=c(1,1))
### done

At a first sight, the most promising is HE4.

Let ‘s try a logistic regression of log(HE4) on cancer’ s outcome.

######################################################################## 
### logistic regression
######################################################################## 

target <- ovarian$Outcome
predictor <- log(ovarian[,1])

logistic <- glm(target ~ predictor, data = ovarian, family = binomial(logit))

summary(logistic)
## 
## Call:
## glm(formula = target ~ predictor, family = binomial(logit), data = ovarian)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.910  -0.463  -0.323  -0.212   2.495  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -28.37       5.00   -5.68  1.4e-08 ***
## predictor       3.06       0.57    5.37  7.7e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 201.58  on 209  degrees of freedom
## Residual deviance: 121.21  on 208  degrees of freedom
## AIC: 125.2
## 
## Number of Fisher Scoring iterations: 6
#################################
### pseudoR2
#################################

suppressWarnings(library(BaylorEdPsych))

options(scipen=999)

PseudoR2(logistic)
##         McFadden     Adj.McFadden        Cox.Snell       Nagelkerke 
##           0.3987           0.3689           0.3180           0.5153 
## McKelvey.Zavoina           Effron            Count        Adj.Count 
##           0.6457           0.4281           0.8810           0.3590 
##              AIC    Corrected.AIC 
##         125.2131         125.2711
options(scipen=0)


######################################################################## 
### Validate model by resampling with 'rms' package
######################################################################## 

suppressWarnings(library(rms))
## Loading required package: Hmisc
## Loading required package: grid
## Loading required package: lattice
## Loading required package: survival
## Loading required package: splines
## 
## Attaching package: 'survival'
## 
## The following object is masked _by_ '.GlobalEnv':
## 
##     ovarian
## 
## Loading required package: Formula
## 
## Attaching package: 'Hmisc'
## 
## The following objects are masked from 'package:base':
## 
##     format.pval, round.POSIXt, trunc.POSIXt, units
## 
## Loading required package: SparseM
## 
## Attaching package: 'SparseM'
## 
## The following object is masked from 'package:base':
## 
##     backsolve
set.seed(123)

mod.val <- lrm(target ~ predictor, data = ovarian, x=TRUE, y=TRUE)
print(mod.val)
## 
## Logistic Regression Model
## 
## lrm(formula = target ~ predictor, data = ovarian, x = TRUE, y = TRUE)
## 
##                       Model Likelihood     Discrimination    Rank Discrim.    
##                          Ratio Test            Indexes          Indexes       
## Obs           210    LR chi2      80.37    R2       0.515    C       0.879    
##  Benign       171    d.f.             1    g        2.170    Dxy     0.758    
##  Malignant     39    Pr(> chi2) <0.0001    gr       8.763    gamma   0.760    
## max |deriv| 4e-07                          gp       0.225    tau-a   0.230    
##                                            Brier    0.086                     
## 
##           Coef     S.E.   Wald Z Pr(>|Z|)
## Intercept -28.3658 4.9977 -5.68  <0.0001 
## predictor   3.0618 0.5698  5.37  <0.0001
validate(mod.val, B = 10)
##           index.orig training    test optimism index.corrected  n
## Dxy           0.7569   0.7581  0.7569   0.0012          0.7557 10
## R2            0.5153   0.5073  0.5153  -0.0080          0.5233 10
## Intercept     0.0000   0.0000 -0.0066   0.0066         -0.0066 10
## Slope         1.0000   1.0000  0.9786   0.0214          0.9786 10
## Emax          0.0000   0.0000  0.0059   0.0059          0.0059 10
## D             0.3779   0.3669  0.3779  -0.0110          0.3890 10
## U            -0.0095  -0.0095 -0.0051  -0.0044         -0.0051 10
## Q             0.3875   0.3764  0.3830  -0.0066          0.3941 10
## B             0.0865   0.0834  0.0871  -0.0037          0.0902 10
## g             2.1705   2.1800  2.1705   0.0095          2.1610 10
## gp            0.2246   0.2156  0.2246  -0.0090          0.2336 10
######################################################################## 
### Extraction of estimates with ORs
######################################################################## 

OR <- exp(coef(logistic))

OR.CI <- exp(confint(logistic)) # 95% CI for exponentiated coefficients
## Waiting for profiling to be done...
Estimates <- data.frame(cbind(OR[2], (rbind(OR.CI[2])),(rbind(OR.CI[4]))))
names(Estimates)<- c("OR", "lower", "upper")
round(Estimates,2)
##              OR lower upper
## predictor 21.37  7.87 73.63

OK, all seems good, with excellent pseudo-R2. However, the 95%CI of the predictor’s OR is a bit large (7.87 to 73.63), which is never a good thing.

Let ’s check the model with the diagnostics.

par(mfrow=c(2,2))

plot(logistic)

plot of chunk unnamed-chunk-9

par(mfrow=c(1,1))
### done

With logistic regression I always add the Tukey-Pregibon test (to evaluate whether the model is mispecified), the le Cessie - van Houwelingen - Copas - Hosmer unweighted sum of squares test for global goodness of fit test (with package “rms”), and the Tjur’s coefficient of discrimination (with package “binomTools”). However, this post is not on doing a logistic regression. We will see these diagnostics in another post.

Let ’s do a ROC analysis.

################################################################
#### ROC analysis on log(HE4)
################################################################

# Again, to simplify the thing we will assign data to a vector of convenience

# gold standard

gold<-ovarian$Outcome

# predictor in the evaluation

pred<-log(ovarian[,1])

################################################################
### ROC analysis
################################################################

suppressWarnings(library(pROC))
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## 
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
par(mfcol = c (1,2))

rocobj <- plot.roc(gold, pred,

                main="Confidence intervals", percent=TRUE,

                ci=TRUE, # compute AUC (of AUC by default)

                print.auc=TRUE) # print the AUC (will contain the CI)

ciobj <- ci.se(rocobj, # CI of sensitivity

               specificities=seq(0, 100, 5)) # over a select set of specificities

plot(ciobj, type="shape", col="#1c61b6AA") # plot as a blue shape

plot(ci(rocobj, of="thresholds", thresholds="best")) # add one threshold

# done


# best threshold

plot.roc(gold, pred,

         main="Confidence interval of a threshold", percent=TRUE,

         ci=TRUE, of="thresholds", # compute AUC (of threshold)

         thresholds="best", # select the (best) threshold

         print.thres="best") # also highlight this threshold on the plot

plot of chunk unnamed-chunk-10

## 
## Call:
## plot.roc.default(x = gold, predictor = pred, main = "Confidence interval of a threshold",     percent = TRUE, ci = TRUE, of = "thresholds", thresholds = "best",     print.thres = "best")
## 
## Data: pred in 171 controls (gold Benign) < 39 cases (gold Malignant).
## Area under the curve: 87.8%
## 95% CI (2000 stratified bootstrap replicates):
##  thresholds sp.low sp.median sp.high se.low se.median se.high
##       8.705   76.6      82.5    88.3   69.2      82.1    92.3
par(mfcol = c (1,1))
# done

######################################################################## 
### description of the main metrics
######################################################################## 


prob=predict(logistic,type=c("response"))   ### probability prediction on the basis of the model

### append the prediction to the dataset

ovarian$prob=prob

library(pROC)

g <- roc(ovarian$Outcome ~ prob, data = ovarian)

### tn, tp, fn, fp

mt <- coords(g, "best", ret=c("tn", "tp", "fn", "fp"))

mt
##  tn  tp  fn  fp 
## 141  32   7  30
### proportion at 2 digits

mt.prop <- round(prop.table(mt),2)      

mt.prop 
##   tn   tp   fn   fp 
## 0.67 0.15 0.03 0.14
tn <- mt.prop[[1]]*100
tp <- mt.prop[[2]]*100
fn <- mt.prop[[3]]*100
fp <- mt.prop[[4]]*100

tn; tp; fn; fp;
## [1] 67
## [1] 15
## [1] 3
## [1] 14
### Since we forced the percentage to be rounded to an integer, we lost some of digitals
### We force the sum to be 100, by attributing the remaining to the tn (which is the default in the function)

tot <- sum(tn+tp+fn+fp)

tn <- ifelse(tot<100, tn+(100-tot), tn)

And now, the infogrid…

infogrid(tn=67,tp=15,fn=3,fp=14)

plot of chunk unnamed-chunk-11

Well, the AUC is great (87.8, with 95%CI = 81.5 to 94.2). The cost of not recognition is low, and the cost of false recognition is low as well.

In the sample there are 39 cases of malignant ovarian cancer, i.e., 18.6%. However, you cannot have 18.6% person, so the function forces the percentage to an integer. You can see in the infogrid that your correctly identified cases are the 15% of the sample, and the not identified cases are 3%, totalling 18% (as expected).

Of course, we could improve the prediction, maybe including the CA125 biomarker, which was promising. But this post is not on predicting ovarian cancer in women with menopause.

The post is on using R at its best…

I hope you’ve enjoyed this!

Have a nice day!

Alt text

Twitter: @AntoViral

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents.

For more details on using R Markdown see http://rmarkdown.rstudio.com.