####################################################################################################
Some time ago I was reading a web page on classification. The page was on Receiver Operating Characteristics (ROC) graphs, a helpful technique to summarize and visualize classifiers’ performances.
A ROC graph is something like that:
The author of the post was arguing that people often do not understand the Area under an ROC Curve (AUC), but they are more able to understand the grouping of cases on the basis of them being true positive, false positive and so on.
Essentially, a ROC curve is a graphical representation of the relationship between sensitivity and specificity of a classifier and it helps to identify the best threshold for a diagnostic test.
Sensitivity and specificity result from a balance between true positive (TP), true negative (TN), false negative (FN), and false positive (FP). When your test identifies correctly a case with a condition of interest, it provides a true positive (TP) case. If the test is unable to correctly identify a case with the condition of interest, marking it as ‘negative’, it provides a false negative (FN) case. When the test correctly marks as ‘negative’ a case without the condition of interest, it provides a true negative (TN) case. And when the test marks as ‘positive’ a case without the condition of interest, it provides a false positive (FP) case.
Sensitivity = TP/(TP + FN), i.e., the sensitivity is the ratio of the cases that were correctly identified to all potential instances of positive cases (i.e., the sum of both the true positive and false negative case).
Conversely, Specificity = TN/(TN + FP), i.e., the specificity is the ratio of the non-cases (those without the condition of interest) that were correctly identified as negative to all potential instances of negative cases (i.e., the sum of both the true negative and false positive case).
Finally, Accuracy = (TN + TP)/(TN+TP+FN+FP), i.e., the accuracy (which is not a good measure) is the ratio of the correctly identified cases to all cases. The accuracy is the proportion of true results: the higher, the better. However, when you have a rare condition, the classifier can perform with high accuracy simply classifying all cases as “negative”.
A ROC curve is a plot of Sensitivity versus 1-Specifity (but different ways of plotting a ROC curve do exist). The area under the curve (AUC) is a measure of how good the classifier is in detecting the condition of interest in a sample. AUC is not affected by the criterion used to establish the boundaries between cases and non-cases, and it is also independent of prevalence of the condition, since it is based entirely on the performance of the classifier (i.e, sensitivity and specificity). Optimal cut off values can be derived from the curve, moreover, several diagnostic classifiers can be compared on the same subjects within the ROC space.
Anyway, this post is not on ROC analysis and AUC, it is on how to make clear to people how a classifier does work. In the web page I was reading, the use of an infogrid was suggested as an easy visual method to make things clear.
This image is all I saved from the web page I was reading. I saved the image, then I did some typying on my mail service, and … Boom! The PC crashed. When I ’ve tried to return to the web page, all my chronology was lost, and I was no more able to find the page…
However, the idea behind the infogrid is appealing. Essentially, you have to plot a 10 X 10 grid, which results into a 100% space. Then, all you have to do is painting TP cases in a color, the FN cases in another color and so on. Eventually, you ’ll have a grid with percentages of TP, FN, FP and TN. It is a clear way of have the ROC space at hand.
Let ’s see one example.
In this example, you have 26% of TP, 5% of FN, 15% of FP, and 54% of TN. The prevalence of cases in this fictional example is 26% + 5% = 31%, which you can see clearly in the plot by summing TP and FN.
There is also a calculation of the cost of not recognition, which is the ratio between the FN and the TP (FN/TP): the lower, the better.
There is also a calculation for the cost of false recognition, which is the ratio between the FP and the TN (FP/TN): again, the lower, the better.
In the example, the cost of not recognition is 5/26 = 0.19 = 19%.
The cost of false recognition was 15/54 = 0.277 = 28% (when rounded).
You do not want a classifier with a high cost of not recognition (too many cases leaved behind).
But sometime you cannot afford to have a classifier with a high cost of false recognition, particularly when the inclusion of the case in the positive folder has negative consequences (e.g., severe side effects of treatment or a high chance of negative labelling, i..e stigma, as it can happen for mental disorders or lethal infective illnesses, as HIV).
I ’ve prepared a function to plot this infogrid. The code is really, really ugly, but it does work. If anyone has a better solution, please, let me know.
This is the function:
#### infogrid for results ROC analysis
### call it with source("Infogrid.ROC.R")
cat("\n Utility for drawing an infogrid with percent distribution of true positive, true negative, false positive, false negative. \n
You may call the function by typing 'infogrid' and including the value in this sequence: tp, tn, fp, fn.
E.g., : infogrid(30,12,34,24) \n
You may also call the function by specifiying the values in any order: infogrid(tp=30, tn=24, fn=12, fp=34) \n
If the values are already available in the platfom environment, you may get the plot just typying 'infogrid()'
")
##
## Utility for drawing an infogrid with percent distribution of true positive, true negative, false positive, false negative.
##
## You may call the function by typing 'infogrid' and including the value in this sequence: tp, tn, fp, fn.
## E.g., : infogrid(30,12,34,24)
##
## You may also call the function by specifiying the values in any order: infogrid(tp=30, tn=24, fn=12, fp=34)
##
## If the values are already available in the platfom environment, you may get the plot just typying 'infogrid()'
##
infogrid <- function(tp,tn,fp,fn)
{
######################################################
# Plot infogrid
######################################################
### Expand right side of clipping rect to make room for the legend
par(xpd=T, mar=par()$mar+c(0,0,0,6))
###################
### plot a square
###################
plot(c(10,0), c(10,0), type= "n", xlab = "", ylab = "",
xlim = c(0,10), ylim = c(10,0), axes=FALSE) ### axes = FALSE --> do not plot axes
axis(3, at=c(0:10)) ### Draw the x-axis above the plot area
axis(2, at=c(0:10)) ### Draw the y-axis to the left of the plot area
box() ### Draw the usual box
#### draw the infogrid
###################
### true positive
###################
for(i in 1:tp)
{
k <- tp
k <- ifelse(k>10, 10, k)
rect(0,0,1,k, col = "pink", lwd=2) # coloured
k1 <- tp-10
k1 <- ifelse(k1>10, 10, k1)
k1 <- ifelse(k1<1, 0, k1)
rect(1,0,2,k1, col = "pink", lwd=2) # coloured
k2 <- tp-(10*2)
k2 <- ifelse(k2>10, 10, k2)
k2 <- ifelse(k2<1, 0, k2)
rect(2,0,3,k2, col = "pink", lwd=2) # coloured
k3 <- tp-(10*3)
k3 <- ifelse(k3>10, 10, k3)
k3 <- ifelse(k3<1, 0, k3)
rect(3,0,4,k3, col = "pink", lwd=2) # coloured
k4 <- tp-(10*4)
k4 <- ifelse(k4>10, 10, k4)
k4 <- ifelse(k4<1, 0, k4)
rect(4,0,5,k4, col = "pink", lwd=2) # coloured
k5 <- tp-(10*5)
k5 <- ifelse(k5>10, 10, k5)
k5 <- ifelse(k5<1, 0, k5)
rect(5,0,6,k5, col = "pink", lwd=2) # coloured
k6 <- tp-(10*6)
k6 <- ifelse(k6>10, 10, k6)
k6 <- ifelse(k6<1, 0, k6)
rect(6,0,7,k6, col = "pink", lwd=2) # coloured
k7 <- tp-(10*7)
k7 <- ifelse(k7>10, 10, k7)
k7 <- ifelse(k7<1, 0, k7)
rect(7,0,8,k7, col = "pink", lwd=2) # coloured
k8 <- tp-(10*8)
k8 <- ifelse(k8>10, 10, k8)
k8 <- ifelse(k8<1, 0, k8)
rect(8,0,9,k8, col = "pink", lwd=2) # coloured
k9 <- tp-(10*9)
k9 <- ifelse(k9>10, 10, k9)
k9 <- ifelse(k9<1, 0, k9)
rect(9,0,10,k9, col = "pink", lwd=2) # coloured
}
###################
### false negative
###################
digits <- function(x){as.integer(substring(x, seq(nchar(x)), seq(nchar(x))))}
vb <- digits(tp)
vb[2] <- ifelse(is.na(vb[2]),vb[1],vb[2])
vb[1] <- ifelse(tp<10,0,vb[1])
h <- vb[2]+fn
h <- ifelse(h>10, 10, h)
rect(vb[1],vb[2],vb[1]+1,h, col = "red", lwd=2) # coloured
start <- vb[1]+1
for(i in start:10)
{
rect(i,0,10,10, col = "red") # coloured
}
###################
### false positive
###################
vb <- digits(tp+fn)
vb[2] <- ifelse(is.na(vb[2]),vb[1],vb[2])
vb[1] <- ifelse(tp+fn<10,0,vb[1])
h <- vb[2]+fp
h <- ifelse(h>10, 10, h)
rect(vb[1],vb[2],vb[1]+1,h, col = "yellow", lwd=2) # coloured
start <- vb[1]+1
for(i in start:10)
{
rect(i,0,10,10, col = "yellow") # coloured
}
###################
### true negative
###################
vb <- digits(tp+fn+fp)
vb[2] <- ifelse(is.na(vb[2]),vb[1],vb[2])
vb[1] <- ifelse(tp+fn+fp<10,0,vb[1])
h <- vb[2]+tn
h <- ifelse(h>10, 10, h)
rect(vb[1],vb[2],vb[1]+1, h, col = "darkolivegreen1") # coloured
start <- vb[1]+1
for(i in start:10)
{
rect(i,0,10,10, col = "darkolivegreen1") # coloured
}
### griglia
for(i in 1:9)
{
segments(i,0,i,10, lwd=2, col="black")
segments(0,i,10,i, lwd=2, col="black")
}
### Add legend
### Plot legend where you want (outside the marging, a 11 on axis x)
### lty = c(0,0,0,0) does eliminate the line
legend(11,1,c("True positive", "False negative", "False positive", "True negative"),
pch = c(22,22,22,22), lty = c(0,0,0,0),
pt.bg = c("pink", "red", "yellow", "darkolivegreen1"), pt.cex=1.75,
bty="y", bg= "cornsilk1", cex=1)
# Restore default clipping rect
par(mar=c(5, 4, 4, 2) + 0.1)
title("Infogrid for ROC analysis\n")
mtext(paste("Cost of not recognition:", round(fn/tp,2), "; Cost of false recognition: ", round(fp/tn,2)),
side=1, line=1)
### done
###
}
The infogrid() function does accept values in the following order: tp,tn,fp,fn.
So, if you type ‘infogrid(30,30,30,10)’, the function will interpret the values as true positive, true negative, false positive and false negative cases
Let ’s see:
infogrid(30,30,30,10)
However, you can specify the values according the order you prefer, by specifying their nature: infogrid(tp=40, fn=20, fn=10, tn=30)
infogrid(tp=40, fn=10, fp=20, tn=30)
Let’ see an example with some data.
Meet the ‘ovarian’ dataset, which is available at the repository http://www.dmi.units.it/borelli/dataset/ovarian.csv
The dataset is curated by Massimo Borelli, teaching statistics at the Trieste University (Italy).
The dataset include information on the risk of cancer in women with or without menopausa. There are also data on four putative cancer biomarkers and information on the age of the patient.
The dataset comes from this study: Richard G Moore, Moune Jabre-Raughley, Amy K Brown, Katina M Robison, M Craig Miller, W Jeffery Allard, Robert J Kurman, Robert C Bast, and Steven J Skates. Compar- ison of a novel multiple marker assay vs the risk of malignancy index for the prediction of epithelial ovarian cancer in patients with a pelvic mass. American journal of obstetrics and gynecology, 203(3):228e1, 2010.
ovarian <- structure(list(HE4 = c(3576L, 3046L, 29376L, 6296L, 3546L, 4056L,
130456L, 3546L, 5286L, 6066L, 3526L, 3836L, 4886L, 4066L, 5766L,
4466L, 6386L, 9536L, 4576L, 3866L, 4236L, 4796L, 5606L, 9466L,
4626L, 3396L, 2156L, 3426L, 6686L, 4746L, 4626L, 3586L, 4756L,
3566L, 11216L, 3936L, 5006L, 3606L, 3996L, 8606L, 11526L, 2906L,
8216L, 4796L, 11006L, 3726L, 6126L, 6736L, 424126L, 3226L, 5506L,
3736L, 5096L, 4306L, 3556L, 94506L, 5876L, 7576L, 6666L, 4026L,
4196L, 5196L, 37841L, 4226L, 4956L, 4556L, 4046L, 3436L, 6496L,
3516L, 3626L, 5996L, 62176L, 5636L, 13816L, 5506L, 2796L, 6996L,
7536L, 5836L, 2786L, 4936L, 1276L, 5376L, 7446L, 41036L, 125776L,
6536L, 7766L, 8786L, 7366L, 12226L, 6666L, 4416L, 3476L, 4236L,
6116L, 4516L, 3856L, 6246L, 4286L, 11046L, 6376L, 2926L, 5186L,
4076L, 4316L, 3676L, 4176L, 5176L, 4626L, 4276L, 3876L, 5706L,
4806L, 6556L, 20326L, 9106L, 11536L, 3776L, 99206L, 3666L, 2796L,
3566L, 5226L, 4886L, 4476L, 4126L, 4306L, 3186L, 16239L, 5316L,
5546L, 27366L, 3786L, 3056L, 9286L, 5296L, 5566L, 4876L, 4416L,
3486L, 3044L, 54726L, 4116L, 5986L, 3086L, 3496L, 4036L, 3306L,
7406L, 2646L, 24816L, 86336L, 4766L, 4526L, 18086L, 3306L, 4406L,
6846L, 4486L, 3486L, 7216L, 6076L, 4446L, 4606L, 5006L, 3486L,
4956L, 3496L, 5466L, 4716L, 3276L, 4586L, 4226L, 3906L, 4916L,
2706L, 5396L, 8196L, 5166L, 8466L, 3256L, 4426L, 3576L, 3446L,
3706L, 3496L, 5306L, 2446L, 210796L, 7266L, 5026L, 3366L, 7396L,
4246L, 3826L, 7306L, 4436L, 3676L, 4926L, 7806L, 4336L, 2926L,
7046L, 3776L, 3276L, 4116L, 5816L, 5226L), CA125 = c(7014L, 23216L,
11226L, 5224L, 2069L, 6084L, 196336L, 1196L, 3819L, 5606L, 9809L,
1696L, 30636L, 20866L, 3706L, 1167L, 953L, 1419L, 1876L, 986L,
9776L, 17006L, 2222L, 34226L, 1796L, 3756L, 596L, 1130L, 824L,
3056L, 2716L, 2086L, 6408L, 9286L, 242636L, 1099L, 7756L, 1906L,
1076L, 3156L, 9961L, 6451L, 5616L, 1523L, 1246L, 1636L, 1749L,
1746L, 16976L, 3056L, 1795L, 3362L, 2259L, 6556L, 2117L, 43386L,
2858L, 14686L, 1335L, 1796L, 2981L, 8699L, 35936L, 2336L, 3890L,
2486L, 1326L, 2246L, 4756L, 937L, 7741L, 28576L, 30616L, 666L,
1012L, 2678L, 1486L, 1593L, 1744L, 46366L, 1346L, 54346L, 4836L,
52516L, 5614L, 98086L, 48016L, 879L, 3346L, 1190L, 646L, 2916L,
9986L, 5496L, 574L, 1236L, 1882L, 7394L, 1076L, 960L, 1176L,
14276L, 1669L, 3816L, 1261L, 1399L, 2702L, 1176L, 3776L, 5562L,
1516L, 1386L, 2707L, 976L, 1226L, 33026L, 241236L, 1360L, 1145L,
1477L, 142736L, 6226L, 3506L, 2136L, 2092L, 2166L, 1116L, 5356L,
1486L, 2314L, 77456L, 879L, 935L, 46936L, 4806L, 5571L, 716L,
2825L, 18396L, 516L, 1156L, 1326L, 1500L, 131836L, 776L, 5291L,
1965L, 1226L, 96956L, 1933L, 2415L, 1718L, 8416L, 18426L, 4726L,
1466L, 3326L, 2626L, 3360L, 2582L, 7336L, 1683L, 6426L, 1173L,
2816L, 7712L, 4826L, 5486L, 661L, 11866L, 886L, 5245L, 3927L,
3394L, 1323L, 744L, 1930L, 2416L, 10866L, 85736L, 477L, 18806L,
3246L, 937L, 1956L, 2966L, 2059L, 2241L, 1166L, 2196L, 238436L,
14466L, 1546L, 2843L, 4736L, 1284L, 2136L, 766L, 2466L, 1998L,
2436L, 1236L, 1612L, 1140L, 3169L, 6259L, 1896L, 6156L, 903L,
5616L), CA199 = c(2785L, 12656L, 2462L, 3450L, 96L, 3123L, 1156L,
432L, 996L, 11366L, 6449L, 2001L, 1123L, 5588L, 886L, 444L, 2575L,
1116L, 3116L, 96L, 10136L, 3152L, 775L, 901L, 106L, 662L, 222L,
735L, 845L, 789L, 3664L, 824L, 4702L, 1596L, 2914L, 1787L, 1010L,
232L, 644L, 3589L, 207036L, 5184L, 96L, 1835L, 407L, 1448L, 580L,
1021L, 679L, 833L, 795L, 2582L, 2405L, 96L, 1410L, 1194L, 96L,
904L, 646L, 9917L, 1789L, 1968L, 3334L, 661L, 1329L, 1909L, 310L,
1244L, 477L, 9699L, 2189L, 641L, 553L, 1664L, 1350L, 4262L, 1210L,
479L, 2108L, 2266L, 794L, 21836L, 341L, 2625L, 40206L, 9484L,
1246L, 930L, 492L, 2612L, 826L, 1738L, 20916L, 3364L, 905L, 1609L,
128L, 2181L, 96L, 677L, 458L, 3112L, 1420L, 7361L, 2613L, 4944L,
9117L, 1892L, 3182L, 1135L, 956L, 614L, 2267L, 2395L, 2088L,
1290L, 19666L, 2683L, 96L, 432L, 1940L, 1434L, 536L, 372L, 412L,
522L, 2468L, 1936L, 2108L, 690L, 298126L, 421L, 1145L, 1337L,
3048L, 96L, 96L, 908L, 49556L, 679L, 1184L, 96L, 203L, 6544L,
786L, 866L, 6516L, 346L, 1434L, 639L, 2180L, 1393L, 3001L, 879L,
1253L, 633L, 1878L, 625L, 9563L, 1539L, 96L, 1021L, 1494L, 923L,
1142L, 2585L, 1114L, 1558L, 185L, 1672L, 537L, 3564L, 1666L,
2944L, 747L, 1011L, 541L, 1080L, 826L, 1312L, 687L, 189L, 442L,
1034L, 763L, 1657L, 229L, 2285L, 595L, 1030L, 1606L, 1168L, 1959L,
2243L, 2964L, 4279L, 2246L, 791L, 96L, 876L, 1901L, 1221L, 1728L,
875L, 2340L, 2818L, 96L, 842L, 98L, 530L), CEA = c(124L, 127L,
250L, 583L, 280L, 178L, 155L, 110L, 115L, 228L, 140L, 203L, 156L,
80L, 136L, 141L, 194L, 219L, 257L, 210L, 191L, 105L, 461L, 813L,
150L, 58L, 94L, 77L, 222L, 246L, 385L, 272L, 124L, 96L, 114L,
193L, 154L, 329L, 98L, 269L, 496L, 56L, 695L, 203L, 115L, 265L,
527L, 233L, 109L, 138L, 125L, 110L, 260L, 106L, 144L, 139L, 337L,
99L, 127L, 149L, 272L, 122L, 182L, 170L, 71L, 140L, 200L, 96L,
256L, 203L, 70L, 128L, 279L, 310L, 152L, 123L, 106L, 139L, 173L,
1370L, 150L, 146L, 74L, 336L, 910L, 200L, 145L, 144L, 915L, 232L,
266L, 201L, 227L, 142L, 140L, 258L, 272L, 114L, 189L, 196L, 80L,
395L, 290L, 101L, 587L, 204L, 175L, 146L, 314L, 128L, 180L, 85L,
390L, 215L, 255L, 83L, 177L, 190L, 752L, 112L, 99L, 149L, 101L,
545L, 217L, 133L, 434L, 89L, 146L, 117L, 5712L, 300L, 324L, 477L,
181L, 133L, 269L, 138L, 199L, 208L, 83L, 141L, 178L, 1076L, 217L,
158L, 112L, 116L, 76L, 70L, 273L, 123L, 437L, 115L, 161L, 85L,
188L, 127L, 293L, 168L, 145L, 174L, 149L, 170L, 166L, 212L, 101L,
119L, 185L, 135L, 98L, 295L, 98L, 107L, 170L, 195L, 64L, 356L,
419L, 208L, 93L, 248L, 106L, 234L, 143L, 166L, 316L, 240L, 145L,
320L, 56L, 370L, 334L, 162L, 114L, 302L, 246L, 373L, 219L, 56L,
484L, 857L, 195L, 154L, 266L, 299L, 255L, 260L, 146L, 203L),
ETA = c(34L, 21L, 64L, 58L, 74L, 40L, 51L, 21L, 27L, 75L,
37L, 30L, 71L, 40L, 54L, 45L, 77L, 65L, 42L, 60L, 37L, 33L,
47L, 64L, 51L, 33L, 44L, 24L, 59L, 27L, 62L, 46L, 60L, 36L,
57L, 40L, 52L, 18L, 34L, 61L, 62L, 37L, 60L, 62L, 71L, 42L,
36L, 69L, 55L, 24L, 26L, 43L, 63L, 41L, 16L, 66L, 77L, 63L,
48L, 58L, 41L, 43L, 81L, 25L, 37L, 37L, 55L, 32L, 25L, 56L,
38L, 68L, 75L, 64L, 75L, 12L, 31L, 36L, 54L, 26L, 31L, 25L,
24L, 60L, 61L, 41L, 73L, 58L, 54L, 73L, 41L, 74L, 52L, 38L,
58L, 44L, 44L, 21L, 35L, 56L, 28L, 58L, 69L, 48L, 79L, 41L,
46L, 57L, 51L, 57L, 52L, 52L, 41L, 72L, 45L, 30L, 59L, 75L,
66L, 32L, 55L, 48L, 42L, 47L, 34L, 66L, 70L, 40L, 33L, 24L,
62L, 76L, 46L, 72L, 55L, 43L, 48L, 50L, 42L, 69L, 38L, 42L,
28L, 56L, 72L, 32L, 47L, 41L, 27L, 26L, 50L, 34L, 77L, 65L,
49L, 36L, 37L, 48L, 39L, 45L, 44L, 44L, 52L, 70L, 42L, 50L,
48L, 33L, 74L, 45L, 69L, 47L, 31L, 35L, 56L, 70L, 35L, 51L,
68L, 46L, 57L, 77L, 33L, 51L, 34L, 57L, 68L, 47L, 58L, 51L,
51L, 84L, 52L, 45L, 53L, 73L, 61L, 45L, 46L, 47L, 48L, 84L,
53L, 52L, 43L, 51L, 42L, 36L, 55L, 63L), MENOPAUSA = structure(c(2L,
2L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 1L, 2L,
1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 1L,
1L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 2L,
2L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 1L, 2L,
2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 2L,
1L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 1L, 2L, 1L, 2L,
2L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 1L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 2L,
2L, 2L, 2L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 2L,
2L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L,
2L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 1L,
1L, 2L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L,
1L, 2L, 2L, 1L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L), .Label = c("POST",
"PRE"), class = "factor"), OUTCOME = structure(c(1L, 1L,
2L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 2L,
1L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 1L, 1L,
1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 2L, 1L,
2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 1L,
2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L), .Label = c("BENIGNO",
"MALIGNO"), class = "factor")), .Names = c("HE4", "CA125",
"CA199", "CEA", "ETA", "MENOPAUSA", "OUTCOME"), class = "data.frame", row.names = c(NA,
-210L))
### rename headers and levels (they were in Italian language)
names(ovarian) <- c("HE4", "CA125", "CA199", "CEA", "Age", "Menopause", "Outcome")
levels(ovarian$Outcome) <- c("Benign", "Malignant")
### check the data
dim(ovarian)
## [1] 210 7
head(ovarian)
## HE4 CA125 CA199 CEA Age Menopause Outcome
## 1 3576 7014 2785 124 34 PRE Benign
## 2 3046 23216 12656 127 21 PRE Benign
## 3 29376 11226 2462 250 64 POST Malignant
## 4 6296 5224 3450 583 58 POST Malignant
## 5 3546 2069 96 280 74 POST Benign
## 6 4056 6084 3123 178 40 PRE Benign
str(ovarian)
## 'data.frame': 210 obs. of 7 variables:
## $ HE4 : int 3576 3046 29376 6296 3546 4056 130456 3546 5286 6066 ...
## $ CA125 : int 7014 23216 11226 5224 2069 6084 196336 1196 3819 5606 ...
## $ CA199 : int 2785 12656 2462 3450 96 3123 1156 432 996 11366 ...
## $ CEA : int 124 127 250 583 280 178 155 110 115 228 ...
## $ Age : int 34 21 64 58 74 40 51 21 27 75 ...
## $ Menopause: Factor w/ 2 levels "POST","PRE": 2 2 1 1 1 2 2 2 2 1 ...
## $ Outcome : Factor w/ 2 levels "Benign","Malignant": 1 1 2 2 1 1 2 1 1 2 ...
summary(ovarian)
## HE4 CA125 CA199 CEA
## Min. : 1276 Min. : 477 Min. : 96 Min. : 56
## 1st Qu.: 3776 1st Qu.: 1389 1st Qu.: 661 1st Qu.: 124
## Median : 4751 Median : 2534 Median : 1216 Median : 174
## Mean : 11836 Mean : 13122 Mean : 4931 Mean : 250
## 3rd Qu.: 6526 3rd Qu.: 6251 3rd Qu.: 2402 3rd Qu.: 260
## Max. :424126 Max. :242636 Max. :298126 Max. :5712
## Age Menopause Outcome
## Min. :12.0 POST: 92 Benign :171
## 1st Qu.:37.2 PRE :118 Malignant: 39
## Median :48.0
## Mean :49.3
## 3rd Qu.:60.0
## Max. :84.0
########################################################
### relationship of menopause to cancer's outcome
########################################################
tab <- table(ovarian$Menopause, ovarian$Outcome)
### table and proportional table of the relationship of menopause to cancer's outcome
tab
##
## Benign Malignant
## POST 65 27
## PRE 106 12
round(prop.table(tab),3)*100
##
## Benign Malignant
## POST 31.0 12.9
## PRE 50.5 5.7
### Chi.square
c.s <- chisq.test(tab)
### semplify p.value for printing
p <- ifelse(c.s$p.value < 0.0001, 0.0001, round(c.s$p.value,4))
### The plot
options(scipen=999) ### avoid scientific notation
plot(tab,col=c("gold","red"), main="", xlab="Time in relation to the menopause")
title(paste0("Pearson's Chi-squared = ", round(c.s$statistic,2), "; df = ", c.s$parameter, "; p-value = ", p))
options(scipen=0) ### reinstate the default
########################################################
### relationship of age to cancer's outcome
########################################################
t <- t.test(ovarian$Age~ovarian$Outcome, var.equal = TRUE)
p <- ifelse(t$p.value < 0.0001, 0.0001, round(t$p.value,4))
### The plot
options(scipen=999) ### avoid scientific notation
plot(ovarian$Age~ovarian$Outcome, col = c("gold", "red"), main = "", xlab="Cancer's outcome", ylab = "Age")
title(paste0("Two Sample t-test = ", round(t$statistic,2), "; df = ", t$parameter, "; p-value = ", p))
options(scipen=0) ### reinstate the default
OK, there is a link between menopause and the probability that an ovarian neoplasm is a malignant cancer.
Since women in menopause usually are older than those not in menopause, age, too, is related to cancer’s outcome.
However, what we are interested in is whether or not our biomarkers could predict the risk of malignant cancer.
This is a classification problem, and we can apply logistic regression to investigate the issue.
Let ’s see how these biomarkers are related to the risk of cancer.
### density plot of the biomarkers
par(mfrow=c(2,2))
plot(density(ovarian[,1]), main="HE4", lwd=3, col = "blue")
plot(density(ovarian[,2]), main="CA125", lwd=3, col = "blue")
plot(density(ovarian[,3]), main="CA199", lwd=3, col = "blue")
plot(density(ovarian[,4]), main="CEA", lwd=3, col = "blue")
par(mfrow=c(1,1))
### done
Data are rather skewed, so we will use log transformation
### comparison of biomarkers by cancer ' status
par(mfrow=c(2,2))
boxplot(log(ovarian[,1])~ovarian$Outcome, col = c("gold", "red"), ylab="log(HE4)")
boxplot(log(ovarian[,2])~ovarian$Outcome, col = c("gold", "red"), ylab="log(CA125)")
boxplot(log(ovarian[,3])~ovarian$Outcome, col = c("gold", "red"), ylab="log(CA199)")
boxplot(log(ovarian[,4])~ovarian$Outcome, col = c("gold", "red"), ylab="log(CEA)")
par(mfrow=c(1,1))
### done
At a first sight, the most promising is HE4.
Let ‘s try a logistic regression of log(HE4) on cancer’ s outcome.
########################################################################
### logistic regression
########################################################################
target <- ovarian$Outcome
predictor <- log(ovarian[,1])
logistic <- glm(target ~ predictor, data = ovarian, family = binomial(logit))
summary(logistic)
##
## Call:
## glm(formula = target ~ predictor, family = binomial(logit), data = ovarian)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.910 -0.463 -0.323 -0.212 2.495
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -28.37 5.00 -5.68 1.4e-08 ***
## predictor 3.06 0.57 5.37 7.7e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 201.58 on 209 degrees of freedom
## Residual deviance: 121.21 on 208 degrees of freedom
## AIC: 125.2
##
## Number of Fisher Scoring iterations: 6
#################################
### pseudoR2
#################################
suppressWarnings(library(BaylorEdPsych))
options(scipen=999)
PseudoR2(logistic)
## McFadden Adj.McFadden Cox.Snell Nagelkerke
## 0.3987 0.3689 0.3180 0.5153
## McKelvey.Zavoina Effron Count Adj.Count
## 0.6457 0.4281 0.8810 0.3590
## AIC Corrected.AIC
## 125.2131 125.2711
options(scipen=0)
########################################################################
### Validate model by resampling with 'rms' package
########################################################################
suppressWarnings(library(rms))
## Loading required package: Hmisc
## Loading required package: grid
## Loading required package: lattice
## Loading required package: survival
## Loading required package: splines
##
## Attaching package: 'survival'
##
## The following object is masked _by_ '.GlobalEnv':
##
## ovarian
##
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
##
## The following objects are masked from 'package:base':
##
## format.pval, round.POSIXt, trunc.POSIXt, units
##
## Loading required package: SparseM
##
## Attaching package: 'SparseM'
##
## The following object is masked from 'package:base':
##
## backsolve
set.seed(123)
mod.val <- lrm(target ~ predictor, data = ovarian, x=TRUE, y=TRUE)
print(mod.val)
##
## Logistic Regression Model
##
## lrm(formula = target ~ predictor, data = ovarian, x = TRUE, y = TRUE)
##
## Model Likelihood Discrimination Rank Discrim.
## Ratio Test Indexes Indexes
## Obs 210 LR chi2 80.37 R2 0.515 C 0.879
## Benign 171 d.f. 1 g 2.170 Dxy 0.758
## Malignant 39 Pr(> chi2) <0.0001 gr 8.763 gamma 0.760
## max |deriv| 4e-07 gp 0.225 tau-a 0.230
## Brier 0.086
##
## Coef S.E. Wald Z Pr(>|Z|)
## Intercept -28.3658 4.9977 -5.68 <0.0001
## predictor 3.0618 0.5698 5.37 <0.0001
validate(mod.val, B = 10)
## index.orig training test optimism index.corrected n
## Dxy 0.7569 0.7581 0.7569 0.0012 0.7557 10
## R2 0.5153 0.5073 0.5153 -0.0080 0.5233 10
## Intercept 0.0000 0.0000 -0.0066 0.0066 -0.0066 10
## Slope 1.0000 1.0000 0.9786 0.0214 0.9786 10
## Emax 0.0000 0.0000 0.0059 0.0059 0.0059 10
## D 0.3779 0.3669 0.3779 -0.0110 0.3890 10
## U -0.0095 -0.0095 -0.0051 -0.0044 -0.0051 10
## Q 0.3875 0.3764 0.3830 -0.0066 0.3941 10
## B 0.0865 0.0834 0.0871 -0.0037 0.0902 10
## g 2.1705 2.1800 2.1705 0.0095 2.1610 10
## gp 0.2246 0.2156 0.2246 -0.0090 0.2336 10
########################################################################
### Extraction of estimates with ORs
########################################################################
OR <- exp(coef(logistic))
OR.CI <- exp(confint(logistic)) # 95% CI for exponentiated coefficients
## Waiting for profiling to be done...
Estimates <- data.frame(cbind(OR[2], (rbind(OR.CI[2])),(rbind(OR.CI[4]))))
names(Estimates)<- c("OR", "lower", "upper")
round(Estimates,2)
## OR lower upper
## predictor 21.37 7.87 73.63
OK, all seems good, with excellent pseudo-R2. However, the 95%CI of the predictor’s OR is a bit large (7.87 to 73.63), which is never a good thing.
Let ’s check the model with the diagnostics.
par(mfrow=c(2,2))
plot(logistic)
par(mfrow=c(1,1))
### done
With logistic regression I always add the Tukey-Pregibon test (to evaluate whether the model is mispecified), the le Cessie - van Houwelingen - Copas - Hosmer unweighted sum of squares test for global goodness of fit test (with package “rms”), and the Tjur’s coefficient of discrimination (with package “binomTools”). However, this post is not on doing a logistic regression. We will see these diagnostics in another post.
Let ’s do a ROC analysis.
################################################################
#### ROC analysis on log(HE4)
################################################################
# Again, to simplify the thing we will assign data to a vector of convenience
# gold standard
gold<-ovarian$Outcome
# predictor in the evaluation
pred<-log(ovarian[,1])
################################################################
### ROC analysis
################################################################
suppressWarnings(library(pROC))
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
##
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
par(mfcol = c (1,2))
rocobj <- plot.roc(gold, pred,
main="Confidence intervals", percent=TRUE,
ci=TRUE, # compute AUC (of AUC by default)
print.auc=TRUE) # print the AUC (will contain the CI)
ciobj <- ci.se(rocobj, # CI of sensitivity
specificities=seq(0, 100, 5)) # over a select set of specificities
plot(ciobj, type="shape", col="#1c61b6AA") # plot as a blue shape
plot(ci(rocobj, of="thresholds", thresholds="best")) # add one threshold
# done
# best threshold
plot.roc(gold, pred,
main="Confidence interval of a threshold", percent=TRUE,
ci=TRUE, of="thresholds", # compute AUC (of threshold)
thresholds="best", # select the (best) threshold
print.thres="best") # also highlight this threshold on the plot
##
## Call:
## plot.roc.default(x = gold, predictor = pred, main = "Confidence interval of a threshold", percent = TRUE, ci = TRUE, of = "thresholds", thresholds = "best", print.thres = "best")
##
## Data: pred in 171 controls (gold Benign) < 39 cases (gold Malignant).
## Area under the curve: 87.8%
## 95% CI (2000 stratified bootstrap replicates):
## thresholds sp.low sp.median sp.high se.low se.median se.high
## 8.705 76.6 82.5 88.3 69.2 82.1 92.3
par(mfcol = c (1,1))
# done
########################################################################
### description of the main metrics
########################################################################
prob=predict(logistic,type=c("response")) ### probability prediction on the basis of the model
### append the prediction to the dataset
ovarian$prob=prob
library(pROC)
g <- roc(ovarian$Outcome ~ prob, data = ovarian)
### tn, tp, fn, fp
mt <- coords(g, "best", ret=c("tn", "tp", "fn", "fp"))
mt
## tn tp fn fp
## 141 32 7 30
### proportion at 2 digits
mt.prop <- round(prop.table(mt),2)
mt.prop
## tn tp fn fp
## 0.67 0.15 0.03 0.14
tn <- mt.prop[[1]]*100
tp <- mt.prop[[2]]*100
fn <- mt.prop[[3]]*100
fp <- mt.prop[[4]]*100
tn; tp; fn; fp;
## [1] 67
## [1] 15
## [1] 3
## [1] 14
### Since we forced the percentage to be rounded to an integer, we lost some of digitals
### We force the sum to be 100, by attributing the remaining to the tn (which is the default in the function)
tot <- sum(tn+tp+fn+fp)
tn <- ifelse(tot<100, tn+(100-tot), tn)
And now, the infogrid…
infogrid(tn=67,tp=15,fn=3,fp=14)
Well, the AUC is great (87.8, with 95%CI = 81.5 to 94.2). The cost of not recognition is low, and the cost of false recognition is low as well.
In the sample there are 39 cases of malignant ovarian cancer, i.e., 18.6%. However, you cannot have 18.6% person, so the function forces the percentage to an integer. You can see in the infogrid that your correctly identified cases are the 15% of the sample, and the not identified cases are 3%, totalling 18% (as expected).
Of course, we could improve the prediction, maybe including the CA125 biomarker, which was promising. But this post is not on predicting ovarian cancer in women with menopause.
The post is on using R at its best…
I hope you’ve enjoyed this!
Have a nice day!