This is an exercise on criterion validity. Different
statistics can be used to test the criterion validity (both for
concurrent and predictive validity), depending on the measurement scale
of the tests. These approaches are listed in table 2
Figure 1: Table 6.2, chapter 6, page 163 from COMSIN book
The goal of this exercise is to calculate the different statistics in
the table above. To keep this exercise simple and to avoid too much data
cleaning, we stick to an artificial dataset. You can find this data on
moodle as “criterion_validity.RData”.
Normally, we give hints which R functions to use. If don’t know how
exactly to use the function, use the help menu by typing
?function_name
.
The dataset “criterion_validity.RData” contains the values of five measurement instruments. The values of the corresponding gold standard are also available. For each of the five measurement instruments, the criterion validity should be evaluated. Use the table above to decide, which statistic in the particular situation should be used.
Import the file “criterion_validity.RData”. You can use the
import()
function from the rio
package. Print
the first 6 rows of the data frame using the head()
function.
Here the first 6 rows of the data frame:
library(rio)
<- import("criterion_validity.RData")
data head(data)
## measure1 gold1 measure2 gold2 measure3 gold3 measure4 gold4
## 1 severity_1 severity_1 4.2 423.85 positive positive 48.9 49.7
## 2 severity_1 severity_1 4.0 375.14 positive positive 51.3 50.9
## 3 severity_1 severity_1 6.7 472.79 positive positive 54.5 64.8
## 4 severity_1 severity_1 5.7 514.15 positive positive 48.5 55.8
## 5 severity_1 severity_1 3.7 314.37 positive positive 54.1 52.0
## 6 severity_1 severity_1 5.7 368.03 positive positive 57.8 63.6
## measure5 gold5
## 1 130 positive
## 2 110 positive
## 3 107 negative
## 4 110 positive
## 5 85 negative
## 6 130 negative
Also note that RData-files remember the data class (the factors are already nicely labelled):
str(data)
## 'data.frame': 500 obs. of 10 variables:
## $ measure1: Factor w/ 4 levels "severity_1","severity_2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ gold1 : Factor w/ 4 levels "severity_1","severity_2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ measure2: num 4.2 4 6.7 5.7 3.7 5.7 4.4 2.5 4.8 4.7 ...
## $ gold2 : num 424 375 473 514 314 ...
## $ measure3: Factor w/ 2 levels "positive","negative": 1 1 1 1 1 1 1 1 1 1 ...
## $ gold3 : Factor w/ 2 levels "positive","negative": 1 1 1 1 1 1 1 1 1 1 ...
## $ measure4: num 48.9 51.3 54.5 48.5 54.1 57.8 54.1 49.3 46.6 48.2 ...
## $ gold4 : num 49.7 50.9 64.8 55.8 52 63.6 55 43 50.3 51.3 ...
## $ measure5: int 130 110 107 110 85 130 130 107 130 112 ...
## $ gold5 : Factor w/ 2 levels "negative","positive": 2 2 1 2 1 1 1 1 1 2 ...
What are the measurement scales of the two measurement instruments
(measure1
and gold1
)? Do they both have the
same unit? Which statistic should be calculated to evaluate the
criterion validity?
They are both categorical variables. Since the goal seems to be to rate the severity (e.g. of a disease), the variables measure1 and gold1 are ordinal. They both have the same unit. Therefore, a weighted kappa statistic seems to be appropriate.
Explore the date using the table()
function.
Alternatively, the ctable()
function from the
summarytools
package gives you an even nicer
output.
Calculate the weighted Kappa and the corresponding 95% CI using
the cohen.kappa()
function from the psych
package.
Interpret the result.
First, we explore the data:
library(summarytools)
ctable(data$measure1, data$gold1, prop = "t") # the prop = "t" is to set total proportions
## Cross-Tabulation, Total Proportions
## measure1 * gold1
## Data Frame: data
##
## ------------ ------- ------------- ------------- ------------- ------------ --------------
## gold1 severity_1 severity_2 severity_3 severity_4 Total
## measure1
## severity_1 56 (11.2%) 29 ( 5.8%) 19 ( 3.8%) 5 ( 1.0%) 109 ( 21.8%)
## severity_2 27 ( 5.4%) 91 (18.2%) 46 ( 9.2%) 7 ( 1.4%) 171 ( 34.2%)
## severity_3 21 ( 4.2%) 42 ( 8.4%) 84 (16.8%) 21 ( 4.2%) 168 ( 33.6%)
## severity_4 2 ( 0.4%) 9 ( 1.8%) 21 ( 4.2%) 20 ( 4.0%) 52 ( 10.4%)
## Total 106 (21.2%) 171 (34.2%) 170 (34.0%) 53 (10.6%) 500 (100.0%)
## ------------ ------- ------------- ------------- ------------- ------------ --------------
Second, we calculate the weighed kappa. Here, we used the default weights (i.e. squared distance from the diagonal).
library(psych)
<- table(data$measure1, data$gold1)
tab cohen.kappa(tab)
## Call: cohen.kappa1(x = x, w = w, n.obs = n.obs, alpha = alpha, levels = levels)
##
## Cohen Kappa and Weighted Kappa correlation coefficients and confidence boundaries
## lower estimate upper
## unweighted kappa 0.24 0.30 0.36
## weighted kappa 0.34 0.45 0.56
##
## Number of subjects = 500
The coefficient is 0.45. According to Landis & Koch, this is a moderate agreement. We are 95% confident that the true values is between 0.34 and 0.56. As this 95% CI does not include 0 (no better agreement than expected by chance), we can also say that the coefficient is significantly different from 0. Sometimes, reseracher test more specific hypotheses (e.g. kappa > 0.4).
What are the measurement scales of the two measurement instruments
(measure2
and gold2
)? Do they both have the
same unit? Which statistic should be calculated to evaluate the
criterion validity?
They are both continuous variables, but it is obvious that they do not have the same unit. To evaluate criterion validity, the pearson or the spearman correlation coefficient seems to be appropriate.
plot()
function. Based on the scatter plot, which correlation coefficient
should be used?cor.test()
function.First, we explore the data:
plot(data$measure2, data$gold2)
We see that the relationship follows a linear pattern. Therefore, we calculate the pearson correlation coefficient:
cor.test(data$measure2, data$gold2, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: data$measure2 and data$gold2
## t = 16.655, df = 498, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5386801 0.6516304
## sample estimates:
## cor
## 0.5981173
The coefficient is about 0.6 and we are 95% confident that the true coefficient is between 0.54 and 0.65. The coefficient is significantly different from 0. However, in practice a more specific hypothesis may be tested (e.g. r > 0.5), depending on the topic.
What are the measurement scales of the two measurement instruments
(measure3
and gold3
)? Do they both have the
same unit? Which statistic should be calculated to evaluate the
criterion validity?
This is a classical example of two dichotomous tests. Therefore, sensitivity, specificity and the predictive values should be calculated.
epi.test()
function from the epiR
package.First, we explore the data:
<- table(data$measure3, data$gold3, dnn = c("measure", "gold standard"))
tab tab
## gold standard
## measure positive negative
## positive 42 46
## negative 8 404
Second, we use the epi.test()
function to calculate the
statistics:
library(epiR)
epi.tests(tab)
## Outcome + Outcome - Total
## Test + 42 46 88
## Test - 8 404 412
## Total 50 450 500
##
## Point estimates and 95% CIs:
## --------------------------------------------------------------
## Apparent prevalence * 0.18 (0.14, 0.21)
## True prevalence * 0.10 (0.08, 0.13)
## Sensitivity * 0.84 (0.71, 0.93)
## Specificity * 0.90 (0.87, 0.92)
## Positive predictive value * 0.48 (0.37, 0.59)
## Negative predictive value * 0.98 (0.96, 0.99)
## Positive likelihood ratio 8.22 (6.09, 11.09)
## Negative likelihood ratio 0.18 (0.09, 0.34)
## False T+ proportion for true D- * 0.10 (0.08, 0.13)
## False T- proportion for true D+ * 0.16 (0.07, 0.29)
## False T+ proportion for T+ * 0.52 (0.41, 0.63)
## False T- proportion for T- * 0.02 (0.01, 0.04)
## Correctly classified proportion * 0.89 (0.86, 0.92)
## --------------------------------------------------------------
## * Exact CIs
What are the measurement scales of the two measurement instruments
(measure4
and gold4
)? Do they both have the
same unit? Which statistic should be calculated to evaluate the
criterion validity?
They are both continuous variables and according to the values, they seem to have the same unit. To evaluate criterion validity, ICC’s can be calculated and a Bland-Altman plot can be used.
ICC()
function from the psych
package.First, we explore the data:
plot(data$measure4, data$gold4)
Second, we calculate the ICC and the 95% CI:
library(psych)
ICC(data[,7:8])
## Call: ICC(x = data[, 7:8])
##
## Intraclass correlation coefficients
## type ICC F df1 df2 p lower bound upper bound
## Single_raters_absolute ICC1 0.47 2.8 499 500 1.0e-29 0.403 0.54
## Single_random_raters ICC2 0.53 5.2 499 499 3.2e-69 0.066 0.75
## Single_fixed_raters ICC3 0.68 5.2 499 499 3.2e-69 0.629 0.72
## Average_raters_absolute ICC1k 0.64 2.8 499 500 1.0e-29 0.574 0.70
## Average_random_raters ICC2k 0.69 5.2 499 499 3.2e-69 0.123 0.86
## Average_fixed_raters ICC3k 0.81 5.2 499 499 3.2e-69 0.772 0.84
##
## Number of subjects = 500 Number of Judges = 2
## See the help file for a discussion of the other 4 McGraw and Wong estimates,
Depending on the situation, it is probably wise to consider systematic error. In this case, we would interpret the ICC2.1
To investigate systematic error, we can create a Bland-Alatman plot.
Use the bland.altman.plot()
function from the
BlandAltmanLeh
package.
library(BlandAltmanLeh)
bland.altman.plot(data$measure4, data$gold4)
## NULL
We see that indeed there is a systematic error. We also see the
limits of agreement which indicate that in 95% of the cases, the
difference between measure4 and the gold standard is between about -12.5
and 5.
If we are really sure that the gold standard is perfect, then
we can plot the values of the gold standard on the x-axis and not the
mean between measure4
and gold4
. See here for more
information on this topic.
What are the measurement scales of the two measurement instruments
(measure5
and gold5
)? Do they both have the
same unit? Which statistic should be calculated to evaluate the
criterion validity?
In this case, the measurement instrument (measure5
) is
continuous and the gold standard is dichotomous. This is typical example
when continuous scores are use for the diagnosis of a disease.
Therefore, we should calculate ROC curves and the area under the curve
(AUC). In a next step, we can also calculate the optimal cut-off point
and the corresponding sensitivity and specificity.
measure5
) for the positive people and the negative
people.roc()
function from
the pROC
package. Plot the curve using the
ggroc()
function.cutpointr()
function from
the cutpointr
package to calculate the AUC, the optimal
cut-off and the corresponding sensitivity and specificity.First, we explore the data using boxplots:
boxplot(measure5 ~ gold5, data)
We see that the scores for positive people are generally lower.
Second, we calculate the ROC curve. We store the calculations in an object. This makes it easier to plot.
library(pROC)
<- roc(data$gold5, data$measure5)
roc ggroc(roc)
Lastly, we can calculate the corresponding statistics:
library(cutpointr)
<- cutpointr(data = data, measure5, gold5)
roc_details cat("AUC = ", roc_details$AUC)
## AUC = 0.6417004
cat("Optimal cut-off point = ", roc_details$optimal_cutpoint)
## Optimal cut-off point = 113
cat("Sensitivity = ", roc_details$sensitivity)
## Sensitivity = 0.7529412
cat("Specificity = ", roc_details$specificity)
## Specificity = 0.50625
Notice that not the same algorithms are used in the pROC
package and the cutpointr
package. Therefore, the results
are not 100% identical.