About this exercise

This is an exercise on criterion validity. Different statistics can be used to test the criterion validity (both for concurrent and predictive validity), depending on the measurement scale of the tests. These approaches are listed in table 2

Figure 1: Table 6.2, chapter 6, page 163 from COMSIN book


The goal of this exercise is to calculate the different statistics in the table above. To keep this exercise simple and to avoid too much data cleaning, we stick to an artificial dataset. You can find this data on moodle as “criterion_validity.RData”.
Normally, we give hints which R functions to use. If don’t know how exactly to use the function, use the help menu by typing ?function_name.



About the data

The dataset “criterion_validity.RData” contains the values of five measurement instruments. The values of the corresponding gold standard are also available. For each of the five measurement instruments, the criterion validity should be evaluated. Use the table above to decide, which statistic in the particular situation should be used.



Exercises

Data import

Task

Import the file “criterion_validity.RData”. You can use the import() function from the rio package. Print the first 6 rows of the data frame using the head() function.



Solution

Here the first 6 rows of the data frame:

library(rio)
data <- import("criterion_validity.RData")
head(data)
##     measure1      gold1 measure2  gold2 measure3    gold3 measure4 gold4
## 1 severity_1 severity_1      4.2 423.85 positive positive     48.9  49.7
## 2 severity_1 severity_1      4.0 375.14 positive positive     51.3  50.9
## 3 severity_1 severity_1      6.7 472.79 positive positive     54.5  64.8
## 4 severity_1 severity_1      5.7 514.15 positive positive     48.5  55.8
## 5 severity_1 severity_1      3.7 314.37 positive positive     54.1  52.0
## 6 severity_1 severity_1      5.7 368.03 positive positive     57.8  63.6
##   measure5    gold5
## 1      130 positive
## 2      110 positive
## 3      107 negative
## 4      110 positive
## 5       85 negative
## 6      130 negative


Also note that RData-files remember the data class (the factors are already nicely labelled):

str(data)
## 'data.frame':    500 obs. of  10 variables:
##  $ measure1: Factor w/ 4 levels "severity_1","severity_2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ gold1   : Factor w/ 4 levels "severity_1","severity_2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ measure2: num  4.2 4 6.7 5.7 3.7 5.7 4.4 2.5 4.8 4.7 ...
##  $ gold2   : num  424 375 473 514 314 ...
##  $ measure3: Factor w/ 2 levels "positive","negative": 1 1 1 1 1 1 1 1 1 1 ...
##  $ gold3   : Factor w/ 2 levels "positive","negative": 1 1 1 1 1 1 1 1 1 1 ...
##  $ measure4: num  48.9 51.3 54.5 48.5 54.1 57.8 54.1 49.3 46.6 48.2 ...
##  $ gold4   : num  49.7 50.9 64.8 55.8 52 63.6 55 43 50.3 51.3 ...
##  $ measure5: int  130 110 107 110 85 130 130 107 130 112 ...
##  $ gold5   : Factor w/ 2 levels "negative","positive": 2 2 1 2 1 1 1 1 1 2 ...


Expample 1

Task 1

What are the measurement scales of the two measurement instruments (measure1 and gold1)? Do they both have the same unit? Which statistic should be calculated to evaluate the criterion validity?



Solution 1

They are both categorical variables. Since the goal seems to be to rate the severity (e.g. of a disease), the variables measure1 and gold1 are ordinal. They both have the same unit. Therefore, a weighted kappa statistic seems to be appropriate.



Task 2

  1. Explore the date using the table() function. Alternatively, the ctable() function from the summarytools package gives you an even nicer output.

  2. Calculate the weighted Kappa and the corresponding 95% CI using the cohen.kappa() function from the psych package.

  3. Interpret the result.



Solution 2

First, we explore the data:

library(summarytools)
ctable(data$measure1, data$gold1, prop = "t") # the prop = "t" is to set total proportions
## Cross-Tabulation, Total Proportions  
## measure1 * gold1  
## Data Frame: data  
## 
## ------------ ------- ------------- ------------- ------------- ------------ --------------
##                gold1    severity_1    severity_2    severity_3   severity_4          Total
##     measure1                                                                              
##   severity_1            56 (11.2%)    29 ( 5.8%)    19 ( 3.8%)    5 ( 1.0%)   109 ( 21.8%)
##   severity_2            27 ( 5.4%)    91 (18.2%)    46 ( 9.2%)    7 ( 1.4%)   171 ( 34.2%)
##   severity_3            21 ( 4.2%)    42 ( 8.4%)    84 (16.8%)   21 ( 4.2%)   168 ( 33.6%)
##   severity_4             2 ( 0.4%)     9 ( 1.8%)    21 ( 4.2%)   20 ( 4.0%)    52 ( 10.4%)
##        Total           106 (21.2%)   171 (34.2%)   170 (34.0%)   53 (10.6%)   500 (100.0%)
## ------------ ------- ------------- ------------- ------------- ------------ --------------


Second, we calculate the weighed kappa. Here, we used the default weights (i.e. squared distance from the diagonal).

library(psych)
tab <- table(data$measure1, data$gold1)
cohen.kappa(tab)
## Call: cohen.kappa1(x = x, w = w, n.obs = n.obs, alpha = alpha, levels = levels)
## 
## Cohen Kappa and Weighted Kappa correlation coefficients and confidence boundaries 
##                  lower estimate upper
## unweighted kappa  0.24     0.30  0.36
## weighted kappa    0.34     0.45  0.56
## 
##  Number of subjects = 500


The coefficient is 0.45. According to Landis & Koch, this is a moderate agreement. We are 95% confident that the true values is between 0.34 and 0.56. As this 95% CI does not include 0 (no better agreement than expected by chance), we can also say that the coefficient is significantly different from 0. Sometimes, reseracher test more specific hypotheses (e.g. kappa > 0.4).



Example 2

Task 1

What are the measurement scales of the two measurement instruments (measure2 and gold2)? Do they both have the same unit? Which statistic should be calculated to evaluate the criterion validity?



Solution 1

They are both continuous variables, but it is obvious that they do not have the same unit. To evaluate criterion validity, the pearson or the spearman correlation coefficient seems to be appropriate.



Task 2

  1. Explore the data using a scatter plot using the plot() function. Based on the scatter plot, which correlation coefficient should be used?
  2. Calculate the corresponding coefficient and its 95% CI using the cor.test() function.
  3. Interpret the result.



Solution 2

First, we explore the data:

plot(data$measure2, data$gold2)

We see that the relationship follows a linear pattern. Therefore, we calculate the pearson correlation coefficient:

cor.test(data$measure2, data$gold2, method = "pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  data$measure2 and data$gold2
## t = 16.655, df = 498, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5386801 0.6516304
## sample estimates:
##       cor 
## 0.5981173

The coefficient is about 0.6 and we are 95% confident that the true coefficient is between 0.54 and 0.65. The coefficient is significantly different from 0. However, in practice a more specific hypothesis may be tested (e.g. r > 0.5), depending on the topic.



Expample 3

Task 1

What are the measurement scales of the two measurement instruments (measure3 and gold3)? Do they both have the same unit? Which statistic should be calculated to evaluate the criterion validity?



Solution 1

This is a classical example of two dichotomous tests. Therefore, sensitivity, specificity and the predictive values should be calculated.



Task 2

  1. Explore the data by creating the 2by2 table.
  2. Calculate the sensitivity, specificity, the predictive values (etc.) using the epi.test() function from the epiR package.



Solution 2

First, we explore the data:

tab <- table(data$measure3, data$gold3, dnn = c("measure", "gold standard"))
tab
##           gold standard
## measure    positive negative
##   positive       42       46
##   negative        8      404


Second, we use the epi.test() function to calculate the statistics:

library(epiR)
epi.tests(tab)
##           Outcome +    Outcome -      Total
## Test +           42           46         88
## Test -            8          404        412
## Total            50          450        500
## 
## Point estimates and 95% CIs:
## --------------------------------------------------------------
## Apparent prevalence *                  0.18 (0.14, 0.21)
## True prevalence *                      0.10 (0.08, 0.13)
## Sensitivity *                          0.84 (0.71, 0.93)
## Specificity *                          0.90 (0.87, 0.92)
## Positive predictive value *            0.48 (0.37, 0.59)
## Negative predictive value *            0.98 (0.96, 0.99)
## Positive likelihood ratio              8.22 (6.09, 11.09)
## Negative likelihood ratio              0.18 (0.09, 0.34)
## False T+ proportion for true D- *      0.10 (0.08, 0.13)
## False T- proportion for true D+ *      0.16 (0.07, 0.29)
## False T+ proportion for T+ *           0.52 (0.41, 0.63)
## False T- proportion for T- *           0.02 (0.01, 0.04)
## Correctly classified proportion *      0.89 (0.86, 0.92)
## --------------------------------------------------------------
## * Exact CIs



Example 4

Task 1

What are the measurement scales of the two measurement instruments (measure4 and gold4)? Do they both have the same unit? Which statistic should be calculated to evaluate the criterion validity?



Solution 1

They are both continuous variables and according to the values, they seem to have the same unit. To evaluate criterion validity, ICC’s can be calculated and a Bland-Altman plot can be used.



Task 2

  1. Explore the data with a scatter plot using the plot() function.
  2. Calculate the ICC and the 95% CI using the ICC() function from the psych package.
  3. Which of the ICC’s in the output would you consider?



Solution 2

First, we explore the data:

plot(data$measure4, data$gold4)

Second, we calculate the ICC and the 95% CI:

library(psych)
ICC(data[,7:8])
## Call: ICC(x = data[, 7:8])
## 
## Intraclass correlation coefficients 
##                          type  ICC   F df1 df2       p lower bound upper bound
## Single_raters_absolute   ICC1 0.47 2.8 499 500 1.0e-29       0.403        0.54
## Single_random_raters     ICC2 0.53 5.2 499 499 3.2e-69       0.066        0.75
## Single_fixed_raters      ICC3 0.68 5.2 499 499 3.2e-69       0.629        0.72
## Average_raters_absolute ICC1k 0.64 2.8 499 500 1.0e-29       0.574        0.70
## Average_random_raters   ICC2k 0.69 5.2 499 499 3.2e-69       0.123        0.86
## Average_fixed_raters    ICC3k 0.81 5.2 499 499 3.2e-69       0.772        0.84
## 
##  Number of subjects = 500     Number of Judges =  2
## See the help file for a discussion of the other 4 McGraw and Wong estimates,

Depending on the situation, it is probably wise to consider systematic error. In this case, we would interpret the ICC2.1



Task 3

To investigate systematic error, we can create a Bland-Alatman plot. Use the bland.altman.plot() function from the BlandAltmanLeh package.



Solution 3

library(BlandAltmanLeh)
bland.altman.plot(data$measure4, data$gold4)

## NULL

We see that indeed there is a systematic error. We also see the limits of agreement which indicate that in 95% of the cases, the difference between measure4 and the gold standard is between about -12.5 and 5.
If we are really sure that the gold standard is perfect, then we can plot the values of the gold standard on the x-axis and not the mean between measure4 and gold4. See here for more information on this topic.



Example 5

Task 1

What are the measurement scales of the two measurement instruments (measure5 and gold5)? Do they both have the same unit? Which statistic should be calculated to evaluate the criterion validity?



Solution 1

In this case, the measurement instrument (measure5) is continuous and the gold standard is dichotomous. This is typical example when continuous scores are use for the diagnosis of a disease. Therefore, we should calculate ROC curves and the area under the curve (AUC). In a next step, we can also calculate the optimal cut-off point and the corresponding sensitivity and specificity.



Task 2

  1. Explore the data using boxplots. Plot the scores (measure5) for the positive people and the negative people.
  2. Calculate the ROC curve using the roc() function from the pROC package. Plot the curve using the ggroc() function.
  3. For the motivated: use the cutpointr() function from the cutpointr package to calculate the AUC, the optimal cut-off and the corresponding sensitivity and specificity.



Solution 2

First, we explore the data using boxplots:

boxplot(measure5 ~ gold5, data)

We see that the scores for positive people are generally lower.


Second, we calculate the ROC curve. We store the calculations in an object. This makes it easier to plot.

library(pROC)
roc <- roc(data$gold5, data$measure5)
ggroc(roc)


Lastly, we can calculate the corresponding statistics:

library(cutpointr)
roc_details <- cutpointr(data = data,  measure5, gold5)
cat("AUC = ", roc_details$AUC)
## AUC =  0.6417004
cat("Optimal cut-off point = ", roc_details$optimal_cutpoint)
## Optimal cut-off point =  113
cat("Sensitivity = ", roc_details$sensitivity)
## Sensitivity =  0.7529412
cat("Specificity = ", roc_details$specificity)
## Specificity =  0.50625

Notice that not the same algorithms are used in the pROC package and the cutpointr package. Therefore, the results are not 100% identical.