Case 4: Studying the agreement between two measurement methods

Foreword: The Elementary Statistics for Medical Students (ESMS) project

This is our 4th tutorial of ESMS project. This project was initialised in February 2017 by a team of Vietnamese doctors and statisticians who interested in data analysis and R statistical programming language. The main objective of this project is to help Vietnamese medical students in developing their skills of data analysis. We provide the problem-based tutorials, each one will show the student how to resolve a common study question using the appropriate methods in R.

Introduction

The study on agreement could be used every time we want to make decision on adopting (purchasing) one among all available methods on the market, or when we have to replace a current method by a new one. This also allows to verify whether an alternative or a new developed measurement method is equivalent to a reference method.

There is no absolute tool for evaluating the degree of agreement. The searcher must usually combine two or more statistical methods to assess the agreement in different aspects. Most of studies adopt the correlation analysis, Bland-Altman method, Reliability analysis and eventually the mixed models to gather all information about the agreement between two technologies or instruments.

In this tutorial, the author would like to share every methods she knew about the study of agreement. Those include :

Linear regression and Correlation analysis
The Bland-Altman method (1983)
The robust method by Carstensen (2008) that based on mixed linear model

Context

The Oxygen saturation in blood could be directly measured from an arterial blood sample using CO-oximetry technique. Pulse oximetry is an alternative and non-invasive method for assessing the oxygen saturation. In this study, a physician want to evaluate the agreement between a Pulse oximeter and a standard CO oximeter for measuring the SaO2. The study was conducted on 61 patients in an Intensive Care Unit (ICU). The SaO2 was measured for each patient using a CO-oximeter and a pulse oximeters, the measurement was replicated 2 or 3 times for each equipment. The objective is to verify whether the agreement between two measurement methods are clinically acceptable ? as well as to evaluate the reproducibility of each method.

Data preparation

The dataset could be extracted from the methComp package as follows:

library(tidyverse)

data(ox,package="MethComp")
ox$item=factor(ox$item)
ox$repl=factor(ox$repl)

ox%>%head()%>%knitr::kable()

meth	item	repl	y
CO	1	1	78.0
CO	1	2	76.4
CO	1	3	77.2
CO	2	1	68.7
CO	2	2	67.6
CO	2	3	68.3

The original dataset (ox) was set in long format, i.e a dataframe that contains 4 variables (columns): Meth for Method’s names, Item or Idx for subject’s identity number, Repl for Replication and Y for the Outcome of SaO2 measurement. Each line represents a single observation.

We will transform this dataset into wide format, and add following variables:

dif: absolute difference between paired mesurement method

mean indicate the mean between paired SaO2 measured by two methods

ClassCO indicates a classification of Hypoxemia severity (Normal, Mild, Moderate and Severe), based on the CO-oximetry technique with thresholds of 40,59 and 79 %

ClassPulse indicates a classification of Hypoxemia severity, based on the Pulse-oximetry technique with thresholds of 40,59 and 79 %

difpc: indicates the difference between 2 methods, in percent of the reference method (CO oximeter)

badf<-ox%>%spread(key=meth,value=y,convert=FALSE,drop=TRUE,sep = NULL)%>%as_tibble()%>%mutate(.,dif=.$pulse-.$CO,mean=0.5*(.$CO+.$pulse),difpc=100*(.$pulse-.$CO)/.$CO)

badf$ClassCO <- cut(badf$CO,breaks=c(-Inf,40,59,79,Inf),labels=c("Severe","Moderate","Mild","Normal"))

badf$ClassPulse <- cut(badf$pulse,breaks=c(-Inf,40,59,79,Inf),labels=c("Severe","Moderate","Mild","Normal"))

badf%>%head()%>%knitr::kable()

item	repl	CO	pulse	dif	mean	difpc	ClassCO	ClassPulse
1	1	78.0	71	-7.0	74.50	-8.9743590	Mild	Mild
1	2	76.4	72	-4.4	74.20	-5.7591623	Mild	Mild
1	3	77.2	73	-4.2	75.10	-5.4404145	Mild	Mild
2	1	68.7	68	-0.7	68.35	-1.0189229	Mild	Mild
2	2	67.6	67	-0.6	67.30	-0.8875740	Mild	Mild
2	3	68.3	68	-0.3	68.15	-0.4392387	Mild	Mild

1) Getting it wrong: Linear correlation analysis

Though the correlation analysis is not recommended as the right method for assessing the agreement, this is frequently adopted as the first line tool for detecting the relationship between two methods. The correlation analysis can only verify whether two outcomes are associated, and determine how strong the association could be. This may include different methods, such as Pearson’s r or Spearman’s rho coefficients. The Pearson’s r is the most commonly used method. A r’s value close to 1 shows a good sign of strong association between two methods. Linear regression model is an alternative way for evaluating the linear relationship between our methods.

badf%>%ggplot(aes(x=CO,y=pulse))+geom_point(shape=21,size=3,aes(fill=ClassPulse),alpha=0.5)+geom_smooth(method="lm",alpha=0.3,color="red3",fill="red")+geom_abline(intercept=0,slope=1,size=1,linetype="dashed",color="blue")+theme_bw()+geom_vline(xintercept=c(40,59,79),linetype="dashed",color="black",alpha=0.6)+geom_hline(yintercept=c(40,59,79),linetype="dashed",color="black",alpha=0.6)

cor.test(badf$CO,badf$pulse)

## 
##  Pearson's product-moment correlation
## 
## data:  badf$CO and badf$pulse
## t = 23.165, df = 175, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8267726 0.9005306
## sample estimates:
##       cor 
## 0.8683753

badf%>%lm(pulse~CO,data=.)%>%summary()

## 
## Call:
## lm(formula = pulse ~ CO, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.6856  -3.3391   0.2401   3.4701  25.4105 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.00966    2.71893   4.049 7.71e-05 ***
## CO           0.82174    0.03547  23.165  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.79 on 175 degrees of freedom
## Multiple R-squared:  0.7541, Adjusted R-squared:  0.7527 
## F-statistic: 536.6 on 1 and 175 DF,  p-value: < 2.2e-16

The linear correlation (Pearson’s r) and the linear regression model showed that the Pulse oximetry is strongly correlated to the CO method (r=0.87, 95%CI from 0.83 to 0.90, p<0.001). Those two method have 75.41% common proportion in their variances. The pulse oximetry result is expected to be lower than that measured by CO method, as the Pulse oximetry SaO2 would increases only 0.82 unit (<1) for each one increased unit in CO oximetry.

This result indicates a very good relation between two methods, they are linearly associated. However, neither correlation analysis nor linear regression can warrant a good agreement between 2 methods. This could be seen on the graph:

In our example, the scatter dot plot shows a good linear correlation between CO and Pulse methods, however even a perfect correlation between 2 methods cannot warrant a good agreement, as we could see on the graph, the regression line between CO and Pulse does not overlap the identity line (y=x). In addition, as the Pulse oximeter is expected to underestimate the true SaO2, several patients have been missclassified among the Normal, Mild and Moderate degrees of hypoxemia.

2) Getting it almost right: Bland-Altman method

In 1983, Bland and Altman introduced a pragmatic method for assessing the agreement between two quantitative measurements. Their method aims to determine the limits of agreement, using the mean and standard deviation of the difference between 2 measurements.

Such approach implied two assumptions that:

the variation of the differences is constant over the range of measurement and
The difference between 2 methods is constant over the range of measurements.

The Bland-Altman plot is set for checking these 2 assumptions, by plotting the differences against averages of method.

By a simple rule, Bland-Altman suggest that 95% of observations should locate within +/- 1.96 standard deviations of the mean difference.

LL=mean(badf$dif)-1.96*(sd(badf$dif))
UL=mean(badf$dif)+1.96*(sd(badf$dif))

LLpc=mean(badf$difpc)-1.96*(sd(badf$difpc))
ULpc=mean(badf$difpc)+1.96*(sd(badf$difpc))

psych::describe(badf$dif)

##    vars   n  mean   sd median trimmed  mad   min  max range skew kurtosis
## X1    1 177 -2.48 6.18   -2.9   -2.69 5.04 -21.5 27.4  48.9 0.73     3.54
##      se
## X1 0.46

psych::describe(badf$difpc)

##    vars   n mean   sd median trimmed  mad    min   max range skew kurtosis
## X1    1 177 -2.7 9.98  -3.35   -3.24 6.31 -43.43 54.15 97.58 1.21     8.24
##      se
## X1 0.75

LL

## [1] -14.58157

UL

## [1] 9.626771

LLpc

## [1] -22.26205

ULpc

## [1] 16.86019

A simple descriptive analysis can describe the behaviors of difference between two methods:

The Pulse oximeter underestimates the SaO2 averagely by -2.48 unit or -2.7%, and the sd of difference was 6.18 or 9.98 %

Based on those results we can determine the limits of agreement by Mean difference +/- 1.96 sd of difference. The limits of agreement between two methods were -14.58 to +9.63 units, or -22.26% to 16.86%

There are also a wide variability of Bland Altman plots. For example it’s also possible to display the differences in percentage, ratio. Some authors also recommended using the reference method instead of mean of both methods.

Basically, the Bland-Altman plot represents every difference between two method in function of the average of the measurements, allowing is to detect a potential relationship between measurement error and the measure scale (or variability of true value). As we never know the true value, the mean of two measurements is the best approximation we have.

badf%>%ggplot(aes(x=CO,y=dif,fill=ClassPulse))+geom_jitter(alpha=0.7,size=3,shape=21,color="black")+geom_hline(yintercept=mean(badf$dif),color="red",size=0.9)+geom_hline(yintercept=0,color="blue",linetype="dotted",size=0.9)+geom_hline(yintercept=c(UL,LL),color="blue",linetype="dashed",size=0.9)+theme_bw()+scale_x_continuous("Reference method")+scale_y_continuous("Difference between 2 methods")+geom_vline(xintercept=c(40,59,79),linetype="dashed",color="black",alpha=0.6)

badf%>%ggplot(aes(x=CO,y=difpc,fill=ClassPulse))+geom_jitter(alpha=0.7,size=3,shape=21,color="black")+geom_hline(yintercept=mean(badf$dif),color="red",size=0.9)+geom_hline(yintercept=0,color="blue",linetype="dotted",size=0.9)+geom_hline(yintercept=c(ULpc,LLpc),color="blue",linetype="dashed",size=0.9)+theme_bw()+scale_x_continuous("Reference method")+scale_y_continuous("Difference between 2 methods (in %)")+geom_vline(xintercept=c(40,59,79),linetype="dashed",color="black",alpha=0.6)

If one method is considered as the reference method or gold standard, its values could be used instead of the mean of two. This presentation mode is better for implying a clinical classification rule, but it’s still controversial, since plotting the difference against a measurement can be misleading about the linear relationship between error (difference) and measurement magnitude.

We can get following information from a Bland-Altman plot:

The Pulse oximetry underestimates the SaO2 averagely by – 2.477 units compared to the CO-oximetry.

The limits of agreement between 2 methods are estimated to be -14.58 and +9.63 units.

The difference between two methods seems to be randomly distributed, there was no obvious pattern of correlation between the difference and the magnitude of measurement.

3) Getting it better: Clinical based interpretation

The limits of agreement is only valid if the difference is normally distributed, this must always be verified by histogram, density curve or a test for normal distribution (Agostino, Kolmogorov Smirnov).

badf%>%ggplot(aes(x=dif))+geom_density(fill="red",alpha=0.4)+theme_bw()

badf%>%select(.,dif,difpc)%>%map(~fBasics::dagoTest(.))

## $dif
## 
## Title:
##  D'Agostino Normality Test
## 
## Test Results:
##   STATISTIC:
##     Chi2 | Omnibus: 35.1038
##     Z3  | Skewness: 3.7771
##     Z4  | Kurtosis: 4.5648
##   P VALUE:
##     Omnibus  Test: 2.384e-08 
##     Skewness Test: 0.0001587 
##     Kurtosis Test: 4.999e-06 
## 
## Description:
##  Mon Mar 06 10:23:31 2017 by user: Admin
## 
## 
## $difpc
## 
## Title:
##  D'Agostino Normality Test
## 
## Test Results:
##   STATISTIC:
##     Chi2 | Omnibus: 70.3075
##     Z3  | Skewness: 5.5924
##     Z4  | Kurtosis: 6.2476
##   P VALUE:
##     Omnibus  Test: 5.551e-16 
##     Skewness Test: 2.24e-08 
##     Kurtosis Test: 4.168e-10 
## 
## Description:
##  Mon Mar 06 10:23:31 2017 by user: Admin

In this case, the normality test showed that the difference between two methods is NOT normally distributed. Thus our interpretation cannot entirely base on the limits of agreement.

For clinical application purpose, we should decide in advance how much error would be acceptable (for example: 5% or 10% ?). The best way to make such decision, is applying a cut-off based clinical classification rule. Then we verify whether too much cases were misclassified due to the discrepancy between two methods, when one of them is considered as reference method (gold standard). For doing this, we recommend plotting the Bland-Altman plot with X-axis representing the Reference method instead of the mean of two methods, and setting-up a clear classification rule using color or by vertical lines. The customized plot allows better detection of misclassified points.

On the Bland-Altman plot, despite that there were only 5/354 data points that fall outside the limits of agreement (1.4%), some patients have been overdiagnosed, i.e they were misclassified from Normality to Mild hypoxemia or from Mild to moderate hypoxemia. One patient with Moderate hypoxemia under CO oximetry was misclassified as Severe hypoxemia.

The agreement in Hypoxia classification between two methods could be evaluated using the Confusion matrix analysis and Cohen’s Kappa coefficient as follows:

caret::confusionMatrix(reference=badf$ClassCO,data=badf$ClassPulse,mode="everything",dnn = c("Pulse", "ReferenceCO"))

## Confusion Matrix and Statistics
## 
##           ReferenceCO
## Pulse      Severe Moderate Mild Normal
##   Severe        3        1    0      0
##   Moderate      0        7    6      0
##   Mild          0        8   69     34
##   Normal        0        0    4     45
## 
## Overall Statistics
##                                          
##                Accuracy : 0.7006         
##                  95% CI : (0.6273, 0.767)
##     No Information Rate : 0.4463         
##     P-Value [Acc > NIR] : 7.308e-12      
##                                          
##                   Kappa : 0.4921         
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: Severe Class: Moderate Class: Mild
## Sensitivity                1.00000         0.43750      0.8734
## Specificity                0.99425         0.96273      0.5714
## Pos Pred Value             0.75000         0.53846      0.6216
## Neg Pred Value             1.00000         0.94512      0.8485
## Precision                  0.75000         0.53846      0.6216
## Recall                     1.00000         0.43750      0.8734
## F1                         0.85714         0.48276      0.7263
## Prevalence                 0.01695         0.09040      0.4463
## Detection Rate             0.01695         0.03955      0.3898
## Detection Prevalence       0.02260         0.07345      0.6271
## Balanced Accuracy          0.99713         0.70012      0.7224
##                      Class: Normal
## Sensitivity                 0.5696
## Specificity                 0.9592
## Pos Pred Value              0.9184
## Neg Pred Value              0.7344
## Precision                   0.9184
## Recall                      0.5696
## F1                          0.7031
## Prevalence                  0.4463
## Detection Rate              0.2542
## Detection Prevalence        0.2768
## Balanced Accuracy           0.7644

k=xtabs(data=badf,~ClassCO+ClassPulse)%>%vcd::Kappa(.)
k

##             value     ASE      z  Pr(>|z|)
## Unweighted 0.4921 0.05584  8.812 1.225e-18
## Weighted   0.5708 0.05262 10.847 2.053e-27

k%>%confint()

##             
## Kappa              lwr       upr
##   Unweighted 0.3826272 0.6015083
##   Weighted   0.4676331 0.6738906

The results show that the agreement between two methods was moderate, as evaluated by a weighted Cohen’s Kappa of 0.57 (95CI from 0.47 to 0.67). The weighted Kappa coefficient is used when the severity are evaluated in ascending or descending orders, i.e the Severe class is considered more clinically relevant than the Moderate class.

The Pulse oximetry has a tendency to overdiagnose the Hypoxemia. The balanced accuracy of Pulse oxymetry was excellent to rule-in Severe hypoxia (99.7%), but was merely acceptable (70 % to 72,2 %) for detecting Mild and Moderate Hypoxia, as well as for ruling out hypoxia (76,4%).

The remaining question is whether this is clinically acceptable ? The answer is Yes, because an underdiagnosed hymoxemia is more dangerous to patient than an overdiagnosis.

4) Getting it correctly in a robust way: The mixed linear model approach

In 2008, Bendix Carstensen, Julie Simpson and Lyle C. Gurrin introduced a new approach for assessing the agreement between 2 measurement methods (The International Journal of Biostatistics, 2008). Their method is based on mixed linear model and provides more advantages over the classical Bland-Altman method, including the ability to handle replicate measurements data and to extract directly the limits of agreement from the model’s covariance matrix.

In fact, the classical Bland-Altman method is not correct, as the calculation did not take into account the errors in estimation of the parameters. It could be expected that the accuracy of Bland Altman method depends on the sample size, and this method could underestimate the limits of agreement for small samples.

Another important limitation of the Bland-Altman method is that such method cannot handle the study designs with replicate measurements. According to Bland-Altman, the limit of agreement can only be determined on pooled data, thus the result can only be interpreted as a prediction limits for difference between MEANS of R replications by 2 methods, which is not relevant for A SINGLE measurement in the future.

To deal with replicate measurements data, we must setup a new model framework that allow modelling all the random effects, including the within subject and between-subject variances, as well as the interaction between Method and Replication factors. Such model is ROBUST, as this allows to compare of any number of methods (meth’s level >=2). It’s also polyvalent, since the model allows to determine both limits of agreement between 2 methods and the repeatability of each method.

Under this model, the limits of agreement could be determined based on the standard deviation of the difference between a pair of measurement by two methods on a new individual. The calculation of those limits might inlcude either the variance of interaction between method and replication and variance of each method.

This model could be fitted using the nlme package in R.This only accepts the dataframe in long format. Before fitting the mixed model, the Meth, Item and Repl variables should be treated as factors.

The function for fitting a mixed model is lme. The formula includes 3 terms as follows:

Fix term : y ~ meth + item (y is modeled by 2 fix factors: method and subject’s Id)

Random term : random=list(item=pdIdent(~ meth - 1),repl= ~ 1). This is a list that combineds random effects of 2 distinct methods and Replication

Weight term: weights = varIdent(form= ~1|meth), the model is weighted by Method

data(ox,package="MethComp")
ox$item=factor(ox$item)
ox$repl=factor(ox$repl)

mod=(lme(y~meth+item,random=list(item=pdIdent(~meth-1),repl=~1),weights = varIdent(form=~1|meth),data=ox))

mod

## Linear mixed-effects model fit by REML
##   Data: ox 
##   Log-restricted-likelihood: -911.7401
##   Fixed: y ~ meth + item 
## (Intercept)   methpulse       item2       item3       item4       item5 
##  76.0428384  -2.4704462  -7.0216227   5.1497034 -10.7281860  -1.1137199 
##       item6       item7       item8       item9      item10      item11 
##   3.1649923   9.7065633   3.5568599  -4.1821374 -14.4222446  12.7503731 
##      item12      item13      item14      item15      item16      item17 
## -47.3135668   3.3219575  -1.1293724   6.2565251  -0.5367298  13.9153464 
##      item18      item19      item20      item21      item22      item23 
##   1.5322522  -2.0861271  -1.0351969   6.4653272  -0.4416475   4.5820321 
##      item24      item25      item26      item27      item28      item29 
##   8.2772197   2.1049893   2.7779659 -10.3186089 -10.8197187   0.7833716 
##      item30      item31      item32      item33      item34      item35 
##   2.6444795 -29.2466418   5.6528703   6.8769613   8.7365767   0.9285974 
##      item36      item37      item38      item39      item40      item41 
##   3.0492155   3.6735649   7.5298316   2.7392939  -8.6159587  -0.1044011 
##      item42      item43      item44      item45      item46      item47 
##  -4.3450727 -20.7468236 -16.2943647   2.0329985   4.5130501   3.4254305 
##      item48      item49      item50      item51      item52      item53 
##  -3.0309415  10.4662553 -24.8350417 -20.8508611  -0.3525354  -3.6222924 
##      item54      item55      item56      item57      item58      item59 
##   1.4299081  12.8385572   9.7971680  13.3501148  13.4953406  15.6657386 
##      item60      item61 
##   7.3963452  -1.7503731 
## 
## Random effects:
##  Formula: ~meth - 1 | item
##  Structure: Multiple of an Identity
##           methCO methpulse
## StdDev: 2.928042  2.928042
## 
##  Formula: ~1 | repl %in% item
##         (Intercept) Residual
## StdDev:    3.415692 2.224868
## 
## Variance function:
##  Structure: Different standard deviations per stratum
##  Formula: ~1 | meth 
##  Parameter estimates:
##       CO    pulse 
## 1.000000 1.795365 
## Number of Observations: 354
## Number of Groups: 
##           item repl %in% item 
##             61            177

mdif=mod$coefficients$fixed[[2]]

tau=VarCorr(mod)[[8]]%>%as.numeric()

omega=VarCorr(mod)[[11]]%>%as.numeric()

sigma1=mod$sigma

varfunc=coef(mod$modelStruct$varStruct, uncons = FALSE)%>%.[[1]]

sigma2=varfunc*sigma1

The output consists of a nlme class object.

First, the regression coefficient of fix effect for Method predictor could be considered as mean difference between 2 methods, its value is mdif=-2.4704462

This also provides the interaction standard deviation and ONE of the residual standard deviation under the Random Effects section;

The ratio of the residual s.d could be found under the Variance function section : it’s value is 1.795365

In our case, the interaction sd (Tau) is 2.928042

The omega entity indicates a random effect of replication, or the random measurement error due to replication, its value = 3.415692

The residual s.d fof method 1 (CO) is denotted by Sigma 1 = 2.2248679 and for the Method 2 (Pulse) is denotted by Sigma 2.

The sigma 2 value is not explicite in the output, but should be calculated by multiplying the sigma 1 by the ratio between them, which is 1.795365. thus sigma 2= 1.795365 * sigma 1 = 3.99445082

The limits of agreement are given by:

Upper bound = Mean difference + 1.96 * sqrt (Tau^2 + Tau^2 + Sigma1^2 + Sigma2^2 + Omega^2)

Lower bound = Mean difference - 1.96 * sqrt (Tau^2 + Tau^2 + Sigma1^2 + Sigma2^2 + Omega^2)

The Tau value was multiplied by 2 since we have only 2 methods, thus the residual s.d for each method can only be assumed to be equal. In generalised situations, when more than 2 methods are compared against each other, this formula should include the variance (sd^2) of all method, i.e Tau1^2 +Tau2^2+ Tau3^2 +… Taun^2

lim1=mdif+1.96*sqrt(2*tau+sigma1^2+sigma2^2+omega^2)
lim2=mdif-1.96*sqrt(2*tau+sigma1^2+sigma2^2+omega^2)

lim1

## [1] 9.679769

lim2

## [1] -14.62066

By taking into account the within subject, between subject variance and interaction random effect between Replication and Method, we have a new limit of agreement of -14.62 to 9.68

The limits of agreement is not the only issue of our interest, as our study also implied a replicate measurement. We must also determine the method specific repeatability or reproducibility. Repeatability of a method can only be assessed when replicate measurements are available for that method.

repeatCO=1.96*sqrt(1.96)*sqrt(omega^2+sigma1^2)
repeatPulse=1.96*sqrt(1.96)*sqrt(omega^2+sigma2^2)

repeatCO

## [1] 11.18563

repeatPulse

## [1] 14.42169

Those results indicate that the repeatabilities for CO oximetry and Pulse oximetry were 11.19% and 14.42%, respectively; i.e the upper 95% limits for the absolute difference between two repeat measurements by the two methods is 11.2 and 14.4% respectively.

Since the upper and lower limits of agreement of those methods were +9.68 and -14.62, that cover the repeatability range (11.2 and 14.42), we could say that discrepancy between the two methods is largely attributable to the poor repeatability of both methods.

The results of the Mixed linear model could be integrated into a traditional Bland-Altman plot as follows:

badf%>%ggplot(aes(x=repl,y=dif))+geom_jitter(shape=21,size=3,alpha=0.8,aes(fill=repl))+geom_hline(yintercept = c(0,mdif,lim1,lim2),size=1,linetype=c("dashed","solid","longdash","longdash"),color=c("black","red4","blue4","blue4"))+theme_bw()

badf%>%ggplot(aes(x=CO,y=dif))+geom_jitter(shape=21,size=3,alpha=0.8,aes(fill=ClassPulse))+geom_hline(yintercept = c(0,mdif,lim1,lim2),size=1,linetype=c("dashed","solid","longdash","longdash"),color=c("black","red4","blue4","blue4"))+geom_vline(xintercept=c(40,59,79),linetype="dashed",color="black",alpha=0.6,size=1)+theme_bw()

library(viridis)

badf%>%ggplot(aes(x=CO,y=dif))+geom_jitter(shape=21,alpha=0.6,aes(fill=dif,size=dif))+geom_hline(yintercept = c(0,mdif,lim1,lim2),size=1,linetype=c("dashed","solid","longdash","longdash"),color=c("black","red4","blue4","blue4"))+geom_vline(xintercept=c(40,59,79),linetype="dashed",color="black",alpha=0.6,size=1)+theme_bw()+scale_fill_viridis(option="C",begin=0.2,end=0.8)

badf%>%ggplot(aes(x=CO,y=dif))+geom_jitter(shape=21,alpha=0.6,aes(fill=ClassPulse,size=dif))+geom_hline(yintercept = c(0,mdif,lim1,lim2),size=1,linetype=c("dashed","solid","longdash","longdash"),color=c("black","red4","blue4","blue4"))+geom_vline(xintercept=c(40,59,79),linetype="dashed",color="black",alpha=0.6,size=1)+theme_bw()

Those graphs always plot the absolute difference between CO and Pulse oximetries against the values of CO oximetry as reference method, but the horizontal lines indicate the NEW limits of agreement, as determined by the mixed model, instead of using the simplified limits as defined by Bland-Altman method.

Conclusion

The agreement between CO and Pulse oximeters upon replicate measurements have been evaluated using The Carstensen-Simpson method (2008).

Overall, the two methods are well correlated and have an acceptable repeatability. The upper limits for the absolute difference between two repeat measurements by the two methods were 11.2 and 14.4% respectively. The Pulse oximetry underestimate the SaO2 by averagely -2.47 units compared to the standard CO method. The limits of agreement between two methods were -14.62 to +9.68 units. The difference between two methods might lead to an overdiagnosis of hypoxemia when using Pulse oximeter, as the balanced accuracy of this method is only 70% for ruling in Mild and Moderate hypoxemia. However, the discrepancy between two methods does not alter the ability to detect the patients with severe hypoxia. Therefore the Pulse oximeter could be used as a noninvasive, alternative tool for evaluating the oxygen desaturation in patient with a clinically acceptable agreement with the standard CO oximetry method.

Exercise

In the methComp package, there are many datasets for practicing the analysis of agreement between two or more measurement method. For example; in the cardiac dataset: cardiac output is measured repeatedly (three to six times) by impedance cardiography (IC) and radionuclide ventriculography (RV).

Try to reproduce the statistical approaches that you have learnt on this dataset.

Thank you

ESMS - Case study 4: A toolbox for agreement study

Lê Ngọc Khả Nhi

04 March 2017

Case 4: Studying the agreement between two measurement methods