Introduction

This document will seek to outline the technical fundamentals of dedupe in DATIM. We will start with a few diagrams, to illustrate what is meant by duplicated data in the context of DATIM.

All data values in DATIM are described by five dimensions.

Organisation unit
Period
Data element
Category option combination (also known as disaggregation)
Attribute option combination (also knon as mechanism)

Every data value which is stored in DATIM, must have these dimensions associated with them.

A duplicate occurs in the context of DATIM, when the first four dimensions are the same, but the data values have different mechanisms. Conceptually, this can be thought of as recording data for the same site, the same indicator, the same period for two different mechanisms. When this situation occurs in DATIM, the two values are flagged as a potential duplicate, and would need to be de-deplicated.

Duplication can also occur between related pairs of DSD-TA data elements.

Case 1: Total overlap

First, lets consider a very simple set of values, which would be classified as duplicated in DATIM.

d<-read.csv("simpleDupe.csv")
print(d)

##       ou period      dataElement  disagg mechanism value
## 1 Site 1 2015Q1 HTC_TST (N, DSD) default    Mech 1    50
## 2 Site 1 2015Q1 HTC_TST (N, DSD) default    Mech 2    90

In this example, we see we have two values for HTC_TST for the same site, the same period, the same disaggregation, but different mechanisms. Mechanism 1 has reported 50 and Mechanism 2 has reported 90.

First, let us consider the case of total overlap of patients. This situation would happen when all of the 50 patients reported by Mechanism 1, are completely part of the 90 individuals reported by Mechanism 2. The following diagram illustrates this situation.

require("VennDiagram")

## Loading required package: VennDiagram
## Loading required package: grid

venn.plot <- draw.pairwise.venn(d[d$mechanism=="Mech 2","value"], d[d$mechanism=="Mech 1","value"], d[d$mechanism=="Mech 1","value"],
                                category=c("Mech 2", "Mech 1"),
                                fill=c("red","green"));

By default, DATIM will aggregate data across dimensions. When all results are aggregated across mechanisms (without deduplicaiton adjustment), we would obtain the following result.

d.agg<-aggregate(value ~ ou + period + dataElement + disagg,data=d,FUN="sum")
print(d.agg)

##       ou period      dataElement  disagg value
## 1 Site 1 2015Q1 HTC_TST (N, DSD) default   140

Thus, by default, DATIM would report 140 for the number of individuals tested, which in this scenario is not right, because 50 of those individuals have already been reported by another partner.

The solution, is to apply a deduplication adjustment. DATIM has a special mechanism “00000 Deduplication adjustment”, which serves to allow adjustments to be added to duplicate pairs, so that when they are aggregated by DATIM, the aggregated result is correct, while the underlying data as reported by partners remains unchanged. In this case, we need to effectively decrease the aggregate total by 50, so we simply add an additional mechanism, with the appropriate adjustment.

When using the De-duplication app within DATIM, this would correspond to a user selecting the “MAX” option. The app would calculate the appropriate adjustment. We simulate this below, with a special function to determine the de-duplication adjustment, and then display this with the original data.

dedupe.max<-function(x) {max(x)-sum(x)}
d.dupe<-aggregate(value ~ ou + period + dataElement + disagg, data=d,FUN="dedupe.max")
d.dupe$mechanism<-"Deudpe adjustment"
d.deduped<-rbind(d,d.dupe)
print(d.deduped)

##       ou period      dataElement  disagg         mechanism value
## 1 Site 1 2015Q1 HTC_TST (N, DSD) default            Mech 1    50
## 2 Site 1 2015Q1 HTC_TST (N, DSD) default            Mech 2    90
## 3 Site 1 2015Q1 HTC_TST (N, DSD) default Deudpe adjustment   -50

Effectively, a new value of -50 has been added, to effectively cancel out the overlap of 50 individuals between Mechanism 1 and Mechanism 2. When we aggregate the new values, we see the correct total.

d.dedupe.agg<-aggregate(value ~ ou + period + dataElement + disagg,data=d.deduped,FUN="sum")
print(d.dedupe.agg)

##       ou period      dataElement  disagg value
## 1 Site 1 2015Q1 HTC_TST (N, DSD) default    90

Case 2: No overlap

Another common case of duplication is when there is effectively no overlap between partners reporting the same indicator. In this case, we would use the “SUM” operation provided in the de-duplicaiton app. In terms of actually how it works, we begin with a diagram.

venn.plot <- draw.pairwise.venn(d[d$mechanism=="Mech 2","value"], d[d$mechanism=="Mech 1","value"], 0,
                                category=c("Mech 2", "Mech 1"),
                                fill=c("red","green"));

In this hypothetical case, we see that the 50 individuals from Mechanism 1 and the 90 individuals from Mechanism 2 are completely distinct. To mark the duplication as having been resolved in DATIM, we simply add a zero to the duplicate pair to indicate that there actually is no duplication, and that the 50 and 90 individuals are completely distinct.

d.dupe.sum<-unique(d[,c("ou","period","dataElement","disagg")])
d.dupe.sum$mechanism<-"Deudpe adjustment"
d.dupe.sum$value<-0
d.deduped.sum<-rbind(d,d.dupe.sum)
print(d.deduped.sum)

##       ou period      dataElement  disagg         mechanism value
## 1 Site 1 2015Q1 HTC_TST (N, DSD) default            Mech 1    50
## 2 Site 1 2015Q1 HTC_TST (N, DSD) default            Mech 2    90
## 3 Site 1 2015Q1 HTC_TST (N, DSD) default Deudpe adjustment     0

In this case, the de-duplication adjustment is zero, meaning that when the values are aggregated, there is effectively no change to the overall total, but because of the presence of the deduplication adjustment along with the potentially duplicated values, DATIM understands this potential duplicate to have been marked as resolved.

d.dedupe.sum.agg<-aggregate(value ~ ou + period + dataElement + disagg,data=d.deduped.sum,FUN="sum")
print(d.dedupe.sum.agg)

##       ou period      dataElement  disagg value
## 1 Site 1 2015Q1 HTC_TST (N, DSD) default   140

Case3: Partial overlap

The third case, corresponds to the “CUSTOM” function in the deduplication app in DATIM. This occurs when there is partial overlap within the duplicate pair. Let us assume that 30 individuals have been duplicated between the two mechanisms. Below, we show this diagrammatically.

d.dupes<-30
venn.plot <- draw.pairwise.venn(d[d$mechanism=="Mech 2","value"], d[d$mechanism=="Mech 1","value"], d.dupes,
                                category=c("Mech 2", "Mech 1"),
                                fill=c("red","green"));

In this case, we see that 30 of the individuals reported are common between Mechanism 1 and Mechanism 2. What this means in practice is that a total of 110 individuals would be reported, as 30 of those have been already counted by one mechanism or the other (although in DATIM it would remain unknown which particular mechanism has reported the duplicated data). In order to achieve this total, we need to add an adjustment factor of -30, corresponding to the overlap of the individuals between mechanisms.

d.dupe.custom<-unique(d[,c("ou","period","dataElement","disagg")])
d.dupe.custom$mechanism<-"Deudpe adjustment"
d.dupe.custom$value<--30
d.deduped.custom<-rbind(d,d.dupe.custom)
print(d.deduped.custom)

##       ou period      dataElement  disagg         mechanism value
## 1 Site 1 2015Q1 HTC_TST (N, DSD) default            Mech 1    50
## 2 Site 1 2015Q1 HTC_TST (N, DSD) default            Mech 2    90
## 3 Site 1 2015Q1 HTC_TST (N, DSD) default Deudpe adjustment   -30

Finally, we display the aggregate result of the custom calculation for deduplication below.

d.dedupe.sum.agg<-aggregate(value ~ ou + period + dataElement + disagg,data=d.deduped.custom,FUN="sum")
print(d.dedupe.sum.agg)

##       ou period      dataElement  disagg value
## 1 Site 1 2015Q1 HTC_TST (N, DSD) default   110

DSD-TA Deduplication

Duplication can also occur between pairs of DSD and TA data elements which are related. As an example of how this type of duplication can occur, let us consider the data elements “HTC_TST (N, DSD): HTC received results” and “HTC_TST (N, TA): HTC received results”.

d.cross<-read.csv(file="crosswalkDupes.csv",row.names=NULL,stringsAsFactors=F)
print(d.cross)

##       ou period      dataElement  disagg         mechanism value
## 1 Site 1 2015Q1 HTC_TST (N, DSD) default            Mech 1    50
## 2 Site 1 2015Q1 HTC_TST (N, DSD) default            Mech 2    90
## 3 Site 1 2015Q1 HTC_TST (N, DSD) default Dedupe Adjustment   -50
## 4 Site 1 2015Q1  HTC_TST (N, TA) default            Mech 2   120

In this example, we extend on the example above. In this case, there are duplicated DSD values (50 & 90) which have been de-duplicated with an adjustment of -50.

d.cross$mechanism[grepl("DSD",d.cross$dataElement)]<-"DSD Value"
d.cross.agg<-aggregate(value ~ ou + period + dataElement + disagg + mechanism,data=d.cross,FUN="dedupe.max")
print(d.cross.agg)

##       ou period      dataElement  disagg mechanism value
## 1 Site 1 2015Q1 HTC_TST (N, DSD) default DSD Value     0
## 2 Site 1 2015Q1  HTC_TST (N, TA) default    Mech 2     0

This in this example, there is a final value of 90 (including the duplication adjustment) for DSD and 120 for TA. Supposing that the 90 individuals who were part of DSD were also part of TA, we would need to adjust the TA values with an adjustment factor of -90. To achieve this, DATIM has another special mechanism named “00001 De-duplication adjustment (DSD-TA)”, which is used to adjust DSD-TA crosswalk duplicates.

Our default logic,in this case, will be to cancel out the TA which overlaps with DSD. So, if we have 120 TA and 90 total DSD, after de-duplication we will have effectively 30 TA, after addition of the -90 adjustment factor. Note that the DSD-TA de-deuplication adjustment is always assigned attributed to the TA data element, but similar to the pure de-deduplication adjustment, it is never attributed to a particular partner or mechanisms. Lastly, this is not necessarily advocating this approach to resolve your own duplicates, but rather simply an illustration of how the DSD-TA deduplication mechanism works from a technical perspective.

d.cross.agg$dataElement<-"HTC_TST (N, TA)"
d.cross.agg.deduped<-aggregate(value ~ ou + period + dataElement + disagg ,data=d.cross.agg,FUN="dedupe.max")
d.cross.agg.deduped$mechanism<-"DSD-TA adjustment"
d.cross.agg.deduped<-rbind(d.cross,d.cross.agg.deduped)
print(d.cross.agg.deduped)

##       ou period      dataElement  disagg         mechanism value
## 1 Site 1 2015Q1 HTC_TST (N, DSD) default         DSD Value    50
## 2 Site 1 2015Q1 HTC_TST (N, DSD) default         DSD Value    90
## 3 Site 1 2015Q1 HTC_TST (N, DSD) default         DSD Value   -50
## 4 Site 1 2015Q1  HTC_TST (N, TA) default            Mech 2   120
## 5 Site 1 2015Q1  HTC_TST (N, TA) default DSD-TA adjustment     0

Note, that looking at both the DSD and TA data values together, along with their respective adjustments, we see we have several

Processing large volumes of duplicates

Due to the large number of duplicates in certain OUs, manual resolution is a tedious and time-consuming task. In the following example, we will demonstrate with real (but scrubbed) data, how it would be conceptually possible to process large numbers of duplicates and using the export/import features of DATIM.

First, we will load up the duplicates from a file produced by the DATIM server. In general, we can obtain a complete list of duplicates by using the API of DATIM. The following URL will provide a CSV file, we can we use to calculate our de-duplication adjustments.

https://www.datim.org/api/sqlViews/wzpSd6j89wc/data.csv?
&var=ou:AAAAAAAA&var=pe:2015Q1&var=rs:false&var=ps:20000
&var=pg:1&var=dt:RESULTS&var=ty:PURE

The parameters for the API call are as follows:

Variable	Description
ou	Refers to the operating unit UID
pe	Refers to the period in question
rs	If false, show unresolved. If true, show all duplicates, including those which have been resolved.
ps	Page size. If you want all duplicates, be sure to set this to a number at least as large as the number of duplicates which exist.
pg	Page number.
st	Data set type: This should either be “RESULTS” or “TARGETS” depending on which data sets should be retrieved.
ty	Duplicate type: This should either by “PURE” or “CROSSWALK” depending on which type of duplicates should be retrieved.

d<-read.csv(file="ManyDedupes.csv")
d$value<-as.numeric(d$value)
head(d[,2:8])

##     site                             de    disagg   agency   mechanism
## 1 Site 1 TB_ART (N, DSD): TB/HIV on ART (default) Agency 1 Mechanism 1
## 2 Site 1 TB_ART (N, DSD): TB/HIV on ART (default) Agency 2 Mechanism 2
## 3 Site 2 TB_ART (N, DSD): TB/HIV on ART (default) Agency 2 Mechanism 2
## 4 Site 2 TB_ART (N, DSD): TB/HIV on ART (default) Agency 1 Mechanism 3
## 5 Site 3 TB_ART (N, DSD): TB/HIV on ART (default) Agency 1 Mechanism 1
## 6 Site 3 TB_ART (N, DSD): TB/HIV on ART (default) Agency 2 Mechanism 2
##     partner value
## 1 Partner 1     6
## 2 Partner 2    52
## 3 Partner 2   251
## 4 Partner 3     2
## 5 Partner 1     7
## 6 Partner 2    52

Note that this file has around 18,000 duplicates, which would be very tedious and time consuming to resolve by hand. To demonstrate how we could resolve these with a semi-automated process, let us assume that we will apply the “MAX” function to all data elements, except OVC related data elements. For OVC related data elements, we will apply the “SUM” function to the duplicates. Note, this is not endorsing this approach in practice, but merely to show a technical example of how grouping certain data elements could be achieved.

We will start by separating the data.

d.max<-d[!grepl("OVC",d$de),]
d.sum<-d[grepl("OVC",d$de),]

For the non-OVC data, we will apply the “MAX” logic.

d.max.dedupes<-aggregate(value ~ de + disagg + site + ou_uid + de_uid + coc_uid, data=d,FUN=function(x) {max(x)-sum(x)})
head(d.max.dedupes)

##                                                           de        disagg
## 1 CARE_NEW (N, DSD, Age/Sex): Received CD4, VL or Assessment (Male, 20-24)
## 2 CARE_NEW (N, DSD, Age/Sex): Received CD4, VL or Assessment (Male, 20-24)
## 3 CARE_NEW (N, DSD, Age/Sex): Received CD4, VL or Assessment (Male, 20-24)
## 4 CARE_NEW (N, DSD, Age/Sex): Received CD4, VL or Assessment (Male, 20-24)
## 5 CARE_NEW (N, DSD, Age/Sex): Received CD4, VL or Assessment (Male, 20-24)
## 6 CARE_NEW (N, DSD, Age/Sex): Received CD4, VL or Assessment (Male, 20-24)
##       site      ou_uid      de_uid     coc_uid value
## 1  Site 14 A4GnoCEIH0l gQ61aSBIn72 Ae0SMRtpsJW    -2
## 2   Site 1 av1Aonm89RM gQ61aSBIn72 Ae0SMRtpsJW    -1
## 3 Site 134 bFIPtqe4NSw gQ61aSBIn72 Ae0SMRtpsJW    -1
## 4 Site 116 bJ2FOiMS7UK gQ61aSBIn72 Ae0SMRtpsJW    -1
## 5  Site 93 cyTBsZneQ8U gQ61aSBIn72 Ae0SMRtpsJW    -1
## 6 Site 107 D0uW6vlX3wt gQ61aSBIn72 Ae0SMRtpsJW    -1

We can see that we now have negative values corresponding to the de-duplication adjustment factor, which has been calculated for each data element /disaggregation /site / period combination.

Lets check one to see what it looks like.

d.max[d.max$ou_uid =="A4GnoCEIH0l" & d.max$de_uid =="gQ61aSBIn72" & d.max$coc_uid=="xBa6iiCU4pI",
c("de","disagg","site","ou_uid","de_uid","coc_uid","value","mechanism")]

##                                                              de
## 4097 CARE_NEW (N, DSD, Age/Sex): Received CD4, VL or Assessment
## 4098 CARE_NEW (N, DSD, Age/Sex): Received CD4, VL or Assessment
## 4099 CARE_NEW (N, DSD, Age/Sex): Received CD4, VL or Assessment
##               disagg    site      ou_uid      de_uid     coc_uid value
## 4097 (Female, 20-24) Site 14 A4GnoCEIH0l gQ61aSBIn72 xBa6iiCU4pI     2
## 4098 (Female, 20-24) Site 14 A4GnoCEIH0l gQ61aSBIn72 xBa6iiCU4pI     2
## 4099 (Female, 20-24) Site 14 A4GnoCEIH0l gQ61aSBIn72 xBa6iiCU4pI     1
##         mechanism
## 4097  Mechanism 2
## 4098 Mechanism 10
## 4099  Mechanism 9

In this case, we see that we have three values composing the duplicate from three different mechanisms. Without any de-duplication adjustments, DATIM would calculate the total for this group as follow: 2 + 2 + 1 = 5. Assuming we choose the “MAX” logic, we would expect that the deduplication adjustment in this case would be -3, Let’s see if we got it right.

d.max.dedupes[d.max.dedupes$ou_uid =="A4GnoCEIH0l" & d.max.dedupes$de_uid =="gQ61aSBIn72" & d.max.dedupes$coc_uid=="xBa6iiCU4pI",
c("de","disagg","site","ou_uid","de_uid","coc_uid","value")]

##                                                               de
## 15289 CARE_NEW (N, DSD, Age/Sex): Received CD4, VL or Assessment
##                disagg    site      ou_uid      de_uid     coc_uid value
## 15289 (Female, 20-24) Site 14 A4GnoCEIH0l gQ61aSBIn72 xBa6iiCU4pI    -3

For this example, we see that the correct de-deduplication adjustment has been calculated.

Although there is really nothing to calculate for the data elements which we chose to resolve with the “SUM” function, DATIM requires that we specify a zero, so that the system knows that the duplicate has been resolved.

d.sum<-d[grepl("OVC",d$de),]
d.sum.dedupes<-unique(d.sum[,c("ou_uid","de_uid","coc_uid")])
d.sum.dedupes$value<-0

Now, lets reform the data, so that we can import it into DATIM. The different data formats for importing data into DATIM are described in detail here. Although XML or JSON could also be used, for the sake of clarity we will demonstrate how to create a CSV file, which could be uploaded to DATIM.

First, we need to get the special mechanism ID, which is used for the deduplication adjustment. From this URL we can see that DATIM has two mechanisms for de-duplication adjustments, one for normal duplicates and another for DSD-TA duplicates. For now, we will choose the normal duplication mechanism, which as a UID of “X8hrDf6bLDC”.

Now, we will simply reform our duplication adjustments into the correct format. We will convert all fields to characters, as required by DATIM, while also combining the two types (SUM and MAX) together into a single consolidated file.

#The manual states the import file needs to look like this.
#"dataelement","period","orgunit","catoptcombo","attroptcombo","value","storedby","lastupd","comment"
d.dedupes<-rbind(d.max.dedupes[,c("de_uid","ou_uid","coc_uid","value")],
                d.sum.dedupes[,c("de_uid","ou_uid","coc_uid","value")])
d.upload<-data.frame(dataelement=as.character(d.dedupes$de_uid),
                     period="2015Q1",
                     orgunit=as.character(d.dedupes$ou_uid),
                     catoptcombo=as.character(d.dedupes$coc_uid),
                     attroptcombo="X8hrDf6bLDC",
                     value=as.character(d.dedupes$value),
                     storedby="",
                     lastupd="",
                     comment="")
head(d.upload[,1:6])

##   dataelement period     orgunit catoptcombo attroptcombo value
## 1 gQ61aSBIn72 2015Q1 A4GnoCEIH0l Ae0SMRtpsJW  X8hrDf6bLDC    -2
## 2 gQ61aSBIn72 2015Q1 av1Aonm89RM Ae0SMRtpsJW  X8hrDf6bLDC    -1
## 3 gQ61aSBIn72 2015Q1 bFIPtqe4NSw Ae0SMRtpsJW  X8hrDf6bLDC    -1
## 4 gQ61aSBIn72 2015Q1 bJ2FOiMS7UK Ae0SMRtpsJW  X8hrDf6bLDC    -1
## 5 gQ61aSBIn72 2015Q1 cyTBsZneQ8U Ae0SMRtpsJW  X8hrDf6bLDC    -1
## 6 gQ61aSBIn72 2015Q1 D0uW6vlX3wt Ae0SMRtpsJW  X8hrDf6bLDC    -1

write.table(d.upload,file="dedupes_resolved.csv",quote=TRUE,sep=",",na="",row.names=F,col.names=F)