In this document, a possible approach to validation rule analytics in DHIS2 will be discussed. In order to reproduce the example, you should have a fully functional copy of DHIS running on your machine using the Sierra Leone database. One of the major challenges with the current implementation of the validation rules is that they must be run in real-time. If the rules are complex or numerous, this may lead to high server loads. It would be useful to be able to run the validation rules at a time when the system usage is low, and then to persist the evaluation of the rules for later analysis.
In this document, an approach to bulk evaluation of validation rules will be shown. The basic approach will be to obtain the required data from the datavalue table, creating a staging table which will be useful for later analysis. The DHIS2 API will be used to retrieve the validation rules. Some limitation of the approach will be discussed.
First we will load a few packages (you may need to install these with your copy of R).
require(RPostgreSQL)
## Loading required package: RPostgreSQL
## Loading required package: DBI
require(reshape2)
## Loading required package: reshape2
require(data.table)
## Loading required package: data.table
require(stringr)
## Loading required package: stringr
Next, we will define a database connection to the Sierra Leone demo database. In this case, it is on the local machine, so you may need to adjust the paramaters to suit your enviornment.
dbUserName<-"postgres"
dbPassword<-"postgres"
dbHost<-"localhost"
dbName<-"demo"
dbPort<-"5432"
m<-dbDriver("PostgreSQL")
con <- dbConnect(PostgreSQL(), user= dbUserName, password=dbPassword,host=dbHost, port=dbPort,dbname=dbName)
expression.pattern<-"[a-zA-Z][a-zA-Z0-9]{10}(\\.[a-zA-Z][a-zA-Z0-9]{10})?"
Next, we define a function to parse the XML metadata from the DHIS2 server and transform validation rules into a form which is more convenient.
#Override the current method, as we need a slightly different form of the rules. Fix this.
getValidationRules<-function(base.url,username,password) {
#Get a copy of the metadata from the server
r<-httr::GET(paste0(base.url,"api/validationRules.xml?fields=id,name,description,leftSide[expression,missingValueStrategy],rightSide[expression,missingValueStrategy],operator,periodType&paging=false"),
httr::authenticate(username,password),httr::timeout(60))
vr.xml<- httr::content(r, "parsed","application/xml")
vr.names<-sapply(XML::getNodeSet(vr.xml,"//o:validationRule","o"),XML::xmlGetAttr,"name")
vr.op<-sapply(XML::getNodeSet(vr.xml,"//o:operator","o"),XML::xmlValue)
vr.ls<-sapply(XML::getNodeSet(vr.xml,"//o:validationRule/o:leftSide/o:expression","o"),XML::xmlValue)
vr.rs<-sapply(XML::getNodeSet(vr.xml,"//o:validationRule/o:rightSide/o:expression","o"),XML::xmlValue)
vr.ls.strategy<-sapply(XML::getNodeSet(vr.xml,"//o:validationRule/o:leftSide/o:missingValueStrategy","o"),XML::xmlValue)
vr.rs.strategy<-sapply(XML::getNodeSet(vr.xml,"//o:validationRule/o:rightSide/o:missingValueStrategy","o"),XML::xmlValue)
vr.periodType<-sapply(XML::getNodeSet(vr.xml,"//o:validationRule/o:periodType","o"),XML::xmlValue)
vr<-data.frame(name=vr.names,ls=vr.ls,op=vr.op,rs=vr.rs,ls.strategy=vr.ls.strategy,rs.strategy=vr.rs.strategy,periodType=vr.periodType)
#Static predefined map of operators
op.map<-data.frame(x=c("greater_than_or_equal_to","greater_than","equal_to","not_equal_to","less_than_or_equal_to","less_than"),
y=c(">=",">","==","!=","<=","<"),stringsAsFactors=F)
#Strategies
strat.map<-data.frame(x=c("SKIP_IF_ANY_VALUE_MISSING","SKIP_IF_ALL_VALUES_MISSING","NEVER_SKIP"))
#Remap the operators
vr$op<-plyr::mapvalues(vr$op,op.map$x,op.map$y,warn_missing=FALSE)
#Remove decorations
#vr$ls<-gsub("[#{}]","",vr$ls)
#vr$rs<-gsub("[#{}]","",vr$rs)
#Count the left and right side operators
vr$rs.ops<-stringr::str_count(vr$rs,expression.pattern)
vr$ls.ops<-stringr::str_count(vr$ls,expression.pattern)
#vr$rs.ops<-ifelse(vr$rs.ops==0,1,vr$rs.ops)
#vr$ls.ops<-ifelse(vr$ls.ops==0,1,vr$ls.ops)
return(vr) }
Next, we will retreive the validation rules, and parse them into an R data frame object.
#Start with getting the validation rules, to filter the data which we do not actually need
username<-"admin"
password<-"district"
#Adjust this if you need to test against a different server
base.url<-"http://localhost/dhis/"
#Get the rules
vr<-getValidationRules(base.url ,username,password)
Next, we need to get all possible data elements from the validation rules. This is required, so that we can filter the raw data from the datavalue table, only populating our staging table with data which may be required.
#Get all of the possible data elements. These are the only ones we will concern our selves with
foo<-unlist(unique(unlist(str_extract_all(vr$rs,expression.pattern))))
bar<-unlist(unique(unlist(str_extract_all(vr$ls,expression.pattern))))
vr_operands<-data.frame(operand=unique(c(foo,bar)))
vr_des<-unique(as.character(vapply(as.character(vr_operands$operand),function(x){unlist(strsplit(x,"[.]"))[[1]]},FUN.VALUE=character(1))))
head(vr_des)
## [1] "fbfJHSPpUQD" "cYeuwXTCPkU" "s46m5MS0hxu" "eY5ehpbEsB7" "Boy3QwztgeZ"
## [6] "d9vZ3HOlzAd"
The ‘vr_des’ object should contain all of the data elements which are part of any validation rule, which we can use to filter the datavalue table with.
Next, we will retreive the data from the database, taking care to only get the data elements which may be required, and only if they are numeric.
sql<-paste0("
SELECT de.uid as dataElement,
p.iso as period,
pt.name as periodtype,
ou.uid as orgUnit,
coc.uid as categoryOptionCombo,
acoc.uid as attributeOptionCombo,
dv.value
FROM datavalue dv
INNER JOIN dataelement de on dv.dataelementid = de.dataelementid
INNER JOIN _periodstructure p on dv.periodid = p.periodid
INNER JOIN period pp on dv.periodid = pp.periodid
INNER JOIN periodtype pt on pp.periodtypeid = pt.periodtypeid
INNER JOIN categoryoptioncombo coc on dv.categoryoptioncomboid = coc.categoryoptioncomboid
INNER JOIN categoryoptioncombo acoc on dv.attributeoptioncomboid = acoc.categoryoptioncomboid
INNER JOIN organisationunit ou on dv.sourceid = ou.organisationunitid
AND dv.value ~('[1-9][0-9]*')
AND dv.dataelementid in (SELECT dataelementid from dataelement where uid in ('",paste(vr_des,sep="",collapse="','"),"'));")
data<-dbGetQuery(con,sql)
names(data)<-c("dataElement","period","periodtype","orgUnit","categoryOptionCombo","attributeOptionCombo","value")
head(data)
## dataElement period periodtype orgUnit categoryOptionCombo
## 1 cYeuwXTCPkU 201509 Monthly uedNhvYPMNu PT59n8BQbqM
## 2 cYeuwXTCPkU 201509 Monthly GhXvo3BpCvo pq2XI5kz2BY
## 3 fbfJHSPpUQD 201506 Monthly sLKHXoBIqSs pq2XI5kz2BY
## 4 xc8gmAKfO95 201509 Monthly DiszpKrYNg8 V6L425pT3A0
## 5 mGN1az8Xub6 201507 Monthly DiszpKrYNg8 Prlt0C1RF0s
## 6 mGN1az8Xub6 201509 Monthly DiszpKrYNg8 V6L425pT3A0
## attributeOptionCombo value
## 1 bRowv6yZOF2 8
## 2 bRowv6yZOF2 4
## 3 bRowv6yZOF2 4
## 4 bRowv6yZOF2 1
## 5 bRowv6yZOF2 23
## 6 bRowv6yZOF2 23
Thus, we have an data structure consisting of data elemements, periods, organisation units, category and attribute combos, along with the actual value. Next, we will convert all values to integers. While this may not be strictly correct, using an integer as the column type will allow for the construction of a wide table format. The use of “double precision” requires a larger byte size per row, and since we will be creating a very wide table, this may lead to problems in creation of the table. Obviously, using a column-store database is one possible solution to this problem, which would allow for very wide tables, which may not be supported by Postgres
#Be sure these are integers, otherwise Postgresql may blow up
data$value<-as.integer(data$value) #This may throw a warning
foo<-apply(apply(data,2,is.na),1,sum) == 0 #Filter out anything which is not complete.
data<-data[foo,]
Next, we will create a new column in the data , which will hold the “combi”, which consists of the data element UID along with the category option combo UID. Totals are then calculated and appended to the data frame containing the values.
#Calculate the totals and append to the data frame
data$combi<-paste0(data$dataElement,".",data$categoryOptionCombo)
data.totals<-aggregate(value ~ dataElement + period + periodtype + orgUnit + attributeOptionCombo, data = data,FUN=sum)
data.totals$combi<-data.totals$dataElement
data.totals$categoryOptionCombo<-NA
data.totals<-data.totals[,names(data)]
data<-rbind(data,data.totals)
tail(data)
## dataElement period periodtype orgUnit categoryOptionCombo
## 1887951 DTtCy7Nx5jH 2015 Yearly ZZmMpGIE7pD <NA>
## 1887952 DTVRnCGamkV 2015 Yearly ZZmMpGIE7pD <NA>
## 1887953 WUg3MYWQ7pt 2015 Yearly ZZmMpGIE7pD <NA>
## 1887954 s46m5MS0hxu 201509 Monthly DiszpKrYNg8 <NA>
## 1887955 s46m5MS0hxu 201509 Monthly DiszpKrYNg8 <NA>
## 1887956 s46m5MS0hxu 201508 Monthly DiszpKrYNg8 <NA>
## attributeOptionCombo value combi
## 1887951 bRowv6yZOF2 843 DTtCy7Nx5jH
## 1887952 bRowv6yZOF2 207 DTVRnCGamkV
## 1887953 bRowv6yZOF2 5173 WUg3MYWQ7pt
## 1887954 pO5CEqK6c1s 58 s46m5MS0hxu
## 1887955 sSeEjeQ0Rgt 49 s46m5MS0hxu
## 1887956 V6L425pT3A0 26 s46m5MS0hxu
We can see, that the totals have no categoryOptionCombo, and correspond to the same syntax as used by the validation rules when considering totals.
Next, we will extract out what we need from the data, and factorize everything in prepartion of the reshaping of the data from long to wide format.
#Extract what is needed and factorize
foo<-data[,c("period","periodtype","orgUnit","attributeOptionCombo","combi","value")]
foo$period<-as.factor(foo$period)
foo$periodtype<-as.factor(foo$periodtype)
foo$orgUnit<-as.factor(foo$orgUnit)
foo$attributeOptionCombo<-as.factor(foo$attributeOptionCombo)
foo$combi<-as.factor(foo$combi)
Next, we will reshape the data from “long” to wide format. The reason for this is the relatively easy translation of the validation rules syntax to an SQL query, whereby we can construct an SQL statement directly from the validation rules as they are defined in DHIS2, and run this directly against the table.
#Reshape the data to wide format, so that we can call the validation rules as SQL
data_wide<-recast(foo, period + periodtype + orgUnit + attributeOptionCombo ~ combi,fill=0)
## Using period, periodtype, orgUnit, attributeOptionCombo, combi as id variables
#Cast to integers again. Fix this.
data_wide<-cbind(data_wide[,1:4],plyr::numcolwise(as.integer)(data_wide))
Note, in this case, we obtain a table with only 158 columns, which is not that many, but could easily reach several thousand depending on the number of validation rules and operands used.
Next, we will save the wide data object to a database table.
#Drop and write the table
dbSendQuery(con,"DROP TABLE IF EXISTS data_validation_wide;")
## <PostgreSQLResult:(19660,0,2)>
dbWriteTable(con,"data_validation_wide",data_wide)
## [1] TRUE
Next, we will determine which operands are actually present in the data. For which are not, we will determine whether we actually need to evaluate the validation rule or not. If there is no need to run the validation rule, we will simply skip it.
ops_present<-names(data_wide)[5:(length(names(data_wide)))]
vr$required<-FALSE
#Must get rid of validation rules which are not applicable
for (i in 1:length(ops_present)){
vr$required<-vr$required | grepl(ops_present[i],paste0(vr$ls,vr$rs))
}
Next, for any operands which are never present, we will substitute these with zero. This is a limitation in the current approach, as it effectively ignores the “strategy”. This effectively treats all rules as “NEVER_SKIP”.
#Fill in operands which are always going to be zero.
vr<-vr[vr$required,]
expression.pattern<-"[a-zA-Z][a-zA-Z0-9]{10}(\\.[a-zA-Z][a-zA-Z0-9]{10})?"
vr<-vr[vr$required,]
foo<-unlist(unique(unlist(str_extract_all(vr$rs,expression.pattern))))
bar<-unlist(unique(unlist(str_extract_all(vr$ls,expression.pattern))))
vr_operands<-data.frame(operand=unique(c(foo,bar)))
vr_operands$has_data<-vr_operands$operand %in% ops_present
missing_operands<-vr_operands[!vr_operands$has_data,]
if (nrow(missing_operands) > 0) {
for (i in 1:nrow(missing_operands)){
foo<-paste0('"',missing_operands$operand[i],'"')
x<-paste0("gsub('",missing_operands$operand[i],"',0,vr$ls)")
vr$ls<-eval(parse(text=x))
x<-paste0("gsub('",missing_operands$operand[i],"',0,vr$rs)")
vr$rs<-eval(parse(text=x))
} }
Finaly, we will alter the syntax slightly, to make it compatible with the column names and SQL statements.
#Massage the syntax
vr$rs<-gsub("#\\{0\\}","0",vr$rs)
vr$ls<-gsub("#\\{0\\}","0",vr$ls)
vr$ls<-gsub("#\\{",'"',vr$ls)
vr$rs<-gsub("#\\{",'"',vr$rs)
vr$ls<-gsub("\\}",'"',vr$ls)
vr$rs<-gsub("\\}",'"',vr$rs)
vr$op<-gsub("==","=",vr$op)
Next, we will create a table to store the validation rule results.
dbGetQuery(con,"DROP TABLE IF EXISTS validation_rule_results;
CREATE TABLE validation_rule_results(
orgunit text,
period text,
attributeoptioncombo text,
vr text,
ls integer,
p text,
rs double precision,
validation_result boolean);")
## NULL
Next, we construct the SQL queries for each rule and then call the SQL.
#Construct the SQL to run on the wide table
vr$sql<-paste0('INSERT INTO validation_rule_results
SELECT * FROM (
SELECT "orgUnit","period","attributeOptionCombo",',"'",vr$name,"' as vr ,", vr$ls," as ls,'",vr$op,"' as op
,",vr$rs, ' as rs,', vr$ls,vr$op,vr$rs," as validation_result from
data_validation_wide
WHERE periodtype = '",vr$periodType,"'
) as foo WHERE validation_result=FALSE;")
#Execute the queries
start_time<-Sys.time()
for (i in 1:nrow(vr) ) {
dbGetQuery(con,vr$sql[i])
}
end_time<-Sys.time()
end_time-start_time
## Time difference of 1.56968 secs
Note that the total execution time was under 1 second on this machine.
The constructed SQL looks like this. Note that in the data_validation_wide table, a column for each operand is present, which can be called with SQL directly.
INSERT INTO validation_rule_results
SELECT * FROM (SELECT "orgUnit","period","attributeOptionCombo",'ANC 2 <= ANC 1' as vr
,"cYeuwXTCPkU.pq2XI5kz2BY"+"cYeuwXTCPkU.PT59n8BQbqM" as ls,'<=' as op,"fbfJHSPpUQD.pq2XI5kz2BY"+"fbfJHSPpUQD.PT59n8BQbqM" as rs,
"cYeuwXTCPkU.pq2XI5kz2BY"+"cYeuwXTCPkU.PT59n8BQbqM"<="fbfJHSPpUQD.pq2XI5kz2BY"+"fbfJHSPpUQD.PT59n8BQbqM"
WHERE periodtype = 'Monthly
as validation_result from data_validation_wide ) as foo WHERE validation_result=FALSE;"
After running the rules, we will see what happens
vr_results<-dbGetQuery(con,"SELECT * FROM validation_rule_results;")
head(vr_results)
## orgunit period attributeoptioncombo vr ls p rs
## 1 a1dP5m3Clw4 201401 bRowv6yZOF2 ANC 2 <= ANC 1 27 <= 26
## 2 a1E6QWBTEwX 201401 bRowv6yZOF2 ANC 2 <= ANC 1 12 <= 8
## 3 aBfyTU5Wgds 201401 bRowv6yZOF2 ANC 2 <= ANC 1 45 <= 10
## 4 ABM75Q1UfoP 201401 bRowv6yZOF2 ANC 2 <= ANC 1 17 <= 10
## 5 agM0BKQlTh3 201401 bRowv6yZOF2 ANC 2 <= ANC 1 15 <= 10
## 6 AiGBODidxPw 201401 bRowv6yZOF2 ANC 2 <= ANC 1 50 <= 8
## validation_result
## 1 FALSE
## 2 FALSE
## 3 FALSE
## 4 FALSE
## 5 FALSE
## 6 FALSE
foo<-data.frame(table(vr_results$vr))
head(foo[with(foo, order(-Freq)), ])
## Var1 Freq
## 11 Commodities Injectable Antibiotics 21997
## 1 ANC 2 <= ANC 1 8891
## 8 Commodities Emergency Contraception 5821
## 13 Commodities Misoprostol 5789
## 5 Commodities Amoxicillin 5752
## 14 Commodities Oral Rehydration Salts 5739
It seems that the most problematic rule is related to “Commodities Injectable Antibiotics” followed by " ANC 2 <= ANC 1“. Once we have the data in this analytical format, other types of analyses (such as number of validation rule violations by district or over time) would be fairly simple to implement.