Examining Logs

Objective

We want to know the contents of the summary sheet presented at each patient encounter.

Problem

There are a few challenges. First, facilities do not track if summary sheets are present at a specific encounter for a specific patient. Our team of DAs are conducting spot checks generate facility-level numbers, but we are also interested in knowing (i) if a summary sheet was present at a specific encounter and, if yes, (ii) knowing the content of the summary sheet (e.g., which reminders, how many reminders).

Since it is not possible to collect summary sheets after encounters for retrospective evaluation, (i) is unknowable for every patient. The question at hand is whether we can come up with an approximation for (ii).

To do this, we asked Win to create a script that logs when summary sheets are generated or viewed (note: neither are equivalent to “printed”). Win wrote the script, but we are left with another challenge: (iii) summary sheets are generated every time there is a change in the AMRS, so there is not a 1-to-1 match with patient encounters.

So how can we make an educated guess about the content of the summary sheet for a specific patient that, if actually printed/delivered/viewed, would have been available to a clinician during a specific encounter.

Approach

Use the patient ID and date of encounter to match to the patient's most recent summary sheet. For instance, if patient A has an encounter on Feb 1, and if we know summary sheets were generated/viewed on Jan 5, Jan 20, and Feb 5, then we match the Feb 1 encounter to the most recent log entry on or before Feb 1: Jan 20.

This is with no assumptions. We could add some additional specifications to say that (a) if the site gets summary sheets delivered and if (b) the time between the patient encounter and the most recent summary sheet generation is less than X days, then it would not be possible for this summary sheet to have been present at the encounter. Instead, match the encounter to the previous summary sheet.

Let's go

Keny queried all return encounters from April 11, 2014 (study launch) to June 16, 2014. He also grabbed all log data, generated and viewed, since the study launched. So in the first step we import all datasets and merge the different log files. File structures vary, so we have to do a bit of munging.

# import logs
  logs1v <- read.csv("logs/log1v.csv", stringsAsFactors=FALSE, header=F)
  logs1g <- read.csv("logs/log1g.csv", stringsAsFactors=FALSE, header=F)
  logs2v <- read.csv("logs/log2v.csv", stringsAsFactors=FALSE, header=F)
  logs2g <- read.csv("logs/log2g.csv", stringsAsFactors=FALSE, header=F)
  logs3v <- read.csv("logs/log3v.csv", stringsAsFactors=FALSE, header=F)
  logs3g <- read.csv("logs/log3g.csv", stringsAsFactors=FALSE, header=F)
# rename and format date
  names(logs1v) <- c("drop", "patientId", "logDate", "drop2", "requestedby",
                    "nonTB", "ieLoc", "ieDate", "TBreminder1", "TBreminder2")
  names(logs1g) <- c("drop", "patientId", "logDate", "drop2", "requestedby",
                      "nonTB", "ieLoc", "ieID", "TBreminder1", "TBreminder2",
                     "TBreminder3")
  logs1v$logDate <- dmy(substr(logs1v$logDate, 1, 12))
  logs1g$logDate <- dmy(substr(logs1g$logDate, 1, 12))
  logs1v <- logs1v[,c(-1, -4)]
  logs1g <- logs1g[,c(-1, -4)]

  names(logs2v) <- c("drop", "patientId", "logDate", "drop2", "requestedby",
                      "nonTB", "TBreminder1", "TBreminder2",
                     "TBreminder3")
  names(logs2g) <- c("drop", "patientId", "logDate", "drop2", "requestedby",
                      "nonTB", "TBreminder1", "TBreminder2",
                     "TBreminder3")
  logs2v$temp1 <- substr(logs2v$logDate, 1, 12)
  foo <- data.frame(do.call('rbind', 
                            strsplit(as.character(logs2v$temp1),
                                     '-',
                                     fixed=TRUE)))
  foo$X4 <- ifelse(nchar(as.character(foo$X3))==2, 
                   paste0("20", as.character(foo$X3)), as.character(foo$X3))
  logs2v$logDate <- paste(foo$X1, foo$X2, foo$X4, sep="-")
  logs2v$logDate <- dmy(logs2v$logDate)
## Warning: 1 failed to parse.
  remove(foo)
  logs2v$temp1 <- NULL

  logs2g$temp1 <- substr(logs2g$logDate, 1, 12)
  foo <- data.frame(do.call('rbind', 
                            strsplit(as.character(logs2g$temp1),
                                     '-',
                                     fixed=TRUE)))
## Warning: number of columns of result is not a multiple of vector length
## (arg 2427)
  foo$X4 <- ifelse(nchar(as.character(foo$X3))==2, 
                   paste0("20", as.character(foo$X3)), as.character(foo$X3))
  logs2g$logDate <- paste(foo$X1, foo$X2, foo$X4, sep="-")
  logs2g$logDate <- dmy(logs2g$logDate)
## Warning: 1 failed to parse.
  remove(foo)
  logs2g$temp1 <- NULL

  logs2v <- logs2v[,c(-1, -4)]
  logs2g <- logs2g[,c(-1, -4)]

  names(logs3v) <- c("drop", "patientId", "logDate", "drop2", "requestedby",
                      "nonTB", "ieLoc", "ieID", "ieDate", "TBreminder1",
                     "TBreminder2", "TBreminder3")
  names(logs3g) <- c("drop", "patientId", "logDate", "drop2", "requestedby",
                      "nonTB", "ieLoc", "ieID", "ieDate", "TBreminder1", 
                     "TBreminder2", "TBreminder3")
  logs3v$logDate <- dmy(substr(logs3v$logDate, 1, 12))
  logs3g$logDate <- dmy(substr(logs3g$logDate, 1, 12))
  logs3v <- logs3v[,c(-1, -4)]
  logs3g <- logs3g[,c(-1, -4)]

# row bind
  library(plyr)
## 
## Attaching package: 'plyr'
## 
## The following object is masked from 'package:lubridate':
## 
##     here
  logs1 <- rbind.fill(logs1g, logs2g, logs3g)
  logs2 <- rbind.fill(logs1v, logs2v, logs3v)
# add a column to each file to indicate generated or viewed
  logs1$type <- "g"
  logs2$type <- "v"
# combine log files
  logs <- rbind.fill(logs1, logs2)
# encounters
  encs <- read.csv("logs/enc.csv", stringsAsFactors=FALSE)
  names(encs) <- c("patientId", "encLocID", "encDate", "encLocname")
  encs$encDate <- mdy(encs$encDate)
  #encs <- subset(encs, encs$encDate >= mdy("01-01-2014"))

We can sort the data ascending order by patient and date. Keny says that it is possible to have duplicate encoutners (patient + date), so we check and remove.

# remove duplicates
  encs <- encs[ !duplicated(encs$patientId, encs$encDate), ]
# sort
  logs <- logs[order(logs$patientId, logs$logDate),]
  encs <- encs[order(encs$patientId, encs$encDate),]

This leaves us with 32710 return encounters and 980451 log entries (going back to 2013). Now we can merge the two datasets by patientId and keep all combinations.

  dat <- merge(encs, logs, by="patientId", all=T)
  dat <- subset(dat, !is.na(dat$patientId))

The dataframe is now arranged in such a way that patients can have multiple summary sheet dates for each encounter date. Let's create a new column that calculates the difference in days.

  dat$diff <- difftime(dat$encDate, dat$logDate, units="days")

Under the simple assumptions, we want to drop any duplicates according to logDate and patientID. When we use the !duplicated() function below, R is going to keep only the first instance of the duplicate (patientId + encDate). So we can sort the dataframe so that the first instance of each duplicate will be the combination with the smallest, positive value for diff. We only want positive differences because logs have to be generated on or before the encounter date to have any chance of the summary being viewed at the encounter.

# remove negative values
  dat <- dat[dat$diff>=0,]
  # if we wanted, we could say if encLoc=="a" | encLoc=="b", then >=7
# sort ascending again, this time with diff
  dat <- dat[order(dat$patientId, dat$encDate, dat$diff),]
# remove duplicates leaving first match (smallest diff value)
  dat2 <- dat[ !duplicated(dat$patientId, dat$encDate), ]

Now every return encounter is matched to the first log entry (generated or viewed) on or before the encounter. Let's tally the number of return encounters by location and see if all of these encounters have data in the logs. We're not being picky here in this demonstration. Any log entry on or before the encounter date counts. We'll limit ourselves to study mother sites.

tbl <- data.frame(table(dat2$encLocname))
names(tbl) <- c("encLocname", "matched")
encnum <- aggregate(patientId ~ encLocname, data = encs, FUN = length)
names(encnum) <- c("encLocname", "total.enc")
dat3 <- merge(encnum, tbl, by = "encLocname", all.x = TRUE)
dat3$matched[is.na(dat3$matched)] <- 0
dat3$per <- dat3$matched/dat3$total.enc
# limit to study mother sites
sites <- c("Khuyangu", "Mukhobola", "Turbo", "Ziwa", "Kitale", "Uasin Gishu District Hospital", 
    "Webuye", "Port Victoria", "Bumala A", "Mt. Elgon", "Iten", "Mois Bridge", 
    "Teso", "Mosoriot", "Busia", "ANGURAI", "Huruma SDH", "Chulaimbo", "Bumala B", 
    "Burnt Forest")
dat3 <- dat3[dat3$encLocname %in% sites, ]
dat3
##                       encLocname total.enc matched    per
## 5                        ANGURAI       335     334 0.9970
## 7                       Bumala A       861     860 0.9988
## 8                       Bumala B       349     347 0.9943
## 9                   Burnt Forest       825     814 0.9867
## 10                         Busia       123     123 1.0000
## 17                     Chulaimbo        26      26 1.0000
## 21                    Huruma SDH       167     164 0.9820
## 22                          Iten       450     444 0.9867
## 28                      Khuyangu      1310    1298 0.9908
## 29                        Kitale      2991    2963 0.9906
## 39                   Mois Bridge       414     411 0.9928
## 40                      Mosoriot      1424    1413 0.9923
## 41                     Mt. Elgon       143     142 0.9930
## 46                     Mukhobola       697     685 0.9828
## 51                 Port Victoria      1569    1556 0.9917
## 60                          Teso       673     670 0.9955
## 62                         Turbo      1398    1380 0.9871
## 63 Uasin Gishu District Hospital       681     674 0.9897
## 64                        Webuye      2144    2140 0.9981
## 65                          Ziwa       281     275 0.9786

So we see that we were able to match plausible log entries for most patients across all facilities. Success!

Here's an example to show how the matching works. Take a look at patient 656. She had an encounter on April 28, 2014, and we matched this encounter to every log entry we found for her that occured on or before this date.

subset(dat, dat$patientId == 656)
##        patientId encLocID    encDate                    encLocname
## 602328       656       70 2014-04-28 Pioneer Sub-District Hospital
## 602329       656       70 2014-04-28 Pioneer Sub-District Hospital
## 602321       656       70 2014-04-28 Pioneer Sub-District Hospital
## 602323       656       70 2014-04-28 Pioneer Sub-District Hospital
## 602326       656       70 2014-04-28 Pioneer Sub-District Hospital
## 602324       656       70 2014-04-28 Pioneer Sub-District Hospital
## 602325       656       70 2014-04-28 Pioneer Sub-District Hospital
## 602322       656       70 2014-04-28 Pioneer Sub-District Hospital
##           logDate requestedby nonTB ieLoc ieID TBreminder1 TBreminder2
## 602328 2014-03-02     rkkisia     1    NA   NA                        
## 602329 2014-03-02      daemon     1    NA   NA                        
## 602321 2013-12-21      daemon     1    NA   NA                        
## 602323 2013-12-21   rjkeitany     1    NA   NA                        
## 602326 2013-12-21   rjkeitany     1    NA   NA                        
## 602324 2013-10-14     rkkisia     0    NA   NA                        
## 602325 2013-10-14      daemon     0    NA   NA                        
## 602322 2013-10-10      daemon     0    NA   NA                        
##        TBreminder3 ieDate type     diff
## 602328               <NA>    v  57 days
## 602329               <NA>    g  57 days
## 602321               <NA>    g 128 days
## 602323               <NA>    v 128 days
## 602326               <NA>    v 128 days
## 602324               <NA>    v 196 days
## 602325               <NA>    g 196 days
## 602322               <NA>    g 200 days

In the end, we matched her encounter to the most recent log entry on March 2. [Sadly, in this example, the most recent summary sheet possible was 57 days old. But this is not our concern here.]

subset(dat2, dat2$patientId == 656)
##        patientId encLocID    encDate                    encLocname
## 602328       656       70 2014-04-28 Pioneer Sub-District Hospital
##           logDate requestedby nonTB ieLoc ieID TBreminder1 TBreminder2
## 602328 2014-03-02     rkkisia     1    NA   NA                        
##        TBreminder3 ieDate type    diff
## 602328               <NA>    v 57 days

Does this mean that when patient 656 arrived for her appointment on April 28 that her clinician viewed her summary sheet? No. We have no way to know this.

Does it mean that a summary sheet was printed and in her file when she met with the clinician? Again, no.

Does it mean that, if a summary sheet was printed, we have a reasonable guess about the content? Here, I think the answer is yes. It's probable that, if printed and placed in her file, patient 656's provider might have been exposed to 1 non-TB reminder.

One more check on implementation of randomization

Before and after we turned “on” the new TB reminders for facilities randomized to treatment, Keny verified that only treatment sites were getting reminders. Let's check to make sure that is still the case.

We count the number of patients who have received TB reminders since the study launched. Then we aggregate by site and merge with the site's treatment assignement.

The (potentially very) bad news is that it looks like patients with encounters at CONTROL sites are getting TB reminders. I've coded “contamination==1” if the site is control and if any patients are receiving TB reminders.

dat6 <- merge(cttb, rand, by.y = "site.id", by.x = "encLocID")
dat6$contamination <- ifelse(dat6$trt == 0 & dat6$ctTBreminders > 0, 1, 0)
dat6 <- dat6[, -c(1, 3)]
dat6 <- dat6[order(-dat6$contamination), ]
dat6
##    ctTBreminders trt contamination
## 4             12   0             1
## 6              5   0             1
## 7            119   0             1
## 9             13   0             1
## 10            24   0             1
## 12           199   0             1
## 16           108   0             1
## 22            18   0             1
## 27            72   0             1
## 29             1   0             1
## 30            55   0             1
## 32             3   0             1
## 35             7   0             1
## 37            63   0             1
## 38             1   0             1
## 40             1   0             1
## 1             89   1             0
## 2             69   1             0
## 3             43   1             0
## 5             50   1             0
## 8             59   1             0
## 11           129   1             0
## 13             7   1             0
## 14            50   1             0
## 15            10   1             0
## 17             0   0             0
## 18             0   1             0
## 19             0   1             0
## 20            23   1             0
## 21            20   1             0
## 23             7   1             0
## 24             0   1             0
## 25             1   1             0
## 26             8   1             0
## 28             0   0             0
## 31             0   0             0
## 33            10   1             0
## 34            10   1             0
## 36             1   1             0
## 39             0   0             0
## 41             0   1             0
## 42             0   0             0
## 43             0   1             0