Log example

Objective

We want to know the contents of the summary sheet presented at each patient encounter.

Problem

Summary sheets are generated every time there is a change in the AMRS, and there is not a 1-to-1 match with patient encounters. Since it is not possible to collect summary sheets after encounters for retrospective evaluation, we have to use the logging script Win created to match the summary sheets to encounters.

Approach

Use the patient ID and date of encounter to match to the patient's most recent summary sheet. For instance, if patient A has encounters on Jan 1, Feb 1, and Mar 1, and if we know a summary sheet was generated on Jan 20, then we match it to the Feb 1 encounter record.

This is with no assumptions. We could add some additional specifications to say that (a) if the site gets summary sheets delivered and if (b) the time between the patient encounter and the most recent summary sheet generation is less than X days, then it would not be possible for this summary sheet to have been present at the encounter. Instead, match the encounter to the previous summary sheet.

Let's go

First, let's create two fake datasets: logs and encounters. The logs dataframe is what you get from Win's script. The encs dataframe would come from an AMRS query of encounters. Two caveats: (1) I'm only including the most relevant data for the matching; and (2) I don't know if I'm using the correct names for the encounter fields.

  logs <- data.frame(patientId=rep(seq(1, 5, 1), 2),
                     generatedOn=c("1-1-2014", "1-10-2014", "1-20-2014",
                                   "1-30-2014", "2-1-2014", 
                                   "2-10-2014", "2-20-2014", "3-1-2014", 
                                   "3-10-2014", "3-20-2014"))
  logs

##    patientId generatedOn
## 1          1    1-1-2014
## 2          2   1-10-2014
## 3          3   1-20-2014
## 4          4   1-30-2014
## 5          5    2-1-2014
## 6          1   2-10-2014
## 7          2   2-20-2014
## 8          3    3-1-2014
## 9          4   3-10-2014
## 10         5   3-20-2014


  encs <- data.frame(patientId=rep(seq(1, 5, 1), 2),
                     encDate=c("1-30-2014", "1-12-2014", "2-01-2014",
                               "2-15-2014", "2-10-2014", 
                               "2-17-2014", "2-29-2014", "3-15-2014", 
                               "3-30-2014", "3-27-2014"),
                     encLoc=c("a", "b", "c", "a", "b",
                              "a", "b", "c", "a", "b"))
  encs

##    patientId   encDate encLoc
## 1          1 1-30-2014      a
## 2          2 1-12-2014      b
## 3          3 2-01-2014      c
## 4          4 2-15-2014      a
## 5          5 2-10-2014      b
## 6          1 2-17-2014      a
## 7          2 2-29-2014      b
## 8          3 3-15-2014      c
## 9          4 3-30-2014      a
## 10         5 3-27-2014      b

We can sort the data ascending order by patient and date.

# convert date
  library(lubridate)
  logs$generatedOn <- mdy(logs$generatedOn)
  encs$encDate <- mdy(encs$encDate)
# sort
  logs <- logs[order(logs$patientId, logs$generatedOn),]
  encs <- encs[order(encs$patientId, encs$encDate),]

Now we can merge the two datasets by patientId and keep all combinations.

  dat <- merge(encs, logs, by="patientId", all=T)
  dat

##    patientId    encDate encLoc generatedOn
## 1          1 2014-01-30      a  2014-01-01
## 2          1 2014-01-30      a  2014-02-10
## 3          1 2014-02-17      a  2014-01-01
## 4          1 2014-02-17      a  2014-02-10
## 5          2 2014-01-12      b  2014-01-10
## 6          2 2014-01-12      b  2014-02-20
## 7          2 2014-03-01      b  2014-01-10
## 8          2 2014-03-01      b  2014-02-20
## 9          3 2014-02-01      c  2014-01-20
## 10         3 2014-02-01      c  2014-03-01
## 11         3 2014-03-15      c  2014-01-20
## 12         3 2014-03-15      c  2014-03-01
## 13         4 2014-02-15      a  2014-01-30
## 14         4 2014-02-15      a  2014-03-10
## 15         4 2014-03-30      a  2014-01-30
## 16         4 2014-03-30      a  2014-03-10
## 17         5 2014-02-10      b  2014-02-01
## 18         5 2014-02-10      b  2014-03-20
## 19         5 2014-03-27      b  2014-02-01
## 20         5 2014-03-27      b  2014-03-20

  dat[dat$patientId==1,]

##   patientId    encDate encLoc generatedOn
## 1         1 2014-01-30      a  2014-01-01
## 2         1 2014-01-30      a  2014-02-10
## 3         1 2014-02-17      a  2014-01-01
## 4         1 2014-02-17      a  2014-02-10

If you look at patient 1, you'll see that she has two summary sheet dates for each encounter date. Let's create a new column that calculates the difference in days.

  dat$diff <- difftime(dat$encDate, dat$generatedOn, units="days")
  dat

##    patientId    encDate encLoc generatedOn     diff
## 1          1 2014-01-30      a  2014-01-01  29 days
## 2          1 2014-01-30      a  2014-02-10 -11 days
## 3          1 2014-02-17      a  2014-01-01  47 days
## 4          1 2014-02-17      a  2014-02-10   7 days
## 5          2 2014-01-12      b  2014-01-10   2 days
## 6          2 2014-01-12      b  2014-02-20 -39 days
## 7          2 2014-03-01      b  2014-01-10  50 days
## 8          2 2014-03-01      b  2014-02-20   9 days
## 9          3 2014-02-01      c  2014-01-20  12 days
## 10         3 2014-02-01      c  2014-03-01 -28 days
## 11         3 2014-03-15      c  2014-01-20  54 days
## 12         3 2014-03-15      c  2014-03-01  14 days
## 13         4 2014-02-15      a  2014-01-30  16 days
## 14         4 2014-02-15      a  2014-03-10 -23 days
## 15         4 2014-03-30      a  2014-01-30  59 days
## 16         4 2014-03-30      a  2014-03-10  20 days
## 17         5 2014-02-10      b  2014-02-01   9 days
## 18         5 2014-02-10      b  2014-03-20 -38 days
## 19         5 2014-03-27      b  2014-02-01  54 days
## 20         5 2014-03-27      b  2014-03-20   7 days

Under the simple assumptions, we want to drop any duplicates according to generatedOn and patientID. When we use the !duplicated() function below, R is going to keep only the first instance of the duplicate (patientId + encDate). So we can sort the dataframe so that the first instance of each duplicate will be the combination with the smallest, positive value for diff. For patientId==1 and her encounter on 2014-01-30, we want to match to the observation with a difference of 29 days. We can discard the other observation that has a difference of -11 days. This negative value means that the summary sheet was generated after the encounter.

# remove negative values
  dat <- dat[dat$diff>=0,]
  # if we wanted, we could say if encLoc=="a" | encLoc=="b", then >=7
# sort ascending again, this time with diff
  dat <- dat[order(dat$patientId, dat$encDate, dat$diff),]
# remove duplicates leaving first match (smallest diff value)
  dat2 <- dat[ !duplicated(dat$patientId, dat$encDate), ]
  dat2

##    patientId    encDate encLoc generatedOn    diff
## 1          1 2014-01-30      a  2014-01-01 29 days
## 5          2 2014-01-12      b  2014-01-10  2 days
## 9          3 2014-02-01      c  2014-01-20 12 days
## 13         4 2014-02-15      a  2014-01-30 16 days
## 17         5 2014-02-10      b  2014-02-01  9 days

Compare this final match with the orginal merged dataframe with all possible combinations. You should be able to see that every encounter is matched to the correct summary sheet generated diff days before the encounter. This difference is the smallest positive value.

The actual log and encounter data are more complex, but I think the approach should hold. That said, there are probably easier methods. Best to check in with the real programmers.