Fraud Detection Challenge

The challenge

Explore the data using data visualisation and discover which site contains fabricated data
Use one or more data visualisations to provide convincing evidence to influence the study team to conduct further investigation

Further Information

Fabricated data often has the following properties:

Digit preference: fraudsters often favour round numbers
Variability: fraudsters unconsciously produce data that is too consistent
Missing data: fraudsters often underestimate the drop-out rate or level of missing data
Correlation: fabricated longitudinal data often overestimates the correlation between visits
Visit regularity: visit dates are often more variable in real data compared to fabricated data

Exploring digit preference

The histogram displays the distribution of observed blood pressure measurements within each site:

ggplot(data = ADVS, aes(x = AVAL, fill = PARAM)) +
  geom_histogram(bins = 50, colour = "grey3") +
  scale_x_continuous(breaks = seq(50, 200, by=25)) +
  scale_fill_discrete_qualitative(palette = "Dark 3") +
  theme_bw() +
  facet_wrap( ~ SITEID) +
  labs(x = "Analysis Value", y = "Frequency", fill = "Parameter") +
  theme(legend.position = "bottom",
        strip.background = element_rect(fill = "white"))

Let’s examine the 20 most frequently reported values across sites:

counts <- ADVS %>%
  group_by(SITEID, PARAM, AVAL) %>%
  summarise(count = n()) %>%
  arrange(desc(count)) 

ggplot(data = counts[1:20, ], aes(x = reorder(AVAL, -count), y = count, fill = SITEID)) +
  geom_bar(stat = "identity", colour = "grey2") +
  guides(fill = guide_legend(nrow = 1)) +
  theme_bw() +
  scale_x_discrete(drop = FALSE) +
  scale_fill_discrete_qualitative(palette = "Dark 3") +
  scale_y_continuous(breaks = seq(0, 150, by=25)) +
  labs(x = "Value", y = "Frequency", fill = "Site Identifier") +
  theme(legend.position = "bottom",
        strip.background = element_rect(fill = "white"),
        strip.text = element_text(size = 10))

Observations

SITE03 appears to favour “nice” numbers ending in 0 or 5 more frequently than the other sites. This pattern is consistent with digit preference, as seemingly the measurements are rounded or estimated rather than recorded exactly.

Exploring variability

To explore variability, we examine how SBP and DBP values are distributed across participants at different sites.

ggplot(data = ADVS, aes(x = reorder(AVISIT, AVISITN), y = AVAL)) +
  geom_line(aes(group = USUBJID, colour = USUBJID)) +
  scale_colour_discrete_sequential(palette = "Red-Blue") +
  facet_grid(vars(PARAM), vars(SITEID)) +
  labs(x = "Analysis Visit", y = "Analysis Value") +
  theme_bw() +
  theme(legend.position = "none",
        strip.background = element_rect(fill = "white"),
        strip.text = element_text(size = 10),
        axis.text.x = element_text(size = 4))

Observations

Measurements from SITE03 appear more homogeneous across participants than those from the other sites. The lines are smoother and less scattered, which indicates smaller changes between visits and reduced variability over time.

To further explore variability within and between participants, we calculate the coefficient of determination (R-squared) for each site and parameter (SBP and DBP). Higher R-squared values will indicate that measurements follow a more predictable pattern over time.

results <- data.frame()
residuals <- data.frame()
sites <- unique(ADVS$SITEID)
params <- unique(ADVS$PARAM)

for (i in seq(sites))
{
  for (j in seq(params))
  {
    subset_i <- ADVS[ADVS$SITEID %in% sites[i], ]
    subset_ij <- subset_i[subset_i$PARAM %in% params[j],]
    fit <- lm(AVAL ~ AVISITN, data = subset_ij)
    
    y = subset_ij$AVAL
    y_hat = fitted(fit)
    y_mean = mean(y)

    ESS = sum((y_hat - y_mean)^2)
    RSS = sum((y - y_hat)^2)
    TSS = ESS + RSS
    R2 = summary(fit)$r.squared
    
    result_ij = data.frame(
      SITEID = sites[i],
      PARAM = params[j],
      N = nrow(subset_ij),
      ESS = ESS,
      RSS = RSS,
      TSS = TSS,
      R2 = R2
    )
    
    results = rbind(results, result_ij)
    
    residuals_ij = data.frame(
      SITEID = sites[i],
      PARAM = params[j],
      USUBJID = subset_ij$USUBJID,
      AVISIT = subset_ij$AVISIT,
      AVAL = subset_ij$AVAL,
      Y_hat = y_hat)
  }
}

ggplot(data = results, aes(x = SITEID, y = R2, colour = PARAM)) +
  geom_point(size = 2) +
  scale_color_discrete_qualitative(palette = "Dark 3") +
  labs(x = "Site Identifier", y = "Coefficient of Determination", colour = "Parameter") +
  theme_bw() +
  theme(legend.position = "bottom")

Observations

SITE03 demonstrates higher R-squared values than the other sites, indicating that outcomes are more predictable over time.

Exploring dropouts and missing data

To explore missing data, we examine the number of participants with available non-missing measurements at each visit across sites.

summary_df <- data.frame(table(ADVS$AVISIT, ADVS$SITEID))

ggplot(data = summary_df, aes(x = Var1, y = Freq)) +
  geom_line(aes(group = 1), linetype = "dashed", colour = "grey") +
  geom_point(size = 2, aes(colour = Var2)) +
  scale_color_discrete_qualitative(palette = "Dark 3") +
  facet_wrap( ~ Var2, nrow = 2) +
  labs(x = "Analysis Visit", y = "Frequency") +
  theme_bw() +
  theme(legend.position = "none",
        strip.background = element_rect(fill = "white"),
        strip.text = element_text(size = 10),
        axis.text.x = element_text(size = 5))

Observations

SITE03 demonstrates perfect retention compared to other sites.

Exploring visit regularity

Based on the study schedule, visits are planned every four weeks. This allows us to calculate the expected study day for each visit and assess how the actual visit dates differ from the planned dates.

ADVS$AWTARGET = NA
ADVS$AWTARGET = ifelse (ADVS$VISIT == "SCREENING", 1, 
                        ifelse(ADVS$VISIT == "WEEK 4", 7*4+1, 
                               ifelse(ADVS$VISIT == "WEEK 8", 7*8+1,
                                      ifelse(ADVS$VISIT == "WEEK 12", 7*12+1,
                                             ifelse(ADVS$VISIT == "WEEK 16", 7*16+1,
                                                    ifelse(ADVS$VISIT == "WEEK 20", 7*20+1, 7*24+1))))))
ADVS$AWTDIFF = ADVS$AWTARGET - ADVS$ADY

We define visit compliance (%) as:

Compliance (%) = \(\frac{N_0}{N}\) * 100,

where \({N_0}\) is the Number of visits with AWTDIFF = 0, and \({N}\) is the total number of visits.

ADVS_unique <- ADVS %>%
  select(USUBJID, SITEID, AVISIT, AVISITN, AWTDIFF) %>%
  group_by(USUBJID, SITEID, AVISIT, AVISITN, AWTDIFF)
  
VisCompliance <- aggregate(
  AWTDIFF ~ SITEID + AVISIT + AVISITN,
  data = ADVS_unique,
  FUN = function(x)
    100 * mean(x == 0, na.rm = TRUE)
  )

The bar charts below show visit compliance for each site and visit.

ggplot(data = VisCompliance, aes(x = reorder(AVISIT, AVISITN), y = AWTDIFF)) +
  geom_bar(stat = "identity", colour = "grey2", fill = "darksalmon") +
  facet_wrap(~ SITEID) +
  theme_bw() +
  scale_x_discrete(drop = FALSE) +
  scale_y_continuous(breaks = seq(0, 150, by=25)) +
  labs(x = "Analysis Visit", y = "Visit Compliance (%)") +
  theme(strip.background = element_rect(fill = "white"),
        strip.text = element_text(size = 10),
        axis.text.x = element_text(size = 6))

Observations

All visits at SITE03 occurred exactly on the planned visit date, resulting in 100% visit compliance across all visits. This was not observed at the other sites, where compliance was generally below 25%.

Conclusion

Across several indicators (including digit preference, lower variability in measurements, perfect participant retention, perfect adherence to the visit schedule), SITE03 appears different from the other sites. Although none of these findings alone proves possible data fabrication, seeing all of them at the same site suggests that SITE03 may require further investigation.

Fraud Detection Challenge

Viktoriia Fursenko

2026-06-06

The challenge

Further Information

Exploring digit preference

Observations

Exploring variability

Observations

Observations

Exploring dropouts and missing data

Observations

Exploring visit regularity

Observations

Conclusion