Fabricated data often has the following properties:
The histogram displays the distribution of observed blood pressure measurements within each site:
ggplot(data = ADVS, aes(x = AVAL, fill = PARAM)) +
geom_histogram(bins = 50, colour = "grey3") +
scale_x_continuous(breaks = seq(50, 200, by=25)) +
scale_fill_discrete_qualitative(palette = "Dark 3") +
theme_bw() +
facet_wrap( ~ SITEID) +
labs(x = "Analysis Value", y = "Frequency", fill = "Parameter") +
theme(legend.position = "bottom",
strip.background = element_rect(fill = "white"))
Let’s examine the 20 most frequently reported values across sites:
counts <- ADVS %>%
group_by(SITEID, PARAM, AVAL) %>%
summarise(count = n()) %>%
arrange(desc(count))
ggplot(data = counts[1:20, ], aes(x = reorder(AVAL, -count), y = count, fill = SITEID)) +
geom_bar(stat = "identity", colour = "grey2") +
guides(fill = guide_legend(nrow = 1)) +
theme_bw() +
scale_x_discrete(drop = FALSE) +
scale_fill_discrete_qualitative(palette = "Dark 3") +
scale_y_continuous(breaks = seq(0, 150, by=25)) +
labs(x = "Value", y = "Frequency", fill = "Site Identifier") +
theme(legend.position = "bottom",
strip.background = element_rect(fill = "white"),
strip.text = element_text(size = 10))
SITE03 appears to favour “nice” numbers ending in 0 or 5 more frequently than the other sites. This pattern is consistent with digit preference, as seemingly the measurements are rounded or estimated rather than recorded exactly.
To explore variability, we examine how SBP and DBP values are distributed across participants at different sites.
ggplot(data = ADVS, aes(x = reorder(AVISIT, AVISITN), y = AVAL)) +
geom_line(aes(group = USUBJID, colour = USUBJID)) +
scale_colour_discrete_sequential(palette = "Red-Blue") +
facet_grid(vars(PARAM), vars(SITEID)) +
labs(x = "Analysis Visit", y = "Analysis Value") +
theme_bw() +
theme(legend.position = "none",
strip.background = element_rect(fill = "white"),
strip.text = element_text(size = 10),
axis.text.x = element_text(size = 4))
Measurements from SITE03 appear more homogeneous across participants than those from the other sites. The lines are smoother and less scattered, which indicates smaller changes between visits and reduced variability over time.
To further explore variability within and between participants, we calculate the coefficient of determination (R-squared) for each site and parameter (SBP and DBP). Higher R-squared values will indicate that measurements follow a more predictable pattern over time.
results <- data.frame()
residuals <- data.frame()
sites <- unique(ADVS$SITEID)
params <- unique(ADVS$PARAM)
for (i in seq(sites))
{
for (j in seq(params))
{
subset_i <- ADVS[ADVS$SITEID %in% sites[i], ]
subset_ij <- subset_i[subset_i$PARAM %in% params[j],]
fit <- lm(AVAL ~ AVISITN, data = subset_ij)
y = subset_ij$AVAL
y_hat = fitted(fit)
y_mean = mean(y)
ESS = sum((y_hat - y_mean)^2)
RSS = sum((y - y_hat)^2)
TSS = ESS + RSS
R2 = summary(fit)$r.squared
result_ij = data.frame(
SITEID = sites[i],
PARAM = params[j],
N = nrow(subset_ij),
ESS = ESS,
RSS = RSS,
TSS = TSS,
R2 = R2
)
results = rbind(results, result_ij)
residuals_ij = data.frame(
SITEID = sites[i],
PARAM = params[j],
USUBJID = subset_ij$USUBJID,
AVISIT = subset_ij$AVISIT,
AVAL = subset_ij$AVAL,
Y_hat = y_hat)
}
}
ggplot(data = results, aes(x = SITEID, y = R2, colour = PARAM)) +
geom_point(size = 2) +
scale_color_discrete_qualitative(palette = "Dark 3") +
labs(x = "Site Identifier", y = "Coefficient of Determination", colour = "Parameter") +
theme_bw() +
theme(legend.position = "bottom")
SITE03 demonstrates higher R-squared values than the other sites, indicating that outcomes are more predictable over time.
To explore missing data, we examine the number of participants with available non-missing measurements at each visit across sites.
summary_df <- data.frame(table(ADVS$AVISIT, ADVS$SITEID))
ggplot(data = summary_df, aes(x = Var1, y = Freq)) +
geom_line(aes(group = 1), linetype = "dashed", colour = "grey") +
geom_point(size = 2, aes(colour = Var2)) +
scale_color_discrete_qualitative(palette = "Dark 3") +
facet_wrap( ~ Var2, nrow = 2) +
labs(x = "Analysis Visit", y = "Frequency") +
theme_bw() +
theme(legend.position = "none",
strip.background = element_rect(fill = "white"),
strip.text = element_text(size = 10),
axis.text.x = element_text(size = 5))
SITE03 demonstrates perfect retention compared to other sites.
Based on the study schedule, visits are planned every four weeks. This allows us to calculate the expected study day for each visit and assess how the actual visit dates differ from the planned dates.
ADVS$AWTARGET = NA
ADVS$AWTARGET = ifelse (ADVS$VISIT == "SCREENING", 1,
ifelse(ADVS$VISIT == "WEEK 4", 7*4+1,
ifelse(ADVS$VISIT == "WEEK 8", 7*8+1,
ifelse(ADVS$VISIT == "WEEK 12", 7*12+1,
ifelse(ADVS$VISIT == "WEEK 16", 7*16+1,
ifelse(ADVS$VISIT == "WEEK 20", 7*20+1, 7*24+1))))))
ADVS$AWTDIFF = ADVS$AWTARGET - ADVS$ADY
We define visit compliance (%) as:
Compliance (%) = \(\frac{N_0}{N}\) * 100,
where \({N_0}\) is the Number of visits with AWTDIFF = 0, and \({N}\) is the total number of visits.
ADVS_unique <- ADVS %>%
select(USUBJID, SITEID, AVISIT, AVISITN, AWTDIFF) %>%
group_by(USUBJID, SITEID, AVISIT, AVISITN, AWTDIFF)
VisCompliance <- aggregate(
AWTDIFF ~ SITEID + AVISIT + AVISITN,
data = ADVS_unique,
FUN = function(x)
100 * mean(x == 0, na.rm = TRUE)
)
The bar charts below show visit compliance for each site and visit.
ggplot(data = VisCompliance, aes(x = reorder(AVISIT, AVISITN), y = AWTDIFF)) +
geom_bar(stat = "identity", colour = "grey2", fill = "darksalmon") +
facet_wrap(~ SITEID) +
theme_bw() +
scale_x_discrete(drop = FALSE) +
scale_y_continuous(breaks = seq(0, 150, by=25)) +
labs(x = "Analysis Visit", y = "Visit Compliance (%)") +
theme(strip.background = element_rect(fill = "white"),
strip.text = element_text(size = 10),
axis.text.x = element_text(size = 6))
All visits at SITE03 occurred exactly on the planned visit date, resulting in 100% visit compliance across all visits. This was not observed at the other sites, where compliance was generally below 25%.
Across several indicators (including digit preference, lower variability in measurements, perfect participant retention, perfect adherence to the visit schedule), SITE03 appears different from the other sites. Although none of these findings alone proves possible data fabrication, seeing all of them at the same site suggests that SITE03 may require further investigation.