This project developed an automated R-based pipeline for monitoring data quality and visualizing missing data patterns in a multi-center kidney transplant registry.
The pipeline evaluates data completeness across variables, patient
records, and transplant centers and can be repeatedly applied as the
registry dataset is updated.
Large multi-center clinical registries often suffer from incomplete data due to heterogeneous data collection practices across participating centers. Incomplete data can introduce bias in epidemiologic analyses, reduce statistical power, and compromise the reliability of survival models.
Therefore, systematic monitoring of data completeness is essential
for maintaining data quality in large collaborative registries.
The dataset represents a large multi-center kidney transplant registry designed to support epidemiologic and clinical outcome research across Asia.
Data were collected through standardized case report forms and include clinical variables across several domains:
The objectives of this project were to:
The following workflow summarizes the automated pipeline developed to detect and visualize missing data patterns.
A variable was considered missing when:
NAAdditional rule-based adjustments were applied for conditionally relevant variables to avoid misclassification of missing values.
For example, variables applicable only to living donors or deceased donors were excluded from missingness calculations when not relevant to the donor type.
To ensure a meaningful assessment of data quality, the analysis focused on variables commonly used in clinical practice and considered essential for kidney transplant research.
Missing data were evaluated at three levels:
Missingness at the record level was assessed by examining the distribution of missing values across individual subjects and variables. Missingness was defined at the cell level and visualized using heatmaps to capture patterns of incomplete data across records.
Variable-level missing rates were defined as:
\[ \text{Missing rate} = \frac{\text{Number of missing values}}{\text{Total eligible observations}} \]
Missing rates were calculated for each variable and expressed as a percentage within each clinical section. Variables were ordered in descending order of missingness to facilitate identification of fields with the highest data incompleteness, and results were visualized using bar charts.
Center-level missingness was evaluated by aggregating missing data across variables within each clinical section for each center. Missing rates were calculated as the proportion of missing values among all eligible observations and expressed as percentages. Centers were ordered based on their overall missingness to highlight relative differences in data quality, and results were visualized using heatmaps with annotated percentage values.
Missingness was evaluated at three levels:
Overall, missing data patterns varied across variables, clinical sections, and transplant centers. Some clinical sections showed consistently higher missing rates, suggesting potential differences in data collection practices across centers.
The heatmap visualizes missingness across individual patient records (y-axis) and variables (x-axis). Data from one transplant center are presented as an example, with panels corresponding to different clinical sections.
This visualization allows rapid identification of records with incomplete data and supports targeted follow-up with centers for data completion.
Figure 1. Record-level missingness heatmap
Variable-level missing rates were examined to identify variables contributing most to overall data incompleteness. Results from another transplant center are shown as an example across clinical sections.
Several variables showed higher levels of missing data, particularly in donor and immunology-related fields.
Figure 2. Variable-level missingness rates
Missing rates were summarized by transplant center and clinical section to evaluate differences in data completeness across centers. Centers are displayed in anonymized form (Center A–H).
Higher missing rates were observed in immunology and induction-related variables in several centers, whereas recipient variables showed relatively lower levels of missing data.
Figure 3. Center-level missingness rates
This framework provides a scalable approach for monitoring data completeness in multi-center clinical registries.
The analysis enables researchers to:
Such monitoring is essential for improving the reliability of epidemiologic analyses derived from registry data and supporting high-quality multi-center research.
All analyses were conducted using R.
Key libraries used in the analysis included:
# Key libraries
dplyr
tidyr
ggplot2
patchwork
This framework focuses on identifying patterns of missing data but does not fully account for differences in data collection systems and coding practices across countries and centers. In multi-center registries, distinctions between data states such as “No”, “Unknown”, “Not Applicable”, and missing (left blank) are not always consistently defined or recorded. This ambiguity may lead to misclassification of missingness and affect the interpretation and comparability of data completeness across centers.
In addition, variations in data definitions and terminology across centers may result in differences in how similar clinical concepts are recorded, further limiting comparability.
Furthermore, the analysis is based on a selected set of clinically relevant variables commonly used in kidney transplant research. While this approach ensures practical relevance, the definition of important variables may vary depending on specific research objectives, and therefore the results may not generalize to all variables in the registry.
Finally, center-level summaries are based on aggregated missingness across variables within each clinical section. While this approach facilitates comparison across centers, it may obscure variable-specific patterns and mask extreme missingness in individual variables.
Future work will focus on extending the current framework to better account for heterogeneity in data collection practices across countries and centers. This includes developing standardized definitions and data harmonization strategies to improve comparability of missingness across sites.
A further extension of this work would involve the development of an integrated system or platform that enables real-time monitoring of data entry and missingness rates. Such a system could provide immediate feedback during data entry, allowing users to identify and address missing or incomplete fields at the point of input.
Integration with data query systems will further enable systematic identification and correction of missing or incomplete entries.