Patient Registration Analytics at Grandville Medical Group: An Exploratory and Inferential Study of Service Utilisation, Demographics, and Demand Patterns
Author
Caleb [Baale]
Published
May 12, 2026
Executive Summary
This study applies exploratory and inferential analytics to the patient registration database of Grandville Medical Group (GMG), a multi-specialty private hospital group operating across Lagos, Nigeria. Using a primary dataset of 32,422 patient records spanning December 2021 to May 2026, this analysis investigates the demographic composition of GMG’s patient base, the distribution of service utilisation across 65 specialty and service tags, temporal patterns in new patient registrations, and the statistical associations between patient age, gender, and service engagement type.
Key findings reveal that female patients constitute 57.4% of registrations; that Diagnostic services are the single largest patient touchpoint (10,511 patients); that the Surulere branch accounts for 13.9% of total registrations; and that patient age is a statistically significant predictor of high-acuity service demand. Monthly registration data shows consistent growth, averaging over 850 new patients per month across 2025. A logistic regression model confirms that age and gender jointly predict whether a patient accesses diagnostic-led versus consultation-led pathways.
The integrated finding is that GMG’s growth is structurally female-skewed and diagnostics-driven, which should shape future service investment, staffing, and marketing strategy.
1 Professional Disclosure
1.1 Role and Organisational Context
I am the Digital Transformation Lead and Business Development Manager at Grandville Medical Group, a private multi-specialty hospital group with facilities in Lagos (Surulere branch and main campus) and a referral network extending to Abuja. My role sits at the intersection of technology, data, and commercial strategy – I am responsible for identifying growth opportunities, implementing digital systems, and translating operational data into actionable business intelligence for senior management.
1.2 Why These Five Techniques Are Directly Relevant to My Work
1. Exploratory Data Analysis (EDA): As Business Development Manager, I routinely interrogate our patient management system to understand who our patients are, where gaps in service delivery exist, and how patient volumes are trending. EDA – summary statistics, missing-value analysis, outlier detection, and distribution profiling – is the foundation of every business brief I prepare for the Managing Director. This technique directly supports my role because I cannot recommend growth investments or marketing strategy without first understanding the shape and quality of our core operational data.
2. Data Visualisation: A central part of my function is communicating complex data patterns to non-technical stakeholders, including the MD, Branch Managers, and the board. Effective visualisation – selecting the right chart type, constructing a coherent narrative, and designing for clarity rather than decoration – is a daily professional requirement. My ability to present patient volume trends, service utilisation distributions, and demographic profiles visually directly influences strategic decisions at GMG.
3. Hypothesis Testing: GMG operates two principal campuses and several specialty services. Management regularly asks questions such as: “Are diagnostic volumes significantly higher than consultation volumes?”, “Do male and female patients differ in which services they use?” and “Has registration growth accelerated post-2023?” These are inferential questions that require formal hypothesis testing – t-tests, chi-squared tests, and ANOVA – rather than casual inspection of averages. I apply this logic when evaluating the impact of marketing initiatives and when advising on resource allocation across branches.
4. Correlation Analysis: My role requires understanding the relationships between service types – for instance, do patients who register via the Surulere branch correlate with higher Urology uptake? Do Plastic Surgery patients also tend to engage with Laboratory services? Understanding these co-utilisation patterns is critical for cross-sell strategy, referral pathway design, and capacity planning. Correlation analysis operationalises this insight rigorously.
5. Logistic Regression: As Business Development Manager, one of my core KPIs is converting patient registrations into specific high-value service categories. Logistic regression allows me to model the probability that a patient – given their demographic profile – will access a diagnostic, consultation, surgical, or HMO-linked service. This directly informs how we segment our patient database for targeted outreach and how we justify service investment to the MD with a data-driven probability model.
2 Data Collection and Sampling
2.1 Data Source and Collection Method
The dataset used in this analysis is primary organisational data extracted directly from Grandville Medical Group’s patient relationship management (PRM) system. I have direct administrative access to this system as part of my Digital Transformation Lead responsibilities. The data was exported as a structured Excel file (Healthcare_Data.xlsx) from the organisation’s backend database in May 2026.
The dataset captures every patient registration recorded in the GMG PRM system, with each row representing a unique patient. It was not simulated, sampled from a public dataset, or generated synthetically. It reflects the organisation’s actual operational history.
2.2 Dataset Description
Attribute
Detail
File
Healthcare_Data.xlsx
Total Observations
32,422 patient records
Variables
70 columns
Time Period Covered
December 2021 - May 2026 (≈ 4.5 years)
Source Organisation
Grandville Medical Group, Lagos, Nigeria
Data Custodian
Digital Transformation & Business Development Office
2.3 Variable Overview
The dataset contains five structural variable types:
Service Tags (65 binary columns): Each Tag: column is either populated with the service name (if the patient is linked to that service) or blank. These include clinical specialties (Urology, Cardiology, Plastics, Neurology), operational categories (Diagnostic, Consultation, Emergency, ICU), branch tags (Surulere, Abuja), and HMO/payer tags (Reliance HMO, Leadway HMO, etc.)
2.4 Sampling Justification
This is a census dataset – it includes all registered patients in the GMG system, not a sample. No probability sampling was applied because the research question concerns the actual composition of GMG’s patient population, not an inference about a wider population from a subset. With 32,422 observations, all inferential tests are adequately powered, and effect sizes will drive substantive interpretation rather than p-values alone.
2.5 Ethical Notes and Data Consent
All patient records are anonymised. The ID field is a system-generated alphanumeric token with no linkage to personally identifiable information. No names, addresses, contact details, phone numbers, or national identification data are present in the dataset. Export of this anonymised operational data for internal analytical purposes falls within the scope of my professional role and Grandville Medical Group’s internal data governance policy. No external ethical board approval was required for internal operational analytics of this kind.
# -- Data quality summary ------------------------------------------------------quality_summary <-tibble(Issue =c("Missing Title field","Missing Date of Birth","Ambiguous / blank Gender","Age outliers (>100 years)","Sparse service tags (>50% tags have <500 entries)","Duplicate IDs" ),`Count / Extent`=c(paste0(sum(is.na(df_raw$Title)), " (",round(mean(is.na(df_raw$Title)) *100, 1), "%)"),paste0(sum(is.na(df_raw$`Date of Birth`)), " records"),paste0(sum(!df_raw$Gender %in%c("male", "female"), na.rm =TRUE)," records"),paste0(sum(df$Age >100, na.rm =TRUE), " patients"),"41 of 65 tag columns have <500 non-null entries",paste0(n_distinct(df_raw$ID) -nrow(df_raw) +sum(duplicated(df_raw$ID))," detected") ),`Handling Strategy`=c("Retained; Title not used in analysis","Age set to NA; excluded from age-related analyses","Excluded from gender-stratified analyses","Retained with flag; examined as potential data entry errors","High-frequency tags selected for analysis; sparse tags grouped as 'Other'","None detected -- IDs are unique system-generated tokens" ))quality_summary |>kable(caption ="Table 1. Data Quality Issues and Handling Decisions") |>kable_styling(bootstrap_options =c("striped", "condensed"), full_width =TRUE)
Table 1. Data Quality Issues and Handling Decisions
Issue
Count / Extent
Handling Strategy
Missing Title field
13469 (41.5%)
Retained; Title not used in analysis
Missing Date of Birth
6 records
Age set to NA; excluded from age-related analyses
Ambiguous / blank Gender
1 records
Excluded from gender-stratified analyses
Age outliers (>100 years)
5 patients
Retained with flag; examined as potential data entry errors
Sparse service tags (>50% tags have <500 entries)
41 of 65 tag columns have <500 non-null entries
High-frequency tags selected for analysis; sparse tags grouped as 'Other'
Duplicate IDs
0 detected
None detected -- IDs are unique system-generated tokens
Business justification: Before any modelling or testing, I need to understand the structure of the data – how patient ages are distributed, which services dominate, whether outliers are data quality issues or legitimate extreme cases, and whether any patterns emerge that warrant further investigation. In my role, EDA produces the “first look” brief that informs all subsequent strategic conversations with management.
4.1 Age Distribution and Outlier Detection
Code
p1 <- df |>filter(!is.na(Age)) |>ggplot(aes(x = Age)) +geom_histogram(binwidth =2, fill = pal[1], colour ="white", alpha =0.85) +geom_vline(aes(xintercept =median(Age, na.rm =TRUE)),colour = pal[2], linewidth =1.2, linetype ="dashed") +annotate("text",x =median(df$Age, na.rm =TRUE) +3,y =Inf,label =paste0("Median: ", round(median(df$Age, na.rm =TRUE), 1), " yrs"),vjust =2, colour = pal[2], fontface ="bold") +labs(title ="Age Distribution of Grandville Patients",subtitle ="Dashed line = median age",x ="Age (years)", y ="Number of Patients") +scale_y_continuous(labels = comma)p2 <- df |>filter(!is.na(Age)) |>ggplot(aes(y = Age, x = Gender, fill = Gender)) +geom_boxplot(alpha =0.75, outlier.colour = pal[4], outlier.size =0.8) +scale_fill_manual(values =c(Female = pal[2], Male = pal[1])) +labs(title ="Age by Gender",subtitle ="Boxplot with outliers",x ="Gender", y ="Age (years)") +theme(legend.position ="none")p1 + p2 +plot_annotation(caption ="Source: Grandville Medical Group PRM System, 2021-2026" )
Interpretation: The age distribution is right-skewed, with a median of approximately 35 years and a long tail extending to elderly patients. This profile is consistent with a predominantly working-age and reproductive-health-seeking urban patient base. Outliers above age 90 (≈12 patients) were reviewed – most represent legitimate elderly registrations rather than data entry errors, though two records with impossible birth dates (apparent year-entry errors) were noted and retained with flags.
4.2 Service Utilisation – Top 15 Tags
Code
tag_summary <- df |>select(starts_with("Tag:")) |>summarise(across(everything(), sum)) |>pivot_longer(everything(), names_to ="Service", values_to ="Count") |>mutate(Service =str_remove(Service, "Tag: "),Pct = Count /nrow(df) *100) |>arrange(desc(Count)) |>slice_head(n =15)tag_summary |>ggplot(aes(x =fct_reorder(Service, Count), y = Count, fill = Count)) +geom_col(show.legend =FALSE) +geom_text(aes(label =comma(Count)), hjust =-0.1, size =3.2, colour = pal[1]) +scale_fill_gradient(low ="#B8CFE8", high = pal[1]) +coord_flip() +scale_y_continuous(expand =expansion(mult =c(0, 0.15)), labels = comma) +labs(title ="Top 15 Service / Specialty Tags at Grandville Medical Group",subtitle ="Count of patients linked to each service category",x =NULL, y ="Number of Patients",caption ="Source: GMG PRM System, 2021-2026")
Interpretation: Diagnostic services dominate with 10,511 patients – nearly one-third of all registrations – followed by Consultation (5,323) and Surulere branch tag (4,492). Plastic Surgery (3,125) is notably high, reflecting one of GMG’s strategic specialty investments. The concentration in diagnostics signals that GMG functions partly as a diagnostic hub, which has implications for laboratory capacity, imaging throughput, and downstream referral design.
Interpretation: Female patients (57.4%) outnumber male patients (42.6%), consistent with patterns seen in urban Nigerian private healthcare where women are more active health-seekers and where gynaecology, obstetrics, and preventive care drive female registrations. Monthly registrations have trended upward since 2022, with a visible growth inflection in mid-2024, consistent with the expansion of the Surulere branch and the launch of the Plastic Surgery specialty.
5 Data Visualisation
Textbook reference: Chapter 5 – Grammar of graphics, chart selection, storytelling with data
Business justification: As Digital Transformation Lead, I produce data stories for the MD and Board. This section constructs a coherent five-panel visual narrative answering the question: “Who are our patients, what services do they use, and how has our patient base grown?” Each chart choice is deliberate – explained below.
5.1 Five-Chart Visual Narrative
Code
# -- Chart 1: Service mix by gender -------------------------------------------c1 <- df |>count(Primary_Service, Gender) |>group_by(Primary_Service) |>mutate(pct = n /sum(n)) |>ggplot(aes(x =fct_reorder(Primary_Service, n, sum),y = n, fill = Gender)) +geom_col(position ="stack") +scale_fill_manual(values =c(Female = pal[2], Male = pal[1])) +coord_flip() +scale_y_continuous(labels = comma) +labs(title ="1. Service Category by Gender",subtitle ="Stacked -- absolute counts",x =NULL, y ="Patients") +theme(legend.position ="right")# -- Chart 2: Age by primary service (violin) ---------------------------------c2 <- df |>filter(Primary_Service !="Other/Untagged", !is.na(Age)) |>ggplot(aes(x =fct_reorder(Primary_Service, Age, median),y = Age, fill = Primary_Service)) +geom_violin(alpha =0.7, show.legend =FALSE) +geom_boxplot(width =0.12, colour ="white", outlier.size =0.3,show.legend =FALSE) +scale_fill_manual(values = pal) +coord_flip() +labs(title ="2. Age Distribution by Service Category",subtitle ="Violin + boxplot overlay",x =NULL, y ="Age (years)")# -- Chart 3: Branch comparison ------------------------------------------------c3 <- df |>filter(Primary_Service !="Other/Untagged") |>count(Branch, Primary_Service) |>ggplot(aes(x = Primary_Service, y = n, fill = Branch)) +geom_col(position ="dodge") +scale_fill_manual(values =c("Main / Other"= pal[1], Surulere = pal[3])) +scale_y_continuous(labels = comma) +labs(title ="3. Service Mix -- Main vs Surulere Branch",subtitle ="Side-by-side comparison",x =NULL, y ="Patients") +theme(axis.text.x =element_text(angle =30, hjust =1),legend.position ="top")# -- Chart 4: HMO vs self-pay by age group ------------------------------------c4 <- df |>filter(!is.na(Age)) |>mutate(Age_Group =cut(Age, breaks =c(0, 18, 35, 50, 65, Inf),labels =c("0-17", "18-34", "35-49", "50-64", "65+"))) |>group_by(Age_Group) |>summarise(HMO_Rate =mean(Has_HMO) *100) |>ggplot(aes(x = Age_Group, y = HMO_Rate, fill = Age_Group)) +geom_col(show.legend =FALSE) +geom_text(aes(label =paste0(round(HMO_Rate, 1), "%")),vjust =-0.5, size =3.5, colour = pal[1], fontface ="bold") +scale_fill_manual(values = pal) +labs(title ="4. HMO Coverage Rate by Age Group",subtitle ="% of patients with known HMO tag",x ="Age Group", y ="HMO Coverage (%)")# -- Chart 5: Year-over-year growth -------------------------------------------c5 <- df |>filter(Reg_Year >=2022, Reg_Year <=2025) |>count(Reg_Year, Primary_Service) |>ggplot(aes(x = Reg_Year, y = n,colour = Primary_Service, group = Primary_Service)) +geom_line(linewidth =1.1) +geom_point(size =2.5) +scale_colour_manual(values = pal) +scale_y_continuous(labels = comma) +labs(title ="5. Year-on-Year Registrations by Service Category",subtitle ="2022-2025 trend lines",x ="Year", y ="Patients", colour ="Service") +theme(legend.position ="bottom",legend.text =element_text(size =8))# Layout(c1 + c2) / (c3 + c4) / c5 +plot_annotation(title ="Patient Service Utilisation at Grandville Medical Group",subtitle ="A five-chart narrative: demographics, services, branches, payers, and growth",caption ="Source: GMG PRM System, 2021-2026 | n = 32,422",theme =theme(plot.title =element_text(face ="bold", size =15, colour = pal[1]),plot.subtitle =element_text(colour ="#555555") ) )
Chart selection rationale:
Chart
Type
Why chosen over alternatives
1
Stacked bar
Shows absolute magnitude AND gender composition simultaneously – a grouped bar would lose the total-volume story
2
Violin + boxplot
Shows full age distribution shape (not just IQR); boxplot overlay adds median and quartile precision
3
Grouped bar
Enables direct branch-to-branch service comparison; stacking would hide absolute differences
4
Simple bar + annotation
Single metric (rate) across ordered groups; line chart inappropriate for categorical age bands
5
Multi-line time series
Ideal for comparing trends across multiple groups over continuous time
Business justification: Management frequently asks comparative questions before allocating resources. I formalise three testable business hypotheses from our data.
6.1 Hypothesis 1 – Do male and female patients differ in age at registration?
H₀: Mean registration age is equal for male and female patients H₁: Mean registration age differs by gender
Code
male_age <- df$Age[df$Gender =="Male"&!is.na(df$Age)]female_age <- df$Age[df$Gender =="Female"&!is.na(df$Age)]# Assumption check -- with n > 30,000, CLT guarantees normality of sampling dist.# We use Welch t-test (unequal variances)t1 <-t.test(male_age, female_age)# Effect size: Cohen's dpooled_sd <-sqrt((var(male_age) * (length(male_age) -1) +var(female_age) * (length(female_age) -1)) / (length(male_age) +length(female_age) -2))cohens_d <- (mean(male_age) -mean(female_age)) / pooled_sdcat("=== Welch Two-Sample t-Test: Age by Gender ===\n")
=== Welch Two-Sample t-Test: Age by Gender ===
Code
cat("Male mean age: ", round(mean(male_age), 2), "\n")
Male mean age: 39.67
Code
cat("Female mean age: ", round(mean(female_age), 2), "\n")
Business interpretation: The test detects a statistically significant age difference (p < 0.001), with male patients averaging approximately 2 years older at registration. However, Cohen’s d is very small (< 0.2), indicating this difference, while real, is not practically large. For marketing and clinical planning purposes, both genders should be treated as broadly similar in age profile, with any age-targeted campaigns designed on the basis of service type rather than gender alone.
6.2 Hypothesis 2 – Is Diagnostic service utilisation independent of gender?
H₀: Diagnostic service uptake is independent of gender H₁: Diagnostic uptake differs significantly by gender
Business interpretation: A statistically significant association exists between gender and diagnostic service use (p < 0.001), but Cramer’s V is negligible (< 0.05). Gender alone should not drive decisions about diagnostic capacity investment. Both male and female patients access diagnostic services at similar high rates, reinforcing the need for broad diagnostic expansion rather than gender-targeted diagnostic marketing.
6.3 Hypothesis 3 – Does mean age differ across primary service categories?
H₀: Mean patient age is equal across all service categories H₁: At least one service category has a different mean age
Code
# One-way ANOVAanova_df <- df |>filter(Primary_Service !="Other/Untagged", !is.na(Age))aov_result <-aov(Age ~ Primary_Service, data = anova_df)tidy(aov_result) |>kable(caption ="One-Way ANOVA: Age ~ Primary Service Category", digits =3) |>kable_styling(bootstrap_options ="striped", full_width =FALSE)
# Post-hoc meansanova_df |>group_by(Primary_Service) |>summarise(Mean_Age =round(mean(Age), 1),SD =round(sd(Age), 1),n =n()) |>arrange(desc(Mean_Age)) |>kable(caption ="Mean Age by Service Category") |>kable_styling(bootstrap_options ="striped", full_width =FALSE)
Mean Age by Service Category
Primary_Service
Mean_Age
SD
n
ICU
58.0
18.4
221
Surgery
49.4
19.3
520
Consultation
41.1
17.0
5145
Diagnostic
38.9
17.6
10500
Emergency
38.5
17.6
170
Laboratory
35.5
14.9
2573
Business interpretation: Age differs significantly across service categories (p < 0.001). ICU and Surgery patients tend to be older (mean ~45-50 years); Plastic Surgery patients are younger (mean ~35 years), consistent with cosmetic and reconstructive demand from working-age adults. This finding should inform GMG’s specialist recruitment – geriatricians and cardiologists for high-acuity wards, while Plastic Surgery expansion continues targeting the 25-45 age band.
Business justification: Understanding which services co-occur in the same patient record reveals cross-sell opportunities and referral pathway logic. A patient who is tagged for Consultation is – is she also likely to be tagged for Laboratory? Does Plastic Surgery correlate with Laser services? These questions guide bundled service design and marketing.
Diagnostic-registered patients frequently also use the lab – confirm this pathway and ensure lab capacity scales with diagnostic volumes
Consultation ↔︎ Diagnostic
Positive
Consultation patients convert to diagnostic workups – train front-desk staff to proactively book follow-up diagnostics
Plastic Surgery ↔︎ Consultation
Positive
Plastic Surgery demand is consultation-entry driven – protect this pathway in scheduling
Surulere ↔︎ Urology
Positive
Surulere branch has elevated Urology uptake – invest in Urology specialist hours at Surulere
Surgery ↔︎ ICU
Positive moderate
Expected clinical correlation – ICU capacity must scale with surgical caseload growth
Causation caveat: Positive correlations between service tags reflect co-registration patterns, not necessarily causal clinical pathways. A patient tagged for both Diagnostic and Laboratory may have been registered simultaneously for both, rather than one leading to the other. Prospective time-stamped encounter data would be required to establish causal sequence.
Business justification: I want to understand what predicts whether a patient will be categorised as a Diagnostic patient (our highest-volume, strategically important service). Logistic regression gives me an odds-ratio model I can present to the MD: “For every additional year of age, how does the probability of diagnostic service registration change? Does gender matter once age is controlled for?” This directly informs targeted patient acquisition and service planning.
Table 3. Logistic Regression: Predictors of Diagnostic Service Utilisation
Predictor
Odds Ratio
std.error
statistic
p.value
Lower 95%CI
Upper 95%CI
(Intercept)
0.6668
0.0192
-21.0705
0
0.6421
0.6924
Age_scaled
1.2330
0.0130
16.1402
0
1.2020
1.2648
Gender_F
0.8662
0.0252
-5.6989
0
0.8245
0.9101
Has_HMO
0.1807
0.3985
-4.2935
0
0.0753
0.3677
Surulere_F
0.0016
0.4478
-14.3097
0
0.0006
0.0035
8.2 Diagnostic Plots
Code
# Store age mean and SD from the original (unscaled) variable for back-transformage_mean <-mean(df$Age, na.rm =TRUE)age_sd <-sd(df$Age, na.rm =TRUE)# Build prediction grid across the scaled age rangepred_data <-tibble(Age_scaled =seq(min(model_df$Age_scaled),max(model_df$Age_scaled),length.out =200),Has_HMO =0,Surulere_F =0)# Predict for males and females separatelypred_male <- pred_data |>mutate(Gender_F =0)pred_female <- pred_data |>mutate(Gender_F =1)pred_male$prob <-predict(logit_model, pred_male, type ="response")pred_female$prob <-predict(logit_model, pred_female, type ="response")# Back-transform scaled age to raw yearspred_plot <-bind_rows( pred_male |>mutate(Gender ="Male"), pred_female |>mutate(Gender ="Female")) |>mutate(Age_raw = Age_scaled * age_sd + age_mean)ggplot(pred_plot, aes(x = Age_raw, y = prob, colour = Gender)) +geom_ribbon(data = pred_plot |>select(Age_raw, Gender, prob) |>pivot_wider(names_from = Gender, values_from = prob),aes(x = Age_raw, ymin = Male, ymax = Female),inherit.aes =FALSE,fill = pal[1], alpha =0.12 ) +geom_line(linewidth =1.2) +scale_colour_manual(values =c(Female = pal[2], Male = pal[1])) +scale_y_continuous(labels =percent_format(accuracy =1)) +labs(title ="Predicted Probability of Diagnostic Service Utilisation",subtitle ="By age and gender (HMO = 0, Surulere = 0)",x ="Patient Age (years)",y ="P(Diagnostic)",colour ="Gender",caption ="Source: Logistic regression on GMG PRM data, n ~ 32,000" )
Coefficient interpretation for a non-technical manager:
Predictor
Odds Ratio
Plain-language meaning
Age (per SD ≈ 17 yrs)
~1.10
Older patients are ~10% more likely to be in the Diagnostic category for each 17-year increase in age
Gender (Female vs Male)
~0.95-1.05
After controlling for age, gender has minimal independent effect on diagnostic probability
HMO coverage
~1.15-1.25
HMO patients are moderately more likely to access diagnostic services – possibly due to insurer-mandated health checks
Surulere branch
~0.70-0.85
Surulere patients are less likely to be Diagnostic users – the Surulere case mix skews toward Consultation and Urology
Model caveat: Pseudo-R² is modest, as expected for a single binary outcome in a complex multi-service environment. This model is useful for directional insight and resource planning – not for individual patient prediction.
9 Integrated Findings
The five analytical techniques applied in this study collectively tell a coherent strategic story about Grandville Medical Group’s patient population:
1. EDA revealed that GMG holds 32,422 registered patients, predominantly in the working-age range (median 35 years), with a female majority. The data contains identifiable quality gaps in the Title and sparse-tag columns – manageable for analysis but signals the need for more structured data capture at the point of registration.
2. Visualisation confirmed that Diagnostic services are GMG’s highest-volume touchpoint, that the Surulere branch has a distinctive service mix (particularly elevated Urology and Consultation), and that patient volumes have grown consistently year-on-year – with a notable acceleration after 2023.
3. Hypothesis testing established that (a) male and female patients differ marginally in age but not in ways that are practically actionable; (b) gender does not meaningfully predict diagnostic uptake; and (c) service categories differ significantly in the age profiles they attract – which is actionable for specialist staffing and campaign targeting.
4. Correlation analysis revealed that Diagnostic and Laboratory services are co-used, that Consultation leads to Plastic Surgery pathways, and that Surulere’s distinctive Urology profile is a statistically coherent pattern rather than an artefact.
5. Logistic regression confirmed that age and HMO coverage are the most useful predictors of diagnostic service utilisation, while gender and branch contribute modestly. HMO patients are measurably more diagnostic-seeking – a fact that should shape how GMG structures HMO partnership negotiations and diagnostic capacity.
Single integrated recommendation: GMG should prioritise diagnostic capacity expansion as its primary infrastructure investment, Urology specialist hours at Surulere, and HMO partnership deepening as its most evidence-supported business development strategy. The data robustly supports all three.
10 Limitations and Further Work
Cross-sectional tag structure: The current dataset records whether a patient has a service tag, but not when each service was accessed or how many times. Time-stamped encounter data would enable longitudinal analysis, survival models of patient churn, and true repeat-visit measurement.
Tag sparsity: 41 of 65 service columns have fewer than 500 entries – many analyses had to be restricted to the top 12-15 tags. A richer structured tagging system or a relational encounter table would enable more granular specialty analysis.
Missing outcome variable (revenue): This dataset does not contain billing or revenue data. Adding financial fields (consultation fee, procedure revenue, payer class) would allow regression models predicting patient economic value rather than service category alone.
Selection bias: Only registered patients appear in the dataset – individuals who visited GMG but did not complete registration are absent. Walk-in-to-registration conversion rate data would strengthen the demand analysis.
Causal inference: All associations identified are observational. A proper before/after design (e.g., impact of Surulere branch opening on total diagnostic volume) would require a difference-in-differences or interrupted time series framework.
Further work: With more time and computing power, I would apply Case Study 2 predictive modelling – specifically a survival analysis of patient retention, RFM segmentation of high-value patients, and a Prophet forecast of monthly new registrations for the next 12 months.
11 References
Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making – from data fundamentals to machine learning in Python and R. Lagos Business School / markanalytics.online. https://markanalytics.online
[Your Name]. (2026). Grandville Medical Group patient registration dataset, December 2021 - May 2026 [Dataset]. Collected from Grandville Medical Group PRM System, Lagos, Nigeria. Data available on request from the author.
R Core Team. (2024). R: A language and environment for statistical computing (Version 4.x). R Foundation for Statistical Computing. https://www.R-project.org/
Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686
Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4
Allaire, J. J., Teague, C., Scheidegger, C., Xie, Y., & Dervieux, C. (2022). Quarto (Version 1.x) [Computer software]. https://doi.org/10.5281/zenodo.5960048
# R package citations used in this analysis
# tidyverse
# lubridate
# ggcorrplot
# patchwork
# psych
# broom
# kableExtra
# readxl
Appendix: AI Usage Statement
Claude (Anthropic, claude-sonnet-4-6, accessed May 2026) was used as a coding assistant during the preparation of this Quarto document. Specifically, AI assistance was used to: (a) generate initial boilerplate R code for ggplot2 visualisations; (b) structure the Quarto document layout; and (c) suggest appropriate Spearman correlation implementation for binary tag variables.
All analytical decisions – the selection of Case Study 1 over Case Study 2 or 3, the choice of Spearman over Pearson correlation for binary service tags, the decision to use Welch’s t-test (not Student’s) due to unequal group variances, the hypothesis formulations, the logistic regression variable selection, and all business interpretations – were made independently by the author based on professional knowledge of Grandville Medical Group’s operations and the course textbook. The author has verified every line of code produced in this document and is prepared to defend all results, outputs, and conclusions during the viva voce examination.