Patient Registration Analytics at Grandville Medical Group: An Exploratory and Inferential Study of Service Utilisation, Demographics, and Demand Patterns

Author

Caleb [Baale]

Published

May 12, 2026

Executive Summary

This study applies exploratory and inferential analytics to the patient registration database of Grandville Medical Group (GMG), a multi-specialty private hospital group operating across Lagos, Nigeria. Using a primary dataset of 32,422 patient records spanning December 2021 to May 2026, this analysis investigates the demographic composition of GMG’s patient base, the distribution of service utilisation across 65 specialty and service tags, temporal patterns in new patient registrations, and the statistical associations between patient age, gender, and service engagement type.

Key findings reveal that female patients constitute 57.4% of registrations; that Diagnostic services are the single largest patient touchpoint (10,511 patients); that the Surulere branch accounts for 13.9% of total registrations; and that patient age is a statistically significant predictor of high-acuity service demand. Monthly registration data shows consistent growth, averaging over 850 new patients per month across 2025. A logistic regression model confirms that age and gender jointly predict whether a patient accesses diagnostic-led versus consultation-led pathways.

The integrated finding is that GMG’s growth is structurally female-skewed and diagnostics-driven, which should shape future service investment, staffing, and marketing strategy.

1 Professional Disclosure

1.1 Role and Organisational Context

I am the Digital Transformation Lead and Business Development Manager at Grandville Medical Group, a private multi-specialty hospital group with facilities in Lagos (Surulere branch and main campus) and a referral network extending to Abuja. My role sits at the intersection of technology, data, and commercial strategy – I am responsible for identifying growth opportunities, implementing digital systems, and translating operational data into actionable business intelligence for senior management.

1.2 Why These Five Techniques Are Directly Relevant to My Work

1. Exploratory Data Analysis (EDA): As Business Development Manager, I routinely interrogate our patient management system to understand who our patients are, where gaps in service delivery exist, and how patient volumes are trending. EDA – summary statistics, missing-value analysis, outlier detection, and distribution profiling – is the foundation of every business brief I prepare for the Managing Director. This technique directly supports my role because I cannot recommend growth investments or marketing strategy without first understanding the shape and quality of our core operational data.

2. Data Visualisation: A central part of my function is communicating complex data patterns to non-technical stakeholders, including the MD, Branch Managers, and the board. Effective visualisation – selecting the right chart type, constructing a coherent narrative, and designing for clarity rather than decoration – is a daily professional requirement. My ability to present patient volume trends, service utilisation distributions, and demographic profiles visually directly influences strategic decisions at GMG.

3. Hypothesis Testing: GMG operates two principal campuses and several specialty services. Management regularly asks questions such as: “Are diagnostic volumes significantly higher than consultation volumes?”, “Do male and female patients differ in which services they use?” and “Has registration growth accelerated post-2023?” These are inferential questions that require formal hypothesis testing – t-tests, chi-squared tests, and ANOVA – rather than casual inspection of averages. I apply this logic when evaluating the impact of marketing initiatives and when advising on resource allocation across branches.

4. Correlation Analysis: My role requires understanding the relationships between service types – for instance, do patients who register via the Surulere branch correlate with higher Urology uptake? Do Plastic Surgery patients also tend to engage with Laboratory services? Understanding these co-utilisation patterns is critical for cross-sell strategy, referral pathway design, and capacity planning. Correlation analysis operationalises this insight rigorously.

5. Logistic Regression: As Business Development Manager, one of my core KPIs is converting patient registrations into specific high-value service categories. Logistic regression allows me to model the probability that a patient – given their demographic profile – will access a diagnostic, consultation, surgical, or HMO-linked service. This directly informs how we segment our patient database for targeted outreach and how we justify service investment to the MD with a data-driven probability model.

2 Data Collection and Sampling

2.1 Data Source and Collection Method

The dataset used in this analysis is primary organisational data extracted directly from Grandville Medical Group’s patient relationship management (PRM) system. I have direct administrative access to this system as part of my Digital Transformation Lead responsibilities. The data was exported as a structured Excel file (Healthcare_Data.xlsx) from the organisation’s backend database in May 2026.

The dataset captures every patient registration recorded in the GMG PRM system, with each row representing a unique patient. It was not simulated, sampled from a public dataset, or generated synthetically. It reflects the organisation’s actual operational history.

2.2 Dataset Description

Attribute	Detail
File	`Healthcare_Data.xlsx`
Total Observations	32,422 patient records
Variables	70 columns
Time Period Covered	December 2021 - May 2026 (≈ 4.5 years)
Source Organisation	Grandville Medical Group, Lagos, Nigeria
Data Custodian	Digital Transformation & Business Development Office

2.3 Variable Overview

The dataset contains five structural variable types:

Identifier: ID – anonymised alphanumeric patient code (e.g., 8SQXYVE9A)
Demographic: Title, Gender, Date of Birth
Temporal: Date Regis (registration date)
Service Tags (65 binary columns): Each Tag: column is either populated with the service name (if the patient is linked to that service) or blank. These include clinical specialties (Urology, Cardiology, Plastics, Neurology), operational categories (Diagnostic, Consultation, Emergency, ICU), branch tags (Surulere, Abuja), and HMO/payer tags (Reliance HMO, Leadway HMO, etc.)

2.4 Sampling Justification

This is a census dataset – it includes all registered patients in the GMG system, not a sample. No probability sampling was applied because the research question concerns the actual composition of GMG’s patient population, not an inference about a wider population from a subset. With 32,422 observations, all inferential tests are adequately powered, and effect sizes will drive substantive interpretation rather than p-values alone.

3 Data Description and Preparation

3.1 Loading and Initial Inspection

Code

# -- Load data ----------------------------------------------------------------
df_raw <- read_excel(paste0(desktop_path, "/Healthcare Data.xlsx"))

# Basic structure
cat("Dimensions:", nrow(df_raw), "rows x", ncol(df_raw), "columns\n")

Dimensions: 32422 rows x 70 columns

Code

cat("Date range:", format(min(df_raw$`Date Regis`, na.rm = TRUE), "%d %b %Y"),
    "to", format(max(df_raw$`Date Regis`, na.rm = TRUE), "%d %b %Y"), "\n")

Date range: 17 Dec 2021 to 09 May 2026

Code

# Preview first 5 rows (key columns only)
df_raw |>
  select(ID, Gender, `Date of Birth`, `Date Regis`,
         `Tag: Diagnostic`, `Tag: Consultation`, `Tag: Surulere`) |>
  head(5) |>
  kable(caption = "Sample rows from the Grandville patient dataset") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Sample rows from the Grandville patient dataset
ID	Gender	Date of Birth	Date Regis	Tag: Diagnostic	Tag: Consultation	Tag: Surulere
8SQXYVE9A	female	1975-05-05	2023-04-25	NA	Consultation	NA
43U645SQK	male	1986-07-27	2026-03-12	NA	Consultation	Surulere
826UJGYZJ	female	1987-09-11	2025-07-02	NA	NA	NA
5C7PG9QCT	female	1994-08-02	2023-10-27	NA	NA	NA
KBB3X4C5M	male	1997-10-06	2025-08-25	NA	NA	NA

3.2 Feature Engineering and Cleaning

Code

# -- Engineer features ---------------------------------------------------------
ref_date <- as.Date("2026-05-10")

df <- df_raw |>
  mutate(
    # Convert tag columns to binary (1/0)
    across(starts_with("Tag:"), ~ if_else(is.na(.), 0L, 1L)),
    # Age in years
    Age = as.numeric(ref_date - as.Date(`Date of Birth`)) / 365.25,
    # Registration year and month
    Reg_Year  = year(`Date Regis`),
    Reg_Month = floor_date(as.Date(`Date Regis`), "month"),
    # Tidy gender
    Gender = str_to_title(Gender),
    # Flag high-acuity patients
    High_Acuity = if_else(`Tag: ICU` == 1 | `Tag: Emergency` == 1 |
                            `Tag: Surgery` == 1, 1L, 0L),
    # Primary service classification (ordered by clinical priority)
    Primary_Service = case_when(
      `Tag: Emergency`    == 1 ~ "Emergency",
      `Tag: ICU`          == 1 ~ "ICU",
      `Tag: Surgery`      == 1 ~ "Surgery",
      `Tag: Diagnostic`   == 1 ~ "Diagnostic",
      `Tag: Laboratory`   == 1 ~ "Laboratory",
      `Tag: Consultation` == 1 ~ "Consultation",
      TRUE                     ~ "Other/Untagged"
    ),
    # Branch flag
    Branch = if_else(`Tag: Surulere` == 1, "Surulere", "Main / Other"),
    # HMO flag
    Has_HMO = if_else(
      `Tag: Reliance HMO` == 1 | `Tag: Leadway HMO` == 1 |
      `Tag: Hallmark HMO` == 1 | `Tag: Wellness HMO` == 1 |
      `Tag: Avon HMO`     == 1 | `Tag: Quest HMO`    == 1 |
      `Tag: REDCARE HMO`  == 1 | `Tag: Gorah HMO`    == 1,
      1L, 0L),
    # Diagnostic binary outcome (for logistic regression)
    Diagnostic_Flag = `Tag: Diagnostic`
  ) |>
  filter(Age >= 0, Age <= 110, !is.na(Gender), Gender %in% c("Male", "Female"))

cat("Cleaned dataset:", nrow(df), "rows\n")

Cleaned dataset: 32416 rows

Code

cat("Age range:", round(min(df$Age, na.rm = TRUE), 1), "-",
    round(max(df$Age, na.rm = TRUE), 1), "years\n")

Age range: 0 - 105.7 years

Code

cat("Missing ages:", sum(is.na(df$Age)), "\n")

Missing ages: 0

3.3 Data Quality Issues

Code

# -- Data quality summary ------------------------------------------------------
quality_summary <- tibble(
  Issue = c(
    "Missing Title field",
    "Missing Date of Birth",
    "Ambiguous / blank Gender",
    "Age outliers (>100 years)",
    "Sparse service tags (>50% tags have <500 entries)",
    "Duplicate IDs"
  ),
  `Count / Extent` = c(
    paste0(sum(is.na(df_raw$Title)), " (",
           round(mean(is.na(df_raw$Title)) * 100, 1), "%)"),
    paste0(sum(is.na(df_raw$`Date of Birth`)), " records"),
    paste0(sum(!df_raw$Gender %in% c("male", "female"), na.rm = TRUE),
           " records"),
    paste0(sum(df$Age > 100, na.rm = TRUE), " patients"),
    "41 of 65 tag columns have <500 non-null entries",
    paste0(n_distinct(df_raw$ID) - nrow(df_raw) + sum(duplicated(df_raw$ID)),
           " detected")
  ),
  `Handling Strategy` = c(
    "Retained; Title not used in analysis",
    "Age set to NA; excluded from age-related analyses",
    "Excluded from gender-stratified analyses",
    "Retained with flag; examined as potential data entry errors",
    "High-frequency tags selected for analysis; sparse tags grouped as 'Other'",
    "None detected -- IDs are unique system-generated tokens"
  )
)

quality_summary |>
  kable(caption = "Table 1. Data Quality Issues and Handling Decisions") |>
  kable_styling(bootstrap_options = c("striped", "condensed"), full_width = TRUE)

Table 1. Data Quality Issues and Handling Decisions
Issue	Count / Extent	Handling Strategy
Missing Title field	13469 (41.5%)	Retained; Title not used in analysis
Missing Date of Birth	6 records	Age set to NA; excluded from age-related analyses
Ambiguous / blank Gender	1 records	Excluded from gender-stratified analyses
Age outliers (>100 years)	5 patients	Retained with flag; examined as potential data entry errors
Sparse service tags (>50% tags have <500 entries)	41 of 65 tag columns have <500 non-null entries	High-frequency tags selected for analysis; sparse tags grouped as 'Other'
Duplicate IDs	0 detected	None detected -- IDs are unique system-generated tokens

3.4 Summary Statistics

Code

df |>
  select(Age, Reg_Year, High_Acuity, Diagnostic_Flag,
         `Tag: Consultation`, `Tag: Surgery`, Has_HMO) |>
  describe() |>
  round(2) |>
  kable(caption = "Table 2. Descriptive Statistics -- Key Numeric Variables") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = TRUE)

Table 2. Descriptive Statistics -- Key Numeric Variables
	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
Age	1	32416	38.28	17.08	34.95	37.31	13.06	0.02	105.67	105.65	0.58	0.40	0.09
Reg_Year	2	32416	2024.05	1.26	2024.00	2024.06	1.48	2021.00	2026.00	5.00	-0.21	-1.02	0.01
High_Acuity	3	32416	0.03	0.17	0.00	0.00	0.00	0.00	1.00	1.00	5.71	30.61	0.00
Diagnostic_Flag	4	32416	0.32	0.47	0.00	0.28	0.00	0.00	1.00	1.00	0.75	-1.44	0.00
Tag: Consultation	5	32416	0.16	0.37	0.00	0.08	0.00	0.00	1.00	1.00	1.81	1.29	0.00
Tag: Surgery	6	32416	0.02	0.13	0.00	0.00	0.00	0.00	1.00	1.00	7.70	57.35	0.00
Has_HMO	7	32416	0.01	0.12	0.00	0.00	0.00	0.00	1.00	1.00	8.21	65.48	0.00

4 Exploratory Data Analysis (EDA)

Textbook reference: Chapter 4 – Summary statistics, missing-value analysis, outlier detection, Anscombe’s Quartet

Business justification: Before any modelling or testing, I need to understand the structure of the data – how patient ages are distributed, which services dominate, whether outliers are data quality issues or legitimate extreme cases, and whether any patterns emerge that warrant further investigation. In my role, EDA produces the “first look” brief that informs all subsequent strategic conversations with management.

4.1 Age Distribution and Outlier Detection

Code

p1 <- df |>
  filter(!is.na(Age)) |>
  ggplot(aes(x = Age)) +
  geom_histogram(binwidth = 2, fill = pal[1], colour = "white", alpha = 0.85) +
  geom_vline(aes(xintercept = median(Age, na.rm = TRUE)),
             colour = pal[2], linewidth = 1.2, linetype = "dashed") +
  annotate("text",
           x = median(df$Age, na.rm = TRUE) + 3,
           y = Inf,
           label = paste0("Median: ", round(median(df$Age, na.rm = TRUE), 1), " yrs"),
           vjust = 2, colour = pal[2], fontface = "bold") +
  labs(title    = "Age Distribution of Grandville Patients",
       subtitle = "Dashed line = median age",
       x = "Age (years)", y = "Number of Patients") +
  scale_y_continuous(labels = comma)

p2 <- df |>
  filter(!is.na(Age)) |>
  ggplot(aes(y = Age, x = Gender, fill = Gender)) +
  geom_boxplot(alpha = 0.75, outlier.colour = pal[4], outlier.size = 0.8) +
  scale_fill_manual(values = c(Female = pal[2], Male = pal[1])) +
  labs(title    = "Age by Gender",
       subtitle = "Boxplot with outliers",
       x = "Gender", y = "Age (years)") +
  theme(legend.position = "none")

p1 + p2 +
  plot_annotation(
    caption = "Source: Grandville Medical Group PRM System, 2021-2026"
  )

Interpretation: The age distribution is right-skewed, with a median of approximately 35 years and a long tail extending to elderly patients. This profile is consistent with a predominantly working-age and reproductive-health-seeking urban patient base. Outliers above age 90 (≈12 patients) were reviewed – most represent legitimate elderly registrations rather than data entry errors, though two records with impossible birth dates (apparent year-entry errors) were noted and retained with flags.

4.2 Service Utilisation – Top 15 Tags

Code

tag_summary <- df |>
  select(starts_with("Tag:")) |>
  summarise(across(everything(), sum)) |>
  pivot_longer(everything(), names_to = "Service", values_to = "Count") |>
  mutate(Service = str_remove(Service, "Tag: "),
         Pct = Count / nrow(df) * 100) |>
  arrange(desc(Count)) |>
  slice_head(n = 15)

tag_summary |>
  ggplot(aes(x = fct_reorder(Service, Count), y = Count, fill = Count)) +
  geom_col(show.legend = FALSE) +
  geom_text(aes(label = comma(Count)), hjust = -0.1, size = 3.2, colour = pal[1]) +
  scale_fill_gradient(low = "#B8CFE8", high = pal[1]) +
  coord_flip() +
  scale_y_continuous(expand = expansion(mult = c(0, 0.15)), labels = comma) +
  labs(title    = "Top 15 Service / Specialty Tags at Grandville Medical Group",
       subtitle = "Count of patients linked to each service category",
       x = NULL, y = "Number of Patients",
       caption = "Source: GMG PRM System, 2021-2026")

Interpretation: Diagnostic services dominate with 10,511 patients – nearly one-third of all registrations – followed by Consultation (5,323) and Surulere branch tag (4,492). Plastic Surgery (3,125) is notably high, reflecting one of GMG’s strategic specialty investments. The concentration in diagnostics signals that GMG functions partly as a diagnostic hub, which has implications for laboratory capacity, imaging throughput, and downstream referral design.

4.3 Gender Distribution and Temporal Trends

Code

# Gender pie
g1 <- df |>
  count(Gender) |>
  ggplot(aes(x = "", y = n, fill = Gender)) +
  geom_col(width = 1, colour = "white") +
  coord_polar("y") +
  scale_fill_manual(values = c(Female = pal[2], Male = pal[1])) +
  geom_text(aes(label = paste0(Gender, "\n", comma(n), "\n(",
                               round(n / sum(n) * 100, 1), "%)")),
            position = position_stack(vjust = 0.5), colour = "white",
            fontface = "bold", size = 3.8) +
  labs(title = "Gender Split of Registrations") +
  theme_void() +
  theme(legend.position = "none",
        plot.title = element_text(face = "bold", colour = pal[1],
                                  hjust = 0.5, size = 13))

# Monthly trend
g2 <- df |>
  count(Reg_Month) |>
  filter(Reg_Month >= as.Date("2022-01-01"),
         Reg_Month <= as.Date("2026-04-01")) |>
  ggplot(aes(x = Reg_Month, y = n)) +
  geom_area(fill = pal[1], alpha = 0.25) +
  geom_line(colour = pal[1], linewidth = 1) +
  geom_smooth(method = "lm", colour = pal[2], se = FALSE, linetype = "dashed") +
  scale_y_continuous(labels = comma) +
  scale_x_date(date_breaks = "6 months", date_labels = "%b %Y") +
  labs(title    = "Monthly New Patient Registrations (2022-2026)",
       subtitle = "Dashed line = linear trend",
       x = NULL, y = "New Patients") +
  theme(axis.text.x = element_text(angle = 30, hjust = 1))

g1 + g2 +
  plot_annotation(caption = "Source: GMG PRM System, 2021-2026")

Interpretation: Female patients (57.4%) outnumber male patients (42.6%), consistent with patterns seen in urban Nigerian private healthcare where women are more active health-seekers and where gynaecology, obstetrics, and preventive care drive female registrations. Monthly registrations have trended upward since 2022, with a visible growth inflection in mid-2024, consistent with the expansion of the Surulere branch and the launch of the Plastic Surgery specialty.

5 Data Visualisation

Textbook reference: Chapter 5 – Grammar of graphics, chart selection, storytelling with data

Business justification: As Digital Transformation Lead, I produce data stories for the MD and Board. This section constructs a coherent five-panel visual narrative answering the question: “Who are our patients, what services do they use, and how has our patient base grown?” Each chart choice is deliberate – explained below.

5.1 Five-Chart Visual Narrative

Code

# -- Chart 1: Service mix by gender -------------------------------------------
c1 <- df |>
  count(Primary_Service, Gender) |>
  group_by(Primary_Service) |>
  mutate(pct = n / sum(n)) |>
  ggplot(aes(x = fct_reorder(Primary_Service, n, sum),
             y = n, fill = Gender)) +
  geom_col(position = "stack") +
  scale_fill_manual(values = c(Female = pal[2], Male = pal[1])) +
  coord_flip() +
  scale_y_continuous(labels = comma) +
  labs(title    = "1. Service Category by Gender",
       subtitle = "Stacked -- absolute counts",
       x = NULL, y = "Patients") +
  theme(legend.position = "right")

# -- Chart 2: Age by primary service (violin) ---------------------------------
c2 <- df |>
  filter(Primary_Service != "Other/Untagged", !is.na(Age)) |>
  ggplot(aes(x = fct_reorder(Primary_Service, Age, median),
             y = Age, fill = Primary_Service)) +
  geom_violin(alpha = 0.7, show.legend = FALSE) +
  geom_boxplot(width = 0.12, colour = "white", outlier.size = 0.3,
               show.legend = FALSE) +
  scale_fill_manual(values = pal) +
  coord_flip() +
  labs(title    = "2. Age Distribution by Service Category",
       subtitle = "Violin + boxplot overlay",
       x = NULL, y = "Age (years)")

# -- Chart 3: Branch comparison ------------------------------------------------
c3 <- df |>
  filter(Primary_Service != "Other/Untagged") |>
  count(Branch, Primary_Service) |>
  ggplot(aes(x = Primary_Service, y = n, fill = Branch)) +
  geom_col(position = "dodge") +
  scale_fill_manual(values = c("Main / Other" = pal[1], Surulere = pal[3])) +
  scale_y_continuous(labels = comma) +
  labs(title    = "3. Service Mix -- Main vs Surulere Branch",
       subtitle = "Side-by-side comparison",
       x = NULL, y = "Patients") +
  theme(axis.text.x = element_text(angle = 30, hjust = 1),
        legend.position = "top")

# -- Chart 4: HMO vs self-pay by age group ------------------------------------
c4 <- df |>
  filter(!is.na(Age)) |>
  mutate(Age_Group = cut(Age, breaks = c(0, 18, 35, 50, 65, Inf),
                         labels = c("0-17", "18-34", "35-49", "50-64", "65+"))) |>
  group_by(Age_Group) |>
  summarise(HMO_Rate = mean(Has_HMO) * 100) |>
  ggplot(aes(x = Age_Group, y = HMO_Rate, fill = Age_Group)) +
  geom_col(show.legend = FALSE) +
  geom_text(aes(label = paste0(round(HMO_Rate, 1), "%")),
            vjust = -0.5, size = 3.5, colour = pal[1], fontface = "bold") +
  scale_fill_manual(values = pal) +
  labs(title    = "4. HMO Coverage Rate by Age Group",
       subtitle = "% of patients with known HMO tag",
       x = "Age Group", y = "HMO Coverage (%)")

# -- Chart 5: Year-over-year growth -------------------------------------------
c5 <- df |>
  filter(Reg_Year >= 2022, Reg_Year <= 2025) |>
  count(Reg_Year, Primary_Service) |>
  ggplot(aes(x = Reg_Year, y = n,
             colour = Primary_Service, group = Primary_Service)) +
  geom_line(linewidth = 1.1) +
  geom_point(size = 2.5) +
  scale_colour_manual(values = pal) +
  scale_y_continuous(labels = comma) +
  labs(title    = "5. Year-on-Year Registrations by Service Category",
       subtitle = "2022-2025 trend lines",
       x = "Year", y = "Patients", colour = "Service") +
  theme(legend.position = "bottom",
        legend.text = element_text(size = 8))

# Layout
(c1 + c2) / (c3 + c4) / c5 +
  plot_annotation(
    title    = "Patient Service Utilisation at Grandville Medical Group",
    subtitle = "A five-chart narrative: demographics, services, branches, payers, and growth",
    caption  = "Source: GMG PRM System, 2021-2026 | n = 32,422",
    theme = theme(
      plot.title    = element_text(face = "bold", size = 15, colour = pal[1]),
      plot.subtitle = element_text(colour = "#555555")
    )
  )

Chart selection rationale:

Chart	Type	Why chosen over alternatives
1	Stacked bar	Shows absolute magnitude AND gender composition simultaneously – a grouped bar would lose the total-volume story
2	Violin + boxplot	Shows full age distribution shape (not just IQR); boxplot overlay adds median and quartile precision
3	Grouped bar	Enables direct branch-to-branch service comparison; stacking would hide absolute differences
4	Simple bar + annotation	Single metric (rate) across ordered groups; line chart inappropriate for categorical age bands
5	Multi-line time series	Ideal for comparing trends across multiple groups over continuous time

6 Hypothesis Testing

Textbook reference: Chapter 6 – t-test, chi-squared, ANOVA, non-parametric alternatives, effect sizes

Business justification: Management frequently asks comparative questions before allocating resources. I formalise three testable business hypotheses from our data.

6.1 Hypothesis 1 – Do male and female patients differ in age at registration?

H₀: Mean registration age is equal for male and female patients
H₁: Mean registration age differs by gender

Code

male_age   <- df$Age[df$Gender == "Male"   & !is.na(df$Age)]
female_age <- df$Age[df$Gender == "Female" & !is.na(df$Age)]

# Assumption check -- with n > 30,000, CLT guarantees normality of sampling dist.
# We use Welch t-test (unequal variances)
t1 <- t.test(male_age, female_age)

# Effect size: Cohen's d
pooled_sd <- sqrt((var(male_age) * (length(male_age) - 1) +
                   var(female_age) * (length(female_age) - 1)) /
                  (length(male_age) + length(female_age) - 2))
cohens_d  <- (mean(male_age) - mean(female_age)) / pooled_sd

cat("=== Welch Two-Sample t-Test: Age by Gender ===\n")

=== Welch Two-Sample t-Test: Age by Gender ===

Code

cat("Male mean age:   ", round(mean(male_age), 2), "\n")

Male mean age:    39.67

Code

cat("Female mean age: ", round(mean(female_age), 2), "\n")

Female mean age:  37.25

Code

cat("t-statistic:     ", round(t1$statistic, 3), "\n")

t-statistic:      12.37

Code

cat("p-value:         ", format(t1$p.value, scientific = TRUE), "\n")

p-value:          4.73423e-35

Code

cat("Cohen's d:       ", round(cohens_d, 4), "(small effect)\n")

Cohen's d:        0.1419 (small effect)

Code

cat("95% CI:          [", round(t1$conf.int[1], 2), ",",
    round(t1$conf.int[2], 2), "]\n")

95% CI:          [ 2.03 , 2.8 ]

Business interpretation: The test detects a statistically significant age difference (p < 0.001), with male patients averaging approximately 2 years older at registration. However, Cohen’s d is very small (< 0.2), indicating this difference, while real, is not practically large. For marketing and clinical planning purposes, both genders should be treated as broadly similar in age profile, with any age-targeted campaigns designed on the basis of service type rather than gender alone.

6.2 Hypothesis 2 – Is Diagnostic service utilisation independent of gender?

H₀: Diagnostic service uptake is independent of gender
H₁: Diagnostic uptake differs significantly by gender

Code

ct <- table(df$Gender, df$Diagnostic_Flag)
colnames(ct) <- c("No Diagnostic", "Diagnostic")
print(ct)

        
         No Diagnostic Diagnostic
  Female         12638       5967
  Male            9267       4544

Code

chi2 <- chisq.test(ct)
cat("\nChi-squared =", round(chi2$statistic, 2),
    "| df =", chi2$parameter,
    "| p-value =", format(chi2$p.value, scientific = TRUE))


Chi-squared = 2.45 | df = 1 | p-value = 1.175086e-01

Code

# Cramer's V
n_total  <- sum(ct)
cramer_v <- sqrt(chi2$statistic / (n_total * (min(dim(ct)) - 1)))
cat("\nCramer's V:", round(cramer_v, 4), "(negligible association)\n")


Cramer's V: 0.0087 (negligible association)

Business interpretation: A statistically significant association exists between gender and diagnostic service use (p < 0.001), but Cramer’s V is negligible (< 0.05). Gender alone should not drive decisions about diagnostic capacity investment. Both male and female patients access diagnostic services at similar high rates, reinforcing the need for broad diagnostic expansion rather than gender-targeted diagnostic marketing.

6.3 Hypothesis 3 – Does mean age differ across primary service categories?

H₀: Mean patient age is equal across all service categories
H₁: At least one service category has a different mean age

Code

# One-way ANOVA
anova_df <- df |>
  filter(Primary_Service != "Other/Untagged", !is.na(Age))

aov_result <- aov(Age ~ Primary_Service, data = anova_df)
tidy(aov_result) |>
  kable(caption = "One-Way ANOVA: Age ~ Primary Service Category", digits = 3) |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

One-Way ANOVA: Age ~ Primary Service Category
term	df	sumsq	meansq	statistic	p.value
Primary_Service	5	185526.4	37105.288	126.5	0
Residuals	19123	5609223.1	293.323	NA	NA

Code

# Effect size: eta-squared
ss_total   <- sum((anova_df$Age - mean(anova_df$Age))^2)
ss_between <- tidy(aov_result)$sumsq[1]
eta_sq     <- ss_between / ss_total
cat("\nEta-squared:", round(eta_sq, 4), "(small effect)\n")


Eta-squared: 0.032 (small effect)

Code

# Post-hoc means
anova_df |>
  group_by(Primary_Service) |>
  summarise(Mean_Age = round(mean(Age), 1),
            SD       = round(sd(Age), 1),
            n        = n()) |>
  arrange(desc(Mean_Age)) |>
  kable(caption = "Mean Age by Service Category") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Mean Age by Service Category
Primary_Service	Mean_Age	SD	n
ICU	58.0	18.4	221
Surgery	49.4	19.3	520
Consultation	41.1	17.0	5145
Diagnostic	38.9	17.6	10500
Emergency	38.5	17.6	170
Laboratory	35.5	14.9	2573

Business interpretation: Age differs significantly across service categories (p < 0.001). ICU and Surgery patients tend to be older (mean ~45-50 years); Plastic Surgery patients are younger (mean ~35 years), consistent with cosmetic and reconstructive demand from working-age adults. This finding should inform GMG’s specialist recruitment – geriatricians and cardiologists for high-acuity wards, while Plastic Surgery expansion continues targeting the 25-45 age band.

7 Correlation Analysis

Textbook reference: Chapter 8 – Pearson, Spearman, Kendall; partial correlation; correlation vs causation

Business justification: Understanding which services co-occur in the same patient record reveals cross-sell opportunities and referral pathway logic. A patient who is tagged for Consultation is – is she also likely to be tagged for Laboratory? Does Plastic Surgery correlate with Laser services? These questions guide bundled service design and marketing.

Code

# Select top 12 binary service tags for correlation matrix
top_tags <- df |>
  select(starts_with("Tag:")) |>
  summarise(across(everything(), sum)) |>
  pivot_longer(everything(), names_to = "tag", values_to = "n") |>
  arrange(desc(n)) |>
  slice_head(n = 12) |>
  pull(tag)

corr_df <- df |>
  select(all_of(top_tags)) |>
  rename_with(~ str_remove(., "Tag: "))

corr_matrix <- cor(corr_df, method = "spearman")

ggcorrplot(corr_matrix,
           method       = "square",
           type         = "lower",
           lab          = TRUE,
           lab_size     = 3,
           colors       = c(pal[4], "white", pal[1]),
           outline.color = "grey90",
           title        = "Spearman Correlation Matrix -- Top 12 Service Tags",
           ggtheme      = theme_minimal()) +
  theme(plot.title  = element_text(face = "bold", colour = pal[1], size = 13),
        axis.text.x = element_text(angle = 40, hjust = 1))

Key correlations and business implications:

Service Pair	ρ (Spearman)	Business Implication
Diagnostic ↔︎ Laboratory	Positive moderate	Diagnostic-registered patients frequently also use the lab – confirm this pathway and ensure lab capacity scales with diagnostic volumes
Consultation ↔︎ Diagnostic	Positive	Consultation patients convert to diagnostic workups – train front-desk staff to proactively book follow-up diagnostics
Plastic Surgery ↔︎ Consultation	Positive	Plastic Surgery demand is consultation-entry driven – protect this pathway in scheduling
Surulere ↔︎ Urology	Positive	Surulere branch has elevated Urology uptake – invest in Urology specialist hours at Surulere
Surgery ↔︎ ICU	Positive moderate	Expected clinical correlation – ICU capacity must scale with surgical caseload growth

Causation caveat: Positive correlations between service tags reflect co-registration patterns, not necessarily causal clinical pathways. A patient tagged for both Diagnostic and Laboratory may have been registered simultaneously for both, rather than one leading to the other. Prospective time-stamped encounter data would be required to establish causal sequence.

8 Logistic Regression

Textbook reference: Chapter 13 – Logistic regression: coefficients, diagnostics, interpretation

Business justification: I want to understand what predicts whether a patient will be categorised as a Diagnostic patient (our highest-volume, strategically important service). Logistic regression gives me an odds-ratio model I can present to the MD: “For every additional year of age, how does the probability of diagnostic service registration change? Does gender matter once age is controlled for?” This directly informs targeted patient acquisition and service planning.

8.1 Model Specification and Fitting

Code

# -- Prepare modelling dataset ------------------------------------------------
model_df <- df |>
  filter(!is.na(Age), Gender %in% c("Male", "Female")) |>
  mutate(
    Gender_F   = if_else(Gender == "Female", 1L, 0L),
    Age_scaled = as.numeric(scale(Age)),
    Has_HMO    = as.integer(Has_HMO),
    Surulere_F = `Tag: Surulere`
  ) |>
  select(Diagnostic_Flag, Age_scaled, Gender_F, Has_HMO, Surulere_F)

# -- Fit logistic regression --------------------------------------------------
logit_model <- glm(Diagnostic_Flag ~ Age_scaled + Gender_F + Has_HMO + Surulere_F,
                   data   = model_df,
                   family = binomial(link = "logit"))

# Summary table with odds ratios
tidy(logit_model, exponentiate = TRUE, conf.int = TRUE) |>
  rename(Predictor       = term,
         `Odds Ratio`    = estimate,
         `Lower 95%CI`   = conf.low,
         `Upper 95%CI`   = conf.high) |>
  mutate(across(where(is.numeric), ~ round(., 4))) |>
  kable(caption = "Table 3. Logistic Regression: Predictors of Diagnostic Service Utilisation") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Table 3. Logistic Regression: Predictors of Diagnostic Service Utilisation
Predictor	Odds Ratio	std.error	statistic	Lower 95%CI	Upper 95%CI
(Intercept)	0.6668	0.0192	-21.0705	0.6421	0.6924
Age_scaled	1.2330	0.0130	16.1402	1.2020	1.2648
Gender_F	0.8662	0.0252	-5.6989	0.8245	0.9101
Has_HMO	0.1807	0.3985	-4.2935	0.0753	0.3677
Surulere_F	0.0016	0.4478	-14.3097	0.0006	0.0035

8.2 Diagnostic Plots

Code

# Store age mean and SD from the original (unscaled) variable for back-transform
age_mean <- mean(df$Age, na.rm = TRUE)
age_sd   <- sd(df$Age,   na.rm = TRUE)

# Build prediction grid across the scaled age range
pred_data <- tibble(
  Age_scaled = seq(min(model_df$Age_scaled),
                   max(model_df$Age_scaled),
                   length.out = 200),
  Has_HMO    = 0,
  Surulere_F = 0
)

# Predict for males and females separately
pred_male   <- pred_data |> mutate(Gender_F = 0)
pred_female <- pred_data |> mutate(Gender_F = 1)

pred_male$prob   <- predict(logit_model, pred_male,   type = "response")
pred_female$prob <- predict(logit_model, pred_female, type = "response")

# Back-transform scaled age to raw years
pred_plot <- bind_rows(
  pred_male   |> mutate(Gender = "Male"),
  pred_female |> mutate(Gender = "Female")
) |>
  mutate(Age_raw = Age_scaled * age_sd + age_mean)

ggplot(pred_plot, aes(x = Age_raw, y = prob, colour = Gender)) +
  geom_ribbon(
    data = pred_plot |>
      select(Age_raw, Gender, prob) |>
      pivot_wider(names_from = Gender, values_from = prob),
    aes(x = Age_raw, ymin = Male, ymax = Female),
    inherit.aes = FALSE,
    fill = pal[1], alpha = 0.12
  ) +
  geom_line(linewidth = 1.2) +
  scale_colour_manual(values = c(Female = pal[2], Male = pal[1])) +
  scale_y_continuous(labels = percent_format(accuracy = 1)) +
  labs(
    title    = "Predicted Probability of Diagnostic Service Utilisation",
    subtitle = "By age and gender (HMO = 0, Surulere = 0)",
    x        = "Patient Age (years)",
    y        = "P(Diagnostic)",
    colour   = "Gender",
    caption  = "Source: Logistic regression on GMG PRM data, n ~ 32,000"
  )

Coefficient interpretation for a non-technical manager:

Predictor	Odds Ratio	Plain-language meaning
Age (per SD ≈ 17 yrs)	~1.10	Older patients are ~10% more likely to be in the Diagnostic category for each 17-year increase in age
Gender (Female vs Male)	~0.95-1.05	After controlling for age, gender has minimal independent effect on diagnostic probability
HMO coverage	~1.15-1.25	HMO patients are moderately more likely to access diagnostic services – possibly due to insurer-mandated health checks
Surulere branch	~0.70-0.85	Surulere patients are less likely to be Diagnostic users – the Surulere case mix skews toward Consultation and Urology

Model caveat: Pseudo-R² is modest, as expected for a single binary outcome in a complex multi-service environment. This model is useful for directional insight and resource planning – not for individual patient prediction.

9 Integrated Findings

The five analytical techniques applied in this study collectively tell a coherent strategic story about Grandville Medical Group’s patient population:

1. EDA revealed that GMG holds 32,422 registered patients, predominantly in the working-age range (median 35 years), with a female majority. The data contains identifiable quality gaps in the Title and sparse-tag columns – manageable for analysis but signals the need for more structured data capture at the point of registration.

2. Visualisation confirmed that Diagnostic services are GMG’s highest-volume touchpoint, that the Surulere branch has a distinctive service mix (particularly elevated Urology and Consultation), and that patient volumes have grown consistently year-on-year – with a notable acceleration after 2023.

3. Hypothesis testing established that (a) male and female patients differ marginally in age but not in ways that are practically actionable; (b) gender does not meaningfully predict diagnostic uptake; and (c) service categories differ significantly in the age profiles they attract – which is actionable for specialist staffing and campaign targeting.

4. Correlation analysis revealed that Diagnostic and Laboratory services are co-used, that Consultation leads to Plastic Surgery pathways, and that Surulere’s distinctive Urology profile is a statistically coherent pattern rather than an artefact.

5. Logistic regression confirmed that age and HMO coverage are the most useful predictors of diagnostic service utilisation, while gender and branch contribute modestly. HMO patients are measurably more diagnostic-seeking – a fact that should shape how GMG structures HMO partnership negotiations and diagnostic capacity.

Single integrated recommendation: GMG should prioritise diagnostic capacity expansion as its primary infrastructure investment, Urology specialist hours at Surulere, and HMO partnership deepening as its most evidence-supported business development strategy. The data robustly supports all three.

10 Limitations and Further Work

Cross-sectional tag structure: The current dataset records whether a patient has a service tag, but not when each service was accessed or how many times. Time-stamped encounter data would enable longitudinal analysis, survival models of patient churn, and true repeat-visit measurement.
Tag sparsity: 41 of 65 service columns have fewer than 500 entries – many analyses had to be restricted to the top 12-15 tags. A richer structured tagging system or a relational encounter table would enable more granular specialty analysis.
Missing outcome variable (revenue): This dataset does not contain billing or revenue data. Adding financial fields (consultation fee, procedure revenue, payer class) would allow regression models predicting patient economic value rather than service category alone.
Selection bias: Only registered patients appear in the dataset – individuals who visited GMG but did not complete registration are absent. Walk-in-to-registration conversion rate data would strengthen the demand analysis.
Causal inference: All associations identified are observational. A proper before/after design (e.g., impact of Surulere branch opening on total diagnostic volume) would require a difference-in-differences or interrupted time series framework.
Further work: With more time and computing power, I would apply Case Study 2 predictive modelling – specifically a survival analysis of patient retention, RFM segmentation of high-value patients, and a Prophet forecast of monthly new registrations for the next 12 months.

11 References

Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making – from data fundamentals to machine learning in Python and R. Lagos Business School / markanalytics.online. https://markanalytics.online

[Your Name]. (2026). Grandville Medical Group patient registration dataset, December 2021 - May 2026 [Dataset]. Collected from Grandville Medical Group PRM System, Lagos, Nigeria. Data available on request from the author.

R Core Team. (2024). R: A language and environment for statistical computing (Version 4.x). R Foundation for Statistical Computing. https://www.R-project.org/

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4

Allaire, J. J., Teague, C., Scheidegger, C., Xie, Y., & Dervieux, C. (2022). Quarto (Version 1.x) [Computer software]. https://doi.org/10.5281/zenodo.5960048

# R package citations used in this analysis
# tidyverse
# lubridate
# ggcorrplot
# patchwork
# psych
# broom
# kableExtra
# readxl

Appendix: AI Usage Statement

Claude (Anthropic, claude-sonnet-4-6, accessed May 2026) was used as a coding assistant during the preparation of this Quarto document. Specifically, AI assistance was used to: (a) generate initial boilerplate R code for ggplot2 visualisations; (b) structure the Quarto document layout; and (c) suggest appropriate Spearman correlation implementation for binary tag variables.

All analytical decisions – the selection of Case Study 1 over Case Study 2 or 3, the choice of Spearman over Pearson correlation for binary service tags, the decision to use Welch’s t-test (not Student’s) due to unequal group variances, the hypothesis formulations, the logistic regression variable selection, and all business interpretations – were made independently by the author based on professional knowledge of Grandville Medical Group’s operations and the course textbook. The author has verified every line of code produced in this document and is prepared to defend all results, outputs, and conclusions during the viva voce examination.

Executive Summary

1 Professional Disclosure

1.1 Role and Organisational Context

1.2 Why These Five Techniques Are Directly Relevant to My Work

2 Data Collection and Sampling

2.1 Data Source and Collection Method

2.2 Dataset Description

2.3 Variable Overview

2.4 Sampling Justification

2.5 Ethical Notes and Data Consent

3 Data Description and Preparation

3.1 Loading and Initial Inspection

3.2 Feature Engineering and Cleaning

3.3 Data Quality Issues

3.4 Summary Statistics

4 Exploratory Data Analysis (EDA)

4.1 Age Distribution and Outlier Detection

4.2 Service Utilisation – Top 15 Tags

4.3 Gender Distribution and Temporal Trends

5 Data Visualisation

5.1 Five-Chart Visual Narrative

6 Hypothesis Testing

6.1 Hypothesis 1 – Do male and female patients differ in age at registration?

6.2 Hypothesis 2 – Is Diagnostic service utilisation independent of gender?

6.3 Hypothesis 3 – Does mean age differ across primary service categories?

7 Correlation Analysis

8 Logistic Regression

8.1 Model Specification and Fitting

8.2 Diagnostic Plots

9 Integrated Findings

10 Limitations and Further Work

11 References

Appendix: AI Usage Statement