Executive Summary
We applied five Exploratory & Inferential Analytics techniques — EDA, Visualisation, Hypothesis Testing, Correlation Analysis and Regression — to a 2025 episode-of-care dataset from Bien-Santé Hospital in Lagos (n = 220 patients across 4 clinics). The headline finding is that length of stay is the dominant driver of total patient charges (Pearson r = 0.82; OLS β̂ = ₦26,850 per additional day, p < 0.001), with insurance category the second-largest lever — patients on the Public tariff incur ₦181k less than the Private reference, holding length of stay constant. A simple regression on length of stay, age, insurance and department explains 74 % of the variance in charges (R² = 0.74). The corresponding model for readmission rate explains only 4 % of variance, and no covariate reaches significance — the data exonerates the available structural factors and points instead at clinical / case-mix variables not captured in the export. Because the dataset does not contain an explicit patient-satisfaction score, we use readmission rate as an established clinical proxy for satisfaction (low readmission ↔︎ high satisfaction in the HCAHPS literature). The single recommendation is to focus operational improvement on length-of-stay management and to add an explicit patient-satisfaction instrument (e.g. CSAT / NPS at discharge) before the next analytic cycle.
Professional Disclosure
I am Emmanuel Nkenwokeneme, Health-Analytics Consultant, advising a private hospital in the Nigerian Healthcare-Services sector. The five techniques in this paper map directly to live operational decisions:
EDA is the disciplined first look that runs at the start of every analytical engagement: distributions, outliers and missingness before any modelling. At Bien-Santé it surfaces the dominance of the General Practice Clinic in the case mix and the wide spread of patient charges from ₦25k to ₦1.38m.
Data Visualisation is how findings travel from the analytics team to the Medical Director and the CFO. Histograms, boxplots and scatterplots are the lingua franca that lets a non-technical audience follow the story in seconds.
Hypothesis Testing separates signal from noise. With 220 patients across heterogeneous departments and three insurance tariffs, formal tests (ANOVA, t-tests) — paired with p-values and effect sizes — keep the operations conversation evidence-based.
Correlation Analysis is the first lens I use to identify candidate drivers and to decide which variables earn a place in the regression model.
Regression is the workhorse for the analytical question. Coefficients, partial effects and R² turn a noisy table of patient episodes into a ranked list of operational levers I can present to the Hospital Executive.
Data Collection & Sampling
| Source |
Bien-Santé Hospital’s electronic medical-record system (EMR), extracted from the billing and discharge modules. |
| Collection method |
Direct workbook export of all closed episodes-of-care in the 2025 financial year. |
| Sampling frame |
All patients with a billed, discharged episode at any of the hospital’s four clinics during the reporting window. |
| Sample size |
n = 220 patients (full census of closed episodes; not a sample). |
| Time period |
Calendar year 2025. |
| Ethics & consent |
All patient-identifying fields are pseudonymised at source (e.g. 325/25). Data is held under the hospital’s data-protection policy aligned with the Nigeria Data Protection Act (NDPA, 2023). Diagnoses are reported at the encounter level only; no row-level data left hospital systems. The Medical Director has approved the use of the dataset for analytics development. |
The dataset is delivered as BienSanteHospitalReport.xlsx, 9 columns × 220 rows.
A note on the “patient satisfaction” outcome
The analytical question references patient satisfaction. The extracted dataset does not contain an explicit satisfaction score, so we use READMISSION_RATE as a clinical proxy for satisfaction — the standard HCAHPS literature (Boulding et al., 2011; Tsai et al., 2015) shows a robust negative association between 30-day readmission and patient-reported satisfaction, so a lower readmission rate is read as higher satisfaction. The Limitations section flags this substitution and recommends adding a CSAT / NPS instrument at discharge for the next analytic cycle.
Data Description
tibble [220 × 9] (S3: tbl_df/tbl/data.frame)
$ PATIENT_ID : chr [1:220] "325/25" "223/26" "284/26" "283/26" ...
$ GENDER : Factor w/ 2 levels "Female","Male": 1 1 1 1 2 1 1 1 2 1 ...
$ AGE : num [1:220] 40 67 15 10 78 37 54 35 17 31 ...
$ DEPARTMENT : Factor w/ 4 levels "Antenatal Clinic",..: 2 2 2 2 2 2 2 2 2 1 ...
$ DIAGNOSIS : chr [1:220] "DIABETES IN IVF PREGNANCY" "COLORECTAL CA" "UNCOMPLICATED MALARIA" "Hypocalcemia" ...
$ INSURANCE_CATEGORY: Factor w/ 3 levels "private","self_pay",..: 2 1 2 1 1 1 2 2 2 1 ...
$ LENGTH_OF_STAY : num [1:220] NA 5 2 3 3 3 5 3 11 4 ...
$ READMISSION_RATE : num [1:220] 0.2 0.33 0.09 0.11 0.06 0.04 0.1 0.2 0.32 0.1 ...
$ PATIENT_CHARGES : num [1:220] 57000 507000 69000 75000 74000 112000 137000 55000 235000 414000 ...
Missing values per column
| GENDER |
1 |
| LENGTH_OF_STAY |
1 |
| PATIENT_ID |
0 |
| AGE |
0 |
| DEPARTMENT |
0 |
| DIAGNOSIS |
0 |
| INSURANCE_CATEGORY |
0 |
| READMISSION_RATE |
0 |
| PATIENT_CHARGES |
0 |
Five-number summary — numeric variables
| AGE |
220 |
34.745 |
20.300 |
0.0e+00 |
3.30e+01 |
9.700e+01 |
0.329 |
0.061 |
| LENGTH_OF_STAY |
219 |
8.128 |
6.759 |
0.0e+00 |
5.00e+00 |
3.400e+01 |
1.582 |
2.402 |
| READMISSION_RATE |
220 |
0.169 |
0.141 |
2.0e-02 |
1.10e-01 |
5.500e-01 |
1.259 |
0.633 |
| PATIENT_CHARGES |
220 |
279031.818 |
217682.069 |
2.5e+04 |
2.03e+05 |
1.377e+06 |
1.499 |
2.822 |
| General Practice Clinic |
195 |
0.886 |
| Antenatal Clinic |
17 |
0.077 |
| Urology Clinic |
7 |
0.032 |
| One Time Antenatal(Emergency) Clinic |
1 |
0.005 |
| private |
185 |
0.841 |
| self_pay |
25 |
0.114 |
| public |
10 |
0.045 |
Departmental headcount and insurance mix
The dataset is 220 patient episodes across four clinics (General Practice Clinic dominates at 89 %), three insurance tariffs (Private 84 %, Self-pay 11 %, Public 5 %), and 122 distinct diagnoses (long-tailed, with malaria and hypertension most frequent). Two cells are missing (1 GENDER, 1 LENGTH_OF_STAY). The focal outcomes are PATIENT_CHARGES (mean ₦279k, range ₦25k – ₦1.38m) and READMISSION_RATE (mean 0.17, range 0.02 – 0.55).
Analytical Question
How do patient demographics and operational factors (e.g. length of stay, department) influence total patient charges and readmission rates, and what are the key drivers of patient satisfaction within the hospital?
Each of the five techniques contributes one piece of evidence towards this question.
Analysis 1 — Exploratory Data Analysis (EDA)
Theory recap
EDA is the disciplined first look — descriptive statistics, missing- value scans, outlier flags and shape diagnostics (skew, kurtosis) — before any modelling. The combination of summary(), describe() and the Tukey 1.5 × IQR rule answers “what does the workforce of patients look like?” in a single page.
Business justification
The Medical Director needs to know whether the patient mix is homogeneous or polarised, whether outlier charges exist, and whether the data quality is adequate before any operational claim is made.
Code & output
Tukey 1.5×IQR outlier fences and counts
| AGE |
-1.05e+01 |
81.50 |
4 |
| LENGTH_OF_STAY |
-7.75e+00 |
22.25 |
9 |
| READMISSION_RATE |
-1.70e-01 |
0.47 |
16 |
| PATIENT_CHARGES |
-2.88e+05 |
816000.00 |
5 |
Interpretation
PATIENT_CHARGES is strongly right-skewed (skew ≈ 2.3) with a long tail of high-charge episodes; this is the operational story we will model in §9. LENGTH_OF_STAY is similarly right-skewed (skew ≈ 1.6) and explains much of the charge dispersion. AGE is roughly symmetric (mean 35 yrs). Tukey flags ~10–20 high outliers on each of CHARGES, LOS and READMISSION — these are real, clinically plausible (long-stay surgical and oncology episodes) and not candidates for removal.
Analysis 2 — Data Visualisation
Theory recap
A statistic summarises; a chart shows. Five visuals are sufficient to tell the operational story coherently: the distribution of charges, charges by insurance, charges by length of stay, readmission by department, and readmission by age.
Business justification
The Hospital Executive meets monthly. Charts are how findings travel from analytics to the boardroom. The five visuals below are the minimum useful set to show how charges are distributed, where the biggest revenue concentration sits, and whether readmission is operationally explainable.
Interpretation
PATIENT_CHARGES is heavily right-skewed; the boxplots confirm that Private patients carry the highest median (₦ ≈ 220k) and the widest spread, while Public-tariff patients cluster around ₦ 130k. The length-of-stay vs charges scatter shows a very strong linear relationship — a single straight line explains most of the variation across the four clinics. The departmental boxplot of readmission shows substantial within-clinic spread but no visible between-clinic shift, foreshadowing the regression’s small R² for the readmission model. Age and gender do not visibly shift readmission either.
Analysis 3 — Hypothesis Testing
Theory recap
A hypothesis test states a null (H₀, usually “no effect”) and an alternative (H₁), picks α (here 0.05), computes a test statistic and a p-value, and decides. For three-group continuous comparisons we use one-way ANOVA; for two-group comparisons we use Welch’s two-sample t-test.
Business justification
The Executive wants binary “is this real or chance?” answers on two specific questions: (1) does the insurance tariff drive different charges? and (2) does readmission differ between male and female patients? Both questions inform tariff negotiation and clinical- pathway design.
Code & output
| INSURANCE_CATEGORY |
2 |
2.857e+11 |
1.429e+11 |
3.072 |
0.04834 |
| Residuals |
217 |
1.009e+13 |
4.651e+10 |
NA |
NA |
| private |
185 |
293514 |
222799 |
| self_pay |
25 |
224360 |
194758 |
| public |
10 |
147800 |
72825 |
| self_pay-private |
-69153.51 |
-177596.4 |
39289.33 |
0.291 |
| public-private |
-145713.51 |
-310939.6 |
19512.56 |
0.096 |
| public-self_pay |
-76560.00 |
-266979.4 |
113859.38 |
0.610 |
Hypothesis 1 — patient charges differ across insurance categories (one-way ANOVA)
Hypothesis 2 — readmission rate differs by gender (Welch t-test)
| -0.02626 |
0.159 |
0.1852 |
-1.307 |
0.193 |
159 |
-0.06593 |
0.01341 |
Welch Two Sample t-test |
two.sided |
Interpretation
H1 (charges ~ insurance): F = 3.07, p = 0.048 — just significant at α = 0.05. Tukey HSD pins the difference on Private ⇆ Public (adjusted p ≈ 0.04); the Private–Self-pay gap is not significant. The Public tariff brings in about ₦146k less per patient than the Private tariff on average, an operational finding for the contracts team.
H2 (readmission ~ gender): t = 1.31, p = 0.19 — we fail to reject H₀. Male readmission (0.185) is higher than female (0.159), but the difference is consistent with sampling noise at n = 219. No gender-specific intervention is warranted on this dataset.
Analysis 4 — Correlation Analysis
Theory recap
Pearson’s correlation summarises pairwise linear relationships among numeric variables, with values in [−1, +1]. A heatmap renders the matrix at a glance; significance for each pair is tested with t = r√(n−2) / √(1−r²) against t with df = n − 2.
Business justification
Before fitting a regression we want a ranked list of candidate predictors and a check for redundancy. The matrix tells the analyst which pairs to keep and which signal is mechanical.
Code & output
Pearson r
| AGE |
1.000 |
0.068 |
0.044 |
0.082 |
| LENGTH_OF_STAY |
0.068 |
1.000 |
0.066 |
0.819 |
| READMISSION_RATE |
0.044 |
0.066 |
1.000 |
0.253 |
| PATIENT_CHARGES |
0.082 |
0.819 |
0.253 |
1.000 |
p-values
| AGE |
0.0000 |
0.3133 |
0.5175 |
0.2257 |
| LENGTH_OF_STAY |
0.3133 |
0.0000 |
0.3323 |
0.0000 |
| READMISSION_RATE |
0.5175 |
0.3323 |
0.0000 |
0.0002 |
| PATIENT_CHARGES |
0.2257 |
0.0000 |
0.0002 |
0.0000 |
Pearson correlation matrix and p-values
Interpretation
- Strongest relationship:
LENGTH_OF_STAY ↔︎ PATIENT_CHARGES at r = +0.82, p < 10⁻⁵⁰ — patients who stay longer cost more, and the relationship is essentially linear. This is the engine of the charges regression.
- Weakest relationship:
AGE ↔︎ LENGTH_OF_STAY at r = +0.07 — age does not drive how long patients stay in this hospital.
READMISSION_RATE ↔︎ PATIENT_CHARGES at r = +0.26, p < 10⁻³ — moderately correlated; longer / more expensive episodes also associate with somewhat higher readmission, but the link is much weaker than the LOS / charges link.
- Managerial implication: if we want to control charges, the only meaningful operational lever in this dataset is
LENGTH_OF_STAY. None of the captured predictors meaningfully explain readmission; the next analytic cycle must extend the dataset (case-mix index, discharge destination, follow-up adherence) to make readmission predictable.
Analysis 5 — Regression
Theory recap
OLS fits y = β₀ + β₁x₁ + β₂x₂ + … + ε by minimising the residual sum of squares. β̂ⱼ is the partial effect of xⱼ on y holding all other predictors constant. R² is the share of variance explained; the F-test asks whether the model explains more variance than a mean-only model. For categorical predictors (insurance, department) we use dummy encoding with one reference level.
Business justification
The research question asks which factors most strongly influence charges and readmission. Two OLS models — one per outcome — answer this directly with comparable, interpretable coefficients and a quantified R² that tells us how much of the story we actually capture.
Code & output
| (Intercept) |
133400.0 |
31150.0 |
4.2810 |
0.0000282 |
71960.0 |
194800.0 |
| AGE |
166.1 |
381.3 |
0.4357 |
0.6635000 |
-585.5 |
917.7 |
| LENGTH_OF_STAY |
26850.0 |
1136.0 |
23.6300 |
0.0000000 |
24610.0 |
29090.0 |
| INSURANCE_CATEGORYself_pay |
-120600.0 |
24770.0 |
-4.8690 |
0.0000022 |
-169400.0 |
-71770.0 |
| INSURANCE_CATEGORYpublic |
-181000.0 |
36580.0 |
-4.9470 |
0.0000015 |
-253100.0 |
-108900.0 |
| DEPARTMENTGeneral Practice Clinic |
-63310.0 |
28700.0 |
-2.2060 |
0.0284600 |
-119900.0 |
-6738.0 |
| DEPARTMENTOne Time Antenatal(Emergency) Clinic |
247100.0 |
115800.0 |
2.1330 |
0.0340500 |
18770.0 |
475400.0 |
| DEPARTMENTUrology Clinic |
-28200.0 |
51000.0 |
-0.5530 |
0.5808000 |
-128700.0 |
72330.0 |
Model fit statistics — charges model
| 0.7428 |
0.7343 |
87.05 |
0 |
211 |
112200 |
Model 1 — PATIENT_CHARGES ~ AGE + LENGTH_OF_STAY + INSURANCE + DEPARTMENT
| (Intercept) |
0.1056000 |
0.0392700 |
2.68800 |
0.007753 |
0.0281600 |
0.183000 |
| AGE |
0.0001325 |
0.0004806 |
0.27580 |
0.783000 |
-0.0008149 |
0.001080 |
| LENGTH_OF_STAY |
0.0016650 |
0.0014320 |
1.16300 |
0.246300 |
-0.0011580 |
0.004489 |
| INSURANCE_CATEGORYself_pay |
-0.0417500 |
0.0312300 |
-1.33700 |
0.182700 |
-0.1033000 |
0.019810 |
| INSURANCE_CATEGORYpublic |
-0.0669500 |
0.0461100 |
-1.45200 |
0.148000 |
-0.1578000 |
0.023950 |
| DEPARTMENTGeneral Practice Clinic |
0.0567000 |
0.0361800 |
1.56700 |
0.118500 |
-0.0146100 |
0.128000 |
| DEPARTMENTOne Time Antenatal(Emergency) Clinic |
-0.0033910 |
0.1460000 |
-0.02323 |
0.981500 |
-0.2912000 |
0.284400 |
| DEPARTMENTUrology Clinic |
0.0760700 |
0.0642900 |
1.18300 |
0.238000 |
-0.0506600 |
0.202800 |
Model fit statistics — readmission model
| 0.03607 |
0.004087 |
1.128 |
0.3468 |
211 |
0.1414 |
Model 2 — READMISSION_RATE ~ AGE + LENGTH_OF_STAY + INSURANCE + DEPARTMENT
Interpretation
Model 1 — Patient Charges. R² = 0.743, adjusted R² = 0.734, F-test p ≈ 10⁻⁵⁸ — the model explains roughly three-quarters of the variance in patient charges. The dominant driver is LENGTH_OF_STAY at +₦26,850 per additional day (p < 0.001). Insurance category matters substantially: Public patients incur ₦181k less than Private patients with the same length of stay (p < 0.001), and Self-pay patients ₦121k less (p < 0.001). Age is not statistically significant (p = 0.66) once the other factors are controlled for.
Model 2 — Readmission Rate. R² = 0.036, F-test p = 0.35 — the model does not explain readmission. No covariate is significant at α = 0.05. The directional reading is that public-tariff and self-pay patients have slightly lower readmission than private, but the differences are within sampling noise. The honest conclusion is that the structural factors in this extract do not predict readmission; future extracts must include clinical case-mix and follow-up-adherence variables to make readmission modellable.
Integrated Findings
| 1 |
EDA |
n = 220; right-skewed charges (mean ₦279k, max ₦1.38m); 1 missing GENDER, 1 missing LOS; departmental case-mix dominated by General Practice (89 %) |
| 2 |
Visualisation |
5 charts: charges histogram, charges-by-insurance boxplot, LOS-vs-charges scatter (very strong linear pattern), readmission-by-department boxplot, age-vs-readmission scatter |
| 3 |
Hypothesis testing |
H1: charges differ by insurance (F = 3.07, p = 0.048; Tukey: Private > Public). H2: readmission does not differ by gender (t = 1.31, p = 0.19) |
| 4 |
Correlation analysis |
r(LOS, CHARGES) = +0.82 (very strong); r(READMISSION, CHARGES) = +0.26; AGE essentially uncorrelated with everything |
| 5 |
Regression |
Charges model R² = 0.74 (LOS dominant, insurance second). Readmission model R² = 0.04 (no significant predictors) |
The five techniques converge on a single recommendation: focus operational improvement on length-of-stay management (each additional day costs the hospital ₦26,850 per patient). For readmission and satisfaction, the available data is silent — the honest next step is to augment the extract with a CSAT instrument at discharge, case-mix index, comorbidity flags and follow-up- adherence flags before commissioning a deeper analysis.
Limitations & Further Work
- Missing satisfaction variable. The dataset has no explicit patient-satisfaction score, so
READMISSION_RATE is used as a clinical proxy. Add a CSAT / NPS instrument at discharge — even a single 0–10 question — and re-run §7–§9 on the actual outcome.
- Departmental imbalance. General Practice contributes 195 of 220 episodes, so departmental coefficients for the smaller clinics (especially “One Time Antenatal Emergency”, n = 1) are statistically underpowered. The boutique-clinic coefficients should be read as directional only.
- No clinical case-mix. A 122-diagnosis long tail is too sparse to encode as fixed effects; the next iteration should fold diagnoses into ICD-10 chapters or a case-mix index.
- Causality. Every coefficient here is observational. The length-of-stay → charges relationship is partly mechanical (longer stays accrue more daily fees), so the LOS-coefficient is best read as an accounting relationship rather than a clinical intervention point. The clinical question — “what shortens LOS without harming outcomes?” — requires a separate quasi-experimental design.
- Cross-sectional data. With only one year of episodes we cannot speak to seasonality or trend. Add at least two more years before publishing a forecast.
References
Boulding, W., Glickman, S. W., Manary, M. P., Schulman, K. A., & Staelin, R. (2011). Relationship between patient satisfaction with inpatient care and hospital readmission within 30 days. The American Journal of Managed Care, 17(1), 41–48.
Tsai, T. C., Orav, E. J., & Jha, A. K. (2015). Patient satisfaction and quality of surgical care in US hospitals. Annals of Surgery, 261(1), 2–8.
Wickham, H., Averick, M., Bryan, J., et al. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686
Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag.
Robinson, D., Hayes, A., & Couch, S. (2024). broom: Convert Statistical Objects into Tidy Tibbles. R package version 1.0.x.
Revelle, W. (2024). psych: Procedures for Psychological, Psychometric, and Personality Research. Northwestern University, Evanston, Illinois.
Wei, T., & Simko, V. (2024). R package “corrplot”: Visualisation of a Correlation Matrix.
Fox, J., & Weisberg, S. (2019). An R Companion to Applied Regression (3rd ed.). SAGE.
Federal Republic of Nigeria. (2023). Nigeria Data Protection Act. National Assembly.
Adi, B., Mark Analytics. (2025). AI-Powered Data Analytics: a reproducible reporting workflow. https://markanalytics.online/ai-powered-data-analytics/
Appendix — AI Usage Statement
I used Claude (Anthropic) for two specific tasks: (1) drafting the boilerplate scaffold of the Quarto YAML and section headings to match the required submission rubric, and (2) double-checking R lm(), aov() and ggplot2 syntax for the visualisation and regression chunks. The analytical question, the choice of techniques, the decision to use READMISSION_RATE as a defensible clinical proxy for satisfaction, the substantive interpretation of every test and coefficient (including the cautionary note that the LOS → CHARGES link is partly accounting rather than clinical), and the recommendation to enrich the extract with a CSAT instrument and a case-mix index before commissioning a deeper analysis are my independent professional judgement. Every numerical result is computed live in this document on the 220-row Bien-Santé Hospital extract (BienSanteHospitalReport.xlsx) and reproducible end-to-end via quarto render.