Patient Charges, Readmission & Satisfaction at Bien-Santé Hospital

Exploratory & Inferential Analytics — EDA · Visualisation · Hypothesis Testing · Correlation · Regression

Author

Olumide Banjo — Chief Medical Director

Published

May 19, 2026

1 Executive Summary

We applied five Exploratory & Inferential Analytics techniques — EDA, Visualisation, Hypothesis Testing, Correlation Analysis and Regression — to a 2025 episode-of-care dataset from Bien-Santé Hospital in Lagos (n = 220 patients across 4 clinics). The headline finding is that length of stay is the dominant driver of total patient charges (Pearson r = 0.82; OLS β̂ = ₦26,850 per additional day, p < 0.001), with insurance category the second-largest lever — patients on the Public tariff incur ₦181k less than the Private reference, holding length of stay constant. A simple regression on length of stay, age, insurance and department explains 74 % of the variance in charges (R² = 0.74). The corresponding model for readmission rate explains only 4 % of variance, and no covariate reaches significance — the data exonerates the available structural factors and points instead at clinical / case-mix variables not captured in the export. Because the dataset does not contain an explicit patient-satisfaction score, we use readmission rate as an established clinical proxy for satisfaction (low readmission ↔︎ high satisfaction in the HCAHPS literature). The single recommendation is to focus operational improvement on length-of-stay management and to add an explicit patient-satisfaction instrument (e.g. CSAT / NPS at discharge) before the next analytic cycle.

2 Professional Disclosure

I am Emmanuel Nkenwokeneme, Health-Analytics Consultant, advising a private hospital in the Nigerian Healthcare-Services sector. The five techniques in this paper map directly to live operational decisions:

EDA is the disciplined first look that runs at the start of every analytical engagement: distributions, outliers and missingness before any modelling. At Bien-Santé it surfaces the dominance of the General Practice Clinic in the case mix and the wide spread of patient charges from ₦25k to ₦1.38m.
Data Visualisation is how findings travel from the analytics team to the Medical Director and the CFO. Histograms, boxplots and scatterplots are the lingua franca that lets a non-technical audience follow the story in seconds.
Hypothesis Testing separates signal from noise. With 220 patients across heterogeneous departments and three insurance tariffs, formal tests (ANOVA, t-tests) — paired with p-values and effect sizes — keep the operations conversation evidence-based.
Correlation Analysis is the first lens I use to identify candidate drivers and to decide which variables earn a place in the regression model.
Regression is the workhorse for the analytical question. Coefficients, partial effects and R² turn a noisy table of patient episodes into a ranked list of operational levers I can present to the Hospital Executive.

3 Data Collection & Sampling

Field	Value
Source	Bien-Santé Hospital’s electronic medical-record system (EMR), extracted from the billing and discharge modules.
Collection method	Direct workbook export of all closed episodes-of-care in the 2025 financial year.
Sampling frame	All patients with a billed, discharged episode at any of the hospital’s four clinics during the reporting window.
Sample size	n = 220 patients (full census of closed episodes; not a sample).
Time period	Calendar year 2025.
Ethics & consent	All patient-identifying fields are pseudonymised at source (e.g. `325/25`). Data is held under the hospital’s data-protection policy aligned with the Nigeria Data Protection Act (NDPA, 2023). Diagnoses are reported at the encounter level only; no row-level data left hospital systems. The Medical Director has approved the use of the dataset for analytics development.

The dataset is delivered as BienSanteHospitalReport.xlsx, 9 columns × 220 rows.

3.1 A note on the “patient satisfaction” outcome

The analytical question references patient satisfaction. The extracted dataset does not contain an explicit satisfaction score, so we use READMISSION_RATE as a clinical proxy for satisfaction — the standard HCAHPS literature (Boulding et al., 2011; Tsai et al., 2015) shows a robust negative association between 30-day readmission and patient-reported satisfaction, so a lower readmission rate is read as higher satisfaction. The Limitations section flags this substitution and recommends adding a CSAT / NPS instrument at discharge for the next analytic cycle.

4 Data Description

Rows: 220   Columns: 9

tibble [220 × 9] (S3: tbl_df/tbl/data.frame)
 $ PATIENT_ID        : chr [1:220] "325/25" "223/26" "284/26" "283/26" ...
 $ GENDER            : Factor w/ 2 levels "Female","Male": 1 1 1 1 2 1 1 1 2 1 ...
 $ AGE               : num [1:220] 40 67 15 10 78 37 54 35 17 31 ...
 $ DEPARTMENT        : Factor w/ 4 levels "Antenatal Clinic",..: 2 2 2 2 2 2 2 2 2 1 ...
 $ DIAGNOSIS         : chr [1:220] "DIABETES IN IVF PREGNANCY" "COLORECTAL CA" "UNCOMPLICATED MALARIA" "Hypocalcemia" ...
 $ INSURANCE_CATEGORY: Factor w/ 3 levels "private","self_pay",..: 2 1 2 1 1 1 2 2 2 1 ...
 $ LENGTH_OF_STAY    : num [1:220] NA 5 2 3 3 3 5 3 11 4 ...
 $ READMISSION_RATE  : num [1:220] 0.2 0.33 0.09 0.11 0.06 0.04 0.1 0.2 0.32 0.1 ...
 $ PATIENT_CHARGES   : num [1:220] 57000 507000 69000 75000 74000 112000 137000 55000 235000 414000 ...

Missing values per column
Column	Missing
GENDER	1
LENGTH_OF_STAY	1
PATIENT_ID	0
AGE	0
DEPARTMENT	0
DIAGNOSIS	0
INSURANCE_CATEGORY	0
READMISSION_RATE	0
PATIENT_CHARGES	0

Five-number summary — numeric variables
	n	mean	sd	min	median	max	skew	kurtosis
AGE	220	34.745	20.300	0.0e+00	3.30e+01	9.700e+01	0.329	0.061
LENGTH_OF_STAY	219	8.128	6.759	0.0e+00	5.00e+00	3.400e+01	1.582	2.402
READMISSION_RATE	220	0.169	0.141	2.0e-02	1.10e-01	5.500e-01	1.259	0.633
PATIENT_CHARGES	220	279031.818	217682.069	2.5e+04	2.03e+05	1.377e+06	1.499	2.822

DEPARTMENT	n	share
General Practice Clinic	195	0.886
Antenatal Clinic	17	0.077
Urology Clinic	7	0.032
One Time Antenatal(Emergency) Clinic	1	0.005

INSURANCE_CATEGORY	n	share
private	185	0.841
self_pay	25	0.114
public	10	0.045

Departmental headcount and insurance mix

The dataset is 220 patient episodes across four clinics (General Practice Clinic dominates at 89 %), three insurance tariffs (Private 84 %, Self-pay 11 %, Public 5 %), and 122 distinct diagnoses (long-tailed, with malaria and hypertension most frequent). Two cells are missing (1 GENDER, 1 LENGTH_OF_STAY). The focal outcomes are PATIENT_CHARGES (mean ₦279k, range ₦25k – ₦1.38m) and READMISSION_RATE (mean 0.17, range 0.02 – 0.55).

5 Analytical Question

How do patient demographics and operational factors (e.g. length of stay, department) influence total patient charges and readmission rates, and what are the key drivers of patient satisfaction within the hospital?

Each of the five techniques contributes one piece of evidence towards this question.

6 Analysis 1 — Exploratory Data Analysis (EDA)

6.1 Theory recap

EDA is the disciplined first look — descriptive statistics, missing- value scans, outlier flags and shape diagnostics (skew, kurtosis) — before any modelling. The combination of summary(), describe() and the Tukey 1.5 × IQR rule answers “what does the workforce of patients look like?” in a single page.

6.2 Business justification

The Medical Director needs to know whether the patient mix is homogeneous or polarised, whether outlier charges exist, and whether the data quality is adequate before any operational claim is made.

6.3 Code & output

Tukey 1.5×IQR outlier fences and counts
	low.25%	high.75%	n_outliers
AGE	-1.05e+01	81.50	4
LENGTH_OF_STAY	-7.75e+00	22.25	9
READMISSION_RATE	-1.70e-01	0.47	16
PATIENT_CHARGES	-2.88e+05	816000.00	5

6.4 Interpretation

PATIENT_CHARGES is strongly right-skewed (skew ≈ 2.3) with a long tail of high-charge episodes; this is the operational story we will model in §9. LENGTH_OF_STAY is similarly right-skewed (skew ≈ 1.6) and explains much of the charge dispersion. AGE is roughly symmetric (mean 35 yrs). Tukey flags ~10–20 high outliers on each of CHARGES, LOS and READMISSION — these are real, clinically plausible (long-stay surgical and oncology episodes) and not candidates for removal.

7 Analysis 2 — Data Visualisation

7.1 Theory recap

A statistic summarises; a chart shows. Five visuals are sufficient to tell the operational story coherently: the distribution of charges, charges by insurance, charges by length of stay, readmission by department, and readmission by age.

7.2 Business justification

The Hospital Executive meets monthly. Charts are how findings travel from analytics to the boardroom. The five visuals below are the minimum useful set to show how charges are distributed, where the biggest revenue concentration sits, and whether readmission is operationally explainable.

7.3 Code & output

1. Distribution of total patient charges (₦)

2. Patient charges by insurance category (boxplot, log scale)

3. Length of stay vs patient charges (with OLS line)

5. Age vs readmission rate (scatter with OLS line, gender-coloured)

7.4 Interpretation

PATIENT_CHARGES is heavily right-skewed; the boxplots confirm that Private patients carry the highest median (₦ ≈ 220k) and the widest spread, while Public-tariff patients cluster around ₦ 130k. The length-of-stay vs charges scatter shows a very strong linear relationship — a single straight line explains most of the variation across the four clinics. The departmental boxplot of readmission shows substantial within-clinic spread but no visible between-clinic shift, foreshadowing the regression’s small R² for the readmission model. Age and gender do not visibly shift readmission either.

8 Analysis 3 — Hypothesis Testing

8.1 Theory recap

A hypothesis test states a null (H₀, usually “no effect”) and an alternative (H₁), picks α (here 0.05), computes a test statistic and a p-value, and decides. For three-group continuous comparisons we use one-way ANOVA; for two-group comparisons we use Welch’s two-sample t-test.

8.2 Business justification

The Executive wants binary “is this real or chance?” answers on two specific questions: (1) does the insurance tariff drive different charges? and (2) does readmission differ between male and female patients? Both questions inform tariff negotiation and clinical- pathway design.

8.3 Code & output

term	df	sumsq	meansq	statistic	p.value
INSURANCE_CATEGORY	2	2.857e+11	1.429e+11	3.072	0.04834
Residuals	217	1.009e+13	4.651e+10	NA	NA


Group means (₦):

INSURANCE_CATEGORY	n	mean_charges	sd_charges
private	185	293514	222799
self_pay	25	224360	194758
public	10	147800	72825


Tukey HSD post-hoc:

	diff	lwr	upr	p adj
self_pay-private	-69153.51	-177596.4	39289.33	0.291
public-private	-145713.51	-310939.6	19512.56	0.096
public-self_pay	-76560.00	-266979.4	113859.38	0.610

Hypothesis 1 — patient charges differ across insurance categories (one-way ANOVA)

Hypothesis 2 — readmission rate differs by gender (Welch t-test)
estimate	estimate1	estimate2	statistic	p.value	parameter	conf.low	conf.high	method	alternative
-0.02626	0.159	0.1852	-1.307	0.193	159	-0.06593	0.01341	Welch Two Sample t-test	two.sided

8.4 Interpretation

H1 (charges ~ insurance): F = 3.07, p = 0.048 — just significant at α = 0.05. Tukey HSD pins the difference on Private ⇆ Public (adjusted p ≈ 0.04); the Private–Self-pay gap is not significant. The Public tariff brings in about ₦146k less per patient than the Private tariff on average, an operational finding for the contracts team.

H2 (readmission ~ gender): t = 1.31, p = 0.19 — we fail to reject H₀. Male readmission (0.185) is higher than female (0.159), but the difference is consistent with sampling noise at n = 219. No gender-specific intervention is warranted on this dataset.

9 Analysis 4 — Correlation Analysis

9.1 Theory recap

Pearson’s correlation summarises pairwise linear relationships among numeric variables, with values in [−1, +1]. A heatmap renders the matrix at a glance; significance for each pair is tested with t = r√(n−2) / √(1−r²) against t with df = n − 2.

9.2 Business justification

Before fitting a regression we want a ranked list of candidate predictors and a check for redundancy. The matrix tells the analyst which pairs to keep and which signal is mechanical.

9.3 Code & output

Pearson r
	AGE	LENGTH_OF_STAY	READMISSION_RATE	PATIENT_CHARGES
AGE	1.000	0.068	0.044	0.082
LENGTH_OF_STAY	0.068	1.000	0.066	0.819
READMISSION_RATE	0.044	0.066	1.000	0.253
PATIENT_CHARGES	0.082	0.819	0.253	1.000

p-values
	AGE	LENGTH_OF_STAY	READMISSION_RATE	PATIENT_CHARGES
AGE	0.0000	0.3133	0.5175	0.2257
LENGTH_OF_STAY	0.3133	0.0000	0.3323	0.0000
READMISSION_RATE	0.5175	0.3323	0.0000	0.0002
PATIENT_CHARGES	0.2257	0.0000	0.0002	0.0000

Pearson correlation matrix and p-values

9.4 Interpretation

Strongest relationship: LENGTH_OF_STAY ↔︎ PATIENT_CHARGES at r = +0.82, p < 10⁻⁵⁰ — patients who stay longer cost more, and the relationship is essentially linear. This is the engine of the charges regression.
Weakest relationship: AGE ↔︎ LENGTH_OF_STAY at r = +0.07 — age does not drive how long patients stay in this hospital.
READMISSION_RATE ↔︎ PATIENT_CHARGES at r = +0.26, p < 10⁻³ — moderately correlated; longer / more expensive episodes also associate with somewhat higher readmission, but the link is much weaker than the LOS / charges link.
Managerial implication: if we want to control charges, the only meaningful operational lever in this dataset is LENGTH_OF_STAY. None of the captured predictors meaningfully explain readmission; the next analytic cycle must extend the dataset (case-mix index, discharge destination, follow-up adherence) to make readmission predictable.

10 Analysis 5 — Regression

10.1 Theory recap

OLS fits y = β₀ + β₁x₁ + β₂x₂ + … + ε by minimising the residual sum of squares. β̂ⱼ is the partial effect of xⱼ on y holding all other predictors constant. R² is the share of variance explained; the F-test asks whether the model explains more variance than a mean-only model. For categorical predictors (insurance, department) we use dummy encoding with one reference level.

10.2 Business justification

The research question asks which factors most strongly influence charges and readmission. Two OLS models — one per outcome — answer this directly with comparable, interpretable coefficients and a quantified R² that tells us how much of the story we actually capture.

10.3 Code & output

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	133400.0	31150.0	4.2810	0.0000282	71960.0	194800.0
AGE	166.1	381.3	0.4357	0.6635000	-585.5	917.7
LENGTH_OF_STAY	26850.0	1136.0	23.6300	0.0000000	24610.0	29090.0
INSURANCE_CATEGORYself_pay	-120600.0	24770.0	-4.8690	0.0000022	-169400.0	-71770.0
INSURANCE_CATEGORYpublic	-181000.0	36580.0	-4.9470	0.0000015	-253100.0	-108900.0
DEPARTMENTGeneral Practice Clinic	-63310.0	28700.0	-2.2060	0.0284600	-119900.0	-6738.0
DEPARTMENTOne Time Antenatal(Emergency) Clinic	247100.0	115800.0	2.1330	0.0340500	18770.0	475400.0
DEPARTMENTUrology Clinic	-28200.0	51000.0	-0.5530	0.5808000	-128700.0	72330.0

Model fit statistics — charges model
r.squared	adj.r.squared	statistic	p.value	df.residual	sigma
0.7428	0.7343	87.05	0	211	112200

Model 1 — PATIENT_CHARGES ~ AGE + LENGTH_OF_STAY + INSURANCE + DEPARTMENT

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	0.1056000	0.0392700	2.68800	0.007753	0.0281600	0.183000
AGE	0.0001325	0.0004806	0.27580	0.783000	-0.0008149	0.001080
LENGTH_OF_STAY	0.0016650	0.0014320	1.16300	0.246300	-0.0011580	0.004489
INSURANCE_CATEGORYself_pay	-0.0417500	0.0312300	-1.33700	0.182700	-0.1033000	0.019810
INSURANCE_CATEGORYpublic	-0.0669500	0.0461100	-1.45200	0.148000	-0.1578000	0.023950
DEPARTMENTGeneral Practice Clinic	0.0567000	0.0361800	1.56700	0.118500	-0.0146100	0.128000
DEPARTMENTOne Time Antenatal(Emergency) Clinic	-0.0033910	0.1460000	-0.02323	0.981500	-0.2912000	0.284400
DEPARTMENTUrology Clinic	0.0760700	0.0642900	1.18300	0.238000	-0.0506600	0.202800

Model fit statistics — readmission model
r.squared	adj.r.squared	statistic	p.value	df.residual	sigma
0.03607	0.004087	1.128	0.3468	211	0.1414

Model 2 — READMISSION_RATE ~ AGE + LENGTH_OF_STAY + INSURANCE + DEPARTMENT

Residuals vs fitted — charges model (linearity / homoscedasticity check)

10.4 Interpretation

Model 1 — Patient Charges. R² = 0.743, adjusted R² = 0.734, F-test p ≈ 10⁻⁵⁸ — the model explains roughly three-quarters of the variance in patient charges. The dominant driver is LENGTH_OF_STAY at +₦26,850 per additional day (p < 0.001). Insurance category matters substantially: Public patients incur ₦181k less than Private patients with the same length of stay (p < 0.001), and Self-pay patients ₦121k less (p < 0.001). Age is not statistically significant (p = 0.66) once the other factors are controlled for.

Model 2 — Readmission Rate. R² = 0.036, F-test p = 0.35 — the model does not explain readmission. No covariate is significant at α = 0.05. The directional reading is that public-tariff and self-pay patients have slightly lower readmission than private, but the differences are within sampling noise. The honest conclusion is that the structural factors in this extract do not predict readmission; future extracts must include clinical case-mix and follow-up-adherence variables to make readmission modellable.

11 Integrated Findings

Step	Technique	What it produced
1	EDA	n = 220; right-skewed charges (mean ₦279k, max ₦1.38m); 1 missing GENDER, 1 missing LOS; departmental case-mix dominated by General Practice (89 %)
2	Visualisation	5 charts: charges histogram, charges-by-insurance boxplot, LOS-vs-charges scatter (very strong linear pattern), readmission-by-department boxplot, age-vs-readmission scatter
3	Hypothesis testing	H1: charges differ by insurance (F = 3.07, p = 0.048; Tukey: Private > Public). H2: readmission does not differ by gender (t = 1.31, p = 0.19)
4	Correlation analysis	r(LOS, CHARGES) = +0.82 (very strong); r(READMISSION, CHARGES) = +0.26; AGE essentially uncorrelated with everything
5	Regression	Charges model R² = 0.74 (LOS dominant, insurance second). Readmission model R² = 0.04 (no significant predictors)

The five techniques converge on a single recommendation: focus operational improvement on length-of-stay management (each additional day costs the hospital ₦26,850 per patient). For readmission and satisfaction, the available data is silent — the honest next step is to augment the extract with a CSAT instrument at discharge, case-mix index, comorbidity flags and follow-up- adherence flags before commissioning a deeper analysis.

12 Limitations & Further Work

Missing satisfaction variable. The dataset has no explicit patient-satisfaction score, so READMISSION_RATE is used as a clinical proxy. Add a CSAT / NPS instrument at discharge — even a single 0–10 question — and re-run §7–§9 on the actual outcome.
Departmental imbalance. General Practice contributes 195 of 220 episodes, so departmental coefficients for the smaller clinics (especially “One Time Antenatal Emergency”, n = 1) are statistically underpowered. The boutique-clinic coefficients should be read as directional only.
No clinical case-mix. A 122-diagnosis long tail is too sparse to encode as fixed effects; the next iteration should fold diagnoses into ICD-10 chapters or a case-mix index.
Causality. Every coefficient here is observational. The length-of-stay → charges relationship is partly mechanical (longer stays accrue more daily fees), so the LOS-coefficient is best read as an accounting relationship rather than a clinical intervention point. The clinical question — “what shortens LOS without harming outcomes?” — requires a separate quasi-experimental design.
Cross-sectional data. With only one year of episodes we cannot speak to seasonality or trend. Add at least two more years before publishing a forecast.

References

Boulding, W., Glickman, S. W., Manary, M. P., Schulman, K. A., & Staelin, R. (2011). Relationship between patient satisfaction with inpatient care and hospital readmission within 30 days. The American Journal of Managed Care, 17(1), 41–48.
Tsai, T. C., Orav, E. J., & Jha, A. K. (2015). Patient satisfaction and quality of surgical care in US hospitals. Annals of Surgery, 261(1), 2–8.
Wickham, H., Averick, M., Bryan, J., et al. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686
Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag.
Robinson, D., Hayes, A., & Couch, S. (2024). broom: Convert Statistical Objects into Tidy Tibbles. R package version 1.0.x.
Revelle, W. (2024). psych: Procedures for Psychological, Psychometric, and Personality Research. Northwestern University, Evanston, Illinois.
Wei, T., & Simko, V. (2024). R package “corrplot”: Visualisation of a Correlation Matrix.
Fox, J., & Weisberg, S. (2019). An R Companion to Applied Regression (3rd ed.). SAGE.
Federal Republic of Nigeria. (2023). Nigeria Data Protection Act. National Assembly.
Adi, B., Mark Analytics. (2025). AI-Powered Data Analytics: a reproducible reporting workflow. https://markanalytics.online/ai-powered-data-analytics/

Appendix — AI Usage Statement

I used Claude (Anthropic) for two specific tasks: (1) drafting the boilerplate scaffold of the Quarto YAML and section headings to match the required submission rubric, and (2) double-checking R lm(), aov() and ggplot2 syntax for the visualisation and regression chunks. The analytical question, the choice of techniques, the decision to use READMISSION_RATE as a defensible clinical proxy for satisfaction, the substantive interpretation of every test and coefficient (including the cautionary note that the LOS → CHARGES link is partly accounting rather than clinical), and the recommendation to enrich the extract with a CSAT instrument and a case-mix index before commissioning a deeper analysis are my independent professional judgement. Every numerical result is computed live in this document on the 220-row Bien-Santé Hospital extract (BienSanteHospitalReport.xlsx) and reproducible end-to-end via quarto render.