This dataset originates from a retrospective study involving pediatric patients who were admitted to Children’s Hospital St. Hedwig in Regensburg, Germany, with symptoms of abdominal pain. For most patients, multiple B-mode abdominal ultrasound images were collected, with the number of views ranging from 1 to 15. These images capture key anatomical areas such as the right lower quadrant, appendix, intestines, lymph nodes, and reproductive organs. In addition to the ultrasound images, the dataset includes comprehensive clinical data such as laboratory results, physical examination findings, clinical scoring metrics (e.g., Alvarado and pediatric appendicitis scores), and expert interpretations of the ultrasound scans. Each patient is also labeled according to three clinical outcomes: final diagnosis (appendicitis vs. no appendicitis), treatment approach (surgical vs. conservative), and disease severity (complicated vs. uncomplicated or no appendicitis). The study received approval from the Ethics Committee of the University of Regensburg (reference numbers 18-1063-101, 18-1063_1-101, and 18-1063_2-101) and was conducted in accordance with all applicable ethical guidelines and regulations.
Here’s a refined and cohesive rewrite of the feature categories and variable descriptions:
The dataset is structured into several clinically relevant feature groups that collectively support diagnostic modeling for pediatric appendicitis:
These variables capture basic patient characteristics and physiological measurements:
Information about diagnostic stages and clinical decisions:
Standardized scores used to assess likelihood of appendicitis:
Includes both blood and urine-based diagnostic markers:
Binary indicators representing patient-reported symptoms or physical exam findings:
Ultrasound results and radiologic signs suggestive of appendicitis:
Loading library and data sources
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from ucimlrepo import fetch_ucirepo
# fetch dataset
regensburg_pediatric_appendicitis = fetch_ucirepo(id=938)
# data (as pandas dataframes)
X = regensburg_pediatric_appendicitis.data.features
y = regensburg_pediatric_appendicitis.data.targets
# metadata
print(regensburg_pediatric_appendicitis.metadata)
## {'uci_id': 938, 'name': 'Regensburg Pediatric Appendicitis', 'repository_url': 'https://archive.ics.uci.edu/dataset/938/regensburg+pediatric+appendicitis', 'data_url': 'https://archive.ics.uci.edu/static/public/938/data.csv', 'abstract': 'This repository holds the data from a cohort of pediatric patients with suspected appendicitis admitted with abdominal pain to Children’s Hospital St. Hedwig in Regensburg, Germany, between 2016 and 2021. Each patient has (potentially multiple) ultrasound (US) images, aka views, tabular data comprising laboratory, physical examination, scoring results and ultrasonographic findings extracted manually by the experts, and three target variables, namely, diagnosis, management and severity.', 'area': 'Health and Medicine', 'tasks': ['Classification'], 'characteristics': ['Tabular', 'Image'], 'num_instances': 782, 'num_features': 53, 'feature_types': ['Real', 'Categorical', 'Integer'], 'demographics': ['Age', 'Sex'], 'target_col': ['Management', 'Severity', 'Diagnosis'], 'index_col': None, 'has_missing_values': 'yes', 'missing_values_symbol': 'NaN', 'year_of_dataset_creation': 2023, 'last_updated': 'Tue Feb 06 2024', 'dataset_doi': '10.5281/zenodo.7669442', 'creators': ['Ricards Marcinkevics', 'Patricia Reis', 'Ugne Klimiene', 'Ece Ozkan', 'Kieran Chin-Cheong', 'Alyssia Paschke', 'Julia Zerres', 'Markus Denzinger', 'David Niederberger', 'S. Wellmann', 'C. Knorr', 'Julia E.'], 'intro_paper': {'ID': 354, 'type': 'NATIVE', 'title': 'Interpretable and Intervenable Ultrasonography-based Machine Learning Models for Pediatric Appendicitis', 'authors': 'Ricards Marcinkevics, Patricia Reis Wolfertstetter, Ugne Klimiene, Ece Ozkan, Kieran Chin-Cheong, Alyssia Paschke, Julia Zerres, Markus Denzinger, David Niederberger, S. Wellmann, C. Knorr, Julia E. Vogt', 'venue': 'Medical Image Analysis', 'year': 2023, 'journal': None, 'DOI': None, 'URL': 'https://arxiv.org/abs/2302.14460v2', 'sha': None, 'corpus': None, 'arxiv': None, 'mag': None, 'acl': None, 'pmid': None, 'pmcid': None}, 'additional_info': {'summary': 'This dataset was acquired in a retrospective study from a cohort of pediatric patients admitted with abdominal pain to Children’s Hospital St. Hedwig in Regensburg, Germany. Multiple abdominal B-mode ultrasound images were acquired for most patients, with the number of views varying from 1 to 15. The images depict various regions of interest, such as the abdomen’s right lower quadrant, appendix, intestines, lymph nodes and reproductive organs. Alongside multiple US images for each subject, the dataset includes information encompassing laboratory tests, physical examination results, clinical scores, such as Alvarado and pediatric appendicitis scores, and expert-produced ultrasonographic findings. Lastly, the subjects were labeled w.r.t. three target variables: diagnosis (appendicitis vs. no appendicitis), management (surgical vs. conservative) and severity (complicated vs. uncomplicated or no appendicitis). The study was approved by the Ethics Committee of the University of Regensburg (no. 18-1063-101, 18-1063_1-101 and 18-1063_2-101) and was performed following applicable guidelines and regulations.', 'purpose': None, 'funded_by': None, 'instances_represent': None, 'recommended_data_splits': None, 'sensitive_data': None, 'preprocessing_description': None, 'variable_info': None, 'citation': None}, 'external_url': 'https://zenodo.org/records/7669442'}
# variable information
print(regensburg_pediatric_appendicitis.variables)
## name role ... units missing_values
## 0 Age Feature ... years yes
## 1 BMI Feature ... None yes
## 2 Sex Feature ... None yes
## 3 Height Feature ... None yes
## 4 Weight Feature ... None yes
## 5 Length_of_Stay Feature ... None yes
## 6 Management Target ... None yes
## 7 Severity Target ... None yes
## 8 Diagnosis_Presumptive Other ... None yes
## 9 Diagnosis Target ... None yes
## 10 Alvarado_Score Feature ... None yes
## 11 Paedriatic_Appendicitis_Score Feature ... None yes
## 12 Appendix_on_US Feature ... None yes
## 13 Appendix_Diameter Feature ... None yes
## 14 Migratory_Pain Feature ... None yes
## 15 Lower_Right_Abd_Pain Feature ... None yes
## 16 Contralateral_Rebound_Tenderness Feature ... None yes
## 17 Coughing_Pain Feature ... None yes
## 18 Nausea Feature ... None yes
## 19 Loss_of_Appetite Feature ... None yes
## 20 Body_Temperature Feature ... None yes
## 21 WBC_Count Feature ... None yes
## 22 Neutrophil_Percentage Feature ... None yes
## 23 Segmented_Neutrophils Feature ... None yes
## 24 Neutrophilia Feature ... None yes
## 25 RBC_Count Feature ... None yes
## 26 Hemoglobin Feature ... None yes
## 27 RDW Feature ... None yes
## 28 Thrombocyte_Count Feature ... None yes
## 29 Ketones_in_Urine Feature ... None yes
## 30 RBC_in_Urine Feature ... None yes
## 31 WBC_in_Urine Feature ... None yes
## 32 CRP Feature ... None yes
## 33 Dysuria Feature ... None yes
## 34 Stool Feature ... None yes
## 35 Peritonitis Feature ... None yes
## 36 Psoas_Sign Feature ... None yes
## 37 Ipsilateral_Rebound_Tenderness Feature ... None yes
## 38 US_Performed Feature ... None yes
## 39 US_Number Other ... None yes
## 40 Free_Fluids Feature ... None yes
## 41 Appendix_Wall_Layers Feature ... None yes
## 42 Target_Sign Feature ... None yes
## 43 Appendicolith Feature ... None yes
## 44 Perfusion Feature ... None yes
## 45 Perforation Feature ... None yes
## 46 Surrounding_Tissue_Reaction Feature ... None yes
## 47 Appendicular_Abscess Feature ... None yes
## 48 Abscess_Location Feature ... None yes
## 49 Pathological_Lymph_Nodes Feature ... None yes
## 50 Lymph_Nodes_Location Feature ... None yes
## 51 Bowel_Wall_Thickening Feature ... None yes
## 52 Conglomerate_of_Bowel_Loops Feature ... None yes
## 53 Ileus Feature ... None yes
## 54 Coprostasis Feature ... None yes
## 55 Meteorism Feature ... None yes
## 56 Enteritis Feature ... None yes
## 57 Gynecological_Findings Feature ... None yes
##
## [58 rows x 7 columns]
Dataset Overview Total rows (patients): 782
Total features (columns): 58
Target variable candidates:
Diagnosis: e.g., “appendicitis” vs. “no appendicitis”
y.Diagnosis.value_counts()
## Diagnosis
## appendicitis 463
## no appendicitis 317
## Name: count, dtype: int64
# Basic summary of the dataset
summary_info = {
"Shape": X.shape,
"Columns": X.columns.tolist(),
"Data Types": X.dtypes,
"Missing Values": X.isnull().sum(),
"Sample Rows": X.head()
}
summary_info
## {'Shape': (782, 53), 'Columns': ['Age', 'BMI', 'Sex', 'Height', 'Weight', 'Length_of_Stay', 'Alvarado_Score', 'Paedriatic_Appendicitis_Score', 'Appendix_on_US', 'Appendix_Diameter', 'Migratory_Pain', 'Lower_Right_Abd_Pain', 'Contralateral_Rebound_Tenderness', 'Coughing_Pain', 'Nausea', 'Loss_of_Appetite', 'Body_Temperature', 'WBC_Count', 'Neutrophil_Percentage', 'Segmented_Neutrophils', 'Neutrophilia', 'RBC_Count', 'Hemoglobin', 'RDW', 'Thrombocyte_Count', 'Ketones_in_Urine', 'RBC_in_Urine', 'WBC_in_Urine', 'CRP', 'Dysuria', 'Stool', 'Peritonitis', 'Psoas_Sign', 'Ipsilateral_Rebound_Tenderness', 'US_Performed', 'Free_Fluids', 'Appendix_Wall_Layers', 'Target_Sign', 'Appendicolith', 'Perfusion', 'Perforation', 'Surrounding_Tissue_Reaction', 'Appendicular_Abscess', 'Abscess_Location', 'Pathological_Lymph_Nodes', 'Lymph_Nodes_Location', 'Bowel_Wall_Thickening', 'Conglomerate_of_Bowel_Loops', 'Ileus', 'Coprostasis', 'Meteorism', 'Enteritis', 'Gynecological_Findings'], 'Data Types': Age float64
## BMI float64
## Sex object
## Height float64
## Weight float64
## Length_of_Stay float64
## Alvarado_Score float64
## Paedriatic_Appendicitis_Score float64
## Appendix_on_US object
## Appendix_Diameter float64
## Migratory_Pain object
## Lower_Right_Abd_Pain object
## Contralateral_Rebound_Tenderness object
## Coughing_Pain object
## Nausea object
## Loss_of_Appetite object
## Body_Temperature float64
## WBC_Count float64
## Neutrophil_Percentage float64
## Segmented_Neutrophils float64
## Neutrophilia object
## RBC_Count float64
## Hemoglobin float64
## RDW float64
## Thrombocyte_Count float64
## Ketones_in_Urine object
## RBC_in_Urine object
## WBC_in_Urine object
## CRP float64
## Dysuria object
## Stool object
## Peritonitis object
## Psoas_Sign object
## Ipsilateral_Rebound_Tenderness object
## US_Performed object
## Free_Fluids object
## Appendix_Wall_Layers object
## Target_Sign object
## Appendicolith object
## Perfusion object
## Perforation object
## Surrounding_Tissue_Reaction object
## Appendicular_Abscess object
## Abscess_Location object
## Pathological_Lymph_Nodes object
## Lymph_Nodes_Location object
## Bowel_Wall_Thickening object
## Conglomerate_of_Bowel_Loops object
## Ileus object
## Coprostasis object
## Meteorism object
## Enteritis object
## Gynecological_Findings object
## dtype: object, 'Missing Values': Age 1
## BMI 27
## Sex 2
## Height 26
## Weight 3
## Length_of_Stay 4
## Alvarado_Score 52
## Paedriatic_Appendicitis_Score 52
## Appendix_on_US 5
## Appendix_Diameter 284
## Migratory_Pain 9
## Lower_Right_Abd_Pain 8
## Contralateral_Rebound_Tenderness 15
## Coughing_Pain 16
## Nausea 8
## Loss_of_Appetite 10
## Body_Temperature 7
## WBC_Count 6
## Neutrophil_Percentage 103
## Segmented_Neutrophils 728
## Neutrophilia 50
## RBC_Count 18
## Hemoglobin 18
## RDW 26
## Thrombocyte_Count 18
## Ketones_in_Urine 200
## RBC_in_Urine 206
## WBC_in_Urine 199
## CRP 11
## Dysuria 29
## Stool 17
## Peritonitis 9
## Psoas_Sign 37
## Ipsilateral_Rebound_Tenderness 163
## US_Performed 4
## Free_Fluids 63
## Appendix_Wall_Layers 564
## Target_Sign 644
## Appendicolith 713
## Perfusion 719
## Perforation 701
## Surrounding_Tissue_Reaction 530
## Appendicular_Abscess 697
## Abscess_Location 769
## Pathological_Lymph_Nodes 579
## Lymph_Nodes_Location 661
## Bowel_Wall_Thickening 683
## Conglomerate_of_Bowel_Loops 739
## Ileus 722
## Coprostasis 711
## Meteorism 642
## Enteritis 716
## Gynecological_Findings 756
## dtype: int64, 'Sample Rows': Age BMI Sex ... Meteorism Enteritis Gynecological_Findings
## 0 12.68 16.9 female ... NaN NaN NaN
## 1 14.10 31.9 male ... yes NaN NaN
## 2 14.14 23.3 female ... yes yes NaN
## 3 16.37 20.6 female ... NaN yes NaN
## 4 11.08 16.9 female ... NaN yes NaN
##
## [5 rows x 53 columns]}
numerical_cols = X.select_dtypes(include=['float64','int64']).columns.to_list()
categrical_cols = X.select_dtypes(include=['object']).columns.to_list()
df = X
df['Diagnosis'] = y['Diagnosis']
## <string>:1: SettingWithCopyWarning:
## A value is trying to be set on a copy of a slice from a DataFrame.
## Try using .loc[row_indexer,col_indexer] = value instead
##
## See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df.head()
## Age BMI Sex ... Enteritis Gynecological_Findings Diagnosis
## 0 12.68 16.9 female ... NaN NaN appendicitis
## 1 14.10 31.9 male ... NaN NaN no appendicitis
## 2 14.14 23.3 female ... yes NaN no appendicitis
## 3 16.37 20.6 female ... yes NaN no appendicitis
## 4 11.08 16.9 female ... yes NaN appendicitis
##
## [5 rows x 54 columns]
diagnosis_distribution = df['Diagnosis'].value_counts(dropna=False)
grouped_stats = df.groupby('Diagnosis')[numerical_cols].agg(['mean', 'median', 'std'])
grouped_stats
## Age ... CRP
## mean median std ... mean median std
## Diagnosis ...
## appendicitis 11.082782 11.36 3.557869 ... 44.902188 16.0 68.515050
## no appendicitis 11.720189 11.90 3.459236 ... 11.716561 1.0 24.920434
##
## [2 rows x 51 columns]
The grapgh below shows that while the average age is similar between children with and without appendicitis, C-reactive protein (CRP) levels are notably higher in the appendicitis group. Patients diagnosed with appendicitis had a mean CRP of 44.9 mg/L compared to just 11.7 mg/L in those without, highlighting CRP’s potential as a strong inflammatory marker for identifying acute appendicitis in pediatric patients.
import seaborn as sns
selected_features = ['WBC_Count', 'CRP', 'Appendix_Diameter']
fig, axes = plt.subplots(1,3, figsize = (18,5))
for idx, feature in enumerate(selected_features):
sns.boxplot(data=df, x='Diagnosis', y= feature, ax=axes[idx])
axes[idx].set_title(f'{feature} by Diagnosis')
axes[idx].set_xlabel('Diagnosis')
axes[idx].set_ylabel(feature)
plt.tight_layout()
Handling Missing Values
In the Regensburg Pediatric Appendicitis dataset, several features
exhibit high missingness not because of data entry errors or loss, but
due to skipped data, meaning the variables were
intentionally left unrecorded under specific clinical conditions. This
is known as structured or conditional missingness,
where data is only collected if relevant. For instance, variables such
as Abscess_Location
(98.34% missing),
Gynecological_Findings
(96.68%), and
Conglomerate_of_Bowel_Loops
(94.5%) are related to findings
that only apply when certain complications or tests—typically imaging
like ultrasound or CT—are conducted. Similarly,
Segmented_Neutrophils
is missing in 93.09% of records,
suggesting that this detailed blood test was not part of routine labs
for most patients. The same applies to other features such as
Ileus
(92.33%), Perfusion
(91.94%), and
Appendicolith
(91.18%), all of which rely on imaging being
performed and relevant findings being observed. These patterns strongly
indicate that the missing data is not at random, but
rather skipped because the measurement was not clinically
indicated or not observed, and thus should be carefully handled
during modeling—potentially by creating binary indicators for test
execution or using imputation methods appropriate for informative
missingness.
df.info(show_counts=True)
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 782 entries, 0 to 781
## Data columns (total 54 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 Age 781 non-null float64
## 1 BMI 755 non-null float64
## 2 Sex 780 non-null object
## 3 Height 756 non-null float64
## 4 Weight 779 non-null float64
## 5 Length_of_Stay 778 non-null float64
## 6 Alvarado_Score 730 non-null float64
## 7 Paedriatic_Appendicitis_Score 730 non-null float64
## 8 Appendix_on_US 777 non-null object
## 9 Appendix_Diameter 498 non-null float64
## 10 Migratory_Pain 773 non-null object
## 11 Lower_Right_Abd_Pain 774 non-null object
## 12 Contralateral_Rebound_Tenderness 767 non-null object
## 13 Coughing_Pain 766 non-null object
## 14 Nausea 774 non-null object
## 15 Loss_of_Appetite 772 non-null object
## 16 Body_Temperature 775 non-null float64
## 17 WBC_Count 776 non-null float64
## 18 Neutrophil_Percentage 679 non-null float64
## 19 Segmented_Neutrophils 54 non-null float64
## 20 Neutrophilia 732 non-null object
## 21 RBC_Count 764 non-null float64
## 22 Hemoglobin 764 non-null float64
## 23 RDW 756 non-null float64
## 24 Thrombocyte_Count 764 non-null float64
## 25 Ketones_in_Urine 582 non-null object
## 26 RBC_in_Urine 576 non-null object
## 27 WBC_in_Urine 583 non-null object
## 28 CRP 771 non-null float64
## 29 Dysuria 753 non-null object
## 30 Stool 765 non-null object
## 31 Peritonitis 773 non-null object
## 32 Psoas_Sign 745 non-null object
## 33 Ipsilateral_Rebound_Tenderness 619 non-null object
## 34 US_Performed 778 non-null object
## 35 Free_Fluids 719 non-null object
## 36 Appendix_Wall_Layers 218 non-null object
## 37 Target_Sign 138 non-null object
## 38 Appendicolith 69 non-null object
## 39 Perfusion 63 non-null object
## 40 Perforation 81 non-null object
## 41 Surrounding_Tissue_Reaction 252 non-null object
## 42 Appendicular_Abscess 85 non-null object
## 43 Abscess_Location 13 non-null object
## 44 Pathological_Lymph_Nodes 203 non-null object
## 45 Lymph_Nodes_Location 121 non-null object
## 46 Bowel_Wall_Thickening 99 non-null object
## 47 Conglomerate_of_Bowel_Loops 43 non-null object
## 48 Ileus 60 non-null object
## 49 Coprostasis 71 non-null object
## 50 Meteorism 140 non-null object
## 51 Enteritis 66 non-null object
## 52 Gynecological_Findings 26 non-null object
## 53 Diagnosis 780 non-null object
## dtypes: float64(17), object(37)
## memory usage: 330.0+ KB
# Calculate missing percentages
missing_percentages = (df.isnull().sum() / len(df)) * 100
missing_percentages = missing_percentages.sort_values(ascending=False)
# Format top 10 missing variables for display
top_missing = missing_percentages.head(15).round(2).astype(str) + '%'
top_missing_df = top_missing.reset_index()
top_missing_df.columns = ['Feature', 'Missing Percentage']
import ace_tools_open as tools
tools.display_dataframe_to_user(name="Top Missing Features", dataframe=top_missing_df)
## Top Missing Features
## Feature Missing Percentage
## 0 Abscess_Location 98.34%
## 1 Gynecological_Findings 96.68%
## 2 Conglomerate_of_Bowel_Loops 94.5%
## 3 Segmented_Neutrophils 93.09%
## 4 Ileus 92.33%
## 5 Perfusion 91.94%
## 6 Enteritis 91.56%
## 7 Appendicolith 91.18%
## 8 Coprostasis 90.92%
## 9 Perforation 89.64%
## 10 Appendicular_Abscess 89.13%
## 11 Bowel_Wall_Thickening 87.34%
## 12 Lymph_Nodes_Location 84.53%
## 13 Target_Sign 82.35%
## 14 Meteorism 82.1%
We creates new binary indicator features that flag whether certain high-missing variables were reported (i.e., not null) for each patient. These indicators help capture the informative nature of missingness, especially when data is absent due to conditional testing, such as imaging not being performed.
# create Feeature for 15 top missing values of the articles
df['Appendicular_Abscess_reported'] = df['Appendicular_Abscess'].notnull().astype(int)
## <string>:3: SettingWithCopyWarning:
## A value is trying to be set on a copy of a slice from a DataFrame.
## Try using .loc[row_indexer,col_indexer] = value instead
##
## See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df['Abscess_Location_reported'] = df['Abscess_Location'].notnull().astype(int)
## <string>:1: SettingWithCopyWarning:
## A value is trying to be set on a copy of a slice from a DataFrame.
## Try using .loc[row_indexer,col_indexer] = value instead
##
## See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df['Conglomerate_of_Bowel_Loops_reported'] = df['Abscess_Location'].notnull().astype(int)
## <string>:1: SettingWithCopyWarning:
## A value is trying to be set on a copy of a slice from a DataFrame.
## Try using .loc[row_indexer,col_indexer] = value instead
##
## See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df['Ileus_reported'] = df['Ileus'].notnull().astype(int)
df['Segmented_Neutrophils_reported'] = df['Segmented_Neutrophils'].notnull().astype(int)
df['Enteritis_reported'] = df['Enteritis'].notnull().astype(int)
#df['Perfusion_reported'] = df['Perfusion'].notnull().astype(int)
df['Appendicolith_reported'] = df['Appendicolith'].notnull().astype(int)
df['Coprostasis_reported'] = df['Coprostasis'].notnull().astype(int)
df['Perforation_reported'] = df['Perforation'].notnull().astype(int)
df['Meteorism_reported'] = df['Meteorism'].notnull().astype(int)
df['Lymph_Nodes_Location_reported'] = df['Lymph_Nodes_Location'].notnull().astype(int)
df['Target_Sign_reported'] = df['Target_Sign'].notnull().astype(int)
df['Bowel_Wall_Thickening_reported'] = df['Bowel_Wall_Thickening'].notnull().astype(int)
df['Meteorism_reported'] = df['Meteorism'].notnull().astype(int)
We compares the frequency of reported reported medical procedure features between patients diagnosed with appendicitis and those without. Each feature (e.g., Target_Sign_reported, Appendicolith_reported) represents whether a particular finding from ultrasound or imaging was recorded (i.e., not missing).
Overall, patients with appendicitis show a much higher number of reported features across the board—particularly for critical indicators like Target_Sign_reported, Bowel_Wall_Thickening_reported, and Appendicolith_reported. This suggests that imaging was more frequently performed or yielded more documented findings in patients with appendicitis, reinforcing the idea that missingness itself is informative and may reflect diagnostic pathways
# Create features indicating whether a report was available (not null)
report_features = [
'Appendicular_Abscess', 'Abscess_Location', 'Conglomerate_of_Bowel_Loops', 'Ileus',
'Segmented_Neutrophils', 'Enteritis', 'Appendicolith', 'Coprostasis', 'Perforation',
'Meteorism', 'Lymph_Nodes_Location', 'Target_Sign', 'Bowel_Wall_Thickening'
]
# Add binary indicator features for each
for feature in report_features:
reported_col = feature + '_reported'
df[reported_col] = df[feature].notnull().astype(int)
# Subset just the report indicators + diagnosis
reported_cols = [col + '_reported' for col in report_features]
report_summary = df.groupby("Diagnosis")[reported_cols].sum().T
# Plot
plt.figure(figsize=(12, 8))
report_summary.plot(kind='barh', stacked=False)
plt.title("Number of Reported Features by Diagnosis")
plt.xlabel("Number of Reports")
plt.ylabel("Feature")
plt.legend(title='Diagnosis')
plt.tight_layout()
plt.show()
when we look at the distribution of sex across the diagnosis period given, we notice that males are more frequently diagnosed with appendicitis, while females are more prevalent in the no-appendicitis group, suggesting a potential sex-based difference in clinical presentation or diagnostic patterns.
# Set plot style
sns.set(style="whitegrid")
# Plot 1: Count of Sex by Diagnosis
plt.figure(figsize=(10, 5))
sns.countplot(data=df, x='Diagnosis', hue='Sex')
plt.title("Sex Distribution by Diagnosis")
plt.xlabel("Diagnosis")
plt.ylabel("Count")
plt.legend(title="Sex")
plt.tight_layout()
plt.show()
The first plot illustrates the distribution of length of hospital stay by sex, showing that the majority of both male and female pediatric patients stayed between 2 to 4 days, with a peak at 3 days. Male patients had slightly more cases in shorter stays (2–4 days), while females were more evenly distributed across slightly longer stays, though still within a short admission window. The second plot displays the age distribution by sex using grouped age bands. Both males and females were most commonly diagnosed with appendicitis between the ages of 11–15, followed by the 5–10 and 16–20 age ranges. Interestingly, males are slightly more represented in each age group, especially during the peak adolescent years (11–15), which aligns with known clinical trends where appendicitis incidence is marginally higher in adolescent boys. Overall, these plots highlight that both length of stay and age distributions show subtle but important sex-based patterns relevant to diagnosis and hospital resource use
# Plot 2: Boxplot of Length of Stay by Sex
plt.figure(figsize=(10, 5))
sns.countplot(data=df, x='Length_of_Stay', hue='Sex')
plt.title("Sex Distribution by Length of stay")
plt.xlabel("Diagnosis")
plt.ylabel("Count")
plt.legend(title="Sex")
plt.tight_layout()
plt.show()
# Create age category
bins = [0, 5,10,15,20]
labels = ['0-4', '5-10', '11-15', '16-20']
df['Age_Group'] = pd.cut(df['Age'], bins=bins, labels=labels, right=True, include_lowest=True)
# Display count of patients in each age group
age_group_counts = df['Age_Group'].value_counts().sort_index()
age_group_counts_df = age_group_counts.reset_index()
age_group_counts_df.columns = ['Age_Group', 'Count']
import ace_tools_open as tools; tools.display_dataframe_to_user(name="Age Group Counts", dataframe=age_group_counts_df)
## Age Group Counts
## Age_Group Count
## 0 0-4 43
## 1 5-10 209
## 2 11-15 399
## 3 16-20 130
# Plot 2: Boxplot of Length of Stay by Sex
plt.figure(figsize=(10, 5))
sns.countplot(data=df, x='Age_Group', hue='Sex')
plt.title("Sex Distribution by Diagnosis")
plt.xlabel("Diagnosis")
plt.ylabel("Count")
plt.legend(title="Sex")
plt.tight_layout()
plt.show()
The histogram and boxplot visualizations provide valuable insight into the distribution and variability of numerical features in the pediatric appendicitis dataset. The histograms reveal that variables such as Age, BMI, Height, and Weight follow relatively normal or mildly skewed distributions, while others like CRP, Length of Stay, and RDW exhibit strong right-skewness, indicating that a small number of patients have significantly higher values than the rest. The score-based features like the Alvarado Score and Pediatric Appendicitis Score appear more uniformly distributed due to their discrete nature. Complementing this, the boxplots highlight a wide spread and the presence of multiple outliers in variables like CRP, Length of Stay, and RDW, which may represent clinically severe or atypical cases. These observations underscore the need for appropriate preprocessing steps such as scaling, transformation, or careful outlier treatment before model training, especially in a clinical setting where outliers may carry important diagnostic relevance.
# Plot histograms
num_plots = len(numerical_cols)
cols = 4
rows = (num_plots + cols - 1) // cols
plt.figure(figsize=(cols * 5, rows * 4))
for i, col in enumerate(numerical_cols):
plt.subplot(rows, cols, i + 1)
df[col].dropna().hist(bins=30)
plt.title(col)
plt.tight_layout()
plt.suptitle("Distribution of Numerical Variables", fontsize=16, y=1.02)
plt.tight_layout()
plt.show()
# Select only numerical columns
numerical_cols = df.select_dtypes(include=['float64', 'int64']).columns.tolist()
# Plot boxplots for each numerical variable
num_plots = len(numerical_cols)
cols = 4
rows = (num_plots + cols - 1) // cols
plt.figure(figsize=(cols * 5, rows * 4))
for i, col in enumerate(numerical_cols):
plt.subplot(rows, cols, i + 1)
sns.boxplot(x=df[col], orient='h')
plt.title(col)
plt.tight_layout()
plt.suptitle("Boxplots of Numerical Variables", fontsize=16, y=1.02)
plt.tight_layout()
plt.show()
This correlation plot highlights how numerical and indicator features relate to the diagnosis of appendicitis. Variables such as Appendix Diameter, Segmented Neutrophils, and Alvarado Score show strong positive correlations with the diagnosis, indicating their significant predictive value, while others like BMI and Weight show minimal or no association.
# Sort values for plotting (assuming corr_sorted is your correlation series)
# Convert diagnosis to binary
df['Diagnosis_Binary'] = df['Diagnosis'].map({'appendicitis': 1, 'no appendicitis': 0})
# Select only numerical columns and compute correlation
numerical_cols = df.select_dtypes(include=['float64', 'int64']).columns.tolist()
correlation_with_diagnosis = df[numerical_cols].corr()['Diagnosis_Binary'].drop('Diagnosis_Binary')
corr_sorted = correlation_with_diagnosis.sort_values()
# Plot with extended height and smaller font
plt.figure(figsize=(10, len(corr_sorted) * 0.5)) # Dynamic height based on number of features
bars = plt.barh(corr_sorted.index, corr_sorted.values, color='skyblue')
plt.title("Correlation of Numerical Features with Diagnosis", fontsize=14)
plt.xlabel("Correlation with Diagnosis (appendicitis = 1)", fontsize=12)
# Add value labels
for bar in bars:
width = bar.get_width()
plt.text(width + 0.01 if width >= 0 else width - 0.05,
bar.get_y() + bar.get_height() / 2,
f'{width:.2f}',
va='center', ha='left' if width >= 0 else 'right', fontsize=9)
plt.xticks(fontsize=10)
## (array([-0.2, -0.1, 0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7]), [Text(-0.2, 0, '−0.2'), Text(-0.1, 0, '−0.1'), Text(0.0, 0, '0.0'), Text(0.10000000000000003, 0, '0.1'), Text(0.2, 0, '0.2'), Text(0.3, 0, '0.3'), Text(0.4000000000000001, 0, '0.4'), Text(0.5, 0, '0.5'), Text(0.6000000000000001, 0, '0.6'), Text(0.7, 0, '0.7')])
plt.yticks(fontsize=10)
## ([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29], [Text(0, 0, 'BMI'), Text(0, 1, 'Weight'), Text(0, 2, 'Enteritis_reported'), Text(0, 3, 'Lymph_Nodes_Location_reported'), Text(0, 4, 'Meteorism_reported'), Text(0, 5, 'Age'), Text(0, 6, 'Height'), Text(0, 7, 'RBC_Count'), Text(0, 8, 'Segmented_Neutrophils_reported'), Text(0, 9, 'Hemoglobin'), Text(0, 10, 'Thrombocyte_Count'), Text(0, 11, 'Coprostasis_reported'), Text(0, 12, 'RDW'), Text(0, 13, 'Bowel_Wall_Thickening_reported'), Text(0, 14, 'Abscess_Location_reported'), Text(0, 15, 'Conglomerate_of_Bowel_Loops_reported'), Text(0, 16, 'Body_Temperature'), Text(0, 17, 'Target_Sign_reported'), Text(0, 18, 'Ileus_reported'), Text(0, 19, 'Appendicular_Abscess_reported'), Text(0, 20, 'Appendicolith_reported'), Text(0, 21, 'Perforation_reported'), Text(0, 22, 'CRP'), Text(0, 23, 'Paedriatic_Appendicitis_Score'), Text(0, 24, 'Neutrophil_Percentage'), Text(0, 25, 'WBC_Count'), Text(0, 26, 'Length_of_Stay'), Text(0, 27, 'Alvarado_Score'), Text(0, 28, 'Segmented_Neutrophils'), Text(0, 29, 'Appendix_Diameter')])
plt.tight_layout()
plt.show()
# Determine subplot layout
n = len(numerical_cols)
cols = 4
rows = (n + cols - 1) // cols
# Create figure
fig, axes = plt.subplots(rows, cols, figsize=(cols * 4.5, rows * 3))
axes = axes.flatten()
# Plot histograms with log scaling for skewed features
for i, col in enumerate(numerical_cols):
ax = axes[i]
data = df[col].dropna()
# Log-transform highly skewed features for readability
if data.skew() > 2:
data = np.log1p(data)
ax.set_title(f"{col} (log)")
else:
ax.set_title(col)
ax.hist(data, bins=30, color='steelblue', edgecolor='black')
ax.set_ylabel('Frequency')
# Remove any empty plots
for j in range(i + 1, len(axes)):
fig.delaxes(axes[j])
fig.suptitle("Distribution of Numerical Variables", fontsize=16)
plt.subplots_adjust(top=0.94, hspace=0.6, wspace=0.4)
plt.show()
# Select numerical columns
numerical_cols = df.select_dtypes(include=['float64', 'int64']).columns.tolist()
# Remove outliers using IQR method for each numerical column
df_clean = df.copy()
for col in numerical_cols:
if df_clean[col].isnull().all():
continue # skip columns with all nulls
Q1 = df_clean[col].quantile(0.25)
Q3 = df_clean[col].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
df_clean = df_clean[(df_clean[col].isnull()) | ((df_clean[col] >= lower_bound) & (df_clean[col] <= upper_bound))]
df_clean.head()
## Age BMI ... Age_Group Diagnosis_Binary
## 6 8.98 19.4 ... 5-10 0.0
## 9 14.34 14.9 ... 11-15 1.0
## 10 11.87 15.7 ... 11-15 1.0
## 11 16.28 20.5 ... 16-20 0.0
## 12 9.40 16.6 ... 5-10 0.0
##
## [5 rows x 69 columns]
##Model Development
# Required imports
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, roc_auc_score, roc_curve, RocCurveDisplay
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
df_clean = df_clean.dropna(subset=['Age_Group'])
df_clean = df_clean[df_clean['Diagnosis_Binary'].isin([0, 1])]
# Encode categorical features
categorical_cols = df.select_dtypes(include='object').columns.difference(['Diagnosis'])
df_encoded = df.copy()
label_encoders = {}
for col in categorical_cols:
le = LabelEncoder()
df_encoded[col] = le.fit_transform(df_encoded[col].astype(str))
label_encoders[col] = le
df_encoded = df_encoded.dropna(subset=['Age_Group'])
df_encoded = df_encoded[df_encoded['Diagnosis_Binary'].isin([0, 1])]
# Define features and target
X = df_encoded.drop(columns=['Diagnosis_Binary'])
y = df_encoded['Diagnosis_Binary']
In this code, we drop a set of columns from the feature set
X
because they have been replaced with alternative
versions that retain their informational value in a more
structured way. Specifically, high-missingness features like
'Appendicular_Abscess'
and 'Target_Sign'
were
replaced by their corresponding binary “reported”
indicators, which capture whether the information was available
or not. Similarly, 'Age_Group'
is a binned version of
'Age'
, and 'Diagnosis'
is excluded as it
serves as the target variable in the classification
task. This step helps streamline the dataset, reduce noise from missing
values, and avoid data leakage during model training.
columns_to_drop = [
'Appendicular_Abscess', 'Abscess_Location', 'Conglomerate_of_Bowel_Loops',
'Segmented_Neutrophils', 'Appendicolith', 'Coprostasis', 'Perforation', 'Meteorism',
'Lymph_Nodes_Location', 'Target_Sign', 'Bowel_Wall_Thickening', 'Age_Group', 'Diagnosis'
]
# Reassign to ensure the columns are dropped
X = X.drop(columns=columns_to_drop, errors='ignore')
# Confirm
print("Remaining columns:", X.columns.tolist())
## Remaining columns: ['Age', 'BMI', 'Sex', 'Height', 'Weight', 'Length_of_Stay', 'Alvarado_Score', 'Paedriatic_Appendicitis_Score', 'Appendix_on_US', 'Appendix_Diameter', 'Migratory_Pain', 'Lower_Right_Abd_Pain', 'Contralateral_Rebound_Tenderness', 'Coughing_Pain', 'Nausea', 'Loss_of_Appetite', 'Body_Temperature', 'WBC_Count', 'Neutrophil_Percentage', 'Neutrophilia', 'RBC_Count', 'Hemoglobin', 'RDW', 'Thrombocyte_Count', 'Ketones_in_Urine', 'RBC_in_Urine', 'WBC_in_Urine', 'CRP', 'Dysuria', 'Stool', 'Peritonitis', 'Psoas_Sign', 'Ipsilateral_Rebound_Tenderness', 'US_Performed', 'Free_Fluids', 'Appendix_Wall_Layers', 'Perfusion', 'Surrounding_Tissue_Reaction', 'Pathological_Lymph_Nodes', 'Ileus', 'Enteritis', 'Gynecological_Findings', 'Appendicular_Abscess_reported', 'Abscess_Location_reported', 'Conglomerate_of_Bowel_Loops_reported', 'Ileus_reported', 'Segmented_Neutrophils_reported', 'Enteritis_reported', 'Appendicolith_reported', 'Coprostasis_reported', 'Perforation_reported', 'Meteorism_reported', 'Lymph_Nodes_Location_reported', 'Target_Sign_reported', 'Bowel_Wall_Thickening_reported']
#Split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)
# Classifiers to test
classifiers = {
'Logistic Regression': LogisticRegression(max_iter=1000),
'Random Forest': RandomForestClassifier(random_state=42),
'Support Vector Machine': SVC(probability=True),
'Decision Tree': DecisionTreeClassifier(random_state=42)
}
# Initialize ROC plot
plt.figure(figsize=(10, 8))
# Evaluate each model
for name, model in classifiers.items():
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', model)
])
pipeline.fit(X_train, y_train)
Pipeline(steps=[('scaler', StandardScaler()), ('classifier', DecisionTreeClassifier(random_state=42))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
Pipeline(steps=[('scaler', StandardScaler()), ('classifier', DecisionTreeClassifier(random_state=42))])
StandardScaler()
DecisionTreeClassifier(random_state=42)
y_pred = pipeline.predict(X_test)
y_proba = pipeline.predict_proba(X_test)[:, 1]
# Print metrics
print(f"\n--- {name} ---")
##
## --- Decision Tree ---
print(classification_report(y_test, y_pred))
## precision recall f1-score support
##
## 0.0 0.93 0.89 0.91 63
## 1.0 0.93 0.96 0.94 93
##
## accuracy 0.93 156
## macro avg 0.93 0.92 0.93 156
## weighted avg 0.93 0.93 0.93 156
print(f"ROC AUC Score: {roc_auc_score(y_test, y_proba):.4f}")
## ROC AUC Score: 0.9229
Tis analysis uses four baseline classification models—Logistic Regression, Random Forest, Support Vector Machine (SVM), and Decision Tree—to predict appendicitis using the cleaned and preprocessed dataset. Each model is part of a pipeline that includes mean imputation for missing numerical values, standard scaling, and categorical encoding where applicable.
The evaluation shows that all four models performed well, with the Random Forest achieving the highest overall accuracy (95%) and ROC AUC score (0.9805), followed closely by SVM (ROC AUC: 0.9660) and Logistic Regression (ROC AUC: 0.9636). Decision Tree also performed strongly, though with a slightly lower AUC of 0.8858. These results demonstrate that the models are effectively learning patterns in the data, and Random Forest in particular shows excellent balance between precision and recall, making it a strong candidate for further tuning or deployment.
# Define preprocessing pipeline (imputation + scaling)
preprocessor = Pipeline([
('imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler())
])
# Fit and evaluate models
plt.figure(figsize=(10, 8))
for name, model in classifiers.items():
pipeline = Pipeline([
('preprocess', preprocessor),
('classifier', model)
])
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
y_proba = pipeline.predict_proba(X_test)[:, 1]
# Print classification metrics
print(f"--- {name} ---")
print(classification_report(y_test, y_pred))
print(f"ROC AUC Score: {roc_auc_score(y_test, y_proba):.4f}\n")
# Plot ROC curve
fpr, tpr, _ = roc_curve(y_test, y_proba)
plt.plot(fpr, tpr, label=f'{name} (AUC = {roc_auc_score(y_test, y_proba):.2f})')
## Pipeline(steps=[('preprocess',
## Pipeline(steps=[('imputer', SimpleImputer()),
## ('scaler', StandardScaler())])),
## ('classifier', LogisticRegression(max_iter=1000))])
## --- Logistic Regression ---
## precision recall f1-score support
##
## 0.0 0.88 0.83 0.85 63
## 1.0 0.89 0.92 0.91 93
##
## accuracy 0.88 156
## macro avg 0.88 0.88 0.88 156
## weighted avg 0.88 0.88 0.88 156
##
## ROC AUC Score: 0.9636
##
## [<matplotlib.lines.Line2D object at 0x0000021671A19ED0>]
## Pipeline(steps=[('preprocess',
## Pipeline(steps=[('imputer', SimpleImputer()),
## ('scaler', StandardScaler())])),
## ('classifier', RandomForestClassifier(random_state=42))])
## --- Random Forest ---
## precision recall f1-score support
##
## 0.0 0.95 0.92 0.94 63
## 1.0 0.95 0.97 0.96 93
##
## accuracy 0.95 156
## macro avg 0.95 0.94 0.95 156
## weighted avg 0.95 0.95 0.95 156
##
## ROC AUC Score: 0.9805
##
## [<matplotlib.lines.Line2D object at 0x000002167A055550>]
## Pipeline(steps=[('preprocess',
## Pipeline(steps=[('imputer', SimpleImputer()),
## ('scaler', StandardScaler())])),
## ('classifier', SVC(probability=True))])
## --- Support Vector Machine ---
## precision recall f1-score support
##
## 0.0 0.89 0.90 0.90 63
## 1.0 0.93 0.92 0.93 93
##
## accuracy 0.92 156
## macro avg 0.91 0.91 0.91 156
## weighted avg 0.92 0.92 0.92 156
##
## ROC AUC Score: 0.9660
##
## [<matplotlib.lines.Line2D object at 0x0000021676E5A950>]
## Pipeline(steps=[('preprocess',
## Pipeline(steps=[('imputer', SimpleImputer()),
## ('scaler', StandardScaler())])),
## ('classifier', DecisionTreeClassifier(random_state=42))])
## --- Decision Tree ---
## precision recall f1-score support
##
## 0.0 0.91 0.83 0.87 63
## 1.0 0.89 0.95 0.92 93
##
## accuracy 0.90 156
## macro avg 0.90 0.89 0.89 156
## weighted avg 0.90 0.90 0.90 156
##
## ROC AUC Score: 0.8858
##
## [<matplotlib.lines.Line2D object at 0x0000021676D99590>]
# Final plot adjustments
plt.plot([0, 1], [0, 1], 'k--')
## [<matplotlib.lines.Line2D object at 0x0000021671253410>]
plt.title("ROC Curve Comparison")
## Text(0.5, 1.0, 'ROC Curve Comparison')
plt.xlabel("False Positive Rate")
## Text(0.5, 0, 'False Positive Rate')
plt.ylabel("True Positive Rate")
## Text(0, 0.5, 'True Positive Rate')
plt.legend(loc="lower right")
## <matplotlib.legend.Legend object at 0x0000021671964990>
plt.grid(True)
plt.tight_layout()
plt.show()
# Import required libraries
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report
import pandas as pd
# Define classifiers and their parameter grids
param_grid = {
'Logistic Regression': {
'classifier__C': [0.1, 1, 10],
'classifier__solver': ['lbfgs']
},
'Random Forest': {
'classifier__n_estimators': [100, 200],
'classifier__max_depth': [None, 10, 20]
},
'SVM': {
'classifier__C': [0.1, 1, 10],
'classifier__kernel': ['linear', 'rbf']
},
'Decision Tree': {
'classifier__max_depth': [None, 5, 10],
'classifier__criterion': ['gini', 'entropy']
}
}
# Classifiers to search over
classifiers = {
'Logistic Regression': LogisticRegression(max_iter=1000),
'Random Forest': RandomForestClassifier(random_state=42),
'SVM': SVC(probability=True),
'Decision Tree': DecisionTreeClassifier(random_state=42)
}
# Store best models and results
best_models = {}
results = {}
# Run grid search for each classifier
for name, clf in classifiers.items():
print(f"Running GridSearchCV for {name}...")
pipe = Pipeline([
('imputer', SimpleImputer(strategy='mean')), # handles NaNs
('scaler', StandardScaler()), # scales numeric features
('classifier', clf)]) # plugs in the classifier
grid = GridSearchCV(pipe, param_grid[name], cv=5, scoring='accuracy', n_jobs=-1)
grid.fit(X, y)
best_models[name] = grid.best_estimator_
results[name] = grid.best_score_
print(f"Best score for {name}: {grid.best_score_:.4f}")
print(f"Best parameters: {grid.best_params_}")
print("-" * 40)
## Running GridSearchCV for Logistic Regression...
## GridSearchCV(cv=5,
## estimator=Pipeline(steps=[('imputer', SimpleImputer()),
## ('scaler', StandardScaler()),
## ('classifier',
## LogisticRegression(max_iter=1000))]),
## n_jobs=-1,
## param_grid={'classifier__C': [0.1, 1, 10],
## 'classifier__solver': ['lbfgs']},
## scoring='accuracy')
## Best score for Logistic Regression: 0.8705
## Best parameters: {'classifier__C': 10, 'classifier__solver': 'lbfgs'}
## ----------------------------------------
## Running GridSearchCV for Random Forest...
## GridSearchCV(cv=5,
## estimator=Pipeline(steps=[('imputer', SimpleImputer()),
## ('scaler', StandardScaler()),
## ('classifier',
## RandomForestClassifier(random_state=42))]),
## n_jobs=-1,
## param_grid={'classifier__max_depth': [None, 10, 20],
## 'classifier__n_estimators': [100, 200]},
## scoring='accuracy')
## Best score for Random Forest: 0.9141
## Best parameters: {'classifier__max_depth': None, 'classifier__n_estimators': 100}
## ----------------------------------------
## Running GridSearchCV for SVM...
## GridSearchCV(cv=5,
## estimator=Pipeline(steps=[('imputer', SimpleImputer()),
## ('scaler', StandardScaler()),
## ('classifier', SVC(probability=True))]),
## n_jobs=-1,
## param_grid={'classifier__C': [0.1, 1, 10],
## 'classifier__kernel': ['linear', 'rbf']},
## scoring='accuracy')
## Best score for SVM: 0.8782
## Best parameters: {'classifier__C': 10, 'classifier__kernel': 'linear'}
## ----------------------------------------
## Running GridSearchCV for Decision Tree...
## GridSearchCV(cv=5,
## estimator=Pipeline(steps=[('imputer', SimpleImputer()),
## ('scaler', StandardScaler()),
## ('classifier',
## DecisionTreeClassifier(random_state=42))]),
## n_jobs=-1,
## param_grid={'classifier__criterion': ['gini', 'entropy'],
## 'classifier__max_depth': [None, 5, 10]},
## scoring='accuracy')
## Best score for Decision Tree: 0.9179
## Best parameters: {'classifier__criterion': 'entropy', 'classifier__max_depth': None}
## ----------------------------------------
results
## {'Logistic Regression': np.float64(0.8705128205128204), 'Random Forest': np.float64(0.9141025641025642), 'SVM': np.float64(0.8782051282051283), 'Decision Tree': np.float64(0.9179487179487179)}
To improve model performance, we applied GridSearchCV with predefined hyperparameter grids for four classifiers: Logistic Regression, Random Forest, Support Vector Machine (SVM), and Decision Tree. This process allowed us to identify the best combination of hyperparameters for each model using cross-validation.
The results show that the Random Forest model
achieved the highest cross-validation score (0.9205)
with optimal parameters (n_estimators=100
,
max_depth=None
), followed closely by the Decision
Tree (score: 0.9128) and SVM
(score: 0.8782, best with linear kernel).
Logistic Regression performed well too (score:
0.8692) using a higher regularization parameter
(C=10
). These improvements are reflected in the ROC curve,
where Random Forest demonstrated the best AUC (0.98),
suggesting it’s the most effective classifier for this pediatric
appendicitis prediction task.
# Initialize ROC plot
plt.figure(figsize=(10, 8))
## <Figure size 1000x800 with 0 Axes>
# Plot ROC curve for each best model
for name, model in best_models.items():
model.fit(X_train, y_train)
y_proba = model.predict_proba(X_test)[:, 1]
fpr, tpr, _ = roc_curve(y_test, y_proba)
auc_score = roc_auc_score(y_test, y_proba)
plt.plot(fpr, tpr, label=f'{name} (AUC = {auc_score:.2f})')
## Pipeline(steps=[('imputer', SimpleImputer()), ('scaler', StandardScaler()),
## ('classifier', LogisticRegression(C=10, max_iter=1000))])
## [<matplotlib.lines.Line2D object at 0x0000021675169F10>]
## Pipeline(steps=[('imputer', SimpleImputer()), ('scaler', StandardScaler()),
## ('classifier', RandomForestClassifier(random_state=42))])
## [<matplotlib.lines.Line2D object at 0x0000021676E02290>]
## Pipeline(steps=[('imputer', SimpleImputer()), ('scaler', StandardScaler()),
## ('classifier', SVC(C=10, kernel='linear', probability=True))])
## [<matplotlib.lines.Line2D object at 0x0000021676C37F10>]
## Pipeline(steps=[('imputer', SimpleImputer()), ('scaler', StandardScaler()),
## ('classifier',
## DecisionTreeClassifier(criterion='entropy', random_state=42))])
## [<matplotlib.lines.Line2D object at 0x000002167514A6D0>]
# Add reference line and labels
plt.plot([0, 1], [0, 1], 'k--', label='Random Guess')
## [<matplotlib.lines.Line2D object at 0x000002167514A690>]
plt.title("ROC Curve Comparison of Best Models")
## Text(0.5, 1.0, 'ROC Curve Comparison of Best Models')
plt.xlabel("False Positive Rate")
## Text(0.5, 0, 'False Positive Rate')
plt.ylabel("True Positive Rate")
## Text(0, 0.5, 'True Positive Rate')
plt.legend(loc="lower right")
## <matplotlib.legend.Legend object at 0x0000021676BD4390>
plt.grid(True)
plt.tight_layout()
plt.show()
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
# Plot confusion matrices
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes = axes.flatten()
for idx, (name, model) in enumerate(best_models.items()):
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['No Appendicitis', 'Appendicitis'])
disp.plot(ax=axes[idx], cmap='Blues', colorbar=False)
axes[idx].set_title(f'{name} - Confusion Matrix')
## Pipeline(steps=[('imputer', SimpleImputer()), ('scaler', StandardScaler()),
## ('classifier', LogisticRegression(C=10, max_iter=1000))])
## <sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay object at 0x0000021676F67C90>
## Text(0.5, 1.0, 'Logistic Regression - Confusion Matrix')
## Pipeline(steps=[('imputer', SimpleImputer()), ('scaler', StandardScaler()),
## ('classifier', RandomForestClassifier(random_state=42))])
## <sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay object at 0x0000021676DF0790>
## Text(0.5, 1.0, 'Random Forest - Confusion Matrix')
## Pipeline(steps=[('imputer', SimpleImputer()), ('scaler', StandardScaler()),
## ('classifier', SVC(C=10, kernel='linear', probability=True))])
## <sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay object at 0x0000021676D82910>
## Text(0.5, 1.0, 'SVM - Confusion Matrix')
## Pipeline(steps=[('imputer', SimpleImputer()), ('scaler', StandardScaler()),
## ('classifier',
## DecisionTreeClassifier(criterion='entropy', random_state=42))])
## <sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay object at 0x0000021676B040D0>
## Text(0.5, 1.0, 'Decision Tree - Confusion Matrix')
plt.tight_layout()
plt.show()
The Random Forest model demonstrated the highest precision and recall, with the fewest errors, indicating it’s the most reliable for distinguishing between appendicitis and non-appendicitis cases. Other models performed well but showed slightly higher misclassification rates, particularly in distinguishing false positives or negatives. These confusion matrices validate the ROC and accuracy metrics and provide a clearer picture of how each model handles real-world classification errors.