Thieu Nguyen - May-10-2025
Diabetes Reports | Python¶
Synthea healthcare data is stored across multiple tables linked by key columns. In this post, I outline how to:
- Define specific patient groups such as diabetes patients, inpatients, and deceased patients
- Extract outcomes related to diabetes, not general
- Extract time points for diabetes diagnoses and lab tests
- Aggregate and visualize lab test values, outcomes, and death counts by year
Loading data and packages¶
%pip install pandas numpy matplotlib seaborn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns
import datetime as dt
# Load the data
conditions = pd.read_csv("/Users/nnthieu/SyntheaData/SyntheaCovid19/conditions.csv")
patients = pd.read_csv("/Users/nnthieu/SyntheaData/SyntheaCovid19/patients.csv")
observations = pd.read_csv("/Users/nnthieu/SyntheaData/SyntheaCovid19/observations.csv")
care_plans = pd.read_csv("/Users/nnthieu/SyntheaData/SyntheaCovid19/careplans.csv")
encounters = pd.read_csv("/Users/nnthieu/SyntheaData/SyntheaCovid19/encounters.csv")
supplies = pd.read_csv('/Users/nnthieu/SyntheaData/SyntheaCovid19/supplies.csv')
procedures = pd.read_csv("/Users/nnthieu/SyntheaData/SyntheaCovid19/procedures.csv")
medications = pd.read_csv("/Users/nnthieu/SyntheaData/SyntheaCovid19/medications.csv")
Define Ids for specific patient groups¶
# define diabetes patients Ids
diab_patient_ids = conditions[conditions['CODE'] == 44054006]['PATIENT'].unique()
print("Number of diabetes patients:", diab_patient_ids.size)
Number of diabetes patients: 8555
diab_patients = patients[patients['Id'].isin(diab_patient_ids)]
print(diab_patients.shape)
diab_patients.head(2)
(8555, 25)
Id | BIRTHDATE | DEATHDATE | SSN | DRIVERS | PASSPORT | PREFIX | FIRST | LAST | SUFFIX | ... | BIRTHPLACE | ADDRESS | CITY | STATE | COUNTY | ZIP | LAT | LON | HEALTHCARE_EXPENSES | HEALTHCARE_COVERAGE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
43 | a2a8e809-cc34-4f8e-93b8-74e4a87f35ef | 1969-03-22 | NaN | 999-12-7459 | S99953931 | X68702974X | Mr. | Stanford577 | Goodwin327 | NaN | ... | Arlington Massachusetts US | 314 Torp Byway | Somerville | Massachusetts | Middlesex County | 2140.0 | 42.440865 | -71.086388 | 1113259.27 | 6304.04 |
65 | 87b26350-3798-4289-a2ee-f6d88145537f | 1940-05-03 | 2010-04-20 | 999-30-7132 | S99937049 | X73545664X | Mrs. | Tia76 | Spinka232 | NaN | ... | Hopedale Massachusetts US | 537 Steuber Heights Unit 8 | Middleborough | Massachusetts | Plymouth County | NaN | 41.855252 | -70.847341 | 1433642.99 | 20767.46 |
2 rows × 25 columns
# extract conditions for diabetes patients
diab_conditions = conditions[conditions['PATIENT'].isin(diab_patient_ids)]
diab_conditions.head(2)
START | STOP | PATIENT | ENCOUNTER | CODE | DESCRIPTION | |
---|---|---|---|---|---|---|
391 | 1992-03-21 | NaN | a2a8e809-cc34-4f8e-93b8-74e4a87f35ef | fdd8a4d9-5fab-4b76-9e2e-4d16753e1703 | 410429000 | Cardiac Arrest |
392 | 1992-03-21 | NaN | a2a8e809-cc34-4f8e-93b8-74e4a87f35ef | fdd8a4d9-5fab-4b76-9e2e-4d16753e1703 | 429007001 | History of cardiac arrest (situation) |
# for inpatients
inpatient_ids = encounters[encounters['ENCOUNTERCLASS'] == 'inpatient']['PATIENT'].unique()
diab_inpatient_ids = np.intersect1d(diab_patient_ids, inpatient_ids)
print("number of inpatient patients:", diab_inpatient_ids.size)
diab_inpatient_patients = patients[patients['Id'].isin(diab_inpatient_ids)]
number of inpatient patients: 4174
# for diabetes patient deaths
death_ids = patients[patients.DEATHDATE.notna()].Id.unique()
diab_death_ids = np.intersect1d(diab_patient_ids, death_ids)
print("number of diabetes patient deaths:", diab_death_ids.size)
number of diabetes patient deaths: 2860
Health outcomes related to diabetes¶
Health outcomes are stored in the table 'condition' with date to get disease and other symptoms and outcomes. Two steps to get specific data for these information.
- disab_date contains date to get diabetes (diab_cons = conditions[conditions['CODE'] == 44054006].unique())
- diab_conditions contains all symptoms and outcomes included in 'DESCRIPTION' values (diab_conditions = conditions[conditions['PATIENT'].isin(diab_patient_ids)]). This table contains many 'DESCRIPTION'['value'], create a 'diabetes_related_keywords' to filter the symptoms or outcomes that we want to analyze.
- how to build a results table of symptoms or outcomes
import re
# Make an explicit copy if needed
diab_conditions = diab_conditions.copy()
# Normalize to lowercase using .loc
diab_conditions.loc[:, 'DESCRIPTION'] = diab_conditions['DESCRIPTION'].str.lower()
# Define diabetes-related keywords
diabetes_related_keywords = [
'diabetes',
'prediabetes',
'hyperglycemia',
'hypoglycemia',
'neuropathy',
'nephropathy',
'retinopathy',
'vision',
'foot ulcer',
'amputation',
'cardiovascular',
'metabolic syndrome',
'hypertriglyceridemia',
'obesity',
'hypertension',
'anemia',
'infection',
'slow healing',
'renal failure',
'fatigue',
'blurred vision'
]
# Build regex pattern with word boundaries
pattern = '|'.join([r'\b' + re.escape(word) + r'\b' for word in diabetes_related_keywords])
# Create mask with regex
mask = diab_conditions['DESCRIPTION'].str.contains(pattern, regex=True)
diabetes_outcomes = diab_conditions[mask]
# Further filter rows where 'DESCRIPTION' contains '(disorder)'
disorder_mask = diabetes_outcomes['DESCRIPTION'].str.contains(r'\(disorder\)', na=False)
diabetes_outcomes = diabetes_outcomes[disorder_mask]
diabetes_outcomes['DESCRIPTION'] = diabetes_outcomes['DESCRIPTION'].str.replace(
r'\(disorder\)', '', regex=True).str.strip()
# Calculate value counts and percentages
counts = diabetes_outcomes['DESCRIPTION'].value_counts().head(12)
percentages = round((counts / diab_patient_ids.size) * 100, 2)
# Combine counts and percentages into a DataFrame
diabetes_outcomes = pd.DataFrame({
'DESCRIPTION' : counts.index,
'Count': counts.values,
'Percentage': percentages.values
})
# Display the DataFrame
print(diabetes_outcomes)
DESCRIPTION Count Percentage 0 hypertriglyceridemia 8369 97.83 1 metabolic syndrome x 8290 96.90 2 anemia 6786 79.32 3 hyperglycemia 4445 51.96 4 neuropathy due to type 2 diabetes mellitus 3045 35.59 5 diabetic retinopathy associated with type ii d... 2326 27.19 6 nonproliferative diabetic retinopathy due to t... 1383 16.17 7 microalbuminuria due to type 2 diabetes mellitus 558 6.52 8 macular edema and retinopathy due to type 2 di... 511 5.97 9 proliferative diabetic retinopathy due to type... 350 4.09 10 proteinuria due to type 2 diabetes mellitus 45 0.53 11 blindness due to type 2 diabetes mellitus 15 0.18
counts = diabetes_outcomes['Count']
plt.figure(figsize=(9, 5))
ax = counts.plot(kind='barh', color='teal')
plt.xlabel("Frequency")
plt.title("Top 10 Diabetes-Related Outcomes")
plt.gca().invert_yaxis()
plt.yticks(ticks=range(len(diabetes_outcomes)), labels=diabetes_outcomes['DESCRIPTION'])
# Add value labels inside the bars
for index, value in enumerate(counts):
plt.text(value, index, str(value), va='center', ha='left', fontsize=10)
plt.tight_layout()
plt.show()
diab_conditions = conditions[conditions['CODE'] == 44054006].copy()
diab_conditions.rename(columns = {'START' : 'diab_date'}, inplace=True)
diab_conditions.head(2)
diab_date | STOP | PATIENT | ENCOUNTER | CODE | DESCRIPTION | |
---|---|---|---|---|---|---|
393 | 2015-04-11 | NaN | a2a8e809-cc34-4f8e-93b8-74e4a87f35ef | 5db5168f-1c3d-45f0-8d54-15d2c04d861a | 44054006 | Diabetes |
577 | 1965-07-09 | NaN | 87b26350-3798-4289-a2ee-f6d88145537f | 7a1ddd9e-6900-433a-aa38-bb7e29558909 | 44054006 | Diabetes |
retino_conditions = diab_conditions.merge(conditions, on='PATIENT')
retino_conditions.head(2)
diab_date | STOP_x | PATIENT | ENCOUNTER_x | CODE_x | DESCRIPTION_x | START | STOP_y | ENCOUNTER_y | CODE_y | DESCRIPTION_y | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2015-04-11 | NaN | a2a8e809-cc34-4f8e-93b8-74e4a87f35ef | 5db5168f-1c3d-45f0-8d54-15d2c04d861a | 44054006 | Diabetes | 1992-03-21 | NaN | fdd8a4d9-5fab-4b76-9e2e-4d16753e1703 | 410429000 | Cardiac Arrest |
1 | 2015-04-11 | NaN | a2a8e809-cc34-4f8e-93b8-74e4a87f35ef | 5db5168f-1c3d-45f0-8d54-15d2c04d861a | 44054006 | Diabetes | 1992-03-21 | NaN | fdd8a4d9-5fab-4b76-9e2e-4d16753e1703 | 429007001 | History of cardiac arrest (situation) |
Diabetic retinopathy complications by years of the diabetes diagnosis.¶
# Filter for diabetic retinopathy descriptions
retino_conditions = retino_conditions[(retino_conditions['CODE_y'] == 422034002) |
(retino_conditions['CODE_y'] == 1551000119108)]
# Ensure START and diab_date columns are in datetime format
retino_conditions['START'] = pd.to_datetime(retino_conditions['START'], errors='coerce')
retino_conditions['diab_date'] = pd.to_datetime(retino_conditions['diab_date'], errors='coerce')
# Calculate diab_years
retino_conditions['diab_years'] = (((retino_conditions['START'] - retino_conditions['diab_date']).dt.days) / 365.25).round(2)
retino_conditions.head(2)
diab_date | STOP_x | PATIENT | ENCOUNTER_x | CODE_x | DESCRIPTION_x | START | STOP_y | ENCOUNTER_y | CODE_y | DESCRIPTION_y | diab_years | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
4 | 2015-04-11 | NaN | a2a8e809-cc34-4f8e-93b8-74e4a87f35ef | 5db5168f-1c3d-45f0-8d54-15d2c04d861a | 44054006 | Diabetes | 2017-04-15 | NaN | 374c2bf5-baaf-4f67-9e02-e4521addb728 | 422034002 | Diabetic retinopathy associated with type II d... | 2.01 |
19 | 2015-04-11 | NaN | a2a8e809-cc34-4f8e-93b8-74e4a87f35ef | 5db5168f-1c3d-45f0-8d54-15d2c04d861a | 44054006 | Diabetes | 2020-05-02 | NaN | caa9c2dc-a809-48e1-9d69-e62a719295a4 | 1551000119108 | Nonproliferative diabetic retinopathy due to t... | 5.06 |
# Define age group bins and labels
bins = [0, 10, 20, 30, 40]
labels = ['0-9', '10-19', '20-29', '30+']
# Create age groups
retino_conditions['year_group'] = pd.cut(retino_conditions['diab_years'], bins=bins, labels=labels, right=False)
retino_conditions.pivot_table(
values='DESCRIPTION_y',
index='year_group',
columns='CODE_y',
aggfunc='count',
observed=False
)
CODE_y | 422034002 | 1551000119108 |
---|---|---|
year_group | ||
0-9 | 1474 | 533 |
10-19 | 578 | 504 |
20-29 | 197 | 240 |
30+ | 62 | 73 |
422034002 : Diabetic retinopathy associated with type II diabetes mellitus 1551000119108 : Nonproliferative diabetic retinopathy due to type 2 diabetes mellitus
Diabetic retinopathy happens at most first 19 years of the disease.
Labs values¶
observations['DESCRIPTION'][observations['PATIENT'].isin(diab_inpatient_ids)].value_counts().head()
DESCRIPTION Diastolic Blood Pressure 44903 Systolic Blood Pressure 44903 Body Weight 43576 Respiratory rate 43576 Heart rate 43576 Name: count, dtype: int64
labs_code = ["2339-0", "4548-4", "2571-8", "38483-4", "2947-0", "6298-4"]
# Filter observations for the specified lab codes and inpatient IDs
diab_observations = observations[
observations['CODE'].isin(labs_code) &
observations['PATIENT'].isin(diab_inpatient_ids)
]
diab_observations.head(2)
DATE | PATIENT | ENCOUNTER | CODE | DESCRIPTION | VALUE | UNITS | TYPE | |
---|---|---|---|---|---|---|---|---|
7223 | 2020-05-02 | a2a8e809-cc34-4f8e-93b8-74e4a87f35ef | caa9c2dc-a809-48e1-9d69-e62a719295a4 | 2339-0 | Glucose | 147.9 | mg/dL | numeric |
7225 | 2020-05-02 | a2a8e809-cc34-4f8e-93b8-74e4a87f35ef | caa9c2dc-a809-48e1-9d69-e62a719295a4 | 38483-4 | Creatinine | 1.1 | mg/dL | numeric |
To analyze lab tests, use table 'observations' containing information about lab tests, merge with 'diab_date' to create the column 'days' of disease course.
# Ensure 'diab_date' is in datetime format
diab_conditions['diab_date'] = pd.to_datetime(diab_conditions['diab_date'], errors='coerce')
# Filter rows where 'diab_date' year is greater than 2010
diab_conditions = diab_conditions[diab_conditions['diab_date'].dt.year > 2010]
diab_patients_labs = diab_conditions.merge(diab_observations, on='PATIENT')
diab_patients_labs['DATE'] = pd.to_datetime(diab_patients_labs.DATE)
diab_patients_labs['labs_days'] = (diab_patients_labs.DATE - diab_patients_labs.diab_date).dt.days
diab_patients_labs['years'] = round((diab_patients_labs['labs_days'] / 365.25), 2)
diab_patients_labs['VALUE'] = pd.to_numeric(diab_patients_labs['VALUE'], errors='coerce')
diab_patients_labs = diab_patients_labs[diab_patients_labs['years'] > 0]
diab_patients_labs.rename(columns={'CODE_y': 'CODE'}, inplace=True)
diab_patients_labs.rename(columns={'DESCRIPTION_y': 'DESCRIPTION'}, inplace=True)
diab_patients_labs = diab_patients_labs[['diab_date', 'PATIENT', 'CODE', 'DESCRIPTION', 'VALUE', 'years']]
diab_patients_labs['hospitalized'] = diab_patients_labs.PATIENT.isin(inpatient_ids)
diab_patients_labs.head(2)
diab_date | PATIENT | CODE | DESCRIPTION | VALUE | years | hospitalized | |
---|---|---|---|---|---|---|---|
0 | 2015-04-11 | a2a8e809-cc34-4f8e-93b8-74e4a87f35ef | 2339-0 | Glucose | 147.9 | 5.06 | True |
1 | 2015-04-11 | a2a8e809-cc34-4f8e-93b8-74e4a87f35ef | 38483-4 | Creatinine | 1.1 | 5.06 | True |
loinc_to_display = {'CODE = 2339-0': 'Glucose', 'CODE = 4548-4': 'Hemoglobin A1c/Hemoglobin.total in Blood', 'CODE = 2571-8': 'Triglycerides', 'CODE = 38483-4': 'Creatinine', 'CODE = 2947-0': 'Sodium', 'CODE = 6298-4': 'Potassium'}
catplt = sns.catplot(x="years", y="VALUE", kind="box", col='CODE',
col_wrap=2, sharey=False, sharex=False, data = diab_patients_labs[diab_patients_labs['years']<= 3])
# Set the title for the entire figure
catplt.fig.suptitle("Labs Values of Diabetes Patients", y=1.02)
for axis in catplt.fig.axes:
axis.xaxis.set_major_formatter(ticker.FormatStrFormatter('%d'))
axis.xaxis.set_major_locator(ticker.MultipleLocator(base=4))
axis.set_title(loinc_to_display[axis.get_title()])
plt.show()
# --- Step 1: Prepare the A1C Data ---
diab_A1C = diab_observations[diab_observations['CODE'] == '4548-4'].copy()
diab_A1C['DATE'] = pd.to_datetime(diab_A1C['DATE'], errors='coerce')
diab_A1C['A1C'] = pd.to_numeric(diab_A1C['VALUE'], errors='coerce')
diab_A1C = diab_A1C[['DATE', 'PATIENT', 'A1C']].dropna()
# --- Step 2: Find Baseline Date per Patient ---
baseline_dates = diab_A1C.groupby('PATIENT')['DATE'].min().reset_index()
baseline_dates.rename(columns={'DATE': 'BASELINE_DATE'}, inplace=True)
# --- Step 3: Merge and Calculate Days Since Baseline ---
diab_A1C = diab_A1C.merge(baseline_dates, on='PATIENT')
diab_A1C['DAYS_SINCE_BASELINE'] = (diab_A1C['DATE'] - diab_A1C['BASELINE_DATE']).dt.days
# --- Step 4: Assign Timepoints ---
def assign_timepoint(days):
if days == 0:
return 'baseline'
elif 80 <= days <= 100:
return '3_month'
elif 160 <= days <= 200:
return '6_month'
else:
return None
diab_A1C['TIMEPOINT'] = diab_A1C['DAYS_SINCE_BASELINE'].apply(assign_timepoint)
diab_A1C = diab_A1C.dropna(subset=['TIMEPOINT'])
# --- Step 5: Aggregate Statistics ---
summary = diab_A1C.groupby('TIMEPOINT')['A1C'].agg(
N='count',
mean='mean',
sd='std'
).reset_index()
# --- Output Results ---
order = ['baseline', '3_month', '6_month']
summary = summary.set_index('TIMEPOINT').loc[order].reset_index()
print(summary)
TIMEPOINT N mean sd 0 baseline 4041 4.868077 1.546044 1 3_month 1120 3.762768 1.004048 2 6_month 1432 3.759567 1.005509
plt.figure(figsize=(10, 6))
sns.boxplot(x='TIMEPOINT', y='A1C', data=diab_A1C, order=['baseline', '3_month', '6_month'])
plt.xticks(rotation=0, ha='right')
plt.title('A1C at Time Points for Diabetes Patients')
plt.xlabel('Time Points')
plt.ylabel('A1C')
plt.tight_layout()
plt.show()
from scipy.stats import f_oneway
# Filter only valid timepoints with enough data
valid_timepoints = diab_A1C['TIMEPOINT'].unique()
groups = [diab_A1C.loc[diab_A1C['TIMEPOINT'] == tp, 'A1C'] for tp in valid_timepoints if diab_A1C.loc[diab_A1C['TIMEPOINT'] == tp, 'A1C'].size > 1]
# Perform one-way ANOVA
anova_result = f_oneway(*groups)
print("ANOVA F-statistic:", anova_result.statistic)
print("ANOVA p-value:", anova_result.pvalue)
ANOVA F-statistic: 419.73032761941977 ANOVA p-value: 1.767089456558457e-171
Care plans for diabetes patients¶
diab_care = care_plans[care_plans['PATIENT'].isin(diab_patient_ids)]
# Normalize REASONDESCRIPTION to lowercase
diab_care.loc[:, 'REASONDESCRIPTION'] = diab_care['REASONDESCRIPTION'].str.lower()
# Use the existing diabetes-related keywords pattern
reason_mask = diab_care['REASONDESCRIPTION'].str.contains(pattern, regex=True, na=False)
# Filter the medications DataFrame
diabetes_related_reasons = diab_care[reason_mask]
# Calculate value counts and percentages
counts = diabetes_related_reasons['REASONDESCRIPTION'].value_counts()
percentages = round((counts / diab_patient_ids.size) * 100, 2)
# Combine counts and percentages into a DataFrame
diabetes_careplans = pd.DataFrame({
'DESCRIPTION' : counts.index,
'Count': counts.values,
'Percentage': percentages.values
})
diabetes_careplans
DESCRIPTION | Count | Percentage | |
---|---|---|---|
0 | diabetes | 6049 | 70.71 |
1 | hypertension | 5063 | 59.18 |
2 | prediabetes | 2506 | 29.29 |
3 | escherichia coli urinary tract infection | 28 | 0.33 |
4 | diabetes from cystic fibrosis | 16 | 0.19 |
Mortality¶
death_ids = patients[patients.DEATHDATE.notna()].Id.unique()
diab_death_ids = np.intersect1d(diab_patient_ids, death_ids)
diab_conditions = conditions[conditions['CODE'] == 44054006].copy()
diab_conditions.rename(columns = {'START' : 'diab_date'}, inplace=True)
diab_conditions.head(2)
diab_date | STOP | PATIENT | ENCOUNTER | CODE | DESCRIPTION | |
---|---|---|---|---|---|---|
393 | 2015-04-11 | NaN | a2a8e809-cc34-4f8e-93b8-74e4a87f35ef | 5db5168f-1c3d-45f0-8d54-15d2c04d861a | 44054006 | Diabetes |
577 | 1965-07-09 | NaN | 87b26350-3798-4289-a2ee-f6d88145537f | 7a1ddd9e-6900-433a-aa38-bb7e29558909 | 44054006 | Diabetes |
dp = diab_conditions.merge(patients, how='left', left_on='PATIENT', right_on='Id')
dp = dp[['diab_date', 'PATIENT', 'BIRTHDATE', 'DEATHDATE', 'GENDER']]
dp['diab_date'] = pd.to_datetime(dp['diab_date'])
dp['BIRTHDATE'] = pd.to_datetime(dp['BIRTHDATE'])
dp['DEATHDATE'] = pd.to_datetime(dp['DEATHDATE'], errors='coerce') # Handle NaT for missing dates
dp['admit'] = dp.PATIENT.isin(inpatient_ids)
dp['death'] = dp.DEATHDATE.notna()
dp['death_age'] = ((dp['DEATHDATE'] - dp['BIRTHDATE']).dt.days / 365.25).round(2)
dp['diab_age'] = ((dp['diab_date'] - dp['BIRTHDATE']).dt.days / 365.25).round(2)
print("mean age at death:", dp['death_age'].mean().round(2), "years-old")
dp.head(3)
mean age at death: 69.11 years-old
diab_date | PATIENT | BIRTHDATE | DEATHDATE | GENDER | admit | death | death_age | diab_age | |
---|---|---|---|---|---|---|---|---|---|
0 | 2015-04-11 | a2a8e809-cc34-4f8e-93b8-74e4a87f35ef | 1969-03-22 | NaT | M | True | False | NaN | 46.05 |
1 | 1965-07-09 | 87b26350-3798-4289-a2ee-f6d88145537f | 1940-05-03 | 2010-04-20 | F | True | True | 69.96 | 25.18 |
2 | 1985-05-22 | 30ce94e8-8c50-4588-97f6-b8aa1df0e9d2 | 1957-03-13 | NaT | M | True | False | NaN | 28.19 |
pd.crosstab(dp['GENDER'], dp['death'])
death | False | True |
---|---|---|
GENDER | ||
F | 2862 | 882 |
M | 2833 | 1978 |
# Create a cross-tabulation with percentages
cross_tab = pd.crosstab(dp['GENDER'], dp['death'], normalize='index') * 100
# Display the cross-tabulation
print(cross_tab.round(2))
death False True GENDER F 76.44 23.56 M 58.89 41.11
dp_death = dp[dp['DEATHDATE'].notna()]
dp_death.head(2)
diab_date | PATIENT | BIRTHDATE | DEATHDATE | GENDER | admit | death | death_age | diab_age | |
---|---|---|---|---|---|---|---|---|---|
1 | 1965-07-09 | 87b26350-3798-4289-a2ee-f6d88145537f | 1940-05-03 | 2010-04-20 | F | True | True | 69.96 | 25.18 |
3 | 1988-12-04 | 8dbd714f-fae1-4a21-bdd4-960d3b08f52d | 1964-09-06 | 2014-10-09 | M | True | True | 50.09 | 24.24 |
death = pd.DataFrame(dp_death['death'].value_counts()).reset_index()
death.columns = ['death', 'count']
deaths = death.loc[death['death'] == True, 'count'].iloc[0] # counting the deaths
death_rate = (deaths/diab_patient_ids.size)
print("Diabetes death rate is:", round(death_rate*100, 2), "%")
Diabetes death rate is: 33.43 %
# Filter only the rows where death occurred
diabetes_deaths = dp[dp['death'] == 1]
# Plot the histogram
plt.figure(figsize=(8, 4))
plt.hist(diabetes_deaths['diab_age'], bins=20, color='skyblue', edgecolor='black')
plt.xlabel('Age at Diabetes Diagnosis (Years)')
plt.ylabel('Number of Deaths')
plt.title('Number of Diabetes Deaths by Age at Diagnosis')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()
dp.pivot_table('death', index='admit', columns = 'GENDER').round(3)
GENDER | F | M |
---|---|---|
admit | ||
False | 0.133 | 0.287 |
True | 0.337 | 0.548 |
Male patients admitted to hospital have a higher death rate, 70% compared to female.
Diabetes medications¶
Top 14 medications diabetes patients use
diab_meds = medications[medications['PATIENT'].isin(diab_patient_ids)]
count = diab_meds['DESCRIPTION'].value_counts()
percentages = round((count / count.sum()) * 100, 2)
# Combine counts and percentages into a DataFrame
diabetes_meds = pd.DataFrame({
'Medications' : count.index,
'Count': count.values,
'Percentage': percentages.values
})
diabetes_meds.head(14)
Medications | Count | Percentage | |
---|---|---|---|
0 | insulin human isophane 70 UNT/ML / Regular In... | 499654 | 27.56 |
1 | 24 HR Metformin hydrochloride 500 MG Extended ... | 348524 | 19.22 |
2 | Hydrochlorothiazide 25 MG Oral Tablet | 168005 | 9.27 |
3 | amLODIPine 5 MG / Hydrochlorothiazide 12.5 MG ... | 118518 | 6.54 |
4 | Atenolol 50 MG / Chlorthalidone 25 MG Oral Tablet | 114742 | 6.33 |
5 | Hydrochlorothiazide 12.5 MG | 59509 | 3.28 |
6 | Digoxin 0.125 MG Oral Tablet | 52711 | 2.91 |
7 | Warfarin Sodium 5 MG Oral Tablet | 52711 | 2.91 |
8 | Verapamil Hydrochloride 40 MG | 50712 | 2.80 |
9 | Nitroglycerin 0.4 MG/ACTUAT Mucosal Spray | 46784 | 2.58 |
10 | Simvastatin 20 MG Oral Tablet | 46340 | 2.56 |
11 | Amlodipine 5 MG Oral Tablet | 43117 | 2.38 |
12 | NDA020503 200 ACTUAT Albuterol 0.09 MG/ACTUAT ... | 34975 | 1.93 |
13 | 120 ACTUAT Fluticasone propionate 0.044 MG/ACT... | 33773 | 1.86 |
# use function size()
diab_meds.groupby('DESCRIPTION').size().reset_index(name='count').sort_values('count', ascending=False).head()
DESCRIPTION | count | |
---|---|---|
157 | insulin human isophane 70 UNT/ML / Regular In... | 499654 |
29 | 24 HR Metformin hydrochloride 500 MG Extended ... | 348524 |
87 | Hydrochlorothiazide 25 MG Oral Tablet | 168005 |
145 | amLODIPine 5 MG / Hydrochlorothiazide 12.5 MG ... | 118518 |
54 | Atenolol 50 MG / Chlorthalidone 25 MG Oral Tablet | 114742 |
Hospital stay length¶
diab_encounters = encounters[encounters['ENCOUNTERCLASS'] == 'inpatient']
diab_encounters = diab_encounters.copy()
diab_encounters['stay_days'] = (
pd.to_datetime(diab_encounters['STOP']) - pd.to_datetime(diab_encounters['START'])
) / np.timedelta64(1, 'D')
diab_encounters['stay_days'].describe()
count 95273.000000 mean 16.316658 std 333.316660 min 1.000000 25% 1.000000 50% 1.041667 75% 4.029167 max 22248.147222 Name: stay_days, dtype: float64
Average hospital stay days of diabetes patients is: 77.9 days
# Remove rows with the abnormal maximum stay_days value
diab_encounters = diab_encounters[diab_encounters['stay_days'] != diab_encounters['stay_days'].max()]
(diab_encounters['stay_days'].sum() / diab_inpatient_ids.size).round(2)
np.float64(367.1)