Thieu Nguyen - May-10-2025

Diabetes Reports | Python¶

Synthea healthcare data is stored across multiple tables linked by key columns. In this post, I outline how to:

  • Define specific patient groups such as diabetes patients, inpatients, and deceased patients
  • Extract outcomes related to diabetes, not general
  • Extract time points for diabetes diagnoses and lab tests
  • Aggregate and visualize lab test values, outcomes, and death counts by year

Loading data and packages¶

In [ ]:
%pip install pandas numpy matplotlib seaborn

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns
import datetime as dt
In [66]:
# Load the data
conditions = pd.read_csv("/Users/nnthieu/SyntheaData/SyntheaCovid19/conditions.csv")
patients = pd.read_csv("/Users/nnthieu/SyntheaData/SyntheaCovid19/patients.csv")
observations = pd.read_csv("/Users/nnthieu/SyntheaData/SyntheaCovid19/observations.csv")
care_plans = pd.read_csv("/Users/nnthieu/SyntheaData/SyntheaCovid19/careplans.csv")
encounters = pd.read_csv("/Users/nnthieu/SyntheaData/SyntheaCovid19/encounters.csv")
supplies = pd.read_csv('/Users/nnthieu/SyntheaData/SyntheaCovid19/supplies.csv')
procedures = pd.read_csv("/Users/nnthieu/SyntheaData/SyntheaCovid19/procedures.csv")
medications = pd.read_csv("/Users/nnthieu/SyntheaData/SyntheaCovid19/medications.csv")

Define Ids for specific patient groups¶

In [67]:
# define diabetes patients Ids
diab_patient_ids = conditions[conditions['CODE'] == 44054006]['PATIENT'].unique()
print("Number of diabetes patients:", diab_patient_ids.size)
Number of diabetes patients: 8555
In [68]:
diab_patients = patients[patients['Id'].isin(diab_patient_ids)]
print(diab_patients.shape)
diab_patients.head(2)
(8555, 25)
Out[68]:
Id BIRTHDATE DEATHDATE SSN DRIVERS PASSPORT PREFIX FIRST LAST SUFFIX ... BIRTHPLACE ADDRESS CITY STATE COUNTY ZIP LAT LON HEALTHCARE_EXPENSES HEALTHCARE_COVERAGE
43 a2a8e809-cc34-4f8e-93b8-74e4a87f35ef 1969-03-22 NaN 999-12-7459 S99953931 X68702974X Mr. Stanford577 Goodwin327 NaN ... Arlington Massachusetts US 314 Torp Byway Somerville Massachusetts Middlesex County 2140.0 42.440865 -71.086388 1113259.27 6304.04
65 87b26350-3798-4289-a2ee-f6d88145537f 1940-05-03 2010-04-20 999-30-7132 S99937049 X73545664X Mrs. Tia76 Spinka232 NaN ... Hopedale Massachusetts US 537 Steuber Heights Unit 8 Middleborough Massachusetts Plymouth County NaN 41.855252 -70.847341 1433642.99 20767.46

2 rows × 25 columns

In [69]:
# extract conditions for diabetes patients
diab_conditions = conditions[conditions['PATIENT'].isin(diab_patient_ids)]
diab_conditions.head(2)
Out[69]:
START STOP PATIENT ENCOUNTER CODE DESCRIPTION
391 1992-03-21 NaN a2a8e809-cc34-4f8e-93b8-74e4a87f35ef fdd8a4d9-5fab-4b76-9e2e-4d16753e1703 410429000 Cardiac Arrest
392 1992-03-21 NaN a2a8e809-cc34-4f8e-93b8-74e4a87f35ef fdd8a4d9-5fab-4b76-9e2e-4d16753e1703 429007001 History of cardiac arrest (situation)
In [70]:
# for inpatients
inpatient_ids = encounters[encounters['ENCOUNTERCLASS'] == 'inpatient']['PATIENT'].unique()
diab_inpatient_ids = np.intersect1d(diab_patient_ids, inpatient_ids)
print("number of inpatient patients:", diab_inpatient_ids.size)
diab_inpatient_patients = patients[patients['Id'].isin(diab_inpatient_ids)]
number of inpatient patients: 4174
In [71]:
# for diabetes patient deaths
death_ids = patients[patients.DEATHDATE.notna()].Id.unique()
diab_death_ids = np.intersect1d(diab_patient_ids, death_ids)
print("number of diabetes patient deaths:", diab_death_ids.size)
number of diabetes patient deaths: 2860

Health outcomes related to diabetes¶

Health outcomes are stored in the table 'condition' with date to get disease and other symptoms and outcomes. Two steps to get specific data for these information.

  • disab_date contains date to get diabetes (diab_cons = conditions[conditions['CODE'] == 44054006].unique())
  • diab_conditions contains all symptoms and outcomes included in 'DESCRIPTION' values (diab_conditions = conditions[conditions['PATIENT'].isin(diab_patient_ids)]). This table contains many 'DESCRIPTION'['value'], create a 'diabetes_related_keywords' to filter the symptoms or outcomes that we want to analyze.
  • how to build a results table of symptoms or outcomes
In [72]:
import re
# Make an explicit copy if needed
diab_conditions = diab_conditions.copy()

# Normalize to lowercase using .loc
diab_conditions.loc[:, 'DESCRIPTION'] = diab_conditions['DESCRIPTION'].str.lower()

# Define diabetes-related keywords
diabetes_related_keywords = [
    'diabetes',
    'prediabetes',
    'hyperglycemia',
    'hypoglycemia',
    'neuropathy',
    'nephropathy',
    'retinopathy',
    'vision',
    'foot ulcer',
    'amputation',
    'cardiovascular',
    'metabolic syndrome',
    'hypertriglyceridemia',
    'obesity',
    'hypertension',
    'anemia',
    'infection',
    'slow healing',
    'renal failure',
    'fatigue',
    'blurred vision'
]

# Build regex pattern with word boundaries
pattern = '|'.join([r'\b' + re.escape(word) + r'\b' for word in diabetes_related_keywords])

# Create mask with regex
mask = diab_conditions['DESCRIPTION'].str.contains(pattern, regex=True)
diabetes_outcomes = diab_conditions[mask]

# Further filter rows where 'DESCRIPTION' contains '(disorder)'
disorder_mask = diabetes_outcomes['DESCRIPTION'].str.contains(r'\(disorder\)', na=False)
diabetes_outcomes = diabetes_outcomes[disorder_mask]
diabetes_outcomes['DESCRIPTION'] = diabetes_outcomes['DESCRIPTION'].str.replace(
    r'\(disorder\)', '', regex=True).str.strip()

# Calculate value counts and percentages
counts = diabetes_outcomes['DESCRIPTION'].value_counts().head(12)
percentages = round((counts / diab_patient_ids.size) * 100, 2)

# Combine counts and percentages into a DataFrame
diabetes_outcomes = pd.DataFrame({
    'DESCRIPTION' : counts.index,
    'Count': counts.values,
    'Percentage': percentages.values
    })

# Display the DataFrame
print(diabetes_outcomes)
                                          DESCRIPTION  Count  Percentage
0                                hypertriglyceridemia   8369       97.83
1                                metabolic syndrome x   8290       96.90
2                                              anemia   6786       79.32
3                                       hyperglycemia   4445       51.96
4          neuropathy due to type 2 diabetes mellitus   3045       35.59
5   diabetic retinopathy associated with type ii d...   2326       27.19
6   nonproliferative diabetic retinopathy due to t...   1383       16.17
7    microalbuminuria due to type 2 diabetes mellitus    558        6.52
8   macular edema and retinopathy due to type 2 di...    511        5.97
9   proliferative diabetic retinopathy due to type...    350        4.09
10        proteinuria due to type 2 diabetes mellitus     45        0.53
11          blindness due to type 2 diabetes mellitus     15        0.18
In [73]:
counts = diabetes_outcomes['Count']

plt.figure(figsize=(9, 5))
ax = counts.plot(kind='barh', color='teal')
plt.xlabel("Frequency")
plt.title("Top 10 Diabetes-Related Outcomes")
plt.gca().invert_yaxis()
plt.yticks(ticks=range(len(diabetes_outcomes)), labels=diabetes_outcomes['DESCRIPTION'])

# Add value labels inside the bars
for index, value in enumerate(counts):
    plt.text(value, index, str(value), va='center', ha='left', fontsize=10)

plt.tight_layout()
plt.show()
No description has been provided for this image
In [74]:
diab_conditions = conditions[conditions['CODE'] == 44054006].copy()
diab_conditions.rename(columns = {'START' : 'diab_date'}, inplace=True)
diab_conditions.head(2)
Out[74]:
diab_date STOP PATIENT ENCOUNTER CODE DESCRIPTION
393 2015-04-11 NaN a2a8e809-cc34-4f8e-93b8-74e4a87f35ef 5db5168f-1c3d-45f0-8d54-15d2c04d861a 44054006 Diabetes
577 1965-07-09 NaN 87b26350-3798-4289-a2ee-f6d88145537f 7a1ddd9e-6900-433a-aa38-bb7e29558909 44054006 Diabetes
In [75]:
retino_conditions = diab_conditions.merge(conditions, on='PATIENT')
retino_conditions.head(2)
Out[75]:
diab_date STOP_x PATIENT ENCOUNTER_x CODE_x DESCRIPTION_x START STOP_y ENCOUNTER_y CODE_y DESCRIPTION_y
0 2015-04-11 NaN a2a8e809-cc34-4f8e-93b8-74e4a87f35ef 5db5168f-1c3d-45f0-8d54-15d2c04d861a 44054006 Diabetes 1992-03-21 NaN fdd8a4d9-5fab-4b76-9e2e-4d16753e1703 410429000 Cardiac Arrest
1 2015-04-11 NaN a2a8e809-cc34-4f8e-93b8-74e4a87f35ef 5db5168f-1c3d-45f0-8d54-15d2c04d861a 44054006 Diabetes 1992-03-21 NaN fdd8a4d9-5fab-4b76-9e2e-4d16753e1703 429007001 History of cardiac arrest (situation)

Diabetic retinopathy complications by years of the diabetes diagnosis.¶

In [76]:
# Filter for diabetic retinopathy descriptions
retino_conditions = retino_conditions[(retino_conditions['CODE_y'] == 422034002) | 
                                      (retino_conditions['CODE_y'] == 1551000119108)]
# Ensure START and diab_date columns are in datetime format
retino_conditions['START'] = pd.to_datetime(retino_conditions['START'], errors='coerce')
retino_conditions['diab_date'] = pd.to_datetime(retino_conditions['diab_date'], errors='coerce')

# Calculate diab_years
retino_conditions['diab_years'] = (((retino_conditions['START'] - retino_conditions['diab_date']).dt.days) / 365.25).round(2)
retino_conditions.head(2)
Out[76]:
diab_date STOP_x PATIENT ENCOUNTER_x CODE_x DESCRIPTION_x START STOP_y ENCOUNTER_y CODE_y DESCRIPTION_y diab_years
4 2015-04-11 NaN a2a8e809-cc34-4f8e-93b8-74e4a87f35ef 5db5168f-1c3d-45f0-8d54-15d2c04d861a 44054006 Diabetes 2017-04-15 NaN 374c2bf5-baaf-4f67-9e02-e4521addb728 422034002 Diabetic retinopathy associated with type II d... 2.01
19 2015-04-11 NaN a2a8e809-cc34-4f8e-93b8-74e4a87f35ef 5db5168f-1c3d-45f0-8d54-15d2c04d861a 44054006 Diabetes 2020-05-02 NaN caa9c2dc-a809-48e1-9d69-e62a719295a4 1551000119108 Nonproliferative diabetic retinopathy due to t... 5.06
In [77]:
# Define age group bins and labels
bins = [0, 10, 20, 30, 40]
labels = ['0-9', '10-19', '20-29', '30+']

# Create age groups
retino_conditions['year_group'] = pd.cut(retino_conditions['diab_years'], bins=bins, labels=labels, right=False)
retino_conditions.pivot_table(
    values='DESCRIPTION_y',
    index='year_group',
    columns='CODE_y',
    aggfunc='count',
    observed=False
)
Out[77]:
CODE_y 422034002 1551000119108
year_group
0-9 1474 533
10-19 578 504
20-29 197 240
30+ 62 73

422034002 : Diabetic retinopathy associated with type II diabetes mellitus 1551000119108 : Nonproliferative diabetic retinopathy due to type 2 diabetes mellitus

Diabetic retinopathy happens at most first 19 years of the disease.

Labs values¶

In [78]:
observations['DESCRIPTION'][observations['PATIENT'].isin(diab_inpatient_ids)].value_counts().head()
Out[78]:
DESCRIPTION
Diastolic Blood Pressure    44903
Systolic Blood Pressure     44903
Body Weight                 43576
Respiratory rate            43576
Heart rate                  43576
Name: count, dtype: int64
In [79]:
labs_code = ["2339-0", "4548-4", "2571-8", "38483-4", "2947-0", "6298-4"]
# Filter observations for the specified lab codes and inpatient IDs
diab_observations = observations[
    observations['CODE'].isin(labs_code) & 
    observations['PATIENT'].isin(diab_inpatient_ids)
]
diab_observations.head(2)
Out[79]:
DATE PATIENT ENCOUNTER CODE DESCRIPTION VALUE UNITS TYPE
7223 2020-05-02 a2a8e809-cc34-4f8e-93b8-74e4a87f35ef caa9c2dc-a809-48e1-9d69-e62a719295a4 2339-0 Glucose 147.9 mg/dL numeric
7225 2020-05-02 a2a8e809-cc34-4f8e-93b8-74e4a87f35ef caa9c2dc-a809-48e1-9d69-e62a719295a4 38483-4 Creatinine 1.1 mg/dL numeric

To analyze lab tests, use table 'observations' containing information about lab tests, merge with 'diab_date' to create the column 'days' of disease course.

In [80]:
# Ensure 'diab_date' is in datetime format
diab_conditions['diab_date'] = pd.to_datetime(diab_conditions['diab_date'], errors='coerce')

# Filter rows where 'diab_date' year is greater than 2010
diab_conditions = diab_conditions[diab_conditions['diab_date'].dt.year > 2010]
In [81]:
diab_patients_labs = diab_conditions.merge(diab_observations, on='PATIENT')
diab_patients_labs['DATE'] = pd.to_datetime(diab_patients_labs.DATE)
diab_patients_labs['labs_days'] = (diab_patients_labs.DATE - diab_patients_labs.diab_date).dt.days
diab_patients_labs['years'] = round((diab_patients_labs['labs_days'] / 365.25), 2)
diab_patients_labs['VALUE'] = pd.to_numeric(diab_patients_labs['VALUE'], errors='coerce')
diab_patients_labs = diab_patients_labs[diab_patients_labs['years'] > 0]
diab_patients_labs.rename(columns={'CODE_y': 'CODE'}, inplace=True)
diab_patients_labs.rename(columns={'DESCRIPTION_y': 'DESCRIPTION'}, inplace=True)
diab_patients_labs = diab_patients_labs[['diab_date', 'PATIENT', 'CODE', 'DESCRIPTION', 'VALUE', 'years']]
diab_patients_labs['hospitalized'] = diab_patients_labs.PATIENT.isin(inpatient_ids)
diab_patients_labs.head(2)
Out[81]:
diab_date PATIENT CODE DESCRIPTION VALUE years hospitalized
0 2015-04-11 a2a8e809-cc34-4f8e-93b8-74e4a87f35ef 2339-0 Glucose 147.9 5.06 True
1 2015-04-11 a2a8e809-cc34-4f8e-93b8-74e4a87f35ef 38483-4 Creatinine 1.1 5.06 True
In [82]:
loinc_to_display = {'CODE = 2339-0': 'Glucose', 'CODE = 4548-4': 'Hemoglobin A1c/Hemoglobin.total in Blood', 'CODE = 2571-8': 'Triglycerides',  'CODE = 38483-4': 'Creatinine', 'CODE = 2947-0': 'Sodium', 'CODE = 6298-4': 'Potassium'}

catplt = sns.catplot(x="years", y="VALUE", kind="box", col='CODE', 
            col_wrap=2, sharey=False, sharex=False, data = diab_patients_labs[diab_patients_labs['years']<= 3])

# Set the title for the entire figure
catplt.fig.suptitle("Labs Values of Diabetes Patients", y=1.02)

for axis in catplt.fig.axes:
    axis.xaxis.set_major_formatter(ticker.FormatStrFormatter('%d'))
    axis.xaxis.set_major_locator(ticker.MultipleLocator(base=4))
    axis.set_title(loinc_to_display[axis.get_title()])
        
plt.show()
No description has been provided for this image
In [109]:
# --- Step 1: Prepare the A1C Data ---
diab_A1C = diab_observations[diab_observations['CODE'] == '4548-4'].copy()
diab_A1C['DATE'] = pd.to_datetime(diab_A1C['DATE'], errors='coerce')
diab_A1C['A1C'] = pd.to_numeric(diab_A1C['VALUE'], errors='coerce')
diab_A1C = diab_A1C[['DATE', 'PATIENT', 'A1C']].dropna()

# --- Step 2: Find Baseline Date per Patient ---
baseline_dates = diab_A1C.groupby('PATIENT')['DATE'].min().reset_index()
baseline_dates.rename(columns={'DATE': 'BASELINE_DATE'}, inplace=True)

# --- Step 3: Merge and Calculate Days Since Baseline ---
diab_A1C = diab_A1C.merge(baseline_dates, on='PATIENT')
diab_A1C['DAYS_SINCE_BASELINE'] = (diab_A1C['DATE'] - diab_A1C['BASELINE_DATE']).dt.days

# --- Step 4: Assign Timepoints ---
def assign_timepoint(days):
    if days == 0:
        return 'baseline'
    elif 80 <= days <= 100:
        return '3_month'
    elif 160 <= days <= 200:
        return '6_month'
    else:
        return None

diab_A1C['TIMEPOINT'] = diab_A1C['DAYS_SINCE_BASELINE'].apply(assign_timepoint)
diab_A1C = diab_A1C.dropna(subset=['TIMEPOINT'])

# --- Step 5: Aggregate Statistics ---
summary = diab_A1C.groupby('TIMEPOINT')['A1C'].agg(
    N='count',
    mean='mean',
    sd='std'
).reset_index()

# --- Output Results ---
order = ['baseline', '3_month', '6_month']
summary = summary.set_index('TIMEPOINT').loc[order].reset_index()
print(summary)
  TIMEPOINT     N      mean        sd
0  baseline  4041  4.868077  1.546044
1   3_month  1120  3.762768  1.004048
2   6_month  1432  3.759567  1.005509
In [113]:
plt.figure(figsize=(10, 6))
sns.boxplot(x='TIMEPOINT', y='A1C', data=diab_A1C, order=['baseline', '3_month', '6_month'])
plt.xticks(rotation=0, ha='right')
plt.title('A1C at Time Points for Diabetes Patients')
plt.xlabel('Time Points')
plt.ylabel('A1C')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [85]:
from scipy.stats import f_oneway

# Filter only valid timepoints with enough data
valid_timepoints = diab_A1C['TIMEPOINT'].unique()
groups = [diab_A1C.loc[diab_A1C['TIMEPOINT'] == tp, 'A1C'] for tp in valid_timepoints if diab_A1C.loc[diab_A1C['TIMEPOINT'] == tp, 'A1C'].size > 1]

# Perform one-way ANOVA
anova_result = f_oneway(*groups)
print("ANOVA F-statistic:", anova_result.statistic)
print("ANOVA p-value:", anova_result.pvalue)
ANOVA F-statistic: 419.73032761941977
ANOVA p-value: 1.767089456558457e-171

Care plans for diabetes patients¶

In [86]:
diab_care = care_plans[care_plans['PATIENT'].isin(diab_patient_ids)]
# Normalize REASONDESCRIPTION to lowercase
diab_care.loc[:, 'REASONDESCRIPTION'] = diab_care['REASONDESCRIPTION'].str.lower()
# Use the existing diabetes-related keywords pattern
reason_mask = diab_care['REASONDESCRIPTION'].str.contains(pattern, regex=True, na=False)

# Filter the medications DataFrame
diabetes_related_reasons = diab_care[reason_mask]

# Calculate value counts and percentages
counts = diabetes_related_reasons['REASONDESCRIPTION'].value_counts()
percentages = round((counts / diab_patient_ids.size) * 100, 2)

# Combine counts and percentages into a DataFrame
diabetes_careplans = pd.DataFrame({
    'DESCRIPTION' : counts.index,
    'Count': counts.values,
    'Percentage': percentages.values
    })
diabetes_careplans
Out[86]:
DESCRIPTION Count Percentage
0 diabetes 6049 70.71
1 hypertension 5063 59.18
2 prediabetes 2506 29.29
3 escherichia coli urinary tract infection 28 0.33
4 diabetes from cystic fibrosis 16 0.19

Mortality¶

In [87]:
death_ids = patients[patients.DEATHDATE.notna()].Id.unique()
diab_death_ids = np.intersect1d(diab_patient_ids, death_ids)
In [88]:
diab_conditions = conditions[conditions['CODE'] == 44054006].copy()
diab_conditions.rename(columns = {'START' : 'diab_date'}, inplace=True)
diab_conditions.head(2)
Out[88]:
diab_date STOP PATIENT ENCOUNTER CODE DESCRIPTION
393 2015-04-11 NaN a2a8e809-cc34-4f8e-93b8-74e4a87f35ef 5db5168f-1c3d-45f0-8d54-15d2c04d861a 44054006 Diabetes
577 1965-07-09 NaN 87b26350-3798-4289-a2ee-f6d88145537f 7a1ddd9e-6900-433a-aa38-bb7e29558909 44054006 Diabetes
In [89]:
dp = diab_conditions.merge(patients, how='left', left_on='PATIENT', right_on='Id')
dp = dp[['diab_date', 'PATIENT', 'BIRTHDATE', 'DEATHDATE', 'GENDER']]
dp['diab_date'] = pd.to_datetime(dp['diab_date'])
dp['BIRTHDATE'] = pd.to_datetime(dp['BIRTHDATE'])
dp['DEATHDATE'] = pd.to_datetime(dp['DEATHDATE'], errors='coerce')  # Handle NaT for missing dates
dp['admit'] = dp.PATIENT.isin(inpatient_ids)
dp['death'] = dp.DEATHDATE.notna()
dp['death_age'] = ((dp['DEATHDATE'] - dp['BIRTHDATE']).dt.days / 365.25).round(2)  
dp['diab_age'] = ((dp['diab_date'] - dp['BIRTHDATE']).dt.days / 365.25).round(2)
print("mean age at death:", dp['death_age'].mean().round(2), "years-old")
dp.head(3)
mean age at death: 69.11 years-old
Out[89]:
diab_date PATIENT BIRTHDATE DEATHDATE GENDER admit death death_age diab_age
0 2015-04-11 a2a8e809-cc34-4f8e-93b8-74e4a87f35ef 1969-03-22 NaT M True False NaN 46.05
1 1965-07-09 87b26350-3798-4289-a2ee-f6d88145537f 1940-05-03 2010-04-20 F True True 69.96 25.18
2 1985-05-22 30ce94e8-8c50-4588-97f6-b8aa1df0e9d2 1957-03-13 NaT M True False NaN 28.19
In [90]:
pd.crosstab(dp['GENDER'], dp['death'])
Out[90]:
death False True
GENDER
F 2862 882
M 2833 1978
In [91]:
# Create a cross-tabulation with percentages
cross_tab = pd.crosstab(dp['GENDER'], dp['death'], normalize='index') * 100

# Display the cross-tabulation
print(cross_tab.round(2))
death   False  True 
GENDER              
F       76.44  23.56
M       58.89  41.11
In [92]:
dp_death = dp[dp['DEATHDATE'].notna()]
dp_death.head(2)
Out[92]:
diab_date PATIENT BIRTHDATE DEATHDATE GENDER admit death death_age diab_age
1 1965-07-09 87b26350-3798-4289-a2ee-f6d88145537f 1940-05-03 2010-04-20 F True True 69.96 25.18
3 1988-12-04 8dbd714f-fae1-4a21-bdd4-960d3b08f52d 1964-09-06 2014-10-09 M True True 50.09 24.24
In [93]:
death = pd.DataFrame(dp_death['death'].value_counts()).reset_index()
death.columns = ['death', 'count']
deaths = death.loc[death['death'] == True, 'count'].iloc[0] # counting the deaths
death_rate = (deaths/diab_patient_ids.size)
print("Diabetes death rate is:", round(death_rate*100, 2), "%")
Diabetes death rate is: 33.43 %
In [94]:
# Filter only the rows where death occurred
diabetes_deaths = dp[dp['death'] == 1]

# Plot the histogram
plt.figure(figsize=(8, 4))
plt.hist(diabetes_deaths['diab_age'], bins=20, color='skyblue', edgecolor='black')
plt.xlabel('Age at Diabetes Diagnosis (Years)')
plt.ylabel('Number of Deaths')
plt.title('Number of Diabetes Deaths by Age at Diagnosis')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()
No description has been provided for this image
In [95]:
dp.pivot_table('death', index='admit', columns = 'GENDER').round(3)
Out[95]:
GENDER F M
admit
False 0.133 0.287
True 0.337 0.548

Male patients admitted to hospital have a higher death rate, 70% compared to female.

Diabetes medications¶

Top 14 medications diabetes patients use

In [96]:
diab_meds = medications[medications['PATIENT'].isin(diab_patient_ids)]
count = diab_meds['DESCRIPTION'].value_counts()
percentages = round((count / count.sum()) * 100, 2)

# Combine counts and percentages into a DataFrame
diabetes_meds = pd.DataFrame({
    'Medications' : count.index,
    'Count': count.values,
    'Percentage': percentages.values
    })
diabetes_meds.head(14)
Out[96]:
Medications Count Percentage
0 insulin human isophane 70 UNT/ML / Regular In... 499654 27.56
1 24 HR Metformin hydrochloride 500 MG Extended ... 348524 19.22
2 Hydrochlorothiazide 25 MG Oral Tablet 168005 9.27
3 amLODIPine 5 MG / Hydrochlorothiazide 12.5 MG ... 118518 6.54
4 Atenolol 50 MG / Chlorthalidone 25 MG Oral Tablet 114742 6.33
5 Hydrochlorothiazide 12.5 MG 59509 3.28
6 Digoxin 0.125 MG Oral Tablet 52711 2.91
7 Warfarin Sodium 5 MG Oral Tablet 52711 2.91
8 Verapamil Hydrochloride 40 MG 50712 2.80
9 Nitroglycerin 0.4 MG/ACTUAT Mucosal Spray 46784 2.58
10 Simvastatin 20 MG Oral Tablet 46340 2.56
11 Amlodipine 5 MG Oral Tablet 43117 2.38
12 NDA020503 200 ACTUAT Albuterol 0.09 MG/ACTUAT ... 34975 1.93
13 120 ACTUAT Fluticasone propionate 0.044 MG/ACT... 33773 1.86
In [97]:
# use function size()
diab_meds.groupby('DESCRIPTION').size().reset_index(name='count').sort_values('count', ascending=False).head()
Out[97]:
DESCRIPTION count
157 insulin human isophane 70 UNT/ML / Regular In... 499654
29 24 HR Metformin hydrochloride 500 MG Extended ... 348524
87 Hydrochlorothiazide 25 MG Oral Tablet 168005
145 amLODIPine 5 MG / Hydrochlorothiazide 12.5 MG ... 118518
54 Atenolol 50 MG / Chlorthalidone 25 MG Oral Tablet 114742

Hospital stay length¶

In [98]:
diab_encounters = encounters[encounters['ENCOUNTERCLASS'] == 'inpatient']
diab_encounters = diab_encounters.copy()
diab_encounters['stay_days'] = (
    pd.to_datetime(diab_encounters['STOP']) - pd.to_datetime(diab_encounters['START'])
) / np.timedelta64(1, 'D')

diab_encounters['stay_days'].describe()
Out[98]:
count    95273.000000
mean        16.316658
std        333.316660
min          1.000000
25%          1.000000
50%          1.041667
75%          4.029167
max      22248.147222
Name: stay_days, dtype: float64

Average hospital stay days of diabetes patients is: 77.9 days

In [99]:
# Remove rows with the abnormal maximum stay_days value
diab_encounters = diab_encounters[diab_encounters['stay_days'] != diab_encounters['stay_days'].max()]
(diab_encounters['stay_days'].sum() / diab_inpatient_ids.size).round(2)
Out[99]:
np.float64(367.1)