Analyzing data in OBSERVATIONS Table in The Synthea Healthcare Database.
Data Summary
In [1]:
import pandas as pd
import numpy as np
df = pd.read_csv('E:\Statistics\PowerBI\Data Synthea\observations.csv')
# Show basic structure and missing values in 'encounter'
df_info = df.info()
null_encounter_count = df['ENCOUNTER'].isnull().sum()
df_preview = df.head()
df_info, null_encounter_count, df_preview
<class 'pandas.core.frame.DataFrame'> RangeIndex: 299697 entries, 0 to 299696 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 DATE 299697 non-null object 1 PATIENT 299697 non-null object 2 ENCOUNTER 269334 non-null object 3 CODE 299697 non-null object 4 DESCRIPTION 299697 non-null object 5 VALUE 299697 non-null object 6 UNITS 286962 non-null object 7 TYPE 299697 non-null object dtypes: object(8) memory usage: 18.3+ MB
Out[1]:
(None, np.int64(30363), DATE PATIENT \ 0 2012-01-23T17:45:28Z 034e9e3b-2def-4559-bb2a-7850888ae060 1 2012-01-23T17:45:28Z 034e9e3b-2def-4559-bb2a-7850888ae060 2 2012-01-23T17:45:28Z 034e9e3b-2def-4559-bb2a-7850888ae060 3 2012-01-23T17:45:28Z 034e9e3b-2def-4559-bb2a-7850888ae060 4 2012-01-23T17:45:28Z 034e9e3b-2def-4559-bb2a-7850888ae060 ENCOUNTER CODE \ 0 e88bc3a9-007c-405e-aabc-792a38f4aa2b 8302-2 1 e88bc3a9-007c-405e-aabc-792a38f4aa2b 72514-3 2 e88bc3a9-007c-405e-aabc-792a38f4aa2b 29463-7 3 e88bc3a9-007c-405e-aabc-792a38f4aa2b 39156-5 4 e88bc3a9-007c-405e-aabc-792a38f4aa2b 8462-4 DESCRIPTION VALUE UNITS TYPE 0 Body Height 193.3 cm numeric 1 Pain severity - 0-10 verbal numeric rating [Sc... 2.0 {score} numeric 2 Body Weight 87.8 kg numeric 3 Body Mass Index 23.5 kg/m2 numeric 4 Diastolic Blood Pressure 82.0 mm[Hg] numeric )
In [12]:
df['CODE'].nunique()
Out[12]:
125
In [4]:
print("Distinct patients:", df['PATIENT'].nunique())
Distinct patients: 1171
Because some rows is 'null' in 'ENCOUNTER' so I use 'DATE'_'PATIENT' as a proxy variable for 'ENCOUNTER'. Now I đo some analysis for 'observations' table
In [ ]:
# Extract unique CODE and DESCRIPTION combinations
df2 = df[['CODE', 'DESCRIPTION']].drop_duplicates()
print(df2.shape)
df2.head()
# Save to CSV (Windows path with raw string to handle backslashes)
df2.to_csv(r"E:\Statistics\PowerBI\Data Synthea\unique_codes.csv", index=False)
(130, 2)
Out[ ]:
CODE | DESCRIPTION | |
---|---|---|
0 | 8302-2 | Body Height |
1 | 72514-3 | Pain severity - 0-10 verbal numeric rating [Sc... |
2 | 29463-7 | Body Weight |
3 | 39156-5 | Body Mass Index |
4 | 8462-4 | Diastolic Blood Pressure |
Analyzing aggreations for BMI, height, weight, SBP
In [2]:
# Create synthetic encounter ID from PATIENT + DATE
df['SYNTH_ENCOUNTER'] = df['PATIENT'] + "_" + df['DATE']
# Convert VALUE to numeric safely
df['VALUE_NUM'] = pd.to_numeric(df['VALUE'], errors='coerce')
# Define relevant clinical codes and their labels
relevant_codes = {
'29463-7': 'weight',
'8302-2': 'height',
'39156-5': 'bmi',
'8480-6': 'systolic_bp',
'8462-4': 'diastolic_bp'
}
# Filter for only the relevant codes
df_filtered = df[df['CODE'].isin(relevant_codes.keys())].copy()
df_filtered['metric'] = df_filtered['CODE'].map(relevant_codes)
# Pivot: aggregate by SYNTH_ENCOUNTER and compute mean value per metric
pivot = df_filtered.pivot_table(
index='SYNTH_ENCOUNTER',
columns='metric',
values='VALUE_NUM',
aggfunc='mean'
).reset_index()
# Extract meta columns: PATIENT, DATE (remove duplicates)
encounter_meta = df[['SYNTH_ENCOUNTER', 'PATIENT', 'DATE']].drop_duplicates()
# Merge metadata with the summary stats
summary_table = encounter_meta.merge(pivot, on='SYNTH_ENCOUNTER', how='left')
# Save to CSV or view
summary_table.to_csv("patient_health_summary.csv", index=False)
print(summary_table.head(10))
SYNTH_ENCOUNTER \ 0 034e9e3b-2def-4559-bb2a-7850888ae060_2012-01-2... 1 8d4c4326-e9de-4f45-9a4c-f8c36bff89ae_2011-11-1... 2 10339b10-3cd1-4ac3-ac13-ec26728cb592_2010-07-2... 3 1d604da9-9a81-4ba9-80c2-de3375d59b40_2011-07-2... 4 f5dcd418-09fe-4a2f-baa0-3da800bd8c3a_2010-11-2... 5 f5dcd418-09fe-4a2f-baa0-3da800bd8c3a_2011-02-0... 6 f5dcd418-09fe-4a2f-baa0-3da800bd8c3a_2011-04-1... 7 034e9e3b-2def-4559-bb2a-7850888ae060_2015-01-2... 8 f5dcd418-09fe-4a2f-baa0-3da800bd8c3a_2011-11-2... 9 8d4c4326-e9de-4f45-9a4c-f8c36bff89ae_2012-08-1... PATIENT DATE bmi \ 0 034e9e3b-2def-4559-bb2a-7850888ae060 2012-01-23T17:45:28Z 23.5 1 8d4c4326-e9de-4f45-9a4c-f8c36bff89ae 2011-11-17T00:26:23Z NaN 2 10339b10-3cd1-4ac3-ac13-ec26728cb592 2010-07-27T12:58:08Z 23.8 3 1d604da9-9a81-4ba9-80c2-de3375d59b40 2011-07-28T15:02:18Z 20.1 4 f5dcd418-09fe-4a2f-baa0-3da800bd8c3a 2010-11-20T03:04:34Z 22.2 5 f5dcd418-09fe-4a2f-baa0-3da800bd8c3a 2011-02-07T03:04:34Z NaN 6 f5dcd418-09fe-4a2f-baa0-3da800bd8c3a 2011-04-19T03:04:34Z NaN 7 034e9e3b-2def-4559-bb2a-7850888ae060 2015-01-26T17:45:28Z 23.5 8 f5dcd418-09fe-4a2f-baa0-3da800bd8c3a 2011-11-26T03:04:34Z 26.6 9 8d4c4326-e9de-4f45-9a4c-f8c36bff89ae 2012-08-12T00:26:23Z 29.3 diastolic_bp height systolic_bp weight 0 82.0 193.3 119.0 87.8 1 NaN NaN NaN NaN 2 70.0 165.0 121.0 64.7 3 83.0 181.0 107.0 65.8 4 84.0 169.6 108.0 63.8 5 NaN NaN NaN NaN 6 NaN NaN NaN NaN 7 88.0 193.3 134.0 87.8 8 70.0 175.2 119.0 81.7 9 87.0 162.5 127.0 77.4