Analyzing data in OBSERVATIONS Table in The Synthea Healthcare Database.

Data Summary

In [1]:
import pandas as pd
import numpy as np
df = pd.read_csv('E:\Statistics\PowerBI\Data Synthea\observations.csv')
# Show basic structure and missing values in 'encounter'
df_info = df.info()
null_encounter_count = df['ENCOUNTER'].isnull().sum()
df_preview = df.head()

df_info, null_encounter_count, df_preview
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299697 entries, 0 to 299696
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   DATE         299697 non-null  object
 1   PATIENT      299697 non-null  object
 2   ENCOUNTER    269334 non-null  object
 3   CODE         299697 non-null  object
 4   DESCRIPTION  299697 non-null  object
 5   VALUE        299697 non-null  object
 6   UNITS        286962 non-null  object
 7   TYPE         299697 non-null  object
dtypes: object(8)
memory usage: 18.3+ MB
Out[1]:
(None,
 np.int64(30363),
                    DATE                               PATIENT  \
 0  2012-01-23T17:45:28Z  034e9e3b-2def-4559-bb2a-7850888ae060   
 1  2012-01-23T17:45:28Z  034e9e3b-2def-4559-bb2a-7850888ae060   
 2  2012-01-23T17:45:28Z  034e9e3b-2def-4559-bb2a-7850888ae060   
 3  2012-01-23T17:45:28Z  034e9e3b-2def-4559-bb2a-7850888ae060   
 4  2012-01-23T17:45:28Z  034e9e3b-2def-4559-bb2a-7850888ae060   
 
                               ENCOUNTER     CODE  \
 0  e88bc3a9-007c-405e-aabc-792a38f4aa2b   8302-2   
 1  e88bc3a9-007c-405e-aabc-792a38f4aa2b  72514-3   
 2  e88bc3a9-007c-405e-aabc-792a38f4aa2b  29463-7   
 3  e88bc3a9-007c-405e-aabc-792a38f4aa2b  39156-5   
 4  e88bc3a9-007c-405e-aabc-792a38f4aa2b   8462-4   
 
                                          DESCRIPTION  VALUE    UNITS     TYPE  
 0                                        Body Height  193.3       cm  numeric  
 1  Pain severity - 0-10 verbal numeric rating [Sc...    2.0  {score}  numeric  
 2                                        Body Weight   87.8       kg  numeric  
 3                                    Body Mass Index   23.5    kg/m2  numeric  
 4                           Diastolic Blood Pressure   82.0   mm[Hg]  numeric  )
In [12]:
df['CODE'].nunique()
Out[12]:
125
In [4]:
print("Distinct patients:", df['PATIENT'].nunique())
Distinct patients: 1171

Because some rows is 'null' in 'ENCOUNTER' so I use 'DATE'_'PATIENT' as a proxy variable for 'ENCOUNTER'. Now I đo some analysis for 'observations' table

In [ ]:
# Extract unique CODE and DESCRIPTION combinations
df2 = df[['CODE', 'DESCRIPTION']].drop_duplicates()
print(df2.shape)
df2.head()

# Save to CSV (Windows path with raw string to handle backslashes)
df2.to_csv(r"E:\Statistics\PowerBI\Data Synthea\unique_codes.csv", index=False)
(130, 2)
Out[ ]:
CODE DESCRIPTION
0 8302-2 Body Height
1 72514-3 Pain severity - 0-10 verbal numeric rating [Sc...
2 29463-7 Body Weight
3 39156-5 Body Mass Index
4 8462-4 Diastolic Blood Pressure

Analyzing aggreations for BMI, height, weight, SBP

In [2]:
# Create synthetic encounter ID from PATIENT + DATE
df['SYNTH_ENCOUNTER'] = df['PATIENT'] + "_" + df['DATE']

# Convert VALUE to numeric safely
df['VALUE_NUM'] = pd.to_numeric(df['VALUE'], errors='coerce')

# Define relevant clinical codes and their labels
relevant_codes = {
    '29463-7': 'weight',
    '8302-2': 'height',
    '39156-5': 'bmi',
    '8480-6': 'systolic_bp',
    '8462-4': 'diastolic_bp'
}

# Filter for only the relevant codes
df_filtered = df[df['CODE'].isin(relevant_codes.keys())].copy()
df_filtered['metric'] = df_filtered['CODE'].map(relevant_codes)

# Pivot: aggregate by SYNTH_ENCOUNTER and compute mean value per metric
pivot = df_filtered.pivot_table(
    index='SYNTH_ENCOUNTER',
    columns='metric',
    values='VALUE_NUM',
    aggfunc='mean'
).reset_index()

# Extract meta columns: PATIENT, DATE (remove duplicates)
encounter_meta = df[['SYNTH_ENCOUNTER', 'PATIENT', 'DATE']].drop_duplicates()

# Merge metadata with the summary stats
summary_table = encounter_meta.merge(pivot, on='SYNTH_ENCOUNTER', how='left')

# Save to CSV or view
summary_table.to_csv("patient_health_summary.csv", index=False)
print(summary_table.head(10))
                                     SYNTH_ENCOUNTER  \
0  034e9e3b-2def-4559-bb2a-7850888ae060_2012-01-2...   
1  8d4c4326-e9de-4f45-9a4c-f8c36bff89ae_2011-11-1...   
2  10339b10-3cd1-4ac3-ac13-ec26728cb592_2010-07-2...   
3  1d604da9-9a81-4ba9-80c2-de3375d59b40_2011-07-2...   
4  f5dcd418-09fe-4a2f-baa0-3da800bd8c3a_2010-11-2...   
5  f5dcd418-09fe-4a2f-baa0-3da800bd8c3a_2011-02-0...   
6  f5dcd418-09fe-4a2f-baa0-3da800bd8c3a_2011-04-1...   
7  034e9e3b-2def-4559-bb2a-7850888ae060_2015-01-2...   
8  f5dcd418-09fe-4a2f-baa0-3da800bd8c3a_2011-11-2...   
9  8d4c4326-e9de-4f45-9a4c-f8c36bff89ae_2012-08-1...   

                                PATIENT                  DATE   bmi  \
0  034e9e3b-2def-4559-bb2a-7850888ae060  2012-01-23T17:45:28Z  23.5   
1  8d4c4326-e9de-4f45-9a4c-f8c36bff89ae  2011-11-17T00:26:23Z   NaN   
2  10339b10-3cd1-4ac3-ac13-ec26728cb592  2010-07-27T12:58:08Z  23.8   
3  1d604da9-9a81-4ba9-80c2-de3375d59b40  2011-07-28T15:02:18Z  20.1   
4  f5dcd418-09fe-4a2f-baa0-3da800bd8c3a  2010-11-20T03:04:34Z  22.2   
5  f5dcd418-09fe-4a2f-baa0-3da800bd8c3a  2011-02-07T03:04:34Z   NaN   
6  f5dcd418-09fe-4a2f-baa0-3da800bd8c3a  2011-04-19T03:04:34Z   NaN   
7  034e9e3b-2def-4559-bb2a-7850888ae060  2015-01-26T17:45:28Z  23.5   
8  f5dcd418-09fe-4a2f-baa0-3da800bd8c3a  2011-11-26T03:04:34Z  26.6   
9  8d4c4326-e9de-4f45-9a4c-f8c36bff89ae  2012-08-12T00:26:23Z  29.3   

   diastolic_bp  height  systolic_bp  weight  
0          82.0   193.3        119.0    87.8  
1           NaN     NaN          NaN     NaN  
2          70.0   165.0        121.0    64.7  
3          83.0   181.0        107.0    65.8  
4          84.0   169.6        108.0    63.8  
5           NaN     NaN          NaN     NaN  
6           NaN     NaN          NaN     NaN  
7          88.0   193.3        134.0    87.8  
8          70.0   175.2        119.0    81.7  
9          87.0   162.5        127.0    77.4