RetinaAI -
Data loading and preprocessing
Data cleaning
Missing data
- As indicated by the data preview below, there are observatios with missing values.
| No. missing data | patient | age | sex | systemic_background |
|---|---|---|---|---|
| 33 | 7 | NA | NA | NA |
| 33 | 13 | NA | NA | NA |
| 33 | 37 | NA | NA | NA |
| 33 | 38 | NA | NA | NA |
| 33 | 39 | NA | NA | NA |
| 33 | 40 | NA | NA | NA |
| ophthalmic_background | symptoms |
|---|---|
| NA | NA |
| NA | NA |
| NA | NA |
| NA | NA |
| NA | NA |
| NA | NA |
Below is the distribution of the number of missing entries per patient.
There is a clear cutoff between many missing values (30 or more) and a few (4 or less).
For excluded patients the age is not a number, which corresponds to the frequency of missing values.
| Characteristic | N = 1191 |
|---|---|
| n_miss_all | |
| 0 | 2 (1.7%) |
| 1 | 22 (18%) |
| 2 | 60 (50%) |
| 3 | 7 (5.9%) |
| 4 | 4 (3.4%) |
| 31 | 1 (0.8%) |
| 32 | 3 (2.5%) |
| 33 | 20 (17%) |
| 1 n (%) | |
- The lines in ?@fig-missing indicate that out of n 119 patients, there are
The following plots describe the missingness in the data provided (results 16.5.23.xlsx).
- check missing by human/gpt.
Wrangling
Sex
Sex is changed to lowercase, and one observation with value HD, S/P MI/ PASCEMAKER is marked NA in the meantime.
Systemic background
I created a dictionary for cleaning the data.
Agreement
- Is “S/P CVA and hemiparesis” one or two diagnoses?
- I assumed so.
- Can we reduce the number of unique values?
| name | chat | human |
|---|---|---|
| f_u_recommendations_other | 6 | 9 |
| f_u_recommendations_time | 9 | 15 |
| modalities_for_f_u | 41 | 14 |
| most_probable_diagnosis | 35 | 29 |
| protocol | 8 | 9 |
| suggested_treatment | 26 | 10 |
- What’s a match? Are
OCTandOCTAa mismatch?
Most probable Dx
For each patient, we calculate whether the response from Chat was exactly the same as the human’s.
How to define “difference” or “distance” between Chat and Human?
- How to define agreement when multiple diagnoses?
Analysis of Diagnostic Data
Most Probable Diagnosis
How do we resolve double/triple diagnoses?
What is W/E? Is it like W,E (two diagnoses)? Or a separate level?
Sure, here’s a revised explanation using the abstract diagnosis labels:
Full Match: This indicates that the diagnosis generated by the AI (chat) matches entirely with the diagnosis made by the medical professional (human). For instance, if the physician and the AI both diagnose “A, B, C”, it is considered a full match.
Partial Match - Human Subset: This scenario happens when all the diagnoses identified by the AI are present in the physician’s diagnosis, but the AI might have missed some conditions. For example, if the physician’s diagnosis is “A, B, C, D”, and the AI’s diagnosis is “A, B, C”, the AI missed condition “D”. This is considered a partial match where the AI’s diagnosis is a subset of the human’s diagnosis.
Partial Match - Chat Subset: In this case, the AI identifies all conditions diagnosed by the physician, and possibly more. For instance, if the physician’s diagnosis is “A, B, C”, and the AI’s diagnosis is “A, B, C, D”, the AI has over-diagnosed with condition “D”. This is considered a partial match where the human’s diagnosis is a subset of the AI’s diagnosis.
Mismatch: A mismatch occurs when there’s no overlap between the AI’s and the physician’s diagnoses. For example, if the physician diagnoses “A, B, C”, and the AI diagnoses “D, E, F”, there are no common conditions, making it a mismatch.
By employing this methodology, we’re measuring the extent to which the AI’s diagnostic capabilities align with that of the medical professional.
| Characteristic | Overall, N = 1901 | left, N = 951 | right, N = 951 |
|---|---|---|---|
| diagnosis_match | |||
| chat_partial | 3 (1.6%) | 1 (1.1%) | 2 (2.1%) |
| full_match | 76 (40%) | 39 (41%) | 37 (39%) |
| human_partial | 20 (11%) | 8 (8.4%) | 12 (13%) |
| mismatch | 91 (48%) | 47 (49%) | 44 (46%) |
| 1 n (%) | |||
Suggested Treatment
- Removed the durATION
| Characteristic | Overall, N = 1901 | left, N = 951 | right, N = 951 |
|---|---|---|---|
| treatment_match | |||
| chat_partial | 25 (13%) | 21 (22%) | 4 (4.2%) |
| full_match | 66 (35%) | 36 (38%) | 30 (32%) |
| human_partial | 85 (45%) | 32 (34%) | 53 (56%) |
| mismatch | 14 (7.4%) | 6 (6.3%) | 8 (8.4%) |
| 1 n (%) | |||
Protocol
| Characteristic | Overall, N = 1901 | left, N = 951 | right, N = 951 |
|---|---|---|---|
| protocol_match | |||
| full_match | 149 (78%) | 76 (80%) | 73 (77%) |
| human_partial | 3 (1.6%) | 2 (2.1%) | 1 (1.1%) |
| mismatch | 38 (20%) | 17 (18%) | 21 (22%) |
| 1 n (%) | |||
Follow-up Recommendations Time
| Characteristic | N = 951 |
|---|---|
| time_match | |
| full_match | 35 (37%) |
| mismatch | 60 (63%) |
| 1 n (%) | |
Follow-up Recommendations Other
| Characteristic | N = 951 |
|---|---|
| other_match | |
| chat_partial | 23 (24%) |
| full_match | 50 (53%) |
| human_partial | 19 (20%) |
| mismatch | 3 (3.2%) |
| 1 n (%) | |
Modalities for Follow-up
| Characteristic | N = 951 |
|---|---|
| modality_match | |
| human_partial | 8 (8.4%) |
| mismatch | 87 (92%) |
| 1 n (%) | |