| title: “FAM_HF v1.2” author: “alice” date: “9/9/2020” output: html_document highlight: haddock |
The purpose of this notebook is to examine EHR data to better understand the complex relationships between heart failure, air pollution, and family history. Here, we match ~21,000 individual EHRs with ~11,700 instances of family health history. Below, this data is loaded into the workspace. The family history is converted from long to wide format.
In the Family History data, some family members report multiple diseases, which creates duplicate columns for a single individual. Presently, the dataframe fam_wide takes each potential family member as a factor and reports the first listed instance of disease, or reports NA for none listed.
The dataframe fam_wide_dx is a wide format display of each possible disease as a factor, with the first listed family member given. I doubt the actual utility of this approach and likely will not use this dataframe, unless it may serve as a means to capture all diseases reported by all family members for each patient.
Here, the two dataframes are merged so that each EHR is paired with the affiliated family history (fam_wide). The family history contained extraneous records for which we do not having corresponding EHR data, so these were removed.
Below, summary tables of the data are shown. The first table describes all patient data. The next three tables group by sex, race, and aliveness.
The second two tables are from messing around with glm; nothing meaningful, just messing around to see how it looks.
| SUMMARY OF ALL PT DATA | |
|---|---|
| Characteristic | N = 20,9201 |
| AGE | 70 (59, 80) |
| SEX_CD | |
| F | 10,998 (53%) |
| M | 9,922 (47%) |
| RACE | |
| BLACK | 5,564 (27%) |
| OTHER | 1,481 (7.1%) |
| WHITE | 13,875 (66%) |
| ANNUAL_AVG_PM | 9.50 (8.73, 10.59) |
| visit_year | 2,014 (2,010, 2,016) |
| income | 49,318 (35,357, 67,216) |
| med_house_value | 150,800 (108,100, 227,800) |
| pubassist | 0.89 (0.00, 2.91) |
| urban | 88 (13, 100) |
| poverty | 14 (7, 24) |
| NEW_TOT_VISITS | 10 (4, 27) |
| N_OUTPATIENT | 9 (3, 24) |
| NEW_INPATIENT | 1.00 (0.00, 2.00) |
| N_EMERGENCY | 0.00 (0.00, 2.00) |
| EMERG_NOT_INPATIENT | 0.00 (0.00, 0.00) |
| log_FU | 0.52 (-0.55, 1.35) |
| ALL_CAUSE | 5,321 (25%) |
| PM_5day_avg | 9.1 (7.2, 11.3) |
| HARMONIZED_SMK | |
| CURRENT | 2,029 (9.7%) |
| FORMER | 6,515 (31%) |
| NEVER | 6,176 (30%) |
| UNKNOWN | 6,200 (30%) |
| CKD | 13,183 (63%) |
| IHD | 13,438 (64%) |
| BP_PRI | 16,602 (79%) |
| COPD | 9,025 (43%) |
| T2D | 7,627 (36%) |
| LIPID | 17,042 (81%) |
| PAD | 9,229 (44%) |
|
1
Statistics presented: median (IQR); n (%)
|
|
| SUMMARY OF PT DATA BY SEX | |||
|---|---|---|---|
| Characteristic | F, N = 10,9981 | M, N = 9,9221 | p-value2 |
| AGE | 71 (60, 82) | 69 (58, 78) | <0.001 |
| RACE | <0.001 | ||
| BLACK | 3,160 (29%) | 2,404 (24%) | |
| OTHER | 804 (7.3%) | 677 (6.8%) | |
| WHITE | 7,034 (64%) | 6,841 (69%) | |
| ANNUAL_AVG_PM | 9.53 (8.75, 10.64) | 9.46 (8.70, 10.55) | 0.003 |
| visit_year | 2,014 (2,010, 2,015) | 2,014 (2,010, 2,016) | 0.6 |
| income | 48,726 (35,000, 65,752) | 50,000 (36,098, 67,765) | <0.001 |
| med_house_value | 149,500 (107,500, 223,700) | 152,200 (109,200, 228,800) | 0.010 |
| pubassist | 0.90 (0.00, 2.94) | 0.86 (0.00, 2.91) | 0.2 |
| urban | 91 (19, 100) | 86 (6, 100) | <0.001 |
| poverty | 14 (7, 25) | 13 (6, 23) | <0.001 |
| NEW_TOT_VISITS | 10 (4, 27) | 11 (4, 27) | 0.12 |
| N_OUTPATIENT | 9 (3, 24) | 9 (3, 25) | 0.015 |
| NEW_INPATIENT | 1.00 (0.00, 2.00) | 1.00 (0.00, 2.00) | 0.012 |
| N_EMERGENCY | 0.00 (0.00, 2.00) | 0.00 (0.00, 2.00) | <0.001 |
| EMERG_NOT_INPATIENT | 0.00 (0.00, 0.00) | 0.00 (0.00, 0.00) | <0.001 |
| log_FU | 0.53 (-0.51, 1.37) | 0.50 (-0.61, 1.32) | 0.010 |
| ALL_CAUSE | 2,656 (24%) | 2,665 (27%) | <0.001 |
| PM_5day_avg | 9.1 (7.2, 11.3) | 9.1 (7.1, 11.3) | 0.8 |
| HARMONIZED_SMK | <0.001 | ||
| CURRENT | 905 (8.2%) | 1,124 (11%) | |
| FORMER | 2,751 (25%) | 3,764 (38%) | |
| NEVER | 4,121 (37%) | 2,055 (21%) | |
| UNKNOWN | 3,221 (29%) | 2,979 (30%) | |
| CKD | 6,902 (63%) | 6,281 (63%) | 0.4 |
| IHD | 6,634 (60%) | 6,804 (69%) | <0.001 |
| BP_PRI | 8,797 (80%) | 7,805 (79%) | 0.019 |
| COPD | 4,972 (45%) | 4,053 (41%) | <0.001 |
| T2D | 3,999 (36%) | 3,628 (37%) | 0.8 |
| LIPID | 8,850 (80%) | 8,192 (83%) | <0.001 |
| PAD | 4,999 (45%) | 4,230 (43%) | <0.001 |
|
1
Statistics presented: median (IQR); n (%)
2
Statistical tests performed: Wilcoxon rank-sum test; chi-square test of independence
|
|||
| SUMMARY OF PT DATA BY RACE | ||||
|---|---|---|---|---|
| Characteristic | BLACK, N = 5,5641 | OTHER, N = 1,4811 | WHITE, N = 13,8751 | p-value2 |
| AGE | 63 (53, 74) | 69 (57, 79) | 73 (63, 82) | <0.001 |
| SEX_CD | <0.001 | |||
| F | 3,160 (57%) | 804 (54%) | 7,034 (51%) | |
| M | 2,404 (43%) | 677 (46%) | 6,841 (49%) | |
| ANNUAL_AVG_PM | 9.64 (8.89, 11.20) | 9.47 (8.73, 10.34) | 9.45 (8.68, 10.43) | <0.001 |
| visit_year | 2,014 (2,009, 2,015) | 2,014 (2,011, 2,016) | 2,014 (2,010, 2,016) | <0.001 |
| income | 40,969 (29,167, 56,460) | 46,758 (34,464, 67,765) | 52,669 (38,551, 70,612) | <0.001 |
| med_house_value | 128,300 (89,200, 182,325) | 141,200 (97,500, 213,800) | 160,800 (115,100, 240,400) | <0.001 |
| pubassist | 1.27 (0.00, 3.50) | 0.90 (0.00, 2.94) | 0.61 (0.00, 2.74) | <0.001 |
| urban | 98 (29, 100) | 91 (15, 100) | 82 (7, 100) | <0.001 |
| poverty | 19 (10, 32) | 14 (6, 26) | 12 (6, 21) | <0.001 |
| NEW_TOT_VISITS | 11 (4, 29) | 7 (3, 17) | 11 (4, 27) | <0.001 |
| N_OUTPATIENT | 9 (3, 26) | 6 (2, 16) | 9 (3, 25) | <0.001 |
| NEW_INPATIENT | 1.00 (0.00, 2.00) | 0.00 (0.00, 1.00) | 1.00 (0.00, 2.00) | <0.001 |
| N_EMERGENCY | 1.00 (0.00, 2.00) | 0.00 (0.00, 1.00) | 0.00 (0.00, 2.00) | <0.001 |
| EMERG_NOT_INPATIENT | 0.00 (0.00, 0.00) | 0.00 (0.00, 0.00) | 0.00 (0.00, 0.00) | <0.001 |
| log_FU | 0.68 (-0.28, 1.59) | 0.46 (-0.67, 1.22) | 0.45 (-0.64, 1.25) | <0.001 |
| ALL_CAUSE | 1,337 (24%) | 304 (21%) | 3,680 (27%) | <0.001 |
| PM_5day_avg | 9.2 (7.3, 11.5) | 9.1 (7.2, 11.2) | 9.0 (7.1, 11.2) | <0.001 |
| HARMONIZED_SMK | <0.001 | |||
| CURRENT | 662 (12%) | 122 (8.2%) | 1,245 (9.0%) | |
| FORMER | 1,526 (27%) | 387 (26%) | 4,602 (33%) | |
| NEVER | 1,694 (30%) | 504 (34%) | 3,978 (29%) | |
| UNKNOWN | 1,682 (30%) | 468 (32%) | 4,050 (29%) | |
| CKD | 3,751 (67%) | 810 (55%) | 8,622 (62%) | <0.001 |
| IHD | 3,267 (59%) | 869 (59%) | 9,302 (67%) | <0.001 |
| BP_PRI | 4,780 (86%) | 1,067 (72%) | 10,755 (78%) | <0.001 |
| COPD | 2,321 (42%) | 532 (36%) | 6,172 (44%) | <0.001 |
| T2D | 2,484 (45%) | 501 (34%) | 4,642 (33%) | <0.001 |
| LIPID | 4,359 (78%) | 1,106 (75%) | 11,577 (83%) | <0.001 |
| PAD | 2,113 (38%) | 516 (35%) | 6,600 (48%) | <0.001 |
|
1
Statistics presented: median (IQR); n (%)
2
Statistical tests performed: Kruskal-Wallis test; chi-square test of independence
|
||||
| SUMMARY OF PT DATA BY ALIVENESS | |||
|---|---|---|---|
| Variable | LIVING, N = 15,5991 | DECEASED, N = 5,3211 | p-value2 |
| AGE | 69 (58, 79) | 74 (63, 84) | <0.001 |
| SEX_CD | <0.001 | ||
| F | 8,342 (53%) | 2,656 (50%) | |
| M | 7,257 (47%) | 2,665 (50%) | |
| RACE | <0.001 | ||
| BLACK | 4,227 (27%) | 1,337 (25%) | |
| OTHER | 1,177 (7.5%) | 304 (5.7%) | |
| WHITE | 10,195 (65%) | 3,680 (69%) | |
| ANNUAL_AVG_PM | 9.30 (8.60, 10.06) | 10.54 (9.34, 12.76) | <0.001 |
| visit_year | 2,015 (2,012, 2,016) | 2,009 (2,006, 2,014) | <0.001 |
| income | 48,276 (35,000, 65,333) | 52,550 (36,967, 70,795) | <0.001 |
| med_house_value | 146,300 (105,600, 214,700) | 167,400 (114,200, 263,400) | <0.001 |
| pubassist | 0.92 (0.00, 2.93) | 0.57 (0.00, 2.91) | <0.001 |
| urban | 89 (13, 100) | 87 (11, 100) | 0.8 |
| poverty | 14 (7, 25) | 13 (6, 23) | <0.001 |
| NEW_TOT_VISITS | 11 (4, 26) | 10 (3, 31) | 0.001 |
| N_OUTPATIENT | 9 (3, 24) | 8 (1, 27) | <0.001 |
| NEW_INPATIENT | 1.00 (0.00, 2.00) | 1.00 (1.00, 3.00) | <0.001 |
| N_EMERGENCY | 0.00 (0.00, 1.00) | 1.00 (0.00, 3.00) | <0.001 |
| EMERG_NOT_INPATIENT | 0.00 (0.00, 0.00) | 0.00 (0.00, 1.00) | <0.001 |
| log_FU | 0.64 (-0.29, 1.47) | 0.02 (-1.54, 1.04) | <0.001 |
| PM_5day_avg | 8.8 (7.0, 10.9) | 9.8 (7.7, 12.6) | <0.001 |
| HARMONIZED_SMK | <0.001 | ||
| CURRENT | 1,785 (11%) | 244 (4.6%) | |
| FORMER | 5,686 (36%) | 829 (16%) | |
| NEVER | 5,545 (36%) | 631 (12%) | |
| UNKNOWN | 2,583 (17%) | 3,617 (68%) | |
| CKD | 9,234 (59%) | 3,949 (74%) | <0.001 |
| IHD | 9,635 (62%) | 3,803 (71%) | <0.001 |
| BP_PRI | 12,248 (79%) | 4,354 (82%) | <0.001 |
| COPD | 6,468 (41%) | 2,557 (48%) | <0.001 |
| T2D | 5,480 (35%) | 2,147 (40%) | <0.001 |
| LIPID | 12,590 (81%) | 4,452 (84%) | <0.001 |
| PAD | 6,631 (43%) | 2,598 (49%) | <0.001 |
|
1
Statistics presented: median (IQR); n (%)
2
Statistical tests performed: Wilcoxon rank-sum test; chi-square test of independence
|
|||
## [1] "Death Predicted by Various Factors, ** No normalization or var control"
| Characteristic | OR1 | 95% CI1 | p-value |
|---|---|---|---|
| AGE | 1.03 | 1.02, 1.03 | <0.001 |
| SEX_CD | |||
| F | — | — | |
| M | 1.25 | 1.15, 1.37 | <0.001 |
| ANNUAL_AVG_PM | 1.88 | 1.82, 1.95 | <0.001 |
| RACE | |||
| BLACK | — | — | |
| OTHER | 0.74 | 0.61, 0.90 | 0.002 |
| WHITE | 0.95 | 0.86, 1.06 | 0.4 |
| med_house_value | 1.00 | 1.00, 1.00 | <0.001 |
| pubassist | 1.00 | 0.98, 1.01 | 0.6 |
| urban | 1.00 | 0.99, 1.00 | <0.001 |
| poverty | 1.00 | 1.00, 1.00 | 0.6 |
| NEW_TOT_VISITS | 0.97 | 0.95, 0.99 | 0.008 |
| N_OUTPATIENT | 1.04 | 1.02, 1.06 | 0.002 |
| NEW_INPATIENT | 1.21 | 1.18, 1.25 | <0.001 |
| EMERG_NOT_INPATIENT | |||
| log_FU | 0.39 | 0.37, 0.40 | <0.001 |
| PM_5day_avg | 0.99 | 0.97, 1.00 | 0.013 |
| HARMONIZED_SMK | |||
| CURRENT | — | — | |
| FORMER | 0.92 | 0.77, 1.11 | 0.4 |
| NEVER | 0.72 | 0.60, 0.87 | <0.001 |
| UNKNOWN | 7.84 | 6.58, 9.37 | <0.001 |
|
1
OR = Odds Ratio, CI = Confidence Interval
|
|||
| Characteristic | OR1 | 95% CI1 | p-value |
|---|---|---|---|
| CKD | 1.84 | 1.71, 1.98 | <0.001 |
| IHD | 1.38 | 1.29, 1.49 | <0.001 |
| BP_PRI | 0.94 | 0.86, 1.03 | 0.2 |
| COPD | 1.12 | 1.05, 1.19 | <0.001 |
| T2D | 1.09 | 1.02, 1.17 | 0.008 |
| LIPID | 0.93 | 0.85, 1.01 | 0.10 |
| PAD | 1.09 | 1.02, 1.16 | 0.013 |
|
1
OR = Odds Ratio, CI = Confidence Interval
|
|||
Next, data visualization is conducted on EHR data. I examine the general distribution of particulate matter exposure, how it relates to other factors, and general other data aspects.
| True | False | Proportion with Disease | |
|---|---|---|---|
| ckd | 13183 | 7737 | 63.02% |
| ihd | 13438 | 7482 | 64.24% |
| bp_pri | 16602 | 4318 | 79.36% |
| copd | 9025 | 11895 | 43.14% |
| t2d | 7627 | 13293 | 36.46% |
| lipid | 17042 | 3878 | 81.46% |
| pad | 9229 | 11691 | 44.12% |
Following PM plots, bar charts of each documented disease per patient are generated. Frequency of each disease is shown by race. These data are not normalized and thus have limited interpretability.
Here, I begin examining family history data. My first exploratory approach was to look at Mother & Father data. I create a table of each possibly reported disease as factors, sort them by frequency for Mothers/Fathers, and save the top ten. This is where I have unrelentingly run into trouble. I cannot figure out how to re-factor / re-level / re-type / re-classify / whatever this dataframe for it to become a recognized class for plotting. It seems that everything I have tried results in new errors. I’m hoping I will look away from it for a bit, then the solution will come to me.
My goal is to get a good initial understanding of the family disease data and how it relates to the patient. From here, I can apply these exploratory methods to each of the other patient family members. I also need to consider that only the first instance of a disease for a family member is shown, and that data display needs to be corrected before any relationships can be made.
| Top 10 Mom Diseases | frequency | Top 10 Dad Diseases | freq (dad) | Top 10 Brother Diseases | freq (bro) | Top 10 Sister Diseases | freq (sis) |
|---|---|---|---|---|---|---|---|
| Cancer | 965 | Heart disease | 1089 | Cancer | 556 | Cancer | 512 |
| Diabetes | 886 | Cancer | 946 | Diabetes | 398 | Diabetes | 418 |
| Heart disease | 817 | Heart attack | 578 | Heart disease | 344 | Heart disease | 210 |
| Hypertension | 483 | Diabetes | 569 | Heart attack | 157 | Breast cancer | 196 |
| Heart attack | 276 | Hypertension | 368 | Hypertension | 129 | Hypertension | 165 |
| Arthritis | 256 | Stroke | 214 | Alcohol abuse | 87 | Arthritis | 77 |
| Stroke | 230 | Coronary artery disease | 190 | COPD | 61 | Heart attack | 65 |
| Breast cancer | 199 | Alcohol abuse | 186 | Coronary artery disease | 54 | Stroke | 51 |
| Heart failure | 165 | Heart failure | 145 | Arthritis | 48 | COPD | 49 |
| Coronary artery disease | 125 | COPD | 141 | Asthma | 48 | Asthma | 47 |
Below: Outputting initial Data Viz plots to a single pdf
## quartz_off_screen
## 2