title: “FAM_HF v1.2” author: “alice” date: “9/9/2020” output: html_document highlight: haddock

HEART FAILURE, FAMILY HISTORY

The purpose of this notebook is to examine EHR data to better understand the complex relationships between heart failure, air pollution, and family history. Here, we match ~21,000 individual EHRs with ~11,700 instances of family health history. Below, this data is loaded into the workspace. The family history is converted from long to wide format.

In the Family History data, some family members report multiple diseases, which creates duplicate columns for a single individual. Presently, the dataframe fam_wide takes each potential family member as a factor and reports the first listed instance of disease, or reports NA for none listed.

The dataframe fam_wide_dx is a wide format display of each possible disease as a factor, with the first listed family member given. I doubt the actual utility of this approach and likely will not use this dataframe, unless it may serve as a means to capture all diseases reported by all family members for each patient.

Here, the two dataframes are merged so that each EHR is paired with the affiliated family history (fam_wide). The family history contained extraneous records for which we do not having corresponding EHR data, so these were removed.

summary tables

Below, summary tables of the data are shown. The first table describes all patient data. The next three tables group by sex, race, and aliveness.

The second two tables are from messing around with glm; nothing meaningful, just messing around to see how it looks.

SUMMARY OF ALL PT DATA
Characteristic N = 20,9201
AGE 70 (59, 80)
SEX_CD
F 10,998 (53%)
M 9,922 (47%)
RACE
BLACK 5,564 (27%)
OTHER 1,481 (7.1%)
WHITE 13,875 (66%)
ANNUAL_AVG_PM 9.50 (8.73, 10.59)
visit_year 2,014 (2,010, 2,016)
income 49,318 (35,357, 67,216)
med_house_value 150,800 (108,100, 227,800)
pubassist 0.89 (0.00, 2.91)
urban 88 (13, 100)
poverty 14 (7, 24)
NEW_TOT_VISITS 10 (4, 27)
N_OUTPATIENT 9 (3, 24)
NEW_INPATIENT 1.00 (0.00, 2.00)
N_EMERGENCY 0.00 (0.00, 2.00)
EMERG_NOT_INPATIENT 0.00 (0.00, 0.00)
log_FU 0.52 (-0.55, 1.35)
ALL_CAUSE 5,321 (25%)
PM_5day_avg 9.1 (7.2, 11.3)
HARMONIZED_SMK
CURRENT 2,029 (9.7%)
FORMER 6,515 (31%)
NEVER 6,176 (30%)
UNKNOWN 6,200 (30%)
CKD 13,183 (63%)
IHD 13,438 (64%)
BP_PRI 16,602 (79%)
COPD 9,025 (43%)
T2D 7,627 (36%)
LIPID 17,042 (81%)
PAD 9,229 (44%)

1 Statistics presented: median (IQR); n (%)

SUMMARY OF PT DATA BY SEX
Characteristic F, N = 10,9981 M, N = 9,9221 p-value2
AGE 71 (60, 82) 69 (58, 78) <0.001
RACE <0.001
BLACK 3,160 (29%) 2,404 (24%)
OTHER 804 (7.3%) 677 (6.8%)
WHITE 7,034 (64%) 6,841 (69%)
ANNUAL_AVG_PM 9.53 (8.75, 10.64) 9.46 (8.70, 10.55) 0.003
visit_year 2,014 (2,010, 2,015) 2,014 (2,010, 2,016) 0.6
income 48,726 (35,000, 65,752) 50,000 (36,098, 67,765) <0.001
med_house_value 149,500 (107,500, 223,700) 152,200 (109,200, 228,800) 0.010
pubassist 0.90 (0.00, 2.94) 0.86 (0.00, 2.91) 0.2
urban 91 (19, 100) 86 (6, 100) <0.001
poverty 14 (7, 25) 13 (6, 23) <0.001
NEW_TOT_VISITS 10 (4, 27) 11 (4, 27) 0.12
N_OUTPATIENT 9 (3, 24) 9 (3, 25) 0.015
NEW_INPATIENT 1.00 (0.00, 2.00) 1.00 (0.00, 2.00) 0.012
N_EMERGENCY 0.00 (0.00, 2.00) 0.00 (0.00, 2.00) <0.001
EMERG_NOT_INPATIENT 0.00 (0.00, 0.00) 0.00 (0.00, 0.00) <0.001
log_FU 0.53 (-0.51, 1.37) 0.50 (-0.61, 1.32) 0.010
ALL_CAUSE 2,656 (24%) 2,665 (27%) <0.001
PM_5day_avg 9.1 (7.2, 11.3) 9.1 (7.1, 11.3) 0.8
HARMONIZED_SMK <0.001
CURRENT 905 (8.2%) 1,124 (11%)
FORMER 2,751 (25%) 3,764 (38%)
NEVER 4,121 (37%) 2,055 (21%)
UNKNOWN 3,221 (29%) 2,979 (30%)
CKD 6,902 (63%) 6,281 (63%) 0.4
IHD 6,634 (60%) 6,804 (69%) <0.001
BP_PRI 8,797 (80%) 7,805 (79%) 0.019
COPD 4,972 (45%) 4,053 (41%) <0.001
T2D 3,999 (36%) 3,628 (37%) 0.8
LIPID 8,850 (80%) 8,192 (83%) <0.001
PAD 4,999 (45%) 4,230 (43%) <0.001

1 Statistics presented: median (IQR); n (%)

2 Statistical tests performed: Wilcoxon rank-sum test; chi-square test of independence

SUMMARY OF PT DATA BY RACE
Characteristic BLACK, N = 5,5641 OTHER, N = 1,4811 WHITE, N = 13,8751 p-value2
AGE 63 (53, 74) 69 (57, 79) 73 (63, 82) <0.001
SEX_CD <0.001
F 3,160 (57%) 804 (54%) 7,034 (51%)
M 2,404 (43%) 677 (46%) 6,841 (49%)
ANNUAL_AVG_PM 9.64 (8.89, 11.20) 9.47 (8.73, 10.34) 9.45 (8.68, 10.43) <0.001
visit_year 2,014 (2,009, 2,015) 2,014 (2,011, 2,016) 2,014 (2,010, 2,016) <0.001
income 40,969 (29,167, 56,460) 46,758 (34,464, 67,765) 52,669 (38,551, 70,612) <0.001
med_house_value 128,300 (89,200, 182,325) 141,200 (97,500, 213,800) 160,800 (115,100, 240,400) <0.001
pubassist 1.27 (0.00, 3.50) 0.90 (0.00, 2.94) 0.61 (0.00, 2.74) <0.001
urban 98 (29, 100) 91 (15, 100) 82 (7, 100) <0.001
poverty 19 (10, 32) 14 (6, 26) 12 (6, 21) <0.001
NEW_TOT_VISITS 11 (4, 29) 7 (3, 17) 11 (4, 27) <0.001
N_OUTPATIENT 9 (3, 26) 6 (2, 16) 9 (3, 25) <0.001
NEW_INPATIENT 1.00 (0.00, 2.00) 0.00 (0.00, 1.00) 1.00 (0.00, 2.00) <0.001
N_EMERGENCY 1.00 (0.00, 2.00) 0.00 (0.00, 1.00) 0.00 (0.00, 2.00) <0.001
EMERG_NOT_INPATIENT 0.00 (0.00, 0.00) 0.00 (0.00, 0.00) 0.00 (0.00, 0.00) <0.001
log_FU 0.68 (-0.28, 1.59) 0.46 (-0.67, 1.22) 0.45 (-0.64, 1.25) <0.001
ALL_CAUSE 1,337 (24%) 304 (21%) 3,680 (27%) <0.001
PM_5day_avg 9.2 (7.3, 11.5) 9.1 (7.2, 11.2) 9.0 (7.1, 11.2) <0.001
HARMONIZED_SMK <0.001
CURRENT 662 (12%) 122 (8.2%) 1,245 (9.0%)
FORMER 1,526 (27%) 387 (26%) 4,602 (33%)
NEVER 1,694 (30%) 504 (34%) 3,978 (29%)
UNKNOWN 1,682 (30%) 468 (32%) 4,050 (29%)
CKD 3,751 (67%) 810 (55%) 8,622 (62%) <0.001
IHD 3,267 (59%) 869 (59%) 9,302 (67%) <0.001
BP_PRI 4,780 (86%) 1,067 (72%) 10,755 (78%) <0.001
COPD 2,321 (42%) 532 (36%) 6,172 (44%) <0.001
T2D 2,484 (45%) 501 (34%) 4,642 (33%) <0.001
LIPID 4,359 (78%) 1,106 (75%) 11,577 (83%) <0.001
PAD 2,113 (38%) 516 (35%) 6,600 (48%) <0.001

1 Statistics presented: median (IQR); n (%)

2 Statistical tests performed: Kruskal-Wallis test; chi-square test of independence

SUMMARY OF PT DATA BY ALIVENESS
Variable LIVING, N = 15,5991 DECEASED, N = 5,3211 p-value2
AGE 69 (58, 79) 74 (63, 84) <0.001
SEX_CD <0.001
F 8,342 (53%) 2,656 (50%)
M 7,257 (47%) 2,665 (50%)
RACE <0.001
BLACK 4,227 (27%) 1,337 (25%)
OTHER 1,177 (7.5%) 304 (5.7%)
WHITE 10,195 (65%) 3,680 (69%)
ANNUAL_AVG_PM 9.30 (8.60, 10.06) 10.54 (9.34, 12.76) <0.001
visit_year 2,015 (2,012, 2,016) 2,009 (2,006, 2,014) <0.001
income 48,276 (35,000, 65,333) 52,550 (36,967, 70,795) <0.001
med_house_value 146,300 (105,600, 214,700) 167,400 (114,200, 263,400) <0.001
pubassist 0.92 (0.00, 2.93) 0.57 (0.00, 2.91) <0.001
urban 89 (13, 100) 87 (11, 100) 0.8
poverty 14 (7, 25) 13 (6, 23) <0.001
NEW_TOT_VISITS 11 (4, 26) 10 (3, 31) 0.001
N_OUTPATIENT 9 (3, 24) 8 (1, 27) <0.001
NEW_INPATIENT 1.00 (0.00, 2.00) 1.00 (1.00, 3.00) <0.001
N_EMERGENCY 0.00 (0.00, 1.00) 1.00 (0.00, 3.00) <0.001
EMERG_NOT_INPATIENT 0.00 (0.00, 0.00) 0.00 (0.00, 1.00) <0.001
log_FU 0.64 (-0.29, 1.47) 0.02 (-1.54, 1.04) <0.001
PM_5day_avg 8.8 (7.0, 10.9) 9.8 (7.7, 12.6) <0.001
HARMONIZED_SMK <0.001
CURRENT 1,785 (11%) 244 (4.6%)
FORMER 5,686 (36%) 829 (16%)
NEVER 5,545 (36%) 631 (12%)
UNKNOWN 2,583 (17%) 3,617 (68%)
CKD 9,234 (59%) 3,949 (74%) <0.001
IHD 9,635 (62%) 3,803 (71%) <0.001
BP_PRI 12,248 (79%) 4,354 (82%) <0.001
COPD 6,468 (41%) 2,557 (48%) <0.001
T2D 5,480 (35%) 2,147 (40%) <0.001
LIPID 12,590 (81%) 4,452 (84%) <0.001
PAD 6,631 (43%) 2,598 (49%) <0.001

1 Statistics presented: median (IQR); n (%)

2 Statistical tests performed: Wilcoxon rank-sum test; chi-square test of independence

## [1] "Death Predicted by Various Factors, ** No normalization or var control"

Characteristic OR1 95% CI1 p-value
AGE 1.03 1.02, 1.03 <0.001
SEX_CD
F
M 1.25 1.15, 1.37 <0.001
ANNUAL_AVG_PM 1.88 1.82, 1.95 <0.001
RACE
BLACK
OTHER 0.74 0.61, 0.90 0.002
WHITE 0.95 0.86, 1.06 0.4
med_house_value 1.00 1.00, 1.00 <0.001
pubassist 1.00 0.98, 1.01 0.6
urban 1.00 0.99, 1.00 <0.001
poverty 1.00 1.00, 1.00 0.6
NEW_TOT_VISITS 0.97 0.95, 0.99 0.008
N_OUTPATIENT 1.04 1.02, 1.06 0.002
NEW_INPATIENT 1.21 1.18, 1.25 <0.001
EMERG_NOT_INPATIENT
log_FU 0.39 0.37, 0.40 <0.001
PM_5day_avg 0.99 0.97, 1.00 0.013
HARMONIZED_SMK
CURRENT
FORMER 0.92 0.77, 1.11 0.4
NEVER 0.72 0.60, 0.87 <0.001
UNKNOWN 7.84 6.58, 9.37 <0.001

1 OR = Odds Ratio, CI = Confidence Interval

Characteristic OR1 95% CI1 p-value
CKD 1.84 1.71, 1.98 <0.001
IHD 1.38 1.29, 1.49 <0.001
BP_PRI 0.94 0.86, 1.03 0.2
COPD 1.12 1.05, 1.19 <0.001
T2D 1.09 1.02, 1.17 0.008
LIPID 0.93 0.85, 1.01 0.10
PAD 1.09 1.02, 1.16 0.013

1 OR = Odds Ratio, CI = Confidence Interval

Data Vizzz

Some visualizations of the hospitalizations data

Frequency of Total Hospital Visits

Frequency of Out-Patient Visits

Frequency of In-Patient Visits

Frequency of Emergency Room Visits

Next, data visualization is conducted on EHR data. I examine the general distribution of particulate matter exposure, how it relates to other factors, and general other data aspects.

Disease Dist for All Patients

General Distributions of Diseases for All Patients
True False Proportion with Disease
ckd 13183 7737 63.02%
ihd 13438 7482 64.24%
bp_pri 16602 4318 79.36%
copd 9025 11895 43.14%
t2d 7627 13293 36.46%
lipid 17042 3878 81.46%
pad 9229 11691 44.12%

Following PM plots, bar charts of each documented disease per patient are generated. Frequency of each disease is shown by race. These data are not normalized and thus have limited interpretability.

Here, I begin examining family history data. My first exploratory approach was to look at Mother & Father data. I create a table of each possibly reported disease as factors, sort them by frequency for Mothers/Fathers, and save the top ten. This is where I have unrelentingly run into trouble. I cannot figure out how to re-factor / re-level / re-type / re-classify / whatever this dataframe for it to become a recognized class for plotting. It seems that everything I have tried results in new errors. I’m hoping I will look away from it for a bit, then the solution will come to me.

My goal is to get a good initial understanding of the family disease data and how it relates to the patient. From here, I can apply these exploratory methods to each of the other patient family members. I also need to consider that only the first instance of a disease for a family member is shown, and that data display needs to be corrected before any relationships can be made.

Top Ten Diseases by Patient Immediate Family
Top 10 Mom Diseases frequency Top 10 Dad Diseases freq (dad) Top 10 Brother Diseases freq (bro) Top 10 Sister Diseases freq (sis)
Cancer 965 Heart disease 1089 Cancer 556 Cancer 512
Diabetes 886 Cancer 946 Diabetes 398 Diabetes 418
Heart disease 817 Heart attack 578 Heart disease 344 Heart disease 210
Hypertension 483 Diabetes 569 Heart attack 157 Breast cancer 196
Heart attack 276 Hypertension 368 Hypertension 129 Hypertension 165
Arthritis 256 Stroke 214 Alcohol abuse 87 Arthritis 77
Stroke 230 Coronary artery disease 190 COPD 61 Heart attack 65
Breast cancer 199 Alcohol abuse 186 Coronary artery disease 54 Stroke 51
Heart failure 165 Heart failure 145 Arthritis 48 COPD 49
Coronary artery disease 125 COPD 141 Asthma 48 Asthma 47

Below: Outputting initial Data Viz plots to a single pdf

## quartz_off_screen 
##                 2