Some Pretty Plots

HEART FAILURE, FAMILY HISTORY

The purpose of this notebook is to examine EHR data to better understand the complex relationships between heart failure, air pollution, and family history. Here, we match ~21,000 individual EHRs with ~11,700 instances of family health history. Below, this data is loaded into the workspace. The family history is converted from long to wide format.

In the Family History data, some family members report multiple diseases, which creates duplicate columns for a single individual. Presently, the dataframe fam_wide takes each potential family member as a factor and reports the first listed instance of disease, or reports NA for none listed.

Here, the two dataframes are merged so that each EHR is paired with the affiliated family history (fam_wide). The family history contained extraneous records for which we do not having corresponding EHR data, so these were removed.

library(faraway); library(gtsummary);library(ggplot2);library(gridExtra);library(kableExtra);library(gt);library(tidyverse);library(glue);library(formattable);library(DT)

load("alice_data.RData")
fam <- read.csv("fam_hx.csv")
fam_wide <- reshape(fam, idvar = "MRN", timevar = "FAM_RLTNP", direction = "wide")

x <- merge(alice_data,fam_wide, all=TRUE)
df <- x[!is.na(x$AGE),]

summary table

Below, a summary table examines the patient history data by whether or not the patient is living. To be honest, I can’t remember off the top of my head which package we are using here, but will confirm on Thursday.

tbl_summary(df[,2:27], by= ALL_CAUSE) %>% #this is grabbing only the factors I care about from the patient data 
  add_p() %>% #add_p will add a column with a p-value of some inferential stat between the two variables
    modify_spanning_header(everything() ~ "**SUMMARY OF PT DATA BY ALIVENESS**") %>%
  modify_header(label = "**Variable**", stat_1 = "**LIVING, N = 15,599**",stat_2="**DECEASED, N = 5,321**") %>% 
  bold_labels()

SUMMARY OF PT DATA BY ALIVENESS
Variable	LIVING, N = 15,599¹	DECEASED, N = 5,321¹	p-value²
AGE	69 (58, 79)	74 (63, 84)	<0.001
SEX_CD			<0.001
F	8,342 (53%)	2,656 (50%)
M	7,257 (47%)	2,665 (50%)
RACE			<0.001
BLACK	4,227 (27%)	1,337 (25%)
OTHER	1,177 (7.5%)	304 (5.7%)
WHITE	10,195 (65%)	3,680 (69%)
ANNUAL_AVG_PM	9.30 (8.60, 10.06)	10.54 (9.34, 12.76)	<0.001
visit_year	2,015 (2,012, 2,016)	2,009 (2,006, 2,014)	<0.001
income	48,276 (35,000, 65,333)	52,550 (36,967, 70,795)	<0.001
med_house_value	146,300 (105,600, 214,700)	167,400 (114,200, 263,400)	<0.001
pubassist	0.92 (0.00, 2.93)	0.57 (0.00, 2.91)	<0.001
urban	89 (13, 100)	87 (11, 100)	0.8
poverty	14 (7, 25)	13 (6, 23)	<0.001
NEW_TOT_VISITS	11 (4, 26)	10 (3, 31)	0.001
N_OUTPATIENT	9 (3, 24)	8 (1, 27)	<0.001
NEW_INPATIENT	1.00 (0.00, 2.00)	1.00 (1.00, 3.00)	<0.001
N_EMERGENCY	0.00 (0.00, 1.00)	1.00 (0.00, 3.00)	<0.001
EMERG_NOT_INPATIENT	0.00 (0.00, 0.00)	0.00 (0.00, 1.00)	<0.001
log_FU	0.64 (-0.29, 1.47)	0.02 (-1.54, 1.04)	<0.001
PM_5day_avg	8.8 (7.0, 10.9)	9.8 (7.7, 12.6)	<0.001
HARMONIZED_SMK			<0.001
CURRENT	1,785 (11%)	244 (4.6%)
FORMER	5,686 (36%)	829 (16%)
NEVER	5,545 (36%)	631 (12%)
UNKNOWN	2,583 (17%)	3,617 (68%)
CKD	9,234 (59%)	3,949 (74%)	<0.001
IHD	9,635 (62%)	3,803 (71%)	<0.001
BP_PRI	12,248 (79%)	4,354 (82%)	<0.001
COPD	6,468 (41%)	2,557 (48%)	<0.001
T2D	5,480 (35%)	2,147 (40%)	<0.001
LIPID	12,590 (81%)	4,452 (84%)	<0.001
PAD	6,631 (43%)	2,598 (49%)	<0.001
¹ Statistics presented: median (IQR); n (%) ² Statistical tests performed: Wilcoxon rank-sum test; chi-square test of independence

Data Viz

Some visualizations of the hospitalizations data

The code chunk below is two different lists of cosmetic attributes we will apply to ggplots later. These aren’t necessary but make the plots look nicer to me.

gghisto <- list(
  theme(axis.text.x = element_text(face="bold", color="cornflowerblue", size=14,angle=17),
          axis.text.y = element_text(face="bold", color="royalblue4", 
          size=16, angle=25),
          axis.title=element_text(size=17,face="italic"),
          plot.title = element_text(size=20,face="bold.italic"))
)

ggbar <- list(
  theme(axis.text.x = element_text(face="bold", color="gray18", size=10,angle=20),
          axis.text.y = element_text(face="bold", color="royalblue4", 
          size=12, angle=25),
          axis.title=element_text(size=12,face="italic"),
          plot.title = element_text(size=16,face="bold.italic"))
)

Frequency of Total Hospital Visits

p1 <- ggplot(df, aes(x=NEW_TOT_VISITS)) + #grabs the total visits from the df
  geom_histogram(aes(y=..density..),      #creates a histogram of dist (bars)
                 binwidth=6,
                color="gray18",fill="white")+
  geom_density(alpha=.2, fill="red",size=.5, color="gray14") + #creates a density distribution (pink)
  ggtitle("Frequency Dist of Total Hospital Visits")+
  scale_x_continuous(limits = c(0, 150))+ #I added x-axis limits because there are some funky outliers that go out to over 700. 
  gghisto

p1 #display p1

t <- as.data.frame(table(df$NEW_TOT_VISITS)) #create a frequency table to accompany the plot we made
colnames(t) <- c("No. of Hospital Admissions","Frequency")
as.datatable(formattable(t, align=c("c","r"),list(Frequency = normalize_bar(color="pink"), title)),rownames=FALSE,options = list(
    autoWidth=TRUE,
    columnDefs = list(list(className = 'dt-center', targets = 0)),
    columnDefs = list(list(width = '150px', targets = list(0:1)))))

Frequency of Out-Patient Visits

p2 <- ggplot(df, aes(x=N_OUTPATIENT)) + 
  geom_histogram(aes(y=..density..),      
                 binwidth=3,
                color="gray18",fill="white")+
  geom_density(alpha=.2, fill="red",size=.5, color="gray14") +
  ggtitle("Frequency Dist of Outpatient Visits")+
  scale_x_continuous(limits = c(-1, 100))+ #Using `-1` in the parameter is cosmetic and technically makes the distribution inaccurate, but here I like the way it displays the plot better
  gghisto

p2

t <- as.data.frame(table(df$N_OUTPATIENT))
colnames(t) <- c("No. of OutPatient Admissions","Frequency")
as.datatable(formattable(t, align=c("c","r"),list(Frequency = normalize_bar(color="pink"), title)),rownames=FALSE,options = list(
    autoWidth=TRUE,
    columnDefs = list(list(className = 'dt-center', targets = 0)),
    columnDefs = list(list(width = '100px', targets = list(0:1)))))

Next, we do further data visualization of patient data. This code must be run all at once or in order, as I re-use variable names. This is bad practice and probably best if you don’t re-use variable names like I did. I was being lazy

The first three plots are density plots of continuous data we have in our data set. They are arranged on a page in a single column. The next three plots look at annual exposure to particulate matter and relation to smoking status, race, and whether the patient is alive or dead. The last four plots examine the distribution of exposure to particulate matter and different diseases the patient may have.

pa <- ggplot(df, aes(x=ANNUAL_AVG_PM)) + 
  geom_histogram(aes(y=..density..),      
                 binwidth=.5,
                color="gray18",fill="white")+
  geom_density(alpha=.2, fill="red",size=.5, color="gray14") +
  ggtitle("pm exposure distribution for entire population")+
  gghisto

pb <- ggplot(df, aes(x=income)) + 
  geom_histogram(aes(y=..density..),      
                 binwidth=10000,
                color="gray18",fill="white")+
  geom_density(alpha=.2, fill="red",size=.5, color="gray14") +
  ggtitle("Data Distribution of Pt Income")+
  gghisto

pc <- ggplot(df, aes(x=med_house_value)) + 
  geom_histogram(aes(y=..density..),      
                 binwidth=28000,
                color="gray18",fill="white")+
  geom_density(alpha=.2, fill="red",size=.5, color="gray14") +
  ggtitle("Data Dist of Median House Val")+
  gghisto

gen_data_plots <- grid.arrange(pa,pb,pc) #put plots on page

a <- ggplot(df, aes(x = ANNUAL_AVG_PM, fill = HARMONIZED_SMK)) +geom_density(alpha = .3)+ ggtitle("AVG ANNUAL PM x SMOKING")+gghisto
b <- ggplot(df, aes(x = ANNUAL_AVG_PM, fill = RACE)) +geom_density(alpha = .3)+ ggtitle("AVG ANNUAL PM x RACE")+gghisto
c <- ggplot(df, aes(x = ANNUAL_AVG_PM, fill = ALL_CAUSE)) +geom_density(alpha = .3)+ ggtitle("AVG ANNUAL PM x DEATH")+gghisto

ppm_plots <- grid.arrange(a,b,c) #put plots on page

a <- ggplot(df, aes(x = ANNUAL_AVG_PM, fill = CKD)) +geom_density(alpha = .3)+ ggtitle("AVG ANNUAL PM x CKD")+gghisto
b <- ggplot(df, aes(x = ANNUAL_AVG_PM, fill = IHD)) +geom_density(alpha = .3)+ ggtitle("AVG ANNUAL PM x ihd")+gghisto
c <- ggplot(df, aes(x = ANNUAL_AVG_PM, fill = COPD)) +geom_density(alpha = .3)+ ggtitle("AVG ANNUAL PM x copd")+gghisto
d <- ggplot(df, aes(x = ANNUAL_AVG_PM, fill = T2D)) +geom_density(alpha = .3)+ ggtitle("AVG ANNUAL PM x t2d")+gghisto

disease_ppm_plots1 <- grid.arrange(a,b,c,d, nrow=3,ncol=2) #put plots on page

Disease Distribution for All Patients

Here, we create a table with information on each disease and its prevalence among patients in our data set. A loop or some other function is better and faster here, so I recommend doing that for the “correct” way. Here, I made an individual table for each of the seven diseases, then combined each of them into one dataframe.

a <- table(df[,21])["TRUE"]
b <- table(df[,21])["FALSE"]
c <- a/(a+b)
ckd <- c(a,b,c)

a <- table(df[,22])["TRUE"]
b <- table(df[,22])["FALSE"]
c <- a/(a+b)
ihd <- c(a,b,c)

a <- table(df[,23])["TRUE"]
b <- table(df[,23])["FALSE"]
c <- a/(a+b)
bp_pri <- c(a,b,c)

a <- table(df[,24])["TRUE"]
b <- table(df[,24])["FALSE"]
c <- a/(a+b)
copd <- c(a,b,c)

a <- table(df[,25])["TRUE"]
b <- table(df[,25])["FALSE"]
c <- a/(a+b)
t2d <- c(a,b,c)

a <- table(df[,26])["TRUE"]
b <- table(df[,26])["FALSE"]
c <- a/(a+b)
lipid <- c(a,b,c)

a <- table(df[,27])["TRUE"]
b <- table(df[,27])["FALSE"]
c <- a/(a+b)
pad <- c(a,b,c)

d <- as.data.frame(rbind(ckd,ihd,bp_pri,copd,t2d,lipid,pad))
colnames(d) <- c("True","False","Proportion with Disease")

formattable(d,list(`True`= color_bar("#B1F4FF"),`False` = color_bar("pink"),`Proportion with Disease`= percent), caption="General Distributions of Diseases for All Patients")

General Distributions of Diseases for All Patients
	True	False	Proportion with Disease
ckd	13183	7737	63.02%
ihd	13438	7482	64.24%
bp_pri	16602	4318	79.36%
copd	9025	11895	43.14%
t2d	7627	13293	36.46%
lipid	17042	3878	81.46%
pad	9229	11691	44.12%

Here, we begin examining family history data.

t <- table(df$FAM_MDCL_HIST.Father)
dad <- sort(t, decreasing=TRUE)[1:10] 
dad <- data.frame(dad)
colnames(dad) <- c("disease","freq")

t <- table(df$FAM_MDCL_HIST.Mother)
mom <- sort(t, decreasing=TRUE)[1:10] 
mom <- data.frame(mom)
colnames(mom) <- c("disease","freq")

fam <- cbind(mom,dad)
colnames(fam) <- c("Top 10 Mom Diseases","frequency","Top 10 Dad Diseases","freq (dad)")
formattable(fam,list(`frequency` = color_bar("#FF78A4"),`freq (dad)` = color_bar("#59E7FF")),caption="Top Ten Diseases by Parents")

Top Ten Diseases by Parents
Top 10 Mom Diseases	frequency	Top 10 Dad Diseases	freq (dad)
Cancer	965	Heart disease	1089
Diabetes	886	Cancer	946
Heart disease	817	Heart attack	578
Hypertension	483	Diabetes	569
Heart attack	276	Hypertension	368
Arthritis	256	Stroke	214
Stroke	230	Coronary artery disease	190
Breast cancer	199	Alcohol abuse	186
Heart failure	165	Heart failure	145
Coronary artery disease	125	COPD	141