Introduction - Heart Disease and its Consequences

Determining whether or not if a patient with chest pain is the result of a disease process, acute coronary syndrome, can sometimes be difficult to differentiate for physicians. Often times, when a patient arrives to the emergency department, a physician is responsible for determining if the chest pain is from a dangerous acute pathology versus a benign pathology, such as heartburn, muscle cramps, etc. In medical literature, the consensus is that physical complaints (such as the characteristic, intensity, duration, location) can be unreliable predictors for bad cardiac outcomes. It does seem intuitive that if a patient is a smoker, has bad eating habits, or has history of hypertension, diabetes, high cholesterol, the patient will have a higher pretest probability of underlying coronary artery disease.

Ultimately, if the chest pain is deemed likely to be secondary to athlerosclerotic coronary arteries, a cardiologist must determine the severity of the heart disease. A very common method is the stress test. This test is design such that when a patient walks a treadmill, he will start to reach a target heart rate (ie. 150 beats per minute). Once he reaches that target heart rate, the cardiologist uses an ultrasound to visualize any abnormal movements of the heart walls. If there are abnormal cardiac movements, that would be considered a “positive” stress test, and that would require further testing and intervention (such as the cardiac cathertization).

However, many patients are simply too debilitated to walk the treadmill. Different comorbidities such as arthritis, obesity, etc. can limit the patient from reaching their target heart rate, thus making the exam limited in interpretation. An alternative is the nuclear stress test. Instead of walking on a treadmill, the patient is injected with a medication, dobutamine, which artifically elevates the heart rate to the desired heart rate. With significant review, the nuclear stress test is more sensitive to coronary artery disease than the walking treadmill stress test. Again, if there are any cardiac wall motion abnormalities, then it’s considered a positive stress test, and that will require further testing and/or intervention.

Ultimately, what physicians care about is whether or not significant morbidiy will arise from a positive (or perhaps even a negative) stress test. In this study that was published in the “Journal of the American College of Cardiology” (1999), data was collected on the demographics of the patient, their medical history, whether or not they had a postiive stress test, and if they sustained any “bad” outcomes over the course of 1 year. In this study, a “bad” outcome was defined as: myocardial infarction (heart attack), the need to undergo an invasive cardiac procedure, percutaneous transluminal coronary angioplasty (PTCA: this implies that the patient had one of his coronary arteries block which can more or less be a proxy for a heart attack) or coronary artery bypass grafting (CABG: a major cardiac surgery where they need to “create” new coronary arteries for the patient given that the native ones are blocked or badly damaged), or death.

However, one must remember that because a stress test is positive, this does not imply that a patient will have one of those adverse outcomes. If a stress test is positive, but no medical consequence results out of it, then in the medical community, we call it “clinically insignificant”. It is also important to note that this paper is technically a cohort study, which is an observational study and cannot be used to determine causation.

Given the consequences of heart disease and its potential impact on morbidity and mortality of a person, a physician must always be on the lookout for myocardiac infarctions (heart attacks), acute heart failure (congestive heart failure), etc. When a patient presents to the emergency department with chest pain, how do the physicians decide which chest pain is benign and which is not. Therefore, below are three questions that I had proposed in my DATA 606 proposal to help assist the physician in his/her decision making process of either admitting the patient for a cardiac workup or discharging the patient home.

  1. Does a positive stress test lead to higher probabilities of having an adverse outcome, such as MI, CABG/PTCA, or death in 1 year?

  2. Does having a history of a previous MI put you at higher risk for adverse outcomes in 1 year from the stress test?

  3. Are older patients more likely to have an adverse outcome in 1 year from the stress test compared to younger patients.?

Data Preparation

Initially, a total of 1183 patients (wich cardiac disease) was referred to the UCLA Adult Cardiac Imaging and Hemodynamics Laborators for a Dobuatmine Stress Evaluation between March 1991 and March 1996. The patients had given informed consent and their medical charts were reviewed for potential enrollment. If the patient had underwent multiple dobutamine stress tests (DSE), they only considered the first DSE in the study. Patients who were under consideration for liver or heart transplants, and patients who had prior cardiac catheterizations (PCTA) within the last 6 months were excluded. An additional 376 patients were also excluded from the final analysis secondary to lack of follow up. Ultimately, the final study population comprised of 558 patients and were utilized for data analysis.

While the study had performed their own analysis on their patients, I will be performing my own independent review (analysis) of their patients. In this case, I had evaluated multiple variables pertaining to the patients’ heart diseases. Given that there are three questions, the explanatory variables that will be explored are 1: positive stress tests, 2: history of previous myocardial infarctions, and 3: Age. The response variable is the adverse outcome as described in the introduction section. Let’s inspect the variables closer. Variables 1 (positive stress test) and 2 (history of previous MI) are categorical explanatory variables, whereas variable 3 (age) is a numerical explanatory variable. The response variable (which is the same for all three questions) is a categorical response.

As noted in the introduction, this was a cohort study. A cohort study is “an analysis of risk factors and follows a group of people who typically do not have a given disease…(It is) largely about the life histories of segments of populations, and the individual people who constitute these segments.” (Wikipedia). A cohort is a group of people who share a common characteristic within a defined period of time. As in this case, 1183 patients were followed over a course of 12 months (with many lost to follow up). Given each patient’s risks factors, the investigators had attempted to track whether or not the patient sustained an adverse outcome.

These studies can be either prospective or retrospective, depending on the study design. In this study, this was a prospective cohort study. The advatange of a prospective cohort study is that data is collected over time in regular intervals, helping reduce recall bias. But these studies can be particularly expensive to maintain and many patients can fall into attrition and lost to follow-up. Though this type of study is not quite as robust as double blinded randomized control study, these prospective cohort studies can be pretty informative.

In the medicine world, we are often very careful of what we consider generalizable from a study. Though it appears that many types of patients with history of hypertension, diabetes, smoking history, and so forth had participated in the study, this cannot be generalized throughout the rest of the country until it is validated by many other hospitals/university centers. The limiting factor in this study is that the patients were 1: referred patients and 2: patients from southern California. We cannot assume that patients in Southern California are similar to patients in southern Alabama or in Chicago. A prime example that demonstrates this problem was a study from regarding feinting called the San Francisco Syncope Rule. (Quinn, James V., et al. “Derivation of the San Francisco Syncope Rule to predict patients with short-term serious outcomes.” Annals of emergency medicine 43.2 (2004): 224-232). This study investigated which patients who had feinted required admission and which patients could be slotted for discharge. Though the study had performed with high sensitivity in San Francisco, it failed to do in a repeat study in the Bronx at Jacobi Medical Center. Again, patients in the Bronx are not the same as in San Francisco, so extra caution needs to be taken before we decide to generalize a study to the general patient population. What we could conclude from this study could be generalized a subset of referred cardiac patients who had underwent stress test from Southern California.

Observation studies in general should technically not be used to establish causal links, as observations are not technically studies. However, in medicine, cohort studies can be considered quasi-studies that can infer some causality (though with caution). In thise case, we could potentially implicate risk factors to causing an adverse outcome, but we have to remember that observation studies cannot take place of a true double-blinded randomized control study, where a causation can be inferred.

Explatory Data Analysis and Inference - Cardiac Disease

Let’s explore the data and answer the three questions that were proposed. Below is the R code and the analysis to reach those answers.

Load libraries.

library(RCurl)
library(psych)
library(tidyr)
library(dplyr)
library(ggplot2)
library(gmodels)

Import the data.

# Import data in from Github
# In this case, import in the Stress Echo dataset.
# Garfinkel, Alan, et. al. "Prognostic Value of Dobutamine Stress Echocardiography in Predicting Cardiac Events in Patients With Known or Suspected Coronary Artery Disease." Journal of the American College of Cardiology 33.3 (1999) 708-16.

#This data was collected from the Department of Biostatistics at Vanderbilt University. The .csv file was downloaded and uploaded onto GitHub. The file was subsequently downloaded with the 'RCurl' package.

# http://biostat.mc.vanderbilt.edu/wiki/Main/DataSets

stress.echo <- read.csv("https://raw.githubusercontent.com/jcp9010/MSDA/master/stressEcho.csv", header = TRUE)

# We will now subset our data to answer the 3 questions listed below.

# For the 1st question:
stress1 <- subset(stress.echo, select = c(posSE, newMI, newPTCA, newCABG, death))
colnames(stress1) <- c("Stress.Test", "New.MI", "New.PTCA", "New.CABG", "Death")
# We will need to subset these data sets again into positive stress test vs. negative stress test
stress.test.positive <- stress1 %>% filter(Stress.Test == 1)
stress.test.negative <- stress1 %>% filter(Stress.Test == 0)

# For the 2nd question:
stress2 <- subset(stress.echo, select = c(hxofMI, newMI, newPTCA, newCABG, death))
colnames(stress2) <- c("Hx.MI", "New.MI", "New.PTCA", "New.CABG", "Death")
# Subset this data into patients with history of MI and patients without history of MI.
With.MI <- stress2 %>% filter(Hx.MI == 1)
Without.MI <- stress2 %>% filter(Hx.MI == 0)

# For the 3rd question:
stress3 <- subset(stress.echo, select = c(age, newMI, newPTCA, newCABG, death))
colnames(stress3) <- c("Age", "New.MI", "New.PTCA", "New.CABG", "Death")
# According to the recent validated scoring system to risk stratify cardiac patients, Six AJ, Backus BE, Kelder JC. Chest pain in the emergency room: value of the HEART score. Netherlands Heart Journal. 2008;16(6):191-196, they had subsetted their patients to < 45 years old, 45 to 65 years old, and greater than 65 years old, This is what I will be subsetting them into.
Age.45 <- stress3 %>% filter(Age < 45)
Age.45.65 <- stress3 %>% filter(Age >= 45 & Age <= 65)
Age.65 <- stress3 %>% filter(Age > 65)

Research question

  1. Does a positive stress test lead to higher probabilities of having an adverse outcome, such as MI, CABG/PTCA, or death in 1 year?
# Using the describe and summary function to take a look at the overall characteristics of stress test positive patients vs. the stress test negative patients.
describe(stress.test.positive)
##             vars   n mean   sd median trimmed mad min max range skew
## Stress.Test    1 136 1.00 0.00      1    1.00   0   1   1     0  NaN
## New.MI         2 136 0.10 0.31      0    0.01   0   0   1     1 2.58
## New.PTCA       3 136 0.10 0.30      0    0.00   0   0   1     1 2.72
## New.CABG       4 136 0.15 0.36      0    0.07   0   0   1     1 1.89
## Death          5 136 0.08 0.27      0    0.00   0   0   1     1 3.04
##             kurtosis   se
## Stress.Test      NaN 0.00
## New.MI          4.71 0.03
## New.PTCA        5.44 0.03
## New.CABG        1.59 0.03
## Death           7.30 0.02
summary(stress.test.positive)
##   Stress.Test     New.MI          New.PTCA          New.CABG     
##  Min.   :1    Min.   :0.0000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:1    1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000  
##  Median :1    Median :0.0000   Median :0.00000   Median :0.0000  
##  Mean   :1    Mean   :0.1029   Mean   :0.09559   Mean   :0.1544  
##  3rd Qu.:1    3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.0000  
##  Max.   :1    Max.   :1.0000   Max.   :1.00000   Max.   :1.0000  
##      Death        
##  Min.   :0.00000  
##  1st Qu.:0.00000  
##  Median :0.00000  
##  Mean   :0.08088  
##  3rd Qu.:0.00000  
##  Max.   :1.00000
describe(stress.test.negative)
##             vars   n mean   sd median trimmed mad min max range skew
## Stress.Test    1 422 0.00 0.00      0       0   0   0   0     0  NaN
## New.MI         2 422 0.03 0.18      0       0   0   0   1     1 5.19
## New.PTCA       3 422 0.03 0.18      0       0   0   0   1     1 5.19
## New.CABG       4 422 0.03 0.17      0       0   0   0   1     1 5.65
## Death          5 422 0.03 0.17      0       0   0   0   1     1 5.41
##             kurtosis   se
## Stress.Test      NaN 0.00
## New.MI         25.04 0.01
## New.PTCA       25.04 0.01
## New.CABG       30.04 0.01
## Death          27.35 0.01
summary(stress.test.negative)
##   Stress.Test     New.MI           New.PTCA          New.CABG      
##  Min.   :0    Min.   :0.00000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0    1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0    Median :0.00000   Median :0.00000   Median :0.00000  
##  Mean   :0    Mean   :0.03318   Mean   :0.03318   Mean   :0.02844  
##  3rd Qu.:0    3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :0    Max.   :1.00000   Max.   :1.00000   Max.   :1.00000  
##      Death        
##  Min.   :0.00000  
##  1st Qu.:0.00000  
##  Median :0.00000  
##  Mean   :0.03081  
##  3rd Qu.:0.00000  
##  Max.   :1.00000
# Let's visualize some of the data

barplot(table(stress1$Stress.Test))

# There are 422 patients with a negative stress test a 136 patients with a positive stress test. 

paste("The total deaths with patients with a positive stress test: ", sum(stress.test.positive$Death))
## [1] "The total deaths with patients with a positive stress test:  11"
paste("The total deaths with patients with a negative stress test: ", sum(stress.test.negative$Death))
## [1] "The total deaths with patients with a negative stress test:  13"
paste("Percentage of patients who died with a positive stress test: ", round(sum(stress.test.positive$Death)/length(stress.test.positive$Stress.Test), 2) * 100, "%")
## [1] "Percentage of patients who died with a positive stress test:  8 %"
paste("Percentage of patients who died with a negative stress test: ", round(sum(stress.test.positive$Death)/length(stress.test.negative$Stress.Test), 2) * 100, "%")
## [1] "Percentage of patients who died with a negative stress test:  3 %"
mosaicplot(table(stress1$Stress.Test, stress1$Death), xlab = "Positive vs. Negative Stress Test", ylab = "Death", main = "Mosaic Plot of Stress Test vs. Death")

# With the table() function, we can determine in each subset group the absolute number of MI's in the positive (1) group vs. the negative (0) group.
pos.stress.MI <- table(stress.test.positive$New.MI)
neg.stress.MI <- table(stress.test.negative$New.MI)
pos.stress.MI
## 
##   0   1 
## 122  14
neg.stress.MI
## 
##   0   1 
## 408  14
# Let's use a bar plot to see the amount of MI's for each group.
barplot(pos.stress.MI, main = "Myocardial Infarctions in 1 year with a Positive Stress Test", ylim = c(0,400), col=c("darkblue","red"))

barplot(neg.stress.MI, main = "Myocardial Infarctions in 1 year with a Negative Stress Test", ylim = c(0, 400), col=c("darkblue","red"))

paste("The total patients with a positive stress test who ended up with a cardiac bypass: ", sum(stress.test.positive$New.CABG))
## [1] "The total patients with a positive stress test who ended up with a cardiac bypass:  21"
paste("The total patients with a negative stress test who ended up with a cardiac bypass: ", sum(stress.test.negative$New.CABG))
## [1] "The total patients with a negative stress test who ended up with a cardiac bypass:  12"
paste("Percentage of patients who had a cardiac bypass with a positive stress test: ", round(sum(stress.test.positive$New.CABG)/length(stress.test.positive$Stress.Test), 2) * 100, "%")
## [1] "Percentage of patients who had a cardiac bypass with a positive stress test:  15 %"
paste("Percentage of patients who had a cardiac bypass with a negative stress test: ", round(sum(stress.test.positive$New.CABG)/length(stress.test.negative$Stress.Test), 2) * 100, "%")
## [1] "Percentage of patients who had a cardiac bypass with a negative stress test:  5 %"
mosaicplot(table(stress1$Stress.Test, stress1$New.CABG), xlab = "Positive vs. Negative Stress Test", ylab = "CABG", main = "Mosaic Plot of Stress Test vs. CABG")

paste("The total patients with a positive stress test who ended up with a new MI: ", sum(stress.test.positive$New.MI))
## [1] "The total patients with a positive stress test who ended up with a new MI:  14"
paste("The total patients with a negative stress test who ended up with a new MI: ", sum(stress.test.negative$New.MI))
## [1] "The total patients with a negative stress test who ended up with a new MI:  14"
paste("Percentage of patients who had another heart attack (MI) with a positive stress test: ", round(sum(stress.test.positive$New.MI)/length(stress.test.positive$Stress.Test), 2) * 100, "%")
## [1] "Percentage of patients who had another heart attack (MI) with a positive stress test:  10 %"
paste("Percentage of patients who had another heart attack (MI) with a negative stress test: ", round(sum(stress.test.positive$New.MI)/length(stress.test.negative$Stress.Test), 2) * 100, "%")
## [1] "Percentage of patients who had another heart attack (MI) with a negative stress test:  3 %"
mosaicplot(table(stress1$Stress.Test, stress1$New.MI), xlab = "Positive vs. Negative Stress Test", ylab = "MI", main = "Mosaic Plot of Stress Test vs. MI")

paste("The total patients with a positive stress test who ended up with a PTCA: ", sum(stress.test.positive$New.PTCA))
## [1] "The total patients with a positive stress test who ended up with a PTCA:  13"
paste("The total patients with a negative stress test who ended up with a PTCA: ", sum(stress.test.negative$New.PTCA))
## [1] "The total patients with a negative stress test who ended up with a PTCA:  14"
paste("Percentage of patients who had a PTCA (cardic cath) with a positive stress test: ", round(sum(stress.test.positive$New.PTCA)/length(stress.test.positive$Stress.Test), 2) * 100, "%")
## [1] "Percentage of patients who had a PTCA (cardic cath) with a positive stress test:  10 %"
paste("Percentage of patients who had a PTCA (cardiac cath) with a negative stress test: ", round(sum(stress.test.positive$New.PTCA)/length(stress.test.negative$Stress.Test), 2) * 100, "%")
## [1] "Percentage of patients who had a PTCA (cardiac cath) with a negative stress test:  3 %"
mosaicplot(table(stress1$Stress.Test, stress1$New.PTCA), xlab = "Positive vs. Negative Stress Test", ylab = "PTCA", main = "Mosaic Plot of Stress Test vs. PTCA")

There are 422 patients with a negative stress test a 136 patients with a positive stress test. And from looking at the absolute numbers and percentages, there seems to be a trend to worse outcomes in patients who have had positive stress tests, which makes intuitive sense. If we assume that there is truly a difference, we can perform conditional probability and determine what is the likliehood of dying if the patient will die if there is a positive stress test, in other words (Probability of Death given Positive Stress Test). In order to do this, we will need to create a contingency table.

CrossTable(stress1$Stress.Test, stress1$Death, prop.t= TRUE, pop.r=TRUE, prop.c=TRUE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  558 
## 
##  
##                     | stress1$Death 
## stress1$Stress.Test |         0 |         1 | Row Total | 
## --------------------|-----------|-----------|-----------|
##                   0 |       409 |        13 |       422 | 
##                     |     0.066 |     1.462 |           | 
##                     |     0.969 |     0.031 |     0.756 | 
##                     |     0.766 |     0.542 |           | 
##                     |     0.733 |     0.023 |           | 
## --------------------|-----------|-----------|-----------|
##                   1 |       125 |        11 |       136 | 
##                     |     0.204 |     4.535 |           | 
##                     |     0.919 |     0.081 |     0.244 | 
##                     |     0.234 |     0.458 |           | 
##                     |     0.224 |     0.020 |           | 
## --------------------|-----------|-----------|-----------|
##        Column Total |       534 |        24 |       558 | 
##                     |     0.957 |     0.043 |           | 
## --------------------|-----------|-----------|-----------|
## 
## 

So, assuming that this data is generalizable and true, what is the probability that a cardiac patient with a positive stress test will die in the 12 months? Let’s compare that to the probability of a cardiac patient with a negative stress test.

# P(outcome = Death | stress test = Positive) = P(result = Death AND stress test = Positive)/P(stress test = Positive)
round((0.020)/(.244),3)
## [1] 0.082
# P(outcome = Death | stress test = Negative) = P(result = Death AND stress test = Negative)/P(stress test = Negative)
round((0.023)/(0.756),3)
## [1] 0.03

The same can be repeated for the other adverse outcomes, but this all relies on the assumptions that this is generalizable. More importantly, we need to know if these numbers have any true statistical significance. While the numbers above suggest that having a positive stress test leads to bad outcomes, is this really true? Or is this by accident. Let’s create two hypotheses for this scenario.

H0 - Null Hypothesis: There is no difference in adverse outcomes between patients with a positive stress test and patients with a negative stress test.

HA - Alternate Hypothesis: There IS a statistical difference in adverse outcomes between patients with a positive stress test and a negative stress test.

We can help investigate further into the truth with the help of statistics. (We will refer to Chapter 6 of Introduction to Open Statistics 3rd Edition, “Inference for categorical data”).

We would like to make conclusions about the difference in two population proportions: p1 - p2 (p1 = positive stress patients, p2 = negative stress patients). In this investigation, we must identify a reasonable point estimate of p1 - p2 based on this sample size. Then we must verify that the point estimate follows the normal model by checking its conditions. And finally, we will compute the estimate’s standard error and create an inference from this information.

Our reasonable point estimate here will be the difference between p1 and p2 (p1 - p2). In order to start this analysis, we must satisfy some conditions. First, the samples must be independent. This is a reasonable assumption here given that most (if not all) patients come from the general population of Southern California and not likely to be all relatives i.e. cousins, newphews, etc. or come from one town. In addition, 558 patients is certainly way less than 10% of the Southern California population. Secondly, the sample distribution for each sample proportion must be nearly normal. Let’s take a look at this second condition more closely.

To satisfy the conditions for the sampling distribution of p1, p2 being nearly normal, we must expect to see at least 10 successes and 10 failures in the sample i.e. np >= 10 and n(1-p) >= 10, thus satisfying the success-failure condition, where n = sample size and p = true proportion. If both conditions are satisfied, then the standard error = sqrt((p (1-p)) / n). Given that the true proportion is not truly known in many instances (including many proportions in the medicine world), we will use the proportion claimed in the null hypothesis in place of p.

Let’s see if the criteria are satisfied:

# 1. The samples are independent --> As noted above, we will assume that this is true.

# 2. Success-Failure Condition:
# Define n:
n <- length(as.integer(stress1$Stress.Test)) # n = 558
p <- sum(stress.test.negative$Death)/length(stress.test.negative$Stress.Test) # p = 0.03080569

# Is n*p => 10
paste("n*p = ", n*p, ", and is n*p >= 10: ", n*p >= 10)
## [1] "n*p =  17.1895734597156 , and is n*p >= 10:  TRUE"
# Is n*(1-p) => 10
paste("n*(1-p) = ", n*(1-p), ", and is n*(1-p) >= 10: ", n*(1-p) >= 10)
## [1] "n*(1-p) =  540.810426540284 , and is n*(1-p) >= 10:  TRUE"
# This the criteria are satisfied, we can calculate the standard error
SE <- sqrt((p * (1-p)) / n)
paste("The standard error for this sample is: ", round(SE,4))
## [1] "The standard error for this sample is:  0.0073"

Great, the criteria is satisfied. We also find that the standard error is 0.0086.

A confidence interval is a range of values so defined that there is a specified probability that the value of a parameter lies within it (Google Dictionary). This gives us a range where we expect a population parameter to lie within. Typically, the general accepted confidence interval is 95%. It is calculated by: point estimate +/- z * SE

# 95% Confidence Interval
z.95 <- 1.96
Upper95 <- p + z.95 * SE
Lower95 <- p - z.95 * SE
paste("95% Confidence Interval range: ", round(Lower95,4), "to", round(Upper95,4))
## [1] "95% Confidence Interval range:  0.0165 to 0.0451"
# Point estimate for patients with positive stress test
p.alt <- sum(stress.test.positive$Death)/length(stress.test.positive$Stress.Test)
paste("Proportion of deaths with patients with positive stress test: ", round(p.alt, 4))
## [1] "Proportion of deaths with patients with positive stress test:  0.0809"

We are 95% confidence that the true proportion of death for patients with cardiac disease lies between 0.0165 to 0.0451. With the proportion of deaths with patients with a positive stress test lying greater than 0.0451 (at 0.0809), I suspect that there is a statistical difference. Let’s calculate the Z score for 0.0809 assuming that the null hypothesis is true.

# Calculate the Z score
Z  <- (p.alt - p)/SE
paste("Z score: ", round(Z, 4))
## [1] "Z score:  6.8459"
# Calculate the two sided p score
paste("Two sided P value: ", 2* pnorm(Z, lower.tail = FALSE))
## [1] "Two sided P value:  7.59875679395945e-12"
# This value is clearly statistically significant. 

We can make conclusions about the differences in the two population proportions as well, by framing it differently. Again, to analyze a sample distribution of the “difference of two proportions”, we must check two conditions before applying the normal model to (p1 - p2).

  1. The sample distribution for each sample portion must be normal.
  2. The samples must be independent. (All of this was demonstrated above.)

If these two criteria are satisfied, then we can calculate the SE of (p1 - p2) = sqrt((p1(1-p1))/n1 + (p2(1-p2))/n2) and a confidence interval.

We will create a null and alternate hypothesis:

H0 - Null Hypothesis: (p1 - p2) = 0 HA - Alternate Hypothesis: (p1 - p1) != 0

The steps taken will be similar to our prior exercise but instead of using the null hypothesis proportion as the patient population (p), we will be utilizing a special proportion called the pooled proportion to check the success-failure condition (assuming that p1 - p2 = 0).

pooled.p <- sum(stress1$Death)/length(stress1$Stress.Test)
# We will also be using the pooled proportion when computing the standard error

# Let us verify the conditions
# Is n*pooled.p => 10
paste("n*pooled.p = ", n*pooled.p, ", and is n*pooled.p >= 10: ", n*pooled.p >= 10)
## [1] "n*pooled.p =  24 , and is n*pooled.p >= 10:  TRUE"
# Is n*(1-pooled.p) => 10
paste("n*(1-pooled.p) = ", n*(1-pooled.p), ", and is n*(1-pooled.p) >= 10: ", n*(1-pooled.p) >= 10)
## [1] "n*(1-pooled.p) =  534 , and is n*(1-pooled.p) >= 10:  TRUE"
# We can safely apply the normal model.

# Calculating the standard error.
# Compute the point estimate of the difference in deaths between the two groups, using the pooled proportion to calculate the standard error.

p.diff <- (sum(stress.test.positive$Death)/length(stress1$Stress.Test) - sum(stress.test.negative$Death)/length(stress1$Stress.Test))
# There is a -0.003584229 difference between the two groups.

SE.p1 <- (pooled.p * (1 - pooled.p))/length(stress.test.positive$Stress.Test)
SE.p2 <- (pooled.p * (1 - pooled.p))/length(stress.test.negative$Stress.Test)

SE.diff <- sqrt(SE.p1 + SE.p2)
# SE is 0.0200477

# Now let's calculate a Z score
Z.diff <- (pooled.p - 0)/SE.diff
pv.diff <- 2 * pnorm(Z.diff, lower.tail = FALSE)
paste("Deaths: The Z-score is: ", Z.diff, "with a two tailed P value of: ", pv.diff)
## [1] "Deaths: The Z-score is:  2.15002531048579 with a two tailed P value of:  0.0315532127751224"

The two tailed P value here is 0.03155 which is less than alpha = 0.05, thus we can reject the null hypothesis and claim that there is a statistical differences in death between patients who had a positive stress test compared to those with a negative stress test, thus confirming the previous statistical testing. We can perform the same calculations for all of the other adverse outcomes: new MI, new CABG, new PTCA. In this scenario, we will be using the pooled differences to determine the p values and using any p value less than 0.05 to be statistically significant.

# Calculating for statistical difference between new myocardial infarctions between patients with positive and negative stress test. 

# Does it satisfy criteria?
pooled.MI.p <- sum(stress1$New.MI)/length(stress1$Stress.Test)

# Is n*pooled.MI.p => 10
paste("n*pooled.MI.p = ", n*pooled.MI.p, ", and is n*pooled.MI.p >= 10: ", n*pooled.MI.p >= 10)
## [1] "n*pooled.MI.p =  28 , and is n*pooled.MI.p >= 10:  TRUE"
# Is n*(1-pooled.MI.p) => 10
paste("n*(1-pooled.MI.p) = ", n*(1-pooled.MI.p), ", and is n*(1-pooled.MI.p) >= 10: ", n*(1-pooled.MI.p) >= 10)
## [1] "n*(1-pooled.MI.p) =  530 , and is n*(1-pooled.MI.p) >= 10:  TRUE"
SE.MI.p1 <- (pooled.MI.p * (1 - pooled.MI.p))/length(stress.test.positive$Stress.Test)
SE.MI.p2 <- (pooled.MI.p * (1 - pooled.MI.p))/length(stress.test.negative$Stress.Test)

SE.MI.diff <- sqrt(SE.MI.p1 + SE.MI.p2)
# SE is 0.02152654

# Calculating a Z score for new MIs
Z.MI.diff <- (pooled.MI.p - 0)/SE.MI.diff
pv.MI.diff <- 2 * pnorm(Z.MI.diff, lower.tail = FALSE)
paste("MI: The Z-score is: ", Z.MI.diff, "with a two tailed P value of: ", pv.MI.diff)
## [1] "MI: The Z-score is:  2.33103965171082 with a two tailed P value of:  0.0197512689463128"
# Calculating for statistical difference between requiring a cardiac bypass (in the next 12 months) between patients with positive and negative stress test. 

# Does it satisfy criteria?
pooled.CABG.p <- sum(stress1$New.CABG)/length(stress1$Stress.Test)

# Is n*pooled.CABG.p => 10
paste("n*pooled.CABG.p = ", n*pooled.CABG.p, ", and is n*pooled.CABG.p >= 10: ", n*pooled.CABG.p >= 10)
## [1] "n*pooled.CABG.p =  33 , and is n*pooled.CABG.p >= 10:  TRUE"
# Is n*(1-pooled.CABG.p) => 10
paste("n*(1-pooled.CABG.p) = ", n*(1-pooled.CABG.p), ", and is n*(1-pooled.CABG.p) >= 10: ", n*(1-pooled.CABG.p) >= 10)
## [1] "n*(1-pooled.CABG.p) =  525 , and is n*(1-pooled.CABG.p) >= 10:  TRUE"
SE.CABG.p1 <- (pooled.CABG.p * (1 - pooled.CABG.p))/length(stress.test.positive$Stress.Test)
SE.CABG.p2 <- (pooled.CABG.p * (1 - pooled.CABG.p))/length(stress.test.negative$Stress.Test)

SE.CABG.diff <- sqrt(SE.CABG.p1 + SE.CABG.p2)
# SE is 0.02325915

# Calculating a Z score for new CABGs
Z.CABG.diff <- (pooled.CABG.p - 0)/SE.CABG.diff
pv.CABG.diff <- 2 * pnorm(Z.CABG.diff, lower.tail = FALSE)
paste("CABG: The Z-score is: ", Z.CABG.diff, "with a two tailed P value of: ", pv.CABG.diff)
## [1] "CABG: The Z-score is:  2.54264599248758 with a two tailed P value of:  0.0110016630244688"
# Calculating for statistical difference between requiring a cardiac catheterization (PTCA) (in the next 12 months) between patients with positive and negative stress test. 

# Does it satisfy criteria?
pooled.PTCA.p <- sum(stress1$New.PTCA)/length(stress1$Stress.Test)

# Is n*pooled.PTCA.p => 10
paste("n*pooled.PTCA.p = ", n*pooled.PTCA.p, ", and is n*pooled.PTCA.p >= 10: ", n*pooled.PTCA.p >= 10)
## [1] "n*pooled.PTCA.p =  27 , and is n*pooled.PTCA.p >= 10:  TRUE"
# Is n*(1-pooled.PTCA.p) => 10
paste("n*(1-pooled.PTCA.p) = ", n*(1-pooled.PTCA.p), ", and is n*(1-pooled.PTCA.p) >= 10: ", n*(1-pooled.PTCA.p) >= 10)
## [1] "n*(1-pooled.PTCA.p) =  531 , and is n*(1-pooled.PTCA.p) >= 10:  TRUE"
SE.PTCA.p1 <- (pooled.PTCA.p * (1 - pooled.PTCA.p))/length(stress.test.positive$Stress.Test)
SE.PTCA.p2 <- (pooled.PTCA.p * (1 - pooled.PTCA.p))/length(stress.test.negative$Stress.Test)

SE.PTCA.diff <- sqrt(SE.PTCA.p1 + SE.PTCA.p2)
# SE is 0.02115857

# Calculating a Z score for new PTCA
Z.PTCA.diff <- (pooled.PTCA.p - 0)/SE.PTCA.diff
pv.PTCA.diff <- 2 * pnorm(Z.PTCA.diff, lower.tail = FALSE)
paste("PTCA: The Z-score is: ", Z.PTCA.diff, "with a two tailed P value of: ", pv.PTCA.diff)
## [1] "PTCA: The Z-score is:  2.28687908044094 with a two tailed P value of:  0.0222028819854056"

After analyzing all the numbers, it appears that all adverse outcomes are all statistically significant. Let’s take a look at see if look at the second question and study a subset of patients with a different risk factor.

  1. Does having a history of a previous MI put you at higher risk for adverse outcomes in 1 year from the stress test?

In medicine, patients with myocardial infarctions (heart attacks) are patients with significant comorbidities and high risk for death. The events leading to an MI are typically not sudden and usually is led by a multitude of risk factors such as poor dieting, lack of exercise, hypertension, diabetes, high cholesterol, smoking history, family history and so forth. A patient can have coronary artery disease (narrowing of the coronary arteries from athlerosclerosis) but not have a heart attack. But if the coronary artery disease gets particularly severe, one of the clots can “break off” and cause a heart attack, causing myocardial tissue to die and necrose. Once the heart tissue cells die, it puts a tremendous strain on the heart and typically leads to dysfunction later in life including congestive heart failure. Intuitively, patients who have had heart attacks should in theory be at higher risk for adverse outcomes, including death? But is there a statistical difference?

Many of the techniques above will be used again here to answer the second question. To avoid repetition from the 1st question, we will look at one particular part of the question and answer it. Does a history of MI lead to higher chances of developing another MI compared to a person who never had an MI prior?

# The subset of information is under `stress2`, `With.MI`, and `Without.MI`.
# Let's get some summary information for both groups
describe(With.MI)
##          vars   n mean   sd median trimmed mad min max range skew kurtosis
## Hx.MI       1 154 1.00 0.00      1    1.00   0   1   1     0  NaN      NaN
## New.MI      2 154 0.09 0.29      0    0.00   0   0   1     1 2.82     5.98
## New.PTCA    3 154 0.11 0.31      0    0.02   0   0   1     1 2.46     4.09
## New.CABG    4 154 0.08 0.27      0    0.00   0   0   1     1 3.12     7.78
## Death       5 154 0.06 0.24      0    0.00   0   0   1     1 3.73    11.98
##            se
## Hx.MI    0.00
## New.MI   0.02
## New.PTCA 0.03
## New.CABG 0.02
## Death    0.02
describe(Without.MI)
##          vars   n mean   sd median trimmed mad min max range skew kurtosis
## Hx.MI       1 404 0.00 0.00      0       0   0   0   0     0  NaN      NaN
## New.MI      2 404 0.03 0.18      0       0   0   0   1     1 5.07    23.76
## New.PTCA    3 404 0.02 0.16      0       0   0   0   1     1 6.09    35.24
## New.CABG    4 404 0.05 0.22      0       0   0   0   1     1 4.02    14.21
## Death       5 404 0.04 0.19      0       0   0   0   1     1 4.88    21.85
##            se
## Hx.MI    0.00
## New.MI   0.01
## New.PTCA 0.01
## New.CABG 0.01
## Death    0.01
# With the table() function, we can determine in each subset group the absolute number of MI's in the positive (1) group vs. the negative (0) group.
pos.MI.MI <- table(With.MI$New.MI)
neg.MI.MI <- table(Without.MI$New.MI)
pos.MI.MI
## 
##   0   1 
## 140  14
neg.MI.MI
## 
##   0   1 
## 390  14
paste("Percentage of patients who had another heart attack (MI) with a history of MI: ", round(sum(With.MI$New.MI)/length(With.MI$Hx.MI), 2) * 100, "%")
## [1] "Percentage of patients who had another heart attack (MI) with a history of MI:  9 %"
paste("Percentage of patients who had another heart attack (MI) without a history of MI: ", round(sum(Without.MI$New.MI)/length(Without.MI$Hx.MI), 2) * 100, "%")
## [1] "Percentage of patients who had another heart attack (MI) without a history of MI:  3 %"

It appears that there is a difference between the two with 9% of patients with history of MI had another MI, whereas 3% of patients without history of MI had another MI in the next 12 months. We will take a pool difference between the two and create a null and alternative hypothesis:

H0 - Null Hypothesis: There is no difference in adverse outcomes between patients with a history of MI and patients with no history of MI.

HA - Alternate Hypothesis: There IS a statistical difference in adverse outcomes between patients with a history of MI and no history of MI.

n <- length(stress2$Hx.MI)
pooled.p <- sum(stress2$New.MI)/length(stress2$Hx.MI)
# We will also be using the pooled proportion when computing the standard error

# Let us verify the conditions
# Is n*pooled.p => 10
paste("n*pooled.p = ", n*pooled.p, ", and is n*pooled.p >= 10: ", n*pooled.p >= 10)
## [1] "n*pooled.p =  28 , and is n*pooled.p >= 10:  TRUE"
# Is n*(1-pooled.p) => 10
paste("n*(1-pooled.p) = ", n*(1-pooled.p), ", and is n*(1-pooled.p) >= 10: ", n*(1-pooled.p) >= 10)
## [1] "n*(1-pooled.p) =  530 , and is n*(1-pooled.p) >= 10:  TRUE"
# We can safely apply the normal model.

# Calculating the standard error.

SE.p1 <- (pooled.p * (1 - pooled.p))/length(With.MI$Hx.MI)
SE.p2 <- (pooled.p * (1 - pooled.p))/length(Without.MI$Hx.MI)

SE.diff <- sqrt(SE.p1 + SE.p2)
# SE is 0.02067516

# Now let's calculate a Z score
Z.diff <- (pooled.p - 0)/SE.diff
pv.diff <- 2 * pnorm(Z.diff, lower.tail = FALSE)
paste("New MIs: The Z-score is: ", Z.diff, "with a two tailed P value of: ", pv.diff)
## [1] "New MIs: The Z-score is:  2.4270293282073 with a two tailed P value of:  0.0152230251851025"

The two tailed P value is < 0.05, making this statistically significant.

  1. Are older patients more likely to have an adverse outcome in 1 year from the stress test compared to younger patients? (We will only examine death for this question.)

According to the HEART trial, age plays a factor in calculating a risk score for the patient. Essentially, the older the patient, the more likely the patient will sustain an adverse outcome.

describe(Age.45)
##          vars  n  mean   sd median trimmed  mad min max range  skew
## Age         1 22 36.45 5.41     38   36.72 5.93  26  44    18 -0.43
## New.MI      2 22  0.00 0.00      0    0.00 0.00   0   0     0   NaN
## New.PTCA    3 22  0.00 0.00      0    0.00 0.00   0   0     0   NaN
## New.CABG    4 22  0.05 0.21      0    0.00 0.00   0   1     1  4.07
## Death       5 22  0.00 0.00      0    0.00 0.00   0   0     0   NaN
##          kurtosis   se
## Age         -1.20 1.15
## New.MI        NaN 0.00
## New.PTCA      NaN 0.00
## New.CABG    15.27 0.05
## Death         NaN 0.00
describe(Age.45.65)
##          vars   n  mean   sd median trimmed  mad min max range  skew
## Age         1 193 57.47 5.83     59   57.87 7.41  45  65    20 -0.48
## New.MI      2 193  0.04 0.20      0    0.00 0.00   0   1     1  4.57
## New.PTCA    3 193  0.04 0.20      0    0.00 0.00   0   1     1  4.57
## New.CABG    4 193  0.03 0.17      0    0.00 0.00   0   1     1  5.36
## Death       5 193  0.03 0.16      0    0.00 0.00   0   1     1  5.92
##          kurtosis   se
## Age         -0.96 0.42
## New.MI      18.94 0.01
## New.PTCA    18.94 0.01
## New.CABG    26.89 0.01
## Death       33.25 0.01
describe(Age.65)
##          vars   n  mean   sd median trimmed  mad min max range skew
## Age         1 343 74.88 6.49     74   74.32 7.41  66  93    27 0.66
## New.MI      2 343  0.06 0.23      0    0.00 0.00   0   1     1 3.75
## New.PTCA    3 343  0.06 0.23      0    0.00 0.00   0   1     1 3.87
## New.CABG    4 343  0.08 0.27      0    0.00 0.00   0   1     1 3.19
## Death       5 343  0.06 0.23      0    0.00 0.00   0   1     1 3.87
##          kurtosis   se
## Age         -0.42 0.35
## New.MI      12.12 0.01
## New.PTCA    13.02 0.01
## New.CABG     8.21 0.01
## Death       13.02 0.01
paste("The total number of cases:", length(stress3$Age))
## [1] "The total number of cases: 558"
# Graphics describing the distribution of the age of the patients in this sample.
# There appears to be a small left skew to this data, but not likely to make an impact on the overall approach given that n = 558
ggplot(stress3, aes(x = Age)) + geom_dotplot(binwidth = 1) + labs(title = "Age of Patients")

qqnorm(stress3$Age)
qqline(stress3$Age)

# Let's draw an overlying distribution curve
hist(stress3$Age, probability = TRUE, main = "Histogram of Age", xlab = "Age")
x <- 10:110
y <- dnorm(x = x, mean = mean(stress3$Age), sd = sd(stress3$Age))
lines(x = x, y = y, col = "blue")

# Summary of the Ages
summary(stress3$Age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   26.00   60.00   69.00   67.34   75.00   93.00
paste("Mean age of 45 and younger group:", mean(Age.45$Age))
## [1] "Mean age of 45 and younger group: 36.4545454545455"
paste("Mean age between 45 and 65:", mean(Age.45.65$Age))
## [1] "Mean age between 45 and 65: 57.4663212435233"
paste("Mean age of 65 and older group:", mean(Age.65$Age))
## [1] "Mean age of 65 and older group: 74.8833819241982"
paste("Number of deaths in 45 and younger group:", sum(Age.45$Death))
## [1] "Number of deaths in 45 and younger group: 0"
paste("Number of deaths between 45 and 65 group:", sum(Age.45.65$Death))
## [1] "Number of deaths between 45 and 65 group: 5"
paste("Number of deaths in 65 and older group:", sum(Age.65$Death))
## [1] "Number of deaths in 65 and older group: 19"

Listed above is the summaries of the data itself. The mean age is 67.34 years old. The distribution appears more or less normal (with perhaps a slight left skew, though this can be ignored given the large sample size).

Let’s calculate a 95% confidence interval for the mean age of the group, using a Z score of 1.96.

# 95% Confidence Intervals
se <- sd(stress3$Age)/sqrt(length(stress3$Age))
lower <- mean(stress3$Age) - 1.96 * se
upper <- mean(stress3$Age) + 1.96 * se
c(lower,upper)
## [1] 66.34430 68.34387
# We are 95% certain that this range contains the actual true mean.

Also apparent from the data is that there are more (absolute number of) deaths as the person gets older. However is this a statistical trend? Let’s take a look at this by fitting a linear model where x = Age and y = Deaths. However, we must verify if a linear model is appropriate for this dataset.

m1 <- lm(Death ~ Age, data = stress3)
summary(m1)
## 
## Call:
## lm(formula = Death ~ Age, data = stress3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.07116 -0.05141 -0.04373 -0.03385  0.96615 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.0308887  0.0487905  -0.633    0.527
## Age          0.0010973  0.0007132   1.539    0.124
## 
## Residual standard error: 0.2028 on 556 degrees of freedom
## Multiple R-squared:  0.00424,    Adjusted R-squared:  0.002449 
## F-statistic: 2.367 on 1 and 556 DF,  p-value: 0.1245
plot(stress3$Death ~ stress3$Age)
abline(m1)

plot(m1$residuals ~ stress3$Age)
abline(h = 0, lty = 3)

hist(m1$residuals)

qqnorm(m1$residuals)
qqline(m1$residuals)

Looks like by age alone, the adjusted R-squared was 0.0025 and the p-value was 0.1245, which is not statistically significant. However, there are some serious issues here with this data. In order to appropriate apply the linear model, it must satisfy:

  1. Linearity (unclear if there is truly one)
  2. Nearly normal residuals (does not appear to have one)
  3. Constant variability (does not)
  4. Independent observations

Given the fact that there are some violations of the criteria, applying a linear model is not appropriate in this dataset.

ANOVA (analysis of variance) is a comparisons of means across many groups. This utilizes the F test statistic to detect statistical significance. We will attempt to use the ANOVA in this exercise.

H0: The mean outcome is the same across all groups. In statistical notation, μ1 = μ2 = · · · = μk where μi represents the mean of the outcome for observations in category i. HA: At least one mean is different.

Generally we must check three conditions on the data before performing ANOVA: • the observations are independent within and across groups, • the data within each group are nearly normal, and • the variability across the groups is about equal.

When these three conditions are met, we may perform an ANOVA to determine whether the data provide strong evidence against the null hypothesis that all the μi are equal.

# We must satisfy the three conditions to determine if the data is fit for ANOVA.
# 1. The observations are independent within and across groups. This is assumed true given that one person's age should not affect another person's age's outcome.
# 2. The data within each group are nearly normal.
# Let's determine the n size in each group

paste("Age 45 and younger:", length(Age.45$Age))
## [1] "Age 45 and younger: 22"
paste("Between ages 45 and 65:", length(Age.45.65$Age))
## [1] "Between ages 45 and 65: 193"
paste("Ages 65 and older:", length(Age.65$Age))
## [1] "Ages 65 and older: 343"
# Unfortunately, we cannot satisfy criteria number 2, so we cannot perform an ANOVA at this time.
# However, what we will attempt to do is a simulation. We will perform bootstrapping. 
# In this situation, we will 10k replacement samples of each categeory

set.seed(123)
samp.45 <- sample(Age.45$Death, size = 50, replace = TRUE)
for (i in 1:10000){
  x <- sample(Age.45$Death, size = 50, replace = TRUE)
  samp.45 <- c(samp.45, x)
}
paste("Number of Deaths for patients younger than 45 years old")
## [1] "Number of Deaths for patients younger than 45 years old"
table(samp.45)
## samp.45
##      0 
## 500050
samp.45.65 <- sample(Age.45.65$Death, size = 50, replace = TRUE)
for (i in 1:10000){
  x <- sample(Age.45.65$Death, size = 50, replace = TRUE)
  samp.45.65 <- c(samp.45.65, x)
}
paste("Number of Deaths for patients between the ages of 45 and 65")
## [1] "Number of Deaths for patients between the ages of 45 and 65"
table(samp.45.65)
## samp.45.65
##      0      1 
## 487140  12910
samp.65 <- sample(Age.65$Death, size = 50, replace = TRUE)
for (i in 1:10000){
  x <- sample(Age.65$Death, size = 50, replace = TRUE)
  samp.65 <- c(samp.65, x)
}
paste("Number of Deaths for patients older than 65")
## [1] "Number of Deaths for patients older than 65"
table(samp.65)
## samp.65
##      0      1 
## 472516  27534
# Now we have satisfied criteria 2.
# 3. the variability across the groups is about equal.

describe(samp.45)
##    vars      n mean sd median trimmed mad min max range skew kurtosis se
## X1    1 500050    0  0      0       0   0   0   0     0  NaN      NaN  0
describe(samp.45.65)
##    vars      n mean   sd median trimmed mad min max range skew kurtosis se
## X1    1 500050 0.03 0.16      0       0   0   0   1     1 5.98    33.76  0
describe(samp.65)
##    vars      n mean   sd median trimmed mad min max range skew kurtosis se
## X1    1 500050 0.06 0.23      0       0   0   0   1     1  3.9    13.22  0
# This appears to be satisfed as well.
# Now let's run the ANOVA

Group.45 <- "45 and Under"
Group.45.65 <- "45 to 65"
Group.65 <- "65 and Older"

death.45 <- data.frame(Category = rep(Group.45, length(samp.45)), Death = samp.45)
death.45.65 <- data.frame(Category = rep(Group.45.65, length(samp.45.65)), Death = samp.45.65)
death.65 <- data.frame(Category = rep(Group.65, length(samp.65)), Death = samp.65)
comb.death.data <- rbind(death.45, death.45.65, death.65)

# Source: https://www.r-bloggers.com/one-way-analysis-of-variance-anova/
m.death <- lm(Death ~ Category, data = comb.death.data)
summary(m.death)
## 
## Call:
## lm(formula = Death ~ Category, data = comb.death.data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.05506 -0.05506 -0.02582  0.00000  0.97418 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          3.171e-13  2.268e-04    0.00        1    
## Category45 to 65     2.582e-02  3.208e-04   80.48   <2e-16 ***
## Category65 and Older 5.506e-02  3.208e-04  171.65   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1604 on 1500147 degrees of freedom
## Multiple R-squared:  0.01929,    Adjusted R-squared:  0.01929 
## F-statistic: 1.475e+04 on 2 and 1500147 DF,  p-value: < 2.2e-16
anova(m.death) # ANOVA
## Analysis of Variance Table
## 
## Response: Death
##                Df Sum Sq Mean Sq F value    Pr(>F)    
## Category        2    759  379.51   14751 < 2.2e-16 ***
## Residuals 1500147  38595    0.03                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
confint(m.death) # 95% Confidence Intervals
##                              2.5 %       97.5 %
## (Intercept)          -0.0004445679 0.0004445679
## Category45 to 65      0.0251887043 0.0264461322
## Category65 and Older  0.0544337798 0.0556912077

This table confirms that there are statistical differences between the age groups.

We can also resort to performing a pooled proportion difference (again) and see if we can determine a difference between “Between ages 45 and 65” and “Greater than 65”. We will exclude the Age 45 and younger because of a small sample size of n.

#Combine these two data sets to calculate the pooled.p and standard error 
age.data <- rbind(Age.45.65, Age.65)

n <- length(age.data$Age)
pooled.p <- sum(age.data$Death)/n
# We will also be using the pooled proportion when computing the standard error

# Let us verify the conditions
# Is n*pooled.p => 10
paste("n*pooled.p = ", n*pooled.p, ", and is n*pooled.p >= 10: ", n*pooled.p >= 10)
## [1] "n*pooled.p =  24 , and is n*pooled.p >= 10:  TRUE"
# Is n*(1-pooled.p) => 10
paste("n*(1-pooled.p) = ", n*(1-pooled.p), ", and is n*(1-pooled.p) >= 10: ", n*(1-pooled.p) >= 10)
## [1] "n*(1-pooled.p) =  512 , and is n*(1-pooled.p) >= 10:  TRUE"
# We can safely apply the normal model.

# Calculating the standard error.

SE.p1 <- (pooled.p * (1 - pooled.p))/length(Age.45.65$Age)
SE.p2 <- (pooled.p * (1 - pooled.p))/length(Age.65$Age)

SE.diff <- sqrt(SE.p1 + SE.p2)
# SE is 0.01860941

# Now let's calculate a Z score
Z.diff <- (pooled.p - 0)/SE.diff
pv.diff <- 2 * pnorm(Z.diff, lower.tail = FALSE)
paste("New Deaths: The Z-score is: ", Z.diff, "with a two tailed P value of: ", pv.diff)
## [1] "New Deaths: The Z-score is:  2.40610158972447 with a two tailed P value of:  0.0161237802865838"

So there are statistical significance with a p value < 0.05, thus allowing us to reject the null hypothesis.

Though this question was not directly asked in the project proposal, I wanted to provide an opportunity to demonstrate different statistical tools to explore “Inference for Numerical Data”. The particular example I want to focus is on is Age vs. Base Blood Pressure, Maximum HR. Again, the presumption is that with increasing age, blood pressure and maximum heart rate typically increases as well. Let’s explore this further.

# from the stress.echo data frame. 
plot(stress.echo$basebp ~ stress.echo$age, main = "Age vs. Base Blood Pressure", xlab = "Age", ylab = "Base Blood Pressure")

# It is unclear if there potential linear component on visual inspection. Let's see if one exists.
cor(stress.echo$sbp, stress.echo$age)
## [1] -0.07265941
# Maximum HR
plot(stress.echo$maxhr ~ stress.echo$age, main = "Age vs. Maximum HR", xlab = "Age", ylab = "HR")

cor(stress.echo$maxhr, stress.echo$age)
## [1] -0.1142203
# As expected, there is some negative correlation with increasing age and HR. (As the body ages, it's ability to mount a very elevated HR becomes more difficult for the body to do.)

# describe()
describe(stress.echo$basebp)
##    vars   n   mean    sd median trimmed   mad min max range skew kurtosis
## X1    1 558 135.32 20.77    133  134.43 19.27  85 203   118 0.41     0.03
##      se
## X1 0.88
describe(stress.echo$maxhr)
##    vars   n   mean    sd median trimmed   mad min max range  skew kurtosis
## X1    1 558 119.37 21.91    120  119.58 20.76  58 200   142 -0.03     0.12
##      se
## X1 0.93
# boxplot()
par(mfrow=c(1,2))
boxplot(stress.echo$basebp, main = "Base Blood Pressure")
boxplot(stress.echo$maxhr, main = "Maximum HR")

# Calculating linear models
m.bbp <- lm(basebp ~ age, data = stress.echo)
summary(m.bbp)
## 
## Call:
## lm(formula = basebp ~ age, data = stress.echo)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.248 -15.032  -2.138  13.090  68.637 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 120.4151     4.9597  24.279  < 2e-16 ***
## age           0.2214     0.0725   3.054  0.00237 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.62 on 556 degrees of freedom
## Multiple R-squared:  0.0165, Adjusted R-squared:  0.01473 
## F-statistic: 9.325 on 1 and 556 DF,  p-value: 0.002368
plot(m.bbp$residuals ~ stress.echo$age)
abline(h = 0, lty = 3)
hist(m.bbp$residuals)

qqnorm(m.bbp$residuals)
qqline(m.bbp$residuals)

# There is statistical difference with age and base blood pressure with for each year as the patient ages, his baseline BP increase by 0.2214. The P value here is 0.00237. However, the adjusted R-squared is 0.01473 suggesting that this variable does not contribute much variance to the base BP. Also, it appears that the criteria for linear regression is satisfied (i.e. Is there linearity? Nearly normal residuals? constant variability? The answer to all of them is yes.)

# Now to maximum heart rate
m.mhr <- lm(maxhr ~ age, data = stress.echo)
summary(m.mhr)
## 
## Call:
## lm(formula = maxhr ~ age, data = stress.echo)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -61.233 -14.085   0.521  13.317  80.767 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 133.3542     5.2404  25.448  < 2e-16 ***
## age          -0.2077     0.0766  -2.711  0.00692 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.78 on 556 degrees of freedom
## Multiple R-squared:  0.01305,    Adjusted R-squared:  0.01127 
## F-statistic:  7.35 on 1 and 556 DF,  p-value: 0.006915
plot(m.mhr$residuals ~ stress.echo$age)
abline(h = 0, lty = 3)

hist(m.mhr$residuals)
qqnorm(m.mhr$residuals)
qqline(m.mhr$residuals)

# The criteria are again satisfied. There also appears to be statistical significant with a decrease in maximum HR as the patient with a p value at 0.00692.

Taking a look at several of the factors in stress.echo, and creating a multiple logistic regression for the likelihood of an MI.

m_full <- lm(newMI ~ bhr + basebp + sbp + maxhr + age + gender + baseEF + hxofCig + hxofMI + 
               hxofPTCA + hxofCABG, data = stress.echo)
summary(m_full)
## 
## Call:
## lm(formula = newMI ~ bhr + basebp + sbp + maxhr + age + gender + 
##     baseEF + hxofCig + hxofMI + hxofPTCA + hxofCABG, data = stress.echo)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.21339 -0.07382 -0.03925 -0.01250  1.00988 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)  
## (Intercept)       -0.0333960  0.1226126  -0.272   0.7854  
## bhr                0.0002926  0.0007239   0.404   0.6863  
## basebp             0.0009139  0.0004954   1.845   0.0656 .
## sbp                0.0002432  0.0002810   0.866   0.3870  
## maxhr             -0.0001788  0.0004955  -0.361   0.7184  
## age                0.0005357  0.0008013   0.669   0.5041  
## gendermale        -0.0011692  0.0204851  -0.057   0.9545  
## baseEF            -0.0023625  0.0009914  -2.383   0.0175 *
## hxofCigmoderate    0.0010044  0.0274395   0.037   0.9708  
## hxofCignon-smoker  0.0178464  0.0236430   0.755   0.4507  
## hxofMI             0.0488466  0.0234916   2.079   0.0381 *
## hxofPTCA          -0.0483296  0.0365165  -1.324   0.1862  
## hxofCABG          -0.0030334  0.0276684  -0.110   0.9127  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2161 on 545 degrees of freedom
## Multiple R-squared:  0.04273,    Adjusted R-squared:  0.02166 
## F-statistic: 2.028 on 12 and 545 DF,  p-value: 0.02021

Judging from the information, it appears that the significant risk variables are baseline ejection fraction and history of MI as predictors for new MIs in the next 12 months.

# Let us eliminate all of the statistically insignificant variables and see if that improves the adjusted R-squared value.
m_adjusted <- lm(newMI ~ baseEF + hxofMI, data = stress.echo)
summary(m_adjusted)
## 
## Call:
## lm(formula = newMI ~ baseEF + hxofMI, data = stress.echo)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.15669 -0.06377 -0.03652 -0.02545  0.99667 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  0.1626090  0.0559284   2.907  0.00379 **
## baseEF      -0.0022122  0.0009488  -2.331  0.02009 * 
## hxofMI       0.0383197  0.0218893   1.751  0.08056 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2164 on 555 degrees of freedom
## Multiple R-squared:  0.02284,    Adjusted R-squared:  0.01932 
## F-statistic: 6.486 on 2 and 555 DF,  p-value: 0.001643
# It does not, so we will revert back to the full model

Conclusion - Cardiac Disease

This data confirms many of the suspicions for many physicians when they consider working up a cardiac patient. Patients who are typically older, with history of heart attacks, and positive stress tests, tend to have worse outcomes. As a result, these are the patients that should be considered more highly for admission to the hospital for further management. However, to attempt to identify how likelihood of an event occurring like MI (example: regression), this can be quite difficult (as seen with low adjusted R-squared values.) But this is not surprising given the complexity of a human body. Though, it should be noted that adverse events such as deaths were not common at all in this data set. This may be the fact that with advanced medical science, people are living longer despite all of the risk factors they maintain. However, we must also consider that the study may be underpowered given a few cases of new MIs, CABGs, PTCAs, and deaths and that further studies should be investigated with numbers that could potentially detect a statistical difference. This particular study appeared to be taken at one center at UCLA. Ideas for possible future research would be performing the same research but over multiple centers (i.e. 10 centers). Not only will this increase our sample size but allow the researchers to truly generalize the research to the general US population. With more information and more data, much can be done to improve the overall quality of human life.