HeartFailureAnalysis.utf8

Survival Analysis of Heart Failure Patients: Student Report

Alexandra Sciocchetti, MS Data Analytics student, Oregon State University

Abstract

The purpose of this analysis was the evaluate the effect of physical traits, relevant pre-existing conditions, heart ejection fraction (%), and select blood tests on mortality among heart failure patients. The original data was collected at the Institute of Cardiology and Allied hospital Faisalabad-Pakistan, and donated to the University of California Irvine Machine Learning Repository for academic use in 2020. Kaplan-Meier plots were used to estimate survival over time for each variable, and identify survival differences within groups (validated by non- weighted log-rank test). Blood Pressure, Age, Ejection Fraction, Serum Creatinine levels, and Serum Sodium levels were all found to be significant. A Cox Proportional Hazard Regression Model was fit to test variable significance in a multivariate model (validated by likelihood ratio tests). Martingale and Schoenfeld residual tests were used to verify the assumptions of the proportional hazard mode. Age, Ejection Fraction, and Serum Creatinine levels all significantly impact risk of mortality for heart failure patients in the Cox Proportional Hazard model.

Introduction

Heart failure is a medical condition in which the heart cannot pump enough blood to meet the body’s needs. This is either because the heart is unable to fill with enough blood, or the heart is unable to pump with enough force to expel oxygenated blood to all necessary areas of the body (Heart Failure, 2017). Both situations can occur simultaneously. Factors such as age, the presence of diabetes (often from obesity), and heart attacks have been identified as possible likely causes for heart failure, among others.

Heart failure rates are exceptionally high in South Asia, including Pakistan where this data originated. South Asian heart failure cases account for 60% of global cases, while South Asia houses 25% of global population (Martinez-Amezcua, 2020). Initial causes of Heart Failure in this region have been examined at length in other academic reports and will not be discussed further here. The heart failure clinical records data used for this analysis was the first study of its kind to examine death rates due to heart failure in Pakistan. Differences in health care, diet, and exercises in Pakistan versus Western countries make the data unique to that region (Ahmaud, 2017).

The heart failure clinical records data has been the subject of multiple analyses and publications. In a 2017 case study, researchers used both Cox Regression and Kaplan-Meier plots to conclude that renal dysfunction, blood pressure, ejection fraction and anemia were significant risk factors for mortality (Ahmaud, 2017). A 2020 paper used alternative feature ranking analysis and machine learning algorithms to determine that a two variable model with serum creatinine and ejection fraction could be used to accurately predict survival rates of heart failure patients (Chicco, 2020).

The aim of this report is to evaluate the impact of age, anemia, diabetes, creatinine phosphokinase, ejection fraction (%), high blood pressure, serum creatinine, serum sodium, sex, and smoking on heart failure patient mortality, using the Kaplan-Meier estimator and the Cox Proportional Hazard Model. This report is not intended to contradict or supersede any previous publications associated with the heart failure data, it is merely an academic exercise where I attempt to apply survival analysis tools and methods to better understand the implications of the heart failure data set.

Dataset

The heart failure clinical records data set contains a total of 299 observations, measured on 2 response variables, time and death_event with death-event indicating the presence of censoring.At the end of the 285 day study, 96 patients were dead, and the remaining 203 were censored (no observed death event occurred). Of the censored observations, only 1 was recorded as reaching the 285 day limit; the remaining 202 censored observations were recorded during the 285 days. The exact causes of censoring are unknown, other than that the individuals left the study for some reason before they died. The time variable signifies the time until either death occurred or an individual was censored. The death_event binary variable denotes the event with a 1 for death, and 0 for censoring. The average time until death or censoring was 130 days.

The data contains 6 binary and 5 continuous predictor variables, each of which has been briefly summarized to provide context to the results of the study. The minimum age for entry into the study was 40 years old, and the average age was 60 years. Heart failure in younger patients was not examined.65.8% of participants were male. Of the relevant pre-existing conditions measured, 35% participants had high blood pressure, 43% had anemia, and 41% had diabetes. 32% were smokers. The average heart ejection fraction among participants was 38%. Of select blood tests recorded, the average platelet count was 263,358, average serum sodium levels were 136 milliequivalents per liter (mEq/L), and average Serum Creatinine was 1.39 mc/dL. Due to the relatively normal distributions of all variables listed thus far, mean and median values were approximately equivalent, so only the mean is reported. For Creatinine Phosphokinase, the remaining continuous predictor, the presence of larger outlying values resulted in a median value of 250 mcg/L, and a mean value of 581 mcg/L. See Appendix A for univariate summaries of each variable.

For reference, and for those of us who are not medical professionals, the heart ejection fraction in the average healthy adult (male or female) is between 55% and 70% (Cleveland Clinic), normal platelet counts per microliter of blood are between 150,000 and 450,000 (Hopkins Medical), normal Serum Sodium levels are between 135 to 145 mEq/L (UCSF Health), and normal creatinine phosphokinase levels are between 10 and 120 mcg/L (UCSF Health). Normal results for serum creatinine are between 0.7 and 1.3 mc/dL for men, and 0.6 to 1.1 mg/dL for women (UCSF Health).

No outliers were removed from the data, and a correlation analysis of all response variables indicated no incidence of strong correlation (Figure 1). For this analysis, strong correlation is defined as any positive correlation greater than 0.5 and any negative correlation less than -0.5. The highest correlation in the data, between the sex and smoking variables, is 0.44.

Figure 1. Correlation Analysis of Response Variables. Low correlation values indicate that interactions between variables will not need to be included in regression models.

Methods

To better understand which analysis methods are available for this data, censored data is evaluated to determine the type of censoring, and whether there is any indication that the censoring was non-informative.The presence of censoring dictates the necessity for alternative analysis methods; logistic regression, multivariate regression models and other tactics generally recommended for normalized data, will lead to biased estimation and inference results because censored observations would have to be removed from the data in order to use these methods (Liu, 2012). All censored observations in the heart failure study are right censored, meaning that for each censored observation, the event (death) did not occur during the time that the individual was enrolled in the study, and that individual’s actual time of death (survival time) is unknown. Censoring time is shorter, or less than, the time to death/survival time because it occurs first (Liu, 2012). This can be written as

\[\begin{aligned} T^0 &> C_T\\ \text{where}\\ T^0 &= \text{survival time}\\ \small C_T &= \text{censoring time}\\ \end{aligned}\]

The data further fits the right censoring assumption because no death events were recorded before the study began. All participants were on alive on the first day of the study, day 1, and all were either censored or died during the course of 285 day observation period. The reasons that individuals left the study were not recorded, but for the purpose of this analysis I will assume that all censoring is non-informative. All univariate and multivariate analysis techniques in this paper must account for non-informative right censoring.

Univariate Analysis

The Kaplan-Meier estimator is one of the most widely used methods for estimating the survival functions of data with non-informative right censoring, and is implemented to conduct univariate analysis on effects of population groups within predictor variable, on survival time. Also known as the product-limit estimator, the Kaplan-Meier estimator is non-parametric; a distribution of survival time does not need to be specified. The Kaplan-Meier estimator is calculated as a point estimate for survival time at each recorded event time (each death, in this case) where the conditional probability of surviving at time \(t_i\) (Liu, 2012) and is written as

\(\begin{aligned} \hat{S}(t_i) &= 1 - \frac{d_i}{n_i}\\ \text{where}\\ \hat{S} &= \text{survival probability}\\ t_i &= \text{event time}\\ n_i &= \text{number of individuals alive (at risk) right before } t_i\\ d_i &= \text{number of deaths at } t_i\\ \end{aligned}\)

The cumulative Kaplan-Meier estimator for the survival function S(t) at all recorded event times in the data is the product of all \(\hat{S}(t)\) point estimates (Liu, 2012):

\(\begin{aligned} S(t) &= \prod_{i=1}^{j}( 1- \frac{d_i}{n_i})\\ \text{where }\\ j &= \text{the total number of recorded events}\\ \end{aligned}\)

While the formula for the Kaplan-Meier estimator is only calculated at recorded event times, and not at censoring times, censoring does influence the result because once an individual is censored, that person is no longer included in the group of individuals still alive right before an event time \(t_i\). In Kaplan-Meier plots, the survival curve will only change at event times, and censoring times are denoted by tick marks across the survival curve.

To determine differences in survival time within each predictor variable in the data, variables are split into population groups. Binary variables are relatively easy to split, as they are already grouped into two populations, those with a 0 and those with a 1. The survival function of each group is calculated using the Kaplan-Meier estimator. Continuous variables are grouped into four populations, each representing one quartile of the data. Kaplan-Meier plots are used to visualize differences in survival functions between variable populations, and will also be used to identify the presence of any non-proportionality in survival curves.

Non-weighted log-rank tests are used to determine statistically significant differences between Kaplan-Meier survival functions. Log-rank tests are non-parametric and do not rely on actual event times, but rather the order or rank in which the events occur. The first death to occur in the study, the death with the shortest time \(t\) will be ranked 1, followed by the second death to occur, etc. Note that log-rank tests are not used in a multivariate setting, they are used to compare differences in specified populations which in this analysis are either binary or quartile, within each variable. When using long-rank test to compare two survival functions, the null hypothesis is that the survival function of the first group is equal to the survival function of the second group. The alternative hypothesis is that the survival functions are statistically not equal. This will be labeled as the two sample log-rank test.

\(\begin{aligned} H_0: S(t_1) &= S(t_2)\\ H_A: S(t_1) &\neq S(t_2) \end{aligned}\)

The null hypothesis for log rank tests of more than two survival curves is that every survival curve in the test is equal. The alternative hypothesis is that at least one of the survival curves is different. The log-rank test does not specify which curve is the different one. Hypotheses for continuous variables, grouped into four quartiles, have four survival survival functions to test, and will be labeled as the four sample log-rank test (LaMorte, 2016)

\(\begin{aligned} H_0 &: S(t_1) = S(t_2) = S_(t_3) = S(t_4)\\ H_A &: \text{At least one survival function } S(t_i) \text{is not equal to the other survival functions} \end{aligned}\)

To calculate the log rank test, the number of observed events is the sum of all events recorded at each time \(t\) (LaMorte, 2016). The number of expected events is defined as

\(\begin{aligned} E_{jt} &= N_{jt}*\frac{O_t}{N_t}\\ \text{where}\\ E_{jt} &= \text{Expected number of events at time t in sample j}\\ N_{jt} &= \text{Number of people at risk right before time t in sample j}\\ O_t &= \text{Number of Observed events across all samples}\\ N_t &= \text{Number of people at risk right before time t across all samples}\\ \end{aligned}\)

The log rank test statistic is calculated as a summation of the difference between the number observed events and expected events at each time interval, for each sample (LaMorte, 2016).

\(\begin{aligned} X^2 &= \sum \frac{(\sum O_{jt} - \sum E_{jt})^2}{\sum E_{jt}}\\ \text{where}\\ X^2 &= \text{the log-rank test statistic on a chi-square distribution}\\ O_{jt} &= \text{the number of observed events for the jth group over time}\\ E_{jt} &= \text{the number of expected events for the jth group over time}\\ \end{aligned}\)

The log-rank test statistic follows a chi-square distribution on k-1 degrees of freedom, where k is the number of groups. A two sample log-rank test will have 1 degree of freedom (k=2), and a four sample log-rank test will have 3 degrees of freedom (k=4). The significance, or p-value of the test can be found in a table of Critical Values or through calculation in R using the pchisq() function. For this analysis, a significance level of \(\alpha\) = 0.05 is used to determine test significance. P-values greater than 0.05 fail to reject the null hypothesis \(H_o\), and p-values less than 0.05 reject the null hypothesis in favor of the alternative hypothesis. In this report, p-values calculated with log rank test are displayed on the Kaplan-Meier plots for each significant variable in the Results section, and for all other variables in Appendix A. Log-rank test significance will help determine which variables should be included in a multivariate model, and indicate which variables are significant to survival time in a univariate setting.

Log-rank tests were non-weighted because there is no evidence in the data description that survival differences are more meaningful if they occur closer to the beginning of the study, or closer to the end.

Multivariate Analysis

Cox Proportional Hazard model is used to determine the simultaneous effects of multiple variables on the risk of death for heart failure patients at a particular time, \(t\), while relying on the assumption that relative risk of death does not change over time (proportional hazard assumption) (Liu, 2012). The Cox model evaluates the risk of death, also known as the hazard function, and an extension of the formula can be used to estimate the survival probability of new test subjects. The Cox model was chosen because it is widely used when working with clinical data due to its applicability to a wide variety of studies (Singh, 2011), and is generally more appealing than Accelerated Failure Time models, which are relatively ineffective for non-parametric data. The basic formula for the Cox Proportional Hazard model can be written as

\(\begin{aligned} h(t|x) &= h_0exp(x\prime\beta)\\ \text{where}\\ h(t|x) &= \text{the predicted value of the hazard rate given }x\\ h_0 &= \text{an arbitrary and unspecified baseline hazard function}\\ x\prime &= \text{all specified risk factors, or variables}\\ \beta &= \text{variable coefficient values}\\ \end{aligned}\)

There are several advantages to the Cox Proportional Hazard model. Exponentiation of the regression coefficients ensure that the effects of the hazard function are multiplicative and always positive (Liu, 2012). The Cox model is non-parametric - a survival distribution does not need to be specified - and because the baseline hazard is separate from the risk factors and coefficients, risk factors and coefficient values can be quantified without solving for the baseline hazard value.

The Cox Proportional Hazard Model uses the partial likelihood function to estimate the variable coefficient values and account for censored data. Similar to the log rank test, the partial likelihood function is constructed using an ordered ranking of event times, rather than the actual time that the event occurred. Censored observations effect the final result because once an individual is censored, that person is no longer still at risk, even though an event did not occur. At each event, the partial likelihood function can be defined as the conditional probability that an individual had an event, given that an event occurred, divided by sum of all possible events that could have occurred at that time (Xue, 2017). If a person is still alive and still part of the study, then they are considered at risk, so the sum of all possible events is equivalent to the sum of all participants still part of the study. The complete partial likelihood function is the product of the partial likelihood at each event (Liu, 2012) and is used to estimate the coefficient value \(\beta\)

\(\begin{aligned} PL(\beta) &= \prod_{i=1}^{D}\frac{exp(x_i\prime\beta)}{\sum exp(x_l\prime\beta)}\\ \text{where}\\ PL(\beta) &= \text{the partial likelihood for \beta}\\ exp(x_i\prime\beta) &= \text{the risk of an event occurring at one event time for one individual}\\ exp(x_l\prime\beta) &= \text{the risk of an event occuring for every individual still at risk at one event time}\\ \end{aligned}\)

In this report, the partial likelihood function is automatically computed for each variable \(\beta\) when fitting the Cox Proportional Hazard model.

The Cox Proportional hazard model relies on the assumptions that the all continuous covariates have a linear form, and that relative risk does not change over time (proportional hazard assumption); Martingale and Schoenfeld residuals are used to check that these assumptions are met and changes are made if necessary. If these assumptions cannot be met, then the results from the Cox Proportional hazard model are invalid and misleading. Martingale residuals calculate the difference between the number of expected events and the number of actual events (Liu, 2012) and are defined as

\(\begin{aligned} r_{M,i} &= d_i - \hat{H}(t_i|x_i)\\ \text{where}\\ r_{M,i} &= \text{is the excess number of events}\\ d_i &= \text{the number of events for the }i\text{th subject}\\ \hat{H}(t_i|x_i) &= \text{the expected number of events for the }i\text{th subject}\\ \end{aligned}\)

Martingale residuals are used to check the functional forms of all continuous covariates, by fitting a null cox proportional hazard model with no predictor variables, and plotting the resultant linear predictor values and residuals values for each covariate against that null model. If the functional form of the covariate is linear, then the martingale residual line will be relatively flat with a slop close to 0, and no further action will be needed. If the function form of the covariate has distinct curvature, then one or multiple higher order exponential terms will need to be added to the model, depending on how many curves the plot displays. Once necessary higher order terms are determined, a full cox proportional hazard model is then fit with all linear form variables and the additional higher order terms. Any non-significant higher order terms that are non-significant in the full model are next removed. The Martingale residual plots are re-evaluated for linearity against the full model with significant higher order terms, and the process is repeated as necessary until the residual plots display no distinct curvature.

Once a final model has been selected and all variable selection is complete, Schoenfeld residuals are used to determine whether the hazard rate for each selected variable does not change over time, satisfying the proportional hazard assumption of the Cox Model. Schoenfeld residuals are the differences between the observed and expected values of each covariate at each event time (Xue, 2017), and can be defined as

\(\begin{aligned} r_{li} &= \sum_{D(t_i)} Z_{kl} - \bar{Z}_{il}\\ \text{where} r_{li} &= \text{Schoenfeld residual for the }i \text{th subject, variable }l\\ D(t_i) &= \text{the total number of events at time }ti\\ Z_{kl} &= \text{the total number of individuals at risk at time }ti\\ Z_{li} &= \text{the expected value of the }l \text{th covariate at time }ti\\ \end{aligned}\)

In principal, Schoenfeld residuals test for independence between each variable and time (Cox Model Assumptions, 2020). In this analysis, the cox.zph() function in R is used to report the test statistics and p-values of the Schoenfeld residuals both for individual variables and for the global model. If any variables are not independent of time, measured by a p-value less than 0.05, a variable:time interaction term is added to the model and the Schoenfeld residuals are re-checked (Cox Model Assumptions, 2020).

Likelihood ratio test is used to find the Cox Proportional Hazard model that is the best fit for the data. Likelihood ratio test is a goodness of fit test that compares complex models with more variables to simpler models with less variables, to determine which model makes the data recorded most likely using the maximum partial likelihood function (Liu, 2012). The likelihood ratio test uses the partial likelihood function already described for the Cox Model to calculate the partial likelihood ratio test statistic (Xue, 2017) which can be defined as

\(\begin{aligned} Q_{LR} &= 2*(log(PL_{full}) - log(PL_{reduced}))\\ \text{where}\\ Q_{LR} &= \text{the partial likelihood ratio test statistic}\\ PL_{full} &= \text{the partial likelihood test statistic of the cox model with more variables}\\ PL_{reduced} &= \text{the partial likelihood test statistic of the Cox model with fewer variables}\\ \end{aligned}\)

The likelihood ratio test follows a chi-square distribution where the degrees of freedom are equal to \(\begin{aligned} df &= p-k\\ \text{where}\\ df &= \text{degrees of freedom}\\ p &= \text{number of parameters in the full model}\\ k &= \text{number of paramters in the reduced model}\\ \end{aligned}\)

The null hypothesis, \(H_0\) for a likelihood ratio test is always that the reduced model or simpler model equals the full, more complex model, and the reduced model should be used for analysis. The alternative hypothesis, \(H_A\) is that the reduced model and the full model are not equal and the reduced model should be rejected in favor of the full model. More specifically, the null hypothesis is rejected if the reduced model is not able to maximize the likelihood of the data being observed, as accurately as the full model (Xue, 2017). The null hypothesis is rejected if

\(\begin{aligned} Q_LR &> X_{p-k,\alpha}^2\\ \text{where}\\ X^2 &= \text{chi-square distribution}\\ p &= \text{number of parameters in the full model}\\ k &= \text{number of parameters in the reduced model}\\ \alpha &= \text{significance level}\\ \end{aligned}\)

For this analysis, the significance level = 0.05 is used. As non-significant variables are removed from the model, the likelihood ratio test is used to determine whether variable removal makes the data less likely.

Results

Univariate Analysis

Kaplan-Meier plots of each variable with a significant log rank test result. Non-significant Kaplan-Meier plots can be found in Appendix A.

Figure 3. Kaplan-Meier survival curves for blood pressure binary variable (left), and age variable (right). The survival time for individuals with high blood pressure is significantly different from individuals with normal blood pressure. Survival time for individuals aged 70 appears significantly different than all other age groups. P-values for log rank tests do not specify which group(s) are significantly different, however plots can be used to discern visual differences.

Figure 4. Kaplan-Meier survival curves for ejection fraction percentage (left) and serum creatinine level (right). The survival time for individuals with ejection fraction percentages less than 30% appear significantly different from all other groups. The group with the highest ejection fraction percentage (45-80%) does not have the highest survival probability of the four groups over time. Serum creatinine, measured in mg/dL, appears to have multiple significant differences between groups. The group with the lowest levels has the highest survival percentage over time, and the group with the highest levels has the lowest survival percentage over time.

Figure 5. Kaplan-Meier survival curves of each Serum Sodium Level (mEq/L) group. The survival time for individuals with sodium levels less than 134 mEq/L appears significantly different, and less than all other groups.

Multivariate Analysis

Figure 6. Martingale Residual plots of the functional forms of each continuous variable against the null model. Curvature in the red residual line indicates non-linearity. All continuous covariates display non-linearity against the null model.

Table 1. Coefficient values, standard error, and p-values of the Cox Proportional Hazard model with added squared terms for continuous covariates that displayed non-linearity. The squared terms for age and ejection fraction are significant and will remain in the model. The non-significant squared terms are removed and the Martingale Residuals are tested. Values are reported to three decimal places.

Figure 7. Martingale Residual plots of the functional forms of each continuous variable against a full variable model with the additional squared terms for age and ejection fraction. All variables display approximately linear form.

Table 2. Model 1: Coefficient values, standard error, and p-values of the Cox Proportional Hazard model with all variables and additional squared terms for age and ejection fraction. Variables \(age^2\), \(ejection\: fraction\), \(ejection\: fraction^2\), \(serum\: creatinine\), and \(creatinine\: phosphokinase\) are significant. All non-significant variables are removed to fit the reduced model.

Table 3. Model 2: Coefficient values, standard error, and p-values of the first reduced Cox Proportional Hazard model. Variables \(age^2\), \(ejection\: fraction\), \(ejection\: fraction^2\), \(serum\: creatinine\) are significant. \(creatinine\: phosphokinase\) is not significant in this model and is removed to fit a third model.

Table 4. Model 3: Coefficient values, standard error, and p-values of the second reduced Cox Proportional Hazard model. All four variables are significant. Raw model output can be found in Appendix B.

Table 5. Likelihood ratio test results. P-values greater than 0.05 indicate that (1) fail to reject Model 2 over Model 1, making Model 2 a preferred model (2) fail to reject Model 3 over Model 1 making Model 3 a preferred model (3) fail to reject model Model 3 over Model 2. Model 3 is the best fit model for the data.

Table 6. Schoenfeld Residuals Test Results for Model 3. The test is not significant for any of the covariates in the model, or for the global model. Proof that the proportional hazard assumption is met for Model 3 and results are valid.

Model Interpretation and Discussion

Univariate analysis of all predictor variables indicated that blood pressure, age, ejection fraction, serum creatinine, and serum sodium each significantly effected the survival of heart failure patients. Lifestyle related variables, smoking and diabetes were not significant, perhaps indicating that while lifestyle may effect a person’s chances of experiencing heart failure (Fagard, 2009), once that heart failure has occurred, smoking and diabetes do not actually effect the remaining time that a person has to live. Sex was also not significant, indicating no difference between males and females. These findings coincide with the findings of the multivariate analysis: Kaplan-Meier functions for age, ejection fraction, and serum creatinine has the smallest p-values of all the variables tested. As will be discussed, the age, ejection fraction, and serum creatinine variables are all present in the final cox proportional hazard model, which will be used for further interpretation.

Multivariate analysis found that age, ejection fraction, and serum creatinine all had a simultaneous effect on risk of mortality for heart failure patients. The final Cox Proportional Hazard model with these three variables and necessary squared terms can be written as

\(\begin{aligned} h(t)&=h_0(t)exp(\beta_1Age^2+ \beta_2ejection\:fraction + \beta_3ejection\:fraction^2+ \beta_4serum\:creatinine)\\ \text{where}\\ t&= \text{duration of the study, terminated by either death or censoring} \end{aligned}\)

The model includes two squared terms, one for age, and one for ejection fraction, which makes interpretation somewhat more complex; a one unit increase of each variable value will not result in a linear increase in the risk of death. To interpret age, I use the average age of a participant in the study: 60 years old. A one year increase in age would increase the risk of death by 4.52%. Thus a 60 year old has a risk of death that is 4.52% lower than a 61 year old. The power of the squared term becomes evident when examining the increase in risk between a 60 year old and a 70 year old participant. Based on this model, the 70 year old has a risk of death that is 49.12% higher than that of the 60 year old, based on age alone. Note that because the age variable is squared, comparing the risk of mortality between a 60 year old and a 70 year old is not the same as the risk between a 50 year old and a 60 year old because the year over year increases in risk are not linear.

Interpretation of the ejection fraction covariates is also more complex; the linear term by the effect of a one unit increase, however the squared term is also present and need to be added to the final value. The average ejection fraction for patients in the study was 38%. A decrease in ejection fraction percentage, from 38% to 37% would result in a 0.18% increase in the risk of death. The reason the difference is so small is because the linear coefficient for _2 is negative, while the squared coefficient for _3 is positive. A decrease in ejection fraction percentage, from 38% to 30% results in a 1.5% increase in risk of mortality. Perhaps these differences seem so slight because I am only looking at individuals with ejection fraction values far below the range that is considered “normal” for a healthy adult. An individual with a heart ejection percentage of 60% (which is within normal healthy range), is 58% less likely to die, than an individual with a heart ejection percentage of 38%. Again, note that because of the squared term, these calculations are specific to each example.

In contrast, serum creatinine is much simpler to interpret and the interpretation can be generalized to a one unit increase or decrease serum creatinine is a linear term in the model. A one unit increase in serum creatinine levels will result in a 30.54% increase in risk of death. Now because serum creatinine levels for all participants ranged from 0.5, to 1.9mg/dL, a one 1mg/dL increase is a significant amount; it might be more reasonable to interpret serum creatinine in 0.1mg/dL changes, which seem more likely given the range of values. A 0.1md/dL increase in serum creatinine results in a 3.05% increase in risk of mortality.

The coefficient values for the cox proportional hazard model are interpreted individually rather than as a single value representing the risk of mortality as generated by the model. This is because the risk of mortality is dependent on the measurements of the individual. Age, ejection fraction percentage, and serum creatinine levels all vary from subject to subject and rather than make generalizations such as risk of mortality increases as age and serum creatinine levels increase and ejection fraction percentage drops, interpreting each value individually provides allows more specificity.

Compared with the univariate analysis, the cox proportional hazard model contained two fewer variables. While measurements for blood pressure and serum sodium were significant in the Kaplan-Meier plots, they were not significant in the cox proportional hazard model. It’s possible that the method in which the continuous variables were split into quartiles for the Kaplan-Meier plots contributed to the discrepancy for the serum sodium variable. Have the Cox Proportional Hazard model also used a mechanism to split the serum sodium variable into multiple dummy variables, it is possible that one of those dummy variables would have been significant in the final model, even though serum sodium was not significant as a continuous variable. For the blood pressure measurement, which was already a discrete variable, it is less clear why it was not significant in the Cox Model. The addition of the squared variable terms to the Cox Model, necessary to satisfy the linear form assumption, could have had in impact on the significance of other variables in the model. Overall, the Cox Proportional Hazard Model provides a much more comprehensive approach to the analysis, by investigating the effects of all variables simultaneously, and the result of that model should be given preference in final conclusions.

Conclusion

The purpose of this analysis was to use the current available heart failure data to develop multiple models (Kaplan-Meier and Cox Proportional Hazard) to understand the effects of measured variables on mortality rate of heart failure patients. Under the Kaplan-Meier estimator, increased age, decreased ejection fraction percentage, high blood pressure, high serum creatinine levels, and low serum sodium levels were all identified as factors contributing to a decrease in survival time and an increased risk of mortality. Similarly, under the Cox Proportional Hazard Model, increasing age, decreasing ejection fraction, and increasing serum creatinine levels all contributed to an increased risk of mortality. The next possible step would be to verify the effectiveness of the Cox Proportional Hazard Model in predicting risk with new data. For instance, given a new set of patient observations, what is that patient’s risk of mortality? The predict function in R can be used in conjunction with the Cox Proportional Hazard model to predict the survival curves of new patients, based on the variable values of those patients. When conducting analysis on the existing data, it would also be possible to set aside a randomly selected portion of the data for prediction. This particular data set originated from a patient study in Pakistan, and it is unclear if the results can be extrapolated to heart failure patients in other countries, or if other factors such as the quality and availability of health care resources in Pakistan make the data unique to that regon. More data is needed to determine whether the identified risk factors would be appropriate for new data in other regions.

References:

Ahmad T, Munir A, Bhatti SH, Aftab M, Raza MA (2017) Survival analysis of heart failure patients: A case study. PLoS ONE 12(7): e0181001. https://doi.org/10.1371/journal.pone.0181001. (Accessed: 7th December 2020)

Chicco, D., Jurman, G. Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone. BMC Med Inform Decis Mak 20, 16 (2020). https://doi.org/10.1186/s12911-020-1023-5. (Accessed: 5th December 2020)

Cox Model Assumptions. Available at: http://www.sthda.com/english/wiki/cox-model-assumptions. (Accessed: 10th December 2020)

Fagard, R. H. Smoking amplifies cardiovascular risk in patients with hypertension and diabetes. (2009). Available at: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2811439/. (Accessed: 10th December 2020)

Heart Failure. National Heart Lung and Blood Institute (2017). Available at: https://www.nhlbi.nih.gov/health-topics/heart-failure. (Accessed: 11th December 2020)

Heart Failure: Types, Symptoms, Causes & Treatments. Cleveland Clinic Available at: https://my.clevelandclinic.org/health/diseases/17069-heart-failure-understanding-heart-failure. (Accessed: 9th December 2020)

LaMorte, W. Comparing Survival Curves. (2016). Available at: https://sphweb.bumc.bu.edu/otlt/mph-modules/bs/bs704_survival/BS704_Survival5.html. (Accessed: 5th December 2020)

Liu, X. (2012). Survival analysis : Models and applications. ProQuest Ebook Central https://ebookcentral.proquest.com

Martinez-Amezcua, P. et al. The Upcoming Epidemic of Heart Failure in South Asia. Circulation: Heart Failure (2020). Available at: https://www.ahajournals.org/doi/10.1161/CIRCHEARTFAILURE.120.007218. (Accessed: 11th December 2020)

Medical Tests. ucsfhealth.org Available at: https://www.ucsfhealth.org/medical-tests. (Accessed: 7th December 2020)

Singh, R. & Mukhopadhyay, K. Survival analysis in clinical trials: Basics and must know areas. (2011). Available at: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3227332/. (Accessed: 13th December 2020)

What are Platelets and Why are They Important? Available at: https://www.hopkinsmedicine.org/health/conditions-and-diseases/what-are-platelets-and-why-are-they-important. (Accessed: 10th December 2020)

Xue, L. (2017). Deviance and Schoenfeld Residuals [Video lecture]. Retrieved from https://media.oregonstate.edu/media/t/0_yd5t34of (class access required).

Appendix A:

Exploratory Analysis, Correlation Coefficient Plot, Kaplan-Meier plots with log rank test.

#load relevant packages
library(survminer)
library(survival)
library(ggplot2)
library(tidyverse)
library(My.stepwise)
library(corrr)
library(gtools)

#import data
df <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00519/heart_failure_clinical_records_dataset.csv")

#variable summary. Output not shown
variablesummary <- lapply(df[,1:10], table)

#correlation plot
res.cor <- correlate(df[,1:11])
heart_cor <- res.cor %>% stretch()
heart_cor[is.na(heart_cor)]<-0

p <- heart_cor %>%
  ggplot(aes(x, y, col=r)) + 
  geom_tile(col="white", fill="white") +
  geom_point(aes(size = abs(r)), shape=16) +
  labs(x = "", y = "", col = "Correlation \nCoefficient", title="") +
  scale_color_gradient2(low="#2903e8",high="#c7000d",mid="#fbfef9", limits=c(-1,1))+
  scale_x_discrete(expand = c(0,0),labels=c("age", "anemia", "creatinine phosphokinase", "diabetes", "ejection fraction", "blood pressure", "platelets", "serum creatinine", "serum sodium", "sex", "smoking")) +
  scale_y_discrete(expand = c(0,0), labels = c("age", "anemia", "creatinine phosphokinase", "diabetes", "ejection fraction", "blood pressure", "platelets", "serum creatinine", "serum sodium", "sex", "smoking")) +
  scale_size(range=c(1,8), guide = NULL)+
  theme(axis.line = element_line(colour = "black", 
                      size = 1, linetype = "solid"),
        axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))+
  ggsave("correlationplot.png", height = 6, width = 6)

#Inspiration from https://www.khstats.com/blog/corr-plots/corr-plots/

#divide continuous variables into quartiles
df$agequant <- quantcut(df$age,4)
df$creatinepquant <- quantcut(df$creatinine_phosphokinase,4)
df$efquant <- quantcut(df$ejection_fraction,4)
df$plateletsquant <- quantcut(df$platelets, 4)
df$scquant <- quantcut(df$serum_creatinine, 4)
df$ssquant <- quantcut(df$serum_sodium,4)

#Kaplan-Meier estimators for significant variables. Plots found in Results section
KM_bloodpressure <- survfit(Surv(time, DEATH_EVENT)~high_blood_pressure, data = df)
KM_efquant <- survfit(Surv(time, DEATH_EVENT)~efquant, data = df)
KM_age <- survfit(Surv(time, DEATH_EVENT)~agequant, data = df)
KM_scquant <- survfit(Surv(time, DEATH_EVENT)~scquant, data = df)
KM_ssquant <- survfit(Surv(time, DEATH_EVENT)~ssquant, data = df)

KMplots1<- ggsurvplot(fit = KM_bloodpressure, data = df, conf.int = F, title = "", ylab = "Survival Probability", xlab = "Days", legend.labs = c("normal blood pressure", "high blood pressure"), pval = T, risk.table = F, palette = c("#2903e8", "#c7000d"), ggtheme = theme_bw(),legend.title = "")

KMplots2<- ggsurvplot(fit = KM_age, data = df, conf.int = F, title = "", ylab = "Survival Probability", xlab = "Days", pval = T,legend.labs = c("40-51", "51-60", "60-70", "70-95"), legend.title = "", palette = c("#d702cc","#8702e0", "#2903e8", "#c7000d"), ggtheme = theme_bw())

KMplots3 <- ggsurvplot(fit = KM_efquant, data = df, conf.int = F, title = "", ylab = "Survival Probability", xlab = "Days", pval = T, legend.labs = c("14-30%", "30-38%", "38-45%", "45-80%"), legend.title ="", palette = c("#d702cc","#8702e0", "#2903e8", "#c7000d"), ggtheme = theme_bw())

KMplots4 <- ggsurvplot(fit = KM_scquant, data = df, conf.int = F, title = "", ylab = "Survival Probability", xlab = "Days", pval = T, legend.labs = c("0.5-0.9", "0.9-1.1", "1.1-1.4", "1.4-1.9"), legend.title = "", palette = c("#d702cc","#8702e0", "#2903e8", "#c7000d"), ggtheme = theme_bw())

KMplots5 <- ggsurvplot(fit = KM_ssquant, data = df, conf.int = F, title = "", ylab = "Survival Probability", xlab = "Days", pval = T, legend.labs = c("113-134", "134-137", "137-140", "140-148"), legend.title ="", palette = c("#d702cc","#8702e0", "#2903e8", "#c7000d"), ggtheme = theme_bw())

#Kaplain-Meier estimators and plots for non-significant variables
KM_anaemia <- survfit(Surv(time, DEATH_EVENT)~anaemia, data = df)
ggsurvplot(fit = KM_anaemia, data = df, conf.int = F, title = "Anaemia", ylab = "Survival Probability", xlab = "Days", legend.labs = c("no anemia", "anemia"), pval = T, palette = c("#2903e8", "#c7000d"), ggtheme = theme_bw(),legend.title = "")

KM_smoking <- survfit(Surv(time, DEATH_EVENT)~smoking, data = df)
ggsurvplot(fit = KM_smoking, data = df, conf.int = F, title = "Smoking", ylab = "Survival Probability", xlab = "Days", legend.labs = c("non-smoker", "smoker"), pval = T, palette = c("#2903e8", "#c7000d"), ggtheme = theme_bw(),legend.title = "")

KM_diabetes <- survfit(Surv(time, DEATH_EVENT)~diabetes, data = df)
ggsurvplot(fit = KM_diabetes, data = df, conf.int = F, title = "Diabetes", ylab = "Survival Probability", xlab = "Days", legend.labs = c("no diabetes", "diabetes"), pval = T, palette = c("#2903e8", "#c7000d"), ggtheme = theme_bw(),legend.title = "")

KM_sex <- survfit(Surv(time, DEATH_EVENT)~sex, data = df)
ggsurvplot(fit = KM_sex, data = df, conf.int = F, title = "Sex", ylab = "Survival Probability", xlab = "Days", legend.labs = c("Female", "Male"), pval = T, palette = c("#2903e8", "#c7000d"), ggtheme = theme_bw(),legend.title = "")

KM_creatinep <- survfit(Surv(time, DEATH_EVENT)~creatinepquant, data = df)
ggsurvplot(fit = KM_creatinep, data = df, conf.int = F, title = "Creatinine Phosphokinase (mcg/L)", ylab = "Survival Probability", xlab = "Days", pval = T, palette = c("#d702cc","#8702e0", "#2903e8", "#c7000d"), ggtheme = theme_bw(), legend.labs = c("23-116", "116-250", "250-582", "582-7861"), legend.title = "")

KM_plateletsq <- survfit(Surv(time, DEATH_EVENT)~plateletsquant, data = df)
ggsurvplot(fit = KM_plateletsq, data = df, conf.int = F, title = "Platelet Count", ylab = "Survival Probability", xlab = "Days", pval = T, palette = c("#d702cc","#8702e0", "#2903e8", "#c7000d"), ggtheme = theme_bw(), legend.labs = c("2.5e4-2.1e5", "2.1e5-2.6e5", "2.6e5-3.0e5", "3.0e5-8.5e5"), legend.title = "")

Appendix B:

Martingale Residuals, Cox Proportional Hazard Models, Likelihood Ratio test, Schoenfeld Residuals

#define Smooth curve function used to plot martingale residuals (source: Xue, 2017)
smoothSEcurve <- function(yy, xx) {
xx.list <- min(xx) + ((0:100)/100)*(max(xx) - min(xx))
yy.xx <- predict(loess(yy ~ xx), se=T,
newdata=data.frame(xx=xx.list))
lines(yy.xx$fit ~ xx.list, lwd=2,col=2)
lines(yy.xx$fit -
qt(0.975, yy.xx$df)*yy.xx$se.fit ~ xx.list, lty=2,col=3)
lines(yy.xx$fit +
qt(0.975, yy.xx$df)*yy.xx$se.fit ~ xx.list, lty=2,col=3)
}

#fit null model
fit0 <- coxph(formula = Surv(time, DEATH_EVENT)~1, data = df)
fit0_mar<-residuals(fit0,type = 'martingale')

#plot martingale residuals using smoothSEcurve function
par(mfrow=c(2,2))
plot(df$age, fit0_mar, xlab = "units = years", ylab = "")
smoothSEcurve(fit0_mar, df$age)
title("age")

plot(df$creatinine_phosphokinase, fit0_mar, ylab = "", xlab = "units = mcg/L")
smoothSEcurve(fit0_mar, df$creatinine_phosphokinase)
title("creatinine phosphokinase")

plot(df$ejection_fraction, fit0_mar, ylab = "", xlab = "units = %")
smoothSEcurve(fit0_mar, df$ejection_fraction)
title("ejection fraction")

plot(df$platelets, fit0_mar, ylab = "", xlab="units = count")
smoothSEcurve(fit0_mar, df$platelets)
title("platelets")

par(mfrow = c(2,2))
plot(df$serum_creatinine, fit0_mar, ylab = "", xlab = "units = mc/dL")
smoothSEcurve(fit0_mar, df$serum_creatinine)
title("serum creatinine")

plot(df$serum_sodium, fit0_mar, ylab = "", xlab = "units = mEq/L")
smoothSEcurve(fit0_mar, df$serum_sodium)
title("serum sodium")

#create squared terms and add to the data set
df$age2 <- df$age^2; df$creatinine_phosphokinase2 <- df$creatinine_phosphokinase^2; df$ejection_fraction2 <- df$ejection_fraction^2
df$platelets2 <- df$platelets^2; df$serum_creatinine2 <- df$serum_creatinine^2; df$serum_sodium2 <- df$serum_sodium^2

#fit a Cox Model with all linear variables and additional squared terms
f1 <- coxph(Surv(time, DEATH_EVENT)~age + age2 + anaemia + diabetes + ejection_fraction + ejection_fraction2 +high_blood_pressure + platelets + platelets2 + serum_creatinine +serum_creatinine2+ serum_sodium +serum_sodium2 +sex + smoking + creatinine_phosphokinase + creatinine_phosphokinase2, data = df)
summary(f1)

## Call:
## coxph(formula = Surv(time, DEATH_EVENT) ~ age + age2 + anaemia + 
##     diabetes + ejection_fraction + ejection_fraction2 + high_blood_pressure + 
##     platelets + platelets2 + serum_creatinine + serum_creatinine2 + 
##     serum_sodium + serum_sodium2 + sex + smoking + creatinine_phosphokinase + 
##     creatinine_phosphokinase2, data = df)
## 
##   n= 299, number of events= 96 
## 
##                                 coef  exp(coef)   se(coef)      z Pr(>|z|)    
## age                       -8.902e-02  9.148e-01  7.158e-02 -1.244 0.213681    
## age2                       1.059e-03  1.001e+00  5.384e-04  1.967 0.049192 *  
## anaemia                    4.492e-01  1.567e+00  2.253e-01  1.994 0.046167 *  
## diabetes                   1.974e-01  1.218e+00  2.349e-01  0.840 0.400670    
## ejection_fraction         -1.980e-01  8.204e-01  4.751e-02 -4.167 3.08e-05 ***
## ejection_fraction2         2.001e-03  1.002e+00  5.956e-04  3.361 0.000778 ***
## high_blood_pressure        4.076e-01  1.503e+00  2.209e-01  1.845 0.064996 .  
## platelets                 -1.231e-06  1.000e+00  3.871e-06 -0.318 0.750577    
## platelets2                 2.997e-13  1.000e+00  6.135e-12  0.049 0.961039    
## serum_creatinine           6.127e-01  1.845e+00  2.653e-01  2.310 0.020910 *  
## serum_creatinine2         -4.758e-02  9.535e-01  3.255e-02 -1.462 0.143851    
## serum_sodium              -4.117e-01  6.625e-01  5.474e-01 -0.752 0.451998    
## serum_sodium2              1.425e-03  1.001e+00  2.055e-03  0.693 0.488059    
## sex                       -2.695e-01  7.638e-01  2.605e-01 -1.034 0.300974    
## smoking                    2.343e-01  1.264e+00  2.622e-01  0.894 0.371549    
## creatinine_phosphokinase  -2.562e-04  9.997e-01  2.828e-04 -0.906 0.365047    
## creatinine_phosphokinase2  7.917e-08  1.000e+00  4.081e-08  1.940 0.052391 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                           exp(coef) exp(-coef) lower .95 upper .95
## age                          0.9148     1.0931    0.7951    1.0526
## age2                         1.0011     0.9989    1.0000    1.0021
## anaemia                      1.5671     0.6381    1.0077    2.4370
## diabetes                     1.2182     0.8209    0.7688    1.9305
## ejection_fraction            0.8204     1.2189    0.7474    0.9004
## ejection_fraction2           1.0020     0.9980    1.0008    1.0032
## high_blood_pressure          1.5032     0.6652    0.9750    2.3176
## platelets                    1.0000     1.0000    1.0000    1.0000
## platelets2                   1.0000     1.0000    1.0000    1.0000
## serum_creatinine             1.8454     0.5419    1.0972    3.1040
## serum_creatinine2            0.9535     1.0487    0.8946    1.0164
## serum_sodium                 0.6625     1.5094    0.2266    1.9372
## serum_sodium2                1.0014     0.9986    0.9974    1.0055
## sex                          0.7638     1.3093    0.4584    1.2727
## smoking                      1.2640     0.7911    0.7561    2.1133
## creatinine_phosphokinase     0.9997     1.0003    0.9992    1.0003
## creatinine_phosphokinase2    1.0000     1.0000    1.0000    1.0000
## 
## Concordance= 0.78  (se = 0.024 )
## Likelihood ratio test= 101.8  on 17 df,   p=4e-14
## Wald test            = 107.5  on 17 df,   p=4e-15
## Score (logrank) test = 132.2  on 17 df,   p=<2e-16

#fit a second cox model with all linear variables and additional higher order terms that are significant. 
f2 <- coxph(formula = Surv(time, DEATH_EVENT) ~age + age2 + anaemia + diabetes + ejection_fraction + ejection_fraction2 + high_blood_pressure + platelets + serum_creatinine + serum_sodium + sex + smoking + creatinine_phosphokinase, data = df)

#re-test residuals against second model
f2_mar <- residuals(f2, type = "martingale")

par(mfrow=c(2,2))

plot(df$age, f2_mar, ylab = "", xlab = "units = years")
smoothSEcurve(f2_mar, df$age)
title("age")

plot(df$creatinine_phosphokinase, f2_mar, ylab = "", xlab = "units = mcg/L")
smoothSEcurve(f2_mar, df$creatinine_phosphokinase)
title("creatinine phosphokinase")

plot(df$ejection_fraction, f2_mar, ylab = "", xlab = "units = %")
smoothSEcurve(f2_mar, df$ejection_fraction)
title("ejection fraction")

plot(df$platelets, f2_mar, ylab = "", xlab = "units = count")
smoothSEcurve(f2_mar, df$platelets)
title("platelets")

par(mfrow=c(2,2))
plot(df$serum_creatinine, f2_mar, ylab = "", xlab = "units = mc/dL")
smoothSEcurve(f2_mar, df$serum_creatinine)
title("serum creatinine")

plot(df$serum_sodium, f2_mar, ylab = "", xlab = "units = mEq/L")
smoothSEcurve(f2_mar, df$serum_sodium)
title("serum sodium")

#Model 1 with all linear variables + squared terms for age and ejection fraction
f2 <- coxph(formula = Surv(time, DEATH_EVENT) ~age + age2 + anaemia + diabetes + ejection_fraction + ejection_fraction2 + high_blood_pressure + platelets + serum_creatinine + serum_sodium + sex + smoking + creatinine_phosphokinase, data = df)
summary(f2)

## Call:
## coxph(formula = Surv(time, DEATH_EVENT) ~ age + age2 + anaemia + 
##     diabetes + ejection_fraction + ejection_fraction2 + high_blood_pressure + 
##     platelets + serum_creatinine + serum_sodium + sex + smoking + 
##     creatinine_phosphokinase, data = df)
## 
##   n= 299, number of events= 96 
## 
##                                coef  exp(coef)   se(coef)      z Pr(>|z|)    
## age                      -9.208e-02  9.120e-01  7.074e-02 -1.302  0.19305    
## age2                      1.090e-03  1.001e+00  5.323e-04  2.047  0.04067 *  
## anaemia                   4.142e-01  1.513e+00  2.212e-01  1.872  0.06115 .  
## diabetes                  2.082e-01  1.231e+00  2.292e-01  0.909  0.36360    
## ejection_fraction        -1.948e-01  8.230e-01  4.675e-02 -4.167 3.08e-05 ***
## ejection_fraction2        1.881e-03  1.002e+00  5.851e-04  3.215  0.00131 ** 
## high_blood_pressure       3.909e-01  1.478e+00  2.202e-01  1.776  0.07581 .  
## platelets                -8.137e-07  1.000e+00  1.154e-06 -0.705  0.48067    
## serum_creatinine          2.329e-01  1.262e+00  7.991e-02  2.915  0.00356 ** 
## serum_sodium             -4.138e-02  9.595e-01  2.384e-02 -1.736  0.08262 .  
## sex                      -2.296e-01  7.949e-01  2.585e-01 -0.888  0.37452    
## smoking                   1.818e-01  1.199e+00  2.585e-01  0.703  0.48188    
## creatinine_phosphokinase  2.090e-04  1.000e+00  9.783e-05  2.136  0.03268 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                          exp(coef) exp(-coef) lower .95 upper .95
## age                         0.9120     1.0965    0.7940    1.0477
## age2                        1.0011     0.9989    1.0000    1.0021
## anaemia                     1.5132     0.6608    0.9808    2.3346
## diabetes                    1.2315     0.8120    0.7859    1.9297
## ejection_fraction           0.8230     1.2151    0.7509    0.9019
## ejection_fraction2          1.0019     0.9981    1.0007    1.0030
## high_blood_pressure         1.4784     0.6764    0.9602    2.2762
## platelets                   1.0000     1.0000    1.0000    1.0000
## serum_creatinine            1.2622     0.7922    1.0793    1.4763
## serum_sodium                0.9595     1.0422    0.9157    1.0054
## sex                         0.7949     1.2580    0.4789    1.3193
## smoking                     1.1994     0.8338    0.7226    1.9906
## creatinine_phosphokinase    1.0002     0.9998    1.0000    1.0004
## 
## Concordance= 0.763  (se = 0.025 )
## Likelihood ratio test= 95.05  on 13 df,   p=1e-14
## Wald test            = 105.3  on 13 df,   p=<2e-16
## Score (logrank) test = 121.9  on 13 df,   p=<2e-16

#Model 2 with only significant variables from Model 1
f3 <- coxph(Surv(time, DEATH_EVENT)~age2 + ejection_fraction + ejection_fraction2+ serum_creatinine + creatinine_phosphokinase, data = df)
summary(f3)

## Call:
## coxph(formula = Surv(time, DEATH_EVENT) ~ age2 + ejection_fraction + 
##     ejection_fraction2 + serum_creatinine + creatinine_phosphokinase, 
##     data = df)
## 
##   n= 299, number of events= 96 
## 
##                                coef  exp(coef)   se(coef)      z Pr(>|z|)    
## age2                      3.795e-04  1.000e+00  6.529e-05  5.814 6.11e-09 ***
## ejection_fraction        -2.090e-01  8.114e-01  4.531e-02 -4.614 3.96e-06 ***
## ejection_fraction2        2.055e-03  1.002e+00  5.646e-04  3.639 0.000274 ***
## serum_creatinine          2.691e-01  1.309e+00  7.400e-02  3.636 0.000277 ***
## creatinine_phosphokinase  1.607e-04  1.000e+00  9.717e-05  1.654 0.098146 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                          exp(coef) exp(-coef) lower .95 upper .95
## age2                        1.0004     0.9996    1.0003    1.0005
## ejection_fraction           0.8114     1.2325    0.7424    0.8867
## ejection_fraction2          1.0021     0.9979    1.0009    1.0032
## serum_creatinine            1.3088     0.7641    1.1321    1.5131
## creatinine_phosphokinase    1.0002     0.9998    1.0000    1.0004
## 
## Concordance= 0.74  (se = 0.028 )
## Likelihood ratio test= 81.94  on 5 df,   p=3e-16
## Wald test            = 88.46  on 5 df,   p=<2e-16
## Score (logrank) test = 98.19  on 5 df,   p=<2e-16

#model 3 with only significant variables from Model 2
f4 <- coxph(Surv(time, DEATH_EVENT)~age2 + ejection_fraction+ ejection_fraction2 + serum_creatinine, data = df)
summary(f4)

## Call:
## coxph(formula = Surv(time, DEATH_EVENT) ~ age2 + ejection_fraction + 
##     ejection_fraction2 + serum_creatinine, data = df)
## 
##   n= 299, number of events= 96 
## 
##                          coef  exp(coef)   se(coef)      z Pr(>|z|)    
## age2                3.788e-04  1.000e+00  6.568e-05  5.767 8.08e-09 ***
## ejection_fraction  -2.031e-01  8.162e-01  4.494e-02 -4.519 6.23e-06 ***
## ejection_fraction2  1.987e-03  1.002e+00  5.602e-04  3.546 0.000390 ***
## serum_creatinine    2.665e-01  1.305e+00  7.368e-02  3.616 0.000299 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                    exp(coef) exp(-coef) lower .95 upper .95
## age2                  1.0004     0.9996    1.0003    1.0005
## ejection_fraction     0.8162     1.2252    0.7474    0.8914
## ejection_fraction2    1.0020     0.9980    1.0009    1.0031
## serum_creatinine      1.3053     0.7661    1.1298    1.5081
## 
## Concordance= 0.736  (se = 0.028 )
## Likelihood ratio test= 79.68  on 4 df,   p=<2e-16
## Wald test            = 86.8  on 4 df,   p=<2e-16
## Score (logrank) test = 96.07  on 4 df,   p=<2e-16

#Likelihood Ratio tests
#Model 1 vs. Model 2
loglik2 <- 2*(f2$loglik[2] - f3$loglik[2])
pchisq(loglik2, df = 8, lower.tail = F)

## [1] 0.1081725

#Model 2 vs. Model 3
loglik3 <- 2*(f3$loglik[2] - f4$loglik[2])
pchisq(loglik3, df = 1, lower.tail = F)

## [1] 0.1326176

#Model 1 vs. Model 3. Third test may not be necessary, but I included it just to be sure. 
loglik4 <- 2*(f2$loglik[2] - f4$loglik[2])
pchisq(loglik4, df = 9, lower.tail = F)

## [1] 0.08126043

#Assess Proportional Hazard Assumption with Schoenfeld Residuals
test.ph <- cox.zph(f4)
test.ph

##                    chisq df     p
## age2               0.430  1 0.512
## ejection_fraction  3.151  1 0.076
## ejection_fraction2 1.999  1 0.157
## serum_creatinine   0.191  1 0.662
## GLOBAL             5.699  4 0.223

#interpreting serum creatinine coefficient
1-exp(0.2665)

## [1] -0.3053876

.9*0.30538

## [1] 0.274842

1*.30538

## [1] 0.30538

.30538-.2748

## [1] 0.03058

#interpreting ejection fraction coefficient
1-exp(-.2031)

## [1] 0.1838034

exp(.001987)-1

## [1] 0.001988975

-(.38*0.1838)+(((.38)^2)*.00198)

## [1] -0.06955809

-(.37*0.1838)+(((.37)^2)*.00198)

## [1] -0.06773494

-(.30*0.1838)+(((.30)^2)*.00198)

## [1] -0.0549618

-(.60*0.1838)+(((.60)^2)*.00198)

## [1] -0.1095672

.06955-0.06773

## [1] 0.00182

0.06955-0.05496

## [1] 0.01459

0.6955-0.10956

## [1] 0.58594

#interpreting age coefficient
(60^2)*0.000378

## [1] 1.3608

(61^2)*0.000378

## [1] 1.406538

(70^2)*0.000378

## [1] 1.8522

1.406-1.3608

## [1] 0.0452

1.852-1.3608

## [1] 0.4912

exp(0.000378)-1

## [1] 0.0003780715