The WHO MONICA Project

Introduction

Sponsored by the World Health Organization 1, the MONICA Project (Multinational MONItoring of trends and determinants in CArdiovascular disease) was an effort to monitor cardiovascular trednds across 21 different countries2. According to the MONICA website, there were over 10 million people studied. Since we are working with only a little over 6,300 records, these data represent but a small sample of the larger study population.

Question

Although this data is relatively small in terms of number of variables and total records, it could provide some insight into what factors contribute to a terminal outcome for patients. Hence, we will do a cursory analysis to see what variables look like they could warrent use in a later more thorough analysis.

Data

First we load the data and look at its structure:

MONICA <- read.csv(url("https://raw.githubusercontent.com/lysanthus/CUNYDSBridge/master/monica.csv"))
row.names(MONICA) <- MONICA$X
str(MONICA)
## 'data.frame':    6367 obs. of  13 variables:
##  $ X       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ outcome : Factor w/ 2 levels "dead","live": 2 2 2 2 1 2 2 2 2 2 ...
##  $ sex     : Factor w/ 2 levels "f","m": 1 2 2 2 2 1 2 1 2 1 ...
##  $ age     : int  63 59 68 46 48 55 56 68 69 64 ...
##  $ yronset : int  85 85 85 85 85 85 85 85 85 85 ...
##  $ premi   : Factor w/ 3 levels "n","nk","y": 1 3 1 1 1 1 1 3 1 1 ...
##  $ smstat  : Factor w/ 4 levels "c","n","nk","x": 4 4 2 1 2 1 4 3 2 4 ...
##  $ diabetes: Factor w/ 3 levels "n","nk","y": 1 1 1 1 3 1 1 2 1 1 ...
##  $ highbp  : Factor w/ 3 levels "n","nk","y": 3 3 3 1 1 3 3 3 3 3 ...
##  $ hichol  : Factor w/ 3 levels "n","nk","y": 3 1 1 1 1 3 1 2 3 1 ...
##  $ angina  : Factor w/ 3 levels "n","nk","y": 1 1 1 1 3 1 1 3 1 3 ...
##  $ stroke  : Factor w/ 3 levels "n","nk","y": 1 1 1 1 1 1 1 1 1 1 ...
##  $ hosp    : Factor w/ 2 levels "n","y": 2 2 2 2 2 2 2 2 2 2 ...

Here we have several factors which, according to the accompanying data dictionary3, represents yes (y), no (n) or not known (nk). We also have variables representing the patient’s sex, age, year of onset, and (most importantly) the outcome variable of whether the person survived or not.

Taking a look at the overall survival rates we see that survival was only slightly more representative at about ~55% of the total.

## 
##      dead      live 
## 0.4463641 0.5536359

Looking at the breakouts by gender, we can see that they are represented proportionally:

##    
##          dead      live
##   f 0.4466515 0.5533485
##   m 0.4462541 0.5537459

Outcomes

Looking at the breakouts by age groups, we see nothing unusual with either gender breakout. The prevalance of cardiovascular issues increases about the same for both genders as age does. The same seems to apply to mortality.

Influences of Outcomes

Now we focus on each factor which could potentially influence whether a patient survived or not. This will help us decide what factors may be worth exploring as part of a deeper analysis.

Smoking

First we will look at whether patients are, or were, smokers. Smoking is often tied to cardiovascular disease, so it makes sense to see how it is represented in our data.

# Change labels on the smoking factor to more friendly names
MONICA <- transform(MONICA,smstat = revalue(smstat,c("c"="Current","n"="No","x"="Ex","nk"=NA)))

# Plot the relative percentages
ggplot(MONICA,aes(smstat,group=outcome)) + geom_bar(aes(y=..prop.., fill=factor(..x..))) + scale_y_continuous(labels=scales::percent) + ylab("Relative Frequency") + xlab("Smoking") + facet_grid(outcome ~ .) + labs(title="Patient Smoking Status", subtitle=expression(paste(italic("by patient outcome")))) + guides(fill=FALSE)

It appears that Current, Non-, and Ex- smokers are similarly represented in both the group of patients that survived as well as those that did not. However there is a large proportion of missing values in the deceased group, so this variable may not be terribly useful in discovering trends.

High Blood Pressure

Another likely factor in the outcome of someone’s cardiovascula health is high blood pressure.

# Change labels on the high bp factor to more friendly names
MONICA <- transform(MONICA,highbp = revalue(highbp,c("y"="Yes","n"="No","nk"=NA)))

# Plot the relative percentages
ggplot(MONICA,aes(highbp,group=outcome)) + geom_bar(aes(y=..prop.., fill=factor(..x..))) + scale_y_continuous(labels=scales::percent) + ylab("Relative Frequency") + xlab("High Blood Pressure") + facet_grid(outcome ~ .) + labs(title="Patient High Blood Pressure", subtitle=expression(paste(italic("by patient outcome")))) + guides(fill=FALSE)

As we did with smoking, we see nearly equal proportions of those with and without high blood pressure amongst both those who survived and those who did not. Also, as with the smoking, a large proportion of unknown values are represented in the population who did not survive. This makes it difficult to say whether high blood pressure had any kind of significant impact on whether a patient survived or not.

Angina

Now we switch gears to more acute conditions which could have impacted the rates of survival for the patient population. First, we look at whether a patient had angina which is described by the American Heart Association as, “…chest pain or discomfort caused when your heart muscle doesn’t get enough oxygen-rich blood”4

# Change labels on the angina factor to more friendly names
MONICA <- transform(MONICA,angina = revalue(angina,c("y"="Yes","n"="No","nk"=NA)))

# Plot the relative percentages
ggplot(MONICA,aes(angina,group=outcome)) + geom_bar(aes(y=..prop.., fill=factor(..x..))) + scale_y_continuous(labels=scales::percent) + ylab("Relative Frequency") + xlab("Angina") + facet_grid(outcome ~ .) + labs(title="Patient Angina", subtitle=expression(paste(italic("by patient outcome")))) + guides(fill=FALSE)

Here we see that those patients without angina were much more heavily represented in the group of patients that survived. THis factor may be useful in a more thorough future analysis.

Stroke

Now we turn our attention to patients who have had a stroke.

# Change labels on the angina factor to more friendly names
MONICA <- transform(MONICA,stroke = revalue(stroke,c("y"="Yes","n"="No","nk"=NA)))

# Plot the relative percentages
ggplot(MONICA,aes(stroke,group=outcome)) + geom_bar(aes(y=..prop.., fill=factor(..x..))) + scale_y_continuous(labels=scales::percent) + ylab("Relative Frequency") + xlab("Stroke") + facet_grid(outcome ~ .) + labs(title="Patient Stroke", subtitle=expression(paste(italic("by patient outcome")))) + guides(fill=FALSE)

Interestingly, the proportion of patients who had a stroke and survived is similar to those who did not survive. We can see, like other data points, the deceased have a large number of missing values.

Conclusion

We can repeat the same sorts of analyses for other variables in the dataset. This may give us a sense of what are the most important features to use in an analysis.


  1. http://www.who.org

  2. https://thl.fi/monica/

  3. http://vincentarelbundock.github.io/Rdatasets/doc/DAAG/monica.html

  4. http://www.heart.org/HEARTORG/Conditions/HeartAttack/DiagnosingaHeartAttack/Angina-Chest-Pain_UCM_450308_Article.jsp#.W14cYy2ZP-Y