Background and objective

The objective of this paper is discussing the incidence and prevalence of cardiovascular disease and its risk factors. The dataset we use for this, is the Framingway heart disease dataset. We will start by demonstrating some visualisation of the dataset and discussing these. Afterwards we will investigate different models to predict if the patient is at risk for developing a coronary disease in the next ten years based on the prevailing information.

The data is comprimised of 2927 observations, 15 variables and one targetvariable. Data is available on Kaggle.

Data Description for the 9 variables are as follows.

  1. age - Age of the individual in years
  2. education - Level of education
  3. sex - Sex of the indivudial
  4. is_smoking - Whether the individual is somking or not
  5. cigsPerDay - If smoking yes, how many cigarettes per day
  6. BPMeds - Whether or not the individual used blood pressure medicine
  7. prevalentStroke - Whether the indivudial has experienced a stroke
  8. prevalentHyp - Whether the individual diagnosed diabetes
  9. totChol - Measure of total cholestrol levels in blood in (mg/dL)
  10. sysBP - Measure of systolic blood pressure (mmHG)
  11. diaBP - Measure of diastolic blood pressure (mmHG)
  12. heartRate - Measure of heartrate (bpm)
  13. BMI - Body mass indeks (kg/cm^2)
  14. glucose - Plasma glucose concentration (glucose tolerance test)
  15. TenYearCHD - Whether the patient is at risk for a coronary heart disease(CHD)

The R code will not integrated in this paper, but can be downloaded separately here.

Visualisaion

First, we take a small look into the missing data. When analyzing a dataset, it is important to keep an eye on the missing values and if there is a systematic representation of these. A first plot on the missing values for each variable can be seen below.

Here we see, that from all the values in the dataset, a total of 0,9% are missing. From these missing vaules, most of them are missing glucose values. Specifically, nearly 9% of the glucose values are missing. This is not nececarrely a problem. At a first glance, it looks like the missing values are randomly distributed within the variables. It would be very suspicious if there for example only were missing values within the first 500 observations. But we must always be careful if there are hidden patterns. For example, do the missing glucose values correlate with another variable?

Another way to represent missing values can be seen below.

This figure shows that there are 265 missing values in the variable glucose, 78 in education and 42 in BPMeds. 28 patients have missing values in totChol and glucose. 17 patients only have missing values in cigsPerDay and so on. For this paper, we will exclude the entire patient, if there is a missing value in a single variable. There is only one patient with missing values in three variables (totChol, BPMeds and glucose). The most important aspect of this plot, is how many patients have more than one missing values and it seems that this does not occur often.

An example of a problem that could occur, is if only glucose values were missing for patients with a low education degree. We want the missing values to be randomly spread across each variable and we want no correlation for missing values between variables. For two variables we can actually plot the missing values. As example we investigate the missing glucose values.

If the reader is not familiar to reading this plots, it may at first be a little strange to investigate. The red points are missing glucose values, and the blue points are glucose values. As seen before, there were not missing values for age. It is seen that the missing glucose values are uniformly spread across the age variable and education level. Also for the 5th plot, where both the missing values for education level an glucose are shown with respect to age. To fully investigate these patterns in debt, these plots should be made with respect to all variables, but we will not preform all those plots in this paper.

In this paper we remove all rows with one or more missing values. As there are no patterns in the missing data, we do not lose important information. We just have have less patients. We now have 2927 patients left after removing the missing rows.

Moving on, our next step is to get an idea of the variables in the dataset.

There are four types of variables in the dataset

  1. Categorical
  2. Ordinal
  3. Continous
  4. Target

The only ordinal variable we have is education level, as this variable has there is a relative ordering between the values. Our ultimate goal, after visualization, is to predict the outcome of the target value, hence the risk of coronary disease in upcoming 10 years.

We plot histograms of the categorical, ordinal and target variables and create continuous density plots of the numerical.

Many observations may me made from these results and the goal is merely getting an idea of the data. The use of blood pressure medicine is apparently low, together with this prevalence of earlier strokes and the diagnosis diabetes.

Probably the most important observation, is that the distribution of our target value is severely skewed. This is important for our interpretation of the model accuracy. Even if we build a model that purely predicts a negative outcome for each patient, the model technically still has a high “accuracy”. Obviously such a model has zero clinical relevance and it is therefore important how we “rate” predicting models.

The prevalence of for example strokes is extremely low. So the only way such a variable will be significant for our model, is if a very significant large group of patients who experienced had a stroke, also is in risk of developing CHD. A qualified guess would be that this is not the case, but this will be shown later.

Following are the density plots of the numeric variables.

The first important note is that glucose has some extreme large values. This outliers increase the skewness from the density. Are these high values a sign of a heavytailed distrubtion or a measurement error?

A BMI greater than 50 is also pretty extreme, but is technically realistic. Same can be said aboud a resting heartbeat of above 130. For now we will not remove this “outliers”.

Age is the variable who is mostly centered around its mean and has therefore the lowest skewness.

For the two bloodpressure vaules, a interesting medical aspect may be, that extreme high systolic blood pressure values occur more often than diastolic. This logicly increases the skewness of systolic bp, but may also give the systolic values more significance in predicting the target value.

Another interesting visual construction is a correlation plot.

The correlation matrix is a very strong tool gain an insight in the correlation over the entire dataset. Often if you have two variables that are highly correlated, it is beneficial to remove one of them hence it in many models is an assumption that the variables are independent of each other In the correlation plot, it can clearly be see which variables have high correlation with each other and also the correlation with the target variable: TenYearCHD. It seems that age has the highest correlation, followed by the prevalence of hypertension, systolic blood pressure and blood glucose levels. Note that this is a correlation, not a causal relationship.

Prevalence of hypertension is highly correlated with systolic and diastolic bloodpressure. This makes good sense and may mean that it is beneficial to remove two of these for the predictive model. The variable “is_smoking”, which is a binary variable of the patient is smoking or not, is also very highly correlated with the number of cigarrets a patient smokes. Again, we may remove one of these. Same sands for prevalence of diabetes and glucose.

Other ways to represent the correlation between densitiys can be seen below.

In these plots, the more overlap of the two functions means that the numeric variable has less correlation with the targetvariable. And the other way around, less overlap means more correlation. These plots actually don’t provide is with new information, as we could already see the correlation in the correlation matrix. Ans hence it makes sense, that Age has the least overlap, systolic bloodpressure some and cholesterol most between the two density functions.

We may preform the same sort of visualisation for the categorical data.

As the prevalence of TenYearCHD is low, then it is easier to plot it on the x-axis, and then the variable as color. Here we can see the proportion of people with heart disease given the different factors. It seems that the especially education level 1, men and hypertension are over representated in the group at risk for coronary heart disease in the upcoming 10 years. Especially hypertension looks like an important variable and unlike smoking where the prevalance is quite small.

The figures of BP medication and stroke are abit difficult to analyze as the prevalence is low. These variables may still be significant, but it is easier to statistically evaluate these variables when computing the classification model.
From these plots we can probably say that a patient who is male, smokes, has hypertension and education level 1 has a higher chance of being in the risk group for developing CHD in the next 10 years than a random patient from the population group. A high age, systolic blood pressure, glucose would probably even further increase the probability. These correlations sound clinically realistic.

We now have an idea, only from visualization, of which variables are important to classify our target variable (TenyearCHD). The next plot shows some correlation between two important numeric variables, one categorical and the targetvalue.

Here we definitely see some correlation between the variables age, glucose, hypertension and the risk group. It seems that having a high age and glucose plus having hypertension increases the probability of being in the risk group. But the plot also shows, that it is more complicated than this. There are CHD risk patients with low age an glucose. Also there are patients with fairly high age and glucose levels with hypertension, who are not classified in the risk group. This will be interesting to our classification model, as we definitely need more information than these three variables. One more of these sort plots is shown below.

Again we see a correlation, but there are still cases with low a low age and SBP who still are in the risk group. It might be hard for the classification model to correctly predict those cases.

Glm model

The next chapter will be about building a predictive model. There are many types of models that we can build but we will start with a Generalized Linear model.