Cardiovascular Disease Prediction

Group 4
S2156317 Chai Kang Sheng

S2164443 SIDI LIU

S2136386 Naxin Dong

22069349 JIAJIA JIANG

S2175434 An Zhao

1.Background

Cardiovascular disease (CVD) refers to a class of diseases that involve the heart or blood vessels. It is a major cause of mortality globally. According to the World Health Organization (WHO), cardiovascular diseases take the lives of 17.9 million people every year, accounting for 31% of all global deaths. Major types of CVD include coronary artery disease (heart attacks), cerebrovascular disease (stroke), and hypertension (high blood pressure), among others.

Predicting cardiovascular disease is a significant focus of medical research and healthcare delivery. The prediction of CVD is based on identifying various risk factors that are associated with heart disease and monitoring those factors in individuals over time. Risk factors include things like age, sex, family history of heart disease, smoking, physical inactivity, unhealthy diet, obesity, diabetes, hypertension, and high cholesterol levels.

Over time, the tools and methods for predicting cardiovascular disease have become more sophisticated. Health professionals now use a combination of medical tests (such as blood tests and physical examinations), personal history, and various predictive models to determine an individual’s risk of developing CVD. Some of these predictive models are based on complex statistical methods or machine learning algorithms that can take into account a large number of variables and their interactions.

In the era of big data and artificial intelligence, the ability to predict CVD has seen significant advancements. Machine learning and data mining techniques are being widely used in the healthcare sector, including for CVD prediction. These techniques can handle a large volume of data and can uncover hidden patterns that might not be evident using traditional statistical methods. This has led to the development of more accurate and personalized predictive models for CVD.

However, it’s important to note that while these predictive models can be powerful tools, they are not perfect and can’t guarantee an individual will or won’t develop CVD. They should be used as part of a broader approach to healthcare that includes regular check-ups, a healthy lifestyle, and appropriate medical interventions when necessary.

2.Introduction and Objectives

In recent years, there has been a significant increase in interest in the development of predictive models for cardiovascular disease. These models utilize adcanced data analysis techniques such as machine learning and artificial intelligence, to identify individuals at high risk of developing CVD and to provide personalized risk assessments. By leveraging large datasets and incorporating a wide range of factors such as age, gender, medical history, lifestyle habits, and biomarkers, these prediction models can generate valuable insights and assist in clinical decision-making.

The objective of this project is to create a robust and accurate predictive model for cardiovascular disease that can assist healthcare professionals in identifying individuals who are at high risk and provide timely interventions. By utilizing the power of machine learning algorithms, we aim to develop a comprehensive model that can enhance risk stratification and enable targeted preventive measures.

The specific objectives include:

-Finding a CVD dataset, which will then be standardized, cleaned and processed to ensure it’s quality for analysis.

-Identifying the most informative features from the collected data to build an optimized predictive model. This process will involve exploratory data analysis (EDA).

-Developing a machine learning algorithm to build a predictive model which will be trained on a labeled dataset.

3.Problem Statement

There are a few problems based on past studies about CVD. The first one is the existing risk assessment methods for CVD lack accuracy and fail to adequately predict the occurrence of CVD in individuals. The traditional risk assessment tools rely on limited variables such as age, gender, and a few more basic and common clinical measurements. This is a problem because these variables lead to underestimation or overestimation of risk in different populations, which will result in missed oppotunities for early detection and timely treatment. Therefore, there is an urgent need to develop a predictive model for CVD that incorporates a comprehensive set of variables such as cholesterol levels, BMI, glucose level and many more.

The second problem is that the adoption of predictive models for CVD is often hindered by a lack of interpretability and explainability. Healthcare professionals are usually skeptical about relying on machine learning algorithms that provide accurate predictions but lack transparency in their decision-making process. Healthcare professionals need insights into the contributing factors and underlying mechanisms that cause the model’s predictions to make informed clinical decisions. However, there is a lack of interpretability in current predictive models for CVD which is preventing their widespread adoption and integration into routine practice. Therefore, there is a need to develop a predictive model for CVD that not only demonstrates high accuracy, but is also able to provide interpretable and explainable results, which will allow healthecare professionals to easily understand and trust the model’s predictions, which will lead to more accurate clinical decision-making.

4.Data Collection

Data set:Cardiovascular Disease dataset from kaggle website

https://www.kaggle.com/datasets/sulianova/cardiovascular-disease-dataset/code?datasetId=107706&language=R

Data set consists of 70000 records of patients data.

The data set includes 13 attributes described as follows:

id : It’s just the Id no of the row. Not revelant

age : It’s the age of a person in Days

gender : It’s the gender of the person

height : It’s the height of the person in cm

weight : It’s the weight of the person in kg

ap_hi : It’s the Systolic blood pressure i.e. Pressure exerted when Blood is ejected in arteries. Normal value : 120mmhg or Below

ap_low : It’s the Diastolic blood pressure i.e. Pressure exerted when Blood exerts between arteries and heartbeats. Normal Value : 80mmhg or Below

cholesterol : It’s the Cholestreol value (Cholesterol is a type of fat found in your blood) of your blood. In Adults, 200 mg/dL is desired with 200 and 239 mg/dL as Boderline High. In Children, 170 mg/dL is desired with 170 and 199 mg/dL as Boderline High

gluc : It’s the Glucose Level. They’re less than 100 mg/dL after not eating (fasting) for at least 8 hours. And they’re less than 140 mg/dL 2 hours after eating. For most people without diabetes, blood sugar levels before meals hover around 70 to 80 mg/dL

smoke : It contain Binary Values stating whether Person is a Smoker or not i.e. {0 : ‘Not a Smoker’, 1 : ‘Smoker’}

alco : It contain Binary Values stating whether Person is an alchoalic or not i.e. {0 : ‘Not a Alcholic’, 1 : ‘Alcholic’}

active : It contain Binary Values stating whether Person is involved in physical activites or not i.e. {0 : ‘Not involved in Physical Activites’, 1 : ‘involved in physical activites’}

cardio : It’s our Target Value Binary Values stating whether Person has Cardiovascular diseases (CVDs) or Not i.e. {0 : ‘Not Have CVD’, 1 : ‘Have CVD’}

5.Data Preprocessing

In this section we will perform an exploration of the collected data using descriptive statistics to better understand the data. Through this, we can detect anomalies and inconsistencies in the data, thereby performing preprocessing to make the data clearer and easier to understand, serving other purposes well.

First step import libraries for data preparation and EDA

library(dplyr)
library(ggplot2)
library(readr)
library(tidyr)
library(tidyverse)
library(gridExtra)
library(corrplot)

read “cardio_train.csv” data, Show the structure of data

df_cardio <- read.csv("cardio_train.csv",sep = ";")
str(df_cardio)

## 'data.frame':    70000 obs. of  13 variables:
##  $ id         : int  0 1 2 3 4 8 9 12 13 14 ...
##  $ age        : int  18393 20228 18857 17623 17474 21914 22113 22584 17668 19834 ...
##  $ gender     : int  2 1 1 2 1 1 1 2 1 1 ...
##  $ height     : int  168 156 165 169 156 151 157 178 158 164 ...
##  $ weight     : num  62 85 64 82 56 67 93 95 71 68 ...
##  $ ap_hi      : int  110 140 130 150 100 120 130 130 110 110 ...
##  $ ap_lo      : int  80 90 70 100 60 80 80 90 70 60 ...
##  $ cholesterol: int  1 3 3 1 1 2 3 3 1 1 ...
##  $ gluc       : int  1 1 1 1 1 2 1 3 1 1 ...
##  $ smoke      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ alco       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ active     : int  1 1 0 1 0 0 1 1 1 0 ...
##  $ cardio     : int  0 1 1 1 0 0 0 1 0 0 ...

From the structure of data, we can see the data frame contians 70000 objectives of 13 variables, the id feature was not in use in this case study, so it will be removed.

#remove id cloumn
df_cardio<-df_cardio[-c(1)]
head(df_cardio)

##     age gender height weight ap_hi ap_lo cholesterol gluc smoke alco active
## 1 18393      2    168     62   110    80           1    1     0    0      1
## 2 20228      1    156     85   140    90           3    1     0    0      1
## 3 18857      1    165     64   130    70           3    1     0    0      0
## 4 17623      2    169     82   150   100           1    1     0    0      1
## 5 17474      1    156     56   100    60           1    1     0    0      0
## 6 21914      1    151     67   120    80           2    2     0    0      0
##   cardio
## 1      0
## 2      1
## 3      1
## 4      1
## 5      0
## 6      0

After remove the id column, the missing value will be checked and removed.

#check missing value 
sum(is.na(df_cardio))

## [1] 0

As none of the variables have missing values, no need to remove any rows. Before we process data, we still need to check duplicated data and remove it.

sum(duplicated(df_cardio))

## [1] 24

After checking, there has 24 duplicated data, we need to remove it.

df_cardio<-subset(df_cardio,!duplicated(df_cardio))

As in the dataset the age column shows in days so we need to change it to years for better analysis.

#converting age from days to years
df_cardio$age <- as.numeric(round(df_cardio$age / 365))

Converte female “1” and male “2” to female “0” anda male “1” in order to match the following features

#converting female to 0 and male to 1
df_cardio$gender<-as.numeric(df_cardio$gender-1)

After adjust those features, we add a new column for the dataset which is BMI（Body Mass Index）=mass (kg)/ height2 (m).

#add new colum for BMI
df_cardio$BMI<-round((df_cardio$weight/(df_cardio$height/100)^2),digits = 2)

Summary and find each features’ outliers and remove extreme value

summary(df_cardio)

##       age            gender           height          weight      
##  Min.   :30.00   Min.   :0.0000   Min.   : 55.0   Min.   : 10.00  
##  1st Qu.:48.00   1st Qu.:0.0000   1st Qu.:159.0   1st Qu.: 65.00  
##  Median :54.00   Median :0.0000   Median :165.0   Median : 72.00  
##  Mean   :53.34   Mean   :0.3496   Mean   :164.4   Mean   : 74.21  
##  3rd Qu.:58.00   3rd Qu.:1.0000   3rd Qu.:170.0   3rd Qu.: 82.00  
##  Max.   :65.00   Max.   :1.0000   Max.   :250.0   Max.   :200.00  
##      ap_hi             ap_lo           cholesterol         gluc      
##  Min.   : -150.0   Min.   :  -70.00   Min.   :1.000   Min.   :1.000  
##  1st Qu.:  120.0   1st Qu.:   80.00   1st Qu.:1.000   1st Qu.:1.000  
##  Median :  120.0   Median :   80.00   Median :1.000   Median :1.000  
##  Mean   :  128.8   Mean   :   96.64   Mean   :1.367   Mean   :1.227  
##  3rd Qu.:  140.0   3rd Qu.:   90.00   3rd Qu.:2.000   3rd Qu.:1.000  
##  Max.   :16020.0   Max.   :11000.00   Max.   :3.000   Max.   :3.000  
##      smoke              alco             active           cardio      
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:1.0000   1st Qu.:0.0000  
##  Median :0.00000   Median :0.00000   Median :1.0000   Median :0.0000  
##  Mean   :0.08816   Mean   :0.05379   Mean   :0.8037   Mean   :0.4998  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.00000   Max.   :1.00000   Max.   :1.0000   Max.   :1.0000  
##       BMI        
##  Min.   :  3.47  
##  1st Qu.: 23.88  
##  Median : 26.39  
##  Mean   : 27.56  
##  3rd Qu.: 30.22  
##  Max.   :298.67

Remove heights less than 120cm and taller than 210cm and remove weight less than 30kg due to age range 30-65 years

df_cardio<-df_cardio[!(df_cardio$height > 210|df_cardio$height < 120),]
df_cardio<-df_cardio[!(df_cardio$weight<30),]

Checking ap_hi & ap_low value ap_hi : It’s the Systolic blood pressure,the normal value : 120mmhg or Below ap_low : It’s the Diastolic blood pressure the normal Value : 80mmhg or Below

# remove ap_hi is higher than 250 or lower than 90
df_cardio <- df_cardio[!(df_cardio$ap_hi > 250 | df_cardio$ap_hi < 90), ]
# remove ap_lo is higher than 160 or lower than 40
df_cardio <- df_cardio[!(df_cardio$ap_lo > 160 | df_cardio$ap_lo < 40), ]
# remove ap_lo is higher than 'ap_hi'
df_cardio <- df_cardio[!(df_cardio$ap_lo > df_cardio$ap_hi), ]

Checking BMI, rows with abnormal BMI values are eliminated

df_cardio <- df_cardio[!(df_cardio$BMI >150),]

After clean all those data, we add a new column “BMI_group”:BMI below 18.5 is underweight; between 18.5 and 23.9 is Normal range; between 24 and 27.9 overweight range；beyond 28 is obese.

df_cardio$BMI_Group<-cut(df_cardio$BMI,breaks =c(0,18.5,23.9,27.9,Inf),labels=c("Underweight","Normal","Overweight","Obese"))

Save the cleanned Cardiovascular Disease dataset to a new csv file.

df_cardio_clean<-df_cardio
write.csv(df_cardio_clean,file='df_cardio_clean.csv')
str(df_cardio_clean)

## 'data.frame':    68517 obs. of  14 variables:
##  $ age        : num  50 55 52 48 48 60 61 62 48 54 ...
##  $ gender     : num  1 0 0 1 0 0 0 1 0 0 ...
##  $ height     : int  168 156 165 169 156 151 157 178 158 164 ...
##  $ weight     : num  62 85 64 82 56 67 93 95 71 68 ...
##  $ ap_hi      : int  110 140 130 150 100 120 130 130 110 110 ...
##  $ ap_lo      : int  80 90 70 100 60 80 80 90 70 60 ...
##  $ cholesterol: int  1 3 3 1 1 2 3 3 1 1 ...
##  $ gluc       : int  1 1 1 1 1 2 1 3 1 1 ...
##  $ smoke      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ alco       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ active     : int  1 1 0 1 0 0 1 1 1 0 ...
##  $ cardio     : int  0 1 1 1 0 0 0 1 0 0 ...
##  $ BMI        : num  22 34.9 23.5 28.7 23 ...
##  $ BMI_Group  : Factor w/ 4 levels "Underweight",..: 2 4 2 4 2 4 4 4 4 3 ...

6.EDA and Visualization

Exploratory data analysis is the process of analyzing data sets using statistical graphics and visualization tools to help us understand the main characteristics of the data set, the correlation of attributes, the distribution of data, and whether there are outliers in the data.

df<-read.csv("df_cardio_clean.csv")
set.seed(123) 
df1<-df
head(df1,5)

##   X age gender height weight ap_hi ap_lo cholesterol gluc smoke alco active
## 1 1  50      1    168     62   110    80           1    1     0    0      1
## 2 2  55      0    156     85   140    90           3    1     0    0      1
## 3 3  52      0    165     64   130    70           3    1     0    0      0
## 4 4  48      1    169     82   150   100           1    1     0    0      1
## 5 5  48      0    156     56   100    60           1    1     0    0      0
##   cardio   BMI BMI_Group
## 1      0 21.97    Normal
## 2      1 34.93     Obese
## 3      1 23.51    Normal
## 4      1 28.71     Obese
## 5      0 23.01    Normal

6.1 Descriptive Statistics

For the Cardiovascular Disease dataset, there are four types of Interval variables: Age, Height, Weight, Systolic blood pressure, Diastolic blood pressure and BMI. We can view the descriptive statistics of these variables, such as mean, median and quantile Number.

Through the statistics of the Interval variables, we can get some rules: The age range of the respondents was between 30 and 65 years old, with an average age of 53.35 years; the median BMI was 26.57, which was above the normal range. This means that some respondents may be overweight or obese.

Interval_variables<-df %>% select(age, height, weight, ap_hi, ap_lo ,BMI)
summary(Interval_variables)

##       age            height          weight           ap_hi      
##  Min.   :30.00   Min.   :120.0   Min.   : 30.00   Min.   : 90.0  
##  1st Qu.:48.00   1st Qu.:159.0   1st Qu.: 65.00   1st Qu.:120.0  
##  Median :54.00   Median :165.0   Median : 72.00   Median :120.0  
##  Mean   :53.33   Mean   :164.4   Mean   : 74.13   Mean   :126.7  
##  3rd Qu.:58.00   3rd Qu.:170.0   3rd Qu.: 82.00   3rd Qu.:140.0  
##  Max.   :65.00   Max.   :207.0   Max.   :200.00   Max.   :240.0  
##      ap_lo             BMI        
##  Min.   : 40.00   Min.   : 10.73  
##  1st Qu.: 80.00   1st Qu.: 23.88  
##  Median : 80.00   Median : 26.35  
##  Mean   : 81.33   Mean   : 27.46  
##  3rd Qu.: 90.00   3rd Qu.: 30.12  
##  Max.   :160.00   Max.   :108.17

For categorical variables: Gender, Cholesterol, Glucose, Smoking, Alcohol intake, Physical activity and BMI_Group, we choose to use frequency statistics.

Through the statistical data of the categorical variables, we can get some rules: The number of women in the data is 1.8 times that of men, indicating that there are more women in this data set; most people’s cholesterol and glucose levels are normal, and most people do not smoke, do not drink alcohol and have positive physical. However, on the BMI_Group, the number of obese people is the largest, which reflects that there is a certain degree of obesity in this group.

gender_labels <- c("women", "men")
cholesterol_labels <- c("normal", "above normal", "well above normal")
gluc_labels <- c("normal", "above normal", "well above normal")
smoke_labels <- c("no", "yes")
alco_labels <- c("no", "yes")
active_labels <- c("no", "yes")

df$gender <- factor(df$gender, levels = c(0, 1), labels = gender_labels)
df$cholesterol <- factor(df$cholesterol, levels = 1:3, labels = cholesterol_labels)
df$gluc <- factor(df$gluc, levels = 1:3, labels = gluc_labels)
df$smoke <- factor(df$smoke, levels = 0:1, labels = smoke_labels)
df$alco <- factor(df$alco, levels = 0:1, labels = alco_labels)
df$active <- factor(df$active, levels = 0:1, labels = active_labels)

Categorical_variables<-df %>% select(gender, cholesterol, gluc, smoke, alco, active, BMI_Group) %>% lapply(table) %>% print()

## $gender
## 
## women   men 
## 44616 23901 
## 
## $cholesterol
## 
##            normal      above normal well above normal 
##             51373              9280              7864 
## 
## $gluc
## 
##            normal      above normal well above normal 
##             58247              5057              5213 
## 
## $smoke
## 
##    no   yes 
## 62491  6026 
## 
## $alco
## 
##    no   yes 
## 64862  3655 
## 
## $active
## 
##    no   yes 
## 13468 55049 
## 
## $BMI_Group
## 
##      Normal       Obese  Overweight Underweight 
##       17455       26136       24295         631

6.2 Visualization

Draw a pie chart showing the proportion of patients with and without cardiovascular disease.

It can be seen from the figure that the number of people without cardiovascular disease is 31858, accounting for about 49%. The number of people suffering from cardiovascular disease was 32,978, accounting for about 51%. The distribution of this dataset is fairly even, with no significant imbalances. This balanced data distribution helps to build predictive models and perform accurate prediction and analysis of cardiovascular diseases.

df %>%
  count(cardio) %>%
  mutate(percentage = n / sum(n)) %>%
  ggplot(aes(x = "", y = percentage, fill = factor(cardio, levels = c(1, 0)))) +
  geom_bar(width = 1, stat = "identity") +
  geom_text(aes(y = cumsum(percentage) - percentage / 2,
                label = paste0(round(percentage * 100), "% (", n, ")")),
            color = "black") +
  coord_polar("y", start = 0) +
  scale_fill_manual(values = c("khaki", "skyblue"),
                    name = "Cardio Disease",
                    labels = c("Yes", "No")) +
  theme_void() +
  ggtitle("Pie Chart of Patients with Cardiovascular Disease") +
  theme(plot.title = element_text(size = 20))

Draw a histogram of age and the target variable, group the ages into 5-year-old intervals, and count the number of people with and without the disease in each age group.

It can be seen from the figure that in the interval of 30-55, the number of people without the disease is more than the number of people with the disease; but in the interval of 55-65, the number of people with the disease is more than the number of people without the disease. This suggests that the probability of disease may be related to the distribution of age.

age_group <- cut(df1$age, breaks = seq(30, 65, by = 5), include.lowest = TRUE, right = FALSE)
ggplot(df1, aes(x=age_group, fill=factor(cardio))) +
  geom_bar(position = "dodge") +
  scale_fill_discrete(name="Cardio", labels=c("No", "Yes")) +
  xlab("Age Group") +
  ylab("Number of Patients") +
  ggtitle("Bar Chart of Age Group by Cardiovascular Disease") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Draw the boxplots of Systolic blood pressure and Diastolic blood pressure respectively.

From the Systolic blood pressure boxplot, it’s evident that people with cardiovascular disease have a notably higher median systolic blood pressure compared to those without the disease. This implies a strong link between systolic blood pressure and the disease. Furthermore, systolic blood pressure values in those with the disease display a more dispersed pattern, which may reflect the severity of the disease and the diversity of control conditions.

Contrastingly, in the Diastolic blood pressure boxplot, despite a comparable median between people with and without the disease, the dispersion of values in the group with the disease is wider with greater fluctuations. This possibly indicates instability in diastolic blood pressure control among people with cardiovascular disease.

remove_outliers <- function(df, variable_name){
  Q1 <- df %>% summarise(q = quantile(!!sym(variable_name), .25, na.rm = TRUE)) %>% pull(q)
  Q3 <- df %>% summarise(q = quantile(!!sym(variable_name), .75, na.rm = TRUE)) %>% pull(q)
  IQR <- Q3 - Q1
  df <- df %>% filter(!((!!sym(variable_name) < (Q1 - 1.5 * IQR)) | (!!sym(variable_name) > (Q3 + 1.5 * IQR))))
  return(df)
}

df2 <- remove_outliers(df1, "ap_hi")
df2 <- remove_outliers(df2, "ap_lo")

df2 <- df2 %>% mutate(cardio = factor(cardio, levels = c(0, 1), labels = c("No", "Yes")))

p1 <- ggplot(df2, aes(x=cardio, y=ap_hi, fill=cardio)) +
  geom_boxplot(outlier.shape = NA) +
  labs(x="Cardio", y="Systolic blood pressure") +
  theme_minimal() +
  scale_fill_manual(values=c("skyblue", "lightpink"), name="Cardio", labels=c("No", "Yes"))

p2 <- ggplot(df2, aes(x=cardio, y=ap_lo, fill=cardio)) +
  geom_boxplot(outlier.shape = NA) +
  labs(x="Cardio", y="Diastolic blood pressure") +
  theme_minimal() +
  scale_fill_manual(values=c("skyblue", "lightpink"), name="Cardio", labels=c("No", "Yes"))

grid.arrange(p1, p2, ncol=2)

Draw the boxplots of height and weight respectively.

It can be seen from the figure that the median height of the people with the disease is slightly lower than that of the people without the disease, but their weight is higher than that of the people without the disease. This indicates that there may be a certain correlation between body weight and cardiovascular disease.

df3 <- remove_outliers(df1, "height")
df3 <- remove_outliers(df3, "weight")

df3 <- df3 %>% mutate(cardio = factor(cardio, levels = c(0, 1), labels = c("No", "Yes")))

p1 <- ggplot(df3, aes(x=cardio, y=height, fill=cardio)) +
  geom_boxplot(outlier.shape = NA) +
  labs(x="Cardio", y="Height") +
  theme_minimal() +
  scale_fill_manual(values=c("#E69F00", "#56B4E9"), name="Cardio", labels=c("No", "Yes"))

p2 <- ggplot(df3, aes(x=cardio, y=weight, fill=cardio)) +
  geom_boxplot(outlier.shape = NA) +
  labs(x="Cardio", y="Weight") +
  theme_minimal() +
  scale_fill_manual(values=c("#E69F00", "#56B4E9"), name="Cardio", labels=c("No", "Yes"))

grid.arrange(p1, p2, ncol=2)

Plot a bar chart of Distribution of Cholesterol Levels among Cardiovascular Disease Categories.

It can be seen from the figure that in the normal level of cholesterol, the number of people without the disease is more than the number of people with the disease; on the contrary, in the above normal and well above normal, the number of people with the disease is more than the number of people without the disease. This may indicate that higher cholesterol levels are associated with a greater risk of cardiovascular disease.

p1 <- ggplot(df1, aes(x = factor(cholesterol), fill = factor(cardio))) +
  geom_bar(position = "stack") +
  labs(title = "Distribution of Cholesterol Levels among Cardiovascular Disease Categories",
       x = "Cholesterol", 
       y = "Count") +
  theme_minimal() +
  scale_x_discrete(labels = c("Normal", "Above Normal", "Well Above Normal")) +
  scale_fill_discrete(name = "Cardio", labels = c("No", "Yes")) +
  geom_text(stat = "count", aes(label = after_stat(count)), position = position_stack(vjust = 0.5))
p1

Plot a bar chart of Distribution of Glucose Levels among Cardiovascular Disease Categories.

It can be seen from the figure that in the normal level of Glucose, the number of people without the disease is more than the number of people with the disease; in the above normal and well above normal, the number of people with the disease is more than the number of people without the disease. This could suggest a link between glucose levels and cardiovascular disease.

p2 <- ggplot(df1, aes(x = factor(gluc), fill = factor(cardio))) +
  geom_bar(position = "stack") +
  labs(title = "Distribution of Glucose Levels among Cardiovascular Disease Categories",
       x = "Glucose", 
       y = "Count") +
  theme_minimal() +
  scale_x_discrete(labels = c("Normal", "Above Normal", "Well Above Normal")) +
  scale_fill_manual(values = c("peachpuff", "lightcoral"), name = "Cardio", labels = c("No", "Yes")) +
  geom_text(stat = "count", aes(label = after_stat(count)), position = position_stack(vjust = 0.5))
p2

Plot a bar chart of Distribution of BMI Groups among Cardiovascular Disease Categories.

It can be seen from the figure that with the change of BMI level, the proportion of people with cardiovascular disease is also gradually increasing, especially in the obese level, the number of people with cardiovascular disease exceeds the number of people without the disease. This shows that obesity has a greater impact on cardiovascular disease.

df1$BMI_Group <- factor(df1$BMI_Group, levels = c("Underweight", "Normal", "Overweight", "Obese"))

ggplot(df1, aes(x = BMI_Group, fill = factor(cardio))) +
  geom_bar(position = "stack") +
  geom_text(stat='count', aes(label=after_stat(count)), position=position_stack(vjust=0.5), size = 3) +
  labs(title = "Distribution of BMI Groups among Cardiovascular Disease Categories",
       x = "BMI Group", 
       y = "Count") +
  theme_minimal() +
  scale_fill_manual(values = c("lavender", "thistle"), name = "Cardio", labels = c("No", "Yes"))

Plot confusion matrix. According to confusion matrix, the relationship between the variables and their relationship with the target variable can be observed

Diastolic blood pressure and systolic blood pressure show a strong positive correlation with a correlation coefficient of 0.73. This strong correlation may be because these two variables are often affected by both blood flow and heart pumping. When systolic blood pressure increases, diastolic blood pressure tends to increase as well; the attribute with the strongest correlation with the target variable is systolic blood pressure. This means that systolic blood pressure may be an important indicator to predict the occurrence of cardiovascular disease. Further analysis and modeling can explore the relationship between other variables and the target variable and identify the most predictive factors.

cor_matrix <- cor(df1[c("age", "height", "weight", "ap_hi", "ap_lo", "cholesterol", "gluc", "smoke", "alco", "active", "cardio", "BMI")])

my_palette <- colorRampPalette(c("blue", "white", "red"))(100)

corrplot(cor_matrix, method = "color",
         tl.col = "black", 
         col = my_palette, 
         addCoef.col = "black")

7.Modeling

Import library for modeling and read data and split data to trainning data and testing data.

library(klaR)
library(class)
library(gbm)
library(e1071)
library(ROCR)
library(glmnet)
library(rpart)
library(rpart.plot)
library(caret)
library(pROC)

Read the data and set random seed Then create indices for the training and test sets, split the data into training and test sets After that Create an empty dataframe to store the performance metrics of each model Create an empty list to store the confusion matrices Create an empty list to store the ROC objects

df <- read.csv("df_cardio_clean.csv")
df$BMI_Group <- as.factor(df$BMI_Group)
set.seed(123)

index <- createDataPartition(df$cardio, p = 0.7, list = FALSE)

train_data <- df[index,]
test_data <- df[-index,]

performance_metrics <- data.frame()
confusion_matrices <- list()
roc_objects <- list()

Decision Tree model

Decision Tree is a tree-like classification model that partitions data into different categories by sequentially splitting on features. The parameters of the Decision Tree model include the method of tree construction (in this case, “anova” method) and the maximum depth of the tree.

model <- rpart(cardio ~ ., data = train_data, method = "anova")
predictions <- predict(model, test_data)
binary_predictions <- ifelse(predictions > 0.5, 1, 0)
accuracy <- sum(binary_predictions == test_data$cardio) / length(test_data$cardio)
precision <- sum(binary_predictions == 1 & test_data$cardio == 1) / sum(binary_predictions == 1)
recall <- sum(binary_predictions == 1 & test_data$cardio == 1) / sum(test_data$cardio == 1)
f1_score <- 2 * precision * recall / (precision + recall)
roc_obj <- roc(test_data$cardio, predictions)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

performance_metrics <- rbind(performance_metrics, data.frame(
  Model = "Decision Tree",
  Accuracy = accuracy,
  Precision = precision,
  Recall = recall,
  F1_Score = f1_score
))
confusion_matrices[["Decision Tree"]] <- table(binary_predictions, test_data$cardio)
roc_objects[["Decision Tree"]] <- roc_obj

Logistic Regression model

Logistic Regression is a widely used linear model for classification problems. It maps the linear combination of input features to probabilities between 0 and 1 and performs classification based on the probabilities. The parameters of the Logistic Regression model include regularization methods, regularization strength, and others.

model <- glm(cardio ~ ., data = train_data, family = binomial)
predictions <- predict(model, newdata = test_data, type = "response")
predicted_labels <- ifelse(predictions >= 0.5, 1, 0)
accuracy <- sum(predicted_labels == test_data$cardio) / nrow(test_data)
confusion_matrix <- table(predicted_labels, test_data$cardio)
precision <- confusion_matrix[2, 2] / sum(confusion_matrix[2,])
recall <- confusion_matrix[2, 2] / sum(confusion_matrix[, 2])
f1_score <- 2 * precision * recall / (precision + recall)
roc_obj <- roc(test_data$cardio, predictions)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

performance_metrics <- rbind(performance_metrics, data.frame(
  Model = "Logistic Regression",
  Accuracy = accuracy,
  Precision = precision,
  Recall = recall,
  F1_Score = f1_score
))
confusion_matrices[["Logistic Regression"]] <- confusion_matrix
roc_objects[["Logistic Regression"]] <- roc_obj

Gradient Boosting Trees model

Gradient Boosting Trees is an ensemble learning method that iteratively trains multiple decision trees to progressively improve the model’s performance. It optimizes the gradients of the loss function to train each individual tree and then combines them for predictions. The parameters of the Gradient Boosting Trees model include the number of trees, tree depth, learning rate, and others.

gbm_model <- gbm(cardio ~ ., data = train_data, distribution = "bernoulli", n.trees = 100, interaction.depth = 3)
gbm_pred <- predict(gbm_model, newdata = test_data, n.trees = 100, type = "response")
confusion_matrix <- table(gbm_pred > 0.5, test_data$cardio)
accuracy <- sum(diag(confusion_matrix))/sum(confusion_matrix)
precision <- confusion_matrix[2, 2]/sum(confusion_matrix[, 2])
recall <- confusion_matrix[2, 2]/sum(confusion_matrix[2, ])
f1_score <- 2 * precision * recall / (precision + recall)
roc_obj <- roc(test_data$cardio, gbm_pred)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

performance_metrics <- rbind(performance_metrics, data.frame(
  Model = "Gradient Boosting Trees",
  Accuracy = accuracy,
  Precision = precision,
  Recall = recall,
  F1_Score = f1_score
))
confusion_matrices[["Gradient Boosting Trees"]] <- confusion_matrix
roc_objects[["Gradient Boosting Trees"]] <- roc_obj

Naive Bayes model

Naive Bayes is a probabilistic model based on Bayes’ theorem and the assumption of feature independence given the class. It assumes that each feature is independent given the class and performs classification based on the conditional probabilities of the features. The parameters of the Naive Bayes model typically involve assumptions about the feature distributions.

train_data$cardio <- as.factor(train_data$cardio)
test_data$cardio <- as.factor(test_data$cardio)
naive_model <- NaiveBayes(cardio ~ ., data = train_data)

# Use the trained model to make predictions
naive_predictions <- predict(naive_model, newdata = test_data)$class

# Compute accuracy, confusion matrix, precision, recall, and F1 score
accuracy <- sum(naive_predictions == test_data$cardio) / length(test_data$cardio)
confusion_matrix <- table(naive_predictions, test_data$cardio)
precision <- confusion_matrix[2, 2] / sum(confusion_matrix[, 2])
recall <- confusion_matrix[2, 2] / sum(confusion_matrix[2,])
f1_score <- 2 * precision * recall / (precision + recall)
roc_obj <- roc(test_data$cardio, as.numeric(naive_predictions))

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

performance_metrics <- rbind(performance_metrics, data.frame(
  Model = "Naive Bayes",
  Accuracy = accuracy,
  Precision = precision,
  Recall = recall,
  F1_Score = f1_score
))
confusion_matrices[["Naive Bayes"]] <- confusion_matrix
roc_objects[["Naive Bayes"]] <- roc_obj

K-Nearest Neighbors(KNN)

K-Nearest Neighbors is an instance-based learning method that classifies based on the similarity between samples. Given a new sample, the K-Nearest Neighbors model identifies the K nearest samples in the training set and determines the sample’s class based on majority voting among its neighbors. The parameter of the K-Nearest Neighbors model is K, the number of nearest neighbors to consider.

knn_predictions <- knn(train = train_data[-ncol(train_data)], test = test_data[-ncol(test_data)], cl = train_data$cardio, k = 3)

# Compute accuracy, confusion matrix, precision, recall, and F1 score
accuracy <- sum(knn_predictions == test_data$cardio) / length(test_data$cardio)
confusion_matrix <- table(knn_predictions, test_data$cardio)
precision <- confusion_matrix[2, 2] / sum(confusion_matrix[, 2])
recall <- confusion_matrix[2, 2] / sum(confusion_matrix[2,])
f1_score <- 2 * precision * recall / (precision + recall)
roc_obj <- roc(test_data$cardio, as.numeric(knn_predictions))

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

performance_metrics <- rbind(performance_metrics, data.frame(
  Model = "K-Nearest Neighbors",
  Accuracy = accuracy,
  Precision = precision,
  Recall = recall,
  F1_Score = f1_score
))
confusion_matrices[["K-Nearest Neighbors"]] <- confusion_matrix
roc_objects[["K-Nearest Neighbors"]] <- roc_obj

8.Evaluation

After running all the models, print the consolidated performance metrics

print(performance_metrics)

##                     Model  Accuracy Precision    Recall  F1_Score
## 1           Decision Tree 0.7150572 0.7567600 0.6232141 0.6835252
## 2     Logistic Regression 0.7270251 0.7478698 0.6745492 0.7093198
## 3 Gradient Boosting Trees 0.7346631 0.7028279 0.7452722 0.7234280
## 4             Naive Bayes 0.7053272 0.6205537 0.7405927 0.6752801
## 5     K-Nearest Neighbors 0.6349307 0.6004532 0.6385833 0.6189315

Print the confusion matrix for each model

for (model in names(confusion_matrices)) {
  print(paste("Confusion matrix for", model))
  print(confusion_matrices[[model]])
}

## [1] "Confusion matrix for Decision Tree"
##                   
## binary_predictions    0    1
##                  0 8373 3824
##                  1 2033 6325
## [1] "Confusion matrix for Logistic Regression"
##                 
## predicted_labels    0    1
##                0 8098 3303
##                1 2308 6846
## [1] "Confusion matrix for Gradient Boosting Trees"
##        
##            0    1
##   FALSE 7968 3016
##   TRUE  2438 7133
## [1] "Confusion matrix for Naive Bayes"
##                  
## naive_predictions    0    1
##                 0 8200 3851
##                 1 2206 6298
## [1] "Confusion matrix for K-Nearest Neighbors"
##                
## knn_predictions    0    1
##               0 6957 4055
##               1 3449 6094

Plot the ROC curves for all models on the same graph

colors <- c("red", "blue", "green", "purple", "orange")   
plot(roc_objects[[1]], main = "ROC Curves", col = colors[1])
for (i in 2:length(roc_objects)) {
  lines(roc_objects[[i]], col = colors[i])
}
legend("bottomright", legend = names(roc_objects), col = colors, lwd = 2)

9.Conclusion

Based on the accuracy metrics, we can analyze the performance of each model and select the best model. Here is a detailed analysis of the accuracy for each model:

Decision Tree Model: The decision tree model achieves an accuracy of 0.715, which is relatively high. Decision trees have the advantage of being interpretable, capturing complex relationships in the data, and requiring minimal preprocessing of the features. However, decision trees are prone to overfitting and may perform poorly on complex datasets.

Logistic Regression Model: The logistic regression model achieves an accuracy of 0.727, slightly higher than the decision tree model. Logistic regression is a simple yet powerful linear classifier that is widely applicable to classification problems. It can handle large-scale datasets and offers good interpretability. However, logistic regression may have limited modeling capabilities for nonlinear relationships.

Gradient Boosting Trees Model: The gradient boosting trees model achieves an accuracy of 0.734, slightly higher than the logistic regression model. Gradient boosting trees are powerful ensemble learning methods that can handle complex nonlinear relationships. They iteratively train multiple decision trees and combine their predictions to improve performance. However, training gradient boosting trees models can be time-consuming, and tuning multiple parameters is necessary to achieve optimal performance.

Naive Bayes Model: The Naive Bayes Model achieves an accuracy of 0.705, relatively low compared to the other models but higher than K-Nearest Neighbors Model. Naive Bayes models assume feature independence and make assumptions about feature distributions. While Naive Bayes models are computationally efficient, the independence assumption may not hold for certain datasets, which can affect accuracy.

K-Nearest Neighbors Model: The K-nearest neighbors (KNN) model achieves an accuracy of 0.634, relatively low compared to the other models. KNN is an instance-based method that relies on similarity between samples and may perform poorly on datasets with high noise. Additionally, the choice of K value can impact model performance.

Considering the accuracy metrics, the Gradient Boosting Trees model performs the best with the highest accuracy. It can handle complex nonlinear relationships and achieves the highest accuracy on the given dataset. Although training time may be longer and parameter tuning is required, it can be considered the best model choice.

In addition, an ideal ROC curve should be convex to the upper left, implying that the model achieves a higher True Positive Rate and a lower False Positive Rate at different thresholds. The closer the curve is to the upper left corner, the better the model performance is. It can be seen that the Gradient Boosting Trees model performs the best.

10.References

1.Quer, G., Arnaout, R., Henne, M., et al. (2021). Machine Learning and the Future of Cardiovascular Care. Journal of the American College of Cardiology, 77(3), 300-313. https://doi.org/10.1016/j.jacc.2020.11.030

2.Asif, M. A. A. R., Nishat, M. M., Faisal, F., Dip, R. R., Udoy, M. H., Shikder, M. F., & Ahsan, R. (2021). Performance Evaluation and Comparative Analysis of Different Machine Learning Algorithms in Predicting Cardiovascular Disease. Engineering Letters, 29(2).

3.Ghosh, P., et al.(2021). Efficient prediction of cardiovascular disease using machine learning algorithms with relief and LASSO feature selection techniques. IEEE Access 9, 19304–19326. https://doi.org/10.1109/ACCESS.2021.3053759

4.Pattanayak, S., & Singh, T. (2022). Cardiovascular Disease Classification Based on Machine Learning Algorithms Using GridSearchCV, Cross Validation and Stacked Ensemble Methods. In T. Ören (Ed.), Advances in Computing and Data Sciences. ICACDS 2022 (Vol. 1613). Springer, Cham. https://doi.org/10.1007/978-3-031-12638-3_19

5.Swathy, M., & Saruladha, K. (2022). A comparative study of classification and prediction of Cardio-Vascular Diseases (CVD) using Machine Learning and Deep Learning techniques. ICT Express, 8(1), 109-116. https://doi.org/10.1016/j.icte.2021.08.021.

6.Kumar, M. R., A, D. A., Saran, T. M. G., Kumar, R. J. R., Subramanyam, D. V. S. S., & T, M. N. (2023). Machine Learning based Cardiac Disease Prediction- A Comparative Analysis. In 2023 9th International Conference on Advanced Computing and Communication Systems (ICACCS) (pp. 530-534). Coimbatore, India. https://doi.org/10.1109/ICACCS57279.2023.10112914.

7004 Group Project

2023-06-13