Group 4
S2156317 Chai Kang Sheng
S2164443 SIDI LIU
S2136386 Naxin Dong
22069349 JIAJIA JIANG
S2175434 An Zhao
Cardiovascular disease (CVD) refers to a class of diseases that involve the heart or blood vessels. It is a major cause of mortality globally. According to the World Health Organization (WHO), cardiovascular diseases take the lives of 17.9 million people every year, accounting for 31% of all global deaths. Major types of CVD include coronary artery disease (heart attacks), cerebrovascular disease (stroke), and hypertension (high blood pressure), among others.
Predicting cardiovascular disease is a significant focus of medical research and healthcare delivery. The prediction of CVD is based on identifying various risk factors that are associated with heart disease and monitoring those factors in individuals over time. Risk factors include things like age, sex, family history of heart disease, smoking, physical inactivity, unhealthy diet, obesity, diabetes, hypertension, and high cholesterol levels.
Over time, the tools and methods for predicting cardiovascular disease have become more sophisticated. Health professionals now use a combination of medical tests (such as blood tests and physical examinations), personal history, and various predictive models to determine an individual’s risk of developing CVD. Some of these predictive models are based on complex statistical methods or machine learning algorithms that can take into account a large number of variables and their interactions.
In the era of big data and artificial intelligence, the ability to predict CVD has seen significant advancements. Machine learning and data mining techniques are being widely used in the healthcare sector, including for CVD prediction. These techniques can handle a large volume of data and can uncover hidden patterns that might not be evident using traditional statistical methods. This has led to the development of more accurate and personalized predictive models for CVD.
However, it’s important to note that while these predictive models can be powerful tools, they are not perfect and can’t guarantee an individual will or won’t develop CVD. They should be used as part of a broader approach to healthcare that includes regular check-ups, a healthy lifestyle, and appropriate medical interventions when necessary.
In recent years, there has been a significant increase in interest in the development of predictive models for cardiovascular disease. These models utilize adcanced data analysis techniques such as machine learning and artificial intelligence, to identify individuals at high risk of developing CVD and to provide personalized risk assessments. By leveraging large datasets and incorporating a wide range of factors such as age, gender, medical history, lifestyle habits, and biomarkers, these prediction models can generate valuable insights and assist in clinical decision-making.
The objective of this project is to create a robust and accurate predictive model for cardiovascular disease that can assist healthcare professionals in identifying individuals who are at high risk and provide timely interventions. By utilizing the power of machine learning algorithms, we aim to develop a comprehensive model that can enhance risk stratification and enable targeted preventive measures.
The specific objectives include:
-Finding a CVD dataset, which will then be standardized, cleaned and processed to ensure it’s quality for analysis.
-Identifying the most informative features from the collected data to build an optimized predictive model. This process will involve exploratory data analysis (EDA).
-Developing a machine learning algorithm to build a predictive model which will be trained on a labeled dataset.
There are a few problems based on past studies about CVD. The first one is the existing risk assessment methods for CVD lack accuracy and fail to adequately predict the occurrence of CVD in individuals. The traditional risk assessment tools rely on limited variables such as age, gender, and a few more basic and common clinical measurements. This is a problem because these variables lead to underestimation or overestimation of risk in different populations, which will result in missed oppotunities for early detection and timely treatment. Therefore, there is an urgent need to develop a predictive model for CVD that incorporates a comprehensive set of variables such as cholesterol levels, BMI, glucose level and many more.
The second problem is that the adoption of predictive models for CVD is often hindered by a lack of interpretability and explainability. Healthcare professionals are usually skeptical about relying on machine learning algorithms that provide accurate predictions but lack transparency in their decision-making process. Healthcare professionals need insights into the contributing factors and underlying mechanisms that cause the model’s predictions to make informed clinical decisions. However, there is a lack of interpretability in current predictive models for CVD which is preventing their widespread adoption and integration into routine practice. Therefore, there is a need to develop a predictive model for CVD that not only demonstrates high accuracy, but is also able to provide interpretable and explainable results, which will allow healthecare professionals to easily understand and trust the model’s predictions, which will lead to more accurate clinical decision-making.
Data set:Cardiovascular Disease dataset from kaggle website
Data set consists of 70000 records of patients data.
The data set includes 13 attributes described as follows:
id : It’s just the Id no of the row. Not revelant
age : It’s the age of a person in Days
gender : It’s the gender of the person
height : It’s the height of the person in cm
weight : It’s the weight of the person in kg
ap_hi : It’s the Systolic blood pressure i.e. Pressure exerted when Blood is ejected in arteries. Normal value : 120mmhg or Below
ap_low : It’s the Diastolic blood pressure i.e. Pressure exerted when Blood exerts between arteries and heartbeats. Normal Value : 80mmhg or Below
cholesterol : It’s the Cholestreol value (Cholesterol is a type of fat found in your blood) of your blood. In Adults, 200 mg/dL is desired with 200 and 239 mg/dL as Boderline High. In Children, 170 mg/dL is desired with 170 and 199 mg/dL as Boderline High
gluc : It’s the Glucose Level. They’re less than 100 mg/dL after not eating (fasting) for at least 8 hours. And they’re less than 140 mg/dL 2 hours after eating. For most people without diabetes, blood sugar levels before meals hover around 70 to 80 mg/dL
smoke : It contain Binary Values stating whether Person is a Smoker or not i.e. {0 : ‘Not a Smoker’, 1 : ‘Smoker’}
alco : It contain Binary Values stating whether Person is an alchoalic or not i.e. {0 : ‘Not a Alcholic’, 1 : ‘Alcholic’}
active : It contain Binary Values stating whether Person is involved in physical activites or not i.e. {0 : ‘Not involved in Physical Activites’, 1 : ‘involved in physical activites’}
cardio : It’s our Target Value Binary Values stating whether Person has Cardiovascular diseases (CVDs) or Not i.e. {0 : ‘Not Have CVD’, 1 : ‘Have CVD’}
In this section we will perform an exploration of the collected data using descriptive statistics to better understand the data. Through this, we can detect anomalies and inconsistencies in the data, thereby performing preprocessing to make the data clearer and easier to understand, serving other purposes well.
First step import libraries for data preparation and EDA
library(dplyr)
library(ggplot2)
library(readr)
library(tidyr)
library(tidyverse)
library(gridExtra)
library(corrplot)
read “cardio_train.csv” data, Show the structure of data
df_cardio <- read.csv("cardio_train.csv",sep = ";")
str(df_cardio)
## 'data.frame': 70000 obs. of 13 variables:
## $ id : int 0 1 2 3 4 8 9 12 13 14 ...
## $ age : int 18393 20228 18857 17623 17474 21914 22113 22584 17668 19834 ...
## $ gender : int 2 1 1 2 1 1 1 2 1 1 ...
## $ height : int 168 156 165 169 156 151 157 178 158 164 ...
## $ weight : num 62 85 64 82 56 67 93 95 71 68 ...
## $ ap_hi : int 110 140 130 150 100 120 130 130 110 110 ...
## $ ap_lo : int 80 90 70 100 60 80 80 90 70 60 ...
## $ cholesterol: int 1 3 3 1 1 2 3 3 1 1 ...
## $ gluc : int 1 1 1 1 1 2 1 3 1 1 ...
## $ smoke : int 0 0 0 0 0 0 0 0 0 0 ...
## $ alco : int 0 0 0 0 0 0 0 0 0 0 ...
## $ active : int 1 1 0 1 0 0 1 1 1 0 ...
## $ cardio : int 0 1 1 1 0 0 0 1 0 0 ...
From the structure of data, we can see the data frame contians 70000 objectives of 13 variables, the id feature was not in use in this case study, so it will be removed.
#remove id cloumn
df_cardio<-df_cardio[-c(1)]
head(df_cardio)
## age gender height weight ap_hi ap_lo cholesterol gluc smoke alco active
## 1 18393 2 168 62 110 80 1 1 0 0 1
## 2 20228 1 156 85 140 90 3 1 0 0 1
## 3 18857 1 165 64 130 70 3 1 0 0 0
## 4 17623 2 169 82 150 100 1 1 0 0 1
## 5 17474 1 156 56 100 60 1 1 0 0 0
## 6 21914 1 151 67 120 80 2 2 0 0 0
## cardio
## 1 0
## 2 1
## 3 1
## 4 1
## 5 0
## 6 0
After remove the id column, the missing value will be checked and removed.
#check missing value
sum(is.na(df_cardio))
## [1] 0
As none of the variables have missing values, no need to remove any rows. Before we process data, we still need to check duplicated data and remove it.
sum(duplicated(df_cardio))
## [1] 24
After checking, there has 24 duplicated data, we need to remove it.
df_cardio<-subset(df_cardio,!duplicated(df_cardio))
As in the dataset the age column shows in days so we need to change it to years for better analysis.
#converting age from days to years
df_cardio$age <- as.numeric(round(df_cardio$age / 365))
Converte female “1” and male “2” to female “0” anda male “1” in order to match the following features
#converting female to 0 and male to 1
df_cardio$gender<-as.numeric(df_cardio$gender-1)
After adjust those features, we add a new column for the dataset which is BMI(Body Mass Index)=mass (kg)/ height2 (m).
#add new colum for BMI
df_cardio$BMI<-round((df_cardio$weight/(df_cardio$height/100)^2),digits = 2)
Summary and find each features’ outliers and remove extreme value
summary(df_cardio)
## age gender height weight
## Min. :30.00 Min. :0.0000 Min. : 55.0 Min. : 10.00
## 1st Qu.:48.00 1st Qu.:0.0000 1st Qu.:159.0 1st Qu.: 65.00
## Median :54.00 Median :0.0000 Median :165.0 Median : 72.00
## Mean :53.34 Mean :0.3496 Mean :164.4 Mean : 74.21
## 3rd Qu.:58.00 3rd Qu.:1.0000 3rd Qu.:170.0 3rd Qu.: 82.00
## Max. :65.00 Max. :1.0000 Max. :250.0 Max. :200.00
## ap_hi ap_lo cholesterol gluc
## Min. : -150.0 Min. : -70.00 Min. :1.000 Min. :1.000
## 1st Qu.: 120.0 1st Qu.: 80.00 1st Qu.:1.000 1st Qu.:1.000
## Median : 120.0 Median : 80.00 Median :1.000 Median :1.000
## Mean : 128.8 Mean : 96.64 Mean :1.367 Mean :1.227
## 3rd Qu.: 140.0 3rd Qu.: 90.00 3rd Qu.:2.000 3rd Qu.:1.000
## Max. :16020.0 Max. :11000.00 Max. :3.000 Max. :3.000
## smoke alco active cardio
## Min. :0.00000 Min. :0.00000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:1.0000 1st Qu.:0.0000
## Median :0.00000 Median :0.00000 Median :1.0000 Median :0.0000
## Mean :0.08816 Mean :0.05379 Mean :0.8037 Mean :0.4998
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.00000 Max. :1.00000 Max. :1.0000 Max. :1.0000
## BMI
## Min. : 3.47
## 1st Qu.: 23.88
## Median : 26.39
## Mean : 27.56
## 3rd Qu.: 30.22
## Max. :298.67
Remove heights less than 120cm and taller than 210cm and remove weight less than 30kg due to age range 30-65 years
df_cardio<-df_cardio[!(df_cardio$height > 210|df_cardio$height < 120),]
df_cardio<-df_cardio[!(df_cardio$weight<30),]
Checking ap_hi & ap_low value ap_hi : It’s the Systolic blood pressure,the normal value : 120mmhg or Below ap_low : It’s the Diastolic blood pressure the normal Value : 80mmhg or Below
# remove ap_hi is higher than 250 or lower than 90
df_cardio <- df_cardio[!(df_cardio$ap_hi > 250 | df_cardio$ap_hi < 90), ]
# remove ap_lo is higher than 160 or lower than 40
df_cardio <- df_cardio[!(df_cardio$ap_lo > 160 | df_cardio$ap_lo < 40), ]
# remove ap_lo is higher than 'ap_hi'
df_cardio <- df_cardio[!(df_cardio$ap_lo > df_cardio$ap_hi), ]
Checking BMI, rows with abnormal BMI values are eliminated
df_cardio <- df_cardio[!(df_cardio$BMI >150),]
After clean all those data, we add a new column “BMI_group”:BMI below 18.5 is underweight; between 18.5 and 23.9 is Normal range; between 24 and 27.9 overweight range;beyond 28 is obese.
df_cardio$BMI_Group<-cut(df_cardio$BMI,breaks =c(0,18.5,23.9,27.9,Inf),labels=c("Underweight","Normal","Overweight","Obese"))
Save the cleanned Cardiovascular Disease dataset to a new csv file.
df_cardio_clean<-df_cardio
write.csv(df_cardio_clean,file='df_cardio_clean.csv')
str(df_cardio_clean)
## 'data.frame': 68517 obs. of 14 variables:
## $ age : num 50 55 52 48 48 60 61 62 48 54 ...
## $ gender : num 1 0 0 1 0 0 0 1 0 0 ...
## $ height : int 168 156 165 169 156 151 157 178 158 164 ...
## $ weight : num 62 85 64 82 56 67 93 95 71 68 ...
## $ ap_hi : int 110 140 130 150 100 120 130 130 110 110 ...
## $ ap_lo : int 80 90 70 100 60 80 80 90 70 60 ...
## $ cholesterol: int 1 3 3 1 1 2 3 3 1 1 ...
## $ gluc : int 1 1 1 1 1 2 1 3 1 1 ...
## $ smoke : int 0 0 0 0 0 0 0 0 0 0 ...
## $ alco : int 0 0 0 0 0 0 0 0 0 0 ...
## $ active : int 1 1 0 1 0 0 1 1 1 0 ...
## $ cardio : int 0 1 1 1 0 0 0 1 0 0 ...
## $ BMI : num 22 34.9 23.5 28.7 23 ...
## $ BMI_Group : Factor w/ 4 levels "Underweight",..: 2 4 2 4 2 4 4 4 4 3 ...
Exploratory data analysis is the process of analyzing data sets using statistical graphics and visualization tools to help us understand the main characteristics of the data set, the correlation of attributes, the distribution of data, and whether there are outliers in the data.
df<-read.csv("df_cardio_clean.csv")
set.seed(123)
df1<-df
head(df1,5)
## X age gender height weight ap_hi ap_lo cholesterol gluc smoke alco active
## 1 1 50 1 168 62 110 80 1 1 0 0 1
## 2 2 55 0 156 85 140 90 3 1 0 0 1
## 3 3 52 0 165 64 130 70 3 1 0 0 0
## 4 4 48 1 169 82 150 100 1 1 0 0 1
## 5 5 48 0 156 56 100 60 1 1 0 0 0
## cardio BMI BMI_Group
## 1 0 21.97 Normal
## 2 1 34.93 Obese
## 3 1 23.51 Normal
## 4 1 28.71 Obese
## 5 0 23.01 Normal
For the Cardiovascular Disease dataset, there are four types of Interval variables: Age, Height, Weight, Systolic blood pressure, Diastolic blood pressure and BMI. We can view the descriptive statistics of these variables, such as mean, median and quantile Number.
Through the statistics of the Interval variables, we can get some rules: The age range of the respondents was between 30 and 65 years old, with an average age of 53.35 years; the median BMI was 26.57, which was above the normal range. This means that some respondents may be overweight or obese.
Interval_variables<-df %>% select(age, height, weight, ap_hi, ap_lo ,BMI)
summary(Interval_variables)
## age height weight ap_hi
## Min. :30.00 Min. :120.0 Min. : 30.00 Min. : 90.0
## 1st Qu.:48.00 1st Qu.:159.0 1st Qu.: 65.00 1st Qu.:120.0
## Median :54.00 Median :165.0 Median : 72.00 Median :120.0
## Mean :53.33 Mean :164.4 Mean : 74.13 Mean :126.7
## 3rd Qu.:58.00 3rd Qu.:170.0 3rd Qu.: 82.00 3rd Qu.:140.0
## Max. :65.00 Max. :207.0 Max. :200.00 Max. :240.0
## ap_lo BMI
## Min. : 40.00 Min. : 10.73
## 1st Qu.: 80.00 1st Qu.: 23.88
## Median : 80.00 Median : 26.35
## Mean : 81.33 Mean : 27.46
## 3rd Qu.: 90.00 3rd Qu.: 30.12
## Max. :160.00 Max. :108.17
For categorical variables: Gender, Cholesterol, Glucose, Smoking, Alcohol intake, Physical activity and BMI_Group, we choose to use frequency statistics.
Through the statistical data of the categorical variables, we can get some rules: The number of women in the data is 1.8 times that of men, indicating that there are more women in this data set; most people’s cholesterol and glucose levels are normal, and most people do not smoke, do not drink alcohol and have positive physical. However, on the BMI_Group, the number of obese people is the largest, which reflects that there is a certain degree of obesity in this group.
gender_labels <- c("women", "men")
cholesterol_labels <- c("normal", "above normal", "well above normal")
gluc_labels <- c("normal", "above normal", "well above normal")
smoke_labels <- c("no", "yes")
alco_labels <- c("no", "yes")
active_labels <- c("no", "yes")
df$gender <- factor(df$gender, levels = c(0, 1), labels = gender_labels)
df$cholesterol <- factor(df$cholesterol, levels = 1:3, labels = cholesterol_labels)
df$gluc <- factor(df$gluc, levels = 1:3, labels = gluc_labels)
df$smoke <- factor(df$smoke, levels = 0:1, labels = smoke_labels)
df$alco <- factor(df$alco, levels = 0:1, labels = alco_labels)
df$active <- factor(df$active, levels = 0:1, labels = active_labels)
Categorical_variables<-df %>% select(gender, cholesterol, gluc, smoke, alco, active, BMI_Group) %>% lapply(table) %>% print()
## $gender
##
## women men
## 44616 23901
##
## $cholesterol
##
## normal above normal well above normal
## 51373 9280 7864
##
## $gluc
##
## normal above normal well above normal
## 58247 5057 5213
##
## $smoke
##
## no yes
## 62491 6026
##
## $alco
##
## no yes
## 64862 3655
##
## $active
##
## no yes
## 13468 55049
##
## $BMI_Group
##
## Normal Obese Overweight Underweight
## 17455 26136 24295 631
Draw a pie chart showing the proportion of patients with and without cardiovascular disease.
It can be seen from the figure that the number of people without cardiovascular disease is 31858, accounting for about 49%. The number of people suffering from cardiovascular disease was 32,978, accounting for about 51%. The distribution of this dataset is fairly even, with no significant imbalances. This balanced data distribution helps to build predictive models and perform accurate prediction and analysis of cardiovascular diseases.
df %>%
count(cardio) %>%
mutate(percentage = n / sum(n)) %>%
ggplot(aes(x = "", y = percentage, fill = factor(cardio, levels = c(1, 0)))) +
geom_bar(width = 1, stat = "identity") +
geom_text(aes(y = cumsum(percentage) - percentage / 2,
label = paste0(round(percentage * 100), "% (", n, ")")),
color = "black") +
coord_polar("y", start = 0) +
scale_fill_manual(values = c("khaki", "skyblue"),
name = "Cardio Disease",
labels = c("Yes", "No")) +
theme_void() +
ggtitle("Pie Chart of Patients with Cardiovascular Disease") +
theme(plot.title = element_text(size = 20))
Draw a histogram of age and the target variable, group the ages into 5-year-old intervals, and count the number of people with and without the disease in each age group.
It can be seen from the figure that in the interval of 30-55, the number of people without the disease is more than the number of people with the disease; but in the interval of 55-65, the number of people with the disease is more than the number of people without the disease. This suggests that the probability of disease may be related to the distribution of age.
age_group <- cut(df1$age, breaks = seq(30, 65, by = 5), include.lowest = TRUE, right = FALSE)
ggplot(df1, aes(x=age_group, fill=factor(cardio))) +
geom_bar(position = "dodge") +
scale_fill_discrete(name="Cardio", labels=c("No", "Yes")) +
xlab("Age Group") +
ylab("Number of Patients") +
ggtitle("Bar Chart of Age Group by Cardiovascular Disease") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Draw the boxplots of Systolic blood pressure and Diastolic blood pressure respectively.
From the Systolic blood pressure boxplot, it’s evident that people with cardiovascular disease have a notably higher median systolic blood pressure compared to those without the disease. This implies a strong link between systolic blood pressure and the disease. Furthermore, systolic blood pressure values in those with the disease display a more dispersed pattern, which may reflect the severity of the disease and the diversity of control conditions.
Contrastingly, in the Diastolic blood pressure boxplot, despite a comparable median between people with and without the disease, the dispersion of values in the group with the disease is wider with greater fluctuations. This possibly indicates instability in diastolic blood pressure control among people with cardiovascular disease.
remove_outliers <- function(df, variable_name){
Q1 <- df %>% summarise(q = quantile(!!sym(variable_name), .25, na.rm = TRUE)) %>% pull(q)
Q3 <- df %>% summarise(q = quantile(!!sym(variable_name), .75, na.rm = TRUE)) %>% pull(q)
IQR <- Q3 - Q1
df <- df %>% filter(!((!!sym(variable_name) < (Q1 - 1.5 * IQR)) | (!!sym(variable_name) > (Q3 + 1.5 * IQR))))
return(df)
}
df2 <- remove_outliers(df1, "ap_hi")
df2 <- remove_outliers(df2, "ap_lo")
df2 <- df2 %>% mutate(cardio = factor(cardio, levels = c(0, 1), labels = c("No", "Yes")))
p1 <- ggplot(df2, aes(x=cardio, y=ap_hi, fill=cardio)) +
geom_boxplot(outlier.shape = NA) +
labs(x="Cardio", y="Systolic blood pressure") +
theme_minimal() +
scale_fill_manual(values=c("skyblue", "lightpink"), name="Cardio", labels=c("No", "Yes"))
p2 <- ggplot(df2, aes(x=cardio, y=ap_lo, fill=cardio)) +
geom_boxplot(outlier.shape = NA) +
labs(x="Cardio", y="Diastolic blood pressure") +
theme_minimal() +
scale_fill_manual(values=c("skyblue", "lightpink"), name="Cardio", labels=c("No", "Yes"))
grid.arrange(p1, p2, ncol=2)
Draw the boxplots of height and weight respectively.
It can be seen from the figure that the median height of the people with the disease is slightly lower than that of the people without the disease, but their weight is higher than that of the people without the disease. This indicates that there may be a certain correlation between body weight and cardiovascular disease.
df3 <- remove_outliers(df1, "height")
df3 <- remove_outliers(df3, "weight")
df3 <- df3 %>% mutate(cardio = factor(cardio, levels = c(0, 1), labels = c("No", "Yes")))
p1 <- ggplot(df3, aes(x=cardio, y=height, fill=cardio)) +
geom_boxplot(outlier.shape = NA) +
labs(x="Cardio", y="Height") +
theme_minimal() +
scale_fill_manual(values=c("#E69F00", "#56B4E9"), name="Cardio", labels=c("No", "Yes"))
p2 <- ggplot(df3, aes(x=cardio, y=weight, fill=cardio)) +
geom_boxplot(outlier.shape = NA) +
labs(x="Cardio", y="Weight") +
theme_minimal() +
scale_fill_manual(values=c("#E69F00", "#56B4E9"), name="Cardio", labels=c("No", "Yes"))
grid.arrange(p1, p2, ncol=2)
Plot a bar chart of Distribution of Cholesterol Levels among Cardiovascular Disease Categories.
It can be seen from the figure that in the normal level of cholesterol, the number of people without the disease is more than the number of people with the disease; on the contrary, in the above normal and well above normal, the number of people with the disease is more than the number of people without the disease. This may indicate that higher cholesterol levels are associated with a greater risk of cardiovascular disease.
p1 <- ggplot(df1, aes(x = factor(cholesterol), fill = factor(cardio))) +
geom_bar(position = "stack") +
labs(title = "Distribution of Cholesterol Levels among Cardiovascular Disease Categories",
x = "Cholesterol",
y = "Count") +
theme_minimal() +
scale_x_discrete(labels = c("Normal", "Above Normal", "Well Above Normal")) +
scale_fill_discrete(name = "Cardio", labels = c("No", "Yes")) +
geom_text(stat = "count", aes(label = after_stat(count)), position = position_stack(vjust = 0.5))
p1
Plot a bar chart of Distribution of Glucose Levels among Cardiovascular Disease Categories.
It can be seen from the figure that in the normal level of Glucose, the number of people without the disease is more than the number of people with the disease; in the above normal and well above normal, the number of people with the disease is more than the number of people without the disease. This could suggest a link between glucose levels and cardiovascular disease.
p2 <- ggplot(df1, aes(x = factor(gluc), fill = factor(cardio))) +
geom_bar(position = "stack") +
labs(title = "Distribution of Glucose Levels among Cardiovascular Disease Categories",
x = "Glucose",
y = "Count") +
theme_minimal() +
scale_x_discrete(labels = c("Normal", "Above Normal", "Well Above Normal")) +
scale_fill_manual(values = c("peachpuff", "lightcoral"), name = "Cardio", labels = c("No", "Yes")) +
geom_text(stat = "count", aes(label = after_stat(count)), position = position_stack(vjust = 0.5))
p2
Plot a bar chart of Distribution of BMI Groups among Cardiovascular Disease Categories.
It can be seen from the figure that with the change of BMI level, the proportion of people with cardiovascular disease is also gradually increasing, especially in the obese level, the number of people with cardiovascular disease exceeds the number of people without the disease. This shows that obesity has a greater impact on cardiovascular disease.
df1$BMI_Group <- factor(df1$BMI_Group, levels = c("Underweight", "Normal", "Overweight", "Obese"))
ggplot(df1, aes(x = BMI_Group, fill = factor(cardio))) +
geom_bar(position = "stack") +
geom_text(stat='count', aes(label=after_stat(count)), position=position_stack(vjust=0.5), size = 3) +
labs(title = "Distribution of BMI Groups among Cardiovascular Disease Categories",
x = "BMI Group",
y = "Count") +
theme_minimal() +
scale_fill_manual(values = c("lavender", "thistle"), name = "Cardio", labels = c("No", "Yes"))
Plot confusion matrix. According to confusion matrix, the relationship between the variables and their relationship with the target variable can be observed
Diastolic blood pressure and systolic blood pressure show a strong positive correlation with a correlation coefficient of 0.73. This strong correlation may be because these two variables are often affected by both blood flow and heart pumping. When systolic blood pressure increases, diastolic blood pressure tends to increase as well; the attribute with the strongest correlation with the target variable is systolic blood pressure. This means that systolic blood pressure may be an important indicator to predict the occurrence of cardiovascular disease. Further analysis and modeling can explore the relationship between other variables and the target variable and identify the most predictive factors.
cor_matrix <- cor(df1[c("age", "height", "weight", "ap_hi", "ap_lo", "cholesterol", "gluc", "smoke", "alco", "active", "cardio", "BMI")])
my_palette <- colorRampPalette(c("blue", "white", "red"))(100)
corrplot(cor_matrix, method = "color",
tl.col = "black",
col = my_palette,
addCoef.col = "black")
Import library for modeling and read data and split data to trainning data and testing data.
library(klaR)
library(class)
library(gbm)
library(e1071)
library(ROCR)
library(glmnet)
library(rpart)
library(rpart.plot)
library(caret)
library(pROC)
Read the data and set random seed Then create indices for the training and test sets, split the data into training and test sets After that Create an empty dataframe to store the performance metrics of each model Create an empty list to store the confusion matrices Create an empty list to store the ROC objects
df <- read.csv("df_cardio_clean.csv")
df$BMI_Group <- as.factor(df$BMI_Group)
set.seed(123)
index <- createDataPartition(df$cardio, p = 0.7, list = FALSE)
train_data <- df[index,]
test_data <- df[-index,]
performance_metrics <- data.frame()
confusion_matrices <- list()
roc_objects <- list()
Decision Tree is a tree-like classification model that partitions data into different categories by sequentially splitting on features. The parameters of the Decision Tree model include the method of tree construction (in this case, “anova” method) and the maximum depth of the tree.
model <- rpart(cardio ~ ., data = train_data, method = "anova")
predictions <- predict(model, test_data)
binary_predictions <- ifelse(predictions > 0.5, 1, 0)
accuracy <- sum(binary_predictions == test_data$cardio) / length(test_data$cardio)
precision <- sum(binary_predictions == 1 & test_data$cardio == 1) / sum(binary_predictions == 1)
recall <- sum(binary_predictions == 1 & test_data$cardio == 1) / sum(test_data$cardio == 1)
f1_score <- 2 * precision * recall / (precision + recall)
roc_obj <- roc(test_data$cardio, predictions)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
performance_metrics <- rbind(performance_metrics, data.frame(
Model = "Decision Tree",
Accuracy = accuracy,
Precision = precision,
Recall = recall,
F1_Score = f1_score
))
confusion_matrices[["Decision Tree"]] <- table(binary_predictions, test_data$cardio)
roc_objects[["Decision Tree"]] <- roc_obj
Logistic Regression is a widely used linear model for classification problems. It maps the linear combination of input features to probabilities between 0 and 1 and performs classification based on the probabilities. The parameters of the Logistic Regression model include regularization methods, regularization strength, and others.
model <- glm(cardio ~ ., data = train_data, family = binomial)
predictions <- predict(model, newdata = test_data, type = "response")
predicted_labels <- ifelse(predictions >= 0.5, 1, 0)
accuracy <- sum(predicted_labels == test_data$cardio) / nrow(test_data)
confusion_matrix <- table(predicted_labels, test_data$cardio)
precision <- confusion_matrix[2, 2] / sum(confusion_matrix[2,])
recall <- confusion_matrix[2, 2] / sum(confusion_matrix[, 2])
f1_score <- 2 * precision * recall / (precision + recall)
roc_obj <- roc(test_data$cardio, predictions)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
performance_metrics <- rbind(performance_metrics, data.frame(
Model = "Logistic Regression",
Accuracy = accuracy,
Precision = precision,
Recall = recall,
F1_Score = f1_score
))
confusion_matrices[["Logistic Regression"]] <- confusion_matrix
roc_objects[["Logistic Regression"]] <- roc_obj
Gradient Boosting Trees is an ensemble learning method that iteratively trains multiple decision trees to progressively improve the model’s performance. It optimizes the gradients of the loss function to train each individual tree and then combines them for predictions. The parameters of the Gradient Boosting Trees model include the number of trees, tree depth, learning rate, and others.
gbm_model <- gbm(cardio ~ ., data = train_data, distribution = "bernoulli", n.trees = 100, interaction.depth = 3)
gbm_pred <- predict(gbm_model, newdata = test_data, n.trees = 100, type = "response")
confusion_matrix <- table(gbm_pred > 0.5, test_data$cardio)
accuracy <- sum(diag(confusion_matrix))/sum(confusion_matrix)
precision <- confusion_matrix[2, 2]/sum(confusion_matrix[, 2])
recall <- confusion_matrix[2, 2]/sum(confusion_matrix[2, ])
f1_score <- 2 * precision * recall / (precision + recall)
roc_obj <- roc(test_data$cardio, gbm_pred)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
performance_metrics <- rbind(performance_metrics, data.frame(
Model = "Gradient Boosting Trees",
Accuracy = accuracy,
Precision = precision,
Recall = recall,
F1_Score = f1_score
))
confusion_matrices[["Gradient Boosting Trees"]] <- confusion_matrix
roc_objects[["Gradient Boosting Trees"]] <- roc_obj
Naive Bayes is a probabilistic model based on Bayes’ theorem and the assumption of feature independence given the class. It assumes that each feature is independent given the class and performs classification based on the conditional probabilities of the features. The parameters of the Naive Bayes model typically involve assumptions about the feature distributions.
train_data$cardio <- as.factor(train_data$cardio)
test_data$cardio <- as.factor(test_data$cardio)
naive_model <- NaiveBayes(cardio ~ ., data = train_data)
# Use the trained model to make predictions
naive_predictions <- predict(naive_model, newdata = test_data)$class
# Compute accuracy, confusion matrix, precision, recall, and F1 score
accuracy <- sum(naive_predictions == test_data$cardio) / length(test_data$cardio)
confusion_matrix <- table(naive_predictions, test_data$cardio)
precision <- confusion_matrix[2, 2] / sum(confusion_matrix[, 2])
recall <- confusion_matrix[2, 2] / sum(confusion_matrix[2,])
f1_score <- 2 * precision * recall / (precision + recall)
roc_obj <- roc(test_data$cardio, as.numeric(naive_predictions))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
performance_metrics <- rbind(performance_metrics, data.frame(
Model = "Naive Bayes",
Accuracy = accuracy,
Precision = precision,
Recall = recall,
F1_Score = f1_score
))
confusion_matrices[["Naive Bayes"]] <- confusion_matrix
roc_objects[["Naive Bayes"]] <- roc_obj
K-Nearest Neighbors is an instance-based learning method that classifies based on the similarity between samples. Given a new sample, the K-Nearest Neighbors model identifies the K nearest samples in the training set and determines the sample’s class based on majority voting among its neighbors. The parameter of the K-Nearest Neighbors model is K, the number of nearest neighbors to consider.
knn_predictions <- knn(train = train_data[-ncol(train_data)], test = test_data[-ncol(test_data)], cl = train_data$cardio, k = 3)
# Compute accuracy, confusion matrix, precision, recall, and F1 score
accuracy <- sum(knn_predictions == test_data$cardio) / length(test_data$cardio)
confusion_matrix <- table(knn_predictions, test_data$cardio)
precision <- confusion_matrix[2, 2] / sum(confusion_matrix[, 2])
recall <- confusion_matrix[2, 2] / sum(confusion_matrix[2,])
f1_score <- 2 * precision * recall / (precision + recall)
roc_obj <- roc(test_data$cardio, as.numeric(knn_predictions))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
performance_metrics <- rbind(performance_metrics, data.frame(
Model = "K-Nearest Neighbors",
Accuracy = accuracy,
Precision = precision,
Recall = recall,
F1_Score = f1_score
))
confusion_matrices[["K-Nearest Neighbors"]] <- confusion_matrix
roc_objects[["K-Nearest Neighbors"]] <- roc_obj
After running all the models, print the consolidated performance metrics
print(performance_metrics)
## Model Accuracy Precision Recall F1_Score
## 1 Decision Tree 0.7150572 0.7567600 0.6232141 0.6835252
## 2 Logistic Regression 0.7270251 0.7478698 0.6745492 0.7093198
## 3 Gradient Boosting Trees 0.7346631 0.7028279 0.7452722 0.7234280
## 4 Naive Bayes 0.7053272 0.6205537 0.7405927 0.6752801
## 5 K-Nearest Neighbors 0.6349307 0.6004532 0.6385833 0.6189315
Print the confusion matrix for each model
for (model in names(confusion_matrices)) {
print(paste("Confusion matrix for", model))
print(confusion_matrices[[model]])
}
## [1] "Confusion matrix for Decision Tree"
##
## binary_predictions 0 1
## 0 8373 3824
## 1 2033 6325
## [1] "Confusion matrix for Logistic Regression"
##
## predicted_labels 0 1
## 0 8098 3303
## 1 2308 6846
## [1] "Confusion matrix for Gradient Boosting Trees"
##
## 0 1
## FALSE 7968 3016
## TRUE 2438 7133
## [1] "Confusion matrix for Naive Bayes"
##
## naive_predictions 0 1
## 0 8200 3851
## 1 2206 6298
## [1] "Confusion matrix for K-Nearest Neighbors"
##
## knn_predictions 0 1
## 0 6957 4055
## 1 3449 6094
Plot the ROC curves for all models on the same graph
colors <- c("red", "blue", "green", "purple", "orange")
plot(roc_objects[[1]], main = "ROC Curves", col = colors[1])
for (i in 2:length(roc_objects)) {
lines(roc_objects[[i]], col = colors[i])
}
legend("bottomright", legend = names(roc_objects), col = colors, lwd = 2)
Based on the accuracy metrics, we can analyze the performance of each model and select the best model. Here is a detailed analysis of the accuracy for each model:
Decision Tree Model: The decision tree model achieves an accuracy of 0.715, which is relatively high. Decision trees have the advantage of being interpretable, capturing complex relationships in the data, and requiring minimal preprocessing of the features. However, decision trees are prone to overfitting and may perform poorly on complex datasets.
Logistic Regression Model: The logistic regression model achieves an accuracy of 0.727, slightly higher than the decision tree model. Logistic regression is a simple yet powerful linear classifier that is widely applicable to classification problems. It can handle large-scale datasets and offers good interpretability. However, logistic regression may have limited modeling capabilities for nonlinear relationships.
Gradient Boosting Trees Model: The gradient boosting trees model achieves an accuracy of 0.734, slightly higher than the logistic regression model. Gradient boosting trees are powerful ensemble learning methods that can handle complex nonlinear relationships. They iteratively train multiple decision trees and combine their predictions to improve performance. However, training gradient boosting trees models can be time-consuming, and tuning multiple parameters is necessary to achieve optimal performance.
Naive Bayes Model: The Naive Bayes Model achieves an accuracy of 0.705, relatively low compared to the other models but higher than K-Nearest Neighbors Model. Naive Bayes models assume feature independence and make assumptions about feature distributions. While Naive Bayes models are computationally efficient, the independence assumption may not hold for certain datasets, which can affect accuracy.
K-Nearest Neighbors Model: The K-nearest neighbors (KNN) model achieves an accuracy of 0.634, relatively low compared to the other models. KNN is an instance-based method that relies on similarity between samples and may perform poorly on datasets with high noise. Additionally, the choice of K value can impact model performance.
Considering the accuracy metrics, the Gradient Boosting Trees model performs the best with the highest accuracy. It can handle complex nonlinear relationships and achieves the highest accuracy on the given dataset. Although training time may be longer and parameter tuning is required, it can be considered the best model choice.
In addition, an ideal ROC curve should be convex to the upper left, implying that the model achieves a higher True Positive Rate and a lower False Positive Rate at different thresholds. The closer the curve is to the upper left corner, the better the model performance is. It can be seen that the Gradient Boosting Trees model performs the best.
1.Quer, G., Arnaout, R., Henne, M., et al. (2021). Machine Learning and the Future of Cardiovascular Care. Journal of the American College of Cardiology, 77(3), 300-313. https://doi.org/10.1016/j.jacc.2020.11.030
2.Asif, M. A. A. R., Nishat, M. M., Faisal, F., Dip, R. R., Udoy, M. H., Shikder, M. F., & Ahsan, R. (2021). Performance Evaluation and Comparative Analysis of Different Machine Learning Algorithms in Predicting Cardiovascular Disease. Engineering Letters, 29(2).
3.Ghosh, P., et al.(2021). Efficient prediction of cardiovascular disease using machine learning algorithms with relief and LASSO feature selection techniques. IEEE Access 9, 19304–19326. https://doi.org/10.1109/ACCESS.2021.3053759
4.Pattanayak, S., & Singh, T. (2022). Cardiovascular Disease Classification Based on Machine Learning Algorithms Using GridSearchCV, Cross Validation and Stacked Ensemble Methods. In T. Ören (Ed.), Advances in Computing and Data Sciences. ICACDS 2022 (Vol. 1613). Springer, Cham. https://doi.org/10.1007/978-3-031-12638-3_19
5.Swathy, M., & Saruladha, K. (2022). A comparative study of classification and prediction of Cardio-Vascular Diseases (CVD) using Machine Learning and Deep Learning techniques. ICT Express, 8(1), 109-116. https://doi.org/10.1016/j.icte.2021.08.021.
6.Kumar, M. R., A, D. A., Saran, T. M. G., Kumar, R. J. R., Subramanyam, D. V. S. S., & T, M. N. (2023). Machine Learning based Cardiac Disease Prediction- A Comparative Analysis. In 2023 9th International Conference on Advanced Computing and Communication Systems (ICACCS) (pp. 530-534). Coimbatore, India. https://doi.org/10.1109/ICACCS57279.2023.10112914.