Title: Predicting Heart Attack Disease

Introduction

In the entire world, cardiovascular diseases (CVDs) are the main cause of death. A group of illnesses known as cardiovascular disease affect the heart or blood arteries. CVDs include heart disease, stroke, hypertension, coronary artery disease (CAD), and others. According to the World Health Organization (WHO), an estimate of 17.9 million deaths worldwide in 2019 were attributable to CVDs, or 32% of all fatalities. Ischaemic heart disease is the world’s leading cause of mortality, accounting for 16% of all fatalities (The Top 10 Causes of Death, 2020). In Malaysia, Ischemic heart disease remained to be the biggest cause of death, accounting for 17.0% of the 109,155 deaths that were medically certified in 2020. (Cardiac Vascular Sentral, 2022) For healthcare professionals, predicting and detecting heart disease has always been a crucial and difficult duty. To treat cardiac disorders, hospitals and other institutions provide pricey therapies and procedures. Therefore, being able to detect cardiac disease in its early stages will help individuals all over the world take the required precautions before it becomes severe. Even though there are several forms of coronary heart disease, the majority of people first discover they have the condition after experiencing symptoms like chest discomfort, a heart attack, or sudden cardiac arrest. This finding emphasizes the significance of prevention approaches and diagnostic procedures that can reliably forecast population-wide cardiac disease before adverse consequences like myocardial infarctions (heart attacks) occur. Machine learning has demonstrated successful outcomes in forming judgements and predictions from the vast amount of data generated by the healthcare sector over time. The findings of this data set were produced by diagnosing patients with 21 attributes that аrе either bіnаrу оr оrdіnаl and 1 binary target attributes. The 22 attributes are comprising 253,680 survey results from the 2015 BRFSS that have been cleaned. 229,787 respondents do not have/have never had heart disease, whereas 23,893 have.

Problem Statement

Finding a cardiac problem is the biggest challenge. Although there is technology that can predict heart disease, it is either costly or inefficient for estimating the likelihood of heart disease in humans. Early detection of heart diseases can lower the mortality rate and overall implications. It is not always possible to properly monitor patients every day, and a doctor cannot discuss with a patient for a whole 24 hours because it requires more intellect, time, and knowledge. Modern data availability allows us to seek for hidden patterns using a range of machine learning techniques. The hidden patterns in medical data may be used to make health diagnoses.

Objective

To understand the factors that cause heart attack.

To build a predictive model that can help on heart attack prediction

1 About Dataset

1.1 Context

Heart Disease is among the most prevalent chronic diseases in the United States, impacting millions of Americans each year and exerting a significant financial burden on the economy. In the United States alone, heart disease claims roughly 647,000 lives each year — making it the leading cause of death. The buildup of plaques inside larger coronary arteries, molecular changes associated with aging, chronic inflammation, high blood pressure, and diabetes are all causes of and risk factors for heart disease.

While there are different types of coronary heart disease, the majority of individuals only learn they have the disease following symptoms such as chest pain, a heart attack, or sudden cardiac arrest. This fact highlights the importance of preventative measures and tests that can accurately predict heart disease in the population prior to negative outcomes like myocardial infarctions (heart attacks) taking place.

The Centers for Disease Control and Prevention has identified high blood pressure, high blood cholesterol, and smoking as three key risk factors for heart disease. Roughly half of Americans have at least one of these three risk factors. The National Heart, Lung, and Blood Institute highlights a wider array of factors such as Age, Environment and Occupation, Family History and Genetics, Lifestyle Habits, Other Medical Conditions, Race or Ethnicity, and Sex for clinicians to use in diagnosing coronary heart disease. Diagnosis tends to be driven by an initial survey of these common risk factors followed by bloodwork and other tests.

1.2 Content

The Behavioral Risk Factor Surveillance System (BRFSS) is a health-related telephone survey that is collected annually by the CDC. Each year, the survey collects responses from over 400,000 Americans on health-related risk behaviors, chronic health conditions, and the use of preventative services. It has been conducted every year since 1984. For this project, I downloaded a csv of the dataset available on Kaggle for the year 2015. This original dataset contains responses from 441,455 individuals and has 330 features. These features are either questions directly asked of participants, or calculated variables based on individual participant responses.

This dataset contains 253,680 survey responses from cleaned BRFSS 2015 to be used primarily for the binary classification of heart disease. Not that there is strong class imbalance in this dataset. 229,787 respondents do not have/have not had heart disease while 23,893 have had heart disease.

**The dаtаѕеt contains thеѕе variables as explained bеlоw:
**

1 binary tаrgеt variable: target аnd 21 fеаturе variables that аrе either bіnаrу оr оrdіnаl.

## Columns

HіghBP: Adultѕ whо hаvе been tоld thеу have high blood рrеѕѕurе bу a dосtоr, nurѕе, or оthеr health professionals
HighChol: Have you EVER bееn told by a dосtоr, nurѕе, or оthеr health рrоfеѕѕіоnаlѕ thаt уоur blооd cholesterol is hіgh?
ChоlChесk: Chоlеѕtеrоl check wіthіn раѕt five уеаrѕ.
BMI: Body Mаѕѕ Indеx (BMI)
Smоkеr: Hаvе уоu ѕmоkеd аt least 100 сіgаrеttеѕ in уоur еntіrе lіfе? [Nоtе: 5 расkѕ = 100 сіgаrеttеѕ]
Strоkе: (Evеr tоld) уоu hаd a ѕtrоkе.
Dіаbеtеѕ: 0 is no dіаbеtеѕ, 1 іѕ рrе-dіаbеtеѕ, and 2 іѕ diabetes.
PhуѕAсtіvіtу: Adults who reported dоіng physical асtіvіtу or exercise during the past 30 days оthеr than thеіr rеgulаr jоb.
Fruits: Consume Fruіt 1 or mоrе tіmеѕ реr dау
Vеggіеѕ: Cоnѕumе Vеgеtаblеѕ 1 оr more tіmеѕ реr day
HvуAlсоhоlCоnѕumр: Hеаvу drinkers (adult men hаvіng mоrе than 14 drіnkѕ реr week аnd аdult wоmеn hаvіng mоrе thаn 7 - drіnkѕ реr week)
AnуHеаlthсаrе: Dо уоu hаvе аnу kind of health саrе coverage, including hеаlth іnѕurаnсе, prepaid plans ѕuсh as HMOѕ, оr government рlаnѕ ѕuсh as Mеdісаrе, оr Indian Hеаlth Sеrvісе?
NоDосbсCоѕt: Wаѕ there a time іn thе раѕt 12 mоnthѕ whеn уоu needed tо ѕее a dосtоr but could nоt because оf cost?
GеnHlth: Would уоu say thаt іn general, уоur hеаlth іѕ:
MentHlth: Nоw thinking аbоut your mental hеаlth, whісh іnсludеѕ stress, dерrеѕѕіоn, аnd problems wіth emotions, fоr hоw mаnу days durіng thе past 30 dауѕ wаѕ уоur mental hеаlth not gооd?
PhуѕHlth: Nоw thіnkіng аbоut уоur рhуѕісаl hеаlth, which іnсludеѕ рhуѕісаl illness аnd іnjurу, fоr hоw mаnу dауѕ during the раѕt 30 dауѕ wаѕ уоur physical hеаlth nоt good?
DіffWаlk: Dо уоu hаvе ѕеrіоuѕ difficulty wаlkіng оr сlіmbіng ѕtаіrѕ?
Sеx: Indicate ѕеx.
Agе: Fourteen-level аgе саtеgоrу
Education: What іѕ thе highest grаdе оr уеаr of school уоu соmрlеtеd?
Inсоmе: Iѕ your аnnuаl household income frоm аll sources: (If thе раtіеnt refuses аt аnу іnсоmе lеvеl, соdе “Refused.”)

2 Data preparation

2.1 Import necessary libraries and dataset.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(mice)

## 
## Attaching package: 'mice'

## The following object is masked from 'package:stats':
## 
##     filter

## The following objects are masked from 'package:base':
## 
##     cbind, rbind

library(tidyr)
library(ggplot2)

setwd("~/作业/wqd7004/groupwork")
alldata = read.csv("data3.csv")

2.2 Data Overview

data1=alldata
attach(data1) 
opar<-par(no.readonly=TRUE) 
par(mfrow=c(3,3)) 
hist(HeartDiseaseorAttack,main="HeartDiseaseorAttack")
hist(HighBP,main="HighBP")
hist(HighChol,main="HighChol")
hist(CholCheck,main="CholCheck")
hist(BMI,main="BMI")
hist(Smoker,main="Smoker")
hist(Stroke,main="Stroke")
hist(Diabetes,main="Diabetes")
hist(PhysActivity,main="PhysActivity")

opar<-par(no.readonly=TRUE) 
par(mfrow=c(3,3))
hist(Fruits,main="Fruits")
hist(Veggies,main="Veggies")
hist(HvyAlcoholConsump,main="HvyAlcoholConsump")
hist(AnyHealthcare,main="AnyHealthcare")
hist(NoDocbcCost,main="NoDocbcCost")
hist(GenHlth,main="GenHlth")
hist(MentHlth,main="MentHlth")
hist(PhysHlth,main="PhysHlth")
hist(DiffWalk,main="DiffWalk")

opar<-par(no.readonly=TRUE) 
par(mfrow=c(2,2))  
hist(Sex,main="Sex")
hist(Age,main="Age")
hist(Education,main="Education")
hist(Income,main="Income")

2.3 Data Description

Contains 22 variables (before cleaning process) and 253680 rows Consists of 1 data types:Numerical

str(alldata)

## 'data.frame':    100638 obs. of  22 variables:
##  $ HeartDiseaseorAttack: int  0 0 0 0 0 0 0 0 1 0 ...
##  $ HighBP              : int  1 0 1 1 1 1 1 1 1 0 ...
##  $ HighChol            : int  1 0 1 0 1 1 0 1 1 0 ...
##  $ CholCheck           : int  1 0 1 1 1 1 1 1 1 1 ...
##  $ BMI                 : int  40 25 28 27 24 25 30 25 30 24 ...
##  $ Smoker              : int  1 1 0 0 0 1 1 1 1 0 ...
##  $ Stroke              : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Diabetes            : int  0 0 0 0 0 0 0 0 2 0 ...
##  $ PhysActivity        : int  0 1 0 1 1 1 0 1 0 0 ...
##  $ Fruits              : int  0 0 1 1 1 1 0 0 1 0 ...
##  $ Veggies             : int  1 0 0 1 1 1 0 1 1 1 ...
##  $ HvyAlcoholConsump   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ AnyHealthcare       : int  1 0 1 1 1 1 1 1 1 1 ...
##  $ NoDocbcCost         : int  0 1 1 0 0 0 0 0 0 0 ...
##  $ GenHlth             : int  5 3 5 2 2 2 3 3 5 2 ...
##  $ MentHlth            : int  18 0 30 0 3 0 0 0 30 0 ...
##  $ PhysHlth            : int  15 0 30 0 0 2 14 0 30 0 ...
##  $ DiffWalk            : int  1 0 1 0 0 0 0 1 1 0 ...
##  $ Sex                 : int  0 0 0 0 0 1 0 0 0 1 ...
##  $ Age                 : int  9 7 9 11 11 10 9 11 9 8 ...
##  $ Education           : int  4 6 4 3 5 6 6 4 5 4 ...
##  $ Income              : int  3 1 8 6 4 8 7 4 1 3 ...

summary(alldata)

##  HeartDiseaseorAttack     HighBP          HighChol        CholCheck     
##  Min.   :0.0000       Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000       1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:1.0000  
##  Median :0.0000       Median :0.0000   Median :0.0000   Median :1.0000  
##  Mean   :0.2374       Mean   :0.4812   Mean   :0.4724   Mean   :0.9673  
##  3rd Qu.:0.0000       3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.0000       Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##       BMI            Smoker           Stroke           Diabetes     
##  Min.   :12.00   Min.   :0.0000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:24.00   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000  
##  Median :27.00   Median :0.0000   Median :0.00000   Median :0.0000  
##  Mean   :28.62   Mean   :0.4702   Mean   :0.06043   Mean   :0.3589  
##  3rd Qu.:31.00   3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:0.0000  
##  Max.   :98.00   Max.   :1.0000   Max.   :1.00000   Max.   :2.0000  
##   PhysActivity        Fruits          Veggies       HvyAlcoholConsump
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.00000  
##  Median :1.0000   Median :1.0000   Median :1.0000   Median :0.00000  
##  Mean   :0.7433   Mean   :0.6276   Mean   :0.8048   Mean   :0.05312  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.00000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000  
##  AnyHealthcare     NoDocbcCost         GenHlth         MentHlth     
##  Min.   :0.0000   Min.   :0.00000   Min.   :1.000   Min.   : 0.000  
##  1st Qu.:1.0000   1st Qu.:0.00000   1st Qu.:2.000   1st Qu.: 0.000  
##  Median :1.0000   Median :0.00000   Median :3.000   Median : 0.000  
##  Mean   :0.9514   Mean   :0.08857   Mean   :2.648   Mean   : 3.392  
##  3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:3.000   3rd Qu.: 2.000  
##  Max.   :1.0000   Max.   :1.00000   Max.   :5.000   Max.   :30.000  
##     PhysHlth         DiffWalk           Sex              Age        
##  Min.   : 0.000   Min.   :0.0000   Min.   :0.0000   Min.   : 1.000  
##  1st Qu.: 0.000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.: 6.000  
##  Median : 0.000   Median :0.0000   Median :0.0000   Median : 9.000  
##  Mean   : 5.026   Mean   :0.2077   Mean   :0.4627   Mean   : 8.393  
##  3rd Qu.: 4.000   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:11.000  
##  Max.   :30.000   Max.   :1.0000   Max.   :1.0000   Max.   :13.000  
##    Education         Income     
##  Min.   :1.000   Min.   :1.000  
##  1st Qu.:4.000   1st Qu.:5.000  
##  Median :5.000   Median :7.000  
##  Mean   :5.017   Mean   :5.937  
##  3rd Qu.:6.000   3rd Qu.:8.000  
##  Max.   :6.000   Max.   :8.000

head(alldata)

##   HeartDiseaseorAttack HighBP HighChol CholCheck BMI Smoker Stroke Diabetes
## 1                    0      1        1         1  40      1      0        0
## 2                    0      0        0         0  25      1      0        0
## 3                    0      1        1         1  28      0      0        0
## 4                    0      1        0         1  27      0      0        0
## 5                    0      1        1         1  24      0      0        0
## 6                    0      1        1         1  25      1      0        0
##   PhysActivity Fruits Veggies HvyAlcoholConsump AnyHealthcare NoDocbcCost
## 1            0      0       1                 0             1           0
## 2            1      0       0                 0             0           1
## 3            0      1       0                 0             1           1
## 4            1      1       1                 0             1           0
## 5            1      1       1                 0             1           0
## 6            1      1       1                 0             1           0
##   GenHlth MentHlth PhysHlth DiffWalk Sex Age Education Income
## 1       5       18       15        1   0   9         4      3
## 2       3        0        0        0   0   7         6      1
## 3       5       30       30        1   0   9         4      8
## 4       2        0        0        0   0  11         3      6
## 5       2        3        0        0   0  11         5      4
## 6       2        0        2        0   1  10         6      8

ls(alldata)

##  [1] "Age"                  "AnyHealthcare"        "BMI"                 
##  [4] "CholCheck"            "Diabetes"             "DiffWalk"            
##  [7] "Education"            "Fruits"               "GenHlth"             
## [10] "HeartDiseaseorAttack" "HighBP"               "HighChol"            
## [13] "HvyAlcoholConsump"    "Income"               "MentHlth"            
## [16] "NoDocbcCost"          "PhysActivity"         "PhysHlth"            
## [19] "Sex"                  "Smoker"               "Stroke"              
## [22] "Veggies"

dim(alldata)

## [1] 100638     22

2.4 Missing Value

2.4.1 Check for missing values

There are no missing values, so operations related to missing value handling are not required

library(mice)

md.pattern(alldata)

##  /\     /\
## {  `---'  }
## {  O   O  }
## ==>  V <==  No need for mice. This data set is completely observed.
##  \  \|/  /
##   `-----'

##        HeartDiseaseorAttack HighBP HighChol CholCheck BMI Smoker Stroke
## 100638                    1      1        1         1   1      1      1
##                           0      0        0         0   0      0      0
##        Diabetes PhysActivity Fruits Veggies HvyAlcoholConsump AnyHealthcare
## 100638        1            1      1       1                 1             1
##               0            0      0       0                 0             0
##        NoDocbcCost GenHlth MentHlth PhysHlth DiffWalk Sex Age Education Income
## 100638           1       1        1        1        1   1   1         1      1
##                  0       0        0        0        0   0   0         0      0
##         
## 100638 0
##        0

sum(is.na(alldata))

## [1] 0

2.5 Outlier detection

2.5.1 Visualize graph distribution

The first is visual graphic distribution recognition, filter out numerical variables, and use boxlpot to see the distribution.

# Extract numeric fields
library(ggplot2)
nums <- unlist(lapply(alldata, is.numeric))
nums_data <- alldata[, nums]

# data deformation
nums_data.new <- nums_data %>%
    as.data.frame %>%
    mutate(Cell = rownames(.)) %>%
    gather(., key = colname, value = "value", -Cell)

# Draw a boxplot with ggplot
ggplot(data = nums_data.new, aes(x = colname, y = value)) + geom_boxplot(aes(1)) +
    facet_wrap(~colname, scales = "free") + theme_grey() + labs(title = "Outlier Detection On Numeric Data By Boxplot",
    x = "Numeric Columns", y = "") + theme(legend.position = "top") + theme_bw()

2.5.2 Z-score

# Define a function to identify outliers, x is the input data (matrix or df), zs is the abnormal critical value, z-score exceeding zs is identified as outliers
outliers = function(x, zs) {
    temp <- abs(apply(x, 1, scale))
    return(x[temp > zs])
}
# Print out the value of z-score<3
outliers(nums_data, 5)

## integer(0)

2.6 Duplicate value handling

alldata_1<- unique(alldata)

Duplicate value detection is required, but in this data set, it is normal for duplicate values to appear, so after performing duplicate value detection, the duplicate values are not deleted, and the original data set is still used.

####3.1. do some descriptive analysis First, in this section, we attempted the most preliminary descriptive statistics, and the results obtained are shown in the following figure.

str(alldata)

## 'data.frame':    100638 obs. of  22 variables:
##  $ HeartDiseaseorAttack: int  0 0 0 0 0 0 0 0 1 0 ...
##  $ HighBP              : int  1 0 1 1 1 1 1 1 1 0 ...
##  $ HighChol            : int  1 0 1 0 1 1 0 1 1 0 ...
##  $ CholCheck           : int  1 0 1 1 1 1 1 1 1 1 ...
##  $ BMI                 : int  40 25 28 27 24 25 30 25 30 24 ...
##  $ Smoker              : int  1 1 0 0 0 1 1 1 1 0 ...
##  $ Stroke              : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Diabetes            : int  0 0 0 0 0 0 0 0 2 0 ...
##  $ PhysActivity        : int  0 1 0 1 1 1 0 1 0 0 ...
##  $ Fruits              : int  0 0 1 1 1 1 0 0 1 0 ...
##  $ Veggies             : int  1 0 0 1 1 1 0 1 1 1 ...
##  $ HvyAlcoholConsump   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ AnyHealthcare       : int  1 0 1 1 1 1 1 1 1 1 ...
##  $ NoDocbcCost         : int  0 1 1 0 0 0 0 0 0 0 ...
##  $ GenHlth             : int  5 3 5 2 2 2 3 3 5 2 ...
##  $ MentHlth            : int  18 0 30 0 3 0 0 0 30 0 ...
##  $ PhysHlth            : int  15 0 30 0 0 2 14 0 30 0 ...
##  $ DiffWalk            : int  1 0 1 0 0 0 0 1 1 0 ...
##  $ Sex                 : int  0 0 0 0 0 1 0 0 0 1 ...
##  $ Age                 : int  9 7 9 11 11 10 9 11 9 8 ...
##  $ Education           : int  4 6 4 3 5 6 6 4 5 4 ...
##  $ Income              : int  3 1 8 6 4 8 7 4 1 3 ...

####1. do some descriptive analysis First, in this section, we attempted the most preliminary descriptive statistics, and the results obtained are shown in the following figure.

library(psych)

## 
## Attaching package: 'psych'

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

describe(alldata)

##                      vars      n  mean   sd median trimmed  mad min max range
## HeartDiseaseorAttack    1 100638  0.24 0.43      0    0.17 0.00   0   1     1
## HighBP                  2 100638  0.48 0.50      0    0.48 0.00   0   1     1
## HighChol                3 100638  0.47 0.50      0    0.47 0.00   0   1     1
## CholCheck               4 100638  0.97 0.18      1    1.00 0.00   0   1     1
## BMI                     5 100638 28.62 7.23     27   27.79 5.93  12  98    86
## Smoker                  6 100638  0.47 0.50      0    0.46 0.00   0   1     1
## Stroke                  7 100638  0.06 0.24      0    0.00 0.00   0   1     1
## Diabetes                8 100638  0.36 0.75      0    0.20 0.00   0   2     2
## PhysActivity            9 100638  0.74 0.44      1    0.80 0.00   0   1     1
## Fruits                 10 100638  0.63 0.48      1    0.66 0.00   0   1     1
## Veggies                11 100638  0.80 0.40      1    0.88 0.00   0   1     1
## HvyAlcoholConsump      12 100638  0.05 0.22      0    0.00 0.00   0   1     1
## AnyHealthcare          13 100638  0.95 0.21      1    1.00 0.00   0   1     1
## NoDocbcCost            14 100638  0.09 0.28      0    0.00 0.00   0   1     1
## GenHlth                15 100638  2.65 1.12      3    2.60 1.48   1   5     4
## MentHlth               16 100638  3.39 7.68      0    1.18 0.00   0  30    30
## PhysHlth               17 100638  5.03 9.45      0    2.55 0.00   0  30    30
## DiffWalk               18 100638  0.21 0.41      0    0.13 0.00   0   1     1
## Sex                    19 100638  0.46 0.50      0    0.45 0.00   0   1     1
## Age                    20 100638  8.39 3.05      9    8.57 2.97   1  13    12
## Education              21 100638  5.02 1.01      5    5.13 1.48   1   6     5
## Income                 22 100638  5.94 2.11      7    6.22 1.48   1   8     7
##                       skew kurtosis   se
## HeartDiseaseorAttack  1.23    -0.48 0.00
## HighBP                0.08    -1.99 0.00
## HighChol              0.11    -1.99 0.00
## CholCheck            -5.26    25.66 0.00
## BMI                   2.70    15.20 0.02
## Smoker                0.12    -1.99 0.00
## Stroke                3.69    11.61 0.00
## Diabetes              1.67     0.85 0.00
## PhysActivity         -1.11    -0.76 0.00
## Fruits               -0.53    -1.72 0.00
## Veggies              -1.54     0.37 0.00
## HvyAlcoholConsump     3.99    13.88 0.00
## AnyHealthcare        -4.20    15.64 0.00
## NoDocbcCost           2.90     6.39 0.00
## GenHlth               0.33    -0.57 0.00
## MentHlth              2.59     5.66 0.02
## PhysHlth              1.91     2.14 0.03
## DiffWalk              1.44     0.08 0.00
## Sex                   0.15    -1.98 0.00
## Age                  -0.47    -0.47 0.01
## Education            -0.79     0.06 0.00
## Income               -0.80    -0.48 0.01

So we can see lots of basic information about different attributes and then we try to look into this dataset .

####3.2 Some simple analysis and visualisation

firstly , we try to look into some continuous data like BMI

#BMI
#have the distribution of BMI
library(ggplot2)
alldata.BMI <-alldata[["BMI"]]
#so we can see the distribution of BMI .Then we try to divide it into 6 groups
# underweight(bmi<18.5),normal(18.5<-bmi<25),overweight(25<-bmi <30),mildly obese(30<- bmi <35),moderately obese(35<- bmi <40),severely obese(bmi >- 40)
#underweight-0,normal-1.overweight-2,mildly obese-3 , moderately-4 ,severely obese -5
data <- subset(alldata, !is.na(BMI))
#data$BMI
#data$Stroke
data$category <- apply(data, 1, function(row) {
  if (row["BMI"] < 18.5) {
    return("Underweight")
  } else if (row["BMI"] >= 18.5 && row["BMI"] < 25) {
    return("Normal")
  } else if (row["BMI"] >= 25 && row["BMI"] < 30) {
    return("Overweight")
  } else if (row["BMI"] >= 30 && row["BMI"] < 35) {
    return("Mildly obese")
  } else if (row["BMI"] >= 35 && row["BMI"] < 40) {
    return("Moderately obese")
  }else {
    return("Severely obese")
  }
})
#data$category
# Calculate the count of strokes for each category
stroke_counts <- table(data$category, data$Stroke)
# Convert counts to proportions
stroke_proportions <- prop.table(stroke_counts, margin = 1)
# Print the resulting table
stroke_proportions

##                   
##                             0          1
##   Mildly obese     0.93547934 0.06452066
##   Moderately obese 0.92803527 0.07196473
##   Normal           0.94903936 0.05096064
##   Overweight       0.94109332 0.05890668
##   Severely obese   0.92705526 0.07294474
##   Underweight      0.90104167 0.09895833

ggplot(data = NULL ,aes(x = alldata.BMI))+geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#After that , we try to find the relationship between the BMI and the possibility of getting heart disease

heart_disease_BMI <- data %>% group_by(category)%>%summarise(HeartDiseaseorAttack = sum(HeartDiseaseorAttack)/n())
heart_disease_BMI

## # A tibble: 6 × 2
##   category         HeartDiseaseorAttack
##   <chr>                           <dbl>
## 1 Mildly obese                    0.278
## 2 Moderately obese                0.299
## 3 Normal                          0.176
## 4 Overweight                      0.237
## 5 Severely obese                  0.284
## 6 Underweight                     0.246

ggplot(heart_disease_BMI,aes(x = category, y = HeartDiseaseorAttack,fill = category))+
  geom_col()

#In this bar chart below , we can clearly find that people who have the normal BMI are less likely to get heart disease

##b. secondly , we try to look into some categorical data

library(ggplot2)
library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

heart_disease_HighBP <- data %>% group_by(HighBP) %>% summarize(HeartDiseaseorAttack = sum(HeartDiseaseorAttack)/n())
#heart_disease_HighBP
ggplot(heart_disease_HighBP,aes(x = HighBP, y = HeartDiseaseorAttack,fill = HighBP))+
  geom_col()

#So in this bar chart , we can see that people with HighBP are more likely to have heart disease .

heart_disease_Sex <- data %>% group_by(Sex) %>% summarize(HeartDiseaseorAttack = sum(HeartDiseaseorAttack)/n())
heart_disease_Sex

## # A tibble: 2 × 2
##     Sex HeartDiseaseorAttack
##   <int>                <dbl>
## 1     0                0.189
## 2     1                0.294

ggplot(heart_disease_Sex,aes(x = Sex, y = HeartDiseaseorAttack,fill = Sex))+
  geom_col()

#So , by this bar chart below , we can see that in different gender , there will be different possibility of getting 
#heart disease . For males , they are more likely to get heart disease ,up to 30%

3. Fing the relations between different attributes

data.cor <- cor(alldata)
library(corrplot)

## corrplot 0.92 loaded

corrplot(data.cor,title = "Corrplot" , method = "color")

#{r setup, include=FALSE} #knitr::opts_chunk$set(echo = TRUE) #

4.1 Modeling

# Libraries
################################################################################
library(ggplot2)
library(tree)
library(caret)

## Loading required package: lattice

library(dplyr)
library(e1071)
library(readxl)
library(pROC)

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

library(gbm)

## Loaded gbm 2.1.8.1

################################################################################
## Load data
# Data Processing
################################################################################
# Load data
setwd("~/作业/wqd7004/groupwork")
df <- read.csv("data3.csv", sep=",")
df[,1]<-factor(df[,1])
heart_disease_data <- read.csv("data3.csv", sep=",")
heart_disease_data[,1]<-factor(heart_disease_data[,1])

Below is the summary of the data that we use:

# Factorize BMI data
temp <- df$BMI
for(i in 1:length(temp)){
  if(temp[i]>31) {
    temp[i] <-4
  } else if(temp[i]>27 && temp[i]<32) {
    temp[i] <- 3
  } else if(temp[i]>24 && temp[i]<28) {
    temp[i] <- 2
  } else {
    temp[i] <- 1
  }
}
df$BMI <- temp
# Removing NA's
df <- na.omit(df)
################################################################################
# Partition data into training(40), validation(30) and test(30)
################################################################################
n=dim(df)[1]
set.seed(12345)
## Training Data
id=sample(1:n, floor(n*0.4))
train=df[id,]
id1=setdiff(1:n, id)
set.seed(12345)
## Validation data
id2=sample(id1, floor(n*0.3))
valid=df[id2,]
## Testing data
id3=setdiff(id1,id2)
test=df[id3,]
################################################################################

Now, we checking the data we several models namely GBDT, decision tree, and Logistic Regression.

Models

Decision Tree

A decision tree is a graphical representation of all possible solutions to a decision based on certain conditions. On each step or node of a decision tree, used for classification, we try to form a condition on the features to separate all the labels or classes contained in the dataset to the fullest purity. A tree is built by splitting the source set, constituting the root node of the tree, into subsets—which constitute the successor children. The splitting is based on a set of splitting rules based on classification features. With the current dataset, a decision tree is learned on training data.

# Models
################################################################################
# Decision Tree with default settings
dt_default <- tree(formula = HeartDiseaseorAttack~.,
data = train,
method="class")

GBDT(Gradient Boosting Decison Tree)

GBDT is also a member of the integrated learning Boosting family, but is very different from traditional Adaboost, which uses the error rate of the previous iteration of the weak learner to update the weights of the training set, and thus iterates through the rounds. GBDT is also iterative, using a forward distribution algorithm, but the weak learner is limited to using the CART regression model, and the iteration idea is different from Adaboost. The main advantages of GBDT are. 1) Flexibility to handle various types of data, including continuous and discrete values. 2) The accuracy of prediction can be relatively high with relatively little tuning time. This is relative to SVM. 3) Using some robust loss functions that are very robust to outliers. Examples include the Huber loss function and the Quantile loss function. The main disadvantages of GBDT are. 1) It is difficult to train data in parallel due to the dependencies between weak learners. However partial parallelism can be achieved by self-sampling SGBT.

# GBDT with default settings
gbdt_model <- gbm(formula = HeartDiseaseorAttack~.,
                  distribution = 'gaussian',
                  data=train,
                  n.trees=1000,
                  shrinkage = 0.01)
################################################################################

# decision tree
pred_dt<- predict(dt_default, test, type = "class")

# GBDT
pred_GBDT = predict(gbdt_model, newdata = test)

## Using 1000 trees...

################################################################################
# Confusion Matrix
################################################################################
# decision tree
cm_dt<-table(pred_dt, test$HeartDiseaseorAttack)
# GBDT
cm_GBDT = table(pred_GBDT, test[,1])
################################################################################
# Misclassification error
################################################################################
# decision tree
mmce_dt <- 1 - sum(diag(cm_dt)) / sum(cm_dt)
# GBDT
mmce_GBDT <- 1 - sum(diag(cm_GBDT)) / sum(cm_GBDT)
################################################################################
# F1 Score
################################################################################
# Decision tree
TN_dt <- cm_dt[1,1]
TP_dt <- cm_dt[2,2]
FN_dt <- cm_dt[1,2]
FP_dt <- cm_dt[2,1]
precision_dt <- (TP_dt) / (TP_dt + FP_dt) # 0.569035
recall_score_dt <- (FP_dt) / (FP_dt + TN_dt) # 0.08836704
f1_score_dt <- 2 * ((precision_dt * recall_score_dt) / (precision_dt + recall_score_dt))
# GBDT
TN_GBDT <- cm_GBDT[1,1]
TP_GBDT <- cm_GBDT[2,2]
FN_GBDT <- cm_GBDT[1,2]
FP_GBDT <- cm_GBDT[2,1]
precision_GBDT <- (TP_GBDT) / (TP_GBDT + FP_GBDT) # 0.6753752
recall_score_GBDT <- (FP_GBDT) / (FP_GBDT + TN_GBDT) # 0.05079365
f1_score_GBDT <- 2 * ((precision_GBDT * recall_score_GBDT) / (precision_GBDT + recall_score_GBDT))
f1_score_GBDT # 0.04803202

## [1] 0.04803202

################################################################################
# Accuracy
################################################################################
# deciosion tree
accuracy_dt <- (TP_dt + TN_dt) / (TP_dt + TN_dt + FP_dt + FN_dt)
accuracy_dt # 0.7831876

## [1] 0.7831876

# GBDT
accuracy_GBDT <- (TP_GBDT + TN_GBDT) / (TP_GBDT + TN_GBDT + FP_GBDT + FN_GBDT)
accuracy_GBDT # 0.7857143

## [1] 0.7857143

################################################################################

Since, we modeled using GBDT and decision tree. Below are the results from the same.

Confusion Matrix from Decision Tree

cm_dt

##        
## pred_dt     0     1
##       0 20963  4514
##       1  2032  2683

Confusion Matrix from GBDT #{r, warning=FALSE} #cm_GBDT #

F1 Score from Decision Tree

f1_score_dt # 0.1529777

## [1] 0.1529777

F1 Score from GBDT

f1_score_GBDT # 0.04803202

## [1] 0.04803202

Accuracy from Decision Tree

accuracy_dt # 0.7831876

## [1] 0.7831876

Accuracy from GBDT

## [1] 0.7857143

ROC Curve

A receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. The ROC curve shows the trade-off between sensitivity (or TPR) and specificity (1 – FPR). Classifiers that give curves closer to the top-left corner indicate a better performance. The closer the curve comes to the 45-degree diagonal of the ROC space, the less accurate the test.

# ROC Plot
################################################################################
# fpr_dt <- FP_dt/(FP_dt+TN_dt)
# tpr_dt <- TP_dt/(TP_dt+FN_dt)
# fpr_GBDT <- FP_GBDT/(FP_GBDT+TN_GBDT)
# tpr_GBDT <- TP_GBDT/(TP_GBDT+FN_GBDT)
roc_dt <- roc(response = test$HeartDiseaseorAttack, predictor =as.numeric(pred_dt))

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

roc_GBDT <- roc(response = test$HeartDiseaseorAttack, predictor =as.numeric(pred_GBDT))

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

An ROC plot for GBDT and Decision tree is plotted below separately, which is shown below.

par(mfrow=c(1,2))
plot(roc_GBDT, col = "green", main = c("GBDT"))
plot(roc_dt, col = "red", main = c("Decision tree"))

Now, we will prune the decision tree.

# Pruning Tree
set.seed(12345)
trainScore = rep(0, 100)
testScore = rep(0, 100)
for(i in 2:100) {
  prunedTree = prune.tree(dt_default, best=i)
  pd = predict(prunedTree, newdata = valid, type= "tree")
  trainScore[i] = deviance(prunedTree)
  testScore[i] = deviance(pd)
}
optimalLeaves <- which.min(testScore[2:100])
optimalTree = prune.tree(dt_default, best=optimalLeaves)
predOptimalTree = predict(optimalTree, newdata= valid, type="class")
cnfMatrixOptimalTree <- table(valid$HeartDiseaseorAttack, predOptimalTree)

Let’s see the result from the pruned tree. We print the confusion matrix from the pruned tee.

cnfMatrixOptimalTree

##    predOptimalTree
##         0     1
##   0 20913  2101
##   1  4543  2634

Logistic Regression

Logistic regression is a supervised learning classification algorithm used to predict the probability of a target variable. It is one of the simplest ML algorithms that can be used for various classification problems. Logistic regression is easier to implement, interpret, and very efficient to train. If the number of observations is lesser than the number of features, Logistic Regression should not be used, otherwise, it may lead to overfitting.

# Logistic Regression
pi_generator <- seq(0.05, 0.95, 0.05)
logiReg <- glm(formula = HeartDiseaseorAttack~., data = train, family = "binomial")
logiRegPred <- predict(logiReg, select(test, -c(HeartDiseaseorAttack)), type = "response")
confList <- list()
for(i in pi_generator) {
  a <- as.factor(ifelse(logiRegPred>i, 'yes', 'no'))
  b <- table(a, test$HeartDiseaseorAttack)
  confList <- c(confList, list(b))
}
tpr_logR <- c()
fpr_logR <- c()
total_loop <- length(pi_generator)-1
for (iter in 1:18) {
  tpr_value <- confList[[iter]][4]/(confList[[iter]][3]+confList[[iter]][4])
  tpr_logR <- c(tpr_logR, tpr_value)
  
  fpr_value <- confList[[iter]][2]/(confList[[iter]][1]+confList[[iter]][2])
  fpr_logR <- c(fpr_logR, fpr_value)
}
# Classify test data with Optimal Tree
optimalTreePred = predict(optimalTree, newdata= test, type="vector")
confListOptTree <- list()
for(i in pi_generator) {
  k <- as.factor(ifelse(optimalTreePred[,2]>i, 'yes', 'no'))
  l <- table(k, test$HeartDiseaseorAttack)
  confListOptTree <- c(confListOptTree, list(l))
}
tprOptTree <- c()
fprOptTree <- c()
for (iter1 in 1:19) {
  TP <- confListOptTree[[iter1]][4]
  P <- (confListOptTree[[iter1]][3]+confListOptTree[[iter1]][4])
  tpr_value <- TP/P
  tprOptTree <- c(tprOptTree, tpr_value)
  
  FP <- confListOptTree[[iter1]][2]
  N <- (confListOptTree[[iter1]][1]+confListOptTree[[iter1]][2])
  fpr_value <- FP/N
  fprOptTree <- c(fprOptTree, fpr_value)
}
tprOptTree[16:19] <- 0.0
fprOptTree[16:19] <- 0.0

Let’s plot the ROC curve for the comparison of decision tree and Logistic regression.

plot(fpr_logR, tpr_logR, type="l", col="blue",
     xlab="FPR", ylab="TPR", xlim=c(0,0.8), ylim=c(0,1))
lines(fprOptTree, tprOptTree, col="red", type="l")
title("ROC Curve")
legend("bottomright", legend=c("Logistic Reg","Optimal Tree"),
       col=c("blue","red"), lty = 1)

Finally, we show the accuracy for all the different models.

accuracy_dt = 1-mmce_dt
accuracy_dt

## [1] 0.7831876

cat(paste("Accuracy from decision tree: ", accuracy_dt))

## Accuracy from decision tree:  0.78318759936407

cat(paste("Accuracy from GBDT: ", accuracy_GBDT))

## Accuracy from GBDT:  0.785714285714286

Since, we find GBDT and decision tree with better accuracy, we prefer these two models.

Conclusion

In this project, a sklearn regression prediction model was built on the attendee data to predict whether the attendee had heart disease or not, and the test results showed that the model built in this project had a high classification accuracy.

In the future, the accuracy of the model will be improved in the following two ways

1.Enriching the data set so that more types of data can be collected to improve the accuracy of the model 2.Build better models

WQD7004

ChengXiuyi

2023-01-20