contents
1.Introduction
1.1. Dataset source and description
2.1. Response Feature
2.Data Preprocessing
1.2. Loading packages
2.2. Loading Dataset
3.2. Data Cleaning and Transformation
4.2. Missing value
3. Data Visaulisation
1.3. Bar charts for Gender type proportion
2.3. Histogram for age frequency
3.3. Bar chart for target proportion
4.3. Exercise induced angina vs Traget
5.3. Gender vs target
6.3. Slope vs target
7.3. chest pain type vs target
8.3. Relationship trestbps, slope vs target
9.3. Relationship Among Age, serum cholestoral vs Target
10.3. Relationship Among Age, serum cholestoral and target for gender
11.3. Bar chart of target vs suger in blood vs resting electrocardio graphic results.
4. Results
5.conculsion
6.References
Over the world , there are many poeple have heart disease that are considered most spread disease in the world. Many people suffer from this disease and do not know that they have it. If the disease is discovered early helps with treatment and and reduce the side effects butter than discovered later.
The aim of phase one is examine the relationsips between featuers and have in predicting the likelihood of developing heart disease using the data provided.Analysis of the data will lead to the formulation of a logistic regression model for predicting the likelihood of developing heart disease by looking at a combination of several factors.
The database was obtained from Kaggle website (https://www.kaggle.com/ronitf/heart-disease-uci ) , The dataset includes 303 obserevations and 14 feature which is defined as follows:
• Age :age of the patient in years patient (Interval)
• Sex: Sex of the patient (Nominal Categorical where 1 refer to male , 0 refer to female)
• Cp: chest pain type ( ordenal Categorical, where 0=no-pain ,1=Less-pain,2=Medium-pain,3=strongly-pain)
• Trestbps: resting blood pressure (Interval - Continuous)
• Chol: serum cholestoral in mg/dl (Interval- Continuous)
• Fbs: fasting blood sugar greater than 120 mg/dl (binary Categorical,where 1 = true and 0 = false)
• Restecg: resting electrocardio graphic results (Categorical)
• Thalach: maximum heart rate achieved (Continuous)
• Exang: exercise induced angina (binary Categorical ,where 1 = yes, 0 = no)
• Oldpeak: ST depression induced by exercise relative to rest (Continuous)
• Slope: the slope of the peak exercise ST segment (Categorical )
• Ca: number of major vessels (0-3) colored by flourosopy (Interval- discreat)
• Target :1 or 0 (binary Categorical , where 1 = Have-heart-diseases; 0 = Not-heart-diseases)
The response feature is Target which is given as:
The target feature has two classes and hence it is a binary classification problem. The goal is to predict the patient has heart disease or Not through the data.
R packages will be used in this project to apply the next commands
library(mlr)
library(dplyr)
library(tidyr)
library(plyr)
library(ggplot2)
library(ggmosaic)
library(vcd)
library(googleVis)
library(knitr)
library(cowplot)
library(GGally)
library(tidyverse)
library(lubridate)
library(Hmisc)
setwd("~/Desktop/past semester/semester3/categerical/Assigmnent1.CDA/Our project1")
opts_chunk$set(tidy.opts=list(width.cutoff=55),tidy=TRUE)
Using read.csv() function to upload our dataset in R
HD <- read.csv("group13.csv")
head(HD)
In the next step, we transferred some variables to an appropriate class because the dataset stored some variables incorrectly The obserevation in the following features have been modified as follows :
sex:rather than 0= Female and 1 = Male.
target:rather than 0= Not-heart-diseases and 1 = heart-diseases.
cp:rather than 0= NO-pain , 1 = Less-pain ,2 = Medium-pain and 3 = strongly-pain.
fbs:rather than 0= Normal and 1 = Higher or lawer.
restecg:rather than 0= Normal , 1 = hypertrophy and 2 = wave abnormality.
exang:rather than 0= Not-angina and 1 = Have angina.
slope:rather than 0= up , 1 = Flat and 2 = Down.
thal:rather than 0= normal, 1 = defect ,2 = fixed-defect and 3 = reversable-defect.
HD$sex<-as.factor(HD$sex)
HD$cp<-as.factor(HD$cp)
HD$fbs<-as.factor(HD$fbs)
HD$restecg<-as.factor(HD$restecg)
HD$exang<-as.factor(HD$exang)
HD$slope<-as.factor(HD$slope)
HD$thal<-as.factor(HD$thal)
HD$target<-as.factor(HD$target)
Afterthat, summarizeColumns() function will use to obatian summary statastics that will show the mean and median and mode for intervel data and ferquancy and missing value
summarizeColumns(HD)%>% select(-disp, -mad) %>% knitr::kable(caption ='Features Summary')
| name | type | na | mean | median | min | max | nlevs |
|---|---|---|---|---|---|---|---|
| age | integer | 0 | 54.3663366 | 55.0 | 29 | 77.0 | 0 |
| sex | factor | 0 | NA | NA | 96 | 207.0 | 2 |
| cp | factor | 0 | NA | NA | 23 | 143.0 | 4 |
| trestbps | integer | 0 | 131.6237624 | 130.0 | 94 | 200.0 | 0 |
| chol | integer | 0 | 246.2640264 | 240.0 | 126 | 564.0 | 0 |
| fbs | factor | 0 | NA | NA | 45 | 258.0 | 2 |
| restecg | factor | 0 | NA | NA | 4 | 152.0 | 3 |
| thalach | integer | 0 | 149.6468647 | 153.0 | 71 | 202.0 | 0 |
| exang | factor | 0 | NA | NA | 99 | 204.0 | 2 |
| oldpeak | numeric | 0 | 1.0396040 | 0.8 | 0 | 6.2 | 0 |
| slope | factor | 0 | NA | NA | 21 | 142.0 | 3 |
| ca | integer | 0 | 0.7293729 | 0.0 | 0 | 4.0 | 0 |
| thal | factor | 0 | NA | NA | 2 | 166.0 | 4 |
| target | factor | 0 | NA | NA | 138 | 165.0 | 2 |
When we saw to table 1, we observe there are no missing value.
The next step, we will change a descriptive namings in categories to understand the categoriies in visualisation
HD$sex <- factor(HD$sex,levels=c('0','1'),labels=c('Female','Male'))
HD$target <- factor(HD$target,levels=c('0','1'),labels=c('Not-heart-diseases','heart-diseases'))
HD$cp <- factor(HD$cp,levels=c('0','1','2','3'),labels=c('NO-pain','Less-pain' ,'Medium-pain','strongly-pain'))
HD$fbs <- factor(HD$fbs,levels=c('0','1'),labels=c('Normal','non-nmormal'))
HD$restecg <- factor(HD$restecg,levels=c('0','1','2'),labels=c('Normal','hypertrophy','wave abnormality'))
HD$exang <- factor(HD$exang,levels=c('0','1'),labels=c('Not-angina','Have angina'))
HD$slope <- factor(HD$slope,levels=c('0','1','2'),labels=c('up','Flat','Down'))
HD$thal <- factor(HD$thal,levels=c('0','1','2','3'),labels=c('normal','defect','fixed-defect','reversable-defect'))
summary(HD)
## age sex cp trestbps
## Min. :29.00 Female: 96 NO-pain :143 Min. : 94.0
## 1st Qu.:47.50 Male :207 Less-pain : 50 1st Qu.:120.0
## Median :55.00 Medium-pain : 87 Median :130.0
## Mean :54.37 strongly-pain: 23 Mean :131.6
## 3rd Qu.:61.00 3rd Qu.:140.0
## Max. :77.00 Max. :200.0
## chol fbs restecg thalach
## Min. :126.0 Normal :258 Normal :147 Min. : 71.0
## 1st Qu.:211.0 non-nmormal: 45 hypertrophy :152 1st Qu.:133.5
## Median :240.0 wave abnormality: 4 Median :153.0
## Mean :246.3 Mean :149.6
## 3rd Qu.:274.5 3rd Qu.:166.0
## Max. :564.0 Max. :202.0
## exang oldpeak slope ca
## Not-angina :204 Min. :0.00 up : 21 Min. :0.0000
## Have angina: 99 1st Qu.:0.00 Flat:140 1st Qu.:0.0000
## Median :0.80 Down:142 Median :0.0000
## Mean :1.04 Mean :0.7294
## 3rd Qu.:1.60 3rd Qu.:1.0000
## Max. :6.20 Max. :4.0000
## thal target
## normal : 2 Not-heart-diseases:138
## defect : 18 heart-diseases :165
## fixed-defect :166
## reversable-defect:117
##
##
Through the next plot, we notic the precentage of male is highest than women where the percentage of Men is approximately 68% however, the percentage of Men is approximately 33%
p12<-ggplot(data = HD) +
geom_bar(ylab='proportion',mapping = aes(x =sex ,y=stat(prop),group=1))+ labs(title= "Frequency of gender") +scale_fill_manual(values ="green")
p12
When we plotted the histogram of age , we conclude that the distribute of age is normal
HD$age %>% hist(col="skyblue",xlim=c(20,90), xlab="age",
main="Histogram of Age")
When we plotted the target bar chart, we concluded that approximately 57% of people diagnosed had heart disease ,however approximately 45% of people diagnosed did not have heart disease.
p11 <-ggplot(data = HD) +
geom_bar(ylab='proportion',mapping = aes(x = target,y=stat(prop),group=1))+ labs(title = "proportion of target")
p11
Approximately 70% of patient who have not angina are with heart-diseases ,whereas around 30% of them have not the heart-diseases.Nearly 77% of patient who have angina are not have the heart-diseases, whereas around 23% of them are have heart-diseases.
exercise_induced_angina<- table(HD$exang, HD$target, dnn = c("exang", "target"))
exercise_induced_angina %>% knitr::kable(caption = "exercise induced angina on target")
| Not-heart-diseases | heart-diseases | |
|---|---|---|
| Not-angina | 62 | 142 |
| Have angina | 76 | 23 |
exercise_induced_angina_proportion <- round(prop.table(exercise_induced_angina, 1), 2)
exercise_induced_angina_proportion %>% knitr::kable(caption = "exercise induced angina Proportion on target")
| Not-heart-diseases | heart-diseases | |
|---|---|---|
| Not-angina | 0.30 | 0.70 |
| Have angina | 0.77 | 0.23 |
p5 <- ggplot(data = HD, aes(x = exang, fill = target))
p5 + geom_bar(position = "dodge") + labs(title = "Frequency of exercise induced angina Vs target") +
scale_fill_manual(values = c("aquamarine3", "cadetblue2"))
We note that the percentage of female who have heart disease is three times the proportion of female who do not have heart disease ,the reasons for the high incidence of female with heart disease appear to be due to the female’s physical makeup, as well as matters related to pregnancy.While the percentage of males without heart disease is 5% higher than those with heart disease, perhaps because males are more physically strong than females.
Gender_Frequency <- table(HD$sex, HD$target, dnn = c("Gender", "taregt"))
Gender_Frequency %>% knitr::kable(caption = "Gender Frequency on taregt")
| Not-heart-diseases | heart-diseases | |
|---|---|---|
| Female | 24 | 72 |
| Male | 114 | 93 |
Gender_Frequency_Proportion <- round(prop.table(Gender_Frequency, 1), 2)
Gender_Frequency_Proportion %>% knitr::kable(caption = "Gender Proportion on taregt")
| Not-heart-diseases | heart-diseases | |
|---|---|---|
| Female | 0.25 | 0.75 |
| Male | 0.55 | 0.45 |
p4 <- ggplot(data = HD, aes(x =sex, fill = target))
p4 + geom_bar() + labs(title = "Frequency of Dependents on target") +
scale_fill_manual(values = c("chartreuse3", "cyan4"))
Approximately 75% of patients with a low heart rate have heart disease while 35% of those with a flat heart rate have heart disease and 43% of those who have a up heart rate also have heart disease.
Slope_Frequency <- table(HD$slope, HD$target, dnn = c("Slope", "target"))
Slope_Frequency %>% knitr::kable(caption = "Slope Frequency on target")
| Not-heart-diseases | heart-diseases | |
|---|---|---|
| up | 12 | 9 |
| Flat | 91 | 49 |
| Down | 35 | 107 |
Slope_Frequency_Proportion <- round(prop.table(Slope_Frequency, 1), 2)
Slope_Frequency_Proportion %>% knitr::kable(caption = "Slope Proportion on taregt")
| Not-heart-diseases | heart-diseases | |
|---|---|---|
| up | 0.57 | 0.43 |
| Flat | 0.65 | 0.35 |
| Down | 0.25 | 0.75 |
mosaic(Slope_Frequency, pop = FALSE, legend = TRUE, shade = TRUE)
labeling_cells(text = Slope_Frequency_Proportion, margin = 0)(Slope_Frequency_Proportion)
from next graph, Nearly 70% of patients with a strongly-chest pain have heart disease while 79% of those with a Medium-chest pain have heart disease and 82% of those who have a Less-chest pain have heart disease lastly,27% of patients without chest pain have heart disease.
cp_Frequency <- table(HD$cp, HD$target, dnn = c("chest pain type", "target"))
cp_Frequency %>% knitr::kable(caption = "chest pain type Frequency on target")
| Not-heart-diseases | heart-diseases | |
|---|---|---|
| NO-pain | 104 | 39 |
| Less-pain | 9 | 41 |
| Medium-pain | 18 | 69 |
| strongly-pain | 7 | 16 |
cp_Frequency_Proportion <- round(prop.table(cp_Frequency, 1), 2)
cp_Frequency_Proportion %>% knitr::kable(caption = "chest pain type Proportion on taregt")
| Not-heart-diseases | heart-diseases | |
|---|---|---|
| NO-pain | 0.73 | 0.27 |
| Less-pain | 0.18 | 0.82 |
| Medium-pain | 0.21 | 0.79 |
| strongly-pain | 0.30 | 0.70 |
p6 <- ggplot(HD, aes(x = cp)) + geom_bar() + labs(title = "Frequency of chest pain type Vs target") + theme(axis.text.x = element_text(angle = 90, vjust = 0.3,
hjust = 1))
p7 <- ggplot(HD, aes(x = cp, fill = target)) +
geom_bar() + facet_grid(target ~ .) + theme(axis.text.x = element_text(angle = 90,
vjust = 0.3, hjust = 1))
plot_grid(p6, p7, ncol = 2)
We notice in the following figure the relationship between patients with blood pressure and the slope of the EKG that those who have up EKG and do not have heart disease have blood pressure from 120 to 150, while those with heart disease have blood pressure of about 125 to 145. For those who have a flat EKG and do not have heart disease, their blood pressure ranges from approximately 122 to 145 while those with heart disease have a blood pressure of about 110 to 138. For those who have a down EKG and do not have heart disease ,their blood pressure ranges from approximately 118 to 140, which is nearly to similar to a percentage of those with heart disease.
p8 <- ggplot(data = HD, aes(x = trestbps,
y = slope,
fill = target))
p8 + geom_boxplot() + labs(y = "slops", x = "Frequency trestbps") +
theme(legend.title=element_blank()) + coord_flip()
From next Figure we note the following:
Patients who have heart disease, their cholesterol levels gradually increase from about 180 mg/dl for those around 28 years old until it reaches about 265 mg/dl for those around 65 years of age, and after that the cholesterol level decreases until it reaches about 240 mg/dl for those around 76 years old.
Patients who do not have heart disease have slightly higher cholesterol than those with heart disease, except for those between the ages of 60 and 70, the cholesterol percentage for those without heart disease is lower than the cholesterol percentage for those with heart disease.
p12 <- ggplot(data =HD, aes(x =age, y =chol , colour = target))
p12 + geom_point() + geom_smooth()+
labs(x = "Age", y = "serum cholestoral in mg/dl",
title = "The Relationship Among Age, serum cholestoral and target")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
In next figure We observed that the percentage of difference in the cholesterol level is slight for those who have heart disease and those who do not have heart disease for males, while there is a big difference in the percentage of cholesterol between those with heart disease and those without heart disease for females.
p13 <- ggplot(data = HD, aes(x = age, y = chol, colour = target))
p13 + geom_point() + geom_smooth() + facet_grid(. ~ sex)+
labs(x = "Age", y = "serum cholestoral in mg/dl",
title = "The Relationship Among Age, serum cholestoral and target for gender")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
The next graph, 75% of people who do not have heart disease and normal sugar in blood and wave abnormality while about 67% of people who have heart disease and normal sugar in blood and hypertrophy. Approximately 65% of people who have heart disease and non-normal sugar in blood and hypertrophy . Approximately 50% of people who do not have heart disease and normal sugar in blood and normal resting electro cardio graphic results.
restecg_Frequency <- table(HD$restecg,HD$target,dnn = c("restecg","Target"))
restecg_Frequency %>% knitr::kable( caption = 'restecg Frequency on target')
| Not-heart-diseases | heart-diseases | |
|---|---|---|
| Normal | 79 | 68 |
| hypertrophy | 56 | 96 |
| wave abnormality | 3 | 1 |
restecg_Proportion <- round(prop.table(restecg_Frequency,1),2)
restecg_Proportion %>% knitr::kable( caption = 'restecg Proportion on traget')
| Not-heart-diseases | heart-diseases | |
|---|---|---|
| Normal | 0.54 | 0.46 |
| hypertrophy | 0.37 | 0.63 |
| wave abnormality | 0.75 | 0.25 |
fbs_Frequency <- table(HD$fbs ,HD$target,dnn = c("fbs","Target"))
fbs_Frequency %>% knitr::kable( caption = 'fbs Frequency on target')
| Not-heart-diseases | heart-diseases | |
|---|---|---|
| Normal | 116 | 142 |
| non-nmormal | 22 | 23 |
fbs_Proportion <- round(prop.table(fbs_Frequency,1),2)
fbs_Proportion %>% knitr::kable( caption = 'fbs Proportion on target')
| Not-heart-diseases | heart-diseases | |
|---|---|---|
| Normal | 0.45 | 0.55 |
| non-nmormal | 0.49 | 0.51 |
p8 <-ggplot(HD, aes(x = restecg, fill = target)) +
geom_bar(position = 'fill') + facet_grid( ~ fbs) +
theme(axis.text.x = element_text(angle = 90, vjust=0.3, hjust = 1)) +
labs(title = 'Proportional Bar Chart: suger in blood and resting electrocardio graphic results, target') + coord_flip()
p8
#
Analysing the data showed that most of those afflicted with heart disease are men, who are often neither old nor young. Likewise, most women who develop heart disease are neither old not young. Approximately 57% of people who have some features and have heart disease ,however there 45% of people did not have heart disease. The percentage of people who have angina pectoris as well as heart disease is about 23%.Furthermore, while The percentage of patients with heart disease who had an up, down, and flat slope of the top of the ST exercise is about 43%, 75%, and 35%, respectively. Nearly 70% of patients with a strongly-chest pain have heart disease while 79% of those with a Medium-chest pain have heart disease and 82% of those who have a Less-chest pain have heart disease lastly,27% of patients without chest pain have heart disease. We noticed the relationship between patients with blood pressure and the slope of the EKG that those who have up EKG and do not have heart disease have blood pressure from 120 to 150, while those with heart disease have blood pressure of about 125 to 145. For those who have a flat EKG and do not have heart disease, their blood pressure ranges from approximately 122 to 145 while those with heart disease have a blood pressure of about 110 to 138. For those who have a down EKG and do not have heart disease ,their blood pressure ranges from approximately 118 to 140, which is nearly to similar to a percentage of those with heart disease.
Patients who have heart disease, their cholesterol levels gradually increase from about 180 mg/dl for those around 28 years old until it reaches about 265 mg/dl for those around 65 years of age, and after that the cholesterol level decreases until it reaches about 240 mg/dl for those around 76 years old. Patients who do not have heart disease have slightly higher cholesterol than those with heart disease, except for those between the ages of 60 and 70, the cholesterol percentage for those without heart disease is lower than the cholesterol percentage for those with heart disease. while 75% of people who do not have heart disease and normal sugar in blood and wave abnormality while about 67% of people who have heart disease and normal sugar in blood and hypertrophy. Approximately 65% of people who have heart disease and non-normal sugar in blood and hypertrophy . Approximately 50% of people who do not have heart disease and normal sugar in blood and normal resting electro cardio graphic results.
There are many features that lead to heart disease, such as age, cholesterol, diabetes, types of chest pain, etc. The data will be further analysed to see if there is a significant relationship between features and the likelihood of heart disease. Furthermore, the logistic regression formula can predict an individual’s probability of heart disease
Bischl, Bernd, Michel Lang, Lars Kotthoff, Julia Schiffner, Jakob Richter, Erich Studerus, Giuseppe Casalicchio, and Zachary M. Jones. 2016. “mlr: Machine Learning in R.” Journal of Machine Learning Research 17 (170): 1–5. http://jmlr.org/papers/v17/15-066.html.
Gesmann, Markus, and Diego de Castillo. 2011. “GoogleVis: Interface Between R and the Google Visualisa- tion Api.” The R Journal 3 (2): 40–44. https://journal.r-project.org/archive/2011-2/RJournal_2011-2_ Gesmann+de~Castillo.pdf.
Grolemund, Garrett, and Hadley Wickham. 2011. “Dates and Times Made Easy with lubridate.” Journal of Statistical Software 40 (3): 1–25. http://www.jstatsoft.org/v40/i03/.
Harrell Jr, Frank E, with contributions from Charles Dupont, and many others. 2017. Hmisc: Harrell Miscellaneous. https://CRAN.R-project.org/package=Hmisc.
Jeppson, Haley, Heike Hofmann, and Di Cook. 2017. Ggmosaic: Mosaic Plots in the ’Ggplot2’ Framework. https://CRAN.R-project.org/package=ggmosaic.
Wickham, Hadley. 2009. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. http: //ggplot2.org.