contents

1.Introduction
  1.1. Dataset source and description 
  2.1. Response Feature
2.Data Preprocessing
  1.2. Loading packages
  2.2. Loading Dataset
  3.2. Data Cleaning and Transformation
  4.2. Missing value
3. Data Visaulisation 
  1.3. Bar charts for Gender type proportion
  2.3. Histogram for age frequency
  3.3. Bar chart for target proportion
  4.3. Exercise induced angina vs Traget
  5.3. Gender vs target
  6.3. Slope vs target 
  7.3. chest pain type vs target
  8.3. Relationship trestbps, slope vs target
  9.3. Relationship Among Age, serum cholestoral vs Target
  10.3. Relationship Among Age, serum cholestoral and target for gender
  11.3. Bar chart of target vs suger in blood vs resting electrocardio graphic results.
4. Results 
5.conculsion
6.References

1.Introduction

Over the world , there are many poeple have heart disease that are considered most spread disease in the world. Many people suffer from this disease and do not know that they have it. If the disease is discovered early helps with treatment and and reduce the side effects butter than discovered later.

The aim of phase one is examine the relationsips between featuers and have in predicting the likelihood of developing heart disease using the data provided.Analysis of the data will lead to the formulation of a logistic regression model for predicting the likelihood of developing heart disease by looking at a combination of several factors.

1.1. Dataset source and description

The database was obtained from Kaggle website (https://www.kaggle.com/ronitf/heart-disease-uci ) , The dataset includes 303 obserevations and 14 feature which is defined as follows:

• Age :age of the patient in years patient (Interval)

• Sex: Sex of the patient (Nominal Categorical where 1 refer to male , 0 refer to female)

• Cp: chest pain type ( ordenal Categorical, where 0=no-pain ,1=Less-pain,2=Medium-pain,3=strongly-pain)

• Trestbps: resting blood pressure (Interval - Continuous)

• Chol: serum cholestoral in mg/dl (Interval- Continuous)

• Fbs: fasting blood sugar greater than 120 mg/dl (binary Categorical,where 1 = true and 0 = false)

• Restecg: resting electrocardio graphic results (Categorical)

• Thalach: maximum heart rate achieved (Continuous)

• Exang: exercise induced angina (binary Categorical ,where 1 = yes, 0 = no)

• Oldpeak: ST depression induced by exercise relative to rest (Continuous)

• Slope: the slope of the peak exercise ST segment (Categorical )

• Ca: number of major vessels (0-3) colored by flourosopy (Interval- discreat)

• Target :1 or 0 (binary Categorical , where 1 = Have-heart-diseases; 0 = Not-heart-diseases)

1.2. Response Feature

The response feature is Target which is given as:

The target feature has two classes and hence it is a binary classification problem. The goal is to predict the patient has heart disease or Not through the data.

2. Data Preprocessing

1.2. Loading Packages

R packages will be used in this project to apply the next commands

library(mlr)
library(dplyr)
library(tidyr)
library(plyr)
library(ggplot2)
library(ggmosaic)
library(vcd)
library(googleVis)
library(knitr)
library(cowplot)
library(GGally)
library(tidyverse)
library(lubridate)
library(Hmisc)
setwd("~/Desktop/past semester/semester3/categerical/Assigmnent1.CDA/Our project1")
opts_chunk$set(tidy.opts=list(width.cutoff=55),tidy=TRUE)

2.2. Loading Dataset

Using read.csv() function to upload our dataset in R

HD <- read.csv("group13.csv")
head(HD)

3.2. Data Cleaning and Transformation

In the next step, we transferred some variables to an appropriate class because the dataset stored some variables incorrectly The obserevation in the following features have been modified as follows :

sex:rather than 0= Female and 1 = Male.

target:rather than 0= Not-heart-diseases and 1 = heart-diseases.

cp:rather than 0= NO-pain , 1 = Less-pain ,2 = Medium-pain and 3 = strongly-pain.

fbs:rather than 0= Normal and 1 = Higher or lawer.

restecg:rather than 0= Normal , 1 = hypertrophy and 2 = wave abnormality.

exang:rather than 0= Not-angina and 1 = Have angina.

slope:rather than 0= up , 1 = Flat and 2 = Down.

thal:rather than 0= normal, 1 = defect ,2 = fixed-defect and 3 = reversable-defect.

HD$sex<-as.factor(HD$sex)
HD$cp<-as.factor(HD$cp)
HD$fbs<-as.factor(HD$fbs)
HD$restecg<-as.factor(HD$restecg)
HD$exang<-as.factor(HD$exang)
HD$slope<-as.factor(HD$slope)
HD$thal<-as.factor(HD$thal)
HD$target<-as.factor(HD$target)

Afterthat, summarizeColumns() function will use to obatian summary statastics that will show the mean and median and mode for intervel data and ferquancy and missing value

summarizeColumns(HD)%>% select(-disp, -mad) %>% knitr::kable(caption ='Features Summary')
Features Summary
name type na mean median min max nlevs
age integer 0 54.3663366 55.0 29 77.0 0
sex factor 0 NA NA 96 207.0 2
cp factor 0 NA NA 23 143.0 4
trestbps integer 0 131.6237624 130.0 94 200.0 0
chol integer 0 246.2640264 240.0 126 564.0 0
fbs factor 0 NA NA 45 258.0 2
restecg factor 0 NA NA 4 152.0 3
thalach integer 0 149.6468647 153.0 71 202.0 0
exang factor 0 NA NA 99 204.0 2
oldpeak numeric 0 1.0396040 0.8 0 6.2 0
slope factor 0 NA NA 21 142.0 3
ca integer 0 0.7293729 0.0 0 4.0 0
thal factor 0 NA NA 2 166.0 4
target factor 0 NA NA 138 165.0 2

4.2. Missing Values

When we saw to table 1, we observe there are no missing value.

The next step, we will change a descriptive namings in categories to understand the categoriies in visualisation

HD$sex <- factor(HD$sex,levels=c('0','1'),labels=c('Female','Male'))
HD$target <- factor(HD$target,levels=c('0','1'),labels=c('Not-heart-diseases','heart-diseases'))
HD$cp <- factor(HD$cp,levels=c('0','1','2','3'),labels=c('NO-pain','Less-pain' ,'Medium-pain','strongly-pain'))
HD$fbs <- factor(HD$fbs,levels=c('0','1'),labels=c('Normal','non-nmormal'))
HD$restecg <- factor(HD$restecg,levels=c('0','1','2'),labels=c('Normal','hypertrophy','wave abnormality'))
HD$exang <- factor(HD$exang,levels=c('0','1'),labels=c('Not-angina','Have angina'))
HD$slope <- factor(HD$slope,levels=c('0','1','2'),labels=c('up','Flat','Down'))
HD$thal <- factor(HD$thal,levels=c('0','1','2','3'),labels=c('normal','defect','fixed-defect','reversable-defect'))


summary(HD)
##       age            sex                  cp         trestbps    
##  Min.   :29.00   Female: 96   NO-pain      :143   Min.   : 94.0  
##  1st Qu.:47.50   Male  :207   Less-pain    : 50   1st Qu.:120.0  
##  Median :55.00                Medium-pain  : 87   Median :130.0  
##  Mean   :54.37                strongly-pain: 23   Mean   :131.6  
##  3rd Qu.:61.00                                    3rd Qu.:140.0  
##  Max.   :77.00                                    Max.   :200.0  
##       chol                fbs                  restecg       thalach     
##  Min.   :126.0   Normal     :258   Normal          :147   Min.   : 71.0  
##  1st Qu.:211.0   non-nmormal: 45   hypertrophy     :152   1st Qu.:133.5  
##  Median :240.0                     wave abnormality:  4   Median :153.0  
##  Mean   :246.3                                            Mean   :149.6  
##  3rd Qu.:274.5                                            3rd Qu.:166.0  
##  Max.   :564.0                                            Max.   :202.0  
##          exang        oldpeak      slope           ca        
##  Not-angina :204   Min.   :0.00   up  : 21   Min.   :0.0000  
##  Have angina: 99   1st Qu.:0.00   Flat:140   1st Qu.:0.0000  
##                    Median :0.80   Down:142   Median :0.0000  
##                    Mean   :1.04              Mean   :0.7294  
##                    3rd Qu.:1.60              3rd Qu.:1.0000  
##                    Max.   :6.20              Max.   :4.0000  
##                 thal                    target   
##  normal           :  2   Not-heart-diseases:138  
##  defect           : 18   heart-diseases    :165  
##  fixed-defect     :166                           
##  reversable-defect:117                           
##                                                  
## 

3. Data Visualisation

1.3. Bar charts for Gender type proportion

Through the next plot, we notic the precentage of male is highest than women where the percentage of Men is approximately 68% however, the percentage of Men is approximately 33%

p12<-ggplot(data = HD) + 
  geom_bar(ylab='proportion',mapping = aes(x =sex ,y=stat(prop),group=1))+ labs(title= "Frequency of gender") +scale_fill_manual(values ="green")
p12

2.3. Histogram for age frequency

When we plotted the histogram of age , we conclude that the distribute of age is normal

HD$age %>% hist(col="skyblue",xlim=c(20,90), xlab="age",
main="Histogram of Age")

3.3. Bar chart for target proportion

When we plotted the target bar chart, we concluded that approximately 57% of people diagnosed had heart disease ,however approximately 45% of people diagnosed did not have heart disease.

p11 <-ggplot(data = HD) + 
  geom_bar(ylab='proportion',mapping = aes(x = target,y=stat(prop),group=1))+ labs(title = "proportion of target")

p11

4.3. Exercise induced angina vs Traget

Approximately 70% of patient who have not angina are with heart-diseases ,whereas around 30% of them have not the heart-diseases.Nearly 77% of patient who have angina are not have the heart-diseases, whereas around 23% of them are have heart-diseases.

 exercise_induced_angina<- table(HD$exang, HD$target, dnn = c("exang", "target"))
exercise_induced_angina %>% knitr::kable(caption = "exercise induced angina on target")
exercise induced angina on target
Not-heart-diseases heart-diseases
Not-angina 62 142
Have angina 76 23
exercise_induced_angina_proportion <- round(prop.table(exercise_induced_angina, 1), 2)
exercise_induced_angina_proportion %>% knitr::kable(caption = "exercise induced angina Proportion on target")
exercise induced angina Proportion on target
Not-heart-diseases heart-diseases
Not-angina 0.30 0.70
Have angina 0.77 0.23
 p5 <- ggplot(data = HD, aes(x = exang, fill = target))
p5 + geom_bar(position = "dodge") + labs(title = "Frequency of exercise induced angina Vs target") +
scale_fill_manual(values = c("aquamarine3", "cadetblue2"))

5.3. Gender vs target

We note that the percentage of female who have heart disease is three times the proportion of female who do not have heart disease ,the reasons for the high incidence of female with heart disease appear to be due to the female’s physical makeup, as well as matters related to pregnancy.While the percentage of males without heart disease is 5% higher than those with heart disease, perhaps because males are more physically strong than females.

 Gender_Frequency <- table(HD$sex, HD$target, dnn = c("Gender", "taregt"))
Gender_Frequency %>% knitr::kable(caption = "Gender Frequency on taregt")
Gender Frequency on taregt
Not-heart-diseases heart-diseases
Female 24 72
Male 114 93
 Gender_Frequency_Proportion <- round(prop.table(Gender_Frequency, 1), 2)
Gender_Frequency_Proportion %>% knitr::kable(caption = "Gender Proportion on taregt")
Gender Proportion on taregt
Not-heart-diseases heart-diseases
Female 0.25 0.75
Male 0.55 0.45
 p4 <- ggplot(data = HD, aes(x =sex, fill = target))
p4 + geom_bar() + labs(title = "Frequency of Dependents on target") +
scale_fill_manual(values = c("chartreuse3", "cyan4"))

6.3. Slope vs target

Approximately 75% of patients with a low heart rate have heart disease while 35% of those with a flat heart rate have heart disease and 43% of those who have a up heart rate also have heart disease.

Slope_Frequency <- table(HD$slope, HD$target, dnn = c("Slope", "target"))
Slope_Frequency %>% knitr::kable(caption = "Slope Frequency on target")
Slope Frequency on target
Not-heart-diseases heart-diseases
up 12 9
Flat 91 49
Down 35 107
 Slope_Frequency_Proportion <- round(prop.table(Slope_Frequency, 1), 2)
Slope_Frequency_Proportion %>% knitr::kable(caption = "Slope Proportion on taregt")
Slope Proportion on taregt
Not-heart-diseases heart-diseases
up 0.57 0.43
Flat 0.65 0.35
Down 0.25 0.75
mosaic(Slope_Frequency, pop = FALSE, legend = TRUE, shade = TRUE)
labeling_cells(text = Slope_Frequency_Proportion, margin = 0)(Slope_Frequency_Proportion)

7.3. chest pain type vs target

from next graph, Nearly 70% of patients with a strongly-chest pain have heart disease while 79% of those with a Medium-chest pain have heart disease and 82% of those who have a Less-chest pain have heart disease lastly,27% of patients without chest pain have heart disease.

cp_Frequency <- table(HD$cp, HD$target, dnn = c("chest pain type", "target"))
cp_Frequency %>% knitr::kable(caption = "chest pain type Frequency on target")
chest pain type Frequency on target
Not-heart-diseases heart-diseases
NO-pain 104 39
Less-pain 9 41
Medium-pain 18 69
strongly-pain 7 16
cp_Frequency_Proportion <- round(prop.table(cp_Frequency, 1), 2)
cp_Frequency_Proportion %>% knitr::kable(caption = "chest pain type Proportion on taregt")
chest pain type Proportion on taregt
Not-heart-diseases heart-diseases
NO-pain 0.73 0.27
Less-pain 0.18 0.82
Medium-pain 0.21 0.79
strongly-pain 0.30 0.70
p6 <- ggplot(HD, aes(x = cp)) + geom_bar() + labs(title = "Frequency of chest pain type Vs target") + theme(axis.text.x = element_text(angle = 90, vjust = 0.3,
hjust = 1))
p7 <- ggplot(HD, aes(x = cp, fill = target)) +
geom_bar() + facet_grid(target ~ .) + theme(axis.text.x = element_text(angle = 90,
vjust = 0.3, hjust = 1)) 
plot_grid(p6, p7, ncol = 2)

8.3. Relationship trestbps, slope vs target

We notice in the following figure the relationship between patients with blood pressure and the slope of the EKG that those who have up EKG and do not have heart disease have blood pressure from 120 to 150, while those with heart disease have blood pressure of about 125 to 145. For those who have a flat EKG and do not have heart disease, their blood pressure ranges from approximately 122 to 145 while those with heart disease have a blood pressure of about 110 to 138. For those who have a down EKG and do not have heart disease ,their blood pressure ranges from approximately 118 to 140, which is nearly to similar to a percentage of those with heart disease.

p8 <- ggplot(data = HD, aes(x = trestbps,
                                         y = slope,
                                         fill = target))
p8 + geom_boxplot() + labs(y = "slops", x = "Frequency trestbps") +
  theme(legend.title=element_blank()) + coord_flip()

9.3. Relationship Among Age, serum cholestoral vs Target

From next Figure we note the following:

Patients who have heart disease, their cholesterol levels gradually increase from about 180 mg/dl for those around 28 years old until it reaches about 265 mg/dl for those around 65 years of age, and after that the cholesterol level decreases until it reaches about 240 mg/dl for those around 76 years old.

Patients who do not have heart disease have slightly higher cholesterol than those with heart disease, except for those between the ages of 60 and 70, the cholesterol percentage for those without heart disease is lower than the cholesterol percentage for those with heart disease.

p12 <- ggplot(data =HD, aes(x =age, y =chol , colour = target))
p12 + geom_point() + geom_smooth()+
  labs(x = "Age", y = "serum cholestoral in mg/dl",
       title = "The Relationship Among Age, serum cholestoral and target") 
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

10.3. Relationship Among Age, serum cholestoral and target for gender

In next figure We observed that the percentage of difference in the cholesterol level is slight for those who have heart disease and those who do not have heart disease for males, while there is a big difference in the percentage of cholesterol between those with heart disease and those without heart disease for females.

p13 <- ggplot(data = HD, aes(x = age, y = chol, colour = target))
p13 + geom_point() + geom_smooth() + facet_grid(. ~ sex)+
  labs(x = "Age", y = "serum cholestoral in mg/dl",
       title = "The Relationship Among Age, serum cholestoral and target for gender") 
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

11.3. Bar chart of target vs suger in blood vs resting electrocardio graphic results.

The next graph, 75% of people who do not have heart disease and normal sugar in blood and wave abnormality while about 67% of people who have heart disease and normal sugar in blood and hypertrophy. Approximately 65% of people who have heart disease and non-normal sugar in blood and hypertrophy . Approximately 50% of people who do not have heart disease and normal sugar in blood and normal resting electro cardio graphic results.

restecg_Frequency <- table(HD$restecg,HD$target,dnn = c("restecg","Target"))
restecg_Frequency %>% knitr::kable( caption =  'restecg Frequency on target')
restecg Frequency on target
Not-heart-diseases heart-diseases
Normal 79 68
hypertrophy 56 96
wave abnormality 3 1
restecg_Proportion <- round(prop.table(restecg_Frequency,1),2)
restecg_Proportion %>% knitr::kable( caption =  'restecg Proportion on traget')
restecg Proportion on traget
Not-heart-diseases heart-diseases
Normal 0.54 0.46
hypertrophy 0.37 0.63
wave abnormality 0.75 0.25
fbs_Frequency <- table(HD$fbs ,HD$target,dnn = c("fbs","Target"))
fbs_Frequency %>% knitr::kable( caption =  'fbs Frequency on target')
fbs Frequency on target
Not-heart-diseases heart-diseases
Normal 116 142
non-nmormal 22 23
fbs_Proportion <- round(prop.table(fbs_Frequency,1),2)
fbs_Proportion %>% knitr::kable( caption =  'fbs Proportion on target')
fbs Proportion on target
Not-heart-diseases heart-diseases
Normal 0.45 0.55
non-nmormal 0.49 0.51
p8 <-ggplot(HD, aes(x = restecg, fill = target)) + 
  geom_bar(position = 'fill') + facet_grid( ~  fbs) + 
  theme(axis.text.x = element_text(angle = 90, vjust=0.3, hjust = 1)) +
  labs(title = 'Proportional Bar Chart: suger in blood and resting electrocardio graphic results, target') + coord_flip()
p8

#

4. Results

Analysing the data showed that most of those afflicted with heart disease are men, who are often neither old nor young. Likewise, most women who develop heart disease are neither old not young. Approximately 57% of people who have some features and have heart disease ,however there 45% of people did not have heart disease. The percentage of people who have angina pectoris as well as heart disease is about 23%.Furthermore, while The percentage of patients with heart disease who had an up, down, and flat slope of the top of the ST exercise is about 43%, 75%, and 35%, respectively. Nearly 70% of patients with a strongly-chest pain have heart disease while 79% of those with a Medium-chest pain have heart disease and 82% of those who have a Less-chest pain have heart disease lastly,27% of patients without chest pain have heart disease. We noticed the relationship between patients with blood pressure and the slope of the EKG that those who have up EKG and do not have heart disease have blood pressure from 120 to 150, while those with heart disease have blood pressure of about 125 to 145. For those who have a flat EKG and do not have heart disease, their blood pressure ranges from approximately 122 to 145 while those with heart disease have a blood pressure of about 110 to 138. For those who have a down EKG and do not have heart disease ,their blood pressure ranges from approximately 118 to 140, which is nearly to similar to a percentage of those with heart disease.

Patients who have heart disease, their cholesterol levels gradually increase from about 180 mg/dl for those around 28 years old until it reaches about 265 mg/dl for those around 65 years of age, and after that the cholesterol level decreases until it reaches about 240 mg/dl for those around 76 years old. Patients who do not have heart disease have slightly higher cholesterol than those with heart disease, except for those between the ages of 60 and 70, the cholesterol percentage for those without heart disease is lower than the cholesterol percentage for those with heart disease. while 75% of people who do not have heart disease and normal sugar in blood and wave abnormality while about 67% of people who have heart disease and normal sugar in blood and hypertrophy. Approximately 65% of people who have heart disease and non-normal sugar in blood and hypertrophy . Approximately 50% of people who do not have heart disease and normal sugar in blood and normal resting electro cardio graphic results.

5. Conculsion

There are many features that lead to heart disease, such as age, cholesterol, diabetes, types of chest pain, etc. The data will be further analysed to see if there is a significant relationship between features and the likelihood of heart disease. Furthermore, the logistic regression formula can predict an individual’s probability of heart disease

6. References