Courtesy of AbdelMalek Hajjam
This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by ML researchers to this date. The “goal” field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4. Attribute Information: 1. age 2. sex 3. chest pain type (4 values) 4. resting blood pressure 5. serum cholestoral in mg/dl 6. fasting blood sugar > 120 mg/dl 7. resting electrocardiographic results (values 0,1,2) 8. maximum heart rate achieved 9. exercise induced angina 10. oldpeak = ST depression induced by exercise relative to rest 11. the slope of the peak exercise ST segment 12. number of major vessels (0-3) colored by flourosopy 13. thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
Objective:
As per the questions asked in discussion 5, there are some certain objectives which need to be met by using dplyr and tidyr packages. 1. Rename all the column’s names and make them in a sophisticated manner 2. Make some data wrangling to through making some categories, filters, etc 3. Data analysis
# Loading the required libraries
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(knitr)
# First of all, let's import the dataset in r
heartd <- read.csv("heart.csv", header=TRUE, stringsAsFactors = FALSE, na.strings=c("", "NA")) # Importing the dataset and setting the empty values to NA
head(heartd)
## ï..age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca
## 1 63 1 3 145 233 1 0 150 0 2.3 0 0
## 2 37 1 2 130 250 0 1 187 0 3.5 0 0
## 3 41 0 1 130 204 0 0 172 0 1.4 2 0
## 4 56 1 1 120 236 0 1 178 0 0.8 2 0
## 5 57 0 0 120 354 0 1 163 1 0.6 2 0
## 6 57 1 0 140 192 0 1 148 0 0.4 1 0
## thal target
## 1 1 1
## 2 2 1
## 3 2 1
## 4 2 1
## 5 2 1
## 6 1 1
The above data frame shows that almost most of the variable’s names are not represented in an easy way and by just reading the table, almost no one can understand what these values mean.
# Firstly, let's rename all the columns to make them easy to understand
heartd <- heartd %>%
rename(age = ï..age) %>%
rename(sex = sex) %>%
rename(chest_paintype = cp) %>%
rename(resting_bp = trestbps) %>%
rename(cholesterol_in_mg = chol) %>%
rename(fasting_bloodsugar = fbs) %>%
rename(resting_ecg = restecg) %>%
rename(max_heartrate_ach = thalach) %>%
rename(exercise_induce_angina = exang) %>%
rename(stdepression_induced_byexercise = oldpeak) %>%
rename(slope_ofpeak_exercise = slope) %>%
rename(number_majorvessels = ca) %>%
rename(effect = thal)
head(heartd)
## age sex chest_paintype resting_bp cholesterol_in_mg fasting_bloodsugar
## 1 63 1 3 145 233 1
## 2 37 1 2 130 250 0
## 3 41 0 1 130 204 0
## 4 56 1 1 120 236 0
## 5 57 0 0 120 354 0
## 6 57 1 0 140 192 0
## resting_ecg max_heartrate_ach exercise_induce_angina
## 1 0 150 0
## 2 1 187 0
## 3 0 172 0
## 4 1 178 0
## 5 1 163 1
## 6 1 148 0
## stdepression_induced_byexercise slope_ofpeak_exercise
## 1 2.3 0
## 2 3.5 0
## 3 1.4 2
## 4 0.8 2
## 5 0.6 2
## 6 0.4 1
## number_majorvessels effect target
## 1 0 1 1
## 2 0 2 1
## 3 0 2 1
## 4 0 2 1
## 5 0 2 1
## 6 0 1 1
# Now let's rename the labels instead of the values. In the file, it does not have any labels which makes it even harder to understand the data. We are going to replace the labels with values to make it easy to understand
heartd1 <- heartd # Making another copy of data frame just in case if anything goes wrong
# Changing the class types and adding labels where it needs
heartd1$sex <- as.factor(heartd1$sex) # Changing the class type to factor to give labels
heartd1$sex = factor(heartd1$sex, labels= c("Female", "Male")) # Add the values to
heartd1$chest_paintype <- as.factor(heartd1$chest_paintype) # Changing the data type into factor
heartd1$chest_paintype = factor(heartd1$chest_paintype, labels= c ("Type 1 Angina", "Type 2 Angina", "Non-Angina Pain", "Asymptomatic"))
heartd1$resting_ecg <- as.factor(heartd1$resting_ecg) # Changing the data type into factor to add labels
heartd1$resting_ecg = factor(heartd1$resting_ecg, labels = c("Normal", "having ST-T wave abnormality", "left ventricular hypertrophy"))
heartd1$fasting_bloodsugar <- as.factor(heartd1$fasting_bloodsugar) # Changing the data type to factor to add labels
heartd1$fasting_bloodsugar = factor(heartd1$fasting_bloodsugar, levels= c(0,1), labels = c("< 120 mg/dl", "> 120 mg/dl"))
heartd1$exercise_induce_angina <- as.factor(heartd1$exercise_induce_angina) # Changing the data type into factor
heartd1$exercise_induce_angina = factor(heartd1$exercise_induce_angina, levels= c(0,1), labels = c("No", "Yes"))
heartd1$slope_ofpeak_exercise <- as.factor(heartd1$slope_ofpeak_exercise) # Changing the data type into factor
heartd1$slope_ofpeak_exercise = factor(heartd1$slope_ofpeak_exercise, levels= c(1,2,3), labels = c("Unsloping; value", "flat; value", "downloping"))
head(heartd1)
## age sex chest_paintype resting_bp cholesterol_in_mg
## 1 63 Male Asymptomatic 145 233
## 2 37 Male Non-Angina Pain 130 250
## 3 41 Female Type 2 Angina 130 204
## 4 56 Male Type 2 Angina 120 236
## 5 57 Female Type 1 Angina 120 354
## 6 57 Male Type 1 Angina 140 192
## fasting_bloodsugar resting_ecg max_heartrate_ach
## 1 > 120 mg/dl Normal 150
## 2 < 120 mg/dl having ST-T wave abnormality 187
## 3 < 120 mg/dl Normal 172
## 4 < 120 mg/dl having ST-T wave abnormality 178
## 5 < 120 mg/dl having ST-T wave abnormality 163
## 6 < 120 mg/dl having ST-T wave abnormality 148
## exercise_induce_angina stdepression_induced_byexercise
## 1 No 2.3
## 2 No 3.5
## 3 No 1.4
## 4 No 0.8
## 5 Yes 0.6
## 6 No 0.4
## slope_ofpeak_exercise number_majorvessels effect target
## 1 <NA> 0 1 1
## 2 <NA> 0 2 1
## 3 flat; value 0 2 1
## 4 flat; value 0 2 1
## 5 flat; value 0 2 1
## 6 Unsloping; value 0 1 1
Now that we have all our data is in good shape; all column’s names are updated and the labels are placed to make it understandable. Let’s try to filter our data with different classes to explore it further. Some of the above variables such as ‘target’ would be useful for machine learning algorithms as this data is typically used for ML practice in Kaggle.
Due to the nature of data, we are going to select few variables from the dataset to make the analysis and make it easy to understand.
heartd2 <- heartd1 %>%
group_by(sex) %>%
select(age, sex, chest_paintype, resting_bp, resting_ecg, cholesterol_in_mg, max_heartrate_ach, effect) %>%
arrange(effect) %>%
ungroup()
head(heartd2)
## # A tibble: 6 x 8
## age sex chest_paintype resting_bp resting_ecg cholesterol_in_~
## <int> <fct> <fct> <int> <fct> <int>
## 1 53 Fema~ Non-Angina Pa~ 128 Normal 216
## 2 52 Male Type 1 Angina 128 having ST-~ 204
## 3 63 Male Asymptomatic 145 Normal 233
## 4 57 Male Type 1 Angina 140 having ST-~ 192
## 5 52 Male Asymptomatic 118 Normal 186
## 6 41 Male Type 2 Angina 135 having ST-~ 203
## # ... with 2 more variables: max_heartrate_ach <int>, effect <int>
ggplot(heartd2, aes(x=heartd2$resting_bp, y=heartd2$age, color= sex))+geom_point()
As we can see that with the increase in age, resting blood pressure also increases. Hence, it has positive relationship. Although it does not seem to have linear line but there are some other factors as well which affects the blood pressure. So we can say to some extent, age affects the resting blood pressure.
Now, let’s see the relationship between AGE and SERUM CHOLESTEROL
ggplot(heartd2, aes(x=heartd2$cholesterol_in_mg, y=heartd2$age, color=sex))+geom_point()
Same goes with the above plot, here it shows more upward relationship. It shows that as the age grows, cholesterol increases as well. Hence, it has positive relationship.
ggplot(heartd2, aes(x= heartd2$effect, y=heartd2$cholesterol_in_mg, color=sex))+geom_point()
The above plot shows that there is relationship between cholesterol and heart attack’s effects. As the cholesterol increases, it can cause more damage to the heart and overall patient’s health eventually.
heartd2 %>%
group_by(sex) %>%
summarize(avg_cholesterol = mean(cholesterol_in_mg), avg_maxheartrate_achieved= mean(max_heartrate_ach))
## # A tibble: 2 x 3
## sex avg_cholesterol avg_maxheartrate_achieved
## <fct> <dbl> <dbl>
## 1 Female 261. 151.
## 2 Male 239. 149.
The overall average cholesterol of the respondents were 261.30 for females and 239.28 for males. Likewise, average maximum heart rate achieved for females is 151.12 while it is 148.96 for males.
After going through the data cleaning process, it has seen that the heart’s disease is dependent on many factors such as cholesterol, blood pressure, etc. For the sake of simplicity for this project, we checked the relationships of cholesterol and max heart rate achieved with the chances of having adverse effects on both males and females. It has proven that with the increases in age, cholesterol and heart rates are also increased which eventually leads to heart diseases and causes severe effects. THe dataset contained many other variables which were not included in the conclusion and they can eventually be taken as well to see further effects caused by other factors.