RESEARCH QUESTION: Are people over the age of 40 more prone to heart diseases, no matter the gender?
data <- read.table("C:/Users/Alisa/Downloads/archive/heart.csv", header=TRUE, sep=",", dec=".")
mydata <- data[ , -c(6,7,9,10,11)]
colnames(mydata) <- c("Age", "Gender", "ChestPain", "BloodPressure", "Cholesterol", "MaxHeartRate", "HeartDisease")
mydata$GenderFactor <- factor(mydata$Gender,
levels = c("M","F"),
labels = c("Male", "Female"))
mydata$ChestPainFactor <- factor(mydata$ChestPain,
levels = c("TA","ATA", "NAP", "ASY"),
labels = c("1", "2", "3", "4"))
mydata$HeartDiseaseFactor <- factor(mydata$HeartDisease,
levels = c("0","1"),
labels = c("No", "Yes"))
mydata1 <- drop_na(mydata) #there were none of the values that are not available
head(mydata1)
## Age Gender ChestPain BloodPressure Cholesterol MaxHeartRate HeartDisease GenderFactor ChestPainFactor
## 1 40 M ATA 140 289 172 0 Male 2
## 2 49 F NAP 160 180 156 1 Female 3
## 3 37 M ATA 130 283 98 0 Male 2
## 4 48 F ASY 138 214 108 1 Female 4
## 5 54 M NAP 150 195 122 0 Male 3
## 6 39 M NAP 120 339 170 0 Male 3
## HeartDiseaseFactor
## 1 No
## 2 Yes
## 3 No
## 4 Yes
## 5 No
## 6 No
Description:
Unit of observation: Patients with cardiovascular disease Sample size: 918 observations
Variables: Age: age of the patient [years]
Gender: gender of the patient - M = Male - F = Female
ChestPain: chest pain type - TA = Typical Angina - ATA = Atypical Angina - NAP = Non-Anginal Pain - ASY = Asymptomatic
BloodPressure: resting blood pressure [mm Hg]
Cholesterol: serum cholesterol [mm/dl]
MaxHeartRate: maximum heart rate achieved [Numeric value between 60 and 202]
HeartDisease: output class - 1: Has heart disease - 0: Doesn’t have heart disease
Datasource: https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction?resource=download
summary(mydata1)
## Age Gender ChestPain BloodPressure Cholesterol MaxHeartRate HeartDisease
## Min. :28.00 Length:918 Length:918 Min. : 0.0 Min. : 0.0 Min. : 60.0 Min. :0.0000
## 1st Qu.:47.00 Class :character Class :character 1st Qu.:120.0 1st Qu.:173.2 1st Qu.:120.0 1st Qu.:0.0000
## Median :54.00 Mode :character Mode :character Median :130.0 Median :223.0 Median :138.0 Median :1.0000
## Mean :53.51 Mean :132.4 Mean :198.8 Mean :136.8 Mean :0.5534
## 3rd Qu.:60.00 3rd Qu.:140.0 3rd Qu.:267.0 3rd Qu.:156.0 3rd Qu.:1.0000
## Max. :77.00 Max. :200.0 Max. :603.0 Max. :202.0 Max. :1.0000
## GenderFactor ChestPainFactor HeartDiseaseFactor
## Male :725 1: 46 No :410
## Female:193 2:173 Yes:508
## 3:203
## 4:496
##
##
Age: In the dataset there observed people who have the age from 28 and 77. Half of them were older than 54 and half of people who were younger, while the mean was 53.51.
Gender: There were 918 observants, 725 of them were male and 193 were female. We can see that the majority was male.
Blood Pressure: Regarding the data we can see tat the average blood pressure of observants vas 132.4 and the maximum was 200.0.
Cholesterol: By looking at the summary of data of the variables we can observe that 25% of people had the cholesterol up to 173.2 and the maximum was 603.0. The average value of cholesterol among observants (mean) was 198.8.
Maximum heart rate: The maximum heart rate of observants was moving between 60(min) and 202(max), while 50% of people had the maximum heart rate lower than 138 and 50% of them higher.
Heart Disease: The numbers of people who had the heart disease (508) and didn’t (410) were quite close.
ggplot(mydata1, aes (x = Age, fill = GenderFactor)) +
geom_histogram(position = position_dodge(width = 11), binwidth = 7, colour = "white") +
facet_wrap(~HeartDiseaseFactor, ncol = 1) +
ylab("Frequency") +
labs(fill = "Gender") +
ggtitle("Distribution of people who have a disease based on age, by gender")
By looking at the two histograms we can conclude that more male observants have the heart disease in comparison to female, but this could not be relevant based on the fact that the majority of observants are male. We can conclude that females get the disease later in life in comparison to males, while males are more prone to have the disease overall (based on the share of both genders regarding having or not having the disease - bigger share of observed males has the disease, bigger share of observed women does not have the disease). All the conclusions are based only on the female and male observants in this sample from which the majority was male.
ggplot(mydata1, aes (x = Age, fill = HeartDiseaseFactor)) +
geom_histogram(position = position_dodge(width = 3), binwidth = 7, colour = "white") +
ylab("Frequency") +
labs(fill = "Heart Disease") +
ggtitle("Distribution of people who have a disease based on age")
The histogram shows the distribution of people who have or don’t have the heart disease based on age. If we look only at the blue collumns we can conclude that people that are older than 40 are more prone to have the heart disease than people who are younger than 40. The graph is skewed to the left. We can confirm our research question.