# I paste some code in here, maybe to identify all of the libraries I need to use and then to read in the data and to report some details about the data. 
library(dplyr)
library(ggplot2)

Key Indicators of Heart Disease

Description of Dataset

The name of this data set is heart_2020_cleaned.csv. It was obtained from Kaggle. It is a report of a 2020 annual CDC survey of 400k adults related to their health status centered around factors that are associated with the development of heart disease. It is 1 file and 25.2MB.

Risk Factor	Question	Response
Heart Disease	Respondents that have ever reported having coronary heart disease (CHD) or myocardial infarction (MI)	Yes or No
BMI	Body Mass Index, a value derived from the height and weight of a person. (kg/m²)	12-94.8
Smoking	Have you smoked at least 100 cigarettes in your entire life? [Note: 5 packs = 100 cigarettes]	Yes or No
Alcohol	Heavy drinkers (adult men having more than 14 drinks per week and adult women having more than 7 drinks per week	Yes or No
Stroke	Have you ever had a stroke?	Yes or No
Physical Health	How many days in the past 30 days has your mental health been not good? Not good = physical injury or illness	0-30
Mental Health	How many days in the past 30 days has your mental health been not good? Not good = depressive episodes, anxiety, stress, suicidal thoughts, etc.	0-30
Difficulty Walking	Do you have difficulty walking or climbing stairs?	Yes or No
Sex	Do you have difficulty walking or climbing stairs?	Male or Female
Diabetic	Do you have diabetes? (type 1 or 2)	Yes or No
General Health	How would you categorize your overall health?	Excellent, Very good, Good, Fair, Poor
Physical Activity	Do you perform physical activity regularly?	Yes or No
Race	What is your race?	American Indian/Alaskan Native, Asian, Black, Hispanic, White, Other
Sleep Time	On average, how many hours of sleep do you get per night?	0-24
Asthma	Have you ever been diagnosed with asthma?	Yes or No
Kidney Disease	Kidney Disease?	Yes or No
Skin Cancer	Have you ever been diagnosed with skin cancer?	Yes or No
Age	What is your Age	25-80+

Objective

Which factors of health are most attributable to the development of heart disease? Are the most attributable factors preventable (smoking, drinking, etc.) or fixed (race, sex, age)?

Findings

Heart Disease

Below is a pie chart representing the number of people in the 2020 Annual CDC Health Survey that responded ‘Yes’ to having heart disease. Of the approximately 400,000 people surveyed, about 8.56% of them had been diagnosed with heart disease.

#Pie chart for heart disease prevalence within the population of survey respondents

library(RColorBrewer)
myPalette <- brewer.pal(5, "Accent") 
library(scales)
setwd("~/Documents/module1")
heart_2020_cleaned <- read.csv("heart_2020_cleaned.csv")


heartpos <- filter(heart_2020_cleaned, HeartDisease == "Yes")
heartposcount <- data.frame(count(heartpos,"Yes"))

heartneg <- filter(heart_2020_cleaned, HeartDisease == "No")
heartnegcount <- data.frame(count(heartneg,"No"))

x <- c(heartposcount$n, heartnegcount$n)
labels <- c("Yes","No")
pct <-round(x/sum(x)*100, digits = 2)
labels <- paste(labels,pct)
labels <- paste(labels,"%",sep="")
pie(x,labels,main="2020 CDC Heart Disease Prevalence",cex=.5, border="white", col=myPalette)

Top Four Indicators of Heart Disease

Initially, I began doing the report by making graphs like the one below Filtering heart disease to show only “yes” responses and then comparing that against each factor. The graph below shows the BMIs of the respondents to the survey. In order to completely understand interconnections in my data, I used Power BI’s Key Influences tool to determine the top 4 key influences on heart disease.

#scatter plot showing bmi
bmi = heart_2020_cleaned[, c(2)]
bmivals <- aggregate(round(heart_2020_cleaned$BMI),by=list(round(heart_2020_cleaned$BMI)), FUN=function(x)length(round(x)))
df <- bmivals[order(-bmivals$x),]

plot <- ggplot(data=df,
    mapping=aes(x=Group.1, y=x,colour=x))+
    geom_point() +
    labs(title = "BMI of 2020 CDC Health Survey Respondants", x="BMI", y="Number of People") +
    theme(plot.title.position='plot', plot.title=element_text(hjust=0.5)) + 
    scale_colour_gradient2(high="palevioletred", mid="darkolivegreen", midpoint = 20)
plot

Below is a screenshot of the Key Influences visualization function of Power BI. This made it much easier to understand how each factor influenced heart disease. Once heart disease was set as the factor to be evaluated, I changed the response to “Yes,” then, the visual reported how each other factor influences the likelihood of heart disease being ‘Yes’.

Stroke

The Power BI chart determined that when people have had a stroke, the likelihood of also having heart disease increases by 4.88x. This was the most influential factor. Below is a graph showing the prevalence of respondents who have had a stroke.

strokepos <- filter(heart_2020_cleaned, Stroke == "Yes")
strokeposcount <- data.frame(count(strokepos,"Yes"))

strokeneg <- filter(heart_2020_cleaned, Stroke == "No")
strokenegcount <- data.frame(count(strokeneg,"No"))


library(ggplot2)
data <- data.frame(
  Responses=c("No", "Yes") ,  
  value=c(strokenegcount$n, strokeposcount$n)
)
yes=round(strokeposcount$n/(strokeposcount$n + strokenegcount$n) * 100, digits=2)
no=round(strokenegcount$n/(strokenegcount$n + strokeposcount$n) * 100, digits=2)

pctLabels=c(no,yes)
pctLabels <- paste(pctLabels,"%",sep="")



ggplot(data, aes(x=Responses, y=value)) + 
  geom_bar(colour = "orchid4", fill= "lavender", stat = "identity", width=0.4) +
  geom_text(aes(label=pctLabels), hjust=0.5) +
  labs(title = "Prevalence of Stroke in CDC 2020 Health Survey Population", x = "Response", y = "Number of People") +
  theme(plot.title.position='plot', plot.title=element_text(hjust=0.5)) + 
  coord_flip()

both <- filter(heart_2020_cleaned, Stroke == "Yes", HeartDisease == "Yes")
bothcount <- data.frame(count(both,"Yes"))
hdnsy <- filter(heart_2020_cleaned, HeartDisease == "No", Stroke == "Yes")
hdnsycount <- data.frame(count(hdnsy,"Yes"))

library(ggplot2)
data <- data.frame(
  Responses=c("Yes", "No") ,  
  value=c(bothcount$n, hdnsycount$n)
)

yes=round(hdnsycount$n/(hdnsycount$n + bothcount$n) * 100, digits=2)
no=round(bothcount$n/(bothcount$n + hdnsycount$n) * 100, digits=2)

pctLabels=c(no,yes)
pctLabels <- paste(pctLabels,"%",sep="")

ggplot(data, aes(x=Responses, y=value)) + 
  geom_bar(colour = "slateblue", fill= "olivedrab", stat = "identity", width=0.4) +
  geom_text(aes(label=pctLabels), hjust=0.5) +
  labs(title = "Heart Disease Response of Only People Who Have Had a Stroke", x = "Response", y = "Number of People") +
  theme(plot.title.position='plot', plot.title=element_text(hjust=0.5)) + 
  coord_flip()

General Health

The graph below shows the next most influential factor of heart disease, general health. Participants were asked to rank their general health as “Poor,” “Fair,” “Good,” “Very good,” or “Excellent.” When people have ranked their own health as “poor,” as opposed to “Fair,” “Good,” “Very good,” or “Excellent,” the likelihood of having heart disease is 4.7x higher than otherwise.

library(treemap)

ghcount <- data.frame(count(heart_2020_cleaned,GenHealth))
group <- c("Excellent","Fair","Good", "Poor", "Very good")
value <- c(ghcount$n)
data <- data.frame(group,value)

treemap(data,
        index="group",
        vSize="value",
        type="index",
        palette="RdYlBu",
        title="General Health of 2020 CDC Survey Respondents",                      # Customize your title
        fontsize.title=12
)

Kidney Disease

The graph below shows the number of participants with kidney disease. Power BI determined that the likelihood of people having heart disease if they have kidney disease is 3.64x higher than if they don’t have kidney disease.

kidneypos <- filter(heart_2020_cleaned, KidneyDisease == "Yes")
kidneyposcount <- data.frame(count(kidneypos,"Yes"))

kidneyneg <- filter(heart_2020_cleaned, KidneyDisease == "No")
kidneynegcount <- data.frame(count(kidneyneg,"No"))

library(ggplot2)
library(scales)
data <- data.frame(
  Responses=c("No", "Yes") ,  
  value=c(kidneynegcount$n, kidneyposcount$n)
)

yes=round(kidneyposcount$n/(kidneyposcount$n + kidneynegcount$n) * 100, digits=2)
no=round(kidneynegcount$n/(kidneynegcount$n + kidneyposcount$n) * 100, digits=2)

pctLabels=c(no,yes)
pctLabels <- paste(pctLabels,"%",sep="")


ggplot(data, aes(x=Responses, y=value)) + 
  geom_bar(colour = "lightpink4", fill= "seashell3", stat = "identity") +
  geom_text(aes(label=pctLabels), vjust=0) +
  labs(title = "Prevalence of Kidney Disease in CDC 2020 Health Survey Population", x = "Response", y = "Number of People") +
  theme(plot.title.position='plot', plot.title=element_text(hjust=0.5))

Physical Health

The graph below shows respondents physical health.They responded with a number between 0-30. 0 meaning they had no bad physical health days in the last 30 days. 30 meaning they had 30 bad physical health days in the past 30 days. Bower BI determined that if participants responded with 18 or higher, their likelihood of having heart disease is 3.42 times higher than those who ranked their physical health below 18.

malesubset <- filter(heart_2020_cleaned,Sex=="Male")
male <- count(malesubset,PhysicalHealth)


femalesubset <- filter(heart_2020_cleaned,Sex=="Female")
female <- count(femalesubset,PhysicalHealth)


data <- data.frame(
  PhysicalHealth=c(male$PhysicalHealth) ,
  Female=c(female$n),
  Male=c(male$n)
)

c1 <- c(male$PhysicalHealth, female$PhysicalHealth)
c2 <- c(male$n, female$n)
c3 <- c("Male", "Female")
df <- data.frame(c1, c2, c3)

ggplot(df, aes(x=c1, y=c2, fill=c3)) + 
  labs(title = "Physical Health (0-30) by Gender", x = "Physical Health (Number of bad physical health days out of 30)", y = "Number of People", fill = "Gender") +
  theme(plot.title.position='plot', plot.title=element_text(hjust=0.5)) + 
  geom_area()

Wrap up

The data visualized in this report suggest that the top 4 most attributable factors for developing heart disease are stroke, poor general health, kidney disease and poor physical health. Because some of these are comorbidities, it is not fair to suggest that they cause heart disease, however, a patient that has any of these conditions should be evaluated for heart disease, as the correlation is evident. This data is also helpful because it alerts people with these risk factors to begin making healthier diet or lifestyle choices in order to reduce the risk of heart disease. Though the data cannot suggest for instance that someone who has had a stroke is at a higher risk of developing heart disease, it is able to suggest that people who have had a stroke, on average have also had more occurrences of heart disease than those who have never had a stroke.

The data also does confirm that some preventable factors also put people at an increased risk for heart disease such as diabetes, smoking, sleep time, and physical activity. It cannot be determined if preventable or non-preventable factors are more influential in the likelihood of having or developing heart disease from this data set alone because many of the factors such as kidney disease or skin cancer could be caused by a multitude of both preventable and non-preventable factors. However, due to the correlation between some of the preventable factors and the increased risk of heart disease, it still is fair to suggest that making healthy choices could reduce the risk of heart disease.

Module 1

Sydney Whitaker

2023-10-15