Week-5: Data Dive

Loading Library:

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Loading Dataset:

HA<- read.csv("/Users/rupeshswarnakar/Desktop/heart_attack_prediction_dataset.csv")

List of three Unclear Columns:

There are several columns that seems unclear to some extent. However, three most unclear columns until reading the documentation are as follows;

a. Cholesterol

b. Stress Level

c. Diabetes

Reason behind the lack of clarity:

a. Unfamiliar Topic:

The dataset for Cholesterol column has documented the actual reading of the Cholesterol level. For someone who is unfamiliar with the topics or terminologies related to health science, it is unclear to put the actual reading without a clear column name. It can easily create sense of confusion if a random number is displayed without mentioning any supplementary instructions.

b. Generalized Level:

In case of stress level column, the range of 1 to 10 is used to show the level of stress for different patients. This is also ambiguous because it is unclear what is low, medium and high for a stress level. For instance, 3 to 7 does not give an entire idea about how better or worse the stress is. 1 and 10 also might seem confusing because it is not specifically mentioned in the dataset whether 1 is low or 10 is.

c. Universal Assumption:

Generally, in situation like True or False, a binary number 0 and 1 is used to denote the case accordingly. However, if it is not mentioned whether 0 is True or 1 is, it may become risky to make universal assumption that 1 is always True and 0 is always False. For instance, in case of Diabetes, 1 and 0 is used to show if the patient has diabetes or not. Before reading the documentation, it is unclear and risky to make universal assumption that 1 is True and 0 is False.

Still unclear Column:

Stress level is still unclear after reading documentation. It is because the document does not clearly categorizes the range (1-10) of stress level. In other words, if a patients has stress level of 7, it is very difficult to assume whether the patient is on danger zone or still safe away from the consequences of severe stress. Obviously, anyone can make self-assumption and categorize themselves reading the documentation, however how safe are those assumption is the real question.

For. the sake of clarity, documentation should have categorized (1-3) as lower stress level, (4-6) as mid stress level and (7-10) as high stress level.

Explore Issue In-depth:

Let’s explore the issue in depth by looking at the box plot of Gender vs. Stress Level among different countries.

ggplot(HA, aes(x=Sex,
               y=Stress.Level,
               fill=Country))+
  geom_boxplot()+
  labs(x="Gender",
       y="Stress Level",
       title="Gender vs Stress Level",
        scale_color_brewer(palette='Dark2'))

From the above visualization, we can see that there male patients and female patients from different countries that have different stress levels ranging from 1 to 10. For making our interpretation easy, if we focus only on female data (since it is more diverse than that of male), we can see that different females from different countries have different stress level. All of them have median stress level of at least 5, so it can be assumed that they all have some sort of stress in life.

Now, there are some females that have stress close to 6 or little higher as well. It is unclear to draw any conclusion about stress level 6 or 7, because we don’t know how severe 6 or 7 is in terms of resulting negative consequences. Obviously, we can compare among box plots and assume some box plots are at higher severity than the others. However that is not enough for a robust analysis such as to predict whether they are going to have heart failure or not.

If the documentation had categorized the range (1-10) of stress level, it would have been easy to extract some distinct information about which patient is at risk of heart failure and which one is not.

Reduce Negative Consequences:

a. Low Stress Level (1-3)

b. Medium Stress Level (4-6)

c. High Stress Level (7-10)

We can achieve the low, medium, and high stress categories by using the following dataframe.

HA$Stress.Category <- cut(HA$Stress.Level,
                           breaks = c(1, 3, 6, 10),   
                           labels = c("Low", "Mid", "High"), 
                           right = TRUE, 
                           include.lowest = TRUE) 
HA

From the above dataframe, we can see that the stress level is now clearly divided into 3 categories such as (1-3) as Low stress level, (4-6) as Medium stress level and (7-10) as High Stress level. This way, we can further facilitate our analysis by assigning each individual patients with a special tag of low, medium or high stress level.

Now, it can help someone to understand what these different numbers (1-10) is signifying to without even reading documentation.

Visualization on Stress Categories:

a. Bar Graph:

Let’s see the visualization as given below to categories both male and female patients with the severity of stress levels.

ggplot(HA, aes(x = Sex, 
               y = Stress.Level, 
               fill = Stress.Category)) + 
  geom_boxplot() + 
  labs(x = "Gender", 
       y = "Stress Level", 
       title = "Gender vs Stress Level") +
  scale_fill_brewer(palette = 'Dark2')+
  theme_minimal()

Here, we can see a clear distinction of both male and female patients with all three low, medium and high stress level. This clear distinction will help make robust analysis as we can tell how many male and female patients are at safe zone vs the danger zone of heart failure due to higher stress level.

This can further assist in making strong prediction of heart-attack, and also help patients with other parts of lifestyle such as finding ways to reduce stress, promote healthy diet, active participation of physical activities etc.

b. Box Plot:

Let’s see a box plot visualization of patients from different countries with different categories of stress level.

ggplot(HA, aes(x = Country, 
               y = Stress.Level, 
               fill = Stress.Category)) + 
  geom_boxplot() + 
  labs(x = "Country", 
       y = "Stress Level", 
       title = "Country vs Stress Level") +
  scale_fill_brewer(palette = 'Dark2')+
  theme_minimal()

From above box plot, we can see that patients from different countries with low and medium stress level are very similar with each other. However, there is variability in high stress level among different countries which is what we want. The high stress level is what make us concerned about the risk of heart failure .

The median of Argentina, New Zealand, Germany, and Spain is higher for high Stress level among all countries. This is a critical piece of information because this alert citizens of these four countries to make changes in lifestyle in order to stay away from heart attack risk. Also, we can see from previous bar graph visualization that female have slightly higher stress level than male; now we can say that female patients from these four countries are at higher risk of heart-attack failure than male due to severe stress level.