Week 6: Data Dive

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

HA<- read.csv("/Users/rupeshswarnakar/Desktop/heart_attack_prediction_dataset.csv", nrows = 250)

Two Pairs of Numerical Variables:

First Numerical Variables:

Let’s create a first pair of numerical variable by using the columns Age and BMI. From BMI column, we are also going to create our new column of BMI Category that is going to categorize different aged group patients with special tag of Underweight, Normal weight, Overweight and Obesity.

HA <- HA |> 
  mutate(BMI_Category = case_when(
    BMI < 18.5 ~ "Underweight",
    BMI >= 18.5 & BMI < 24.9 ~ "Normal weight",
    BMI >= 25 & BMI < 29.9 ~ "Overweight",
    TRUE ~ "Obesity"
  ))

For the above dataframe, we can see that the dataset (HA) contains the new column called BMI_category which categorizes all different aged patients with different BMI tags. This will help in quick understanding of BMI of different aged group patients by just looking at their categories.

Second Numerical Variables:

Let’s create a second pair of numerical variable by using the columns Cholesterol and Heart Rate. We are now going to create our new column of ratio of Cholesterol by Heart Rate that will help us understand the risk of heart attack relative to each patient’s Cholesterol level and Heart rate.

HAB <-HA |> 
  mutate(RatioofCholesbyHR = Cholesterol/Heart.Rate)

From the above dataframe, we can see that we have added a new column namely “RatioofCholesbyHR” and stored it in new dataset (HAB). Now this will help us in understanding how an impact in cholesterol or Heart rate may increase or decrease the risk of heart attack in each patients. However, to analyze the link between the ratio and risk of heart attack, we may need to further investigate on each value of cholesterol and heart rate explicitly because low ratio value does not necessarily only mean low cholesterol as it also mean high heart rate, for instance.

Visualization on Each Pair:

First Visualization:

Let’s create a scatter plot to see how Age and Cholesterol relates for four different BMI categorized patients.

ggplot(HA, aes(x = Age, y = Cholesterol)) +
  geom_point(aes(color = BMI_Category)) +
  geom_smooth(method = "lm", color = "blue")+
  facet_wrap(~ BMI_Category) +
  labs(title = "Age vs Cholesterol for different BMI Categories",
       x = "Age",
       y = "Cholesterol Level") +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

From these above scatter plot visualizations, it is evident that the population of underweight patients is significantly lower compared to other categories. Across all four BMI categories, the scatter plots indicate a weak linear relationship between age and cholesterol levels. Specifically, the obesity category shows an almost zero correlation between age and cholesterol, indicating no clear linear association. In the normal weight category, there is a weak positive correlation, while the overweight and underweight categories exhibit weak negative correlations.

These “weak correlations” suggest that age may not be a strong factor influencing cholesterol levels. This is somewhat true as cholesterol is more significantly affected by factors such as diet, physical activity, sleep, and stress, rather than age alone.

Since the dataset (HA) contains well distributed patients’ data, it is very difficult to tag a data as outlier in the graph above.

Second Visualization:

Let’s create a scatter plot to see how Cholesterol relates with the ratio of cholesterol by heart rate.

ggplot(HAB, aes(x = Cholesterol, 
               y = RatioofCholesbyHR)) + 
  geom_point() +
  geom_smooth(method = "lm", color = "blue") +
  labs(x = "Cholesterol", 
       y = "RatioofCholesterolbyHeartRate", 
       title = "Cholesterol vs RatioofCholesterolbyHeartRate"+
  scale_color_brewer(palette='Dark2'))+
  
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

From the scatter plot, we can observe a strong positive correlation between cholesterol levels and the ratio of cholesterol to heart rate. A densely populated cluster of data points appears around a cholesterol level of 200 and a ratio of 2.5, indicating that many patients have cholesterol levels within the normal range along with a normal cholesterol-to-heart rate ratio.

As we move to the upper-right of the scatter plot, the data becomes more spread out, with higher cholesterol levels between 300 and 400 corresponding to ratios even higher than 7.5. This elevated ratio could be due to a lower heart rate in patients with high cholesterol. This is an unusual scenario, as high cholesterol levels typically correlate with increased heart rate due to arterial blockages. As a result, some of these upper-right data points may represent outliers, indicating cases where the expected relationship between cholesterol and heart rate is not typical. Also these data are spread out; making it difficult to draw firm conclusions.

Correlation Coefficients:

On First Numerical Variables:

Let’s calculate correlation coefficients of different columns of first pair.

cor(HAB$Age, HAB$BMI)

## [1] -0.03948948

cor(HAB$Age, HAB$Cholesterol)

## [1] -0.003212924

cor(HAB$Cholesterol, HAB$BMI)

## [1] 0.09193448

Correlation between Age and BMI:

We can see that there is a weak negative correlation between Age and BMI. It is somewhat true because BMI is highly impacted by the kind of diet, sleep and stress an individual has in life rather than age alone.

Correlation between Age and Cholesterol:

We can see that there is a very weak negative correlation between Age and Cholesterol. This is very small value to draw conclusions from. Again, the cholesterol is highly dependent on diet, sleep, exercise, and stress in life. We can see that the age might have relations with diet, sleep, exercise and stress, hence accumulating all these factors there might be a very weak negative correlation between age and cholesterol.

Correlation between Cholesterol and BMI:

We can see that there is a weak positive correlation between cholesterol and BMI. Even though, it is a weak correlation, it is positive which means as cholesterol increases then BMI will certainly increases as well.

Correlation Sense with First Visualization:

Since the new category created in the first numerical variable is BMI category (categorical column), it is difficult to create correlation among Age, Cholesterol, and BMI with BMI category. However, if we compare the correlation of age and cholesterol with that of in first visualization, we can make partial sense of true and false. In other words, there are underweight and overweight categories, that matches with negative correlation value of age and cholesterol. And, there are normal weight and obesity that contradicts with the negative correlation value of age and cholesterol.

It might require us further investigation to find out the why behind it, however, the first visualization is a expanded version of the correlation value between age and cholesterol. In other words, the visualization give us a distinct correlation for each categories of BMI rather than just giving a single correlation coefficients for all.

On Second Numerical Variables:

Let’s calculate correlation coefficients of different columns of second pair.

# Calculate the correlation between Cholesterol and Heart.Rate
cor(HAB$Cholesterol, HAB$Heart.Rate)

## [1] -0.1255535

# Calculate the correlation between Cholesterol and RatioofCholesbyHR
cor(HAB$Cholesterol, HAB$RatioofCholesbyHR)

## [1] 0.7588446

# Calculate the correlation between Heart.Rate and RatioofCholesbyHR
cor(HAB$Heart.Rate, HAB$RatioofCholesbyHR)

## [1] -0.6809854

Correlation between Cholesterol and Heart Rate:

We can see that there is a weak negative correlation between Cholesterol and Heart Rate. Usually, higher cholesterol increases the heart rate due to arteries blockage. However, this contradictory correlation is due to heart rate and cholesterol being impacted by numbers of other factors presented in the dataset. Or, the data could simply be an outlier.

Correlation between Cholesterol and Ratio of Cholesterol by Heart Rate:

We can see that there is a strong positive correlation between Cholesterol and ratio of itself by heart rate. This is very true because Cholesterol is directly related to the ratio.

Correlation between Heart Rate and Ratio of Cholesterol by Heart Rate:

We can see that there is a strong negative correlation between heart rate and ratio of cholesterol by heart rate. This is very true because heart rate is inversely related to the ratio.

Correlation Sense with Second Visualization:

The correlation of cholesterol with ratio of cholesterol by heart rate also matches with the second visualization. We can see there is a strong positive correlation present in the second visualization. The second visualization gives additional information about how well the data are distributed in the dataset including any outliers.

Confidence Intervals:

First Response Variable:

Let’s see the confidence interval (C.I.) of 95% by executing the below code on the response variable namely Cholesterol.

t.test(HAB$Cholesterol)

## 
##  One Sample t-test
## 
## data:  HAB$Cholesterol
## t = 51.116, df = 249, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  245.5015 265.1785
## sample estimates:
## mean of x 
##    255.34

From above information of one sample t-test on Cholesterol variable, we can get 95% Confidence Interval ranging from 245.5015 to 265.1785. This means, the mean of a true population for cholesterol level lies between the range of 245.5015 to 265.1785, with 95% confidence. This range is far away from zero, which also confirms that the mean of true population for cholesterol level does not belong to zero.

This dataframe also give us the sample mean of 255.34 which is slightly higher cholesterol level for general population. As we know, cholesterol is impacted by various other factors hence, this t-test opens door to further research on impacting factors of Cholesterol.

Second Response Variable:

Let’s see the confidence interval (C.I.) of 95% by executing the below code on the response variable namely Ratio of Cholesterol by Heart Rate.

t.test(HAB$RatioofCholesbyHR)

## 
##  One Sample t-test
## 
## data:  HAB$RatioofCholesbyHR
## t = 33.934, df = 249, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  3.413702 3.834385
## sample estimates:
## mean of x 
##  3.624043

From above information of one sample t-test on Ratio of Cholesterol by Heart Rate variable, we can get 95% Confidence Interval ranging from 3.413702 to 3.834385. This means, the mean of a true population for Ratio of Cholesterol by Heart Rate lies between the range of 3.413702 to 3.834385, with 95% confidence. This range is also far away from zero, which also confirms that the mean of true population for the ratio does not belong to zero.

For general population, the sample mean of 3.62 is slighty higher ratio. Since the value itself is composed from cholesterol and heart rate, a closer investigation on cholesterol and heart rate should be done before making any firm conclusions about the risk of heart attack on each patients.