Week3: Data Dive

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Grouping 3 different data frames from “Heart-attack Prediction” dataset:

Let’s define the Heart-attack prediction dataset as HA as follows,

HA<- read.csv("/Users/rupeshswarnakar/Desktop/heart_attack_prediction_dataset.csv", nrows = 250)

First Data frame:

HA |> 
  group_by(Country) |> 
  summarise(Mean_Exercise_Hours_per_week=mean(Exercise.Hours.Per.Week))

## # A tibble: 20 × 2
##    Country        Mean_Exercise_Hours_per_week
##    <chr>                                 <dbl>
##  1 Argentina                              9.68
##  2 Australia                              7.17
##  3 Brazil                                11.1 
##  4 Canada                                 8.31
##  5 China                                 12.6 
##  6 Colombia                              10.6 
##  7 France                                 9.90
##  8 Germany                               11.6 
##  9 India                                 10.6 
## 10 Italy                                 10.5 
## 11 Japan                                  9.36
## 12 New Zealand                           10.7 
## 13 Nigeria                               11.4 
## 14 South Africa                          10.7 
## 15 South Korea                            9.21
## 16 Spain                                 10.0 
## 17 Thailand                               8.21
## 18 United Kingdom                        10.8 
## 19 United States                         11.6 
## 20 Vietnam                               10.5

This data frame is about patients from different countries having different exercise hours per week. As seen above in the data frame, China has the highest exercise hours per week and Australia has the lowest one. This could signify us that people living in China are living relatively healthier life. The notion behind it is that Chinese are culturally hardworking and disciplined which might be why the hours they are exercising is higher per week.

However, Australia’s population is geographically not well distributed. Most of the land is thinly populated. This might be why people living in major cities might be travelling through buses and trains instead of walking or biking and relatively living a less active lifestyle.

Second Data frame:

HA |> 
  group_by(Sex) |> 
 summarise(Mean_Stress_Level=mean(Stress.Level))

## # A tibble: 2 × 2
##   Sex    Mean_Stress_Level
##   <chr>              <dbl>
## 1 Female              4.91
## 2 Male                5.49

This data frame is about male having higher level of stress than female. This could relate to why male die comparatively higher than female due to heart failure. Generally, male have more stress of family, job, future security, kids, work and many more which might be indicated in above data frame.

Third Data frame:

HA |> 
  group_by(Diet) |>
  summarise(Mean_BMI=mean(BMI))

## # A tibble: 3 × 2
##   Diet      Mean_BMI
##   <chr>        <dbl>
## 1 Average       28.7
## 2 Healthy       28.9
## 3 Unhealthy     29.1

This data frame is about the people eating different diet (healthy and unhealthy) relating to the different measurement of BMI. We can see there is not much difference in BMI of both healthy and unhealthy diet. The reason behin that could be the mix of people from different countries, where some eat very healthy diets and others eat unhealthy diets. Because these differences balance each other out, the overall average BMI ends up being similar for everyone, regardless of whether they eat healthy or unhealthy food.

We can further analyze this above data frame by performing a visualization and looking into diet of people from different countries or continents.

Additional Assistance in Analysis:

To further assist our research, let’s find the count of different aged group of people from (0-20), (20-40), (40-60),(60-80) and (80 and above) and observe which aged group data is heavier than the others as follows,

HAA<-HA |> 

  mutate(size=cut(Age,breaks = 
                    c(0,20,40,60,80,Inf),
                  labels = 
                    c('Young age',
                      'Middle age',
                      'Old age',
                      'Very Old age',
                      'About to die'))) |> 
  count(size) |> 
  mutate(Probability=n/sum(n))
HAA

##           size  n Probability
## 1    Young age  6       0.024
## 2   Middle age 75       0.300
## 3      Old age 66       0.264
## 4 Very Old age 64       0.256
## 5 About to die 39       0.156

This data frame shows us that the people with age between 0 to 20 years old (Young age) are the least in count which is 6. In terms of probability, 6 people out of 250 people (sample size) aged from 0-20 years old posses the risk of having heart failure. So if any patients from the dataset in picked; the specific patients falling into the young age category has probability of 6/250 which is 0.024 is the lowest among all aged groups.

Intuitively, the probability is so low because younger people have lower chances of heart failure than that of older people in general due to strong immune system for an instance.

Hypothesis:

Younger people are less likely to fall under the risk of Heart Attack.

Visualization:

Visualization on First Data frame:

ggplot(HA, aes(x=Country,
               y=Exercise.Hours.Per.Week,
               fill=Country))+
  geom_boxplot()+
  labs(x="Country",
       y="Exercise Hours Per Week",
       title="Country vs Exercise Hours Per Week")+
  scale_color_brewer(palette='Dark2')

This visualization clearly show us that China has the highest number of exercise per week. The median of China is certainly higher than the others. United States also has a similar data as China, however the median and the first quantile lies below than that of China. This means, lots of people in United States comparatively do less exercise hours per week than that of China.

Visualization on Second Data frame:

ggplot(HA, aes(x=Sex,
               y=Stress.Level,
               fill=Country))+
  geom_boxplot()+
  labs(x="Gender",
       y="Stress Level",
       title="Gender vs Stress Level",
        scale_color_brewer(palette='Dark2'))

This visualization show us the stress level between male and female. After observing the above visualization we can see that, male and female of same country have different stress level except for China. Also, China has relatively lower stress level for both male and female. This means, despite other factors influencing heart attack prediction, both men and women in China have similar yet lower stress levels.

Visualization on Third Data frame:

ggplot(HA, aes(x=Diet,
               y=BMI,
               fill=Continent))+
  geom_boxplot()+
  labs(x="Diet",
       y="BMI",
       title="Diet vs BMI",
        scale_color_brewer(palette='Dark2'))

In this above visualization we can see that Asia has comparatively lower BMI in average than others. We can also see that Africa has lowest BMI in average however their median is extremely high. China has relatively similar BMI for both healthy and unhealthy which in average accounts to lower BMI in average.

Analyzing the missing combinations:

HA |> 
  group_by(Country, Diet) |> 
  arrange(Country, Diet) |> 
  select(Country, Diet) |> 
  count()

## # A tibble: 60 × 3
## # Groups:   Country, Diet [60]
##    Country   Diet          n
##    <chr>     <chr>     <int>
##  1 Argentina Average       7
##  2 Argentina Healthy       8
##  3 Argentina Unhealthy     7
##  4 Australia Average      11
##  5 Australia Healthy       3
##  6 Australia Unhealthy     3
##  7 Brazil    Average       5
##  8 Brazil    Healthy       7
##  9 Brazil    Unhealthy     2
## 10 Canada    Average       4
## # ℹ 50 more rows

The above data frame does not have any missing combination since all patients from each of the different countries surveyed has consumed all types of diet (healthy, unhealthy, and average).

Most/Least common combinations:

After observing the above data frame, we see that the ‘Average diet of Australia’ is the most common combination. This might be due to Australia’s population living mostly on the coastal sides of continent which have higher accessibility to seafood, and hence their diet is not too unhealthy.

The least common combinations are ‘South Africa Unhealthy diet’, ‘Spain Average diet’ and ‘United States Healthy diet’. The reason behind this could be that South Africa are less dependent on all fried and unhealthy diets, Spain promotes all premium oil, food and cheeses, and America is heavily focused on the culture of fast food.

Visualization on one of the combinations (Country vs Diet):

ggplot(HA, aes(x=Country,
               fill=Diet))+
  geom_bar()+
  labs(x="Country",
       y="Count",
       title="Diet of Different Countries",
        scale_color_brewer(palette='Dark2'))

This bar graph tell us about different countries following different types of diet. Here we can see that countries like New Zealand and Nigeria are consuming higher amount of unhealthy foods. Countries like South Korea and Australia are consuming higher amount of Average diet. And Brazil is consuming higher amount of Healthy foods.

As in general, we can get a overview that, most of the countries are consuming unhealthy food more than healthy and average food. This might be why heart attack and diabetes are growing rapidly around the world. This opens up questions on the factors that are affecting people to make poor and unhealthy food choices. The future investigation could be done on various topics such as literacy of people, urbanization, poverty, food culture, religion, geographical difficulties, social media’s influence etc. to determine whether any of these topics impact public’s food choices or not.