library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
HA<- read.csv("/Users/rupeshswarnakar/Desktop/heart_attack_prediction_dataset.csv")
The objective is to create five different random samples from “Heart-Attack Prediction” dataset with replacement. There are series of codes that we need to execute to perform random sampling. Below is shown what codes to execute,
set.seed(555)
number_subsample<-5
size_subsample<-nrow(HA)*0.5
list_subsample<- list()
for (i in 1:number_subsample) {
subsample<- HA |>
sample_n(size=size_subsample, replace=TRUE)
assign(paste0("df_",i),subsample)
subsample[[i]] <- subsample
}
The above series of codes will create five different random samples with replacement namely df_1, df_2, df_3, df_4 and df_5. For the sake of reproducing similar random samples once created, we set seed code as executed above.
Let’s summarize all five random samples so as to check the mean, median and quantile of each sub-samples. This will visually help us to observe the quality and distribution of data in dataset by checking any outliers or extremities in the sub-samples.
summary(df_1)
## Patient.ID Age Sex Cholesterol
## Length:4381 Min. :18.00 Length:4381 Min. :120.0
## Class :character 1st Qu.:35.00 Class :character 1st Qu.:188.0
## Mode :character Median :54.00 Mode :character Median :258.0
## Mean :53.85 Mean :258.7
## 3rd Qu.:72.00 3rd Qu.:330.0
## Max. :90.00 Max. :400.0
## Blood.Pressure Heart.Rate Diabetes Family.History
## Length:4381 Min. : 40.00 Min. :0.0000 Min. :0.0000
## Class :character 1st Qu.: 57.00 1st Qu.:0.0000 1st Qu.:0.0000
## Mode :character Median : 75.00 Median :1.0000 Median :0.0000
## Mean : 75.03 Mean :0.6567 Mean :0.4921
## 3rd Qu.: 93.00 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :110.00 Max. :1.0000 Max. :1.0000
## Smoking Obesity Alcohol.Consumption Exercise.Hours.Per.Week
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. : 0.002442
## 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 5.040301
## Median :1.0000 Median :0.0000 Median :1.0000 Median :10.069365
## Mean :0.9009 Mean :0.4951 Mean :0.5983 Mean :10.029454
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:15.153341
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :19.997891
## Diet Previous.Heart.Problems Medication.Use Stress.Level
## Length:4381 Min. :0.0000 Min. :0.0000 Min. : 1.000
## Class :character 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 3.000
## Mode :character Median :0.0000 Median :1.0000 Median : 5.000
## Mean :0.4999 Mean :0.5038 Mean : 5.479
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.: 8.000
## Max. :1.0000 Max. :1.0000 Max. :10.000
## Sedentary.Hours.Per.Day Income BMI Triglycerides
## Min. : 0.001529 Min. : 20162 Min. :18.00 Min. : 30.0
## 1st Qu.: 3.098552 1st Qu.: 86045 1st Qu.:23.27 1st Qu.:224.0
## Median : 5.962756 Median :155955 Median :28.43 Median :411.0
## Mean : 6.058531 Mean :156332 Mean :28.71 Mean :415.8
## 3rd Qu.: 9.174081 3rd Qu.:224344 3rd Qu.:34.11 3rd Qu.:610.0
## Max. :11.999313 Max. :299909 Max. :39.99 Max. :800.0
## Physical.Activity.Days.Per.Week Sleep.Hours.Per.Day Country
## Min. :0.000 Min. : 4.000 Length:4381
## 1st Qu.:2.000 1st Qu.: 5.000 Class :character
## Median :3.000 Median : 7.000 Mode :character
## Mean :3.471 Mean : 7.063
## 3rd Qu.:5.000 3rd Qu.: 9.000
## Max. :7.000 Max. :10.000
## Continent Hemisphere Heart.Attack.Risk
## Length:4381 Length:4381 Min. :0.000
## Class :character Class :character 1st Qu.:0.000
## Mode :character Mode :character Median :0.000
## Mean :0.354
## 3rd Qu.:1.000
## Max. :1.000
summary(df_2)
## Patient.ID Age Sex Cholesterol
## Length:4381 Min. :18.00 Length:4381 Min. :120.0
## Class :character 1st Qu.:35.00 Class :character 1st Qu.:194.0
## Mode :character Median :55.00 Mode :character Median :257.0
## Mean :54.08 Mean :258.8
## 3rd Qu.:72.00 3rd Qu.:327.0
## Max. :90.00 Max. :400.0
## Blood.Pressure Heart.Rate Diabetes Family.History
## Length:4381 Min. : 40.00 Min. :0.0000 Min. :0.0000
## Class :character 1st Qu.: 57.00 1st Qu.:0.0000 1st Qu.:0.0000
## Mode :character Median : 76.00 Median :1.0000 Median :0.0000
## Mean : 75.46 Mean :0.6576 Mean :0.4926
## 3rd Qu.: 93.00 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :110.00 Max. :1.0000 Max. :1.0000
## Smoking Obesity Alcohol.Consumption Exercise.Hours.Per.Week
## Min. :0.0000 Min. :0.0000 Min. :0.00 Min. : 0.004443
## 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:0.00 1st Qu.: 4.822964
## Median :1.0000 Median :0.0000 Median :1.00 Median :10.040609
## Mean :0.9007 Mean :0.4885 Mean :0.59 Mean : 9.923605
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.00 3rd Qu.:15.007338
## Max. :1.0000 Max. :1.0000 Max. :1.00 Max. :19.998709
## Diet Previous.Heart.Problems Medication.Use Stress.Level
## Length:4381 Min. :0.0000 Min. :0.0000 Min. : 1.000
## Class :character 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 3.000
## Mode :character Median :0.0000 Median :0.0000 Median : 5.000
## Mean :0.4967 Mean :0.4873 Mean : 5.417
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.: 8.000
## Max. :1.0000 Max. :1.0000 Max. :10.000
## Sedentary.Hours.Per.Day Income BMI Triglycerides
## Min. : 0.001263 Min. : 20140 Min. :18.00 Min. : 30.0
## 1st Qu.: 2.967004 1st Qu.: 88605 1st Qu.:23.55 1st Qu.:229.0
## Median : 5.936669 Median :161194 Median :28.75 Median :417.0
## Mean : 5.954485 Mean :159525 Mean :28.91 Mean :421.6
## 3rd Qu.: 8.954137 3rd Qu.:229720 3rd Qu.:34.28 3rd Qu.:619.0
## Max. :11.992341 Max. :299954 Max. :39.99 Max. :800.0
## Physical.Activity.Days.Per.Week Sleep.Hours.Per.Day Country
## Min. :0.000 Min. : 4.000 Length:4381
## 1st Qu.:2.000 1st Qu.: 5.000 Class :character
## Median :3.000 Median : 7.000 Mode :character
## Mean :3.486 Mean : 6.999
## 3rd Qu.:5.000 3rd Qu.: 9.000
## Max. :7.000 Max. :10.000
## Continent Hemisphere Heart.Attack.Risk
## Length:4381 Length:4381 Min. :0.000
## Class :character Class :character 1st Qu.:0.000
## Mode :character Mode :character Median :0.000
## Mean :0.368
## 3rd Qu.:1.000
## Max. :1.000
summary(df_3)
## Patient.ID Age Sex Cholesterol
## Length:4381 Min. :18.00 Length:4381 Min. :120.0
## Class :character 1st Qu.:35.00 Class :character 1st Qu.:193.0
## Mode :character Median :53.00 Mode :character Median :260.0
## Mean :53.65 Mean :260.4
## 3rd Qu.:72.00 3rd Qu.:328.0
## Max. :90.00 Max. :400.0
## Blood.Pressure Heart.Rate Diabetes Family.History
## Length:4381 Min. : 40.00 Min. :0.0000 Min. :0.0000
## Class :character 1st Qu.: 58.00 1st Qu.:0.0000 1st Qu.:0.0000
## Mode :character Median : 76.00 Median :1.0000 Median :0.0000
## Mean : 75.41 Mean :0.6606 Mean :0.4855
## 3rd Qu.: 93.00 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :110.00 Max. :1.0000 Max. :1.0000
## Smoking Obesity Alcohol.Consumption Exercise.Hours.Per.Week
## Min. :0.0 Min. :0.0000 Min. :0.0000 Min. : 0.004443
## 1st Qu.:1.0 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 4.773629
## Median :1.0 Median :0.0000 Median :1.0000 Median : 9.986294
## Mean :0.9 Mean :0.4976 Mean :0.5971 Mean : 9.892954
## 3rd Qu.:1.0 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:15.046399
## Max. :1.0 Max. :1.0000 Max. :1.0000 Max. :19.998709
## Diet Previous.Heart.Problems Medication.Use Stress.Level
## Length:4381 Min. :0.0000 Min. :0.0000 Min. : 1.000
## Class :character 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 3.000
## Mode :character Median :0.0000 Median :0.0000 Median : 5.000
## Mean :0.4921 Mean :0.4981 Mean : 5.494
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.: 8.000
## Max. :1.0000 Max. :1.0000 Max. :10.000
## Sedentary.Hours.Per.Day Income BMI Triglycerides
## Min. : 0.008307 Min. : 20062 Min. :18.00 Min. : 30.0
## 1st Qu.: 2.910678 1st Qu.: 88968 1st Qu.:23.35 1st Qu.:229.0
## Median : 6.078463 Median :161265 Median :28.84 Median :415.0
## Mean : 6.033178 Mean :159739 Mean :28.93 Mean :417.9
## 3rd Qu.: 9.176924 3rd Qu.:230131 3rd Qu.:34.44 3rd Qu.:614.0
## Max. :11.985484 Max. :299954 Max. :39.99 Max. :800.0
## Physical.Activity.Days.Per.Week Sleep.Hours.Per.Day Country
## Min. :0.000 Min. : 4.000 Length:4381
## 1st Qu.:2.000 1st Qu.: 5.000 Class :character
## Median :4.000 Median : 7.000 Mode :character
## Mean :3.534 Mean : 6.966
## 3rd Qu.:6.000 3rd Qu.: 9.000
## Max. :7.000 Max. :10.000
## Continent Hemisphere Heart.Attack.Risk
## Length:4381 Length:4381 Min. :0.000
## Class :character Class :character 1st Qu.:0.000
## Mode :character Mode :character Median :0.000
## Mean :0.375
## 3rd Qu.:1.000
## Max. :1.000
summary(df_4)
## Patient.ID Age Sex Cholesterol
## Length:4381 Min. :18.00 Length:4381 Min. :120.0
## Class :character 1st Qu.:35.00 Class :character 1st Qu.:193.0
## Mode :character Median :53.00 Mode :character Median :257.0
## Mean :53.53 Mean :259.8
## 3rd Qu.:72.00 3rd Qu.:330.0
## Max. :90.00 Max. :400.0
## Blood.Pressure Heart.Rate Diabetes Family.History
## Length:4381 Min. : 40.00 Min. :0.000 Min. :0.0000
## Class :character 1st Qu.: 58.00 1st Qu.:0.000 1st Qu.:0.0000
## Mode :character Median : 76.00 Median :1.000 Median :0.0000
## Mean : 75.35 Mean :0.638 Mean :0.4997
## 3rd Qu.: 94.00 3rd Qu.:1.000 3rd Qu.:1.0000
## Max. :110.00 Max. :1.000 Max. :1.0000
## Smoking Obesity Alcohol.Consumption Exercise.Hours.Per.Week
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. : 0.004443
## 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 5.084560
## Median :1.0000 Median :1.0000 Median :1.0000 Median :10.169128
## Mean :0.8982 Mean :0.5035 Mean :0.5898 Mean :10.089018
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:15.162057
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :19.997891
## Diet Previous.Heart.Problems Medication.Use Stress.Level
## Length:4381 Min. :0.0000 Min. :0.0000 Min. : 1.000
## Class :character 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 3.000
## Mode :character Median :0.0000 Median :0.0000 Median : 5.000
## Mean :0.4992 Mean :0.4805 Mean : 5.388
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.: 8.000
## Max. :1.0000 Max. :1.0000 Max. :10.000
## Sedentary.Hours.Per.Day Income BMI Triglycerides
## Min. : 0.001529 Min. : 20062 Min. :18.00 Min. : 30.0
## 1st Qu.: 3.029868 1st Qu.: 89742 1st Qu.:23.56 1st Qu.:224.0
## Median : 5.912014 Median :159293 Median :28.92 Median :411.0
## Mean : 6.003798 Mean :159235 Mean :28.98 Mean :415.2
## 3rd Qu.: 9.072136 3rd Qu.:227652 3rd Qu.:34.39 3rd Qu.:609.0
## Max. :11.987716 Max. :299909 Max. :40.00 Max. :800.0
## Physical.Activity.Days.Per.Week Sleep.Hours.Per.Day Country
## Min. :0.00 Min. : 4.000 Length:4381
## 1st Qu.:1.00 1st Qu.: 5.000 Class :character
## Median :3.00 Median : 7.000 Mode :character
## Mean :3.45 Mean : 7.038
## 3rd Qu.:5.00 3rd Qu.: 9.000
## Max. :7.00 Max. :10.000
## Continent Hemisphere Heart.Attack.Risk
## Length:4381 Length:4381 Min. :0.0000
## Class :character Class :character 1st Qu.:0.0000
## Mode :character Mode :character Median :0.0000
## Mean :0.3472
## 3rd Qu.:1.0000
## Max. :1.0000
summary(df_5)
## Patient.ID Age Sex Cholesterol
## Length:4381 Min. :18.0 Length:4381 Min. :120.0
## Class :character 1st Qu.:35.0 Class :character 1st Qu.:190.0
## Mode :character Median :53.0 Mode :character Median :257.0
## Mean :53.6 Mean :258.1
## 3rd Qu.:72.0 3rd Qu.:328.0
## Max. :90.0 Max. :400.0
## Blood.Pressure Heart.Rate Diabetes Family.History
## Length:4381 Min. : 40.00 Min. :0.0000 Min. :0.0000
## Class :character 1st Qu.: 57.00 1st Qu.:0.0000 1st Qu.:0.0000
## Mode :character Median : 75.00 Median :1.0000 Median :0.0000
## Mean : 74.77 Mean :0.6537 Mean :0.4951
## 3rd Qu.: 92.00 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :110.00 Max. :1.0000 Max. :1.0000
## Smoking Obesity Alcohol.Consumption Exercise.Hours.Per.Week
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. : 0.004443
## 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 5.089899
## Median :1.0000 Median :0.0000 Median :1.0000 Median : 9.900774
## Mean :0.8973 Mean :0.4967 Mean :0.6021 Mean : 9.933326
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:14.871307
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :19.997012
## Diet Previous.Heart.Problems Medication.Use Stress.Level
## Length:4381 Min. :0.0000 Min. :0.000 Min. : 1.000
## Class :character 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.: 3.000
## Mode :character Median :0.0000 Median :0.000 Median : 6.000
## Mean :0.4878 Mean :0.499 Mean : 5.516
## 3rd Qu.:1.0000 3rd Qu.:1.000 3rd Qu.: 8.000
## Max. :1.0000 Max. :1.000 Max. :10.000
## Sedentary.Hours.Per.Day Income BMI Triglycerides
## Min. : 0.008307 Min. : 20208 Min. :18.01 Min. : 30.0
## 1st Qu.: 3.053636 1st Qu.: 90784 1st Qu.:23.29 1st Qu.:219.0
## Median : 6.103019 Median :159628 Median :28.57 Median :409.0
## Mean : 6.071993 Mean :159834 Mean :28.84 Mean :411.4
## 3rd Qu.: 9.185967 3rd Qu.:230221 3rd Qu.:34.32 3rd Qu.:608.0
## Max. :11.999313 Max. :299850 Max. :39.99 Max. :800.0
## Physical.Activity.Days.Per.Week Sleep.Hours.Per.Day Country
## Min. :0.000 Min. : 4.000 Length:4381
## 1st Qu.:1.000 1st Qu.: 5.000 Class :character
## Median :3.000 Median : 7.000 Mode :character
## Mean :3.437 Mean : 6.984
## 3rd Qu.:5.000 3rd Qu.: 9.000
## Max. :7.000 Max. :10.000
## Continent Hemisphere Heart.Attack.Risk
## Length:4381 Length:4381 Min. :0.0000
## Class :character Class :character 1st Qu.:0.0000
## Mode :character Mode :character Median :0.0000
## Mean :0.3378
## 3rd Qu.:1.0000
## Max. :1.0000
Upon summarization, we can see that mean, median, and quantile of each sub-samples are very similar among each other. This suggests the well distribution of data and preserves the robustness of analysis.
We can further facilitate the analysis by comparing an average of any numeric variable of any two sub-samples and comprehend how different the random samples are from each other. Specifically, we can observe the mean value of Triglycerides level among different countries from random samples df_3 and df_5 to get clear understanding of their differences.
set.seed(555)
df_3 |>
group_by(Country) |>
summarise(mean_Triglycerides=mean(Triglycerides))
## # A tibble: 20 × 2
## Country mean_Triglycerides
## <chr> <dbl>
## 1 Argentina 417.
## 2 Australia 415.
## 3 Brazil 403.
## 4 Canada 376.
## 5 China 418.
## 6 Colombia 389.
## 7 France 444.
## 8 Germany 406.
## 9 India 395.
## 10 Italy 421.
## 11 Japan 425.
## 12 New Zealand 426.
## 13 Nigeria 420.
## 14 South Africa 419.
## 15 South Korea 428.
## 16 Spain 403.
## 17 Thailand 431.
## 18 United Kingdom 430.
## 19 United States 457.
## 20 Vietnam 437.
set.seed(555)
df_5 |>
group_by(Country) |>
summarise(mean_Triglycerides=mean(Triglycerides))
## # A tibble: 20 × 2
## Country mean_Triglycerides
## <chr> <dbl>
## 1 Argentina 392.
## 2 Australia 408.
## 3 Brazil 411.
## 4 Canada 373.
## 5 China 427.
## 6 Colombia 406.
## 7 France 414.
## 8 Germany 423.
## 9 India 376.
## 10 Italy 415.
## 11 Japan 421.
## 12 New Zealand 394.
## 13 Nigeria 431.
## 14 South Africa 390.
## 15 South Korea 395.
## 16 Spain 421.
## 17 Thailand 442.
## 18 United Kingdom 413.
## 19 United States 428.
## 20 Vietnam 440.
By observing the above data frames, we can see the variation on highest and lowest level of Triglycerides between df_3 and df_5. For instance, on df_3, France has the highest Triglycerides level and Canada has the lowest one. On df_5, Japan has the highest Triglycerides level and Brazil has the lowest one. We can see how easily a random sampling can deviate the results from one data frame to other considering whether the data are well distributed or not.
Since the original dataset is well distributed, the average Triglycerides on all countries are only slightly different among each other which saves from huge variations.
If we refer back to summarization of all random samples, we can see that there is no anomaly in one sub-sample that wouldn’t be in others. In other words, the mean, median and quantile of all sub-samples are very similar to each other.
The reason behind it could be that the sample does not have many outlier in the dataset. The dataset is very well distributed without having any extremities. This makes the dataset robust for randomization and reliable to analysis.
Consistency among all sub-samples basically means similarity on mean, median, quantile etc. among all sub-samples. We can refer back to the summarisation of all random samples and see that the mean, median, and quantile among them are very similar.
Another way to observe the consistency on data among all sub-samples is through the visualization of random samples and the original dataset, and compare whether the graphs look similar or not among them.
For the sake of simplicity, let’s perform visualization on original dataset (HA), df_1 and df_5 using boxplot.
ggplot(HA, aes(x=Country,
y=Triglycerides,
fill=Country))+
geom_boxplot()+
labs(x="Country",
y="Triglycerides Level",
title="Country vs Triglycerides Level",
scale_color_brewer(palette='Dark2'))
ggplot(df_1, aes(x=Country,
y=Triglycerides,
fill=Country))+
geom_boxplot()+
labs(x="Country",
y="Triglycerides Level",
title="Country vs Triglycerides Level",
scale_color_brewer(palette='Dark2'))
ggplot(df_5, aes(x=Country,
y=Triglycerides,
fill=Country))+
geom_boxplot()+
labs(x="Country",
y="Triglycerides Level",
title="Country vs Triglycerides Level",
scale_color_brewer(palette='Dark2'))
Upon examining the visualizations of the HA, df_1, and df_5 datasets, the boxplots exhibit a high degree of similarity. This indicates that the underlying distribution, central tendency, and spread of the data are consistent across these samples. For instance, if we observe the first quantile, third quantile and median of each boxplots from HA, df_1 and df_5, we can clearly see that df_1 and df_5 are consistent to that of HA.
Certainly there is presence of slight difference on the median of df_1, df_5 to that of HA, but this has to do with random sampling from fifty percent of the original dataset. In other words, random samples have less data to analyze with than that of original dataset.
Anyway, this high level of consistency in data even after random sampling preserves the statistical properties of the dataset across different samples, reinforcing the robustness of the data.
The above analysis has certainly proved few facts that we should keep in mind while drawing conclusions on data after random sampling. Some of them are as follows,
If the dataset is small, then random sampling might affect the results. For instance, in my HA dataset, the result of Triglyceride level would be very contradictory among countries if the data from each country was small enough.
If the dataset has missing values or missing combinations of categorical and continuous variables for instance, then that can lead to poor and unpromising result after analysis of sub-samples. For example, in my HA dataset, there is no missing combination of patients from all different countries consuming healthy, unhealthy and average diet. This quality in dataset preserved the robustness in analysis after random sampling.
If the dataset contains extremities or outliers then there is higher probability of getting biasness in the result after analysis. For instance, upon visualizing df_1, df_5 and HA, we saw that the boxplot had the high degree of similarity among each boxes in the boxplot. This result remained same in the sub-samples as well. Summarization of sub-samples could be one way to check the quality distribution of data in dataset.