Week4: Data Dive

Loading Library:

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Loading Dataset:

HA<- read.csv("/Users/rupeshswarnakar/Desktop/heart_attack_prediction_dataset.csv")

Defining Code to generate Random Samples:

The objective is to create five different random samples from “Heart-Attack Prediction” dataset with replacement. There are series of codes that we need to execute to perform random sampling. Below is shown what codes to execute,

For Reproducibility:

set.seed(555)
number_subsample<-5

For Defining the Sample Size:

size_subsample<-nrow(HA)*0.5

For Creating list:

list_subsample<- list()

For Random Sampling:

for (i in 1:number_subsample) {
  subsample<- HA |> 
    sample_n(size=size_subsample, replace=TRUE)
  assign(paste0("df_",i),subsample)
  subsample[[i]] <- subsample
}

The above series of codes will create five different random samples with replacement namely df_1, df_2, df_3, df_4 and df_5. For the sake of reproducing similar random samples once created, we set seed code as executed above.

Summarizing all Random Samples:

Let’s summarize all five random samples so as to check the mean, median and quantile of each sub-samples. This will visually help us to observe the quality and distribution of data in dataset by checking any outliers or extremities in the sub-samples.

Random Sample (df_1):

summary(df_1)

##   Patient.ID             Age            Sex             Cholesterol   
##  Length:4381        Min.   :18.00   Length:4381        Min.   :120.0  
##  Class :character   1st Qu.:35.00   Class :character   1st Qu.:188.0  
##  Mode  :character   Median :54.00   Mode  :character   Median :258.0  
##                     Mean   :53.85                      Mean   :258.7  
##                     3rd Qu.:72.00                      3rd Qu.:330.0  
##                     Max.   :90.00                      Max.   :400.0  
##  Blood.Pressure       Heart.Rate        Diabetes      Family.History  
##  Length:4381        Min.   : 40.00   Min.   :0.0000   Min.   :0.0000  
##  Class :character   1st Qu.: 57.00   1st Qu.:0.0000   1st Qu.:0.0000  
##  Mode  :character   Median : 75.00   Median :1.0000   Median :0.0000  
##                     Mean   : 75.03   Mean   :0.6567   Mean   :0.4921  
##                     3rd Qu.: 93.00   3rd Qu.:1.0000   3rd Qu.:1.0000  
##                     Max.   :110.00   Max.   :1.0000   Max.   :1.0000  
##     Smoking          Obesity       Alcohol.Consumption Exercise.Hours.Per.Week
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000      Min.   : 0.002442      
##  1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.:0.0000      1st Qu.: 5.040301      
##  Median :1.0000   Median :0.0000   Median :1.0000      Median :10.069365      
##  Mean   :0.9009   Mean   :0.4951   Mean   :0.5983      Mean   :10.029454      
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000      3rd Qu.:15.153341      
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000      Max.   :19.997891      
##      Diet           Previous.Heart.Problems Medication.Use    Stress.Level   
##  Length:4381        Min.   :0.0000          Min.   :0.0000   Min.   : 1.000  
##  Class :character   1st Qu.:0.0000          1st Qu.:0.0000   1st Qu.: 3.000  
##  Mode  :character   Median :0.0000          Median :1.0000   Median : 5.000  
##                     Mean   :0.4999          Mean   :0.5038   Mean   : 5.479  
##                     3rd Qu.:1.0000          3rd Qu.:1.0000   3rd Qu.: 8.000  
##                     Max.   :1.0000          Max.   :1.0000   Max.   :10.000  
##  Sedentary.Hours.Per.Day     Income            BMI        Triglycerides  
##  Min.   : 0.001529       Min.   : 20162   Min.   :18.00   Min.   : 30.0  
##  1st Qu.: 3.098552       1st Qu.: 86045   1st Qu.:23.27   1st Qu.:224.0  
##  Median : 5.962756       Median :155955   Median :28.43   Median :411.0  
##  Mean   : 6.058531       Mean   :156332   Mean   :28.71   Mean   :415.8  
##  3rd Qu.: 9.174081       3rd Qu.:224344   3rd Qu.:34.11   3rd Qu.:610.0  
##  Max.   :11.999313       Max.   :299909   Max.   :39.99   Max.   :800.0  
##  Physical.Activity.Days.Per.Week Sleep.Hours.Per.Day   Country         
##  Min.   :0.000                   Min.   : 4.000      Length:4381       
##  1st Qu.:2.000                   1st Qu.: 5.000      Class :character  
##  Median :3.000                   Median : 7.000      Mode  :character  
##  Mean   :3.471                   Mean   : 7.063                        
##  3rd Qu.:5.000                   3rd Qu.: 9.000                        
##  Max.   :7.000                   Max.   :10.000                        
##   Continent          Hemisphere        Heart.Attack.Risk
##  Length:4381        Length:4381        Min.   :0.000    
##  Class :character   Class :character   1st Qu.:0.000    
##  Mode  :character   Mode  :character   Median :0.000    
##                                        Mean   :0.354    
##                                        3rd Qu.:1.000    
##                                        Max.   :1.000

Random Sample (df_2):

summary(df_2)

##   Patient.ID             Age            Sex             Cholesterol   
##  Length:4381        Min.   :18.00   Length:4381        Min.   :120.0  
##  Class :character   1st Qu.:35.00   Class :character   1st Qu.:194.0  
##  Mode  :character   Median :55.00   Mode  :character   Median :257.0  
##                     Mean   :54.08                      Mean   :258.8  
##                     3rd Qu.:72.00                      3rd Qu.:327.0  
##                     Max.   :90.00                      Max.   :400.0  
##  Blood.Pressure       Heart.Rate        Diabetes      Family.History  
##  Length:4381        Min.   : 40.00   Min.   :0.0000   Min.   :0.0000  
##  Class :character   1st Qu.: 57.00   1st Qu.:0.0000   1st Qu.:0.0000  
##  Mode  :character   Median : 76.00   Median :1.0000   Median :0.0000  
##                     Mean   : 75.46   Mean   :0.6576   Mean   :0.4926  
##                     3rd Qu.: 93.00   3rd Qu.:1.0000   3rd Qu.:1.0000  
##                     Max.   :110.00   Max.   :1.0000   Max.   :1.0000  
##     Smoking          Obesity       Alcohol.Consumption Exercise.Hours.Per.Week
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00        Min.   : 0.004443      
##  1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.:0.00        1st Qu.: 4.822964      
##  Median :1.0000   Median :0.0000   Median :1.00        Median :10.040609      
##  Mean   :0.9007   Mean   :0.4885   Mean   :0.59        Mean   : 9.923605      
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.00        3rd Qu.:15.007338      
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.00        Max.   :19.998709      
##      Diet           Previous.Heart.Problems Medication.Use    Stress.Level   
##  Length:4381        Min.   :0.0000          Min.   :0.0000   Min.   : 1.000  
##  Class :character   1st Qu.:0.0000          1st Qu.:0.0000   1st Qu.: 3.000  
##  Mode  :character   Median :0.0000          Median :0.0000   Median : 5.000  
##                     Mean   :0.4967          Mean   :0.4873   Mean   : 5.417  
##                     3rd Qu.:1.0000          3rd Qu.:1.0000   3rd Qu.: 8.000  
##                     Max.   :1.0000          Max.   :1.0000   Max.   :10.000  
##  Sedentary.Hours.Per.Day     Income            BMI        Triglycerides  
##  Min.   : 0.001263       Min.   : 20140   Min.   :18.00   Min.   : 30.0  
##  1st Qu.: 2.967004       1st Qu.: 88605   1st Qu.:23.55   1st Qu.:229.0  
##  Median : 5.936669       Median :161194   Median :28.75   Median :417.0  
##  Mean   : 5.954485       Mean   :159525   Mean   :28.91   Mean   :421.6  
##  3rd Qu.: 8.954137       3rd Qu.:229720   3rd Qu.:34.28   3rd Qu.:619.0  
##  Max.   :11.992341       Max.   :299954   Max.   :39.99   Max.   :800.0  
##  Physical.Activity.Days.Per.Week Sleep.Hours.Per.Day   Country         
##  Min.   :0.000                   Min.   : 4.000      Length:4381       
##  1st Qu.:2.000                   1st Qu.: 5.000      Class :character  
##  Median :3.000                   Median : 7.000      Mode  :character  
##  Mean   :3.486                   Mean   : 6.999                        
##  3rd Qu.:5.000                   3rd Qu.: 9.000                        
##  Max.   :7.000                   Max.   :10.000                        
##   Continent          Hemisphere        Heart.Attack.Risk
##  Length:4381        Length:4381        Min.   :0.000    
##  Class :character   Class :character   1st Qu.:0.000    
##  Mode  :character   Mode  :character   Median :0.000    
##                                        Mean   :0.368    
##                                        3rd Qu.:1.000    
##                                        Max.   :1.000

Random Sample (df_3):

summary(df_3)

##   Patient.ID             Age            Sex             Cholesterol   
##  Length:4381        Min.   :18.00   Length:4381        Min.   :120.0  
##  Class :character   1st Qu.:35.00   Class :character   1st Qu.:193.0  
##  Mode  :character   Median :53.00   Mode  :character   Median :260.0  
##                     Mean   :53.65                      Mean   :260.4  
##                     3rd Qu.:72.00                      3rd Qu.:328.0  
##                     Max.   :90.00                      Max.   :400.0  
##  Blood.Pressure       Heart.Rate        Diabetes      Family.History  
##  Length:4381        Min.   : 40.00   Min.   :0.0000   Min.   :0.0000  
##  Class :character   1st Qu.: 58.00   1st Qu.:0.0000   1st Qu.:0.0000  
##  Mode  :character   Median : 76.00   Median :1.0000   Median :0.0000  
##                     Mean   : 75.41   Mean   :0.6606   Mean   :0.4855  
##                     3rd Qu.: 93.00   3rd Qu.:1.0000   3rd Qu.:1.0000  
##                     Max.   :110.00   Max.   :1.0000   Max.   :1.0000  
##     Smoking       Obesity       Alcohol.Consumption Exercise.Hours.Per.Week
##  Min.   :0.0   Min.   :0.0000   Min.   :0.0000      Min.   : 0.004443      
##  1st Qu.:1.0   1st Qu.:0.0000   1st Qu.:0.0000      1st Qu.: 4.773629      
##  Median :1.0   Median :0.0000   Median :1.0000      Median : 9.986294      
##  Mean   :0.9   Mean   :0.4976   Mean   :0.5971      Mean   : 9.892954      
##  3rd Qu.:1.0   3rd Qu.:1.0000   3rd Qu.:1.0000      3rd Qu.:15.046399      
##  Max.   :1.0   Max.   :1.0000   Max.   :1.0000      Max.   :19.998709      
##      Diet           Previous.Heart.Problems Medication.Use    Stress.Level   
##  Length:4381        Min.   :0.0000          Min.   :0.0000   Min.   : 1.000  
##  Class :character   1st Qu.:0.0000          1st Qu.:0.0000   1st Qu.: 3.000  
##  Mode  :character   Median :0.0000          Median :0.0000   Median : 5.000  
##                     Mean   :0.4921          Mean   :0.4981   Mean   : 5.494  
##                     3rd Qu.:1.0000          3rd Qu.:1.0000   3rd Qu.: 8.000  
##                     Max.   :1.0000          Max.   :1.0000   Max.   :10.000  
##  Sedentary.Hours.Per.Day     Income            BMI        Triglycerides  
##  Min.   : 0.008307       Min.   : 20062   Min.   :18.00   Min.   : 30.0  
##  1st Qu.: 2.910678       1st Qu.: 88968   1st Qu.:23.35   1st Qu.:229.0  
##  Median : 6.078463       Median :161265   Median :28.84   Median :415.0  
##  Mean   : 6.033178       Mean   :159739   Mean   :28.93   Mean   :417.9  
##  3rd Qu.: 9.176924       3rd Qu.:230131   3rd Qu.:34.44   3rd Qu.:614.0  
##  Max.   :11.985484       Max.   :299954   Max.   :39.99   Max.   :800.0  
##  Physical.Activity.Days.Per.Week Sleep.Hours.Per.Day   Country         
##  Min.   :0.000                   Min.   : 4.000      Length:4381       
##  1st Qu.:2.000                   1st Qu.: 5.000      Class :character  
##  Median :4.000                   Median : 7.000      Mode  :character  
##  Mean   :3.534                   Mean   : 6.966                        
##  3rd Qu.:6.000                   3rd Qu.: 9.000                        
##  Max.   :7.000                   Max.   :10.000                        
##   Continent          Hemisphere        Heart.Attack.Risk
##  Length:4381        Length:4381        Min.   :0.000    
##  Class :character   Class :character   1st Qu.:0.000    
##  Mode  :character   Mode  :character   Median :0.000    
##                                        Mean   :0.375    
##                                        3rd Qu.:1.000    
##                                        Max.   :1.000

Random Sample (df_4):

summary(df_4)

##   Patient.ID             Age            Sex             Cholesterol   
##  Length:4381        Min.   :18.00   Length:4381        Min.   :120.0  
##  Class :character   1st Qu.:35.00   Class :character   1st Qu.:193.0  
##  Mode  :character   Median :53.00   Mode  :character   Median :257.0  
##                     Mean   :53.53                      Mean   :259.8  
##                     3rd Qu.:72.00                      3rd Qu.:330.0  
##                     Max.   :90.00                      Max.   :400.0  
##  Blood.Pressure       Heart.Rate        Diabetes     Family.History  
##  Length:4381        Min.   : 40.00   Min.   :0.000   Min.   :0.0000  
##  Class :character   1st Qu.: 58.00   1st Qu.:0.000   1st Qu.:0.0000  
##  Mode  :character   Median : 76.00   Median :1.000   Median :0.0000  
##                     Mean   : 75.35   Mean   :0.638   Mean   :0.4997  
##                     3rd Qu.: 94.00   3rd Qu.:1.000   3rd Qu.:1.0000  
##                     Max.   :110.00   Max.   :1.000   Max.   :1.0000  
##     Smoking          Obesity       Alcohol.Consumption Exercise.Hours.Per.Week
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000      Min.   : 0.004443      
##  1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.:0.0000      1st Qu.: 5.084560      
##  Median :1.0000   Median :1.0000   Median :1.0000      Median :10.169128      
##  Mean   :0.8982   Mean   :0.5035   Mean   :0.5898      Mean   :10.089018      
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000      3rd Qu.:15.162057      
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000      Max.   :19.997891      
##      Diet           Previous.Heart.Problems Medication.Use    Stress.Level   
##  Length:4381        Min.   :0.0000          Min.   :0.0000   Min.   : 1.000  
##  Class :character   1st Qu.:0.0000          1st Qu.:0.0000   1st Qu.: 3.000  
##  Mode  :character   Median :0.0000          Median :0.0000   Median : 5.000  
##                     Mean   :0.4992          Mean   :0.4805   Mean   : 5.388  
##                     3rd Qu.:1.0000          3rd Qu.:1.0000   3rd Qu.: 8.000  
##                     Max.   :1.0000          Max.   :1.0000   Max.   :10.000  
##  Sedentary.Hours.Per.Day     Income            BMI        Triglycerides  
##  Min.   : 0.001529       Min.   : 20062   Min.   :18.00   Min.   : 30.0  
##  1st Qu.: 3.029868       1st Qu.: 89742   1st Qu.:23.56   1st Qu.:224.0  
##  Median : 5.912014       Median :159293   Median :28.92   Median :411.0  
##  Mean   : 6.003798       Mean   :159235   Mean   :28.98   Mean   :415.2  
##  3rd Qu.: 9.072136       3rd Qu.:227652   3rd Qu.:34.39   3rd Qu.:609.0  
##  Max.   :11.987716       Max.   :299909   Max.   :40.00   Max.   :800.0  
##  Physical.Activity.Days.Per.Week Sleep.Hours.Per.Day   Country         
##  Min.   :0.00                    Min.   : 4.000      Length:4381       
##  1st Qu.:1.00                    1st Qu.: 5.000      Class :character  
##  Median :3.00                    Median : 7.000      Mode  :character  
##  Mean   :3.45                    Mean   : 7.038                        
##  3rd Qu.:5.00                    3rd Qu.: 9.000                        
##  Max.   :7.00                    Max.   :10.000                        
##   Continent          Hemisphere        Heart.Attack.Risk
##  Length:4381        Length:4381        Min.   :0.0000   
##  Class :character   Class :character   1st Qu.:0.0000   
##  Mode  :character   Mode  :character   Median :0.0000   
##                                        Mean   :0.3472   
##                                        3rd Qu.:1.0000   
##                                        Max.   :1.0000

Random Sample (df_5):

summary(df_5)

##   Patient.ID             Age           Sex             Cholesterol   
##  Length:4381        Min.   :18.0   Length:4381        Min.   :120.0  
##  Class :character   1st Qu.:35.0   Class :character   1st Qu.:190.0  
##  Mode  :character   Median :53.0   Mode  :character   Median :257.0  
##                     Mean   :53.6                      Mean   :258.1  
##                     3rd Qu.:72.0                      3rd Qu.:328.0  
##                     Max.   :90.0                      Max.   :400.0  
##  Blood.Pressure       Heart.Rate        Diabetes      Family.History  
##  Length:4381        Min.   : 40.00   Min.   :0.0000   Min.   :0.0000  
##  Class :character   1st Qu.: 57.00   1st Qu.:0.0000   1st Qu.:0.0000  
##  Mode  :character   Median : 75.00   Median :1.0000   Median :0.0000  
##                     Mean   : 74.77   Mean   :0.6537   Mean   :0.4951  
##                     3rd Qu.: 92.00   3rd Qu.:1.0000   3rd Qu.:1.0000  
##                     Max.   :110.00   Max.   :1.0000   Max.   :1.0000  
##     Smoking          Obesity       Alcohol.Consumption Exercise.Hours.Per.Week
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000      Min.   : 0.004443      
##  1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.:0.0000      1st Qu.: 5.089899      
##  Median :1.0000   Median :0.0000   Median :1.0000      Median : 9.900774      
##  Mean   :0.8973   Mean   :0.4967   Mean   :0.6021      Mean   : 9.933326      
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000      3rd Qu.:14.871307      
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000      Max.   :19.997012      
##      Diet           Previous.Heart.Problems Medication.Use   Stress.Level   
##  Length:4381        Min.   :0.0000          Min.   :0.000   Min.   : 1.000  
##  Class :character   1st Qu.:0.0000          1st Qu.:0.000   1st Qu.: 3.000  
##  Mode  :character   Median :0.0000          Median :0.000   Median : 6.000  
##                     Mean   :0.4878          Mean   :0.499   Mean   : 5.516  
##                     3rd Qu.:1.0000          3rd Qu.:1.000   3rd Qu.: 8.000  
##                     Max.   :1.0000          Max.   :1.000   Max.   :10.000  
##  Sedentary.Hours.Per.Day     Income            BMI        Triglycerides  
##  Min.   : 0.008307       Min.   : 20208   Min.   :18.01   Min.   : 30.0  
##  1st Qu.: 3.053636       1st Qu.: 90784   1st Qu.:23.29   1st Qu.:219.0  
##  Median : 6.103019       Median :159628   Median :28.57   Median :409.0  
##  Mean   : 6.071993       Mean   :159834   Mean   :28.84   Mean   :411.4  
##  3rd Qu.: 9.185967       3rd Qu.:230221   3rd Qu.:34.32   3rd Qu.:608.0  
##  Max.   :11.999313       Max.   :299850   Max.   :39.99   Max.   :800.0  
##  Physical.Activity.Days.Per.Week Sleep.Hours.Per.Day   Country         
##  Min.   :0.000                   Min.   : 4.000      Length:4381       
##  1st Qu.:1.000                   1st Qu.: 5.000      Class :character  
##  Median :3.000                   Median : 7.000      Mode  :character  
##  Mean   :3.437                   Mean   : 6.984                        
##  3rd Qu.:5.000                   3rd Qu.: 9.000                        
##  Max.   :7.000                   Max.   :10.000                        
##   Continent          Hemisphere        Heart.Attack.Risk
##  Length:4381        Length:4381        Min.   :0.0000   
##  Class :character   Class :character   1st Qu.:0.0000   
##  Mode  :character   Mode  :character   Median :0.0000   
##                                        Mean   :0.3378   
##                                        3rd Qu.:1.0000   
##                                        Max.   :1.0000

Upon summarization, we can see that mean, median, and quantile of each sub-samples are very similar among each other. This suggests the well distribution of data and preserves the robustness of analysis.

Difference among Random Samples:

We can further facilitate the analysis by comparing an average of any numeric variable of any two sub-samples and comprehend how different the random samples are from each other. Specifically, we can observe the mean value of Triglycerides level among different countries from random samples df_3 and df_5 to get clear understanding of their differences.

Triglycerides on df_3:

set.seed(555)
df_3 |> 
  group_by(Country) |> 
  summarise(mean_Triglycerides=mean(Triglycerides))

## # A tibble: 20 × 2
##    Country        mean_Triglycerides
##    <chr>                       <dbl>
##  1 Argentina                    417.
##  2 Australia                    415.
##  3 Brazil                       403.
##  4 Canada                       376.
##  5 China                        418.
##  6 Colombia                     389.
##  7 France                       444.
##  8 Germany                      406.
##  9 India                        395.
## 10 Italy                        421.
## 11 Japan                        425.
## 12 New Zealand                  426.
## 13 Nigeria                      420.
## 14 South Africa                 419.
## 15 South Korea                  428.
## 16 Spain                        403.
## 17 Thailand                     431.
## 18 United Kingdom               430.
## 19 United States                457.
## 20 Vietnam                      437.

Triglycerides on df_5:

set.seed(555)
df_5 |> 
  group_by(Country) |> 
  summarise(mean_Triglycerides=mean(Triglycerides))

## # A tibble: 20 × 2
##    Country        mean_Triglycerides
##    <chr>                       <dbl>
##  1 Argentina                    392.
##  2 Australia                    408.
##  3 Brazil                       411.
##  4 Canada                       373.
##  5 China                        427.
##  6 Colombia                     406.
##  7 France                       414.
##  8 Germany                      423.
##  9 India                        376.
## 10 Italy                        415.
## 11 Japan                        421.
## 12 New Zealand                  394.
## 13 Nigeria                      431.
## 14 South Africa                 390.
## 15 South Korea                  395.
## 16 Spain                        421.
## 17 Thailand                     442.
## 18 United Kingdom               413.
## 19 United States                428.
## 20 Vietnam                      440.

By observing the above data frames, we can see the variation on highest and lowest level of Triglycerides between df_3 and df_5. For instance, on df_3, France has the highest Triglycerides level and Canada has the lowest one. On df_5, Japan has the highest Triglycerides level and Brazil has the lowest one. We can see how easily a random sampling can deviate the results from one data frame to other considering whether the data are well distributed or not.

Since the original dataset is well distributed, the average Triglycerides on all countries are only slightly different among each other which saves from huge variations.

Anomaly in One Sub-Sample:

If we refer back to summarization of all random samples, we can see that there is no anomaly in one sub-sample that wouldn’t be in others. In other words, the mean, median and quantile of all sub-samples are very similar to each other.

The reason behind it could be that the sample does not have many outlier in the dataset. The dataset is very well distributed without having any extremities. This makes the dataset robust for randomization and reliable to analysis.

Consistency among all Sub-Samples:

Consistency among all sub-samples basically means similarity on mean, median, quantile etc. among all sub-samples. We can refer back to the summarisation of all random samples and see that the mean, median, and quantile among them are very similar.

Another way to observe the consistency on data among all sub-samples is through the visualization of random samples and the original dataset, and compare whether the graphs look similar or not among them.

For the sake of simplicity, let’s perform visualization on original dataset (HA), df_1 and df_5 using boxplot.

Visualization on Original Dataset (HA):

ggplot(HA, aes(x=Country,
               y=Triglycerides,
               fill=Country))+
  geom_boxplot()+
  labs(x="Country",
       y="Triglycerides Level",
       title="Country vs Triglycerides Level",
        scale_color_brewer(palette='Dark2'))

Visualization on df_1:

ggplot(df_1, aes(x=Country,
               y=Triglycerides,
               fill=Country))+
  geom_boxplot()+
  labs(x="Country",
       y="Triglycerides Level",
       title="Country vs Triglycerides Level",
        scale_color_brewer(palette='Dark2'))

Visualization on df_5:

ggplot(df_5, aes(x=Country,
               y=Triglycerides,
               fill=Country))+
  geom_boxplot()+
  labs(x="Country",
       y="Triglycerides Level",
       title="Country vs Triglycerides Level",
        scale_color_brewer(palette='Dark2'))

Upon examining the visualizations of the HA, df_1, and df_5 datasets, the boxplots exhibit a high degree of similarity. This indicates that the underlying distribution, central tendency, and spread of the data are consistent across these samples. For instance, if we observe the first quantile, third quantile and median of each boxplots from HA, df_1 and df_5, we can clearly see that df_1 and df_5 are consistent to that of HA.

Certainly there is presence of slight difference on the median of df_1, df_5 to that of HA, but this has to do with random sampling from fifty percent of the original dataset. In other words, random samples have less data to analyze with than that of original dataset.

Anyway, this high level of consistency in data even after random sampling preserves the statistical properties of the dataset across different samples, reinforcing the robustness of the data.

Drawing Conclusion about Data in Future:

The above analysis has certainly proved few facts that we should keep in mind while drawing conclusions on data after random sampling. Some of them are as follows,

Size of Dataset:

If the dataset is small, then random sampling might affect the results. For instance, in my HA dataset, the result of Triglyceride level would be very contradictory among countries if the data from each country was small enough.

Quality of Dataset:

If the dataset has missing values or missing combinations of categorical and continuous variables for instance, then that can lead to poor and unpromising result after analysis of sub-samples. For example, in my HA dataset, there is no missing combination of patients from all different countries consuming healthy, unhealthy and average diet. This quality in dataset preserved the robustness in analysis after random sampling.

Distribution of Data:

If the dataset contains extremities or outliers then there is higher probability of getting biasness in the result after analysis. For instance, upon visualizing df_1, df_5 and HA, we saw that the boxplot had the high degree of similarity among each boxes in the boxplot. This result remained same in the sub-samples as well. Summarization of sub-samples could be one way to check the quality distribution of data in dataset.