Introduction

The purpose of this data dive is to critically analyze potential issues that may arise when making conclusions about a dataset. We will take random samples from the dataset to simulate data collection from a population, scrutinize these samples, and consider how these insights affect future conclusions.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Load necessary libraries
library(dplyr)

Data Preparation

First, we load the dataset and prepare it for sampling.

dataset <-read_delim("C:/Users/Akshay Dembra/Downloads/Stats_Selected_Dataset/diabetes_binary_5050split_health_indicators_BRFSS2015_1.csv" , delim = ",")
## Rows: 70692 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (22): Diabetes_binary, HighBP, HighChol, CholCheck, BMI, Smoker, Stroke,...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dataset
# Display the first few rows of the dataset
head(dataset)

Sampling

We will create 5 random samples from the dataset, each containing approximately 50% of the data. Each sample will be stored in a separate data frame.

set.seed(123) # For reproducibility

# Function to create a random sample
create_sample <- function(dataset, sample_size) {
  sample_indices <- sample(1:nrow(dataset), size = sample_size, replace = TRUE)
  return(dataset[sample_indices, ])
}

# Define sample size as 50% of the dataset
sample_size <- floor(0.5 * nrow(dataset))

# Create 5 random samples
df_1 <- create_sample(dataset, sample_size)
df_2 <- create_sample(dataset, sample_size)
df_3 <- create_sample(dataset, sample_size)
df_4 <- create_sample(dataset, sample_size)
df_5 <- create_sample(dataset, sample_size)

Analysis of Subsamples

Differences Among Subsamples

We will analyze how different these subsamples are and identify any anomalies.

# Compare summary statistics of each subsample
summary(df_1)
##  Diabetes_binary      HighBP          HighChol        CholCheck    
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:1.000  
##  Median :0.0000   Median :1.0000   Median :1.0000   Median :1.000  
##  Mean   :0.4965   Mean   :0.5605   Mean   :0.5248   Mean   :0.975  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.000  
##       BMI           Smoker           Stroke        HeartDiseaseorAttack
##  Min.   :12.0   Min.   :0.0000   Min.   :0.00000   Min.   :0.0000      
##  1st Qu.:25.0   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000      
##  Median :29.0   Median :0.0000   Median :0.00000   Median :0.0000      
##  Mean   :29.9   Mean   :0.4702   Mean   :0.06289   Mean   :0.1474      
##  3rd Qu.:33.0   3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:0.0000      
##  Max.   :98.0   Max.   :1.0000   Max.   :1.00000   Max.   :1.0000      
##   PhysActivity        Fruits          Veggies       HvyAlcoholConsump
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.00000  
##  Median :1.0000   Median :1.0000   Median :1.0000   Median :0.00000  
##  Mean   :0.7016   Mean   :0.6126   Mean   :0.7921   Mean   :0.04142  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.00000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000  
##  AnyHealthcare     NoDocbcCost         GenHlth         MentHlth     
##  Min.   :0.0000   Min.   :0.00000   Min.   :1.000   Min.   : 0.000  
##  1st Qu.:1.0000   1st Qu.:0.00000   1st Qu.:2.000   1st Qu.: 0.000  
##  Median :1.0000   Median :0.00000   Median :3.000   Median : 0.000  
##  Mean   :0.9548   Mean   :0.09506   Mean   :2.835   Mean   : 3.868  
##  3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:4.000   3rd Qu.: 3.000  
##  Max.   :1.0000   Max.   :1.00000   Max.   :5.000   Max.   :30.000  
##     PhysHlth         DiffWalk           Sex              Age        
##  Min.   : 0.000   Min.   :0.0000   Min.   :0.0000   Min.   : 1.000  
##  1st Qu.: 0.000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.: 7.000  
##  Median : 0.000   Median :0.0000   Median :0.0000   Median : 9.000  
##  Mean   : 5.861   Mean   :0.2532   Mean   :0.4554   Mean   : 8.561  
##  3rd Qu.: 6.000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:11.000  
##  Max.   :30.000   Max.   :1.0000   Max.   :1.0000   Max.   :13.000  
##    Education         Income     
##  Min.   :1.000   Min.   :1.000  
##  1st Qu.:4.000   1st Qu.:4.000  
##  Median :5.000   Median :6.000  
##  Mean   :4.922   Mean   :5.688  
##  3rd Qu.:6.000   3rd Qu.:8.000  
##  Max.   :6.000   Max.   :8.000
summary(df_2)
##  Diabetes_binary      HighBP          HighChol       CholCheck     
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:1.0000  
##  Median :1.0000   Median :1.0000   Median :1.000   Median :1.0000  
##  Mean   :0.5038   Mean   :0.5652   Mean   :0.524   Mean   :0.9755  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.000   Max.   :1.0000  
##       BMI            Smoker           Stroke        HeartDiseaseorAttack
##  Min.   :13.00   Min.   :0.0000   Min.   :0.00000   Min.   :0.0000      
##  1st Qu.:25.00   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000      
##  Median :29.00   Median :0.0000   Median :0.00000   Median :0.0000      
##  Mean   :29.94   Mean   :0.4754   Mean   :0.06207   Mean   :0.1483      
##  3rd Qu.:33.00   3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:0.0000      
##  Max.   :98.00   Max.   :1.0000   Max.   :1.00000   Max.   :1.0000      
##   PhysActivity        Fruits          Veggies       HvyAlcoholConsump
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.00000  
##  Median :1.0000   Median :1.0000   Median :1.0000   Median :0.00000  
##  Mean   :0.6988   Mean   :0.6121   Mean   :0.7852   Mean   :0.04317  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.00000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000  
##  AnyHealthcare     NoDocbcCost         GenHlth         MentHlth     
##  Min.   :0.0000   Min.   :0.00000   Min.   :1.000   Min.   : 0.000  
##  1st Qu.:1.0000   1st Qu.:0.00000   1st Qu.:2.000   1st Qu.: 0.000  
##  Median :1.0000   Median :0.00000   Median :3.000   Median : 0.000  
##  Mean   :0.9552   Mean   :0.09605   Mean   :2.838   Mean   : 3.751  
##  3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:4.000   3rd Qu.: 2.000  
##  Max.   :1.0000   Max.   :1.00000   Max.   :5.000   Max.   :30.000  
##     PhysHlth         DiffWalk           Sex              Age        
##  Min.   : 0.000   Min.   :0.0000   Min.   :0.0000   Min.   : 1.000  
##  1st Qu.: 0.000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.: 7.000  
##  Median : 0.000   Median :0.0000   Median :0.0000   Median : 9.000  
##  Mean   : 5.812   Mean   :0.2517   Mean   :0.4596   Mean   : 8.599  
##  3rd Qu.: 5.000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:11.000  
##  Max.   :30.000   Max.   :1.0000   Max.   :1.0000   Max.   :13.000  
##    Education         Income     
##  Min.   :1.000   Min.   :1.000  
##  1st Qu.:4.000   1st Qu.:4.000  
##  Median :5.000   Median :6.000  
##  Mean   :4.927   Mean   :5.698  
##  3rd Qu.:6.000   3rd Qu.:8.000  
##  Max.   :6.000   Max.   :8.000
summary(df_3)
##  Diabetes_binary      HighBP          HighChol        CholCheck     
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:1.0000  
##  Median :0.0000   Median :1.0000   Median :1.0000   Median :1.0000  
##  Mean   :0.4958   Mean   :0.5631   Mean   :0.5247   Mean   :0.9749  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##       BMI           Smoker           Stroke        HeartDiseaseorAttack
##  Min.   :13.0   Min.   :0.0000   Min.   :0.00000   Min.   :0.0000      
##  1st Qu.:25.0   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000      
##  Median :28.0   Median :0.0000   Median :0.00000   Median :0.0000      
##  Mean   :29.8   Mean   :0.4727   Mean   :0.06241   Mean   :0.1496      
##  3rd Qu.:33.0   3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:0.0000      
##  Max.   :98.0   Max.   :1.0000   Max.   :1.00000   Max.   :1.0000      
##   PhysActivity        Fruits          Veggies       HvyAlcoholConsump
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.00000  
##  Median :1.0000   Median :1.0000   Median :1.0000   Median :0.00000  
##  Mean   :0.7034   Mean   :0.6117   Mean   :0.7885   Mean   :0.04278  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.00000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000  
##  AnyHealthcare     NoDocbcCost         GenHlth         MentHlth     
##  Min.   :0.0000   Min.   :0.00000   Min.   :1.000   Min.   : 0.000  
##  1st Qu.:1.0000   1st Qu.:0.00000   1st Qu.:2.000   1st Qu.: 0.000  
##  Median :1.0000   Median :0.00000   Median :3.000   Median : 0.000  
##  Mean   :0.9529   Mean   :0.09379   Mean   :2.835   Mean   : 3.713  
##  3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:4.000   3rd Qu.: 2.000  
##  Max.   :1.0000   Max.   :1.00000   Max.   :5.000   Max.   :30.000  
##     PhysHlth         DiffWalk           Sex              Age        
##  Min.   : 0.000   Min.   :0.0000   Min.   :0.0000   Min.   : 1.000  
##  1st Qu.: 0.000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.: 7.000  
##  Median : 0.000   Median :0.0000   Median :0.0000   Median : 9.000  
##  Mean   : 5.805   Mean   :0.2512   Mean   :0.4591   Mean   : 8.559  
##  3rd Qu.: 5.000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:11.000  
##  Max.   :30.000   Max.   :1.0000   Max.   :1.0000   Max.   :13.000  
##    Education         Income     
##  Min.   :1.000   Min.   :1.000  
##  1st Qu.:4.000   1st Qu.:4.000  
##  Median :5.000   Median :6.000  
##  Mean   :4.928   Mean   :5.691  
##  3rd Qu.:6.000   3rd Qu.:8.000  
##  Max.   :6.000   Max.   :8.000
summary(df_4)
##  Diabetes_binary      HighBP          HighChol        CholCheck     
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:1.0000  
##  Median :0.0000   Median :1.0000   Median :1.0000   Median :1.0000  
##  Mean   :0.4996   Mean   :0.5618   Mean   :0.5226   Mean   :0.9759  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##       BMI            Smoker           Stroke        HeartDiseaseorAttack
##  Min.   :12.00   Min.   :0.0000   Min.   :0.00000   Min.   :0.0000      
##  1st Qu.:25.00   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000      
##  Median :29.00   Median :0.0000   Median :0.00000   Median :0.0000      
##  Mean   :29.86   Mean   :0.4777   Mean   :0.06301   Mean   :0.1464      
##  3rd Qu.:33.00   3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:0.0000      
##  Max.   :98.00   Max.   :1.0000   Max.   :1.00000   Max.   :1.0000      
##   PhysActivity        Fruits          Veggies       HvyAlcoholConsump
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.0000   
##  Median :1.0000   Median :1.0000   Median :1.0000   Median :0.0000   
##  Mean   :0.7003   Mean   :0.6109   Mean   :0.7844   Mean   :0.0419   
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.0000   
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   
##  AnyHealthcare     NoDocbcCost         GenHlth         MentHlth     
##  Min.   :0.0000   Min.   :0.00000   Min.   :1.000   Min.   : 0.000  
##  1st Qu.:1.0000   1st Qu.:0.00000   1st Qu.:2.000   1st Qu.: 0.000  
##  Median :1.0000   Median :0.00000   Median :3.000   Median : 0.000  
##  Mean   :0.9567   Mean   :0.09517   Mean   :2.833   Mean   : 3.749  
##  3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:4.000   3rd Qu.: 2.000  
##  Max.   :1.0000   Max.   :1.00000   Max.   :5.000   Max.   :30.000  
##     PhysHlth         DiffWalk           Sex              Age        
##  Min.   : 0.000   Min.   :0.0000   Min.   :0.0000   Min.   : 1.000  
##  1st Qu.: 0.000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.: 7.000  
##  Median : 0.000   Median :0.0000   Median :0.0000   Median : 9.000  
##  Mean   : 5.811   Mean   :0.2536   Mean   :0.4566   Mean   : 8.574  
##  3rd Qu.: 5.000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:11.000  
##  Max.   :30.000   Max.   :1.0000   Max.   :1.0000   Max.   :13.000  
##    Education         Income     
##  Min.   :1.000   Min.   :1.000  
##  1st Qu.:4.000   1st Qu.:4.000  
##  Median :5.000   Median :6.000  
##  Mean   :4.922   Mean   :5.694  
##  3rd Qu.:6.000   3rd Qu.:8.000  
##  Max.   :6.000   Max.   :8.000
summary(df_5)
##  Diabetes_binary      HighBP         HighChol        CholCheck     
##  Min.   :0.0000   Min.   :0.000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:1.0000  
##  Median :0.0000   Median :1.000   Median :1.0000   Median :1.0000  
##  Mean   :0.4996   Mean   :0.563   Mean   :0.5278   Mean   :0.9748  
##  3rd Qu.:1.0000   3rd Qu.:1.000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.000   Max.   :1.0000   Max.   :1.0000  
##       BMI            Smoker           Stroke        HeartDiseaseorAttack
##  Min.   :13.00   Min.   :0.0000   Min.   :0.00000   Min.   :0.0000      
##  1st Qu.:25.00   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000      
##  Median :29.00   Median :0.0000   Median :0.00000   Median :0.0000      
##  Mean   :29.87   Mean   :0.4698   Mean   :0.06346   Mean   :0.1489      
##  3rd Qu.:33.00   3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:0.0000      
##  Max.   :98.00   Max.   :1.0000   Max.   :1.00000   Max.   :1.0000      
##   PhysActivity        Fruits         Veggies       HvyAlcoholConsump
##  Min.   :0.0000   Min.   :0.000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:1.0000   1st Qu.:0.00000  
##  Median :1.0000   Median :1.000   Median :1.0000   Median :0.00000  
##  Mean   :0.6993   Mean   :0.614   Mean   :0.7902   Mean   :0.04193  
##  3rd Qu.:1.0000   3rd Qu.:1.000   3rd Qu.:1.0000   3rd Qu.:0.00000  
##  Max.   :1.0000   Max.   :1.000   Max.   :1.0000   Max.   :1.00000  
##  AnyHealthcare     NoDocbcCost         GenHlth         MentHlth     
##  Min.   :0.0000   Min.   :0.00000   Min.   :1.000   Min.   : 0.000  
##  1st Qu.:1.0000   1st Qu.:0.00000   1st Qu.:2.000   1st Qu.: 0.000  
##  Median :1.0000   Median :0.00000   Median :3.000   Median : 0.000  
##  Mean   :0.9554   Mean   :0.09602   Mean   :2.832   Mean   : 3.763  
##  3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:4.000   3rd Qu.: 2.000  
##  Max.   :1.0000   Max.   :1.00000   Max.   :5.000   Max.   :30.000  
##     PhysHlth         DiffWalk           Sex              Age      
##  Min.   : 0.000   Min.   :0.0000   Min.   :0.0000   Min.   : 1.0  
##  1st Qu.: 0.000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.: 7.0  
##  Median : 0.000   Median :0.0000   Median :0.0000   Median : 9.0  
##  Mean   : 5.818   Mean   :0.2496   Mean   :0.4547   Mean   : 8.6  
##  3rd Qu.: 6.000   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:11.0  
##  Max.   :30.000   Max.   :1.0000   Max.   :1.0000   Max.   :13.0  
##    Education         Income     
##  Min.   :1.000   Min.   :1.000  
##  1st Qu.:4.000   1st Qu.:4.000  
##  Median :5.000   Median :6.000  
##  Mean   :4.928   Mean   :5.713  
##  3rd Qu.:6.000   3rd Qu.:8.000  
##  Max.   :6.000   Max.   :8.000
# Check for anomalies using group_by for categorical variables
analyze_anomalies <- function(df) {
  df %>%
    group_by(Sex, Age) %>%
    summarize(count = n(), mean_BMI = mean(BMI), .groups = 'drop')
}

analyze_anomalies(df_1)
analyze_anomalies(df_2)
analyze_anomalies(df_3)
analyze_anomalies(df_4)
analyze_anomalies(df_5)

Insight: The sub samples exhibit variability in both categorical and continuous variables, such as `BMI`. This indicates that conclusions based on a single sub sample may not represent the entire dataset. Anomalies in one sample might not appear in another due to these differences.
Consistency Among Subsamples

Identify consistent aspects across all sub samples.

# Check for consistent patterns in each subsample
consistent_patterns <- function(df) {
  df %>%
    group_by(HighBP, HighChol) %>%
    summarize(count = n(), mean_BMI = mean(BMI), .groups = 'drop')
}

consistent_patterns(df_1)
consistent_patterns(df_2)
consistent_patterns(df_3)
consistent_patterns(df_4)
consistent_patterns(df_5)

Insight: Certain patterns, like the relationship between `HighBP` and `HighChol`, remain consistent across all sub samples. These consistent patterns suggest that some data relationships are robust and can be used to make general conclusions about the population.

Monte Carlo Simulations

Incorporate Monte Carlo simulations to further investigate variability and reliability.

# Example Monte Carlo simulation for BMI distribution
simulate_bmi <- function(n_simulations) {
  results <- replicate(n_simulations, {
    sample_df <- create_sample(dataset, sample_size)
    mean(sample_df$BMI)
  })
  return(results)
}

bmi_simulation_results <- simulate_bmi(1000)

# Plot the results
hist(bmi_simulation_results, main = "Monte Carlo Simulation of BMI", xlab = "Mean BMI")

Insight: Monte Carlo simulations show stability in the `BMI` distribution across simulations, indicating that mean BMI is a reliable metric. This highlights the importance of using simulations to assess variability and reliability in data analysis.

Conclusion

Through this data dive, we have identified differences and consistencies among sub samples. This analysis highlights potential anomalies and consistent patterns that can inform future data-driven decisions. Further questions include investigating specific anomalies and exploring additional variables that may influence outcomes.