Introduction
The purpose of this data dive is to critically analyze potential issues that may arise when making conclusions about a dataset. We will take random samples from the dataset to simulate data collection from a population, scrutinize these samples, and consider how these insights affect future conclusions.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Load necessary libraries
library(dplyr)
Data Preparation
First, we load the dataset and prepare it for sampling.
dataset <-read_delim("C:/Users/Akshay Dembra/Downloads/Stats_Selected_Dataset/diabetes_binary_5050split_health_indicators_BRFSS2015_1.csv" , delim = ",")
## Rows: 70692 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (22): Diabetes_binary, HighBP, HighChol, CholCheck, BMI, Smoker, Stroke,...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dataset
# Display the first few rows of the dataset
head(dataset)
Sampling
We will create 5 random samples from the dataset, each containing approximately 50% of the data. Each sample will be stored in a separate data frame.
set.seed(123) # For reproducibility
# Function to create a random sample
create_sample <- function(dataset, sample_size) {
sample_indices <- sample(1:nrow(dataset), size = sample_size, replace = TRUE)
return(dataset[sample_indices, ])
}
# Define sample size as 50% of the dataset
sample_size <- floor(0.5 * nrow(dataset))
# Create 5 random samples
df_1 <- create_sample(dataset, sample_size)
df_2 <- create_sample(dataset, sample_size)
df_3 <- create_sample(dataset, sample_size)
df_4 <- create_sample(dataset, sample_size)
df_5 <- create_sample(dataset, sample_size)
Analysis of Subsamples
Differences Among Subsamples
We will analyze how different these subsamples are and identify any anomalies.
# Compare summary statistics of each subsample
summary(df_1)
## Diabetes_binary HighBP HighChol CholCheck
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:1.000
## Median :0.0000 Median :1.0000 Median :1.0000 Median :1.000
## Mean :0.4965 Mean :0.5605 Mean :0.5248 Mean :0.975
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.000
## BMI Smoker Stroke HeartDiseaseorAttack
## Min. :12.0 Min. :0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.:25.0 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :29.0 Median :0.0000 Median :0.00000 Median :0.0000
## Mean :29.9 Mean :0.4702 Mean :0.06289 Mean :0.1474
## 3rd Qu.:33.0 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :98.0 Max. :1.0000 Max. :1.00000 Max. :1.0000
## PhysActivity Fruits Veggies HvyAlcoholConsump
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:0.00000
## Median :1.0000 Median :1.0000 Median :1.0000 Median :0.00000
## Mean :0.7016 Mean :0.6126 Mean :0.7921 Mean :0.04142
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000
## AnyHealthcare NoDocbcCost GenHlth MentHlth
## Min. :0.0000 Min. :0.00000 Min. :1.000 Min. : 0.000
## 1st Qu.:1.0000 1st Qu.:0.00000 1st Qu.:2.000 1st Qu.: 0.000
## Median :1.0000 Median :0.00000 Median :3.000 Median : 0.000
## Mean :0.9548 Mean :0.09506 Mean :2.835 Mean : 3.868
## 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:4.000 3rd Qu.: 3.000
## Max. :1.0000 Max. :1.00000 Max. :5.000 Max. :30.000
## PhysHlth DiffWalk Sex Age
## Min. : 0.000 Min. :0.0000 Min. :0.0000 Min. : 1.000
## 1st Qu.: 0.000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 7.000
## Median : 0.000 Median :0.0000 Median :0.0000 Median : 9.000
## Mean : 5.861 Mean :0.2532 Mean :0.4554 Mean : 8.561
## 3rd Qu.: 6.000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:11.000
## Max. :30.000 Max. :1.0000 Max. :1.0000 Max. :13.000
## Education Income
## Min. :1.000 Min. :1.000
## 1st Qu.:4.000 1st Qu.:4.000
## Median :5.000 Median :6.000
## Mean :4.922 Mean :5.688
## 3rd Qu.:6.000 3rd Qu.:8.000
## Max. :6.000 Max. :8.000
summary(df_2)
## Diabetes_binary HighBP HighChol CholCheck
## Min. :0.0000 Min. :0.0000 Min. :0.000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:1.0000
## Median :1.0000 Median :1.0000 Median :1.000 Median :1.0000
## Mean :0.5038 Mean :0.5652 Mean :0.524 Mean :0.9755
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.000 Max. :1.0000
## BMI Smoker Stroke HeartDiseaseorAttack
## Min. :13.00 Min. :0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.:25.00 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :29.00 Median :0.0000 Median :0.00000 Median :0.0000
## Mean :29.94 Mean :0.4754 Mean :0.06207 Mean :0.1483
## 3rd Qu.:33.00 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :98.00 Max. :1.0000 Max. :1.00000 Max. :1.0000
## PhysActivity Fruits Veggies HvyAlcoholConsump
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:0.00000
## Median :1.0000 Median :1.0000 Median :1.0000 Median :0.00000
## Mean :0.6988 Mean :0.6121 Mean :0.7852 Mean :0.04317
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000
## AnyHealthcare NoDocbcCost GenHlth MentHlth
## Min. :0.0000 Min. :0.00000 Min. :1.000 Min. : 0.000
## 1st Qu.:1.0000 1st Qu.:0.00000 1st Qu.:2.000 1st Qu.: 0.000
## Median :1.0000 Median :0.00000 Median :3.000 Median : 0.000
## Mean :0.9552 Mean :0.09605 Mean :2.838 Mean : 3.751
## 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:4.000 3rd Qu.: 2.000
## Max. :1.0000 Max. :1.00000 Max. :5.000 Max. :30.000
## PhysHlth DiffWalk Sex Age
## Min. : 0.000 Min. :0.0000 Min. :0.0000 Min. : 1.000
## 1st Qu.: 0.000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 7.000
## Median : 0.000 Median :0.0000 Median :0.0000 Median : 9.000
## Mean : 5.812 Mean :0.2517 Mean :0.4596 Mean : 8.599
## 3rd Qu.: 5.000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:11.000
## Max. :30.000 Max. :1.0000 Max. :1.0000 Max. :13.000
## Education Income
## Min. :1.000 Min. :1.000
## 1st Qu.:4.000 1st Qu.:4.000
## Median :5.000 Median :6.000
## Mean :4.927 Mean :5.698
## 3rd Qu.:6.000 3rd Qu.:8.000
## Max. :6.000 Max. :8.000
summary(df_3)
## Diabetes_binary HighBP HighChol CholCheck
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:1.0000
## Median :0.0000 Median :1.0000 Median :1.0000 Median :1.0000
## Mean :0.4958 Mean :0.5631 Mean :0.5247 Mean :0.9749
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## BMI Smoker Stroke HeartDiseaseorAttack
## Min. :13.0 Min. :0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.:25.0 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :28.0 Median :0.0000 Median :0.00000 Median :0.0000
## Mean :29.8 Mean :0.4727 Mean :0.06241 Mean :0.1496
## 3rd Qu.:33.0 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :98.0 Max. :1.0000 Max. :1.00000 Max. :1.0000
## PhysActivity Fruits Veggies HvyAlcoholConsump
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:0.00000
## Median :1.0000 Median :1.0000 Median :1.0000 Median :0.00000
## Mean :0.7034 Mean :0.6117 Mean :0.7885 Mean :0.04278
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000
## AnyHealthcare NoDocbcCost GenHlth MentHlth
## Min. :0.0000 Min. :0.00000 Min. :1.000 Min. : 0.000
## 1st Qu.:1.0000 1st Qu.:0.00000 1st Qu.:2.000 1st Qu.: 0.000
## Median :1.0000 Median :0.00000 Median :3.000 Median : 0.000
## Mean :0.9529 Mean :0.09379 Mean :2.835 Mean : 3.713
## 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:4.000 3rd Qu.: 2.000
## Max. :1.0000 Max. :1.00000 Max. :5.000 Max. :30.000
## PhysHlth DiffWalk Sex Age
## Min. : 0.000 Min. :0.0000 Min. :0.0000 Min. : 1.000
## 1st Qu.: 0.000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 7.000
## Median : 0.000 Median :0.0000 Median :0.0000 Median : 9.000
## Mean : 5.805 Mean :0.2512 Mean :0.4591 Mean : 8.559
## 3rd Qu.: 5.000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:11.000
## Max. :30.000 Max. :1.0000 Max. :1.0000 Max. :13.000
## Education Income
## Min. :1.000 Min. :1.000
## 1st Qu.:4.000 1st Qu.:4.000
## Median :5.000 Median :6.000
## Mean :4.928 Mean :5.691
## 3rd Qu.:6.000 3rd Qu.:8.000
## Max. :6.000 Max. :8.000
summary(df_4)
## Diabetes_binary HighBP HighChol CholCheck
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:1.0000
## Median :0.0000 Median :1.0000 Median :1.0000 Median :1.0000
## Mean :0.4996 Mean :0.5618 Mean :0.5226 Mean :0.9759
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## BMI Smoker Stroke HeartDiseaseorAttack
## Min. :12.00 Min. :0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.:25.00 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :29.00 Median :0.0000 Median :0.00000 Median :0.0000
## Mean :29.86 Mean :0.4777 Mean :0.06301 Mean :0.1464
## 3rd Qu.:33.00 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :98.00 Max. :1.0000 Max. :1.00000 Max. :1.0000
## PhysActivity Fruits Veggies HvyAlcoholConsump
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:0.0000
## Median :1.0000 Median :1.0000 Median :1.0000 Median :0.0000
## Mean :0.7003 Mean :0.6109 Mean :0.7844 Mean :0.0419
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## AnyHealthcare NoDocbcCost GenHlth MentHlth
## Min. :0.0000 Min. :0.00000 Min. :1.000 Min. : 0.000
## 1st Qu.:1.0000 1st Qu.:0.00000 1st Qu.:2.000 1st Qu.: 0.000
## Median :1.0000 Median :0.00000 Median :3.000 Median : 0.000
## Mean :0.9567 Mean :0.09517 Mean :2.833 Mean : 3.749
## 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:4.000 3rd Qu.: 2.000
## Max. :1.0000 Max. :1.00000 Max. :5.000 Max. :30.000
## PhysHlth DiffWalk Sex Age
## Min. : 0.000 Min. :0.0000 Min. :0.0000 Min. : 1.000
## 1st Qu.: 0.000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 7.000
## Median : 0.000 Median :0.0000 Median :0.0000 Median : 9.000
## Mean : 5.811 Mean :0.2536 Mean :0.4566 Mean : 8.574
## 3rd Qu.: 5.000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:11.000
## Max. :30.000 Max. :1.0000 Max. :1.0000 Max. :13.000
## Education Income
## Min. :1.000 Min. :1.000
## 1st Qu.:4.000 1st Qu.:4.000
## Median :5.000 Median :6.000
## Mean :4.922 Mean :5.694
## 3rd Qu.:6.000 3rd Qu.:8.000
## Max. :6.000 Max. :8.000
summary(df_5)
## Diabetes_binary HighBP HighChol CholCheck
## Min. :0.0000 Min. :0.000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:1.0000
## Median :0.0000 Median :1.000 Median :1.0000 Median :1.0000
## Mean :0.4996 Mean :0.563 Mean :0.5278 Mean :0.9748
## 3rd Qu.:1.0000 3rd Qu.:1.000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.000 Max. :1.0000 Max. :1.0000
## BMI Smoker Stroke HeartDiseaseorAttack
## Min. :13.00 Min. :0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.:25.00 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :29.00 Median :0.0000 Median :0.00000 Median :0.0000
## Mean :29.87 Mean :0.4698 Mean :0.06346 Mean :0.1489
## 3rd Qu.:33.00 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :98.00 Max. :1.0000 Max. :1.00000 Max. :1.0000
## PhysActivity Fruits Veggies HvyAlcoholConsump
## Min. :0.0000 Min. :0.000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:1.0000 1st Qu.:0.00000
## Median :1.0000 Median :1.000 Median :1.0000 Median :0.00000
## Mean :0.6993 Mean :0.614 Mean :0.7902 Mean :0.04193
## 3rd Qu.:1.0000 3rd Qu.:1.000 3rd Qu.:1.0000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.000 Max. :1.0000 Max. :1.00000
## AnyHealthcare NoDocbcCost GenHlth MentHlth
## Min. :0.0000 Min. :0.00000 Min. :1.000 Min. : 0.000
## 1st Qu.:1.0000 1st Qu.:0.00000 1st Qu.:2.000 1st Qu.: 0.000
## Median :1.0000 Median :0.00000 Median :3.000 Median : 0.000
## Mean :0.9554 Mean :0.09602 Mean :2.832 Mean : 3.763
## 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:4.000 3rd Qu.: 2.000
## Max. :1.0000 Max. :1.00000 Max. :5.000 Max. :30.000
## PhysHlth DiffWalk Sex Age
## Min. : 0.000 Min. :0.0000 Min. :0.0000 Min. : 1.0
## 1st Qu.: 0.000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 7.0
## Median : 0.000 Median :0.0000 Median :0.0000 Median : 9.0
## Mean : 5.818 Mean :0.2496 Mean :0.4547 Mean : 8.6
## 3rd Qu.: 6.000 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:11.0
## Max. :30.000 Max. :1.0000 Max. :1.0000 Max. :13.0
## Education Income
## Min. :1.000 Min. :1.000
## 1st Qu.:4.000 1st Qu.:4.000
## Median :5.000 Median :6.000
## Mean :4.928 Mean :5.713
## 3rd Qu.:6.000 3rd Qu.:8.000
## Max. :6.000 Max. :8.000
# Check for anomalies using group_by for categorical variables
analyze_anomalies <- function(df) {
df %>%
group_by(Sex, Age) %>%
summarize(count = n(), mean_BMI = mean(BMI), .groups = 'drop')
}
analyze_anomalies(df_1)
analyze_anomalies(df_2)
analyze_anomalies(df_3)
analyze_anomalies(df_4)
analyze_anomalies(df_5)
Insight: The sub samples exhibit variability in both
categorical and continuous variables, such as `BMI`. This indicates that
conclusions based on a single sub sample may not represent the entire
dataset. Anomalies in one sample might not appear in another due to
these differences.
Consistency Among Subsamples
Identify consistent aspects across all sub samples.
# Check for consistent patterns in each subsample
consistent_patterns <- function(df) {
df %>%
group_by(HighBP, HighChol) %>%
summarize(count = n(), mean_BMI = mean(BMI), .groups = 'drop')
}
consistent_patterns(df_1)
consistent_patterns(df_2)
consistent_patterns(df_3)
consistent_patterns(df_4)
consistent_patterns(df_5)
Insight: Certain patterns, like the relationship between `HighBP` and `HighChol`, remain consistent across all sub samples. These consistent patterns suggest that some data relationships are robust and can be used to make general conclusions about the population.
Monte Carlo Simulations
Incorporate Monte Carlo simulations to further investigate variability and reliability.
# Example Monte Carlo simulation for BMI distribution
simulate_bmi <- function(n_simulations) {
results <- replicate(n_simulations, {
sample_df <- create_sample(dataset, sample_size)
mean(sample_df$BMI)
})
return(results)
}
bmi_simulation_results <- simulate_bmi(1000)
# Plot the results
hist(bmi_simulation_results, main = "Monte Carlo Simulation of BMI", xlab = "Mean BMI")
Insight: Monte Carlo simulations show stability in the `BMI` distribution across simulations, indicating that mean BMI is a reliable metric. This highlights the importance of using simulations to assess variability and reliability in data analysis.
Conclusion
Through this data dive, we have identified differences and consistencies among sub samples. This analysis highlights potential anomalies and consistent patterns that can inform future data-driven decisions. Further questions include investigating specific anomalies and exploring additional variables that may influence outcomes.