Data Dive — Sampling and Drawing Conclusions

# Load necessary libraries
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(dplyr)

# Load dataset 
obesity <- read.csv("C:\\Users\\saisr\\Downloads\\statistics using R\\obesity.csv")
# View the first few rows of the dataset
head(obesity)

##   Gender Age Height Weight family_history_with_overweight FAVC FCVC NCP
## 1 Female  21   1.62   64.0                            yes   no    2   3
## 2 Female  21   1.52   56.0                            yes   no    3   3
## 3   Male  23   1.80   77.0                            yes   no    2   3
## 4   Male  27   1.80   87.0                             no   no    3   3
## 5   Male  22   1.78   89.8                             no   no    2   1
## 6   Male  29   1.62   53.0                             no  yes    2   3
##        CAEC SMOKE CH2O SCC FAF TUE       CALC                MTRANS
## 1 Sometimes    no    2  no   0   1         no Public_Transportation
## 2 Sometimes   yes    3 yes   3   0  Sometimes Public_Transportation
## 3 Sometimes    no    2  no   2   1 Frequently Public_Transportation
## 4 Sometimes    no    2  no   2   0 Frequently               Walking
## 5 Sometimes    no    2  no   0   0  Sometimes Public_Transportation
## 6 Sometimes    no    2  no   0   0  Sometimes            Automobile
##            NObeyesdad
## 1       Normal_Weight
## 2       Normal_Weight
## 3       Normal_Weight
## 4  Overweight_Level_I
## 5 Overweight_Level_II
## 6       Normal_Weight

# Summary of the dataset
str(obesity)

## 'data.frame':    2111 obs. of  17 variables:
##  $ Gender                        : chr  "Female" "Female" "Male" "Male" ...
##  $ Age                           : num  21 21 23 27 22 29 23 22 24 22 ...
##  $ Height                        : num  1.62 1.52 1.8 1.8 1.78 1.62 1.5 1.64 1.78 1.72 ...
##  $ Weight                        : num  64 56 77 87 89.8 53 55 53 64 68 ...
##  $ family_history_with_overweight: chr  "yes" "yes" "yes" "no" ...
##  $ FAVC                          : chr  "no" "no" "no" "no" ...
##  $ FCVC                          : num  2 3 2 3 2 2 3 2 3 2 ...
##  $ NCP                           : num  3 3 3 3 1 3 3 3 3 3 ...
##  $ CAEC                          : chr  "Sometimes" "Sometimes" "Sometimes" "Sometimes" ...
##  $ SMOKE                         : chr  "no" "yes" "no" "no" ...
##  $ CH2O                          : num  2 3 2 2 2 2 2 2 2 2 ...
##  $ SCC                           : chr  "no" "yes" "no" "no" ...
##  $ FAF                           : num  0 3 2 2 0 0 1 3 1 1 ...
##  $ TUE                           : num  1 0 1 0 0 0 0 0 1 1 ...
##  $ CALC                          : chr  "no" "Sometimes" "Frequently" "Frequently" ...
##  $ MTRANS                        : chr  "Public_Transportation" "Public_Transportation" "Public_Transportation" "Walking" ...
##  $ NObeyesdad                    : chr  "Normal_Weight" "Normal_Weight" "Normal_Weight" "Overweight_Level_I" ...

summary(obesity)

##     Gender               Age            Height          Weight      
##  Length:2111        Min.   :14.00   Min.   :1.450   Min.   : 39.00  
##  Class :character   1st Qu.:19.95   1st Qu.:1.630   1st Qu.: 65.47  
##  Mode  :character   Median :22.78   Median :1.700   Median : 83.00  
##                     Mean   :24.31   Mean   :1.702   Mean   : 86.59  
##                     3rd Qu.:26.00   3rd Qu.:1.768   3rd Qu.:107.43  
##                     Max.   :61.00   Max.   :1.980   Max.   :173.00  
##  family_history_with_overweight     FAVC                FCVC      
##  Length:2111                    Length:2111        Min.   :1.000  
##  Class :character               Class :character   1st Qu.:2.000  
##  Mode  :character               Mode  :character   Median :2.386  
##                                                    Mean   :2.419  
##                                                    3rd Qu.:3.000  
##                                                    Max.   :3.000  
##       NCP            CAEC              SMOKE                CH2O      
##  Min.   :1.000   Length:2111        Length:2111        Min.   :1.000  
##  1st Qu.:2.659   Class :character   Class :character   1st Qu.:1.585  
##  Median :3.000   Mode  :character   Mode  :character   Median :2.000  
##  Mean   :2.686                                         Mean   :2.008  
##  3rd Qu.:3.000                                         3rd Qu.:2.477  
##  Max.   :4.000                                         Max.   :3.000  
##      SCC                 FAF              TUE             CALC          
##  Length:2111        Min.   :0.0000   Min.   :0.0000   Length:2111       
##  Class :character   1st Qu.:0.1245   1st Qu.:0.0000   Class :character  
##  Mode  :character   Median :1.0000   Median :0.6253   Mode  :character  
##                     Mean   :1.0103   Mean   :0.6579                     
##                     3rd Qu.:1.6667   3rd Qu.:1.0000                     
##                     Max.   :3.0000   Max.   :2.0000                     
##     MTRANS           NObeyesdad       
##  Length:2111        Length:2111       
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
##

Random sampling with replacement

# Define sample size (50% of the total data)
sample_size <- floor(0.5 * nrow(obesity))

# Create 5 random subsamples with replacement
df_1 <- obesity[sample(nrow(obesity), sample_size, replace = TRUE), ]
df_2 <- obesity[sample(nrow(obesity), sample_size, replace = TRUE), ]
df_3 <- obesity[sample(nrow(obesity), sample_size, replace = TRUE), ]
df_4 <- obesity[sample(nrow(obesity), sample_size, replace = TRUE), ]
df_5 <- obesity[sample(nrow(obesity), sample_size, replace = TRUE), ]

# Check the size of each sample
list(Sample_1 = nrow(df_1), Sample_2 = nrow(df_2), Sample_3 = nrow(df_3), Sample_4 = nrow(df_4), Sample_5 = nrow(df_5))

## $Sample_1
## [1] 1055
## 
## $Sample_2
## [1] 1055
## 
## $Sample_3
## [1] 1055
## 
## $Sample_4
## [1] 1055
## 
## $Sample_5
## [1] 1055

Insights and summary

Random Subsampling Each subsample (df_1, df_2, etc.) contains 50% of the total data and was created by randomly selecting rows with replacement. This means that some rows from the original dataset may appear multiple times in a given sample, while others may not appear at all.

Significance: Sampling Variability: This reflects how real-world data collection often involves random sampling, and different subsets of the same population can yield slightly different results. Monte Carlo Simulation: These subsamples allow you to conduct further analysis to assess the stability and reliability of statistics like the mean, variance, or other measures across different samples. The provided code generates five random subsamples (df_1 to df_5), each containing 50% of the total rows from the obesity_data dataset, using sampling with replacement. Let’s break down the insights and their significance:

Sample Size Consistency The output checks the size of each subsample using nrow(). Since each subsample was intended to represent 50% of the dataset, the size of each subsample should be approximately half of the original dataset. This ensures that the sampling process has been correctly executed.

Significance: Consistency of Sample Size: The consistency in sample size confirms that the method is correctly implemented and ensures each sample can be meaningfully compared to one another. Even though the samples are drawn randomly, they all maintain the same size, ensuring fair comparison.

Random Variation Since the samples are generated randomly with replacement, each subsample can potentially have different characteristics. For example:
- One subsample may overrepresent certain classes (e.g., more instances of a particular obesity level).
- Certain extreme values (outliers) could appear more frequently in some samples than in others.

Significance: Assessing Randomness: By generating multiple subsamples, you can explore how random variation affects your results. This is particularly important when making inferences about population characteristics or when building models. For example, if one subsample shows a significant deviation in BMI values, this might be due to random sampling, not a true population trend. - Impact on Conclusions: It highlights how conclusions drawn from a single sample might be affected by the specific composition of that sample. By examining multiple random samples, you get a better understanding of how robust your conclusions are against random fluctuations in the data.

Insight: Preparation for Comparison These subsamples allow you to compare how different aspects of the data, such as categorical and continuous variables, behave in different samples. For example:
- You could compare the mean BMI across the subsamples.
- You might check the distribution of obesity levels in each subsample.

Significance: Variation Across Subsamples: Differences between subsamples (e.g., different averages or distributions) can reveal insights into how much variation exists within the dataset. This might highlight trends that hold consistently or identify potential anomalies that are present in one subsample but not others. Understanding Data Stability: Exploring how different subsamples reflect the population helps to assess the stability of the dataset. This is crucial when building predictive models because stable patterns across samples suggest more reliable models.

Continuous Variables comparison

# Plot the distribution of BMI across the five subsamples
ggplot() + 
  geom_density(aes(x = df_1$Age, color = "Sample 1"), size = 1) +
  geom_density(aes(x = df_2$Age, color = "Sample 2"), size = 1) +
  geom_density(aes(x = df_3$Age, color = "Sample 3"), size = 1) +
  geom_density(aes(x = df_4$Age, color = "Sample 4"), size = 1) +
  geom_density(aes(x = df_5$Age, color = "Sample 5"), size = 1) +
  labs(title = "age Distribution Across 5 Subsamples", x = "Age", y = "density") +
  theme_minimal()

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

insights

A density plot is useful to visualize the distribution of a continuous variable (in this case, Age) for each of the five subsamples. It shows how the Age values are distributed within each sample, giving insights into the overall shape (e.g., whether they are skewed, multimodal, etc.) and whether there are significant differences between the subsamples.

categorical variables comparison

library(knitr)
library(dplyr)

# Combine the gender tables for all samples
gender_distribution <- data.frame(
  Sample1 = table(df_1$Gender),
  Sample2 = table(df_2$Gender),
  Sample3 = table(df_3$Gender),
  Sample4 = table(df_4$Gender),
  Sample5 = table(df_5$Gender)
)
# Display the table
kable(gender_distribution, caption = "Gender Distribution Across Samples")

Gender Distribution Across Samples
Sample1.Var1	Sample1.Freq	Sample2.Var1	Sample2.Freq	Sample3.Var1	Sample3.Freq	Sample4.Var1	Sample4.Freq	Sample5.Var1	Sample5.Freq
Female	514	Female	541	Female	505	Female	510	Female	519
Male	541	Male	514	Male	550	Male	545	Male	536

Insights

Comparing Gender Balance: By looking at the frequency of males and females (or other gender categories, if available) across subsamples, you can determine if your groups are well-balanced or if one gender is overrepresented. This is important for ensuring that any findings or conclusions drawn from these samples are not biased by gender imbalances.
Identifying Patterns Across Subsamples: If the gender distribution is roughly the same across all subsamples, it indicates that gender won’t significantly influence any differences found between the samples. If there are major differences in gender distribution, it could suggest the need for stratified analysis, where results are split by gender to ensure fair comparisons.
Potential Need for Weighting: If some samples have a significant imbalance, you might want to apply statistical weighting or adjust your models to account for this imbalance, especially if gender could affect the outcome variables.

Scrutinize the Subsamples

# Function to summarize data
summarize_data <- function(df) {
  df %>%
    group_by(Gender) %>%
    summarise(
      mean_Age = mean(Age, na.rm = TRUE),
      mean_Weight = mean(Weight, na.rm = TRUE),
      count = n()
    )
}

summary_1 <- summarize_data(df_1)
summary_2 <- summarize_data(df_2)
summary_3 <- summarize_data(df_3)
summary_4 <- summarize_data(df_4)
summary_5 <- summarize_data(df_5)

# Display summaries
summary_1

## # A tibble: 2 × 4
##   Gender mean_Age mean_Weight count
##   <chr>     <dbl>       <dbl> <int>
## 1 Female     24.1        83.4   514
## 2 Male       24.6        90.8   541

summary_2

## # A tibble: 2 × 4
##   Gender mean_Age mean_Weight count
##   <chr>     <dbl>       <dbl> <int>
## 1 Female     24.0        83.1   541
## 2 Male       24.7        91.3   514

summary_3

## # A tibble: 2 × 4
##   Gender mean_Age mean_Weight count
##   <chr>     <dbl>       <dbl> <int>
## 1 Female     23.7        82.8   505
## 2 Male       24.9        92.9   550

summary_4

## # A tibble: 2 × 4
##   Gender mean_Age mean_Weight count
##   <chr>     <dbl>       <dbl> <int>
## 1 Female     23.8        82.2   510
## 2 Male       24.9        91.2   545

summary_5

## # A tibble: 2 × 4
##   Gender mean_Age mean_Weight count
##   <chr>     <dbl>       <dbl> <int>
## 1 Female     23.9        84.2   519
## 2 Male       24.6        90.3   536

Insights

Function Purpose: The summarize_data function groups data by Gender and calculates the mean age, mean weight, and counts the number of entries for each gender in the samples. Benefits: Reusability: The function can be applied to multiple data frames without repeating code. Simplicity: It keeps your code organized and easier to read.
Key Statistics: Mean Age: Helps identify age trends between genders across samples. Mean Weight: Provides insights into weight differences among genders. Count: Indicates gender representation within each sample. Interpreting Results: Look for gender balance: Analyzing the counts can reveal imbalances that may affect results. Compare mean values: Differences in mean age and weight between samples can indicate demographic variations.

How Different Are They?

# Combine summaries into one data frame for comparison
summary_combined <- bind_rows(
  mutate(summary_1, Sample = "Sample 1"),
  mutate(summary_2, Sample = "Sample 2"),
  mutate(summary_3, Sample = "Sample 3"),
  mutate(summary_4, Sample = "Sample 4"),
  mutate(summary_5, Sample = "Sample 5")
)

# View combined summary
summary_combined

## # A tibble: 10 × 5
##    Gender mean_Age mean_Weight count Sample  
##    <chr>     <dbl>       <dbl> <int> <chr>   
##  1 Female     24.1        83.4   514 Sample 1
##  2 Male       24.6        90.8   541 Sample 1
##  3 Female     24.0        83.1   541 Sample 2
##  4 Male       24.7        91.3   514 Sample 2
##  5 Female     23.7        82.8   505 Sample 3
##  6 Male       24.9        92.9   550 Sample 3
##  7 Female     23.8        82.2   510 Sample 4
##  8 Male       24.9        91.2   545 Sample 4
##  9 Female     23.9        84.2   519 Sample 5
## 10 Male       24.6        90.3   536 Sample 5

Insights

Comparison Across Samples: By combining the summaries into summary_combined, you can easily compare mean age, mean weight, and count of each gender across all five samples in a single view.
Identifying Trends: Look for patterns or differences in mean age and weight between genders across samples. For example, if one sample shows a significantly higher mean weight for females compared to males, this could indicate differing health or lifestyle factors.
Assessing Gender Representation: The count of individuals per gender in each sample allows you to assess gender balance. Noticing significant disparities in counts could suggest biases or limitations in the sample collection process.

Visualizing Differences

# Load ggplot2 for visualization
library(ggplot2)

# Plot mean weight by sample
ggplot(summary_combined, aes(x = Sample, y = mean_Weight, fill = Gender)) +
  geom_bar(stat = "identity", position = "dodge") +
  theme_minimal() +
  labs(title = "Mean Weight by Sample and Gender", y = "Mean Weight (kg)", x = "Sample")

### Insights - Visual Comparison: The bar plot allows for a clear visual comparison of mean weight across different samples for each gender, facilitating quick insights into any differences. - Weight Trends: You can identify trends such as whether one gender consistently weighs more across samples or if there are significant variations in mean weight between genders in specific samples. - Sample Differences: Any notable differences in mean weight between samples (e.g., Sample 1 vs. Sample 5) can indicate varying demographic or lifestyle factors that might influence weight. - Gender Imbalances: If one gender has higher mean weights across most samples, it may suggest underlying health or social factors that should be explored further.

Anomalies in Subsamples

Identify Anomalies

library(dplyr)

# Function to identify outliers based on Weight
identify_anomalies <- function(df) {
  mean_weight <- mean(df$Weight, na.rm = TRUE)
  sd_weight <- sd(df$Weight, na.rm = TRUE)
  
  df %>% 
    filter(Weight > (mean_weight + 3 * sd_weight) | 
           Weight < (mean_weight - 3 * sd_weight))
}

# List of data frames
dfs <- list(df_1, df_2, df_3, df_4, df_5)

# Initialize a list to store anomalies
anomalies_list <- list()

# Loop through each data frame to identify anomalies
for (i in 1:length(dfs)) {
  anomalies_list[[i]] <- identify_anomalies(dfs[[i]]) %>%
    mutate(Sample = paste("Sample", i))  # Add sample identifier
}

# Combine all anomalies into one data frame for easier viewing
anomalies_combined <- bind_rows(anomalies_list)

# Display combined anomalies
anomalies_combined

##  [1] Gender                         Age                           
##  [3] Height                         Weight                        
##  [5] family_history_with_overweight FAVC                          
##  [7] FCVC                           NCP                           
##  [9] CAEC                           SMOKE                         
## [11] CH2O                           SCC                           
## [13] FAF                            TUE                           
## [15] CALC                           MTRANS                        
## [17] NObeyesdad                     Sample                        
## <0 rows> (or 0-length row.names)

Insights

Outlier Identification: The function identify_anomalies detects outliers in weight using the mean and standard deviation, flagging values that fall outside three standard deviations from the mean. This helps identify unusually high or low weights.
Loop Efficiency: Using a loop to process multiple data frames (df_1 to df_5) reduces redundancy and makes the code cleaner, enabling easy scalability if more data frames are added later.
Sample Context: By tagging anomalies with a sample identifier, it becomes easier to track which samples have outliers, facilitating further investigation into potential causes or data quality issues.
Combined View: The combined data frame (anomalies_combined) provides a holistic view of outliers across all samples, simplifying analysis and comparison.

Consistency Among Subsamples

# Check distribution of Gender across all samples
gender_distribution <- bind_rows(
  df_1 %>% group_by(Gender) %>% summarise(count = n(), Sample = "Sample 1"),
  df_2 %>% group_by(Gender) %>% summarise(count = n(), Sample = "Sample 2"),
  df_3 %>% group_by(Gender) %>% summarise(count = n(), Sample = "Sample 3"),
  df_4 %>% group_by(Gender) %>% summarise(count = n(), Sample = "Sample 4"),
  df_5 %>% group_by(Gender) %>% summarise(count = n(), Sample = "Sample 5")
)

# Visualize the distribution of Gender
ggplot(gender_distribution, aes(x = Sample, y = count, fill = Gender)) +
  geom_bar(stat = "identity", position = "dodge") +
  theme_minimal() +
  labs(title = "Gender Distribution Across Samples", y = "Count", x = "Sample")

### Insights - Gender Distribution: The code aggregates the count of each gender across the five samples, providing a clear view of the gender composition in each sample. - Visual Comparison: The bar plot visualizes the gender distribution, allowing for immediate comparison of how many males and females are present in each sample. This can highlight gender imbalances. - Sample Variation: By observing the heights of the bars, you can identify any significant variations in gender representation across samples. For example, one sample might have predominantly one gender, which could indicate a sampling bias. - Data Quality Assessment: Analyzing the gender distribution can help assess the quality and diversity of the samples, informing decisions on data validity and potential adjustments needed for balanced representation.

Monte Carlo Simulations

# Monte Carlo simulation to estimate mean Weight across 1000 samples
set.seed(123)
weight_simulation <- replicate(1000, {
  sample_data <- obesity[sample(nrow(obesity), sample_size, replace = TRUE), ]
  mean(sample_data$Weight, na.rm = TRUE)
})

# Plot the distribution of simulated weights
hist(weight_simulation, breaks = 30, main = "Monte Carlo Simulation of Mean Weight", xlab = "Mean Weight (kg)")

### Insights - Simulation Purpose: The Monte Carlo simulation estimates the mean weight by drawing 1,000 random samples from the obesity_data. This method provides a robust way to assess the variability and distribution of the mean weight estimate. - Random Sampling: By sampling with replacement, the simulation captures the potential variability in mean weight that could arise from different random selections, which helps to understand how representative the mean is. - Distribution Visualization: The histogram displays the distribution of the simulated mean weights, allowing you to visualize the range and frequency of mean weight estimates. A normal or bell-shaped distribution would indicate that the sample means are clustered around a central value. - Estimate Reliability: The spread of the simulated means provides insight into the reliability of the estimated mean weight. A narrow distribution suggests more confidence in the mean, while a wider distribution indicates greater uncertainty.

Conclusion and Future Implications of the Investigation

The investigation of the obesity dataset through random subsampling and analysis has several implications for drawing conclusions about the data in the future. Here are some key considerations:

Understanding Variability - Implication: The differences observed among the five subsamples highlight the inherent variability within the dataset. Each sample may yield different insights, suggesting that a single analysis might not capture the complete picture. - Future Consideration: When interpreting results, it’s essential to consider multiple subsamples and the potential variability they represent. Conclusions drawn from one sample may not generalize well to the entire population.

Anomaly Recognition - Implication: Identifying anomalies in one subsample that may not appear in others emphasizes the importance of context when labeling outliers. A weight that is considered unusual in one sample may be typical in another due to differing sample characteristics. - Future Consideration: Establishing a standardized approach to identify and assess anomalies is crucial. This includes understanding the distribution and context of the data to avoid misclassifying valid observations as outliers.

Consistency Across Subsamples - Implication: The consistent patterns observed across subsamples, such as gender distribution, reinforce certain trends within the data set. These trends may indicate broader population characteristics. - Future Consideration: Consistent findings should be further investigated and validated. Future studies can focus on exploring the underlying factors contributing to these consistent patterns and assess their significance in the context of obesity.

Monte Carlo Simulations and Robustness - Implication: The use of Monte Carlo simulations to estimate mean weight demonstrates the importance of robustness in statistical analysis. The distribution of simulated means provides insights into the uncertainty and reliability of estimates. - Future Consideration: Incorporating simulations in future analyses can enhance understanding of variability and help in making more informed decisions regarding sample size and data collection methods. It encourages the exploration of how different sampling strategies affect outcomes.

Further Questions for Investigation

1 How do lifestyle factors (such as dietary habits, physical activity levels, and smoking status) correlate with obesity levels across different demographics? 2 Is there a measurable impact of family history on the effectiveness of different dietary or physical activity interventions? 3 Is there a link between stress levels and obesity that can be explored further?

Data Dive — Sampling and Drawing Conclusions

Saisree mucharla

2024-09-24

Random sampling with replacement

Insights and summary

Continuous Variables comparison

insights

categorical variables comparison

Insights

Scrutinize the Subsamples

Insights

How Different Are They?

Insights

Visualizing Differences

Anomalies in Subsamples

Insights

Consistency Among Subsamples

Monte Carlo Simulations

Conclusion and Future Implications of the Investigation

Further Questions for Investigation