# Load necessary libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(dplyr)
# Load dataset
obesity <- read.csv("C:\\Users\\saisr\\Downloads\\statistics using R\\obesity.csv")
# View the first few rows of the dataset
head(obesity)
## Gender Age Height Weight family_history_with_overweight FAVC FCVC NCP
## 1 Female 21 1.62 64.0 yes no 2 3
## 2 Female 21 1.52 56.0 yes no 3 3
## 3 Male 23 1.80 77.0 yes no 2 3
## 4 Male 27 1.80 87.0 no no 3 3
## 5 Male 22 1.78 89.8 no no 2 1
## 6 Male 29 1.62 53.0 no yes 2 3
## CAEC SMOKE CH2O SCC FAF TUE CALC MTRANS
## 1 Sometimes no 2 no 0 1 no Public_Transportation
## 2 Sometimes yes 3 yes 3 0 Sometimes Public_Transportation
## 3 Sometimes no 2 no 2 1 Frequently Public_Transportation
## 4 Sometimes no 2 no 2 0 Frequently Walking
## 5 Sometimes no 2 no 0 0 Sometimes Public_Transportation
## 6 Sometimes no 2 no 0 0 Sometimes Automobile
## NObeyesdad
## 1 Normal_Weight
## 2 Normal_Weight
## 3 Normal_Weight
## 4 Overweight_Level_I
## 5 Overweight_Level_II
## 6 Normal_Weight
# Summary of the dataset
str(obesity)
## 'data.frame': 2111 obs. of 17 variables:
## $ Gender : chr "Female" "Female" "Male" "Male" ...
## $ Age : num 21 21 23 27 22 29 23 22 24 22 ...
## $ Height : num 1.62 1.52 1.8 1.8 1.78 1.62 1.5 1.64 1.78 1.72 ...
## $ Weight : num 64 56 77 87 89.8 53 55 53 64 68 ...
## $ family_history_with_overweight: chr "yes" "yes" "yes" "no" ...
## $ FAVC : chr "no" "no" "no" "no" ...
## $ FCVC : num 2 3 2 3 2 2 3 2 3 2 ...
## $ NCP : num 3 3 3 3 1 3 3 3 3 3 ...
## $ CAEC : chr "Sometimes" "Sometimes" "Sometimes" "Sometimes" ...
## $ SMOKE : chr "no" "yes" "no" "no" ...
## $ CH2O : num 2 3 2 2 2 2 2 2 2 2 ...
## $ SCC : chr "no" "yes" "no" "no" ...
## $ FAF : num 0 3 2 2 0 0 1 3 1 1 ...
## $ TUE : num 1 0 1 0 0 0 0 0 1 1 ...
## $ CALC : chr "no" "Sometimes" "Frequently" "Frequently" ...
## $ MTRANS : chr "Public_Transportation" "Public_Transportation" "Public_Transportation" "Walking" ...
## $ NObeyesdad : chr "Normal_Weight" "Normal_Weight" "Normal_Weight" "Overweight_Level_I" ...
summary(obesity)
## Gender Age Height Weight
## Length:2111 Min. :14.00 Min. :1.450 Min. : 39.00
## Class :character 1st Qu.:19.95 1st Qu.:1.630 1st Qu.: 65.47
## Mode :character Median :22.78 Median :1.700 Median : 83.00
## Mean :24.31 Mean :1.702 Mean : 86.59
## 3rd Qu.:26.00 3rd Qu.:1.768 3rd Qu.:107.43
## Max. :61.00 Max. :1.980 Max. :173.00
## family_history_with_overweight FAVC FCVC
## Length:2111 Length:2111 Min. :1.000
## Class :character Class :character 1st Qu.:2.000
## Mode :character Mode :character Median :2.386
## Mean :2.419
## 3rd Qu.:3.000
## Max. :3.000
## NCP CAEC SMOKE CH2O
## Min. :1.000 Length:2111 Length:2111 Min. :1.000
## 1st Qu.:2.659 Class :character Class :character 1st Qu.:1.585
## Median :3.000 Mode :character Mode :character Median :2.000
## Mean :2.686 Mean :2.008
## 3rd Qu.:3.000 3rd Qu.:2.477
## Max. :4.000 Max. :3.000
## SCC FAF TUE CALC
## Length:2111 Min. :0.0000 Min. :0.0000 Length:2111
## Class :character 1st Qu.:0.1245 1st Qu.:0.0000 Class :character
## Mode :character Median :1.0000 Median :0.6253 Mode :character
## Mean :1.0103 Mean :0.6579
## 3rd Qu.:1.6667 3rd Qu.:1.0000
## Max. :3.0000 Max. :2.0000
## MTRANS NObeyesdad
## Length:2111 Length:2111
## Class :character Class :character
## Mode :character Mode :character
##
##
##
# Define sample size (50% of the total data)
sample_size <- floor(0.5 * nrow(obesity))
# Create 5 random subsamples with replacement
df_1 <- obesity[sample(nrow(obesity), sample_size, replace = TRUE), ]
df_2 <- obesity[sample(nrow(obesity), sample_size, replace = TRUE), ]
df_3 <- obesity[sample(nrow(obesity), sample_size, replace = TRUE), ]
df_4 <- obesity[sample(nrow(obesity), sample_size, replace = TRUE), ]
df_5 <- obesity[sample(nrow(obesity), sample_size, replace = TRUE), ]
# Check the size of each sample
list(Sample_1 = nrow(df_1), Sample_2 = nrow(df_2), Sample_3 = nrow(df_3), Sample_4 = nrow(df_4), Sample_5 = nrow(df_5))
## $Sample_1
## [1] 1055
##
## $Sample_2
## [1] 1055
##
## $Sample_3
## [1] 1055
##
## $Sample_4
## [1] 1055
##
## $Sample_5
## [1] 1055
Significance: Sampling Variability: This reflects how real-world data
collection often involves random sampling, and different subsets of the
same population can yield slightly different results. Monte Carlo
Simulation: These subsamples allow you to conduct further analysis to
assess the stability and reliability of statistics like the mean,
variance, or other measures across different samples. The provided code
generates five random subsamples (df_1
to
df_5
), each containing 50% of the total rows from the
obesity_data
dataset, using sampling with replacement.
Let’s break down the insights and their significance:
nrow()
. Since each subsample was
intended to represent 50% of the dataset, the size of each subsample
should be approximately half of the original dataset. This ensures that
the sampling process has been correctly executed.Significance: Consistency of Sample Size: The consistency in sample size confirms that the method is correctly implemented and ensures each sample can be meaningfully compared to one another. Even though the samples are drawn randomly, they all maintain the same size, ensuring fair comparison.
Significance: Assessing Randomness: By generating multiple subsamples, you can explore how random variation affects your results. This is particularly important when making inferences about population characteristics or when building models. For example, if one subsample shows a significant deviation in BMI values, this might be due to random sampling, not a true population trend. - Impact on Conclusions: It highlights how conclusions drawn from a single sample might be affected by the specific composition of that sample. By examining multiple random samples, you get a better understanding of how robust your conclusions are against random fluctuations in the data.
Significance: Variation Across Subsamples: Differences between subsamples (e.g., different averages or distributions) can reveal insights into how much variation exists within the dataset. This might highlight trends that hold consistently or identify potential anomalies that are present in one subsample but not others. Understanding Data Stability: Exploring how different subsamples reflect the population helps to assess the stability of the dataset. This is crucial when building predictive models because stable patterns across samples suggest more reliable models.
# Plot the distribution of BMI across the five subsamples
ggplot() +
geom_density(aes(x = df_1$Age, color = "Sample 1"), size = 1) +
geom_density(aes(x = df_2$Age, color = "Sample 2"), size = 1) +
geom_density(aes(x = df_3$Age, color = "Sample 3"), size = 1) +
geom_density(aes(x = df_4$Age, color = "Sample 4"), size = 1) +
geom_density(aes(x = df_5$Age, color = "Sample 5"), size = 1) +
labs(title = "age Distribution Across 5 Subsamples", x = "Age", y = "density") +
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
library(knitr)
library(dplyr)
# Combine the gender tables for all samples
gender_distribution <- data.frame(
Sample1 = table(df_1$Gender),
Sample2 = table(df_2$Gender),
Sample3 = table(df_3$Gender),
Sample4 = table(df_4$Gender),
Sample5 = table(df_5$Gender)
)
# Display the table
kable(gender_distribution, caption = "Gender Distribution Across Samples")
Sample1.Var1 | Sample1.Freq | Sample2.Var1 | Sample2.Freq | Sample3.Var1 | Sample3.Freq | Sample4.Var1 | Sample4.Freq | Sample5.Var1 | Sample5.Freq |
---|---|---|---|---|---|---|---|---|---|
Female | 514 | Female | 541 | Female | 505 | Female | 510 | Female | 519 |
Male | 541 | Male | 514 | Male | 550 | Male | 545 | Male | 536 |
# Function to summarize data
summarize_data <- function(df) {
df %>%
group_by(Gender) %>%
summarise(
mean_Age = mean(Age, na.rm = TRUE),
mean_Weight = mean(Weight, na.rm = TRUE),
count = n()
)
}
summary_1 <- summarize_data(df_1)
summary_2 <- summarize_data(df_2)
summary_3 <- summarize_data(df_3)
summary_4 <- summarize_data(df_4)
summary_5 <- summarize_data(df_5)
# Display summaries
summary_1
## # A tibble: 2 × 4
## Gender mean_Age mean_Weight count
## <chr> <dbl> <dbl> <int>
## 1 Female 24.1 83.4 514
## 2 Male 24.6 90.8 541
summary_2
## # A tibble: 2 × 4
## Gender mean_Age mean_Weight count
## <chr> <dbl> <dbl> <int>
## 1 Female 24.0 83.1 541
## 2 Male 24.7 91.3 514
summary_3
## # A tibble: 2 × 4
## Gender mean_Age mean_Weight count
## <chr> <dbl> <dbl> <int>
## 1 Female 23.7 82.8 505
## 2 Male 24.9 92.9 550
summary_4
## # A tibble: 2 × 4
## Gender mean_Age mean_Weight count
## <chr> <dbl> <dbl> <int>
## 1 Female 23.8 82.2 510
## 2 Male 24.9 91.2 545
summary_5
## # A tibble: 2 × 4
## Gender mean_Age mean_Weight count
## <chr> <dbl> <dbl> <int>
## 1 Female 23.9 84.2 519
## 2 Male 24.6 90.3 536
# Combine summaries into one data frame for comparison
summary_combined <- bind_rows(
mutate(summary_1, Sample = "Sample 1"),
mutate(summary_2, Sample = "Sample 2"),
mutate(summary_3, Sample = "Sample 3"),
mutate(summary_4, Sample = "Sample 4"),
mutate(summary_5, Sample = "Sample 5")
)
# View combined summary
summary_combined
## # A tibble: 10 × 5
## Gender mean_Age mean_Weight count Sample
## <chr> <dbl> <dbl> <int> <chr>
## 1 Female 24.1 83.4 514 Sample 1
## 2 Male 24.6 90.8 541 Sample 1
## 3 Female 24.0 83.1 541 Sample 2
## 4 Male 24.7 91.3 514 Sample 2
## 5 Female 23.7 82.8 505 Sample 3
## 6 Male 24.9 92.9 550 Sample 3
## 7 Female 23.8 82.2 510 Sample 4
## 8 Male 24.9 91.2 545 Sample 4
## 9 Female 23.9 84.2 519 Sample 5
## 10 Male 24.6 90.3 536 Sample 5
# Load ggplot2 for visualization
library(ggplot2)
# Plot mean weight by sample
ggplot(summary_combined, aes(x = Sample, y = mean_Weight, fill = Gender)) +
geom_bar(stat = "identity", position = "dodge") +
theme_minimal() +
labs(title = "Mean Weight by Sample and Gender", y = "Mean Weight (kg)", x = "Sample")
### Insights - Visual Comparison: The bar plot allows for a clear visual
comparison of mean weight across different samples for each gender,
facilitating quick insights into any differences. - Weight Trends: You
can identify trends such as whether one gender consistently weighs more
across samples or if there are significant variations in mean weight
between genders in specific samples. - Sample Differences: Any notable
differences in mean weight between samples (e.g., Sample 1 vs. Sample 5)
can indicate varying demographic or lifestyle factors that might
influence weight. - Gender Imbalances: If one gender has higher mean
weights across most samples, it may suggest underlying health or social
factors that should be explored further.
Identify Anomalies
library(dplyr)
# Function to identify outliers based on Weight
identify_anomalies <- function(df) {
mean_weight <- mean(df$Weight, na.rm = TRUE)
sd_weight <- sd(df$Weight, na.rm = TRUE)
df %>%
filter(Weight > (mean_weight + 3 * sd_weight) |
Weight < (mean_weight - 3 * sd_weight))
}
# List of data frames
dfs <- list(df_1, df_2, df_3, df_4, df_5)
# Initialize a list to store anomalies
anomalies_list <- list()
# Loop through each data frame to identify anomalies
for (i in 1:length(dfs)) {
anomalies_list[[i]] <- identify_anomalies(dfs[[i]]) %>%
mutate(Sample = paste("Sample", i)) # Add sample identifier
}
# Combine all anomalies into one data frame for easier viewing
anomalies_combined <- bind_rows(anomalies_list)
# Display combined anomalies
anomalies_combined
## [1] Gender Age
## [3] Height Weight
## [5] family_history_with_overweight FAVC
## [7] FCVC NCP
## [9] CAEC SMOKE
## [11] CH2O SCC
## [13] FAF TUE
## [15] CALC MTRANS
## [17] NObeyesdad Sample
## <0 rows> (or 0-length row.names)
# Check distribution of Gender across all samples
gender_distribution <- bind_rows(
df_1 %>% group_by(Gender) %>% summarise(count = n(), Sample = "Sample 1"),
df_2 %>% group_by(Gender) %>% summarise(count = n(), Sample = "Sample 2"),
df_3 %>% group_by(Gender) %>% summarise(count = n(), Sample = "Sample 3"),
df_4 %>% group_by(Gender) %>% summarise(count = n(), Sample = "Sample 4"),
df_5 %>% group_by(Gender) %>% summarise(count = n(), Sample = "Sample 5")
)
# Visualize the distribution of Gender
ggplot(gender_distribution, aes(x = Sample, y = count, fill = Gender)) +
geom_bar(stat = "identity", position = "dodge") +
theme_minimal() +
labs(title = "Gender Distribution Across Samples", y = "Count", x = "Sample")
### Insights - Gender Distribution: The code aggregates the count of
each gender across the five samples, providing a clear view of the
gender composition in each sample. - Visual Comparison: The bar plot
visualizes the gender distribution, allowing for immediate comparison of
how many males and females are present in each sample. This can
highlight gender imbalances. - Sample Variation: By observing the
heights of the bars, you can identify any significant variations in
gender representation across samples. For example, one sample might have
predominantly one gender, which could indicate a sampling bias. - Data
Quality Assessment: Analyzing the gender distribution can help assess
the quality and diversity of the samples, informing decisions on data
validity and potential adjustments needed for balanced
representation.
# Monte Carlo simulation to estimate mean Weight across 1000 samples
set.seed(123)
weight_simulation <- replicate(1000, {
sample_data <- obesity[sample(nrow(obesity), sample_size, replace = TRUE), ]
mean(sample_data$Weight, na.rm = TRUE)
})
# Plot the distribution of simulated weights
hist(weight_simulation, breaks = 30, main = "Monte Carlo Simulation of Mean Weight", xlab = "Mean Weight (kg)")
### Insights - Simulation Purpose: The Monte Carlo simulation estimates
the mean weight by drawing 1,000 random samples from the obesity_data.
This method provides a robust way to assess the variability and
distribution of the mean weight estimate. - Random Sampling: By sampling
with replacement, the simulation captures the potential variability in
mean weight that could arise from different random selections, which
helps to understand how representative the mean is. - Distribution
Visualization: The histogram displays the distribution of the simulated
mean weights, allowing you to visualize the range and frequency of mean
weight estimates. A normal or bell-shaped distribution would indicate
that the sample means are clustered around a central value. - Estimate
Reliability: The spread of the simulated means provides insight into the
reliability of the estimated mean weight. A narrow distribution suggests
more confidence in the mean, while a wider distribution indicates
greater uncertainty.
The investigation of the obesity dataset through random subsampling and analysis has several implications for drawing conclusions about the data in the future. Here are some key considerations:
Understanding Variability - Implication: The differences observed among the five subsamples highlight the inherent variability within the dataset. Each sample may yield different insights, suggesting that a single analysis might not capture the complete picture. - Future Consideration: When interpreting results, it’s essential to consider multiple subsamples and the potential variability they represent. Conclusions drawn from one sample may not generalize well to the entire population.
Anomaly Recognition - Implication: Identifying anomalies in one subsample that may not appear in others emphasizes the importance of context when labeling outliers. A weight that is considered unusual in one sample may be typical in another due to differing sample characteristics. - Future Consideration: Establishing a standardized approach to identify and assess anomalies is crucial. This includes understanding the distribution and context of the data to avoid misclassifying valid observations as outliers.
Consistency Across Subsamples - Implication: The consistent patterns observed across subsamples, such as gender distribution, reinforce certain trends within the data set. These trends may indicate broader population characteristics. - Future Consideration: Consistent findings should be further investigated and validated. Future studies can focus on exploring the underlying factors contributing to these consistent patterns and assess their significance in the context of obesity.
Monte Carlo Simulations and Robustness - Implication: The use of Monte Carlo simulations to estimate mean weight demonstrates the importance of robustness in statistical analysis. The distribution of simulated means provides insights into the uncertainty and reliability of estimates. - Future Consideration: Incorporating simulations in future analyses can enhance understanding of variability and help in making more informed decisions regarding sample size and data collection methods. It encourages the exploration of how different sampling strategies affect outcomes.
1 How do lifestyle factors (such as dietary habits, physical activity levels, and smoking status) correlate with obesity levels across different demographics? 2 Is there a measurable impact of family history on the effectiveness of different dietary or physical activity interventions? 3 Is there a link between stress levels and obesity that can be explored further?