Amirreza Saeidinik 3932838, Golnaz Akbari 3852157
Last updated: 28/05/2023
Question: What is the distribution of Body Mass Index (BMI) in a randomly sampled dataset, and is there a significant difference in BMI between males and females?
I randomly sample 100 rows from the original dataframe, df, using the sample function. The sampled data is stored in the sample_df dataframe. I calculate the BMI (Body Mass Index) for each row in the sampled dataframe, sample_df, using the formula Weight / ((Height / 100)^2). The BMI values are added as a new column in sample_df. I create two new dataframes, male_data and female_data, by separating the sample_df dataframe based on the Gender column. The male_data dataframe contains rows where the Gender is “Male,” and the female_data dataframe contains rows where the Gender is “Female”. I create a boxplot of the BMI values from the sample_df dataframe.
The boxplot allows us to visualize the distribution of BMI values in the sampled dataset. I use the aggregate function to calculate summary statistics for BMI based on the Gender column in the sample_df dataframe.
The calculated statistics include the minimum, first quartile (Q1), median, third quartile (Q3), maximum, mean, standard deviation (SD), number of observations (n), and count of missing values for each gender.
I perform a two-sample t-test using the t.test function. The t-test compares the BMI values between males (male_data\(BMI) and females (female_data\)BMI).
The result of the t-test is stored in the t_test_result variable. I calculate the 95% confidence interval for the mean BMI of the sample_df dataframe using the qnorm function.
The confidence interval is based on the normal distribution and is calculated using the standard deviation and sample size of the BMI values.
This data frame contains the following columns:
Gender : Male / Female
Height : Number (cm)
Weight : Number (Kg)
Index : 0 - Extremely Weak 1 - Weak 2 - Normal 3 - Overweight 4 - Obesity 5 - Extreme Obesity
# Assuming the dataset is stored in a CSV file
df <- read.csv("/Users/amirrezasaeidi/Downloads/500_Person_Gender_Height_Weight_Index.csv")
set.seed(42)
sample_size <- 100
sample_df <- df[sample(nrow(df), sample_size), ]
sample_df$BMI <- sample_df$Weight / ((sample_df$Height / 100)^2)
# Create two new dataframes, male_data and female_data, by subsetting
#sample_df based on the Gender column
male_data <- sample_df[sample_df$Gender == "Male", ]
female_data <- sample_df[sample_df$Gender == "Female", ]# Create a boxplot of the BMI values from the sample_df dataframe
boxplot(sample_df$BMI, main = "Boxplot of BMI", ylab = "BMI")# Calculate summary statistics for BMI based on the Gender column
summary_stats <- aggregate(BMI ~ Gender, data = sample_df, FUN = function(x) {
c(min = min(x), Q1 = quantile(x, 0.25), median = median(x),
Q3 = quantile(x, 0.75), max = max(x), mean = mean(x),
SD = sd(x), n = length(x), missing_values = sum(is.na(x)))
})
# Perform a two-sample t-test using the t.test function
t_test_result <- t.test(male_data$BMI, female_data$BMI)
# Calculate the 95% confidence interval for the mean BMI of the sample_df dataframe
confidence_interval <- qnorm(0.975) * sd(sample_df$BMI) / sqrt(length(sample_df$BMI))
mean_bmi <- mean(sample_df$BMI)## Gender BMI.min BMI.Q1.25% BMI.median BMI.Q3.75% BMI.max BMI.mean BMI.SD
## 1 Female 13.85042 24.40718 32.24790 42.86241 78.85340 34.31082 13.08805
## 2 Male 13.85042 29.64108 36.21465 45.02039 70.76333 39.14627 13.42628
## BMI.n BMI.missing_values
## 1 56.00000 0.00000
## 2 44.00000 0.00000
##
## Welch Two Sample t-test
##
## data: male_data$BMI and female_data$BMI
## t = 1.8076, df = 91.362, p-value = 0.07396
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.477898 10.148788
## sample estimates:
## mean of x mean of y
## 39.14627 34.31082
Discuss the major findings of your investigation The investigation’s goal was to compare males and females’ BMIs. According to the t-test results, the difference in BMI between males and females is not statistically significant (p-value = 0.07396). Although there is significant uncertainty in the calculation, the confidence interval shows that the true mean BMI difference might be anywhere between -0.477898 and 10.148788. The main finding is that although there may be a difference in BMI between males and females, the given dataset does not support that difference being statistically significant.
Discuss any strengths and limitations. The application of statistical methods helps decrease bias. Limitations include the very small sample size (not supplied), possible missing values, and the dataset’s representativeness, which may affect how broadly the results may be applied. Additionally, the approach does not deal with outliers or missing variables. When evaluating the findings, these limitations should be taken into consideration.
Propose directions for future investigations. Here are some suggested directions for more research based on the findings and limitations: Increase Sample Size: Having a bigger sample size could help the results be more generalizable and have better statistical power. Implement suitable methods to deal with missing values, such as imputation techniques or sensitivity studies to determine the effects of missing data on the outcomes.
This is a good place to re-state your findings as a final conclusion. What is the one take home message the reader should leave with?
The investigation’s goal was to compare the Body Mass Index (BMI) between males and females based on the analysis done on the provided dataset. The study’s key finding is the lack of a statistically significant variation in BMI across genders (p-value = 0.07396). It is crucial to keep in mind that there is some uncertainty in the estimate as shown by the confidence interval for the mean BMI difference (-0.477898 to 10.148788). The BMI difference between men and females may therefore exist, but it is not statistically significant.
M YASSER H (2022), BMI Dataset, Kaggle, accessed 28/05/2023, https://www.kaggle.com/datasets/yasserh/bmidataset