BMI Males and Females

Is there a significant difference in BMI between males and females

Amirreza Saeidinik 3932838, Golnaz Akbari 3852157

Last updated: 28/05/2023

Problem Statement

Question: What is the distribution of Body Mass Index (BMI) in a randomly sampled dataset, and is there a significant difference in BMI between males and females?

I randomly sample 100 rows from the original dataframe, df, using the sample function. The sampled data is stored in the sample_df dataframe. I calculate the BMI (Body Mass Index) for each row in the sampled dataframe, sample_df, using the formula Weight / ((Height / 100)^2). The BMI values are added as a new column in sample_df. I create two new dataframes, male_data and female_data, by separating the sample_df dataframe based on the Gender column. The male_data dataframe contains rows where the Gender is “Male,” and the female_data dataframe contains rows where the Gender is “Female”. I create a boxplot of the BMI values from the sample_df dataframe.

The boxplot allows us to visualize the distribution of BMI values in the sampled dataset. I use the aggregate function to calculate summary statistics for BMI based on the Gender column in the sample_df dataframe.

The calculated statistics include the minimum, first quartile (Q1), median, third quartile (Q3), maximum, mean, standard deviation (SD), number of observations (n), and count of missing values for each gender.

I perform a two-sample t-test using the t.test function. The t-test compares the BMI values between males (male_data\(BMI) and females (female_data\)BMI).

The result of the t-test is stored in the t_test_result variable. I calculate the 95% confidence interval for the mean BMI of the sample_df dataframe using the qnorm function.

The confidence interval is based on the normal distribution and is calculated using the standard deviation and sample size of the BMI values.

Data

Data Cont.

This data frame contains the following columns:

Gender : Male / Female

Height : Number (cm)

Weight : Number (Kg)

Index : 0 - Extremely Weak 1 - Weak 2 - Normal 3 - Overweight 4 - Obesity 5 - Extreme Obesity

Descriptive Statistics and Visualisation

# Assuming the dataset is stored in a CSV file
df <- read.csv("/Users/amirrezasaeidi/Downloads/500_Person_Gender_Height_Weight_Index.csv")

set.seed(42)

sample_size <- 100
sample_df <- df[sample(nrow(df), sample_size), ]

sample_df$BMI <- sample_df$Weight / ((sample_df$Height / 100)^2)

# Create two new dataframes, male_data and female_data, by subsetting 
#sample_df based on the Gender column
male_data <- sample_df[sample_df$Gender == "Male", ]
female_data <- sample_df[sample_df$Gender == "Female", ]
# Create a boxplot of the BMI values from the sample_df dataframe
boxplot(sample_df$BMI, main = "Boxplot of BMI", ylab = "BMI")

# Calculate summary statistics for BMI based on the Gender column
summary_stats <- aggregate(BMI ~ Gender, data = sample_df, FUN = function(x) {
  c(min = min(x), Q1 = quantile(x, 0.25), median = median(x), 
    Q3 = quantile(x, 0.75), max = max(x), mean = mean(x),
    SD = sd(x), n = length(x), missing_values = sum(is.na(x)))
})

# Perform a two-sample t-test using the t.test function
t_test_result <- t.test(male_data$BMI, female_data$BMI)

# Calculate the 95% confidence interval for the mean BMI of the sample_df dataframe
confidence_interval <- qnorm(0.975) * sd(sample_df$BMI) / sqrt(length(sample_df$BMI))
mean_bmi <- mean(sample_df$BMI)
# Print the results
print(summary_stats)
##   Gender  BMI.min BMI.Q1.25% BMI.median BMI.Q3.75%  BMI.max BMI.mean   BMI.SD
## 1 Female 13.85042   24.40718   32.24790   42.86241 78.85340 34.31082 13.08805
## 2   Male 13.85042   29.64108   36.21465   45.02039 70.76333 39.14627 13.42628
##      BMI.n BMI.missing_values
## 1 56.00000            0.00000
## 2 44.00000            0.00000
print(t_test_result)
## 
##  Welch Two Sample t-test
## 
## data:  male_data$BMI and female_data$BMI
## t = 1.8076, df = 91.362, p-value = 0.07396
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.477898 10.148788
## sample estimates:
## mean of x mean of y 
##  39.14627  34.31082
print(confidence_interval)
## [1] 2.624313
print(mean_bmi)
## [1] 36.43842

Discussion

The investigation’s goal was to compare the Body Mass Index (BMI) between males and females based on the analysis done on the provided dataset. The study’s key finding is the lack of a statistically significant variation in BMI across genders (p-value = 0.07396). It is crucial to keep in mind that there is some uncertainty in the estimate as shown by the confidence interval for the mean BMI difference (-0.477898 to 10.148788). The BMI difference between men and females may therefore exist, but it is not statistically significant.

References

M YASSER H (2022), BMI Dataset, Kaggle, accessed 28/05/2023, https://www.kaggle.com/datasets/yasserh/bmidataset