library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(boot)
df <- read.csv("~/Downloads/ObesityDataSet_raw_and_data_sinthetic.csv", header=TRUE)
df['BMI'] = df$Weight/(df$Height**2) #new_column BMI stands for 'Body Mass Index'
cor(df$BMI, df$FAF)
## [1] -0.1775373
cor(df$BMI, df$TUE)
## [1] -0.09972039
Use what we’ve covered so far in class to scrutinize the plot (e.g., are there any outliers?)
##scatter plot between BMI and FAF
ggplot(df, aes(x = BMI, y = FAF, color = NObeyesdad)) + geom_point() + theme_minimal()
Outlier : Top left of the datapoints can be considered Outliers. These
have the lowest BMI yet there excercise at a close to 3 frequency.
ggplot(df, aes(x = BMI, y = TUE, colour = NObeyesdad)) +
geom_point(alpha = 0.4) +
theme_minimal()
cat('Correlation between BMI and FAF',cor(df$BMI, df$FAF),'\n')
## Correlation between BMI and FAF -0.1775373
cat("Correlation between BMI and TUE",cor(df$BMI, df$TUE),"\n")
## Correlation between BMI and TUE -0.09972039
BMI vs FAF :::: It is clear that individual with obesity level III are less likely to engage in regular physcial activity. But the remaining categories are ditributed across 0 to 3, with a higher count of them below 2. Thus a very small negative correlation, -0.1775373, can be justified in this case.
BMI vs TUE :::: It is clear that individual with obesity level III are less likely to spend their time on electronics like TVs.But the remaining categories are ditributed across 0 to 2 randomly. Thus a very small negative correlation, -0.09972039, can be justified in this case.
Both of the plots look very similar. This might be attributed to the fact that 77% of the dataset is synthesized.
mean_bmi <- function(data, indices) {
return(mean(data[indices]))
}
# Perform bootstrapping (1000 resamples in this case)
bootstrap_results <- boot(data = df$BMI, statistic = mean_bmi, R = 1000)
# Calculating the 95% confidence interval using the percentile method
ci <- boot.ci(bootstrap_results, type = "perc")
cat("Bootstrap Confidence Interval for BMI (95%):\n")
## Bootstrap Confidence Interval for BMI (95%):
cat("Lower Bound:", ci$perc[4], "\n")
## Lower Bound: 29.38044
cat("Upper Bound:", ci$perc[5], "\n")
## Upper Bound: 30.04013
bootstrap_means <- bootstrap_results$t
ggplot(data.frame(bootstrap_means), aes(x = bootstrap_means)) +
geom_density(fill = "lightblue", alpha = 0.5) +
geom_vline(aes(xintercept = mean(bootstrap_means)), color = "red", linetype = "dashed", size = 1) + # Mean line
geom_vline(aes(xintercept = ci$perc[4]), color = "yellow", linetype = "dashed", size = 1) + # Lower bound of CI
geom_vline(aes(xintercept = ci$perc[5]), color = "yellow", linetype = "dashed", size = 1) + # Upper bound of CI
labs(title = "Bootstrap Distribution of BMI Means with 95% Confidence Interval",
x = "Mean BMI",
y = "Density") +
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
mean(df$BMI)
## [1] 29.70016
Bootstrap Confidence Interval for BMI (95%) is [29.37711,30.05564] using bootstrapping sampling 1000 replicates.
The true mean of the population lies in the confidence interval successfully.