library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(boot)
df <- read.csv("~/Downloads/ObesityDataSet_raw_and_data_sinthetic.csv", header=TRUE)

Build at least two pairs of numeric variables

For each pair of variables, include at least one column that you created (i.e., calculated based on others)

  1. BMI- Body Mass Index(response variable, calculated based on weight and height) and FAF - Physical Activity Frequency(exploratory variable)
df['BMI'] = df$Weight/(df$Height**2)  #new_column BMI stands for 'Body Mass Index'
cor(df$BMI, df$FAF)
## [1] -0.1775373
  1. BMI- Body Mass Index(response variable, calculated based on weight and height) and TUE - Time on Electronics variable (exploratory variable)
cor(df$BMI, df$TUE)
## [1] -0.09972039

Plot a visualization for each relationship, and draw some conclusions based on the plot

Use what we’ve covered so far in class to scrutinize the plot (e.g., are there any outliers?)

##scatter plot between BMI and FAF
ggplot(df, aes(x = BMI, y = FAF, color = NObeyesdad)) + geom_point() + theme_minimal()

Outlier : Top left of the datapoints can be considered Outliers. These have the lowest BMI yet there excercise at a close to 3 frequency.

  1. BMI vs TUE
ggplot(df, aes(x = BMI, y = TUE, colour = NObeyesdad)) +
  geom_point(alpha = 0.4) +
  theme_minimal() 

Calculate the appropriate correlation coefficient for each of these combinations

cat('Correlation between BMI and FAF',cor(df$BMI, df$FAF),'\n')
## Correlation between BMI and FAF -0.1775373
cat("Correlation between BMI and TUE",cor(df$BMI, df$TUE),"\n")
## Correlation between BMI and TUE -0.09972039

Explain why the value makes sense (or doesn’t) based on the visualization(s)

  1. BMI vs FAF :::: It is clear that individual with obesity level III are less likely to engage in regular physcial activity. But the remaining categories are ditributed across 0 to 3, with a higher count of them below 2. Thus a very small negative correlation, -0.1775373, can be justified in this case.

  2. BMI vs TUE :::: It is clear that individual with obesity level III are less likely to spend their time on electronics like TVs.But the remaining categories are ditributed across 0 to 2 randomly. Thus a very small negative correlation, -0.09972039, can be justified in this case.

  3. Both of the plots look very similar. This might be attributed to the fact that 77% of the dataset is synthesized.

Build a confidence interval for each of the response variable(s). Provide a detailed conclusion of the response variable (i.e., the population) based on your confidence interval.

  1. Condifence Interval for BMI Mean (This is the response variable for both of the variables pairs).
mean_bmi <- function(data, indices) {
  return(mean(data[indices]))
}

# Perform bootstrapping (1000 resamples in this case)
bootstrap_results <- boot(data = df$BMI, statistic = mean_bmi, R = 1000)

# Calculating the 95% confidence interval using the percentile method
ci <- boot.ci(bootstrap_results, type = "perc")


cat("Bootstrap Confidence Interval for BMI (95%):\n")
## Bootstrap Confidence Interval for BMI (95%):
cat("Lower Bound:", ci$perc[4], "\n")
## Lower Bound: 29.38044
cat("Upper Bound:", ci$perc[5], "\n")
## Upper Bound: 30.04013
bootstrap_means <- bootstrap_results$t

ggplot(data.frame(bootstrap_means), aes(x = bootstrap_means)) +
  geom_density(fill = "lightblue", alpha = 0.5) + 
  geom_vline(aes(xintercept = mean(bootstrap_means)), color = "red", linetype = "dashed", size = 1) +  # Mean line
  geom_vline(aes(xintercept = ci$perc[4]), color = "yellow", linetype = "dashed", size = 1) +  # Lower bound of CI
  geom_vline(aes(xintercept = ci$perc[5]), color = "yellow", linetype = "dashed", size = 1) +  # Upper bound of CI
  labs(title = "Bootstrap Distribution of BMI Means with 95% Confidence Interval",
       x = "Mean BMI",
       y = "Density") +
  theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Conclusion

mean(df$BMI)
## [1] 29.70016
  1. Bootstrap Confidence Interval for BMI (95%) is [29.37711,30.05564] using bootstrapping sampling 1000 replicates.

  2. The true mean of the population lies in the confidence interval successfully.