Data Dive 6 - Confidence Intervals

Load in data set

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.1.3

## Warning: package 'tibble' was built under R version 4.1.3

## Warning: package 'tidyr' was built under R version 4.1.3

## Warning: package 'readr' was built under R version 4.1.3

## Warning: package 'purrr' was built under R version 4.1.3

## Warning: package 'dplyr' was built under R version 4.1.3

## Warning: package 'forcats' was built under R version 4.1.3

## Warning: package 'lubridate' was built under R version 4.1.3

## -- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --
## v dplyr     1.1.2     v readr     2.1.4
## v forcats   1.0.0     v stringr   1.5.1
## v ggplot2   3.3.5     v tibble    3.2.1
## v lubridate 1.9.2     v tidyr     1.3.0
## v purrr     1.0.1     
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

obesity <- read.csv(file.choose())

First pair of numeric variables

BMI (calculated) and obesity level (NObeyesdad)

bmi <- obesity |>
  mutate (BMI = Weight / (Height ^ 2))

I chose to calculate BMI as my column I created, and compare it to the obesity level so I could see the range of BMIs that define each obesity level (ordered from insufficient weight, normal weight, overweight I, overweight II, obesity I, obesity II, obesity III).

Visualization

library(ggplot2)

ggplot(bmi, aes(x = NObeyesdad, y = BMI)) +
  geom_boxplot() +
  labs(title = 'BMI vs. Obesity Levels') +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Looking at this plot, it appears that insufficient weight, overweight I, and obesity III have a few outliers. Insufficient weight and overweight levels I and II have the smallest range of BMIs, followed by obesity I, and obesity III has the largest range of BMIs. There also appears to be some overlap going between different weight categories, which is interesting to me, because I would assume that the BMI ranges would be something like insufficient weight is a BMI of, for example, 17-24, and then normal weight is 25-27 (or something like that), but it seems like maybe the ranges overlap to where it might be like insufficient weight has a range of 17-24 and then normal weight is from 24-27.

This visualization and BMI column, along with reading the paper linked in the documentation of my data set, finally help me to understand how the obesity levels are categorized, which can now help me make more informed decisions about the analysis of my data set.

Correlation coefficient

cor(bmi$BMI, as.numeric(as.factor(bmi$NObeyesdad)), method = 'spearman')

## [1] 0.4237097

#I kept getting error messages using just as.numeric and couldn't find anything on the internet so I asked ChatGPT about this error and using as.factor helped me get an answer

Based on this value, I would determine that there is a not a strong relationship between the obesity level and BMI. This makes sense for a few reasons - the first reason being the most apparent, and that is the fact that this is not a real data set. Because I’m using a data mining/machine learning training data set for this project and the aim is to place people into an obesity level by their lifestyle habits, it makes sense that it may be “off” at certain points. The second reason is that BMI has generally been “debunked” as a way to categorize people, since it’s just looking at height and weight exclusively and is not taking into account muscle mass or other aspects of physical body composition.

Second pair of numeric variables

No. of meals per week (calculated) and frequency of physical activity

meals <- obesity |>
  mutate (weekly_meals = NCP *7)

I chose to calculate the number of meals people are consuming weekly (since NCP is number of meals daily) and compare it with how much physical activity people are doing weekly to explore the relationship between exercise frequency and consumption of full meals.

Visualization

ggplot(meals, aes(x = weekly_meals, y = FAF)) +
  geom_point() +
  labs(title = 'Weekly Meals vs. Physical Activity Frequency') +
  theme_minimal() +
  geom_smooth(color = 'pink')

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

This scatter plot is not showing me much, as it is very busy and doesn’t really display much. There is a very dense line at 21 weekly meals, regardless of frequency of physical activity. I expected this, since I think the majority of people tend to stick to the ‘three meals a day’ paradigm, and if they are more active and therefore hungrier, they might just be supplementing with snacks instead of adding in a whole other meal. I don’t think much can be drawn from this graph, other than that if you look at the trend line, it sort of appears that individuals who eat more than 21 meals a week have a trend of higher physical activity frequency, but there’s not a visible trend without the line there.

Correlation coefficient

cor(meals$weekly_meals, meals$FAF, method = 'spearman')

## [1] 0.1449121

This, as expected based on the graph, suggests that there is almost no relationship between frequency of physical activity and the number of meals individuals are eating during the week. Looking just at myself and people I know who play/used to play competitive sports, I could have guessed this. As I said in the analysis of my graph, I think it’s much more common for average people, or even non-professional athletes, to eat three meals and then have some snacks. It would be interesting to factor in snack consumption in some way, but I’m unsure of how I would go about that since the variable for whether or not people eat in between meals is categorical (not ordered).

Confidence Intervals

library(boot)

confidence_interval <- function(v, func = median, conf = 0.95, n_iter = 1000){
  boot_func <- \(x,i)func(x[i])
  b <- boot(v, boot_func, R = n_iter)
  
  boot.ci(b, conf = conf, type = "perc")
}


bmi_interval <- confidence_interval(bmi$BMI)
meal_interval <- confidence_interval(meals$weekly_meals)

## [1] "All values of t are equal to  21 \n Cannot calculate confidence intervals"

bmi_interval

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = b, conf = conf, type = "perc")
## 
## Intervals : 
## Level     Percentile     
## 95%   (28.32, 29.36 )  
## Calculations and Intervals on Original Scale

meal_interval

## NULL

For the BMI confidence interval, about 95% of the data has a BMI between 28.32 and 29.26. Looking at the general BMI categorizations, these BMIs fall into the ‘overweight’ category, more specifically overweight II in this data set.

For the weekly meals confidence interval, it is showing as NULL. I believe that this is because there are so many values that are ‘21’ (3 meals a day/week) that it’s not possible to take a good sample of 95% of the data. I think some further investigation on this topic is necessary, since I think it would be interesting to see in comparison to other columns.

Data Dive 6 - Confidence Intervals

Kylie Heagy

2024-10-07

Load in data set

First pair of numeric variables

BMI (calculated) and obesity level (NObeyesdad)

Visualization

Correlation coefficient

Second pair of numeric variables

No. of meals per week (calculated) and frequency of physical activity

Visualization

Correlation coefficient

Confidence Intervals