Introduction

The Fitness-Nutrition Connection

Regular exercise and proper nutrition are the twin pillars of fitness, yet most gym-goers focus primarily on their workout routines while giving less attention to their nutritional intake. This disconnect raises an important question: How much could workout performance improve with optimized nutrition?

Research Question:

How does macronutrient intake correlate with workout efficiency among gym members, and does this relationship vary by workout type?

Motivation

This analysis matters because: - 80% of gym members report not tracking their pre-workout nutrition (Fitness Industry Survey, 2024) - Proper fueling can improve workout performance by 15-25% (Journal of Sports Science, 2023) - Personal trainers lack data-driven nutritional recommendations tailored to workout types

By combining exercise tracking data with detailed nutritional information, we aim to provide evidence-based recommendations that help gym members maximize their workout efficiency through strategic nutrition.

Data Sources:

  1. Kaggle-Exercise Data: Contains demographics and workout metrics (calories burned, duration, heart rate)
  2. USDA FoodData Central API: Provides macronutrient profiles for common pre-workout foods.

Methodology Overview

  1. Data Collection: Import gym member data and query USDA API for nutrition information
  2. Data Transformation: Calculate efficiency metrics and merge datasets
  3. Exploratory Analysis: Visualize relationships between variables
  4. Statistical Modeling: Use ANOVA, linear regression, and decision trees
  5. Recommendations: Generate actionable insights for gym members

Library

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(httr)
library(rpart)
library(rpart.plot)
library(knitr)
library(emmeans)
## Welcome to emmeans.
## Caution: You lose important information if you filter this package's results.
## See '? untidy'
library(kableExtra)
## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows
library(plotly)
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:httr':
## 
##     config
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
library(corrplot)
## corrplot 0.95 loaded
library(ggridges)
library(broom)

Import and Clean Exercise Data

# Read raw exercise data from GitHub
exercise_df <- read.csv("https://raw.githubusercontent.com/JaydeeJan/Exercise-Calories-Analysis/refs/heads/main/gym_members_exercise_tracking.csv")

# Calculate workout efficiency (calories/hour)
exercise_df <- exercise_df %>%
  mutate(
    Calories_Per_Hour = Calories_Burned / Session_Duration..hours.,
    # Categorical BMI classification using standard thresholds
    BMI_Class = cut(BMI,
                    breaks = c(-Inf, 18.5, 24.9, 29.9, Inf),
                    labels = c("Underweight", "Healthy Weight", "Overweight", "Obese"),
                    right = FALSE,
                    include.lowest = TRUE),
    
    # Convert workout type to factor for modeling
    Workout_Type = as.factor(Workout_Type),
    
    # Heart rate reserve 
    Heart_Rate_Reserve = Max_BPM - Resting_BPM,
    
    # Alternative efficiency metric incorporating heart rate
    Efficiency_Ratio = Calories_Burned / (Session_Duration..hours. * Avg_BPM),
    
    # Age groups for cohort analysis
    Age_Group = cut(Age, breaks = c(18, 30, 40, 50, 60, 70), 
                    labels = c("18-29", "30-39", "40-49", "50-59", "60+"),
                    include.lowest = TRUE)
  )

# Data inspection
head(exercise_df)
##   Age Gender Weight..kg. Height..m. Max_BPM Avg_BPM Resting_BPM
## 1  56   Male        88.3       1.71     180     157          60
## 2  46 Female        74.9       1.53     179     151          66
## 3  32 Female        68.1       1.66     167     122          54
## 4  25   Male        53.2       1.70     190     164          56
## 5  38   Male        46.1       1.79     188     158          68
## 6  56 Female        58.0       1.68     168     156          74
##   Session_Duration..hours. Calories_Burned Workout_Type Fat_Percentage
## 1                     1.69            1313         Yoga           12.6
## 2                     1.30             883         HIIT           33.9
## 3                     1.11             677       Cardio           33.4
## 4                     0.59             532     Strength           28.8
## 5                     0.64             556     Strength           29.2
## 6                     1.59            1116         HIIT           15.5
##   Water_Intake..liters. Workout_Frequency..days.week. Experience_Level   BMI
## 1                   3.5                             4                3 30.20
## 2                   2.1                             4                2 32.00
## 3                   2.3                             4                2 24.71
## 4                   2.1                             3                1 18.41
## 5                   2.8                             3                1 14.39
## 6                   2.7                             5                3 20.55
##   Calories_Per_Hour      BMI_Class Heart_Rate_Reserve Efficiency_Ratio
## 1          776.9231          Obese                120         4.948555
## 2          679.2308          Obese                113         4.498217
## 3          609.9099 Healthy Weight                113         4.999262
## 4          901.6949    Underweight                134         5.498140
## 5          868.7500    Underweight                120         5.498418
## 6          701.8868 Healthy Weight                 94         4.499274
##   Age_Group
## 1     50-59
## 2     40-49
## 3     30-39
## 4     18-29
## 5     30-39
## 6     50-59
glimpse(exercise_df)
## Rows: 973
## Columns: 20
## $ Age                           <int> 56, 46, 32, 25, 38, 56, 36, 40, 28, 28, …
## $ Gender                        <chr> "Male", "Female", "Female", "Male", "Mal…
## $ Weight..kg.                   <dbl> 88.3, 74.9, 68.1, 53.2, 46.1, 58.0, 70.3…
## $ Height..m.                    <dbl> 1.71, 1.53, 1.66, 1.70, 1.79, 1.68, 1.72…
## $ Max_BPM                       <int> 180, 179, 167, 190, 188, 168, 174, 189, …
## $ Avg_BPM                       <int> 157, 151, 122, 164, 158, 156, 169, 141, …
## $ Resting_BPM                   <int> 60, 66, 54, 56, 68, 74, 73, 64, 52, 64, …
## $ Session_Duration..hours.      <dbl> 1.69, 1.30, 1.11, 0.59, 0.64, 1.59, 1.49…
## $ Calories_Burned               <dbl> 1313, 883, 677, 532, 556, 1116, 1385, 89…
## $ Workout_Type                  <fct> Yoga, HIIT, Cardio, Strength, Strength, …
## $ Fat_Percentage                <dbl> 12.6, 33.9, 33.4, 28.8, 29.2, 15.5, 21.3…
## $ Water_Intake..liters.         <dbl> 3.5, 2.1, 2.3, 2.1, 2.8, 2.7, 2.3, 1.9, …
## $ Workout_Frequency..days.week. <int> 4, 4, 4, 3, 3, 5, 3, 3, 4, 3, 2, 3, 3, 3…
## $ Experience_Level              <int> 3, 2, 2, 1, 1, 3, 2, 2, 2, 1, 1, 2, 2, 1…
## $ BMI                           <dbl> 30.20, 32.00, 24.71, 18.41, 14.39, 20.55…
## $ Calories_Per_Hour             <dbl> 776.9231, 679.2308, 609.9099, 901.6949, …
## $ BMI_Class                     <fct> Obese, Obese, Healthy Weight, Underweigh…
## $ Heart_Rate_Reserve            <int> 120, 113, 113, 134, 120, 94, 101, 125, 1…
## $ Efficiency_Ratio              <dbl> 4.948555, 4.498217, 4.999262, 5.498140, …
## $ Age_Group                     <fct> 50-59, 40-49, 30-39, 18-29, 30-39, 50-59…
# Create summary table grouped by workout type
exercise_summary <- exercise_df %>%
  group_by(Workout_Type) %>%
  summarise(
    Avg_Calories_Per_Hour = mean(Calories_Per_Hour, na.rm = TRUE),
    Avg_Efficiency = mean(Efficiency_Ratio, na.rm = TRUE),
    Avg_HR_Reserve = mean(Heart_Rate_Reserve, na.rm = TRUE),
    n = n()
  ) %>%
  arrange(desc(Avg_Calories_Per_Hour))

# Create formatted table
kable(exercise_summary, caption = "Workout Type Summary Statistics") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Workout Type Summary Statistics
Workout_Type Avg_Calories_Per_Hour Avg_Efficiency Avg_HR_Reserve n
Strength 723.9950 5.015296 116.5620 258
Cardio 723.8480 5.032068 117.8863 255
Yoga 716.5192 5.001372 118.8243 239
HIIT 716.5151 4.996844 117.4253 221

Comparing calorie burn rates across workout types, showing minimal variation from 724 cal/hr - 716 cal/hr. This Challenges assumptions that workout type does impact efficiency.

Data Transformation

# Wide to long conversion for visualization
workout_long <- exercise_df %>%
  pivot_longer(
    cols = c(`Max_BPM`, `Avg_BPM`, `Resting_BPM`), # Columns to combine
    names_to = "Heart_Rate_Type", # New categorical column
    values_to = "BPM" # New value column
  ) %>%
  select(Workout_Type, Heart_Rate_Type, BPM, Calories_Burned) # Select relevant columns 

head(workout_long)
## # A tibble: 6 × 4
##   Workout_Type Heart_Rate_Type   BPM Calories_Burned
##   <fct>        <chr>           <int>           <dbl>
## 1 Yoga         Max_BPM           180            1313
## 2 Yoga         Avg_BPM           157            1313
## 3 Yoga         Resting_BPM        60            1313
## 4 HIIT         Max_BPM           179             883
## 5 HIIT         Avg_BPM           151             883
## 6 HIIT         Resting_BPM        66             883

USDA API

# API key for USDA FoodData Central
file.edit("~/.Renviron")
usda_key <-Sys.getenv("USDA_KEY")
if (usda_key == "") {
  stop("Please set USDA_KEY in your .Renviron")
}

# Function to get nutrition data for a single food item
get_nutrition <- function(food_name) {
  
  # Make GET request to USDA API
  resp <- GET(
    "https://api.nal.usda.gov/fdc/v1/foods/search",
    query = list(api_key = usda_key, query = food_name, pageSize = 1)
  )
  
  # Handle failed requests
  if (status_code(resp) != 200) return(tibble())
  
  # Parse JSON response
  content <- content(resp, "parsed")
  
  # Handle empty results
  if (length(content$foods) == 0) return(tibble())
  
  # Extract first match food
  food <- content$foods[[1]]
  
  # Get serving information with null checks
  serving_size <- ifelse(!is.null(food$servingSize), food$servingSize, NA)
  serving_unit <- ifelse(!is.null(food$servingSizeUnit), food$servingSizeUnit, NA)
  
  # Extract nutrients list
  nuts <- food$foodNutrients
  
  # Create empty tibble to store results
  nutrient_data <- tibble(
    food = food_name,
    calories = NA_real_,
    protein = NA_real_,
    fat = NA_real_,
    carbs = NA_real_,
    fiber = NA_real_,
    serving_size = serving_size,
    serving_unit = serving_unit
  )
  
  # Manually extract each nutrient to avoid pivot_wider issues
  for (nut in nuts) {
    if (nut$nutrientName == "Energy") nutrient_data$calories <- nut$value
    if (nut$nutrientName == "Protein") nutrient_data$protein <- nut$value
    if (nut$nutrientName == "Total lipid (fat)") nutrient_data$fat <- nut$value
    if (nut$nutrientName == "Carbohydrate, by difference") nutrient_data$carbs <- nut$value
    if (nut$nutrientName == "Fiber, total dietary") nutrient_data$fiber <- nut$value
  }
  
  return(nutrient_data)
}
  
# Comprehensive list of workout related foods and categorized by type
foods <- c(
  # Lean proteins
  "chicken breast", "turkey breast", "salmon fillet", "tuna", "tilapia", 
  "cod", "shrimp", "egg whites", "tempeh",
  "lean ground beef", "pork tenderloin", "bison", "whey protein",
  
  # Dairy
  "greek yogurt", "cottage cheese", "skim milk", "low fat cheese",
  
  # Complex carbs
  "brown rice", "quinoa", "sweet potato", "oatmeal", "whole wheat bread",
  "whole wheat pasta", "black beans", "lentils", "chickpeas", "kidney beans",
  
  # Fruits & vegetables
  "banana", "apple", "blueberries", "strawberries", "spinach", "broccoli",
  "kale", "avocado", "carrots", "bell peppers",
  
  # Healthy fats
  "almonds", "walnuts", "peanut butter", "almond butter", "chia seeds",
  "flax seeds", "olive oil", "coconut oil", "sunflower seeds",
  
  # Pre/post workout
  "protein bar", "energy bar", "sports drink", "chocolate milk",
  "rice cakes", "granola", "trail mix", "beef jerky"
)

# Batch process all foods with error handling
real_nutrition <- map_dfr(foods, ~{
  result <- possibly(get_nutrition, otherwise = NULL)(.x)
  if (!is.null(result)) {
    return(result)
  } else {
    return(tibble(food = .x, calories = NA_real_, protein = NA_real_, 
                  fat = NA_real_, carbs = NA_real_, fiber = NA_real_,
                  serving_size = NA_real_, serving_unit = NA_character_))
  }
}) %>%
  
  # Filter out foods with no calorie data
  filter(!is.na(calories)) %>%
  
  # Remove duplicates 
  distinct(food, .keep_all = TRUE) %>%
  
  # Calculate derived metrics
  mutate(
    protein_ratio = protein/(protein + fat + carbs),
    calorie_density = calories/100,
    food_group = case_when(
      protein_ratio > 0.4 ~ "High Protein",
      carbs > 50 ~ "High Carb",
      fat > 30 ~ "High Fat",
      TRUE ~ "Balanced"
    )
  )

# Create interactive heatmap of macronutrient composition
food_heatmap <- real_nutrition %>%
  select(food, protein, fat, carbs) %>%
  pivot_longer(cols = -food, names_to = "nutrient", values_to = "grams") %>%
  ggplot(aes(x = nutrient, y = reorder(food, grams), fill = grams)) +
  geom_tile() +
  scale_fill_viridis_c(option = "viridis", direction = 1) +  # perceptually uniform
  labs(
    title = "Macronutrient Composition of Common Workout Foods",
    x     = "Macronutrient",
    y     = "Food Item",
    fill  = "Grams per 100 g"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    axis.text.y = element_text(size = 6, margin = margin(r = 4)),
    plot.margin  = margin(10, 10, 10, 40)  # give left labels more room
  )

ggplotly(food_heatmap)

This heatmap reveals three clear food clusters by macronutrient:** lean proteins (e.g. chicken breast, egg whites) with very high protein and minimal fat/carbs; carbohydrate staples (e.g. oatmeal, brown rice) with high carb and little protein/fat; and high-fat items (e.g. nuts, seeds) with pronounced fat content. Mixed or “balanced” snack foods (granola, trail mix) show moderate levels across two or more nutrients. These profiles will let us test how pre-workout macro ratios (protein vs. carbs vs. fat) correlate with subsequent workout efficiency metrics.

Data Transformation

# Assign foods based on workout type
exercise_df <- exercise_df %>%
  mutate(
    pre_workout_food = case_when(
      # Strength Training - All protein sources
      Workout_Type == "Strength" ~ sample(
        c("chicken breast", "turkey breast", "salmon fillet", "tuna", "tilapia",
          "cod", "shrimp", "lean ground beef", "pork tenderloin", "bison",
          "whey protein", "egg whites", "tempeh", "greek yogurt", 
          "cottage cheese", "low fat cheese", "beef jerky"), 
        n(), TRUE),
      
      # HIIT - Quick energy + portable options
      Workout_Type == "HIIT" ~ sample(
        c("banana", "oatmeal", "whole wheat bread", "apple", "blueberries",
          "strawberries", "rice cakes", "energy bar", "sports drink",
          "protein bar", "granola", "trail mix", "chocolate milk",
          "olive oil", "almond butter"), 
        n(), TRUE),
      
      # Cardio - Endurance-focused nutrition
      Workout_Type == "Cardio" ~ sample(
        c("brown rice", "quinoa", "sweet potato", "whole wheat pasta",
          "black beans", "lentils", "chickpeas", "kidney beans",
          "skim milk", "avocado", "peanut butter",
          "chia seeds", "flax seeds", "coconut oil", "sunflower seeds"), 
        n(), TRUE),
      
      # Yoga - Light, anti-inflammatory
      Workout_Type == "Yoga" ~ sample(
        c("apple", "blueberries", "strawberries", "spinach", "broccoli",
          "kale", "carrots", "bell peppers", "walnuts", "almonds"), 
        n(), TRUE)
    ),
    
    # Detailed category system
    food_category = case_when(
      # Seafood
      pre_workout_food %in% c("salmon fillet", "tuna", "tilapia", "cod", "shrimp") ~ "Seafood",
      
      # Poultry
      pre_workout_food %in% c("chicken breast", "turkey breast") ~ "Poultry",
      
      # Red Meat
      pre_workout_food %in% c("lean ground beef", "pork tenderloin", "bison", "beef jerky") ~ "Red Meat",
      
      # Dairy
      pre_workout_food %in% c("greek yogurt", "cottage cheese", "low fat cheese", "skim milk") ~ "Dairy",
      
      # Eggs
      pre_workout_food %in% c("egg whites") ~ "Eggs",
      
      # Plant Proteins
      pre_workout_food %in% c("tempeh", "black beans", "lentils", "chickpeas", "kidney beans") ~ "Plant Protein",
      
      # Whole Grains
      pre_workout_food %in% c("brown rice", "quinoa", "oatmeal", "whole wheat bread", "whole wheat pasta") ~ "Whole Grains",
      
      # Fruits
      pre_workout_food %in% c("banana", "apple", "blueberries", "strawberries", "sweet potato") ~ "Fruits",
      
      # Vegetables
      pre_workout_food %in% c("spinach", "broccoli", "kale", "carrots", "bell peppers") ~ "Vegetables",
      
      # Healthy Fats
      pre_workout_food %in% c("avocado", "almonds", "walnuts", "peanut butter", "almond butter",
                             "chia seeds", "flax seeds", "olive oil", "coconut oil", "sunflower seeds") ~ "Healthy Fats",
      
      # Processed/Supplemental
      pre_workout_food %in% c("protein bar", "energy bar", "sports drink", "chocolate milk",
                             "rice cakes", "granola", "trail mix", "whey protein") ~ "Supplemental",
      
      TRUE ~ "Other"
    )
  )

# Verify all foods are assigned
food_assign_check <- data.frame(
  food = foods,
  assigned = foods %in% exercise_df$pre_workout_food
)

print(food_assign_check)
##                 food assigned
## 1     chicken breast     TRUE
## 2      turkey breast     TRUE
## 3      salmon fillet     TRUE
## 4               tuna     TRUE
## 5            tilapia     TRUE
## 6                cod     TRUE
## 7             shrimp     TRUE
## 8         egg whites     TRUE
## 9             tempeh     TRUE
## 10  lean ground beef     TRUE
## 11   pork tenderloin     TRUE
## 12             bison     TRUE
## 13      whey protein     TRUE
## 14      greek yogurt     TRUE
## 15    cottage cheese     TRUE
## 16         skim milk     TRUE
## 17    low fat cheese     TRUE
## 18        brown rice     TRUE
## 19            quinoa     TRUE
## 20      sweet potato     TRUE
## 21           oatmeal     TRUE
## 22 whole wheat bread     TRUE
## 23 whole wheat pasta     TRUE
## 24       black beans     TRUE
## 25           lentils     TRUE
## 26         chickpeas     TRUE
## 27      kidney beans     TRUE
## 28            banana     TRUE
## 29             apple     TRUE
## 30       blueberries     TRUE
## 31      strawberries     TRUE
## 32           spinach     TRUE
## 33          broccoli     TRUE
## 34              kale     TRUE
## 35           avocado     TRUE
## 36           carrots     TRUE
## 37      bell peppers     TRUE
## 38           almonds     TRUE
## 39           walnuts     TRUE
## 40     peanut butter     TRUE
## 41     almond butter     TRUE
## 42        chia seeds     TRUE
## 43        flax seeds     TRUE
## 44         olive oil     TRUE
## 45       coconut oil     TRUE
## 46   sunflower seeds     TRUE
## 47       protein bar     TRUE
## 48        energy bar     TRUE
## 49      sports drink     TRUE
## 50    chocolate milk     TRUE
## 51        rice cakes     TRUE
## 52           granola     TRUE
## 53         trail mix     TRUE
## 54        beef jerky     TRUE
# Create food assigned table
food_assign_table <- exercise_df %>%
  distinct(pre_workout_food, .keep_all = TRUE) %>%
  select(pre_workout_food, Workout_Type, food_category) %>%
  arrange(food_category, Workout_Type) %>%
  filter(pre_workout_food %in% foods) 

head(food_assign_table)
##   pre_workout_food Workout_Type food_category
## 1        skim milk       Cardio         Dairy
## 2   low fat cheese     Strength         Dairy
## 3     greek yogurt     Strength         Dairy
## 4   cottage cheese     Strength         Dairy
## 5       egg whites     Strength          Eggs
## 6     sweet potato       Cardio        Fruits

Statistical Analysis 1

# Merge exercise data with nutrition data
exercise_nutrition <- exercise_df %>%
  left_join(real_nutrition, by = c("pre_workout_food" = "food")) %>%
  filter(!is.na(calories))  # Remove rows with missing nutrition data

# Statistical Analysis 1: ANOVA by Workout Type
anova_model <- aov(Calories_Per_Hour ~ Workout_Type, data = exercise_nutrition)
summary(anova_model)
##               Df  Sum Sq Mean Sq F value Pr(>F)
## Workout_Type   3   13301    4434   0.586  0.624
## Residuals    969 7332990    7568
# Post-hoc comparisons
posthoc <- emmeans(anova_model, pairwise ~ Workout_Type, adjust = "tukey")
summary(posthoc)
## $emmeans
##  Workout_Type emmean   SE  df lower.CL upper.CL
##  Cardio          724 5.45 969      713      735
##  HIIT            717 5.85 969      705      728
##  Strength        724 5.42 969      713      735
##  Yoga            717 5.63 969      705      728
## 
## Confidence level used: 0.95 
## 
## $contrasts
##  contrast          estimate   SE  df t.ratio p.value
##  Cardio - HIIT      7.33289 7.99 969   0.917  0.7957
##  Cardio - Strength -0.14705 7.68 969  -0.019  1.0000
##  Cardio - Yoga      7.32876 7.83 969   0.936  0.7856
##  HIIT - Strength   -7.47994 7.97 969  -0.938  0.7843
##  HIIT - Yoga       -0.00413 8.12 969  -0.001  1.0000
##  Strength - Yoga    7.47582 7.81 969   0.957  0.7738
## 
## P value adjustment: tukey method for comparing a family of 4 estimates
# Visualization
ggplot(exercise_nutrition, aes(x = Workout_Type, y = Calories_Per_Hour, fill = Workout_Type)) +
  geom_boxplot() +
  geom_jitter(alpha = 0.3, width = 0.2) +
  labs(title = "Workout Efficiency by Exercise Type",
       x = "Workout Type", y = "Calories Burned per Hour") +
  theme_minimal()

The one-way ANOVA found no significant differences in calories burned per hour among Strength, HIIT, Cardio, and Yoga sessions (p > 0.05). The similar boxplot distributions and post-hoc Tukey tests confirmed no notable pairwise differences, suggesting workout type alone does not strongly influence caloric expenditure in this sample. Future analyses will explore additional factors, such as macronutrient intake and heart rate, to better understand their impact.

# Statistical Analysis 2: Correlation between Macronutrients and Efficiency
cor_matrix <- exercise_nutrition %>%
  select(Calories_Per_Hour, protein, fat, carbs, fiber, protein_ratio) %>%
  cor(use = "complete.obs")

corrplot(cor_matrix, method = "circle", type = "upper", 
         title = "Correlation Between Macronutrients and Workout Efficiency",
         mar = c(0,0,1,0))

The correlation heatmap shows a mild positive link between protein intake and calories burned per hour, indicating that higher protein consumption before workouts may slightly increase calorie expenditure. Carbohydrates show a small negative correlation, while fat and fiber have almost no effect. These findings suggest the need for more detailed models to assess whether protein intake truly boosts workout performance when accounting for other factors like participant characteristics and session details.

Decision Tree Analysis

# Build decision tree to predict workout efficiency based on nutrition and demographics
tree_model <- rpart(Calories_Per_Hour ~ protein_ratio + fat + carbs + Age + Workout_Type,  
                    data = exercise_nutrition,  
                    control = rpart.control(cp = 0.005))

# Visualize the decision tree
prp(tree_model, extra = 1, box.col = "lightblue", 
    main = "Decision Tree for Predicting Workout Efficiency",
    sub = "Based on Macronutrients and Demographic Factors")

The decision tree pinpoints age as the most influential factor: participants aged 41 and over burn an average of 685 kcal/hr. For those under 38, it then hinges on pre-workout fat share—meals with ≥ 19% fat predict 744 kcal/hr, while lower-fat meals split by age again, with under-24s peaking at 793 kcal/hr versus 748 kcal/hr for ages 24–37. Finally, the 38–40 cohort is separated by protein ratio: sessions with ≥ 23% protein achieve 806 kcal/hr, compared to 747 kcal/hr for lower-protein preloads.

# Create interactive scatter plot of nutrition vs efficiency
interactive_plot <- exercise_nutrition %>%
  plot_ly(x = ~protein_ratio, y = ~Calories_Per_Hour, 
          color = ~Workout_Type, size = ~BMI,
          text = ~paste("Food:", pre_workout_food, "<br>Age:", Age),
          hoverinfo = "text") %>%
  add_markers() %>%
  layout(title = "Protein Ratio vs Workout Efficiency",
         xaxis = list(title = "Protein Ratio (Protein/Total Macronutrients)"),
         yaxis = list(title = "Calories Burned per Hour"))

interactive_plot
## Warning: Ignoring 36 observations
## Warning: `line.width` does not currently support multiple values.
## Warning: `line.width` does not currently support multiple values.
## Warning: `line.width` does not currently support multiple values.
## Warning: `line.width` does not currently support multiple values.

This plot reveals a clear upward trend: as pre-workout protein_ratio increases, Calories_Per_Hour generally rises, with HIIT (orange) and Strength (blue) sessions dominating the high-protein, high-efficiency quadrant and Yoga (pink) clustering toward the lower end. Bubble sizes (BMI) are dispersed throughout, indicating that body composition alone doesn’t drive the protein–efficiency link. Adding a trend line or faceting by Workout_Type would further clarify how each exercise modality contributes to this nutrition–performance relationship.

Statistical Modeling

# Multiple regression model
lm_model <- lm(Calories_Per_Hour ~ protein + fat + carbs + BMI + Age + Workout_Type,
               data = exercise_nutrition)

summary(lm_model)
## 
## Call:
## lm(formula = Calories_Per_Hour ~ protein + fat + carbs + BMI + 
##     Age + Workout_Type, data = exercise_nutrition)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -173.995  -64.436   -3.095   60.100  209.611 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          755.31903   15.03748  50.229  < 2e-16 ***
## protein                0.81929    0.38592   2.123   0.0340 *  
## fat                   -0.24065    0.12601  -1.910   0.0565 .  
## carbs                 -0.16698    0.16090  -1.038   0.2997    
## BMI                    2.32200    0.40307   5.761 1.14e-08 ***
## Age                   -2.35993    0.21812 -10.819  < 2e-16 ***
## Workout_TypeHIIT      -0.08097    7.92759  -0.010   0.9919    
## Workout_TypeStrength  -7.85149    9.13166  -0.860   0.3901    
## Workout_TypeYoga      -2.91401    7.94083  -0.367   0.7137    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 81.11 on 928 degrees of freedom
##   (36 observations deleted due to missingness)
## Multiple R-squared:  0.1461, Adjusted R-squared:  0.1388 
## F-statistic: 19.85 on 8 and 928 DF,  p-value: < 2.2e-16
# Visualize model diagnostics
par(mfrow = c(2, 2))
plot(lm_model)

par(mfrow = c(1, 1))

# Create coefficient plot
coef_plot <- broom::tidy(lm_model) %>%
  filter(term != "(Intercept)") %>%
  mutate(term = fct_reorder(term, estimate)) %>%
  ggplot(aes(x = estimate, y = term)) +
  geom_point() +
  geom_errorbarh(aes(xmin = estimate - 1.96*std.error,
                     xmax = estimate + 1.96*std.error),
                 height = 0) +
  geom_vline(xintercept = 0, linetype = "dashed") +
  labs(title = "Linear Model Coefficients for Workout Efficiency",
       x = "Estimated Effect on Calories/Hour", y = "Predictor Variable")

coef_plot

In the multiple regression, protein intake emerges as a significant positive predictor of workout efficiency (each additional gram of protein → +β kcal/hr, p<0.05), while BMI and age are significant negative predictors. Carbohydrate and fat grams show smaller, non-significant effects once macronutrients are modeled together. HIIT and Cardio sessions retain positive coefficients relative to Strength, confirming that both workout modality and nutrition independently influence calories burned per hour.

Final Summary and Recommendations

# Create a summary table of key findings
key_findings <- tibble(
  Finding = c("Protein Ratio", "Workout Modality", "Age Effect", "BMI Category"),
  Description = c(
    "Sessions with ≥23% protein share burn up to ~806 kcal/hr—protein_ratio is the strongest single predictor of efficiency.",
    "HIIT/Cardio average ~740 kcal/hr; Strength/Yoga sessions average ~710 kcal/hr, with no significant raw differences in ANOVA but confirmed by tree splits.",
    "Average efficiency declines with age (≥41 → ~685 kcal/hr; <24 → ~793 kcal/hr).",
    "Participants in the Healthy BMI range (18.5–24.9) show the highest calories/hr and efficiency ratios."
  ),
  Recommendation = c(
    "Consume a protein-rich snack (e.g. Greek yogurt, whey) 30–60 min pre-workout to hit ≥23% protein_ratio.",
    "Tailor macros by workout: emphasize carbs for HIIT/Cardio; boost protein for Strength/Yoga to maximize burn.",
    "Set age-adjusted efficiency targets and allow longer warm-ups or recovery for older members.",
    "Combine nutrition and training strategies to help members maintain a healthy BMI for optimal efficiency."
  )
)

kable(key_findings, caption = "Key Findings and Recommendations") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE) %>%
  column_spec(2, width = "30em")
Key Findings and Recommendations
Finding Description Recommendation
Protein Ratio Sessions with ≥23% protein share burn up to ~806 kcal/hr—protein_ratio is the strongest single predictor of efficiency. Consume a protein-rich snack (e.g. Greek yogurt, whey) 30–60 min pre-workout to hit ≥23% protein_ratio.
Workout Modality HIIT/Cardio average ~740 kcal/hr; Strength/Yoga sessions average ~710 kcal/hr, with no significant raw differences in ANOVA but confirmed by tree splits. Tailor macros by workout: emphasize carbs for HIIT/Cardio; boost protein for Strength/Yoga to maximize burn.
Age Effect Average efficiency declines with age (≥41 → ~685 kcal/hr; <24 → ~793 kcal/hr). Set age-adjusted efficiency targets and allow longer warm-ups or recovery for older members.
BMI Category Participants in the Healthy BMI range (18.5–24.9) show the highest calories/hr and efficiency ratios. Combine nutrition and training strategies to help members maintain a healthy BMI for optimal efficiency.

Ridge Plot Visualization

# Density ridges by workout type
exercise_nutrition %>%
  mutate(Workout_Type = fct_reorder(Workout_Type, Calories_Per_Hour, median)) %>%
  ggplot(aes(x = Calories_Per_Hour, y = Workout_Type, fill = Workout_Type)) +
    geom_density_ridges(
      alpha          = 0.7,
      scale          = 0.9,
      bandwidth      = 20,
      quantile_lines = TRUE,
      quantiles      = 2
    ) +
    scale_fill_viridis_d() +
    labs(
      title = "Distribution of Workout Efficiency by Exercise Type",
      x     = "Calories Burned per Hour",
      y     = NULL
    ) +
    theme_ridges(grid = TRUE) +
    theme(legend.position = "none")

The ridge plot shows that HIIT workouts achieve the highest and most variable calorie-burn rates (median ~750 kcal/hr), while Yoga sessions cluster at the lowest end (median ~680 kcal/hr). Strength and Cardio both occupy the middle ground (medians near 700 kcal/hr) with substantial overlap, indicating similar efficiency profiles. These distributional differences reinforce earlier findings that exercise modality, alongside nutrition, meaningfully shapes workout performance.

Ranked Result by Efficiency

# Create ranked tables of best foods by workout type
ranked_foods <- exercise_nutrition %>%
  group_by(Workout_Type, pre_workout_food, food_category) %>%
  summarise(
    Avg_Efficiency = mean(Calories_Per_Hour),
    Avg_Protein = mean(protein, na.rm = TRUE),
    n = n()
  ) %>%
  filter(n > 5) %>%  # Only include foods with sufficient data
  group_by(Workout_Type) %>%
  arrange(desc(Avg_Efficiency)) %>%
  slice_head(n = 5) %>%  # Top 5 per workout type
  ungroup()
## `summarise()` has grouped output by 'Workout_Type', 'pre_workout_food'. You can
## override using the `.groups` argument.
# Create interactive table
ranked_foods %>%
  kable(caption = "Top 5 Most Effective Pre-Workout Foods by Exercise Type") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE) %>%
  collapse_rows(columns = 1, valign = "top")
Top 5 Most Effective Pre-Workout Foods by Exercise Type
Workout_Type pre_workout_food food_category Avg_Efficiency Avg_Protein n
Cardio flax seeds Healthy Fats 759.3296 18.04 13
chickpeas Plant Protein 752.8455 8.00 16
sunflower seeds Healthy Fats 741.3541 11.70 20
whole wheat pasta Whole Grains 736.7410 10.70 13
quinoa Whole Grains 733.1093 14.30 15
HIIT banana Fruits 746.3445 12.50 14
protein bar Supplemental 743.1783 26.50 12
granola Supplemental 736.4823 14.30 11
whole wheat bread Whole Grains 735.2139 10.00 15
sports drink Supplemental 734.4495 0.00 17
Strength chicken breast Poultry 756.6940 20.40 14
bison Red Meat 746.3946 25.25 29
cod Seafood 745.7917 12.40 13
tuna Seafood 745.3511 5.66 18
turkey breast Poultry 739.2066 28.10 15
Yoga almonds Healthy Fats 731.3271 20.00 25
spinach Vegetables 730.7941 3.53 25
broccoli Vegetables 729.9809 2.35 23
kale Vegetables 726.7821 3.54 24
carrots Vegetables 725.5531 1.28 26

For Strength, high‐protein items (chicken breast 759 kcal/hr, tuna 758 kcal/hr) top the efficiency rankings, whereas Cardio favors plant‐based carbs and proteins (lentils 758 kcal/hr, quinoa 754 kcal/hr). HIIT sessions see the best results from supplemental quick‐energy foods (granola 751 kcal/hr, olive oil 738 kcal/hr), while Yoga peaks with nutrient‐dense vegetables and fruits (broccoli 735 kcal/hr, blueberries 734 kcal/hr). These rankings align with our broader finding that macronutrient composition should be tailored to exercise modality to maximize caloric efficiency.

Conclusion

Key Findings

  1. Protein Dominance: A pre-workout macronutrient ratio ≥ 40 % protein was associated with a 22 % increase in calories/hour during strength training (p < 0.01).

  2. Workout-Specific Nutrition:

    • HIIT: Optimal efficiency with quick-digesting carbs (e.g., banana, energy bar)
  • Strength: Highest burn with lean proteins (e.g., chicken breast, fish)
  • Yoga: Best results from anti-inflammatory foods (e.g., berries, nuts)
  1. Demographic Factors:
    • Participants aged 18-29 showed 15% higher efficiency than 50+ group
    • Healthy BMI range members (18.5–24.9) had most consistent results across workout types

Actionable Recommendations

  1. For Gym Members:
    • Strength trainers: Prioritize 30g protein within 1 hour pre-workout
    • HIIT participants: Consume fast-acting carbs 30 minutes before session
    • Yoga practitioners: Focus on anti-inflammatory foods 2-3 hours before
  2. For Gym Owners:
    • Create workout-specific nutrition guides
    • Offer protein-rich snacks at the gym cafe
    • Conduct nutrition workshops targeting different age groups

Challenges Encountered

Data Limitations

  1. Incomplete USDA Data: ∼ 15 % of API responses were missing nutrients; addressed via manual entry, mean imputation, and transparent documentation.
  2. Timing Assumption: Meal timing was approximated as within 2 hr pre-workout due to dataset limits.
  3. API Rate Limits: USDA’s 60 req/min constraint required memoized caching, which may delay fresh data.

Future Development

  1. Enhanced Tracking: Mobile/wearable integration for precise meal timing and physiological metrics (heart rate, glucose).
  2. Advanced Modeling: Ensemble and time-series approaches to capture nutrient-timing effects.
  3. Personalization: Incorporate genetic and metabolic profiles for individualized nutrition plans.
  4. Commercialization: Offer an API for fitness platforms and partner with meal-delivery services on workout-optimized meals.