Project2 Mehreen Ali Gillani

Exercise Schedule & Meal Plan is a structured dataset (80,000 rows, 5 columns) mapping basic user inputs—Gender, Goal, and BMI Category—to recommended Exercise Schedule and Meal Plan. Ideal for recommendation systems, multi-output classification, and rule-based baselines in health & fitness applications. This dataset is available on kaggle: https://www.kaggle.com/datasets/kavindavimukthi/meal-plan-and-exercise-schedule-gender-goal-bmi

Step 1: import libraries, read csv from github

# import libraries
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(stringr)
library(scales)

## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor

library(forcats)
url = 'https://raw.githubusercontent.com/mehreengillani/DATA607/refs/heads/main/GYM.csv'
gym_data <- read_csv(url)

## Rows: 80000 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Gender, Goal, BMI Category, Exercise Schedule, Meal Plan
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Step 2: Initial data exploration

cat("Dataset dimensions:", dim(gym_data), "\n")

## Dataset dimensions: 80000 5

cat("Column names:", names(gym_data), "\n")

## Column names: Gender Goal BMI Category Exercise Schedule Meal Plan

cat("\nFirst look at the data:\n")

## 
## First look at the data:

gym_data

## # A tibble: 80,000 × 5
##    Gender Goal        `BMI Category` `Exercise Schedule`             `Meal Plan`
##    <chr>  <chr>       <chr>          <chr>                           <chr>      
##  1 Female muscle_gain Normal weight  Moderate cardio, Strength trai… Balanced d…
##  2 Male   fat_burn    Underweight    Light weightlifting, Yoga, and… High-calor…
##  3 Male   muscle_gain Normal weight  Moderate cardio, Strength trai… Balanced d…
##  4 Male   muscle_gain Overweight     High-intensity interval traini… Low-carb, …
##  5 Female muscle_gain Normal weight  Moderate cardio, Strength trai… Balanced d…
##  6 Male   muscle_gain Underweight    Light weightlifting, Yoga, and… High-calor…
##  7 Female fat_burn    Overweight     High-intensity interval traini… Low-carb, …
##  8 Male   muscle_gain Overweight     High-intensity interval traini… Low-carb, …
##  9 Female fat_burn    Obesity        Low-impact cardio, Swimming, a… Low-calori…
## 10 Female muscle_gain Underweight    Light weightlifting, Yoga, and… High-calor…
## # ℹ 79,990 more rows

Step 3: Check for missing values and data types

cat("Missing values by column:\n")

## Missing values by column:

colSums(is.na(gym_data))

##            Gender              Goal      BMI Category Exercise Schedule 
##                 0                 0                 0                 0 
##         Meal Plan 
##                 0

cat("\nData types:\n")

## 
## Data types:

glimpse(gym_data)

## Rows: 80,000
## Columns: 5
## $ Gender              <chr> "Female", "Male", "Male", "Male", "Female", "Male"…
## $ Goal                <chr> "muscle_gain", "fat_burn", "muscle_gain", "muscle_…
## $ `BMI Category`      <chr> "Normal weight", "Underweight", "Normal weight", "…
## $ `Exercise Schedule` <chr> "Moderate cardio, Strength training, and 5000 step…
## $ `Meal Plan`         <chr> "Balanced diet with moderate protein and carbohydr…

Step 4: Standardize all categorical variables

clean_data <- gym_data %>%
  rename_with(~ tolower(gsub(" ", "_", .x))) %>%
  mutate(
    gender = factor(gender, levels = c("Female", "Male")),
    
    # Standardize goal values
    goal = case_when(
      tolower(goal) == "muscle_gain" ~ "Muscle Gain",
      tolower(goal) == "fat_burn" ~ "Fat Burn",
      TRUE ~ goal
    ),
    goal = factor(goal, levels = c("Muscle Gain", "Fat Burn")),
    
    # Standardize BMI categories
    bmi_category = case_when(
      tolower(bmi_category) == "normal weight" ~ "Normal",
      tolower(bmi_category) == "underweight" ~ "Underweight",
      tolower(bmi_category) == "overweight" ~ "Overweight", 
      tolower(bmi_category) == "obesity" ~ "Obese",
      TRUE ~ bmi_category
    ),
    bmi_category = factor(bmi_category, 
                         levels = c("Underweight", "Normal", "Overweight", "Obese")),
    
    # Categorize exercise intensity based on schedule descriptions
    exercise_intensity = case_when(
      str_detect(tolower(exercise_schedule), "high-intensity|hiit|intense") ~ "High",
      str_detect(tolower(exercise_schedule), "moderate|strength training") ~ "Moderate",
      str_detect(tolower(exercise_schedule), "light|low-impact|yoga") ~ "Low",
      TRUE ~ "Unknown"
    ),
    exercise_intensity = factor(exercise_intensity, 
                               levels = c("Low", "Moderate", "High")),
    
    # Extract step count from exercise schedule
    steps = as.numeric(str_extract(exercise_schedule, "\\d+")),
    
    # Categorize meal plan types
    meal_plan_type = case_when(
      str_detect(tolower(meal_plan), "balanced|moderate") ~ "Balanced",
      str_detect(tolower(meal_plan), "high-calorie|protein-rich|whole milk") ~ "High Calorie",
      str_detect(tolower(meal_plan), "low-carb|high-fiber") ~ "Low Carb",
      str_detect(tolower(meal_plan), "low-calorie|portion control") ~ "Low Calorie",
      TRUE ~ "Other"
    ),
    meal_plan_type = factor(meal_plan_type)
  )

cat("Cleaned data structure:\n")

## Cleaned data structure:

head(clean_data)

## # A tibble: 6 × 8
##   gender goal  bmi_category exercise_schedule meal_plan exercise_intensity steps
##   <fct>  <fct> <fct>        <chr>             <chr>     <fct>              <dbl>
## 1 Female Musc… Normal       Moderate cardio,… Balanced… Moderate            5000
## 2 Male   Fat … Underweight  Light weightlift… High-cal… Low                 2000
## 3 Male   Musc… Normal       Moderate cardio,… Balanced… Moderate            5000
## 4 Male   Musc… Overweight   High-intensity i… Low-carb… High                8000
## 5 Female Musc… Normal       Moderate cardio,… Balanced… Moderate            5000
## 6 Male   Musc… Underweight  Light weightlift… High-cal… Low                 2000
## # ℹ 1 more variable: meal_plan_type <fct>

colnames(clean_data)

## [1] "gender"             "goal"               "bmi_category"      
## [4] "exercise_schedule"  "meal_plan"          "exercise_intensity"
## [7] "steps"              "meal_plan_type"

Step 4.1: Verify categorical variable standardization

#Verify categorical variable standardization
cat("Gender distribution:\n")

## Gender distribution:

table(clean_data$gender)

## 
## Female   Male 
##  40680  39320

cat("\nGoal distribution:\n")

## 
## Goal distribution:

table(clean_data$goal)

## 
## Muscle Gain    Fat Burn 
##       41020       38980

cat("\nBMI Category distribution:\n")

## 
## BMI Category distribution:

table(clean_data$bmi_category)

## 
## Underweight      Normal  Overweight       Obese 
##       20940       19920       19840       19300

cat("\nExercise Intensity distribution:\n")

## 
## Exercise Intensity distribution:

table(clean_data$exercise_intensity)

## 
##      Low Moderate     High 
##    40240    19920    19840

cat("\nMeal Plan Type distribution:\n")

## 
## Meal Plan Type distribution:

table(clean_data$meal_plan_type)

## 
##     Balanced High Calorie  Low Calorie     Low Carb 
##        19920        20940        19300        19840

Step 4.2: Create comprehensive summary tables

# Summary 1: Goals by Gender and BMI - Percentage of Total
goal_summary <- clean_data %>%
  count(gender, bmi_category, goal) %>%
  ungroup() %>%  # Remove previous grouping
  mutate(percentage_of_total = round(n / sum(n) * 100, 2)) %>% # Percentage of overall total
  arrange(gender, bmi_category, desc(n))
  
cat("Goals by Gender and BMI Category (Percentage of Total):\n")

## Goals by Gender and BMI Category (Percentage of Total):

print(goal_summary, n = Inf)

## # A tibble: 16 × 5
##    gender bmi_category goal            n percentage_of_total
##    <fct>  <fct>        <fct>       <int>               <dbl>
##  1 Female Underweight  Fat Burn     5480                6.85
##  2 Female Underweight  Muscle Gain  5200                6.5 
##  3 Female Normal       Muscle Gain  5440                6.8 
##  4 Female Normal       Fat Burn     4800                6   
##  5 Female Overweight   Muscle Gain  4940                6.18
##  6 Female Overweight   Fat Burn     4860                6.08
##  7 Female Obese        Muscle Gain  5020                6.28
##  8 Female Obese        Fat Burn     4940                6.18
##  9 Male   Underweight  Muscle Gain  5260                6.58
## 10 Male   Underweight  Fat Burn     5000                6.25
## 11 Male   Normal       Muscle Gain  4880                6.1 
## 12 Male   Normal       Fat Burn     4800                6   
## 13 Male   Overweight   Muscle Gain  5460                6.82
## 14 Male   Overweight   Fat Burn     4580                5.73
## 15 Male   Obese        Muscle Gain  4820                6.02
## 16 Male   Obese        Fat Burn     4520                5.65

Step 5: Visualizations for categorical data analysis

# Visualization 1: Goals by Gender
ggplot(clean_data, aes(x = gender, fill = goal)) +
  geom_bar(position = "dodge", alpha = 0.8) +
  geom_text(stat = 'count', aes(label = after_stat(count)), 
            position = position_dodge(width = 0.9), vjust = -0.5) +
  labs(title = "Fitness Goals by Gender",
       x = "Gender", 
       y = "Count",
       fill = "Goal") +
  theme_minimal() +
  scale_fill_brewer(palette = "Set5")

## Warning: Unknown palette: "Set5"

Visualization 5.1: Exercise Intensity by Goal

# Visualization 3: Exercise Intensity by Goal
ggplot(clean_data, aes(x = goal, fill = exercise_intensity)) +
  geom_bar(position = "dodge", alpha = 0.8) +
  labs(title = "Exercise Intensity by Fitness Goal",
       x = "Goal", 
       y = "Count",
       fill = "Intensity") +
  theme_minimal() +
  scale_fill_brewer(palette = "Set4") +
  facet_wrap(~gender)

## Warning: Unknown palette: "Set4"

Muscle Gain Approaches: Men prefer high-intensity; women choose moderate strength training
Overall Intensity Preference: Women favor low-intensity activities (yoga, swimming) more than men
Most Common Pattern: Low-intensity exercise predominates across both genders

Step 5.2: BMI Category Distribution by Gender

# Load scales package properly
library(scales)

# Calculate percentages first, then plot
bmi_gender_percentage <- clean_data %>%
  count(gender, bmi_category) %>%
  group_by(gender) %>%
  mutate(percentage = n / sum(n)) %>%
  ungroup()

# Now create the plot with percentages
ggplot(bmi_gender_percentage, aes(x = gender, y = percentage, fill = bmi_category)) +
  geom_col(position = "dodge", alpha = 0.8) +
  labs(title = "BMI Category Distribution by Gender",
       x = "Gender", 
       y = "Percentage",
       fill = "BMI Category") +
  theme_minimal() +
  scale_fill_brewer(palette = "Set4") +
  scale_y_continuous(labels = percent, limits = c(0, 1)) +  # 0% to 100%
  geom_text(aes(label = percent(percentage)), 
            position = position_dodge(width = 0.9), 
            vjust = -0.5, size = 3)

## Warning: Unknown palette: "Set4"

Underweight: Nearly identical rates (Female: 26.25%, Male: 26.09%)
Overweight: Comparable percentages (Female: 24%, Male: 25.5%)
Obese: Minimal gender difference (Female: 24.5%, Male: 23.75%)
Overall: Remarkably similar BMI distribution patterns between genders

Step 5.3 Visualization: Meal Plan Types by Goal and BMI

# Calculate percentages first
meal_gender_goal_percentage <- clean_data %>%
  count(gender, goal,meal_plan_type) %>%
  group_by(gender) %>%
  mutate(percentage = round(n / sum(n),3)) %>% #round(n / sum(n) * 100, 2))
  ungroup()

# Now create the plot with percentages
ggplot(meal_gender_goal_percentage, aes(x = meal_plan_type, y = percentage, fill = meal_plan_type)) +
  geom_col(position = "dodge", alpha = 0.8) +
  labs(title = "Meal Plan Types by Goal",
       x = "Gender",  # Note: You might want to reconsider this x-axis label since it shows meal_plan_type
       y = "Percentage",
       fill = "Meal Plan Type") +
  theme_minimal() +
  scale_fill_brewer(palette = "Set4") +
  scale_y_continuous(labels = percent, limits = c(0, 1)) +  # 0% to 100%
  facet_grid(goal ~ gender) +
  geom_text(aes(label = percent(percentage)), 
            position = position_dodge(width = 0.9), 
            vjust = -0.5, size = 3) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

## Warning: Unknown palette: "Set4"

Muscle Gain: Men prefer low-carb; women choose balanced diets
Fat Burn: Both genders consistently select high-calorie meal plans
Overall: Gender influences diet choice for muscle building but not for fat reduction

Step 5.4 Visualization Step counts by goal and intensity

step_heatmap <- clean_data %>%
  group_by(exercise_intensity, goal) %>%
  summarise(mean_steps = mean(steps, na.rm = TRUE), .groups = 'drop')

ggplot(step_heatmap, aes(x = exercise_intensity, y = goal, fill = mean_steps)) +
  geom_tile(color = "white", size = 1) +
  geom_text(aes(label = round(mean_steps)), color = "white", fontface = "bold") +
  scale_fill_gradient(low = "lightblue", high = "darkblue", 
                      name = "Average Steps") +
  labs(title = "Average Step Count by Exercise Intensity and Goal",
       x = "Exercise Intensity", 
       y = "Goal") +
  theme_minimal()

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Summary

This analysis of the Exercise Schedule & Meal Plan dataset revealed several key insights into fitness patterns across gender, BMI categories, and fitness goals:

Key Findings:

Gender-Based Exercise Patterns: Distinct exercise intensity preferences emerged between genders. Men pursuing muscle gain predominantly selected high-intensity workouts, while women with the same goal preferred moderate strength training. Overall, women showed a stronger preference for low-intensity activities like yoga and swimming.

BMI Distribution Consistency: The analysis revealed remarkably similar BMI distributions across genders, with nearly identical proportions in underweight (~26%), overweight (~24-25%), and obese (~24%) categories for both males and females.

Goal-Oriented Meal Planning: Muscle Gain: Men prefer low-carb; women choose balanced diets
Fat Burn: Both genders consistently select high-calorie meal plans

Methodological Strengths:

Successfully standardized and cleaned categorical variables from categorical text data
Implemented robust feature engineering to extract exercise intensity and step counts
Created comprehensive visualizations showing proportional relationships
Maintained data integrity through systematic transformation pipelines

Future Work

1. Advanced Analytics

Machine Learning Applications: Develop recommendation systems using multi-output classification to suggest personalized exercise and meal plans
Cluster Analysis: Identify distinct user segments based on gender, BMI, and goal combinations for targeted fitness programs
Predictive Modeling: Build models to predict optimal exercise intensity and meal plans for new users

2. Data Enhancement

Temporal Analysis: Incorporate time-series data to track fitness progress and plan effectiveness over time
Nutritional Deep Dive: Expand meal plan analysis with detailed nutritional information (macronutrients, calories)
Exercise Specificity: Categorize exercises by type (cardio, strength, flexibility) and muscle groups targeted

3. Expanded Research Questions

Cultural & Geographic Factors: Investigate how exercise and diet preferences vary across different demographics
Seasonal Patterns: Analyze how fitness recommendations change based on seasonal variations
Age Group Analysis: Extend the dataset to include age demographics for life-stage specific recommendations

4. Technical Improvements

Personalized Fitness Apps: Use findings to develop AI-driven fitness coaching applications
Healthcare Integration: Partner with healthcare providers for obesity prevention and weight management programs
Corporate Wellness: Adapt insights for workplace wellness program development
This analysis provides a strong foundation for building intelligent fitness recommendation systems and contributes valuable insights to the growing field of data-driven health and wellness optimization.