BANA 4090 - Final Project Part 1

Introduction

Nutrition has become a topic of concern as more people are aware of not only their appearance but also underlying, potential health issues.

As such, this project scrutinizes a dietary journey shared on Kaggle, utilizing time series analysis to explore how different nutrient groups were proportioned in daily meals throughout one year. Ultimately, it looks into the calories consumption pattern - a success determinant for any weight-loss targeter - applies 4 basic forecasting models, and derives insights into calories intake patterns based on the models’ performance.

Calorie Trend

Data Preparation & Exploratory Data Analysis

Data Preparation

Prior to visualizing the data and building the models, I installed and loaded the following packages:

# Libraries
library(tidyverse)   # For data wrangling and visualization
library(tsibble)     # For tidy time series data structures
library(fable)       # For forecasting models
library(fabletools)  # For model evaluation and accuracy metrics
library(lubridate)   # For working with and manipulating date/time data
library(readr)       # For reading CSV and other delimited files
library(feasts)      # For time series decomposition and visualization tools
library(ggtime)      # For time-aware ggplot extensions
library(readxl)      # For reading Excel files
library(interactions) # For visualizing and probing interaction effects in regression models
library(pROC)        # For ROC curves and AUC metrics in classification models
library(glmnet)      # For regularized regression models
library(forecast)    # For traditional time series models and tools
library(here)        # For managing relative file paths in a reproducible way
library(RKaggle)

Exploratory Data Analysis

I inspected the data, removed N/As, adjusted data types, and conducted DML operations before turning it into a tsibble for forecasts.

# Introduce dataset
food_dataset <- RKaggle::get_dataset("sangramphadke/food-and-health-data-for-one-year-day-to-day")

# Examining the dataset
dim(food_dataset)

## [1] 610  10

head(food_dataset)

# Data Cleaning
# Removing N/As
cln_food_dataset <- food_dataset %>%
  drop_na()

# Changing datatypes using DML functions (mutate, group_by, summarise, arrange)
cln_food_dataset$Date <- as.Date(cln_food_dataset$Date)

monthly_food_csmpt <- cln_food_dataset %>%
  mutate(Month = yearmonth(Date), 
         total_nutr_csmpt = rowSums(across(c(`Protein\r\nin grams`,`Fibers\r\nin grams`,`Carbs\r\nin grams`, `Fats\r\nin grams`  ))),
         protein_csmpt_pctg = `Protein\r\nin grams`/total_nutr_csmpt*100,
         fiber_csmpt_pctg = `Fibers\r\nin grams`/total_nutr_csmpt*100,
         carb_csmpt_pctg = `Carbs\r\nin grams`/total_nutr_csmpt*100,
         fat_csmpt_pctg = `Fats\r\nin grams`/total_nutr_csmpt*100,
         status = if_else(`Calories Consumed` - `Calories Burned` > 0, "Deficit Failed", "Deficit Succeeded")
)
monthly_food_csmpt

# This is aggreating data by month that I will use to create the tsibble later on
monthly_food_csmpt_smr <- monthly_food_csmpt %>%
  group_by(Month) %>%
  summarise(
    avg_calories_csmpt = mean(`Calories Consumed`, na.rm = TRUE),
    avg_protein_csmpt_pctg = mean(protein_csmpt_pctg, na.rm = TRUE),
    avg_fiber_csmpt_pctg = mean(fiber_csmpt_pctg, na.rm = TRUE),
    avg_carb_csmpt_pctg = mean(carb_csmpt_pctg, na.rm = TRUE),
    avg_fat_csmpt_pctg = mean(fat_csmpt_pctg, na.rm = TRUE)
  ) %>%
  arrange(Month)

monthly_food_csmpt_smr

Forecasting

For each of the 4 models, I began by splitting the data into designated training and test sets and used 5-fold cross-validation to assess its performance. I also performed hyperparameter tuning on all the models, except for the logistical regression model, so as to achieve the maximum AUCs possible. Subsequently, I took into consideration their confusion matrices, explained how the models behave, and plotted their respective Receiver Operator Curve (ROC).

Tsibble

# Converting the data frame to a tsibble
monthly_food_tsibble <- monthly_food_csmpt_smr %>%
  as_tsibble(index = Month)

monthly_food_tsibble

Forecast Models

# Defining 4 models
food_models <- monthly_food_tsibble %>%
  # Fit models
  model(
    mean_model = MEAN(avg_calories_csmpt),
    naive_model = NAIVE(avg_calories_csmpt),
    snaive_model = SNAIVE(avg_calories_csmpt), # See note below
    drift_model = RW(avg_calories_csmpt)
  )

# Generating the forecasts
food_fc <- food_models %>%
  forecast(h = 12, level = c(80, 95))

#Plotting the forecasts
food_fc %>%
  autoplot(    
    monthly_food_tsibble,
    level = NULL     
  ) +
  labs(
    title = "Calories Consumption Forecasts (12 Months)",
    y = "Calories Consumption (units)", 
    x = "Month"
  ) +
  theme_minimal()

# Forecast using DRIFT model overlapped with that of NAIVE model so not shown in this 
# plot

Error Metrics

# Inspect the structure of the output
food_models_errors <- accuracy(food_models)
glimpse(food_models_errors)  # or use names(food_models_errors)

## Rows: 4
## Columns: 10
## $ .model <chr> "mean_model", "naive_model", "snaive_model", "drift_model"
## $ .type  <chr> "Training", "Training", "Training", "Training"
## $ ME     <dbl> 5.684342e-14, 1.948727e+01, 3.242438e+02, 1.948727e+01
## $ RMSE   <dbl> 236.9865, 241.5370, 389.7010, 241.5370
## $ MAE    <dbl> 151.9816, 158.5750, 324.2438, 158.5750
## $ MPE    <dbl> -1.7717347, 0.3487289, 16.5692377, 0.3487289
## $ MAPE   <dbl> 8.851065, 8.945477, 16.569238, 8.945477
## $ MASE   <dbl> 0.4687263, 0.4890612, 1.0000000, 0.4890612
## $ RMSSE  <dbl> 0.6081239, 0.6198009, 1.0000000, 0.6198009
## $ ACF1   <dbl> 0.4760323, -0.4949918, -0.4560629, -0.4949918

# If .model is missing, skip selecting it
food_models_errors %>%
  select(any_of(c(".model", "RMSE", "MAE", "MAPE"))) %>%
  arrange(RMSE)

# Assessing autocorrelation to confirm stability / non-seasonality of the data
ts_data <- ts(monthly_food_tsibble$avg_calories_csmpt, frequency = 12)
acf(ts_data, main = "ACF: Autocorrelation")

pacf(ts_data, main = "PACF: Partial Autocorrelation")

# The correlation lessened as Lags increase, indicating that seasonality if weak

Conclusion

This project reinforces that simple forecasting models can unveil powerful insights. The mean model, scoring lowest across all error metrics, reflects that consistent daily calorie intake is the key to the weight loss process. Utilizing time series analysis, the project stresses how predictive tools can support the forming of good habits — not just by anticipating behavior, but by informing it with clarity and purpose.

BANA 4090 - Final Project Part 1 - Linh Le