Nutrition has become a topic of concern as more
people are aware of not only their appearance but also underlying,
potential health issues.
As such, this project scrutinizes a dietary journey shared on Kaggle,
utilizing time series analysis to explore how different nutrient groups
were proportioned in daily meals throughout one year. Ultimately, it
looks into the calories consumption pattern - a success determinant for
any weight-loss targeter - applies 4 basic forecasting models, and
derives insights into calories intake patterns based on the models’
performance.
Prior to visualizing the data and building the models, I installed
and loaded the following packages:
# Libraries
library(tidyverse) # For data wrangling and visualization
library(tsibble) # For tidy time series data structures
library(fable) # For forecasting models
library(fabletools) # For model evaluation and accuracy metrics
library(lubridate) # For working with and manipulating date/time data
library(readr) # For reading CSV and other delimited files
library(feasts) # For time series decomposition and visualization tools
library(ggtime) # For time-aware ggplot extensions
library(readxl) # For reading Excel files
library(interactions) # For visualizing and probing interaction effects in regression models
library(pROC) # For ROC curves and AUC metrics in classification models
library(glmnet) # For regularized regression models
library(forecast) # For traditional time series models and tools
library(here) # For managing relative file paths in a reproducible way
library(RKaggle)
I inspected the data, removed N/As, adjusted data types, and
conducted DML operations before turning it into a tsibble for forecasts.
# Introduce dataset
food_dataset <- RKaggle::get_dataset("sangramphadke/food-and-health-data-for-one-year-day-to-day")
# Examining the dataset
dim(food_dataset)
## [1] 610 10
head(food_dataset)
# Data Cleaning
# Removing N/As
cln_food_dataset <- food_dataset %>%
drop_na()
# Changing datatypes using DML functions (mutate, group_by, summarise, arrange)
cln_food_dataset$Date <- as.Date(cln_food_dataset$Date)
monthly_food_csmpt <- cln_food_dataset %>%
mutate(Month = yearmonth(Date),
total_nutr_csmpt = rowSums(across(c(`Protein\r\nin grams`,`Fibers\r\nin grams`,`Carbs\r\nin grams`, `Fats\r\nin grams` ))),
protein_csmpt_pctg = `Protein\r\nin grams`/total_nutr_csmpt*100,
fiber_csmpt_pctg = `Fibers\r\nin grams`/total_nutr_csmpt*100,
carb_csmpt_pctg = `Carbs\r\nin grams`/total_nutr_csmpt*100,
fat_csmpt_pctg = `Fats\r\nin grams`/total_nutr_csmpt*100,
status = if_else(`Calories Consumed` - `Calories Burned` > 0, "Deficit Failed", "Deficit Succeeded")
)
monthly_food_csmpt
# This is aggreating data by month that I will use to create the tsibble later on
monthly_food_csmpt_smr <- monthly_food_csmpt %>%
group_by(Month) %>%
summarise(
avg_calories_csmpt = mean(`Calories Consumed`, na.rm = TRUE),
avg_protein_csmpt_pctg = mean(protein_csmpt_pctg, na.rm = TRUE),
avg_fiber_csmpt_pctg = mean(fiber_csmpt_pctg, na.rm = TRUE),
avg_carb_csmpt_pctg = mean(carb_csmpt_pctg, na.rm = TRUE),
avg_fat_csmpt_pctg = mean(fat_csmpt_pctg, na.rm = TRUE)
) %>%
arrange(Month)
monthly_food_csmpt_smr
For each of the 4 models, I began by splitting the data into
designated training and test sets and used 5-fold cross-validation to
assess its performance. I also performed hyperparameter tuning on all
the models, except for the logistical regression model, so as to achieve
the maximum AUCs possible. Subsequently, I took into consideration their
confusion matrices, explained how the models behave, and plotted their
respective Receiver Operator Curve (ROC).
# Converting the data frame to a tsibble
monthly_food_tsibble <- monthly_food_csmpt_smr %>%
as_tsibble(index = Month)
monthly_food_tsibble
# Defining 4 models
food_models <- monthly_food_tsibble %>%
# Fit models
model(
mean_model = MEAN(avg_calories_csmpt),
naive_model = NAIVE(avg_calories_csmpt),
snaive_model = SNAIVE(avg_calories_csmpt), # See note below
drift_model = RW(avg_calories_csmpt)
)
# Generating the forecasts
food_fc <- food_models %>%
forecast(h = 12, level = c(80, 95))
#Plotting the forecasts
food_fc %>%
autoplot(
monthly_food_tsibble,
level = NULL
) +
labs(
title = "Calories Consumption Forecasts (12 Months)",
y = "Calories Consumption (units)",
x = "Month"
) +
theme_minimal()
# Forecast using DRIFT model overlapped with that of NAIVE model so not shown in this
# plot
# Inspect the structure of the output
food_models_errors <- accuracy(food_models)
glimpse(food_models_errors) # or use names(food_models_errors)
## Rows: 4
## Columns: 10
## $ .model <chr> "mean_model", "naive_model", "snaive_model", "drift_model"
## $ .type <chr> "Training", "Training", "Training", "Training"
## $ ME <dbl> 5.684342e-14, 1.948727e+01, 3.242438e+02, 1.948727e+01
## $ RMSE <dbl> 236.9865, 241.5370, 389.7010, 241.5370
## $ MAE <dbl> 151.9816, 158.5750, 324.2438, 158.5750
## $ MPE <dbl> -1.7717347, 0.3487289, 16.5692377, 0.3487289
## $ MAPE <dbl> 8.851065, 8.945477, 16.569238, 8.945477
## $ MASE <dbl> 0.4687263, 0.4890612, 1.0000000, 0.4890612
## $ RMSSE <dbl> 0.6081239, 0.6198009, 1.0000000, 0.6198009
## $ ACF1 <dbl> 0.4760323, -0.4949918, -0.4560629, -0.4949918
# If .model is missing, skip selecting it
food_models_errors %>%
select(any_of(c(".model", "RMSE", "MAE", "MAPE"))) %>%
arrange(RMSE)
# Assessing autocorrelation to confirm stability / non-seasonality of the data
ts_data <- ts(monthly_food_tsibble$avg_calories_csmpt, frequency = 12)
acf(ts_data, main = "ACF: Autocorrelation")
pacf(ts_data, main = "PACF: Partial Autocorrelation")
# The correlation lessened as Lags increase, indicating that seasonality if weak
This project reinforces that simple forecasting models can unveil
powerful insights. The mean model, scoring lowest across all error
metrics, reflects that consistent daily calorie intake is the key to the
weight loss process. Utilizing time series analysis, the project
stresses how predictive tools can support the forming of good habits —
not just by anticipating behavior, but by informing it with clarity and
purpose.