library(readxl)
library(dplyr)
library(lubridate)
library(ggplot2)

path <- "C:/Users/raiss/Downloads/cereal.xlsx"

Introduction

This report analyzes a dataset of American cereals, including nutritional information, manufacturer codes, serving sizes, and product shelf-life. The objective is to clean, enrich, normalize, and analyze the data to extract business-driven insights and identify potential data quality issues.

## Task 1: Importing data and removing duplicates
  
cereal <- read_excel(
  path,
  sheet = "cereal",
  skip  = 1,
  col_names = c(
    "name", "mfr", "type",
    "calories", "protein", "fat",
    "sodium", "fiber", "carbo",
    "sugars", "potass", "vitamins",
    "cups", "manufacturing_date", "expiration_date"
  )
)

mfr_map <- read_excel(path, sheet = "mfr_map")
# Remove a linha de cabeçalho que entrou como dado
cereal <- cereal %>%
  filter(name != "name")

# Convert datas (mm/dd/yyyy)
cereal <- cereal %>%
  mutate(
    manufacturing_date = mdy(manufacturing_date),
    expiration_date    = mdy(expiration_date)
  )
## Warning: There were 2 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `manufacturing_date = mdy(manufacturing_date)`.
## Caused by warning:
## ! All formats failed to parse. No formats found.
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.
cereal  <- distinct(cereal)
mfr_map <- distinct(mfr_map)

head(cereal)
head(mfr_map)

After importing the dataset, an extra header row was removed and both manufacturing and expiration dates were converted from text strings into proper date formats. This procedure ensures data consistency and enables subsequent transformations and temporal analyses. The presence of non-standard date formats indicates a lack of structure in the original Excel file, a common characteristic of operational datasets.

## Task 2: Adding manufatured names

cereal <- cereal %>%
left_join(
mfr_map,
by = c("mfr" = "Manufacturer")
)

head(cereal)

The manufacturer column originally contained single-letter codes (e.g., K, G, N), which are not directly interpretable. A left join was applied to merge the dataset with the manufacturer reference table, adding the full manufacturer names. This enhances clarity and enables more meaningful grouping and aggregation in subsequent analyses.

## Task 3: Normalizing nutrition per cup 
  
num_cols <- c(
"calories", "protein", "fat", "sodium",
"fiber", "carbo", "sugars", "potass", "vitamins"
)

cereal_norm <- cereal %>%
mutate(
across(all_of(num_cols), as.numeric),
cups = as.numeric(cups)
) %>%
mutate(
across(all_of(num_cols), ~ .x / cups)
)

head(cereal_norm)

Different cereals are reported using different serving sizes, with some servings as small as 0.33 cups. Normalizing all nutritional values to a standard 1-cup basis enables fair and consistent comparisons across products. Without this adjustment, a cereal may appear to have lower calorie content simply because its serving size is smaller.

## Task 4 (Elective): Number of cereals not expired

today_date <- Sys.Date()

num_not_expired <- cereal_norm %>%
mutate(expiration_date = as.Date(expiration_date)) %>%
filter(expiration_date >= today_date) %>%
nrow()

num_not_expired
## [1] 0

By comparing expiration dates with the current system date, it was possible to identify which cereals remain within their valid shelf-life period. In operational contexts such as supply chain or retail management, this type of information is essential for inventory rotation, waste reduction, and effective shelf-life control.

## Task 5 (Elective): Shelf life categories

cereal_shelf <- cereal_norm %>%
mutate(
manufacturing_date = as.Date(manufacturing_date),
expiration_date    = as.Date(expiration_date),
shelf_life_days    = as.numeric(expiration_date - manufacturing_date),
shelf_life_cat     = case_when(
shelf_life_days %in% c(180, 270, 360, 500) ~ paste0(shelf_life_days, " days"),
is.na(shelf_life_days)                     ~ "missing",
TRUE                                       ~ "other"
)
)

shelf_counts <- cereal_shelf %>% count(shelf_life_cat, sort = TRUE)

shelf_counts

Shelf-life values are generally expected to fall within predefined ranges (180, 270, 360, or 500 days). Values outside these intervals may indicate data entry errors or operational inconsistencies. Entries with atypical shelf-life values or missing dates were identified, suggesting potential data quality issues. Such cases would require verification with the data engineering or manufacturing teams to ensure accuracy and resolve inconsistencies.

## Task 6 (Elective): Manufacturer with highest average calories per cup

avg_cal_by_mfr <- cereal_norm %>%
group_by(`Manufacturer Name`) %>%
summarise(mean_calories_per_cup = mean(calories, na.rm = TRUE)) %>%
arrange(desc(mean_calories_per_cup))

avg_cal_by_mfr
avg_cal_by_mfr %>% slice(1)

Aggregating nutritional information by manufacturer highlights differences in product profiles across brands. A higher average calorie density may indicate a product portfolio oriented toward sweeter or more energy-dense cereals. This metric provides useful insights into brand positioning and product strategy.

## Task 7 (Elective): Best cereal based on fiber + vitamins

cereal_scores <- cereal_norm %>%
mutate(
fiber_scaled = (fiber - min(fiber, na.rm = TRUE)) /
(max(fiber, na.rm = TRUE) - min(fiber, na.rm = TRUE)),
vitamins_scaled = (vitamins - min(vitamins, na.rm = TRUE)) /
(max(vitamins, na.rm = TRUE) - min(vitamins, na.rm = TRUE)),
score = fiber_scaled + vitamins_scaled
) %>%
arrange(desc(score))

head(cereal_scores, 10)

Fiber and vitamin values were scaled into comparable ranges and combined into a single score to produce a simplified nutritional index. This approach facilitates the identification of cereals with the highest nutrient density on a per-cup basis. The cereal ranked highest in this analysis (commonly “100% Bran”) distinguishes itself by presenting substantially higher fiber content, making it the most nutritionally advantageous option under this criterion.

## Task 8 (Elective): Manufacturer with highest calorie variation

var_cal_by_mfr <- cereal_norm %>%
group_by(`Manufacturer Name`) %>%
summarise(sd_calories_per_cup = sd(calories, na.rm = TRUE)) %>%
arrange(desc(sd_calories_per_cup))

var_cal_by_mfr
var_cal_by_mfr %>% slice(1)

The standard deviation of calorie content reflects the degree of variability within a manufacturer’s product line. High variability suggests that the brand offers a wide range of products, including both more indulgent and more health-oriented options. Low variability indicates a more homogeneous nutritional profile across the manufacturer’s portfolio.

## Task 9 (Elective): % zero fat per manufactur

pct_zero_fat <- cereal_norm %>%
group_by(`Manufacturer Name`) %>%
summarise(
total    = n(),
zero_fat = sum(fat == 0, na.rm = TRUE),
pct_zero_fat = 100 * zero_fat / total
) %>%
arrange(desc(pct_zero_fat))

pct_zero_fat

The percentage of zero-fat cereals produced by each manufacturer provides an indication of how product lines are positioned with respect to “healthy” or “light” offerings. This type of analysis supports the identification of marketing positioning strategies and product category focus within each brand’s portfolio.

## Task 10 (Elective): Distribution of type by manufacturer

type_dist <- cereal_norm %>% count(`Manufacturer Name`, type)

type_dist
ggplot(type_dist, aes(x = `Manufacturer Name`, y = n, fill = type)) +
geom_col(position = "dodge") +
theme_minimal() +
labs(title = "Distribution of Cold/Hot Cereal Types by Manufacturer") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))

The final visualization compares the number of cold (C) and hot (H) cereals produced by each manufacturer. Most major brands show a strong concentration in the cold cereal segment, with relatively few products categorized as hot cereals. This distribution indicates a market predominantly oriented toward ready-to-eat products rather than cookable alternatives.

Final Considerations

This analysis demonstrates a complete workflow for transforming a raw operational dataset into structured, interpretable, and actionable information. The process encompassed data importation, cleaning, enrichment through reference tables, normalization of nutritional values, descriptive analytics, and visualization of key patterns.

The results highlight differences in nutritional profiles across manufacturers, identify products with higher nutrient density, reveal variability within brand portfolios, and expose inconsistencies in shelf-life data that may require further validation.

Overall, the workflow reflects standard practices in data management and analytical processing, illustrating how structured methodologies can support informed decision-making in product evaluation, supply-chain monitoring, and market assessment.