Data sources

FAO and World Bank (2025), using data and methods from Bai et al. (2024)

This project examines global grocery inflation in 2025–2026 using a dataset that compares food prices across cities, countries, and regions. The topic is meaningful to me because food prices affect everyday life, and inflation changes what people can afford from place to place. The dataset includes both categorical and quantitative variables, which makes it useful for comparing prices across groups and over time. The categorical variables I plan to use are City, Country, Region, and Food Item, while the quantitative variables are Price (USD), Basket Cost, and Month. These variables will help me explore how grocery prices differ by location, by type of item, and across time. According to the dataset description, the data tracks 14 staples across 122 cities in 80 countries over a 6-month time series and was compiled from sources including Numbeo, FAO, USDA, World Bank/IMF, and ICO. Before visualization and modeling, I will clean the data by checking variable names, correcting data types, removing missing values only within specific variables when necessary, standardizing text labels, and selecting smaller analytic categories.

# These libraries help me load, clean, graph, and interact with the data
library(readr) #imports the CSV file
library(dplyr) #helps clean and organize the data
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2) #creates graphs
library(plotly) #adds interactivity
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(maps) #adds interactivity
library(stringr) #helps clean text.

# Load the grocery inflation dataset
setwd("/Users/precious/Downloads/DATASETS") # folder where dataset is stored
grocery <- read_csv("global_grocery_inflation_2025_2026.csv")
## Rows: 10248 Columns: 27
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (15): City, Country, ISO_Country_Code, Region, Continent, Month, Month_...
## dbl  (11): Quantity, Price_Local, Price_USD, Exchange_Rate, YoY_Inflation_Es...
## date  (1): Data_Collection_Date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#This code helps me understand the dataset before cleaning it. I can see the first few rows, the names of the variables, and whether each variable is categorical or quantitative.

# View the first few rows
head(grocery)
## # A tibble: 6 × 27
##   City     Country      ISO_Country_Code Region Continent Month Month_Name Item 
##   <chr>    <chr>        <chr>            <chr>  <chr>     <chr> <chr>      <chr>
## 1 New York United Stat… USA              North… North Am… 2025… October 2… Milk…
## 2 New York United Stat… USA              North… North Am… 2025… November … Milk…
## 3 New York United Stat… USA              North… North Am… 2025… December … Milk…
## 4 New York United Stat… USA              North… North Am… 2026… January 2… Milk…
## 5 New York United Stat… USA              North… North Am… 2026… February … Milk…
## 6 New York United Stat… USA              North… North Am… 2026… March 2026 Milk…
## # ℹ 19 more variables: Item_Key <chr>, Item_Category <chr>, Quantity <dbl>,
## #   Unit <chr>, Price_Local <dbl>, Currency_Local <chr>, Price_USD <dbl>,
## #   Exchange_Rate <dbl>, YoY_Inflation_Estimate_Pct <dbl>,
## #   Inflation_Source <chr>, FAO_Index_Value <dbl>, FAO_Index_Date <chr>,
## #   FAO_YoY_Change_Pct <dbl>, USDA_All_Food_Forecast_Pct <dbl>,
## #   USDA_Food_At_Home_Pct <dbl>, Data_Collection_Date <date>, Source_URL <chr>,
## #   Population_Estimate <dbl>, Breakfast_Basket_USD <dbl>
# Check the variable names
names(grocery)
##  [1] "City"                       "Country"                   
##  [3] "ISO_Country_Code"           "Region"                    
##  [5] "Continent"                  "Month"                     
##  [7] "Month_Name"                 "Item"                      
##  [9] "Item_Key"                   "Item_Category"             
## [11] "Quantity"                   "Unit"                      
## [13] "Price_Local"                "Currency_Local"            
## [15] "Price_USD"                  "Exchange_Rate"             
## [17] "YoY_Inflation_Estimate_Pct" "Inflation_Source"          
## [19] "FAO_Index_Value"            "FAO_Index_Date"            
## [21] "FAO_YoY_Change_Pct"         "USDA_All_Food_Forecast_Pct"
## [23] "USDA_Food_At_Home_Pct"      "Data_Collection_Date"      
## [25] "Source_URL"                 "Population_Estimate"       
## [27] "Breakfast_Basket_USD"
# Check the structure of the dataset
str(grocery)
## spc_tbl_ [10,248 × 27] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ City                      : chr [1:10248] "New York" "New York" "New York" "New York" ...
##  $ Country                   : chr [1:10248] "United States" "United States" "United States" "United States" ...
##  $ ISO_Country_Code          : chr [1:10248] "USA" "USA" "USA" "USA" ...
##  $ Region                    : chr [1:10248] "North America" "North America" "North America" "North America" ...
##  $ Continent                 : chr [1:10248] "North America" "North America" "North America" "North America" ...
##  $ Month                     : chr [1:10248] "2025-10" "2025-11" "2025-12" "2026-01" ...
##  $ Month_Name                : chr [1:10248] "October 2025" "November 2025" "December 2025" "January 2026" ...
##  $ Item                      : chr [1:10248] "Milk (1 Liter)" "Milk (1 Liter)" "Milk (1 Liter)" "Milk (1 Liter)" ...
##  $ Item_Key                  : chr [1:10248] "Milk_1L" "Milk_1L" "Milk_1L" "Milk_1L" ...
##  $ Item_Category             : chr [1:10248] "Dairy" "Dairy" "Dairy" "Dairy" ...
##  $ Quantity                  : num [1:10248] 1 1 1 1 1 1 500 500 500 500 ...
##  $ Unit                      : chr [1:10248] "liter" "liter" "liter" "liter" ...
##  $ Price_Local               : num [1:10248] 1.32 1.33 1.34 1.3 1.35 1.33 4.38 4.41 4.53 4.62 ...
##  $ Currency_Local            : chr [1:10248] "USD" "USD" "USD" "USD" ...
##  $ Price_USD                 : num [1:10248] 1.32 1.33 1.34 1.3 1.35 1.33 4.38 4.41 4.53 4.62 ...
##  $ Exchange_Rate             : num [1:10248] 1 1 1 1 1 1 1 1 1 1 ...
##  $ YoY_Inflation_Estimate_Pct: num [1:10248] 4.3 4.3 4.3 4.3 4.3 4.3 4.3 4.3 4.3 4.3 ...
##  $ Inflation_Source          : chr [1:10248] "USDA Food Price Outlook / IMF WEO 2026" "USDA Food Price Outlook / IMF WEO 2026" "USDA Food Price Outlook / IMF WEO 2026" "USDA Food Price Outlook / IMF WEO 2026" ...
##  $ FAO_Index_Value           : num [1:10248] 127 126 126 124 125 ...
##  $ FAO_Index_Date            : chr [1:10248] "October 2025" "November 2025" "December 2025" "January 2026" ...
##  $ FAO_YoY_Change_Pct        : num [1:10248] -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ USDA_All_Food_Forecast_Pct: num [1:10248] 3.1 3.1 3.1 3.1 3.1 3.1 3.1 3.1 3.1 3.1 ...
##  $ USDA_Food_At_Home_Pct     : num [1:10248] 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 ...
##  $ Data_Collection_Date      : Date[1:10248], format: "2026-03-20" "2026-03-20" ...
##  $ Source_URL                : chr [1:10248] "https://www.numbeo.com/food-prices/in/New-York" "https://www.numbeo.com/food-prices/in/New-York" "https://www.numbeo.com/food-prices/in/New-York" "https://www.numbeo.com/food-prices/in/New-York" ...
##  $ Population_Estimate       : num [1:10248] 8336817 8336817 8336817 8336817 8336817 ...
##  $ Breakfast_Basket_USD      : num [1:10248] 16.2 16.4 17 16.8 16.8 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   City = col_character(),
##   ..   Country = col_character(),
##   ..   ISO_Country_Code = col_character(),
##   ..   Region = col_character(),
##   ..   Continent = col_character(),
##   ..   Month = col_character(),
##   ..   Month_Name = col_character(),
##   ..   Item = col_character(),
##   ..   Item_Key = col_character(),
##   ..   Item_Category = col_character(),
##   ..   Quantity = col_double(),
##   ..   Unit = col_character(),
##   ..   Price_Local = col_double(),
##   ..   Currency_Local = col_character(),
##   ..   Price_USD = col_double(),
##   ..   Exchange_Rate = col_double(),
##   ..   YoY_Inflation_Estimate_Pct = col_double(),
##   ..   Inflation_Source = col_character(),
##   ..   FAO_Index_Value = col_double(),
##   ..   FAO_Index_Date = col_character(),
##   ..   FAO_YoY_Change_Pct = col_double(),
##   ..   USDA_All_Food_Forecast_Pct = col_double(),
##   ..   USDA_Food_At_Home_Pct = col_double(),
##   ..   Data_Collection_Date = col_date(format = ""),
##   ..   Source_URL = col_character(),
##   ..   Population_Estimate = col_double(),
##   ..   Breakfast_Basket_USD = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
names(grocery)
##  [1] "City"                       "Country"                   
##  [3] "ISO_Country_Code"           "Region"                    
##  [5] "Continent"                  "Month"                     
##  [7] "Month_Name"                 "Item"                      
##  [9] "Item_Key"                   "Item_Category"             
## [11] "Quantity"                   "Unit"                      
## [13] "Price_Local"                "Currency_Local"            
## [15] "Price_USD"                  "Exchange_Rate"             
## [17] "YoY_Inflation_Estimate_Pct" "Inflation_Source"          
## [19] "FAO_Index_Value"            "FAO_Index_Date"            
## [21] "FAO_YoY_Change_Pct"         "USDA_All_Food_Forecast_Pct"
## [23] "USDA_Food_At_Home_Pct"      "Data_Collection_Date"      
## [25] "Source_URL"                 "Population_Estimate"       
## [27] "Breakfast_Basket_USD"
#This code removes rows with missing values in the variables I need and limits the dataset to 800 observations or fewer. This creates a cleaner analytic dataset and follows the project requirement.

# Keep rows with complete values, then reduce the dataset size
# Clean + create smaller dataset
# Clean the data and create a smaller dataset for analysis
grocery_small <- grocery %>%
  mutate(
    City = str_trim(as.character(City)),
    Country = str_trim(as.character(Country)),
    Region = str_trim(as.character(Region)),
    Continent = str_trim(as.character(Continent)),
    Item = str_trim(as.character(Item)),
    Item_Category = str_trim(as.character(Item_Category)),
    Month = as.character(Month),
    Price_USD = as.numeric(Price_USD),
    Breakfast_Basket_USD = as.numeric(Breakfast_Basket_USD),
    YoY_Inflation_Estimate_Pct = as.numeric(YoY_Inflation_Estimate_Pct)
  ) %>%
  filter(
    !is.na(City),
    !is.na(Country),
    !is.na(Region),
    !is.na(Continent),
    !is.na(Item),
    !is.na(Item_Category),
    !is.na(Month),
    !is.na(Price_USD),
    !is.na(Breakfast_Basket_USD),
    !is.na(YoY_Inflation_Estimate_Pct)
  ) %>%
  slice_head(n = 800)

#This code cleans and prepares the dataset for analysis. The mutate() function is used to fix variable types and remove extra spaces from text variables. The filter() function removes rows with missing values in key variables. Finally, slice_head(n = 800) is used to limit the dataset to 800 observations, and makes the data easier to manage.
# Explore how prices vary by food item
# First exploratory graph: price differences by item category
# First exploratory graph: price differences by item category
ggplot(grocery_small, aes(x = Item_Category, y = Price_USD, fill = Item_Category)) +
  geom_boxplot() +
  labs( 
  title = "Price Distribution by Item Category",
  subtitle = "Comparing variation across food groups",
    x = "Item Category",
    y = "Price (USD)",
    caption = "Source: Global Grocery Inflation dataset"
  ) +
  theme_minimal() +
  theme(
    plot.background = element_rect(fill = "#fde2e4", color = NA),
    panel.background = element_rect(fill = "#fff8fb", color = NA),
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

#This graph shows how prices differ across food items. A boxplot is helpful because it shows the spread, middle values, and possible outliers for each category. I used this graph to explore differences in grocery prices across product types before choosing my final visualization. Categories such as meat tend to have higher median prices, while others like vegetables and grains are generally lower. The variation in box sizes indicates that some food categories experience greater price differences across regions. This suggests that grocery inflation does not affect all food types equally.

#I chose to do categories instead of actual food items because it made the graph look crowded and hard to interpret.
# Find average basket cost by region

region_summary <- grocery_small %>%
  group_by(Region) %>%
  summarize(avg_basket = mean(Breakfast_Basket_USD, na.rm = TRUE))

# Plot it
ggplot(region_summary, aes(x = Region, y = avg_basket, fill = Region)) +
  geom_col() +
  labs(
    title = "Average Breakfast Basket Cost by Region",
    x = "Region",
    y = "Average Basket Cost (USD)",
    caption = "Source: Global Grocery Inflation dataset"
  ) +
  theme_minimal() +
  theme(
    plot.background = element_rect(fill = "#fde2e4", color = NA),
    panel.background = element_rect(fill = "#fff8fb", color = NA)
  )

#This code calculates the average basket cost for each region and displays the results in a bar chart. I used this graph to compare grocery costs across broader geographic areas and to see whether some regions were more expensive than others. Groups the data by region and calculates the average breakfast basket cost using the summarize() function. It reveals clear differences in grocery prices globally, with some regions such as North America having higher average costs compared to others. This indicates that geographic location plays a significant role in food pricing. These differences may be influenced by factors such as income levels, supply chains, and inflation rates across regions.
 # Multiple linear regression model
# Multiple linear regression
model1 <- lm(Breakfast_Basket_USD ~ Price_USD + YoY_Inflation_Estimate_Pct + FAO_Index_Value,
             data = grocery_small)

# Show regression results
summary(model1)
## 
## Call:
## lm(formula = Breakfast_Basket_USD ~ Price_USD + YoY_Inflation_Estimate_Pct + 
##     FAO_Index_Value, data = grocery_small)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.4509 -1.5923 -0.7948  2.9279  4.8046 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                45.57045   15.14076   3.010 0.002697 ** 
## Price_USD                   0.06356    0.01675   3.795 0.000159 ***
## YoY_Inflation_Estimate_Pct -2.89352    0.14838 -19.500  < 2e-16 ***
## FAO_Index_Value            -0.13879    0.12021  -1.155 0.248627    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.008 on 796 degrees of freedom
## Multiple R-squared:  0.3519, Adjusted R-squared:  0.3495 
## F-statistic: 144.1 on 3 and 796 DF,  p-value: < 2.2e-16
#I chose basket cost as the response variable because it represents overall grocery cost. I chose price and month as predictors because they are both quantitative variables that may help explain changes in basket cost over time.

The regression model examines how breakfast basket cost is influenced by price, inflation, and the FAO index. I selected these variables because they are quantitative and logically related to inflation trends. I evaluated the model using the p-values for each predictor and the adjusted R2 value. Small p-values suggest that a predictor is significantly related to basket cost, while adjusted R2 shows how much of the variation in basket cost is explained by the model.

# Find average price by month and region
final_summary <- grocery_small %>%
  group_by(Month, Region) %>%
  summarize(avg_price = mean(Price_USD, na.rm = TRUE))
## `summarise()` has regrouped the output.
## ℹ Summaries were computed grouped by Month and Region.
## ℹ Output is grouped by Month.
## ℹ Use `summarise(.groups = "drop_last")` to silence this message.
## ℹ Use `summarise(.by = c(Month, Region))` for per-operation grouping
##   (`?dplyr::dplyr_by`) instead.
# Final graph
ggplot(final_summary, aes(x = Month, y = avg_price, color = Region, group = Region)) +
  geom_line(linewidth = 1.3) +
  geom_point(size = 3) +
  labs(
    title = "Trends in Grocery Prices Across Regions",
    subtitle = "Comparison of Average Prices Over Time (2025–2026)",
    x = "Month",
    y = "Average Price (USD)",
    color = "Region",
    caption = "Source: Global Grocery Inflation dataset"
  ) +
  theme_light() +
  scale_color_brewer(palette = "Set2") +
  theme(
    plot.background = element_rect(fill = "#fde2e4", color = NA),
    panel.background = element_rect(fill = "#fff8fb", color = NA),
    legend.position = "bottom"
  )

#This graph shows how average grocery prices change over time across regions. Different colors are used to represent different regions. I chose this as my final visualization because it clearly shows changes over time while also comparing multiple groups in the same graph. A clear pattern is that North America and Western Europe consistently have higher average prices than Latin America. 
# Turn the final graph into an interactive plot
p <- ggplot(final_summary, aes(
  x = Month,
  y = avg_price,
  color = Region,
  group = Region,
  text = paste(
    "Region:", Region,
    "<br>Month:", Month,
    "<br>Average Price:", round(avg_price, 2)
  )
)) +
  geom_line() +
  geom_point(size = 2.5) +
  labs(
    title = "Interactive Grocery Price Trends",
    x = "Month",
    y = "Average Price (USD)"
  ) +
  theme_minimal()

ggplotly(p, tooltip = "text")
#This code turns a ggplot graph into an interactive Plotly graph so that the viewer can move the mouse over the points and see extra details.I included this graph for mouseover interactivity and to make the visualization easier to explore

#This graph shows how average grocery prices change over time across different regions. Each line represents a region, allowing for easy comparison of trends. One clear pattern is that some regions consistently have higher prices than others, while others remain lower throughout the time period. Another observation is that prices generally change gradually rather than sharply, suggesting steady inflation rather than sudden spikes. A limitation of this graph is that it only shows averages, so it does not capture differences between specific food items. If I could improve it, I would include separate lines for different food categories to better understand what is driving these trends.
# Average breakfast basket cost by country
country_summary <- grocery_small %>%
  group_by(Country) %>%
  summarize(avg_basket = mean(Breakfast_Basket_USD, na.rm = TRUE), .groups = "drop")

# Load world map data
world_map <- map_data("world")

# Join map data with grocery data
map_joined <- left_join(world_map, country_summary, by = c("region" = "Country"))

# Create the map
ggplot(map_joined, aes(x = long, y = lat, group = group, fill = avg_basket)) +
  geom_polygon(color = "white", linewidth = 0.2) +
  coord_fixed(1.3) +
  labs(
    title = "Average Breakfast Basket Cost by Country",
    subtitle = "Global Grocery Inflation, 2025–2026",
    fill = "Basket Cost",
    caption = "Source: Global Grocery Inflation dataset"
  ) +
  theme_void() +
  scale_fill_gradient(low = "lavender", high = "purple", na.value = "gray90")

#This map shows average breakfast basket cost by country using color shading. Darker colors represent higher costs, while lighter colors represent lower costs. I included a map because the dataset is global, so a map helps show geographic patterns more clearly than a table.

This project shows how grocery prices vary across regions and over time. One clear pattern is that North America and Western Europe tend to have higher average prices compared to Latin America. Prices also change gradually over time, suggesting steady inflation rather than sudden spikes. The regression model further demonstrated that factors such as price levels and inflation estimates significantly influence overall basket cost, helping explain trends in grocery affordability.

The map adds another perspective by showing geographic differences in grocery basket costs. Darker colors represent higher costs, while lighter colors represent lower costs. It reveals that countries such as the United States and Canada have higher grocery costs compared to many other regions, highlighting global inequality in food affordability. Grocery prices are not evenly distributed across the world. These differences directly affect people’s ability to afford food, as seen in global patterns of food affordability. This highlights how inflation is not just an economic concept but a real-world issue that impacts daily life.

A limitation of this analysis is that the dataset does not include all countries and may contain inconsistencies in country naming, which affected the map visualization. If I could improve this project, I would include more detailed comparisons of individual food items within each country to better understand what drives these price differences.

Overall, this project demonstrates that grocery inflation varies significantly across regions and categories, and understanding these patterns is important for analyzing global economic inequality and food accessibility.