Research Question: How are the costs of different food-group components (such as staples, animal-source foods, vegetables and fruits) are related to the total cost of a healthy diet across countries?

The data for this project comes from the Cost and Affordability of a Healthy Diet (CoAHD) database published by the Food and Agriculture Organization (FAO). The original dataset includes many indicators and thousands of observations. For this project, I filtered the data to focus on five main cost variables: the total cost of a healthy diet and the costs of staples, animal foods, vegetables, and fruits. After cleaning and reshaping, the working dataset contains 166 observations and 7 variables. Each row represents one country with prices measured in PPP dollars per person per day.

Source: https://data360.worldbank.org/en/dataset/FAO_CAHD.

library(tidyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
setwd("C:/Users/mezni/OneDrive/Desktop/project 3 dataset")
nutrition <- read.csv("nutrition.csv")

names(nutrition)
##  [1] "STRUCTURE"              "STRUCTURE_ID"           "ACTION"                
##  [4] "FREQ"                   "FREQ_LABEL"             "REF_AREA"              
##  [7] "REF_AREA_LABEL"         "INDICATOR"              "INDICATOR_LABEL"       
## [10] "SEX"                    "SEX_LABEL"              "AGE"                   
## [13] "AGE_LABEL"              "URBANISATION"           "URBANISATION_LABEL"    
## [16] "UNIT_MEASURE"           "UNIT_MEASURE_LABEL"     "COMP_BREAKDOWN_1"      
## [19] "COMP_BREAKDOWN_1_LABEL" "COMP_BREAKDOWN_2"       "COMP_BREAKDOWN_2_LABEL"
## [22] "COMP_BREAKDOWN_3"       "COMP_BREAKDOWN_3_LABEL" "TIME_PERIOD"           
## [25] "OBS_VALUE"              "DATABASE_ID"            "DATABASE_ID_LABEL"     
## [28] "UNIT_MULT"              "UNIT_MULT_LABEL"        "UNIT_TYPE"             
## [31] "UNIT_TYPE_LABEL"        "TIME_FORMAT"            "TIME_FORMAT_LABEL"     
## [34] "OBS_STATUS"             "OBS_STATUS_LABEL"       "OBS_CONF"              
## [37] "OBS_CONF_LABEL"
head(nutrition)
##       STRUCTURE               STRUCTURE_ID ACTION FREQ FREQ_LABEL REF_AREA
## 1 datastructure WB.DATA360:DS_DATA360(1.2)      I    A     Annual      PAN
## 2 datastructure WB.DATA360:DS_DATA360(1.2)      I    A     Annual      PAN
## 3 datastructure WB.DATA360:DS_DATA360(1.2)      I    A     Annual      PAN
## 4 datastructure WB.DATA360:DS_DATA360(1.2)      I    A     Annual      PAN
## 5 datastructure WB.DATA360:DS_DATA360(1.2)      I    A     Annual      PAN
## 6 datastructure WB.DATA360:DS_DATA360(1.2)      I    A     Annual      PAN
##   REF_AREA_LABEL     INDICATOR
## 1         Panama FAO_CAHD_7004
## 2         Panama FAO_CAHD_7004
## 3         Panama FAO_CAHD_7004
## 4         Panama FAO_CAHD_7004
## 5         Panama FAO_CAHD_7004
## 6         Panama FAO_CAHD_7004
##                                          INDICATOR_LABEL SEX SEX_LABEL AGE
## 1 Cost of a healthy diet (PPP dollar per person per day)  _T     Total  _T
## 2 Cost of a healthy diet (PPP dollar per person per day)  _T     Total  _T
## 3 Cost of a healthy diet (PPP dollar per person per day)  _T     Total  _T
## 4 Cost of a healthy diet (PPP dollar per person per day)  _T     Total  _T
## 5 Cost of a healthy diet (PPP dollar per person per day)  _T     Total  _T
## 6 Cost of a healthy diet (PPP dollar per person per day)  _T     Total  _T
##                               AGE_LABEL URBANISATION URBANISATION_LABEL
## 1 All age ranges or no breakdown by age           _T              Total
## 2 All age ranges or no breakdown by age           _T              Total
## 3 All age ranges or no breakdown by age           _T              Total
## 4 All age ranges or no breakdown by age           _T              Total
## 5 All age ranges or no breakdown by age           _T              Total
## 6 All age ranges or no breakdown by age           _T              Total
##   UNIT_MEASURE             UNIT_MEASURE_LABEL COMP_BREAKDOWN_1
## 1     PPP_PS_D PPP dollars per person per day               _Z
## 2     PPP_PS_D PPP dollars per person per day               _Z
## 3     PPP_PS_D PPP dollars per person per day               _Z
## 4     PPP_PS_D PPP dollars per person per day               _Z
## 5     PPP_PS_D PPP dollars per person per day               _Z
## 6     PPP_PS_D PPP dollars per person per day               _Z
##   COMP_BREAKDOWN_1_LABEL COMP_BREAKDOWN_2 COMP_BREAKDOWN_2_LABEL
## 1         Not Applicable               _Z         Not Applicable
## 2         Not Applicable               _Z         Not Applicable
## 3         Not Applicable               _Z         Not Applicable
## 4         Not Applicable               _Z         Not Applicable
## 5         Not Applicable               _Z         Not Applicable
## 6         Not Applicable               _Z         Not Applicable
##   COMP_BREAKDOWN_3 COMP_BREAKDOWN_3_LABEL TIME_PERIOD OBS_VALUE DATABASE_ID
## 1               _Z         Not Applicable        2017      4.19    FAO_CAHD
## 2               _Z         Not Applicable        2018      3.88    FAO_CAHD
## 3               _Z         Not Applicable        2019      3.78    FAO_CAHD
## 4               _Z         Not Applicable        2020      3.65    FAO_CAHD
## 5               _Z         Not Applicable        2021      3.63    FAO_CAHD
## 6               _Z         Not Applicable        2022      3.96    FAO_CAHD
##                                  DATABASE_ID_LABEL UNIT_MULT UNIT_MULT_LABEL
## 1 Cost and Affordability of a Healthy Diet (CoAHD)         0           Units
## 2 Cost and Affordability of a Healthy Diet (CoAHD)         0           Units
## 3 Cost and Affordability of a Healthy Diet (CoAHD)         0           Units
## 4 Cost and Affordability of a Healthy Diet (CoAHD)         0           Units
## 5 Cost and Affordability of a Healthy Diet (CoAHD)         0           Units
## 6 Cost and Affordability of a Healthy Diet (CoAHD)         0           Units
##   UNIT_TYPE      UNIT_TYPE_LABEL TIME_FORMAT TIME_FORMAT_LABEL OBS_STATUS
## 1    NUMBER Number (real number)         602              CCYY          A
## 2    NUMBER Number (real number)         602              CCYY          A
## 3    NUMBER Number (real number)         602              CCYY          A
## 4    NUMBER Number (real number)         602              CCYY          A
## 5    NUMBER Number (real number)         602              CCYY          A
## 6    NUMBER Number (real number)         602              CCYY          A
##   OBS_STATUS_LABEL OBS_CONF OBS_CONF_LABEL
## 1     Normal value       PU         Public
## 2     Normal value       PU         Public
## 3     Normal value       PU         Public
## 4     Normal value       PU         Public
## 5     Normal value       PU         Public
## 6     Normal value       PU         Public
str(nutrition)
## 'data.frame':    7067 obs. of  37 variables:
##  $ STRUCTURE             : chr  "datastructure" "datastructure" "datastructure" "datastructure" ...
##  $ STRUCTURE_ID          : chr  "WB.DATA360:DS_DATA360(1.2)" "WB.DATA360:DS_DATA360(1.2)" "WB.DATA360:DS_DATA360(1.2)" "WB.DATA360:DS_DATA360(1.2)" ...
##  $ ACTION                : chr  "I" "I" "I" "I" ...
##  $ FREQ                  : chr  "A" "A" "A" "A" ...
##  $ FREQ_LABEL            : chr  "Annual" "Annual" "Annual" "Annual" ...
##  $ REF_AREA              : chr  "PAN" "PAN" "PAN" "PAN" ...
##  $ REF_AREA_LABEL        : chr  "Panama" "Panama" "Panama" "Panama" ...
##  $ INDICATOR             : chr  "FAO_CAHD_7004" "FAO_CAHD_7004" "FAO_CAHD_7004" "FAO_CAHD_7004" ...
##  $ INDICATOR_LABEL       : chr  "Cost of a healthy diet (PPP dollar per person per day)" "Cost of a healthy diet (PPP dollar per person per day)" "Cost of a healthy diet (PPP dollar per person per day)" "Cost of a healthy diet (PPP dollar per person per day)" ...
##  $ SEX                   : chr  "_T" "_T" "_T" "_T" ...
##  $ SEX_LABEL             : chr  "Total" "Total" "Total" "Total" ...
##  $ AGE                   : chr  "_T" "_T" "_T" "_T" ...
##  $ AGE_LABEL             : chr  "All age ranges or no breakdown by age" "All age ranges or no breakdown by age" "All age ranges or no breakdown by age" "All age ranges or no breakdown by age" ...
##  $ URBANISATION          : chr  "_T" "_T" "_T" "_T" ...
##  $ URBANISATION_LABEL    : chr  "Total" "Total" "Total" "Total" ...
##  $ UNIT_MEASURE          : chr  "PPP_PS_D" "PPP_PS_D" "PPP_PS_D" "PPP_PS_D" ...
##  $ UNIT_MEASURE_LABEL    : chr  "PPP dollars per person per day" "PPP dollars per person per day" "PPP dollars per person per day" "PPP dollars per person per day" ...
##  $ COMP_BREAKDOWN_1      : chr  "_Z" "_Z" "_Z" "_Z" ...
##  $ COMP_BREAKDOWN_1_LABEL: chr  "Not Applicable" "Not Applicable" "Not Applicable" "Not Applicable" ...
##  $ COMP_BREAKDOWN_2      : chr  "_Z" "_Z" "_Z" "_Z" ...
##  $ COMP_BREAKDOWN_2_LABEL: chr  "Not Applicable" "Not Applicable" "Not Applicable" "Not Applicable" ...
##  $ COMP_BREAKDOWN_3      : chr  "_Z" "_Z" "_Z" "_Z" ...
##  $ COMP_BREAKDOWN_3_LABEL: chr  "Not Applicable" "Not Applicable" "Not Applicable" "Not Applicable" ...
##  $ TIME_PERIOD           : int  2017 2018 2019 2020 2021 2022 2023 2024 2017 2019 ...
##  $ OBS_VALUE             : num  4.19 3.88 3.78 3.65 3.63 3.96 4.2 4.34 3.74 3.71 ...
##  $ DATABASE_ID           : chr  "FAO_CAHD" "FAO_CAHD" "FAO_CAHD" "FAO_CAHD" ...
##  $ DATABASE_ID_LABEL     : chr  "Cost and Affordability of a Healthy Diet (CoAHD)" "Cost and Affordability of a Healthy Diet (CoAHD)" "Cost and Affordability of a Healthy Diet (CoAHD)" "Cost and Affordability of a Healthy Diet (CoAHD)" ...
##  $ UNIT_MULT             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ UNIT_MULT_LABEL       : chr  "Units" "Units" "Units" "Units" ...
##  $ UNIT_TYPE             : chr  "NUMBER" "NUMBER" "NUMBER" "NUMBER" ...
##  $ UNIT_TYPE_LABEL       : chr  "Number (real number)" "Number (real number)" "Number (real number)" "Number (real number)" ...
##  $ TIME_FORMAT           : int  602 602 602 602 602 602 602 602 602 602 ...
##  $ TIME_FORMAT_LABEL     : chr  "CCYY" "CCYY" "CCYY" "CCYY" ...
##  $ OBS_STATUS            : chr  "A" "A" "A" "A" ...
##  $ OBS_STATUS_LABEL      : chr  "Normal value" "Normal value" "Normal value" "Normal value" ...
##  $ OBS_CONF              : chr  "PU" "PU" "PU" "PU" ...
##  $ OBS_CONF_LABEL        : chr  "Public" "Public" "Public" "Public" ...
summary(nutrition)
##   STRUCTURE         STRUCTURE_ID          ACTION              FREQ          
##  Length:7067        Length:7067        Length:7067        Length:7067       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##   FREQ_LABEL          REF_AREA         REF_AREA_LABEL      INDICATOR        
##  Length:7067        Length:7067        Length:7067        Length:7067       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  INDICATOR_LABEL        SEX             SEX_LABEL             AGE           
##  Length:7067        Length:7067        Length:7067        Length:7067       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##   AGE_LABEL         URBANISATION       URBANISATION_LABEL UNIT_MEASURE      
##  Length:7067        Length:7067        Length:7067        Length:7067       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  UNIT_MEASURE_LABEL COMP_BREAKDOWN_1   COMP_BREAKDOWN_1_LABEL
##  Length:7067        Length:7067        Length:7067           
##  Class :character   Class :character   Class :character      
##  Mode  :character   Mode  :character   Mode  :character      
##                                                              
##                                                              
##                                                              
##  COMP_BREAKDOWN_2   COMP_BREAKDOWN_2_LABEL COMP_BREAKDOWN_3  
##  Length:7067        Length:7067            Length:7067       
##  Class :character   Class :character       Class :character  
##  Mode  :character   Mode  :character       Mode  :character  
##                                                              
##                                                              
##                                                              
##  COMP_BREAKDOWN_3_LABEL  TIME_PERIOD     OBS_VALUE         DATABASE_ID       
##  Length:7067            Min.   :2017   Min.   :     0.02   Length:7067       
##  Class :character       1st Qu.:2019   1st Qu.:     1.20   Class :character  
##  Mode  :character       Median :2021   Median :     3.93   Mode  :character  
##                         Mean   :2021   Mean   :   567.56                     
##                         3rd Qu.:2022   3rd Qu.:    25.86                     
##                         Max.   :2024   Max.   :500466.00                     
##  DATABASE_ID_LABEL    UNIT_MULT      UNIT_MULT_LABEL     UNIT_TYPE        
##  Length:7067        Min.   :0.0000   Length:7067        Length:7067       
##  Class :character   1st Qu.:0.0000   Class :character   Class :character  
##  Mode  :character   Median :0.0000   Mode  :character   Mode  :character  
##                     Mean   :0.9526                                        
##                     3rd Qu.:0.0000                                        
##                     Max.   :6.0000                                        
##  UNIT_TYPE_LABEL     TIME_FORMAT  TIME_FORMAT_LABEL   OBS_STATUS       
##  Length:7067        Min.   :602   Length:7067        Length:7067       
##  Class :character   1st Qu.:602   Class :character   Class :character  
##  Mode  :character   Median :602   Mode  :character   Mode  :character  
##                     Mean   :602                                        
##                     3rd Qu.:602                                        
##                     Max.   :602                                        
##  OBS_STATUS_LABEL     OBS_CONF         OBS_CONF_LABEL    
##  Length:7067        Length:7067        Length:7067       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
## 
nutrition_small <- nutrition |>
  filter(INDICATOR %in% c(
    "FAO_CAHD_7004",  
    "FAO_CAHD_7007",  
    "FAO_CAHD_7008",  
    "FAO_CAHD_7010",  
    "FAO_CAHD_7011"   
  )
  )
nutrition_wide <- nutrition_small |>
  pivot_wider(
    id_cols   = c(REF_AREA_LABEL, TIME_PERIOD),
    names_from  = INDICATOR,
    values_from = OBS_VALUE,
    values_fn   = ~mean(.x, na.rm = TRUE)
  )

head(nutrition_wide)
## # A tibble: 6 × 7
##   REF_AREA_LABEL TIME_PERIOD FAO_CAHD_7004 FAO_CAHD_7007 FAO_CAHD_7008
##   <chr>                <int>         <dbl>         <dbl>         <dbl>
## 1 Panama                2017          3.11        NA             NA   
## 2 Panama                2018          2.94        NA             NA   
## 3 Panama                2019          2.9         NA             NA   
## 4 Panama                2020          2.83        NA             NA   
## 5 Panama                2021          2.84         0.485          0.78
## 6 Panama                2022          3.04        NA             NA   
## # ℹ 2 more variables: FAO_CAHD_7010 <dbl>, FAO_CAHD_7011 <dbl>
nutrition_clean <- nutrition_wide |>
  rename(
    Country     = REF_AREA_LABEL,
    Year        = TIME_PERIOD,
    HealthyDiet = FAO_CAHD_7004,
    Staples     = FAO_CAHD_7007,
    AnimalFoods = FAO_CAHD_7008,
    Vegetables  = FAO_CAHD_7010,
    Fruits      = FAO_CAHD_7011
  ) |>
  filter(
    !is.na(HealthyDiet),
    !is.na(Staples),
    !is.na(AnimalFoods),
    !is.na(Vegetables),
    !is.na(Fruits)
  ) |>
  mutate(
    Year = as.numeric(Year)
  )

str(nutrition_clean)
## tibble [166 × 7] (S3: tbl_df/tbl/data.frame)
##  $ Country    : chr [1:166] "Panama" "Pakistan" "West Bank and Gaza" "Philippines" ...
##  $ Year       : num [1:166] 2021 2021 2021 2021 2021 ...
##  $ HealthyDiet: num [1:166] 2.83 71.82 4.09 40.51 4.88 ...
##  $ Staples    : num [1:166] 0.485 9.8 0.89 7.18 0.58 ...
##  $ AnimalFoods: num [1:166] 0.78 23.59 1.16 8.42 1.14 ...
##  $ Vegetables : num [1:166] 0.72 8.48 0.68 10.8 1.5 ...
##  $ Fruits     : num [1:166] 0.485 16.77 0.725 8.07 0.92 ...
summary(nutrition_clean)
##    Country               Year       HealthyDiet           Staples         
##  Length:166         Min.   :2021   Min.   :    1.465   Min.   :   0.1300  
##  Class :character   1st Qu.:2021   1st Qu.:    3.954   1st Qu.:   0.5687  
##  Mode  :character   Median :2021   Median :   17.285   Median :   2.8950  
##                     Mean   :2021   Mean   :  589.209   Mean   : 100.0541  
##                     3rd Qu.:2021   3rd Qu.:  303.670   3rd Qu.:  50.9900  
##                     Max.   :2021   Max.   :14572.240   Max.   :2717.4350  
##   AnimalFoods          Vegetables            Fruits         
##  Min.   :   0.2550   Min.   :   0.2650   Min.   :   0.2450  
##  1st Qu.:   0.9325   1st Qu.:   0.8387   1st Qu.:   0.8075  
##  Median :   4.5000   Median :   3.6575   Median :   2.9775  
##  Mean   : 187.0556   Mean   : 100.0974   Mean   : 105.7008  
##  3rd Qu.:  80.9025   3rd Qu.:  45.3375   3rd Qu.:  32.3062  
##  Max.   :4897.3700   Max.   :2442.8050   Max.   :2348.8400
head(nutrition_clean)
## # A tibble: 6 × 7
##   Country             Year HealthyDiet Staples AnimalFoods Vegetables  Fruits
##   <chr>              <dbl>       <dbl>   <dbl>       <dbl>      <dbl>   <dbl>
## 1 Panama              2021        2.84   0.485        0.78       0.72   0.485
## 2 Pakistan            2021       71.8    9.8         23.6        8.48  16.8  
## 3 West Bank and Gaza  2021        4.09   0.89         1.16       0.68   0.725
## 4 Philippines         2021       40.5    7.18         8.42      10.8    8.07 
## 5 Poland              2021        4.88   0.58         1.14       1.50   0.92 
## 6 Paraguay            2021     5439.   593         1106.      1333.   967.

To answer my research question, I will use multiple linear regression to see how the cost of staples, animal foods, vegetables, and fruits explains the total cost of a healthy diet. Before running the model, I will clean and reshape the data and look at basic summaries. To check whether the model assumptions are reasonable, I will generate the standard diagnostic plots from R, including Residuals vs Fitted, Normal Q–Q, Scale–Location, and Residuals vs Leverage. These plots help me check linearity, normality, equal variance, and any unusual observations.

model1 <- lm(HealthyDiet ~ Staples + AnimalFoods + Vegetables + Fruits,
             data = nutrition_clean)

summary(model1)
## 
## Call:
## lm(formula = HealthyDiet ~ Staples + AnimalFoods + Vegetables + 
##     Fruits, data = nutrition_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -331.57   -1.81   -1.27   -0.70  557.78 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.19006    5.41885   0.220    0.826    
## Staples      0.40241    0.08746   4.601 8.48e-06 ***
## AnimalFoods  1.53805    0.04281  35.926  < 2e-16 ***
## Vegetables   1.34195    0.05906  22.721  < 2e-16 ***
## Fruits       1.18949    0.04419  26.916  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 66.42 on 161 degrees of freedom
## Multiple R-squared:  0.9988, Adjusted R-squared:  0.9987 
## F-statistic: 3.24e+04 on 4 and 161 DF,  p-value: < 2.2e-16

The intercept (≈ 1.19) represents the predicted healthy diet cost when all food prices are zero, which is not realistic but is the mathematical baseline of the model. A 1-PPP-dollar increase in starchy staples raises the healthy diet cost by about 0.40 PPP dollars, holding other prices constant. Animal-source foods have the strongest effect, where a 1-PPP-dollar increase raises total diet cost by about 1.54 PPP dollars. For vegetables (≈ 1.34) and fruits (≈ 1.19), increases in price also lead to meaningful increases in overall diet cost. All predictors have extremely small p-values (p < 0.001), meaning they are statistically significant. The adjusted R² (≈ 0.9987) indicates that about 99.9% of the variation in healthy diet cost is explained by the four food-group prices, suggesting a very strong model fit.

par(mfrow = c(2,2))
plot(model1)

par(mfrow = c(1,1))

The Residuals vs Fitted plot shows a mostly flat pattern without a strong curve, which suggests that the linearity assumption is generally reasonable. The Q–Q plot shows very small deviations from the straight line, indicating that residuals are close to normally distributed, with only mild skewness caused by a few extreme observations. The Scale–Location plot shows slightly uneven spread for higher fitted values, which suggests some mild heteroscedasticity, but nothing severe. The Residuals vs Leverage plot shows a few observations with higher leverage, but none appear to be highly influential or problematic. Overall, the model assumptions are mostly satisfied, and any small departures are expected given real cross-country economic cost differences.

library(car)
## Warning: package 'car' was built under R version 4.5.2
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.5.2
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
vif(model1)
##     Staples AnimalFoods  Vegetables      Fruits 
##   29.668541   24.790242   12.806226    9.457067

The variance inflation factors for Staples, AnimalFoods, Vegetables, and Fruits are all above 9, with Staples and AnimalFoods being the highest. These values show strong multicollinearity among the predictors, meaning some of the food-group prices tend to move together across countries. This inflation can make individual coefficient p-values less reliable, even though the model as a whole still explains healthy diet cost very well. Since the predictors represent different parts of a country’s food system and are naturally related, this level of multicollinearity is expected and does not prevent the model from being useful for understanding how food prices contribute to total diet cost.

residuals_model1 <- resid(model1)
rmse_model1 <- sqrt(mean(residuals_model1^2))
rmse_model1
## [1] 65.41529

The RMSE value is approximately 65.4, which represents the average prediction error in PPP dollars per person per day for this model. Although this number is large in absolute terms, it reflects the wide range of food prices across countries, including very high-cost locations that naturally increase prediction variance. RMSE complements the adjusted R² by showing how far predictions tend to be from the true values, and in this case the level of error is acceptable given the real economic diversity in the dataset.

E. Conclusion and Future Directions

This analysis shows that the cost of a healthy diet is strongly related to the prices of staples, animal foods, vegetables, and fruits. All predictors were statistically significant, and increases in any food-group price lead to higher overall diet cost. Animal-source foods had the largest effect, followed by vegetables and fruits. The adjusted R² of about 0.9987 shows that the model explains nearly all of the variation in healthy diet cost, indicating a very strong fit.

There are some limitations, mainly the high multicollinearity among predictors and mild heteroscedasticity caused by large cost differences between countries. Future analysis could explore additional predictors such as income, regional effects, or inflation differences, or test regularized regression methods to reduce multicollinearity. Using data from multiple years could also help track changes in diet affordability over time.