Research Question: How are the costs of different food-group components (such as staples, animal-source foods, vegetables and fruits) are related to the total cost of a healthy diet across countries?
The data for this project comes from the Cost and Affordability of a Healthy Diet (CoAHD) database published by the Food and Agriculture Organization (FAO). The original dataset includes many indicators and thousands of observations. For this project, I filtered the data to focus on five main cost variables: the total cost of a healthy diet and the costs of staples, animal foods, vegetables, and fruits. After cleaning and reshaping, the working dataset contains 166 observations and 7 variables. Each row represents one country with prices measured in PPP dollars per person per day.
Source: https://data360.worldbank.org/en/dataset/FAO_CAHD.
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
setwd("C:/Users/mezni/OneDrive/Desktop/project 3 dataset")
nutrition <- read.csv("nutrition.csv")
names(nutrition)
## [1] "STRUCTURE" "STRUCTURE_ID" "ACTION"
## [4] "FREQ" "FREQ_LABEL" "REF_AREA"
## [7] "REF_AREA_LABEL" "INDICATOR" "INDICATOR_LABEL"
## [10] "SEX" "SEX_LABEL" "AGE"
## [13] "AGE_LABEL" "URBANISATION" "URBANISATION_LABEL"
## [16] "UNIT_MEASURE" "UNIT_MEASURE_LABEL" "COMP_BREAKDOWN_1"
## [19] "COMP_BREAKDOWN_1_LABEL" "COMP_BREAKDOWN_2" "COMP_BREAKDOWN_2_LABEL"
## [22] "COMP_BREAKDOWN_3" "COMP_BREAKDOWN_3_LABEL" "TIME_PERIOD"
## [25] "OBS_VALUE" "DATABASE_ID" "DATABASE_ID_LABEL"
## [28] "UNIT_MULT" "UNIT_MULT_LABEL" "UNIT_TYPE"
## [31] "UNIT_TYPE_LABEL" "TIME_FORMAT" "TIME_FORMAT_LABEL"
## [34] "OBS_STATUS" "OBS_STATUS_LABEL" "OBS_CONF"
## [37] "OBS_CONF_LABEL"
head(nutrition)
## STRUCTURE STRUCTURE_ID ACTION FREQ FREQ_LABEL REF_AREA
## 1 datastructure WB.DATA360:DS_DATA360(1.2) I A Annual PAN
## 2 datastructure WB.DATA360:DS_DATA360(1.2) I A Annual PAN
## 3 datastructure WB.DATA360:DS_DATA360(1.2) I A Annual PAN
## 4 datastructure WB.DATA360:DS_DATA360(1.2) I A Annual PAN
## 5 datastructure WB.DATA360:DS_DATA360(1.2) I A Annual PAN
## 6 datastructure WB.DATA360:DS_DATA360(1.2) I A Annual PAN
## REF_AREA_LABEL INDICATOR
## 1 Panama FAO_CAHD_7004
## 2 Panama FAO_CAHD_7004
## 3 Panama FAO_CAHD_7004
## 4 Panama FAO_CAHD_7004
## 5 Panama FAO_CAHD_7004
## 6 Panama FAO_CAHD_7004
## INDICATOR_LABEL SEX SEX_LABEL AGE
## 1 Cost of a healthy diet (PPP dollar per person per day) _T Total _T
## 2 Cost of a healthy diet (PPP dollar per person per day) _T Total _T
## 3 Cost of a healthy diet (PPP dollar per person per day) _T Total _T
## 4 Cost of a healthy diet (PPP dollar per person per day) _T Total _T
## 5 Cost of a healthy diet (PPP dollar per person per day) _T Total _T
## 6 Cost of a healthy diet (PPP dollar per person per day) _T Total _T
## AGE_LABEL URBANISATION URBANISATION_LABEL
## 1 All age ranges or no breakdown by age _T Total
## 2 All age ranges or no breakdown by age _T Total
## 3 All age ranges or no breakdown by age _T Total
## 4 All age ranges or no breakdown by age _T Total
## 5 All age ranges or no breakdown by age _T Total
## 6 All age ranges or no breakdown by age _T Total
## UNIT_MEASURE UNIT_MEASURE_LABEL COMP_BREAKDOWN_1
## 1 PPP_PS_D PPP dollars per person per day _Z
## 2 PPP_PS_D PPP dollars per person per day _Z
## 3 PPP_PS_D PPP dollars per person per day _Z
## 4 PPP_PS_D PPP dollars per person per day _Z
## 5 PPP_PS_D PPP dollars per person per day _Z
## 6 PPP_PS_D PPP dollars per person per day _Z
## COMP_BREAKDOWN_1_LABEL COMP_BREAKDOWN_2 COMP_BREAKDOWN_2_LABEL
## 1 Not Applicable _Z Not Applicable
## 2 Not Applicable _Z Not Applicable
## 3 Not Applicable _Z Not Applicable
## 4 Not Applicable _Z Not Applicable
## 5 Not Applicable _Z Not Applicable
## 6 Not Applicable _Z Not Applicable
## COMP_BREAKDOWN_3 COMP_BREAKDOWN_3_LABEL TIME_PERIOD OBS_VALUE DATABASE_ID
## 1 _Z Not Applicable 2017 4.19 FAO_CAHD
## 2 _Z Not Applicable 2018 3.88 FAO_CAHD
## 3 _Z Not Applicable 2019 3.78 FAO_CAHD
## 4 _Z Not Applicable 2020 3.65 FAO_CAHD
## 5 _Z Not Applicable 2021 3.63 FAO_CAHD
## 6 _Z Not Applicable 2022 3.96 FAO_CAHD
## DATABASE_ID_LABEL UNIT_MULT UNIT_MULT_LABEL
## 1 Cost and Affordability of a Healthy Diet (CoAHD) 0 Units
## 2 Cost and Affordability of a Healthy Diet (CoAHD) 0 Units
## 3 Cost and Affordability of a Healthy Diet (CoAHD) 0 Units
## 4 Cost and Affordability of a Healthy Diet (CoAHD) 0 Units
## 5 Cost and Affordability of a Healthy Diet (CoAHD) 0 Units
## 6 Cost and Affordability of a Healthy Diet (CoAHD) 0 Units
## UNIT_TYPE UNIT_TYPE_LABEL TIME_FORMAT TIME_FORMAT_LABEL OBS_STATUS
## 1 NUMBER Number (real number) 602 CCYY A
## 2 NUMBER Number (real number) 602 CCYY A
## 3 NUMBER Number (real number) 602 CCYY A
## 4 NUMBER Number (real number) 602 CCYY A
## 5 NUMBER Number (real number) 602 CCYY A
## 6 NUMBER Number (real number) 602 CCYY A
## OBS_STATUS_LABEL OBS_CONF OBS_CONF_LABEL
## 1 Normal value PU Public
## 2 Normal value PU Public
## 3 Normal value PU Public
## 4 Normal value PU Public
## 5 Normal value PU Public
## 6 Normal value PU Public
str(nutrition)
## 'data.frame': 7067 obs. of 37 variables:
## $ STRUCTURE : chr "datastructure" "datastructure" "datastructure" "datastructure" ...
## $ STRUCTURE_ID : chr "WB.DATA360:DS_DATA360(1.2)" "WB.DATA360:DS_DATA360(1.2)" "WB.DATA360:DS_DATA360(1.2)" "WB.DATA360:DS_DATA360(1.2)" ...
## $ ACTION : chr "I" "I" "I" "I" ...
## $ FREQ : chr "A" "A" "A" "A" ...
## $ FREQ_LABEL : chr "Annual" "Annual" "Annual" "Annual" ...
## $ REF_AREA : chr "PAN" "PAN" "PAN" "PAN" ...
## $ REF_AREA_LABEL : chr "Panama" "Panama" "Panama" "Panama" ...
## $ INDICATOR : chr "FAO_CAHD_7004" "FAO_CAHD_7004" "FAO_CAHD_7004" "FAO_CAHD_7004" ...
## $ INDICATOR_LABEL : chr "Cost of a healthy diet (PPP dollar per person per day)" "Cost of a healthy diet (PPP dollar per person per day)" "Cost of a healthy diet (PPP dollar per person per day)" "Cost of a healthy diet (PPP dollar per person per day)" ...
## $ SEX : chr "_T" "_T" "_T" "_T" ...
## $ SEX_LABEL : chr "Total" "Total" "Total" "Total" ...
## $ AGE : chr "_T" "_T" "_T" "_T" ...
## $ AGE_LABEL : chr "All age ranges or no breakdown by age" "All age ranges or no breakdown by age" "All age ranges or no breakdown by age" "All age ranges or no breakdown by age" ...
## $ URBANISATION : chr "_T" "_T" "_T" "_T" ...
## $ URBANISATION_LABEL : chr "Total" "Total" "Total" "Total" ...
## $ UNIT_MEASURE : chr "PPP_PS_D" "PPP_PS_D" "PPP_PS_D" "PPP_PS_D" ...
## $ UNIT_MEASURE_LABEL : chr "PPP dollars per person per day" "PPP dollars per person per day" "PPP dollars per person per day" "PPP dollars per person per day" ...
## $ COMP_BREAKDOWN_1 : chr "_Z" "_Z" "_Z" "_Z" ...
## $ COMP_BREAKDOWN_1_LABEL: chr "Not Applicable" "Not Applicable" "Not Applicable" "Not Applicable" ...
## $ COMP_BREAKDOWN_2 : chr "_Z" "_Z" "_Z" "_Z" ...
## $ COMP_BREAKDOWN_2_LABEL: chr "Not Applicable" "Not Applicable" "Not Applicable" "Not Applicable" ...
## $ COMP_BREAKDOWN_3 : chr "_Z" "_Z" "_Z" "_Z" ...
## $ COMP_BREAKDOWN_3_LABEL: chr "Not Applicable" "Not Applicable" "Not Applicable" "Not Applicable" ...
## $ TIME_PERIOD : int 2017 2018 2019 2020 2021 2022 2023 2024 2017 2019 ...
## $ OBS_VALUE : num 4.19 3.88 3.78 3.65 3.63 3.96 4.2 4.34 3.74 3.71 ...
## $ DATABASE_ID : chr "FAO_CAHD" "FAO_CAHD" "FAO_CAHD" "FAO_CAHD" ...
## $ DATABASE_ID_LABEL : chr "Cost and Affordability of a Healthy Diet (CoAHD)" "Cost and Affordability of a Healthy Diet (CoAHD)" "Cost and Affordability of a Healthy Diet (CoAHD)" "Cost and Affordability of a Healthy Diet (CoAHD)" ...
## $ UNIT_MULT : int 0 0 0 0 0 0 0 0 0 0 ...
## $ UNIT_MULT_LABEL : chr "Units" "Units" "Units" "Units" ...
## $ UNIT_TYPE : chr "NUMBER" "NUMBER" "NUMBER" "NUMBER" ...
## $ UNIT_TYPE_LABEL : chr "Number (real number)" "Number (real number)" "Number (real number)" "Number (real number)" ...
## $ TIME_FORMAT : int 602 602 602 602 602 602 602 602 602 602 ...
## $ TIME_FORMAT_LABEL : chr "CCYY" "CCYY" "CCYY" "CCYY" ...
## $ OBS_STATUS : chr "A" "A" "A" "A" ...
## $ OBS_STATUS_LABEL : chr "Normal value" "Normal value" "Normal value" "Normal value" ...
## $ OBS_CONF : chr "PU" "PU" "PU" "PU" ...
## $ OBS_CONF_LABEL : chr "Public" "Public" "Public" "Public" ...
summary(nutrition)
## STRUCTURE STRUCTURE_ID ACTION FREQ
## Length:7067 Length:7067 Length:7067 Length:7067
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## FREQ_LABEL REF_AREA REF_AREA_LABEL INDICATOR
## Length:7067 Length:7067 Length:7067 Length:7067
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## INDICATOR_LABEL SEX SEX_LABEL AGE
## Length:7067 Length:7067 Length:7067 Length:7067
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## AGE_LABEL URBANISATION URBANISATION_LABEL UNIT_MEASURE
## Length:7067 Length:7067 Length:7067 Length:7067
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## UNIT_MEASURE_LABEL COMP_BREAKDOWN_1 COMP_BREAKDOWN_1_LABEL
## Length:7067 Length:7067 Length:7067
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## COMP_BREAKDOWN_2 COMP_BREAKDOWN_2_LABEL COMP_BREAKDOWN_3
## Length:7067 Length:7067 Length:7067
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## COMP_BREAKDOWN_3_LABEL TIME_PERIOD OBS_VALUE DATABASE_ID
## Length:7067 Min. :2017 Min. : 0.02 Length:7067
## Class :character 1st Qu.:2019 1st Qu.: 1.20 Class :character
## Mode :character Median :2021 Median : 3.93 Mode :character
## Mean :2021 Mean : 567.56
## 3rd Qu.:2022 3rd Qu.: 25.86
## Max. :2024 Max. :500466.00
## DATABASE_ID_LABEL UNIT_MULT UNIT_MULT_LABEL UNIT_TYPE
## Length:7067 Min. :0.0000 Length:7067 Length:7067
## Class :character 1st Qu.:0.0000 Class :character Class :character
## Mode :character Median :0.0000 Mode :character Mode :character
## Mean :0.9526
## 3rd Qu.:0.0000
## Max. :6.0000
## UNIT_TYPE_LABEL TIME_FORMAT TIME_FORMAT_LABEL OBS_STATUS
## Length:7067 Min. :602 Length:7067 Length:7067
## Class :character 1st Qu.:602 Class :character Class :character
## Mode :character Median :602 Mode :character Mode :character
## Mean :602
## 3rd Qu.:602
## Max. :602
## OBS_STATUS_LABEL OBS_CONF OBS_CONF_LABEL
## Length:7067 Length:7067 Length:7067
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
nutrition_small <- nutrition |>
filter(INDICATOR %in% c(
"FAO_CAHD_7004",
"FAO_CAHD_7007",
"FAO_CAHD_7008",
"FAO_CAHD_7010",
"FAO_CAHD_7011"
)
)
nutrition_wide <- nutrition_small |>
pivot_wider(
id_cols = c(REF_AREA_LABEL, TIME_PERIOD),
names_from = INDICATOR,
values_from = OBS_VALUE,
values_fn = ~mean(.x, na.rm = TRUE)
)
head(nutrition_wide)
## # A tibble: 6 × 7
## REF_AREA_LABEL TIME_PERIOD FAO_CAHD_7004 FAO_CAHD_7007 FAO_CAHD_7008
## <chr> <int> <dbl> <dbl> <dbl>
## 1 Panama 2017 3.11 NA NA
## 2 Panama 2018 2.94 NA NA
## 3 Panama 2019 2.9 NA NA
## 4 Panama 2020 2.83 NA NA
## 5 Panama 2021 2.84 0.485 0.78
## 6 Panama 2022 3.04 NA NA
## # ℹ 2 more variables: FAO_CAHD_7010 <dbl>, FAO_CAHD_7011 <dbl>
nutrition_clean <- nutrition_wide |>
rename(
Country = REF_AREA_LABEL,
Year = TIME_PERIOD,
HealthyDiet = FAO_CAHD_7004,
Staples = FAO_CAHD_7007,
AnimalFoods = FAO_CAHD_7008,
Vegetables = FAO_CAHD_7010,
Fruits = FAO_CAHD_7011
) |>
filter(
!is.na(HealthyDiet),
!is.na(Staples),
!is.na(AnimalFoods),
!is.na(Vegetables),
!is.na(Fruits)
) |>
mutate(
Year = as.numeric(Year)
)
str(nutrition_clean)
## tibble [166 × 7] (S3: tbl_df/tbl/data.frame)
## $ Country : chr [1:166] "Panama" "Pakistan" "West Bank and Gaza" "Philippines" ...
## $ Year : num [1:166] 2021 2021 2021 2021 2021 ...
## $ HealthyDiet: num [1:166] 2.83 71.82 4.09 40.51 4.88 ...
## $ Staples : num [1:166] 0.485 9.8 0.89 7.18 0.58 ...
## $ AnimalFoods: num [1:166] 0.78 23.59 1.16 8.42 1.14 ...
## $ Vegetables : num [1:166] 0.72 8.48 0.68 10.8 1.5 ...
## $ Fruits : num [1:166] 0.485 16.77 0.725 8.07 0.92 ...
summary(nutrition_clean)
## Country Year HealthyDiet Staples
## Length:166 Min. :2021 Min. : 1.465 Min. : 0.1300
## Class :character 1st Qu.:2021 1st Qu.: 3.954 1st Qu.: 0.5687
## Mode :character Median :2021 Median : 17.285 Median : 2.8950
## Mean :2021 Mean : 589.209 Mean : 100.0541
## 3rd Qu.:2021 3rd Qu.: 303.670 3rd Qu.: 50.9900
## Max. :2021 Max. :14572.240 Max. :2717.4350
## AnimalFoods Vegetables Fruits
## Min. : 0.2550 Min. : 0.2650 Min. : 0.2450
## 1st Qu.: 0.9325 1st Qu.: 0.8387 1st Qu.: 0.8075
## Median : 4.5000 Median : 3.6575 Median : 2.9775
## Mean : 187.0556 Mean : 100.0974 Mean : 105.7008
## 3rd Qu.: 80.9025 3rd Qu.: 45.3375 3rd Qu.: 32.3062
## Max. :4897.3700 Max. :2442.8050 Max. :2348.8400
head(nutrition_clean)
## # A tibble: 6 × 7
## Country Year HealthyDiet Staples AnimalFoods Vegetables Fruits
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Panama 2021 2.84 0.485 0.78 0.72 0.485
## 2 Pakistan 2021 71.8 9.8 23.6 8.48 16.8
## 3 West Bank and Gaza 2021 4.09 0.89 1.16 0.68 0.725
## 4 Philippines 2021 40.5 7.18 8.42 10.8 8.07
## 5 Poland 2021 4.88 0.58 1.14 1.50 0.92
## 6 Paraguay 2021 5439. 593 1106. 1333. 967.
To answer my research question, I will use multiple linear regression to see how the cost of staples, animal foods, vegetables, and fruits explains the total cost of a healthy diet. Before running the model, I will clean and reshape the data and look at basic summaries. To check whether the model assumptions are reasonable, I will generate the standard diagnostic plots from R, including Residuals vs Fitted, Normal Q–Q, Scale–Location, and Residuals vs Leverage. These plots help me check linearity, normality, equal variance, and any unusual observations.
model1 <- lm(HealthyDiet ~ Staples + AnimalFoods + Vegetables + Fruits,
data = nutrition_clean)
summary(model1)
##
## Call:
## lm(formula = HealthyDiet ~ Staples + AnimalFoods + Vegetables +
## Fruits, data = nutrition_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -331.57 -1.81 -1.27 -0.70 557.78
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.19006 5.41885 0.220 0.826
## Staples 0.40241 0.08746 4.601 8.48e-06 ***
## AnimalFoods 1.53805 0.04281 35.926 < 2e-16 ***
## Vegetables 1.34195 0.05906 22.721 < 2e-16 ***
## Fruits 1.18949 0.04419 26.916 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 66.42 on 161 degrees of freedom
## Multiple R-squared: 0.9988, Adjusted R-squared: 0.9987
## F-statistic: 3.24e+04 on 4 and 161 DF, p-value: < 2.2e-16
The intercept (≈ 1.19) represents the predicted healthy diet cost when all food prices are zero, which is not realistic but is the mathematical baseline of the model. A 1-PPP-dollar increase in starchy staples raises the healthy diet cost by about 0.40 PPP dollars, holding other prices constant. Animal-source foods have the strongest effect, where a 1-PPP-dollar increase raises total diet cost by about 1.54 PPP dollars. For vegetables (≈ 1.34) and fruits (≈ 1.19), increases in price also lead to meaningful increases in overall diet cost. All predictors have extremely small p-values (p < 0.001), meaning they are statistically significant. The adjusted R² (≈ 0.9987) indicates that about 99.9% of the variation in healthy diet cost is explained by the four food-group prices, suggesting a very strong model fit.
par(mfrow = c(2,2))
plot(model1)
par(mfrow = c(1,1))
The Residuals vs Fitted plot shows a mostly flat pattern without a strong curve, which suggests that the linearity assumption is generally reasonable. The Q–Q plot shows very small deviations from the straight line, indicating that residuals are close to normally distributed, with only mild skewness caused by a few extreme observations. The Scale–Location plot shows slightly uneven spread for higher fitted values, which suggests some mild heteroscedasticity, but nothing severe. The Residuals vs Leverage plot shows a few observations with higher leverage, but none appear to be highly influential or problematic. Overall, the model assumptions are mostly satisfied, and any small departures are expected given real cross-country economic cost differences.
library(car)
## Warning: package 'car' was built under R version 4.5.2
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.5.2
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
vif(model1)
## Staples AnimalFoods Vegetables Fruits
## 29.668541 24.790242 12.806226 9.457067
The variance inflation factors for Staples, AnimalFoods, Vegetables, and Fruits are all above 9, with Staples and AnimalFoods being the highest. These values show strong multicollinearity among the predictors, meaning some of the food-group prices tend to move together across countries. This inflation can make individual coefficient p-values less reliable, even though the model as a whole still explains healthy diet cost very well. Since the predictors represent different parts of a country’s food system and are naturally related, this level of multicollinearity is expected and does not prevent the model from being useful for understanding how food prices contribute to total diet cost.
residuals_model1 <- resid(model1)
rmse_model1 <- sqrt(mean(residuals_model1^2))
rmse_model1
## [1] 65.41529
The RMSE value is approximately 65.4, which represents the average prediction error in PPP dollars per person per day for this model. Although this number is large in absolute terms, it reflects the wide range of food prices across countries, including very high-cost locations that naturally increase prediction variance. RMSE complements the adjusted R² by showing how far predictions tend to be from the true values, and in this case the level of error is acceptable given the real economic diversity in the dataset.
E. Conclusion and Future Directions
This analysis shows that the cost of a healthy diet is strongly related to the prices of staples, animal foods, vegetables, and fruits. All predictors were statistically significant, and increases in any food-group price lead to higher overall diet cost. Animal-source foods had the largest effect, followed by vegetables and fruits. The adjusted R² of about 0.9987 shows that the model explains nearly all of the variation in healthy diet cost, indicating a very strong fit.
There are some limitations, mainly the high multicollinearity among predictors and mild heteroscedasticity caused by large cost differences between countries. Future analysis could explore additional predictors such as income, regional effects, or inflation differences, or test regularized regression methods to reduce multicollinearity. Using data from multiple years could also help track changes in diet affordability over time.