Agricultural productivity is one of the most critical factors in global food security. As climate patterns shift and farming technologies evolve, understanding how crop yields have changed over time — and whether that change differs by crop type — has significant implications for policy, economics, and sustainability planning.
This project investigates the following research question: Does crop type and year (time) significantly predict global crop yield (in tonnes per hectare), and is there an interaction between crop type and year over the 1981–2016 period?
The dataset used is the Global Dataset of Historical Yields (GDHY) v1.2 & v1.3, sourced from PANGAEA Data Publisher for Earth & Environmental Science (Iizumi, 2019). The dataset contains global gridded crop yield estimates from 1981 to 2016, aggregated at the annual level across three major crops.
The variables used in this analysis are:
The multiple linear regression model tested is:
\[\hat{Y} = \beta_0 + \beta_1(\text{year}) + \beta_2(\text{crop}) + \beta_3(\text{year} \times \text{crop}) + \varepsilon\]
For the overall model:
For the interaction term specifically:
Significance level: α = 0.05
library(tidyverse)
# Load the dataset — place gdhy_crop_yield.csv in the same folder as this Rmd
crop_data <- read_csv("gdhy_crop_yield.csv")
# Preview the data
glimpse(crop_data)
## Rows: 108
## Columns: 3
## $ year <dbl> 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, …
## $ crop <chr> "maize", "maize", "maize", "maize", "maize", "maize", "maize…
## $ yield_tha <dbl> 2.4217, 2.8218, 2.5354, 2.8033, 2.9371, 2.9800, 2.9722, 2.77…
head(crop_data)
## # A tibble: 6 × 3
## year crop yield_tha
## <dbl> <chr> <dbl>
## 1 1981 maize 2.42
## 2 1982 maize 2.82
## 3 1983 maize 2.54
## 4 1984 maize 2.80
## 5 1985 maize 2.94
## 6 1986 maize 2.98
# Ensure crop is a factor and remove any missing values
crop_clean <- crop_data %>%
select(year, crop, yield_tha) %>%
filter(!is.na(yield_tha), !is.na(year), !is.na(crop)) %>%
mutate(crop = factor(crop))
# Confirm structure
summary(crop_clean)
## year crop yield_tha
## Min. :1981 maize :36 Min. :1.544
## 1st Qu.:1990 soybean:36 1st Qu.:2.297
## Median :1998 wheat :36 Median :3.032
## Mean :1998 Mean :3.051
## 3rd Qu.:2007 3rd Qu.:3.654
## Max. :2016 Max. :5.288
cat("\nNumber of observations:", nrow(crop_clean), "\n")
##
## Number of observations: 108
cat("Crop levels:", levels(crop_clean$crop), "\n")
## Crop levels: maize soybean wheat
cat("Year range:", min(crop_clean$year), "to", max(crop_clean$year), "\n")
## Year range: 1981 to 2016
# Summary statistics by crop type
crop_clean %>%
group_by(crop) %>%
summarise(
n = n(),
mean_yield = round(mean(yield_tha), 3),
sd_yield = round(sd(yield_tha), 3),
min_yield = round(min(yield_tha), 3),
max_yield = round(max(yield_tha), 3)
)
## # A tibble: 3 × 6
## crop n mean_yield sd_yield min_yield max_yield
## <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 maize 36 3.71 0.777 2.42 5.29
## 2 soybean 36 2.12 0.278 1.54 2.61
## 3 wheat 36 3.33 0.473 1.68 4.13
# Yield over time by crop type with linear trend lines
ggplot(crop_clean, aes(x = year, y = yield_tha, color = crop)) +
geom_point(alpha = 0.5, size = 2) +
geom_smooth(method = "lm", se = TRUE) +
labs(
title = "Global Crop Yield Over Time by Crop Type (1981-2016)",
subtitle = "Linear trend lines with 95% confidence intervals",
x = "Year",
y = "Yield (tonnes per hectare)",
color = "Crop Type"
) +
theme_minimal() +
theme(plot.title = element_text(face = "bold"))
The scatter plot reveals distinct yield levels and trend slopes across the three crops. Maize shows the steepest upward trend over the 35-year period, while soybean and wheat show more moderate increases. Differences in slope across crops are a visual indication that the interaction term may be statistically significant.
# Fit the multiple linear regression model with interaction
# year * crop expands to: year + crop + year:crop
model <- lm(yield_tha ~ year * crop, data = crop_clean)
# Full model summary
summary(model)
##
## Call:
## lm(formula = yield_tha ~ year * crop, data = crop_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.93208 -0.07281 -0.01033 0.08802 0.38770
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.408e+02 4.846e+00 -29.054 < 2e-16 ***
## year 7.231e-02 2.425e-03 29.819 < 2e-16 ***
## cropsoybean 9.195e+01 6.854e+00 13.416 < 2e-16 ***
## cropwheat 6.271e+01 6.854e+00 9.150 6.17e-15 ***
## year:cropsoybean -4.680e-02 3.429e-03 -13.648 < 2e-16 ***
## year:cropwheat -3.157e-02 3.429e-03 -9.206 4.66e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1512 on 102 degrees of freedom
## Multiple R-squared: 0.9713, Adjusted R-squared: 0.9699
## F-statistic: 690.1 on 5 and 102 DF, p-value: < 2.2e-16
# ANOVA table — tests the significance of each term in the model
anova(model)
## Analysis of Variance Table
##
## Response: yield_tha
## Df Sum Sq Mean Sq F value Pr(>F)
## year 1 24.863 24.8626 1088.253 < 2.2e-16 ***
## crop 2 49.542 24.7709 1084.241 < 2.2e-16 ***
## year:crop 2 4.428 2.2142 96.915 < 2.2e-16 ***
## Residuals 102 2.330 0.0228
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
par(mfrow = c(2, 2))
plot(model)
par(mfrow = c(1, 1))
The four diagnostic plots assess whether the assumptions of linear regression are:
library(broom)
# Tidy coefficient table
tidy(model) %>%
mutate(across(where(is.numeric), ~ round(., 4)))
## # A tibble: 6 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -141. 4.85 -29.1 0
## 2 year 0.0723 0.0024 29.8 0
## 3 cropsoybean 92.0 6.85 13.4 0
## 4 cropwheat 62.7 6.85 9.15 0
## 5 year:cropsoybean -0.0468 0.0034 -13.6 0
## 6 year:cropwheat -0.0316 0.0034 -9.21 0
# Model fit statistics
glance(model) %>%
select(r.squared, adj.r.squared, statistic, p.value, df) %>%
mutate(across(where(is.numeric), ~ round(., 4)))
## # A tibble: 1 × 5
## r.squared adj.r.squared statistic p.value df
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.971 0.970 690. 0 5
Interpretation of coefficients:
# Overlay model predictions on observed data
crop_clean <- crop_clean %>%
mutate(predicted = fitted(model))
ggplot(crop_clean, aes(x = year, color = crop)) +
geom_point(aes(y = yield_tha), alpha = 0.3, size = 1.5) +
geom_line(aes(y = predicted), linewidth = 1.2) +
labs(
title = "Observed vs. Predicted Crop Yield (1981-2016)",
subtitle = "Solid lines = model predictions from MLR with interaction term",
x = "Year",
y = "Yield (tonnes per hectare)",
color = "Crop Type"
) +
theme_minimal() +
theme(plot.title = element_text(face = "bold"))
Multiple linear regression was used to analyze whether year, crop type, or an interaction of the two could be used to explain global crop yields for the years 1981 – 2016.
The overall test result for the F statistic gives a measure of how well, statistically speaking, the regression model affected the amount of variation in the yield variable. The adjusted r-square provides an estimate of how much of the yield variance can be explained by independent variables.
If a statistically significant interaction term is found in the full regression model, then the difference in yields over time depends on the type of crop being produced so the type of crop will moderate the effects of time on yield. If there is no statistically significant interaction term then all crops would follow the same yield growth trajectory over time, and only the type of crop would affect the yield levels of each crop. This means the type of crop would not influence the trajectory of yield growth for all crops.
Iizumi, T. (2019). Global dataset of historical yields v1.2 and v1.3 aligned version [Data set]. PANGAEA. https://doi.pangaea.de/10.1594/PANGAEA.909132