Introduction

Agricultural productivity is one of the most critical factors in global food security. As climate patterns shift and farming technologies evolve, understanding how crop yields have changed over time — and whether that change differs by crop type — has significant implications for policy, economics, and sustainability planning.

This project investigates the following research question: Does crop type and year (time) significantly predict global crop yield (in tonnes per hectare), and is there an interaction between crop type and year over the 1981–2016 period?

The dataset used is the Global Dataset of Historical Yields (GDHY) v1.2 & v1.3, sourced from PANGAEA Data Publisher for Earth & Environmental Science (Iizumi, 2019). The dataset contains global gridded crop yield estimates from 1981 to 2016, aggregated at the annual level across three major crops.

The variables used in this analysis are:


Hypotheses

The multiple linear regression model tested is:

\[\hat{Y} = \beta_0 + \beta_1(\text{year}) + \beta_2(\text{crop}) + \beta_3(\text{year} \times \text{crop}) + \varepsilon\]

For the overall model:

For the interaction term specifically:

Significance level: α = 0.05


Data Preparation

library(tidyverse)

# Load the dataset — place gdhy_crop_yield.csv in the same folder as this Rmd
crop_data <- read_csv("gdhy_crop_yield.csv")

# Preview the data
glimpse(crop_data)
## Rows: 108
## Columns: 3
## $ year      <dbl> 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, …
## $ crop      <chr> "maize", "maize", "maize", "maize", "maize", "maize", "maize…
## $ yield_tha <dbl> 2.4217, 2.8218, 2.5354, 2.8033, 2.9371, 2.9800, 2.9722, 2.77…
head(crop_data)
## # A tibble: 6 × 3
##    year crop  yield_tha
##   <dbl> <chr>     <dbl>
## 1  1981 maize      2.42
## 2  1982 maize      2.82
## 3  1983 maize      2.54
## 4  1984 maize      2.80
## 5  1985 maize      2.94
## 6  1986 maize      2.98
# Ensure crop is a factor and remove any missing values
crop_clean <- crop_data %>%
  select(year, crop, yield_tha) %>%
  filter(!is.na(yield_tha), !is.na(year), !is.na(crop)) %>%
  mutate(crop = factor(crop))

# Confirm structure
summary(crop_clean)
##       year           crop      yield_tha    
##  Min.   :1981   maize  :36   Min.   :1.544  
##  1st Qu.:1990   soybean:36   1st Qu.:2.297  
##  Median :1998   wheat  :36   Median :3.032  
##  Mean   :1998                Mean   :3.051  
##  3rd Qu.:2007                3rd Qu.:3.654  
##  Max.   :2016                Max.   :5.288
cat("\nNumber of observations:", nrow(crop_clean), "\n")
## 
## Number of observations: 108
cat("Crop levels:", levels(crop_clean$crop), "\n")
## Crop levels: maize soybean wheat
cat("Year range:", min(crop_clean$year), "to", max(crop_clean$year), "\n")
## Year range: 1981 to 2016

Exploratory Data Analysis

# Summary statistics by crop type
crop_clean %>%
  group_by(crop) %>%
  summarise(
    n          = n(),
    mean_yield = round(mean(yield_tha), 3),
    sd_yield   = round(sd(yield_tha), 3),
    min_yield  = round(min(yield_tha), 3),
    max_yield  = round(max(yield_tha), 3)
  )
## # A tibble: 3 × 6
##   crop        n mean_yield sd_yield min_yield max_yield
##   <fct>   <int>      <dbl>    <dbl>     <dbl>     <dbl>
## 1 maize      36       3.71    0.777      2.42      5.29
## 2 soybean    36       2.12    0.278      1.54      2.61
## 3 wheat      36       3.33    0.473      1.68      4.13
# Yield over time by crop type with linear trend lines
ggplot(crop_clean, aes(x = year, y = yield_tha, color = crop)) +
  geom_point(alpha = 0.5, size = 2) +
  geom_smooth(method = "lm", se = TRUE) +
  labs(
    title    = "Global Crop Yield Over Time by Crop Type (1981-2016)",
    subtitle = "Linear trend lines with 95% confidence intervals",
    x        = "Year",
    y        = "Yield (tonnes per hectare)",
    color    = "Crop Type"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold"))

The scatter plot reveals distinct yield levels and trend slopes across the three crops. Maize shows the steepest upward trend over the 35-year period, while soybean and wheat show more moderate increases. Differences in slope across crops are a visual indication that the interaction term may be statistically significant.


Multiple Linear Regression

# Fit the multiple linear regression model with interaction
# year * crop expands to: year + crop + year:crop
model <- lm(yield_tha ~ year * crop, data = crop_clean)

# Full model summary
summary(model)
## 
## Call:
## lm(formula = yield_tha ~ year * crop, data = crop_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.93208 -0.07281 -0.01033  0.08802  0.38770 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -1.408e+02  4.846e+00 -29.054  < 2e-16 ***
## year              7.231e-02  2.425e-03  29.819  < 2e-16 ***
## cropsoybean       9.195e+01  6.854e+00  13.416  < 2e-16 ***
## cropwheat         6.271e+01  6.854e+00   9.150 6.17e-15 ***
## year:cropsoybean -4.680e-02  3.429e-03 -13.648  < 2e-16 ***
## year:cropwheat   -3.157e-02  3.429e-03  -9.206 4.66e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1512 on 102 degrees of freedom
## Multiple R-squared:  0.9713, Adjusted R-squared:  0.9699 
## F-statistic: 690.1 on 5 and 102 DF,  p-value: < 2.2e-16
# ANOVA table — tests the significance of each term in the model
anova(model)
## Analysis of Variance Table
## 
## Response: yield_tha
##            Df Sum Sq Mean Sq  F value    Pr(>F)    
## year        1 24.863 24.8626 1088.253 < 2.2e-16 ***
## crop        2 49.542 24.7709 1084.241 < 2.2e-16 ***
## year:crop   2  4.428  2.2142   96.915 < 2.2e-16 ***
## Residuals 102  2.330  0.0228                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model Diagnostics

par(mfrow = c(2, 2))
plot(model)

par(mfrow = c(1, 1))

The four diagnostic plots assess whether the assumptions of linear regression are:


Results and Interpretation

library(broom)

# Tidy coefficient table
tidy(model) %>%
  mutate(across(where(is.numeric), ~ round(., 4)))
## # A tibble: 6 × 5
##   term              estimate std.error statistic p.value
##   <chr>                <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)      -141.        4.85      -29.1        0
## 2 year                0.0723    0.0024     29.8        0
## 3 cropsoybean        92.0       6.85       13.4        0
## 4 cropwheat          62.7       6.85        9.15       0
## 5 year:cropsoybean   -0.0468    0.0034    -13.6        0
## 6 year:cropwheat     -0.0316    0.0034     -9.21       0
# Model fit statistics
glance(model) %>%
  select(r.squared, adj.r.squared, statistic, p.value, df) %>%
  mutate(across(where(is.numeric), ~ round(., 4)))
## # A tibble: 1 × 5
##   r.squared adj.r.squared statistic p.value    df
##       <dbl>         <dbl>     <dbl>   <dbl> <dbl>
## 1     0.971         0.970      690.       0     5

Interpretation of coefficients:


Conclusion

# Overlay model predictions on observed data
crop_clean <- crop_clean %>%
  mutate(predicted = fitted(model))

ggplot(crop_clean, aes(x = year, color = crop)) +
  geom_point(aes(y = yield_tha), alpha = 0.3, size = 1.5) +
  geom_line(aes(y = predicted), linewidth = 1.2) +
  labs(
    title    = "Observed vs. Predicted Crop Yield (1981-2016)",
    subtitle = "Solid lines = model predictions from MLR with interaction term",
    x        = "Year",
    y        = "Yield (tonnes per hectare)",
    color    = "Crop Type"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold"))

Multiple linear regression was used to analyze whether year, crop type, or an interaction of the two could be used to explain global crop yields for the years 1981 – 2016.

The overall test result for the F statistic gives a measure of how well, statistically speaking, the regression model affected the amount of variation in the yield variable. The adjusted r-square provides an estimate of how much of the yield variance can be explained by independent variables.

If a statistically significant interaction term is found in the full regression model, then the difference in yields over time depends on the type of crop being produced so the type of crop will moderate the effects of time on yield. If there is no statistically significant interaction term then all crops would follow the same yield growth trajectory over time, and only the type of crop would affect the yield levels of each crop. This means the type of crop would not influence the trajectory of yield growth for all crops.


References

Iizumi, T. (2019). Global dataset of historical yields v1.2 and v1.3 aligned version [Data set]. PANGAEA. https://doi.pangaea.de/10.1594/PANGAEA.909132