Inflation Forecasting Competition by The Economic Growth & Forecasting Lab at IBA-SESS

Intro to Competition

This is part of the Inflation Forecasting Competition organized by the Economic Growth & Forecasting Lab at IBA-SESS. The objective is to develop accurate models for predicting inflation using historical economic indicators and related variables. The competition promotes the use of various forecasting techniques to enhance predictive accuracy, with a focus on regression-based models.

Methodology Overview

We utilize a linear regression model as the primary method for forecasting inflation. The model is trained on historical data, where predictors (independent variables) include factors such as consumer confidence, credit to the private sector, interest rates, oil prices, and real exchange rates. Below is an outline of the modeling process:

Data Preparation

  • The dataset is cleaned and prepared for analysis. Key steps include:
    • Renaming columns for better understanding.
    • Converting date formats to time-series-compatible formats.
    • Creating seasonal dummy variables to capture monthly variations.
  • Variables are refined based on statistical significance (p-value < 0.05) to ensure only the most relevant predictors are included in the model.

Modeling Process

The forecasting process involves the following steps:

  1. Linear Regression Model:
    • We apply a linear regression model to establish the relationship between the selected predictors and inflation.
    • The data is split into training and testing sets based on the specified time frame to validate model performance.
  2. Rolling Window Techniques:
    • Fixed Rolling Window:
      • A fixed window of 100 observations is maintained to forecast 12 months ahead.
    • Expanding Rolling Window:
      • The model starts with an initial window of 100 observations and expands by one observation at each step to forecast 12 months ahead.

Evaluation Metrics

  • We assess model accuracy using Root Mean Square Error (RMSE), which quantifies the difference between actual and predicted inflation rates.
  • We also plot actual vs. predicted inflation to visually evaluate the model’s performance across different forecasting methods.

Key Insights

  • The linear regression model identifies and uses the most significant predictors for inflation forecasting.
  • Both fixed and expanding rolling windows capture time-varying dynamics in inflation.
  • The results provide insights into effective strategies for real-world inflation forecasting.
  • The use of rolling window techniques adds robustness to the model, allowing for better adaptation to shifting economic patterns over time.

Code is as below:

# Load necessary libraries
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
#importing and saving the datset
data <- read.csv("C:/Users/Abdul Qudoos/Desktop/kaggle-EGF.csv", skip = 2, header = TRUE)
#checking on the data
str(data)
## 'data.frame':    142 obs. of  25 variables:
##  $ Date: chr  "1/1/2012" "2/1/2012" "3/1/2012" "4/1/2012" ...
##  $ V1  : num  10.1 11.2 10.8 11.3 12.3 ...
##  $ V2  : num  3492649 3567035 3606844 3602125 3626809 ...
##  $ V3  : num  15722 24172 23286 21337 23713 ...
##  $ V4  : num  29.5 29.9 30.4 29.2 28 ...
##  $ V5  : num  10.9 11.5 11.3 11.3 10.8 ...
##  $ V6  : num  136 140 140 140 134 ...
##  $ V7  : num  79.8 80.8 81.6 82.4 83.8 ...
##  $ V8  : num  3017343 2973088 2959743 2966976 2924351 ...
##  $ V9  : num  5682167 5716810 5920093 5935537 6021033 ...
##  $ V10 : num  13.2 13.1 12.8 12.8 12.9 ...
##  $ V11 : num  90.3 90.7 90.8 90.7 91.5 ...
##  $ V12 : num  71.8 95.1 67.6 81.3 91.3 ...
##  $ V13 : num  602 603 603 604 605 ...
##  $ V14 : num  11875 12878 13762 13990 13787 ...
##  $ V15 : num  11.8 11.7 11.8 11.8 11.9 ...
##  $ V16 : num  11.6 11.6 11.7 11.8 11.8 ...
##  $ V17 : num  8226 8947 9650 9813 9714 ...
##  $ V18 : num  5.88 5.85 5.74 5.76 5.88 5.82 5.88 5.87 5.75 5.64 ...
##  $ V19 : num  107 113 118 114 104 ...
##  $ V20 : num  12 12 12 12 12 12 12 10.5 10.5 10 ...
##  $ V21 : num  3402132 3496691 3534391 3602893 3691756 ...
##  $ V22 : num  87.5 87.6 87.7 87.9 88.1 ...
##  $ V23 : num  105 103 104 106 107 ...
##  $ V24 : num  11 11.6 11.3 11.3 11.1 ...
head(data)
##       Date    V1      V2    V3    V4    V5     V6    V7      V8      V9   V10
## 1 1/1/2012 10.11 3492649 15722 29.48 10.90 136.29 79.84 3017343 5682167 13.18
## 2 2/1/2012 11.16 3567035 24172 29.94 11.52 140.04 80.80 2973088 5716810 13.14
## 3 3/1/2012 10.84 3606844 23286 30.39 11.27 140.11 81.58 2959743 5920093 12.80
## 4 4/1/2012 11.26 3602125 21337 29.18 11.33 139.73 82.43 2966976 5935537 12.83
## 5 5/1/2012 12.29 3626809 23713 27.96 10.85 134.00 83.78 2924351 6021033 12.94
## 6 6/1/2012 11.22 3739942 27917 30.07 11.10 131.24 83.76 2921997 6402735 13.13
##     V11   V12    V13      V14   V15   V16     V17  V18    V19 V20     V21   V22
## 1 90.27 71.78 601.96 11874.89 11.75 11.65 8225.97 5.88 106.71  12 3402132 87.50
## 2 90.74 95.06 602.73 12877.88 11.74 11.65 8946.64 5.85 112.71  12 3496691 87.63
## 3 90.79 67.61 603.49 13761.76 11.81 11.70 9649.95 5.74 117.83  12 3534391 87.74
## 4 90.71 81.32 604.26 13990.38 11.85 11.75 9812.71 5.76 113.76  12 3602893 87.88
## 5 91.48 91.31 605.02 13786.62 11.87 11.76 9714.33 5.88 104.34  12 3691756 88.10
## 6 94.26 97.20 605.78 13801.41 11.88 11.78 9708.31 5.82  90.91  12 3799917 88.35
##      V23   V24
## 1 104.74 10.98
## 2 102.94 11.65
## 3 104.09 11.26
## 4 105.76 11.31
## 5 107.39 11.11
## 6 105.47 11.48
sum(is.na(data))
## [1] 0

we removed the first two rows during the import process because they only contain metadata and unnecessary headers. Now, we are renaming the third row, which initially had generic names like ‘V1’, ‘V2’, etc., to more meaningful and interpretable column names based on the second row. This ensures that each column has a clear, useful name for analysis and modeling.

colnames(data) <- c("Date", "Inflation", "ADV", "AUTOS_F", "CCI", "CMR_E", "COMM",
                    "CPI", "CPS", "DEP", "DR", "E", "EPU2", "FERTS_F", "K100", "K1Y",
                    "K6M", "KALL", "LR", "OIL", "PR", "PSB", "QIM", "REER", "WAONR_A")

Additionally, Since there are no missing values in the dataset, we can proceed without any data imputation. However, the ‘Date’ column is currently in string format, which isn’t suitable for time series forecasting. To properly handle time-based data, we need to convert the ‘Date’ column to a date format. This conversion allows us to treat the ‘Date’ as a time index, enabling accurate modeling and forecasting.

data$Date <- as.Date(data$Date, format="%m/%d/%Y")

Accounting for Seasonality in Inflation Forecasting

Inflation often shows seasonal patterns throughout the year, which can affect forecasts. To capture these patterns, we convert months into dummy variables, representing each month as a binary indicator (1 or 0). This helps the model understand month-specific effects on inflation, making predictions more accurate by accounting for recurring trends.

#Adding seasonalal dummies for months
data$Month <- as.factor(format(data$Date, "%m"))  #Converting month to factor variable

#Creating dummy variables for each month
month_dummies <- model.matrix(~ Month - 1, data = data) %>% as.data.frame()

#Combining month dummies with original data (excluding 'Inflation' and 'Date')
X <- cbind(data %>% select(-Inflation, -Date, -Month), month_dummies)
##Including these dummies helps the model account for month-specific variations in the target variable (inflation), allowing it to capture seasonal trends that might affect the predictions.
y <- data$Inflation

Initial Linear Regression Model

To begin the forecasting process, we start by creating an initial linear regression model using all the variables available in the dataset. The primary goal is to understand which variables significantly impact the inflation rate. The model summary gives p-values for each variable, helping us identify which variables have a statistically significant effect (commonly considered as p < 0.05). Variables that meet this criterion will be used to build a refined model, ensuring that we only include relevant predictors, making the model both efficient and interpretable.

lm_model_initial <- lm(y ~ ., data = X)

#displaying the summary of the model
summary(lm_model_initial)
## 
## Call:
## lm(formula = y ~ ., data = X)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.4033 -0.7168 -0.0509  0.6682  5.5302 
## 
## Coefficients: (1 not defined because of singularities)
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -9.006e+00  6.490e+01  -0.139 0.889893    
## ADV         -1.274e-06  1.304e-06  -0.977 0.330917    
## AUTOS_F     -8.808e-05  4.513e-05  -1.952 0.053583 .  
## CCI         -2.090e-01  6.913e-02  -3.023 0.003134 ** 
## CMR_E       -2.981e-01  6.298e-01  -0.473 0.636963    
## COMM        -5.061e-02  3.500e-02  -1.446 0.151042    
## CPI          1.575e-01  8.207e-02   1.919 0.057685 .  
## CPS          4.219e-06  1.899e-06   2.222 0.028398 *  
## DEP         -8.097e-07  5.374e-07  -1.507 0.134851    
## DR           1.240e+00  4.894e-01   2.534 0.012722 *  
## E            1.254e-01  5.790e-02   2.166 0.032531 *  
## EPU2        -3.072e-03  4.792e-03  -0.641 0.522845    
## FERTS_F     -3.893e-02  1.020e-01  -0.382 0.703415    
## K100         2.475e-04  2.444e-04   1.013 0.313525    
## K1Y         -1.953e+00  1.860e+00  -1.050 0.296142    
## K6M          5.269e-01  2.106e+00   0.250 0.802926    
## KALL        -5.075e-04  3.511e-04  -1.445 0.151275    
## LR          -4.493e+00  7.205e-01  -6.236 9.14e-09 ***
## OIL          7.813e-02  2.277e-02   3.431 0.000855 ***
## PR           2.595e+00  7.706e-01   3.368 0.001054 ** 
## PSB         -4.206e-07  5.891e-07  -0.714 0.476808    
## QIM          2.252e-02  5.599e-02   0.402 0.688339    
## REER         2.083e-01  8.616e-02   2.418 0.017292 *  
## WAONR_A      1.426e-01  8.750e-01   0.163 0.870837    
## Month01      3.819e-01  6.991e-01   0.546 0.586067    
## Month02      2.017e-01  7.343e-01   0.275 0.784117    
## Month03      3.125e-01  7.490e-01   0.417 0.677331    
## Month04     -3.489e-01  7.620e-01  -0.458 0.648010    
## Month05      1.975e-01  7.761e-01   0.254 0.799623    
## Month06      5.903e-02  7.635e-01   0.077 0.938521    
## Month07     -3.736e-01  7.853e-01  -0.476 0.635190    
## Month08      3.598e-01  7.851e-01   0.458 0.647683    
## Month09      3.531e-01  7.832e-01   0.451 0.653036    
## Month10     -2.383e-01  8.018e-01  -0.297 0.766911    
## Month11     -2.294e-01  7.414e-01  -0.309 0.757605    
## Month12             NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.476 on 107 degrees of freedom
## Multiple R-squared:  0.9718, Adjusted R-squared:  0.9629 
## F-statistic: 108.5 on 34 and 107 DF,  p-value: < 2.2e-16

We will only keep variables with p-values < 0.05 from the initial model since they are statistically significant. This helps in making the model more accurate by focusing only on key predictors of inflation.

significant_vars <- c("CCI", "CPS", "DR", "E", "LR", "OIL", "PR", "REER")

#filtering the data with only these variables, we onnly need to do this to independent variables only
X_refined <- X %>% select(all_of(significant_vars))

Data Preparation for Training and Testing

To prepare the data for training and testing, we split it into two sets based on specified date ranges:

#Creating indices for train and test sets based on the date range
train_indices <- which(data$Date >= "2011-01-01" & data$Date <= "2022-10-01")
test_indices <- which(data$Date >= "2022-11-01" & data$Date <= "2023-10-01")

#filtering the refined data and target variable for training and testing
X_train <- X_refined[train_indices, ]
y_train <- y[train_indices]
X_test <- X_refined[test_indices, ]
y_test <- y[test_indices]

Building the Refined Linear Regression Model

After selecting the significant variables, we now will train a linear regression model using only these variables. The refined model aims to better predict inflation by focusing on the most relevant factors.

lm_model_refined <- lm(y_train ~ ., data = X_train)
summary(lm_model_refined)
## 
## Call:
## lm(formula = y_train ~ ., data = X_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.3080 -0.5734  0.1358  0.7200  3.5277 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.981e+01  3.699e+00  -8.060 6.17e-13 ***
## CCI         -2.332e-01  4.005e-02  -5.823 4.85e-08 ***
## CPS         -6.601e-07  3.881e-07  -1.701 0.091516 .  
## DR           2.085e-01  4.135e-01   0.504 0.615003    
## E            1.436e-01  1.960e-02   7.330 2.84e-11 ***
## LR          -1.846e+00  5.948e-01  -3.104 0.002379 ** 
## OIL          3.987e-02  8.098e-03   4.923 2.73e-06 ***
## PR           1.341e+00  3.623e-01   3.701 0.000325 ***
## REER         2.297e-01  2.414e-02   9.513 2.33e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.304 on 121 degrees of freedom
## Multiple R-squared:  0.9224, Adjusted R-squared:  0.9173 
## F-statistic: 179.9 on 8 and 121 DF,  p-value: < 2.2e-16

Making Predictions on the Test Data

After training the refined model, we use it to predict inflation on the test dataset. The accuracy of these predictions is measured using the Root Mean Square Error (RMSE), which provides a sense of how well the model forecasts inflation out-of-sample.

The comparison between the actual inflation values and the predicted inflation values is displayed in a table for better visualization.

# Making predictions on the test data using the refined model
y_pred_refined <- predict(lm_model_refined, X_test)

# Calculate RMSE for the out-of-sample predictions
rmse_refined <- sqrt(mean((y_test - y_pred_refined)^2))
print(paste("Refined Out-of-Sample RMSE is:", rmse_refined))
## [1] "Refined Out-of-Sample RMSE is: 4.96129962204243"
# Creating a comparison table of original vs. predicted inflation
comparison_table <- data.frame(
  Date = data$Date[test_indices],
  Actual_Inflation = y_test,
  Predicted_Inflation = y_pred_refined)

# Displaying the comparison table
print(comparison_table)
##           Date Actual_Inflation Predicted_Inflation
## 131 2022-11-01            23.99            24.10274
## 132 2022-12-01            24.60            23.62223
## 133 2023-01-01            27.55            25.14890
## 134 2023-02-01            31.59            29.10232
## 135 2023-03-01            35.18            34.72483
## 136 2023-04-01            36.30            36.05355
## 137 2023-05-01            38.12            34.68895
## 138 2023-06-01            29.28            33.90944
## 139 2023-07-01            28.16            34.20883
## 140 2023-08-01            27.25            36.33808
## 141 2023-09-01            31.49            38.61699
## 142 2023-10-01            26.86            35.74842

Visualizing the Final Results: Actual vs. Predicted Inflation

The following plot compares the actual inflation values with the predicted values from the refined model over the test period.

# Plot actual vs. predicted inflation for the refined model
par(mar = c(5, 5, 4, 2) + 0.1) # Adjusting margins for better visualization

plot(data$Date[test_indices], y_test, type = "o", col = "blue", 
     xlab = "Date", ylab = "Inflation",
     main = "Final Results: Actual vs. Predicted Inflation", 
     ylim = range(c(y_test, y_pred_refined), na.rm = TRUE))

# Adding the line for predicted inflation
lines(data$Date[test_indices], y_pred_refined, type = "o", col = "red")

# Adding a legend to the plot to distinguish between actual and predicted values
legend("topright", legend = c("Actual", "Predicted"), 
       col = c("blue", "red"), lty = 1, bty = "n")

Implementing the Fixed Rolling Window Approach

In this section, we apply a fixed rolling window method to forecast inflation using a linear regression model. A fixed window size of 100 observations is used, meaning that the model trains on the most recent 100 observations at each step and then predicts the next 12 months.

# Setting the window size for the fixed rolling window approach
window_size <- 100
fixed_rmse <- c()  #This will store the RMSEs for each step
fixed_forecasts <- rep(NA, length(y_test))  # To store the 12-month forecasts

#Loop to perform fixed rolling window forecasts
for (i in 1:(nrow(X_test) - 12)) { X_window <- X_train[i:(i + window_size - 1), ]
  y_window <- y_train[i:(i + window_size - 1)]   # Defining the current window of data (100 observations)
  
  #Training the linear model using the current window of data
  lm_model_window <- lm(y_window ~ ., data = X_window)
  
  #Making 12-month ahead predictions using the trained model
  X_12 <- X_test[i:(i + 11), ]
  y_pred_12 <- predict(lm_model_window, X_12)
  
  #Storing the 12-month forecasts to compare with actual values later
  fixed_forecasts[i:(i + 11)] <- y_pred_12
  
  #Calculating RMSE for the current window and store it
  fixed_rmse[i] <- sqrt(mean((y_test[i:(i + 11)] - y_pred_12)^2))}

# Calculate average RMSE for the fixed rolling window
avg_fixed_rmse <- mean(fixed_rmse, na.rm = TRUE)
print(paste("Average RMSE for Fixed Rolling Window is:", avg_fixed_rmse))
## [1] "Average RMSE for Fixed Rolling Window is: 6.90804822347697"

Plotting Actual vs. 12-Month Forecasts (Fixed Rolling Window)

here we create a plot to visualize the actual inflation values against the 12-month forecasts generated using the fixed rolling window approach. This plot helps us understand how well the model’s forecasts align with the actual inflation data over time.

# Adjusting the margins for better visualization
par(mar = c(5, 5, 4, 2) + 0.1)

# Plotting the actual vs. 12-month forecasts from the fixed rolling window
plot(data$Date[test_indices], y_test, type = "o", col = "blue", xlab = "Date", ylab = "Inflation",
     main = "Actual vs. Fixed Rolling Window 12-Month Forecasts", ylim = range(c(y_test, fixed_forecasts), na.rm = TRUE))

# Adding the line for 12-month forecasts
lines(data$Date[test_indices], fixed_forecasts, type = "o", col = "red")

# Adding a legend to the plot to distinguish between actual values and 12-month forecasts
legend("topright", legend = c("Actual", "12-Month Forecast"), col = c("blue", "red"), lty = 1, bty = "n")

Expanding Rolling Window Forecast

Now, we implement an expanding rolling window approach to predict inflation. Unlike the fixed window, this method expands the training data with each iteration, adding one observation at a time. This allows the model to use more historical data as it makes predictions, potentially improving accuracy over time.

#Initializing variables for expanding rolling window
expanding_rmse <- c()  #Storing RMSE for each window
expanding_forecasts <- rep(NA, length(y_test))  #Store 12-month forecasts

# Loop for expanding rolling window forecast
for (i in 1:(nrow(X_test) - 12)) {
  X_window <- X_train[1:(window_size + i - 1), ]
  y_window <- y_train[1:(window_size + i - 1)]   #Define the expanding window
  
  #Training the model on the expanding window
  lm_model_expanding <- lm(y_window ~ ., data = X_window)
  
  #Making 12-month ahead predictions using the trained model
  X_12 <- X_test[i:(i + 11), ]
  y_pred_12 <- predict(lm_model_expanding, X_12)
  
  #Storing the 12-month forecasts to compare with actual values later
  expanding_forecasts[i:(i + 11)] <- y_pred_12
  
  #Calculating RMSE for the current window and store it
  expanding_rmse[i] <- sqrt(mean((y_test[i:(i + 11)] - y_pred_12)^2))}

# Calculate average RMSE for the expanding rolling window
avg_expanding_rmse <- mean(expanding_rmse, na.rm = TRUE)
print(paste("Average RMSE for Expanding Rolling Window:", avg_expanding_rmse))
## [1] "Average RMSE for Expanding Rolling Window: 6.90804822347697"

Expanding Rolling Window: Actual vs. 12-Month Forecasts

The plot below shows the comparison between actual inflation and the 12-month forecasts generated using the expanding rolling window method.

# Plot actual vs. 12-month forecasts from the expanding rolling window
plot(data$Date[test_indices], y_test, type = "o", col = "blue", xlab = "Date", ylab = "Inflation",
     main = "Actual vs. Expanding Rolling Window 12-Month Forecasts", 
     ylim = range(c(y_test, expanding_forecasts), na.rm = TRUE))
# Adding the line for 12-month forecasts
lines(data$Date[test_indices], expanding_forecasts, type = "o", col = "red")
par(mar = c(5, 5, 4, 2) + 0.1)
#Adding a legend to the plot to distinguish between actual values and 12-month forecasts
legend("topright", legend = c("Actual", "12-Month Forecast"), col = c("blue", "red"), lty = 1, bty = "n")

### Final Results Summary

Based on the three different forecasting approaches, here are the final results:

Refined Model

  • Refined Out-of-Sample RMSE: 4.96
    • The refined linear regression model, which uses the most significant variables, achieved a root mean square error (RMSE) of 4.96. This indicates the average difference between the actual and predicted inflation values over the test period, making it the most accurate approach among the three.

Fixed Rolling Window

  • Average RMSE for Fixed Rolling Window: 6.91
    • This approach used a fixed 100-month window to generate 12-month-ahead forecasts. The average RMSE was higher than the refined model, suggesting slightly less accurate forecasts.

Expanding Rolling Window

  • Average RMSE for Expanding Rolling Window: 6.91
    • This approach began with an initial 100-month window and gradually expanded to include more data over time. The RMSE of 6.91 indicates similar accuracy to the fixed rolling window method.

Conclusion

  • The refined model had the lowest RMSE, indicating it was the most accurate in tracking actual inflation over the test period.
  • Both the fixed and expanding rolling window methods had similar RMSEs, suggesting comparable performance, though less accurate than the refined model.