Wholesale Price Index Forecasting Using ARIMA Model

Set working directory

Read Dataset

wpi <- read.csv("Wholesal_price_index.csv")

Load Libraries

library(tseries)
library(Hmisc)
library(psych)
library(ggplot2)

View the Structure of the dataset

str(wpi)

## 'data.frame':    113 obs. of  2 variables:
##  $ Monthly              : chr  "2012M04" "2012M05" "2012M06" "2012M07" ...
##  $ wholesale.price.index: num  105 105 105 106 107 ...

View(wpi)

ChecK for missing values

checked row with missing values.

missing_row_index <- which(is.na(wpi$wholesale.price.index))
print(missing_row_index)

## [1] 101

Calculate the median and imput missing value

Using the median provides a more reliable estimate of central tendency that is unaffected by outliers or extreme numbers.

median_wpi <- median(wpi$wholesale.price.index, na.rm = TRUE)

wpi$wholesale.price.index[is.na(wpi$wholesale.price.index)] <- median_wpi

print(wpi$wholesale.price.index)

##   [1] 104.7 105.3 105.3 106.2 106.9 107.6 107.4 107.3 107.1 108.0 108.4 108.6
##  [13] 108.6 108.6 110.1 111.2 112.9 114.3 114.6 114.3 113.4 113.6 113.6 114.3
##  [25] 114.1 114.8 115.2 116.7 117.2 116.4 115.6 114.1 112.1 110.8 109.6 109.9
##  [37] 110.2 111.4 111.8 111.1 110.0 109.9 110.1 109.9 109.4 108.0 107.1 107.7
##  [49] 109.0 110.4 111.7 111.8 111.2 111.4 111.5 111.9 111.7 112.6 113.0 113.2
##  [61] 113.2 112.9 112.7 113.9 114.8 114.9 115.6 116.4 115.7 116.0 116.1 116.3
##  [73] 117.3 118.3 119.1 119.9 120.1 120.9 122.0 121.6 119.7 119.2 119.5 119.9
##  [85] 121.1 121.6 121.5 121.3 121.5 121.3 122.0 122.3 123.0 123.4 122.2 120.4
##  [97] 119.2 117.5 119.3 121.0 114.3 122.9 123.6 125.1 125.4 126.5 128.1 129.9
## [109] 132.0 132.9 133.7 134.5 135.9

Create a Time series plot

Convert the column to a numeric vector
Create a time series object
Plot the time series

wpi_numeric <- as.numeric(wpi$wholesale.price.index)

x <- ts(wpi_numeric, start = c(2012, 4), frequency = 12, end = c(2021, 8))

ts.plot(x, main = "Wholesale Price Index", xlab = "Year", ylab = "Percentage")

From 2012 to 2021, the wholesale pricing index showed shifting trends that had important repercussions for businesses. Between 2012 and 2014, wholesale prices gradually climbed, indicating higher costs for products and services. However, in 2016, there was a decrease, which could indicate a period of relative affordability for firms. The subsequent price increase from 2018 onwards is likely to present issues for enterprises, as greater wholesale costs may have an impact on profitability and pricing strategies. The general trend shows that enterprises should regularly watch wholesale price variations in order to modify strategies and remain competitive in changeable market conditions.

Identifer the outlier using a boxplot

boxplot(x)

boxplot.stats(x)

## $stats
## [1] 104.7 110.8 114.3 120.1 133.7
## 
## $n
## [1] 113
## 
## $conf
## [1] 112.9177 115.6823
## 
## $out
## [1] 134.5 135.9

The boxplot and statistical summary illustrate interesting trends in the wholesale pricing index data, providing useful information for enterprises. Here is the interpretation:
- The median wholesale price index, 114.3, is a key reference point for studying price distribution across time.
- The interquartile range (IQR), which ranges from 110.8 to 120.1, represents the middle 50% of the data and demonstrates the diversity in wholesale pricing.
- Outliers at 134.5 and 135.9 indicate possible abnormalities or excessive price variations within specific time periods, requiring more examination. The outlier difference also do not cause any significant impact on the analysis
This examination like this can aid businesses in the following ways:
- Understanding the average range of wholesale prices aids in developing competitive pricing strategies and assessing price variations in relation to industry standards.
- Monitoring outliers assists in identifying unexpected market situations or interruptions, allowing for rapid adjustments to supply chain or pricing plans.
- Understanding the central tendency and spread makes it easier to estimate future price fluctuations, allowing you to make more educated judgements about inventory management, production planning, and financial projections.

Auto-correlation Function (ACF) AND Partial Correlation Function (PCF) Plot

acf(x, lag.max = 200, type = c("correlation", "covariance", "partial"), plot = TRUE, na.action = na.fail, demean = TRUE)

The chart indicates that recent wholesale pricing adjustments are linked to previous ones. This means that firms can utilize recent trends to predict what will happen in the near future. However, if we look further back in time, this association weakens. As a result, relying too much on historical data to forecast prices may prove ineffective. It serves as a reminder to firms to be adaptable and make sound decisions based on current data.
The term “lag” relates to how far back in time we are looking when comparing previous observations to recent ones. A lag of 0 implies we’re comparing each observation to itself, a lag of 1 means we’re comparing each observation to the one right before it, and so on. “Auto-Correlation Function,” which determines how much each observation in a time series is associated with previous values at various lags. So, when i say “initial positive correlation” in the insight, it means there is a positive association between recent observations and their previous values. This positive connection implies that recent price fluctuations follow similar patterns to previous price changes.
However, as we travel back in time (with longer lags), this correlation declines, implying that earlier data is less valuable for predicting present prices.

pacf(x, lag.max = 50)

This assists in identifying substantial lags that have a direct impact on present value, providing useful information for forecasting future trends and making informed decisions. Understanding the PACF can help firms improve forecasting accuracy and strategic planning by emphasizing the most important historical data points to examine when projecting future market moves. PACF cuts out after lag 1, but we can’t use it to determine which model to use because the data isn’t constant.

adf.test(x)

## 
##  Augmented Dickey-Fuller Test
## 
## data:  x
## Dickey-Fuller = -1.1302, Lag order = 4, p-value = 0.9142
## alternative hypothesis: stationary

The Augmented Dickey-Fuller Test (ADF Test) functions as a checkup for time series data, determining if it behaves predictably over time. In this situation, the test result indicates that our data lacks sufficient evidence of consistency across time, implying that it may contain changing patterns or trends. The Dickey-Fuller statistic, which in this case is -1.1302, indicates how far the data deviates from being stationary. Because data does not always follow the same patterns, businesses may need to utilize diverse methodologies to analyse and anticipate future trends.

Non stationary to stationary differences

dx <- diff(x, 1)
ts.plot(dx)

The diff() function in R allows us to convert non-stationary data into stationary data by calculating the differences between consecutive observations. In this scenario, the output displays the differences between the data points over time. The timeline spans the years 2012 to 2020. The values at each point indicate the shift from one observation to the next. When the difference is -5, it indicates that the data declined by 5 units compared to the prior observation. A difference of 0 denotes no change, but a difference of 5 suggests the data has increased by 5 units.
The plot shows that the discrepancies fluctuate with time, showing changes in the underlying pattern of the data. This transformation makes data more predictable and easy to analyse, which can help firms discover trends and make informed decisions.

adf.test(dx)

## Warning in adf.test(dx): p-value smaller than printed p-value

## 
##  Augmented Dickey-Fuller Test
## 
## data:  dx
## Dickey-Fuller = -4.4256, Lag order = 4, p-value = 0.01
## alternative hypothesis: stationary

The Augmented Dickey-Fuller Test determines if a time series dataset is stationary or non stationary. Stationary data has consistent statistical features across time, but non-stationary data does not.
In these test results:

-   *The Dickey-Fuller statistic (-4.4256) is calculated using the data differences (dx).*
-   *The lag order, or the number of lags utilized in the test, is four.*
-   *The p-value for the test is reported as 0.01.*

The warning notice indicates that the calculated p-value is less than the reported p-value. This usually happens when the p-value is extremely low, indicating significant evidence against the null hypothesis.In this scenario, because the p-value is less than a significance level (often 0.05), we reject the null hypothesis that the data is not stationary. Instead, we infer that the data is steady, which means that it retains consistent statistical features over time.
For organisations, this result suggests that data discrepancies (dx) follow a consistent pattern, making it easier to analyse and forecast future trends. This information can help firms make better decisions and understand the behavior of their time series data.

Create training dataset

x1<-x[1:110]
x1

##   [1] 104.7 105.3 105.3 106.2 106.9 107.6 107.4 107.3 107.1 108.0 108.4 108.6
##  [13] 108.6 108.6 110.1 111.2 112.9 114.3 114.6 114.3 113.4 113.6 113.6 114.3
##  [25] 114.1 114.8 115.2 116.7 117.2 116.4 115.6 114.1 112.1 110.8 109.6 109.9
##  [37] 110.2 111.4 111.8 111.1 110.0 109.9 110.1 109.9 109.4 108.0 107.1 107.7
##  [49] 109.0 110.4 111.7 111.8 111.2 111.4 111.5 111.9 111.7 112.6 113.0 113.2
##  [61] 113.2 112.9 112.7 113.9 114.8 114.9 115.6 116.4 115.7 116.0 116.1 116.3
##  [73] 117.3 118.3 119.1 119.9 120.1 120.9 122.0 121.6 119.7 119.2 119.5 119.9
##  [85] 121.1 121.6 121.5 121.3 121.5 121.3 122.0 122.3 123.0 123.4 122.2 120.4
##  [97] 119.2 117.5 119.3 121.0 114.3 122.9 123.6 125.1 125.4 126.5 128.1 129.9
## [109] 132.0 132.9

ARIMA Model

result <- arima(x1, order = c(1,1,0))
tsdiag(result)

The ARIMA model aids in understanding and predicting wholesale pricing variations over time. The diagnostic plots tell us the following:

Standardized residuals are the differences between actual prices and what our model forecasts. Ideally, these differences should be random and near zero. While most of them do, some deviate too far from zero, indicating that our model may be inaccurate.
ACF of Residuals: This is equivalent to determining whether there is a residual pattern in the variations between projected and actual prices. We discovered a strong pattern every five months, indicating that our model is not capturing all of the relevant information.
Ljung-Box Statistic: Think of this as a test to examine if there is any residual pattern in the disparities between anticipated and actual prices at various periods in time. Our test revealed some trends at some locations, indicating that our model may be missing some key components.

While my ARIMA model provides a fair overall picture of how wholesale prices vary over time, these plots show that it may overlook some minor subtleties. This suggests that we may need to modify our model to produce more accurate forecasts.

Testing the model

predict(result, 3)

## $pred
## Time Series:
## Start = 111 
## End = 113 
## Frequency = 1 
## [1] 132.8450 132.8483 132.8481
## 
## $se
## Time Series:
## Start = 111 
## End = 113 
## Frequency = 1 
## [1] 1.373933 1.884582 2.286675

The ARIMA model estimates the wholesale price index for the following three time periods with reasonable accuracy and includes an uncertainty measure for each prediction. The estimated wholesale price index values are 132.85, 132.85, and 132.85, with standard errors of 1.37, 1.88, and 2.29, respectively. This knowledge is critical for businesses because it allows you to anticipate probable wholesale price variations, which helps you make strategic decisions about manufacturing, pricing, inventory management, and budgeting. Businesses that incorporate such forecasts into their planning processes will be able to better adjust to changing market conditions, optimize resource allocation, and preserve industry competitiveness. Furthermore, the inclusion of standard errors allows organisations to assess the trustworthiness of these forecasts, enabling you to make informed decisions in the face of uncertainty.

x[111:113]

## [1] 133.7 134.5 135.9

The wholesale price index numbers for the next three periods are 133.7, 134.5, and 135.9 respectively. The ARIMA model accurately predicts future price changes, as evidenced by the comparison of actual and anticipated values (132.85, 132.85, and 132.85). This comparison gives vital input to enterprises, allowing companies to analyse the forecasting model’s reliability and change their strategy and operations accordingly. It emphasizes the significance of constantly analyzing and refining forecasting processes in order to improve decision-making and respond to market changes.

Forecasting Wholesale Price Index

result_1 <- arima (x, order = c(1,1,0))
predict(result_1, 3)

## $pred
##           Sep      Oct      Nov
## 2021 135.8319 135.8352 135.8350
## 
## $se
##           Sep      Oct      Nov
## 2021 1.366955 1.886703 2.293286

The forecasted wholesale price index values for September, October, and November 2021 are 135.83, 135.84, and 135.84, respectively. These projections, together with the related standard errors, provide essential insights for businesses. They enable decision-makers to proactively change price plans, optimize inventory levels, and strategically deploy resources. Businesses can use advanced forecasting techniques to keep ahead of market volatility, improve operational efficiency, and maintain competitiveness. Such insights enable organisations to confidently navigate changing market dynamics, resulting in long-term growth and profitability.

x2 <- x [1:113]
plot (x2, main = "Wholesale Price Index", xlab = "Year", ylab = "Percentages")
forecast = predict(result_1, n.ahead = 5)
lines(114:118, forecast$pred, type = "o", col="red")

The wholesale price index is plotted over time, with percentages ranging from 105 to 135. The timeline runs from 2012 to 2021, with each point representing a particular year. Initially, the index begins at 105 and gradually rises until roughly 2014 (year 20), when it undergoes swings. The index shows a substantial decline around 2016 (year 40), followed by a gradual rise until around 2020 (year 80). The red dots reflect the anticipated values for the next five time points, which expand the plot beyond the current data. These forecasts provide firms with significant insights into probable future trends, which help with strategic planning and decision-making.