Research Question

Can we forecast the revenue for the next six months based on the provided data? We can use this information for budgeting purposes for the next fiscal year.

The objective of this model is to forecast the hospital’s revenue for the next six months by analyzing the daily revenue from the past two years. This process will identify overall trends, seasonality, and other factors. The data will be split into testing and training data sets so we can compare the accuracy of our predictions with the actual data.

Method Justification

We can assume the time series will be stationary. A stationary time series has a normal distribution, and the variance and mean are constant over time. The observations are assumed to be evenly spaced and apply a discrete-time observation index (Matteson, nd).

Stationarity is a property of a time series where statistical properties, such as mean, variance, and autocorrelation, remain constant over time. This fundamental assumption simplifies complex data, which makes the data set easier for analysis, forecasting, and modeling (Tate, 2023). Correlation found between two values in a time series is referred to as autocorrelation (Smith, 2023).

Data Preparation

# Load medical time series data
data <- read.csv(file.choose())
# Libraries 
library(tseries)
library(forecast)
library(dplyr)
library(caTools)
library(LSTS)
library(astsa)
library(TSstudio)
summary(data)
##       Day           Revenue      
##  Min.   :  1.0   Min.   :-4.423  
##  1st Qu.:183.5   1st Qu.:11.122  
##  Median :366.0   Median :15.952  
##  Mean   :366.0   Mean   :14.180  
##  3rd Qu.:548.5   3rd Qu.:19.294  
##  Max.   :731.0   Max.   :24.792
plot(data$Day, data$Revenue, type = "l", xlab = "Day", ylab = "Daily Revenue (in millions)", main = "Time Series Plot")

The data set provided is series of 731 days with the daily revenue associated with each day. There are no gaps, nulls, missing values or duplicates in the data. I used is.data.frame() to verify the data set was a data frame, converted it to a vector, and then converted the vector into a time series.

glimpse(data)
colSums(is.na(data)) # Check for missing values
sum(duplicated(data)) # Check for duplicates
is.null(data) # Check for null values 
is.data.frame(data) # Verify data frame
data_vector <- data[['Revenue']] # Convert to vector
is.vector(data_vector) # Verify vector
time_series <- ts(data_vector, frequency = 731/24) # Convert to time series
is.ts(time_series) # Confirm time series

I ran an Augmented Dickey-Fuller (ADF) test to check for stationarity. The ADF test is a hypothesis test that assesses whether or not a series is stationary. If the p-value from the ADF is less than 0.05, we can reject the null hypothesis and accept the alternative hypothesis indicating the time series is stationary. My ADF showed a p-value of 0.32. This is not significant and we must difference the data (Tate, 2023).

Differencing is a technique that removes non-stationarity by subtracting the previous observation from the current observation (Brownlee, 2020). I used the diff() function to difference the data and ran the ADF test again. The p-value is less than 0.05 so my time series data is now stationary. The Dickey-Fuller statistic is -7.2, the more negative the statistic the stronger the evidence against the null hypothesis. The lag order represents the number of lags used in the test (Tate, 2023).

adf_test <- adf.test(time_series)
print(adf_test)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  time_series
## Dickey-Fuller = -2.5906, Lag order = 9, p-value = 0.3283
## alternative hypothesis: stationary
time_series_diff <- diff(time_series)
adf_test_diff <- adf.test(time_series_diff)
## Warning in adf.test(time_series_diff): p-value smaller than printed p-value
print(adf_test_diff)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  time_series_diff
## Dickey-Fuller = -7.2794, Lag order = 8, p-value = 0.01
## alternative hypothesis: stationary

After differencing to make the data stationary, I split the data into testing and training data sets. I used the ts_split() function to split the data at the specified point. I used the parameter ‘sample.out = 183’ to split the time series into two parts at the index 183.

The data has been successfully loaded into R Studio, converted to a time series, stationarity has been achieved after differencing, and the data has been split into testing and training sets. Copies of the cleaned data, test, and training sets are attached with the submission. The data is ready for analysis.

split <- ts_split(time_series, sample.out = 183)
training <- split$train
testing <- split$test

length(time_series_diff)
## [1] 730
length(testing)
## [1] 183
length(training)
## [1] 548

Model Identification and Analysis

Seasonality

This plot visualizes patterns that repeat in the time series. Peaks and troughs indicate the presence of seasonal patterns.

tsdata <- ts(data$Revenue, frequency = 30)
datadecomp <- decompose(tsdata)
plot(datadecomp$seasonal, main = "Seasonality")

Trends

Trend plots shows overall upward or downward trends, indicating the time series longterm movement. This is important for understanding the underlying behavior of the time series (Ozen, 2021).

plot(datadecomp$trend, main = "Trend")

Autocorrelation

Autocorrelation Function (ACF) Plot displays the autocorrelation of the time series at different lags. This identifies the direction ans strength of the relationship between the observations and their lagged values (Ahmed, 2023).

The Partial Autocorrelation Function (PACF) plot shows the correlation of the time series at different lags after removing the effects of the previous lags. This is used to identify the order of a moving average model (Ahmed, 2023).

acf(time_series, lag.max = 30, main = "ACF Before Differencing")

pacf(time_series, lag.max = 30, main = "PACF Before Differencing")

acf(time_series_diff, lag.max = 30, main = "ACF After Differencing")

pacf(time_series_diff, lag.max = 30, main = "PACF After Differencing")

Spectral Density

The frequencies related to the autocovariance time domain is graphed in the spectral density plot. Autocovariance refers to the covariance between two elements in the series. The peaks correspond to the frequencies that contribute to the variability (Elleh, 2022).

spectrum(time_series_diff)

Decomposed Time Series

The decomposed time series splits the time series into four components; data, seasonality, trend, and residual.

stl <- stl(time_series_diff, s.window = "period")
autoplot(stl)

Residuals of the Decomposed Times Series

The residuals in a time series are what is left over after the model is fitted. A good forecasting method will have residuals that are uncorrelated and have a mean of zero (Hyndman and Athanasopoulos, 2018).

seasonal_comp <- stl$time.series[,"seasonal"]
trend_comp <- stl$time.series[,"trend"]
remainder_comp <- stl$time.series[,"remainder"]
residuals <- time_series_diff - seasonal_comp - trend_comp
residuals_ts <- ts(residuals, frequency = frequency(time_series_diff))
plot(residuals_ts)

I used the auto.arima() function to identify the best seasonality order for the ARIMA model. ARIMA stands for Auto-Regressive Integrated Moving Average. Auto.arima() searches various AR and MA combinations to find the best fit for the time series data. Setting the seasonal argument to “TRUE” indicates the function should consider seasonal components. The ARIMA model assumes the data series is stationary and the data is a univariate series. The autoregressive component is defined by the parameter ‘p’ in the function. The PACF plot can provide the ‘p’ value. Integrated component is the number of times the time series is differenced to achieve stationarity, represented by the parameter ‘d’. The moving average component is the ‘q’ parameter and can be provided by the ACF plot (Singh, 2023).

The training data was applied to the auto.arima() function and the results showed the best fit parameters were p = 1, d = 1, q = 0, P = 1, Q = 0, D = 1, and m = 30. The non-seasonal orders are p, d, and q. The seasonal orders are P, D, and Q. This information is applied to the ARIMA model. The result is stored in ‘fit’, the summary and residuals are reviewed after the model has run.

auto.arima(training, seasonal = T)
## Series: training 
## ARIMA(1,1,0)(1,0,1)[30] 
## 
## Coefficients:
##         ar1    sar1     sma1
##       0.408  0.4690  -0.4460
## s.e.  0.039  0.7764   0.7823
## 
## sigma^2 = 0.1986:  log likelihood = -332.62
## AIC=673.24   AICc=673.32   BIC=690.46
seasonal_arima_model <- arima(training, order = c(1, 1, 0), seasonal = list(order = c(1, 0, 1), period = 30))
summary(seasonal_arima_model)
## 
## Call:
## arima(x = training, order = c(1, 1, 0), seasonal = list(order = c(1, 0, 1), 
##     period = 30))
## 
## Coefficients:
##         ar1    sar1     sma1
##       0.408  0.4690  -0.4460
## s.e.  0.039  0.7764   0.7823
## 
## sigma^2 estimated as 0.1975:  log likelihood = -332.62,  aic = 673.24
## 
## Training set error measures:
##                      ME      RMSE       MAE        MPE     MAPE      MASE
## Training set 0.01234072 0.4439883 0.3586636 -0.2775491 9.281389 0.9232299
##                     ACF1
## Training set -0.01264159
checkresiduals(seasonal_arima_model)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,1,0)(1,0,1)[30]
## Q* = 46.197, df = 57.917, p-value = 0.8664
## 
## Model df: 3.   Total lags used: 60.9166666666667

I used the forecast() function to forecast the next six months (183 days) of revenue. I used ‘fit’ from the ARIMA model and set h = 183. The prediction intervals are shown in the plot as a cone-shaped region above and below the forecast. This is the uncertainty associated with the forecast. The wider the cone, the higher the uncertainty. The forecast() function has a default confidence level set to 95%. Meaning, 95% of the true value of the forecasted observations fall within the prediction intervals.

A Ljung Box Test was performed on forecast to asses whether the residuals of the forecast are correlated at different lags and to visualize the p-values. All p-values are below 0.5, indicating there is significant autocorrelation in the residuals. This also implies the residuals do not behave like white noise and there is a pattern that the model has not captured.

forecast <- forecast(seasonal_arima_model, h = 183) # Forecast 183 days (6 months) out
plot(forecast, main = "Forecasting Six Months of Total Daily Revenue", xlab = "Months", ylab = "Revenue (in millions)")
lines(testing, col = "green")
legend("topleft", legend = c("Training", "Testing", "Forecast"), col = c("Black", "Green", "Blue"), lty = c(1,1,1))

Box.Ljung.Test(forecast)

Output of calculations for analysis are provided above. Below is a list of calculations and outputs produced.
* Differencing to achieve stationarity, confirmed by Augmented Dickey-Fuller Test * ACF and PACF Plots * Auto Arima used to identify best seasonality order * Seasonal ARIMA model * Predictions from ARIMA model

Data Summary and Implications

To find the best suitable ARIMA model I had to examine the ACF and PACF plots. These plots gives me the values for the autoregressive (AR) and moving average (MA) orders. If there is significant correlation at the lag and a slow decline, it suggests an AR component. If lag 1 has a significant spike in the PACF plot and a gradual decline in ACF, MA1 is suggested. Auto.arima() provides the non-seasonal components (1,1,0) and the seasonal components (1,0,1)[30]. The model selection from auto.arima() provides the lowest possible AIC score.

The prediction interval of my forecast is 1 day. In my forecast() function I set h = 183, so it is predicting the next 183 days (or six months).

I felt like a six month forecast of revenue would be an appropriate length of time. The purpose of my research question is related to budgeting for the hospital system, and larger forecasting horizons can become less accurate. This data set only includes two years of historical data, attempting to forecast six months of revenue based on 24 months of data is ambitious.

As discuss previously, I used auto.arima() to find the best fit seasonal order. The training data was used with auto.arima() and best fit parameters were (1,1,0)(1,0,1)[30].

The error metrics from the forecasting model is found using the accuracy() function on the forecast object. The mean error (ME) is the average forecast error. On the training set the ME is very close to zero. A lower score generally means the forecasts are reasonably accurate, however it could also be because of the low number of observations in the data set. The ME on the testing data suggests a positive bias in the forecasts. The root mean squared error (RSME) is the measure of the average magnitude of errors. The RSME for the training set is relatively low, suggesting the model’s predictions align well with the training set values. The mean absolute error (MAE) is low in the training set. A low MAE score indicates that the absolute forecast errors are small.

As indicated by the low ME, RSME, and MAE values on the training set, the model seems to be performing well. The test set has evidence of a positive bias and higher error metrics. This suggests the model may be less accurate when predicting new, unseen data.

accuracy_metrics <- accuracy(forecast, testing)
accuracy_metrics
##                      ME      RMSE       MAE        MPE      MAPE      MASE
## Training set 0.01234072 0.4439883 0.3586636 -0.2775491  9.281389 0.1019376
## Test set     4.72343591 5.1837291 4.7628363 27.0507334 27.399794 1.3536698
##                     ACF1 Theil's U
## Training set -0.01264159        NA
## Test set      0.96507664  9.752533
plot(forecast, main = "Forecasting Six Months of Total Daily Revenue", xlab = "Months", ylab = "Revenue (in millions)")
lines(testing, col = "green")
legend("topleft", legend = c("Training", "Testing", "Forecast"), col = c("Black", "Green", "Blue"), lty = c(1,1,1))

This model shows room for improvement. I would first recommend that the analytics team evaluate the accuracy of the forecasts and identify any biases to expose any areas of improvement. It would also be wise to involve stakeholders in the interpretation of results and rely on their domain knowledge of their departments. Refreshing the model when new data is available can potentially help improve its accuracy. Once the model has been found satisfactory, the results of this analysis can be given to department heads in the hospital system to budget for the next two quarters.