D213 - Medical Revenue Time Series Analysis

Research Question

What is the projected revenue for the upcoming quarter?

The goal of the data analysis is to create a model that can predict future revenue data using the previous two years of revenue data.

Method Justification

The forecasting method applied is ARIMA as the revenue data is time series data and there is expected autocorrelation in subsequent steps.

The assumptions of a time series model include stationarity, such that the mean and variance do not change as a function of time, absence of anomalies, and data are univariate. The autocorrelation assumption holds that there is a relationship between current and previous values in the data. Model parameters and error terms must be constant.

Data Preparation

The medical revenue data is loaded and inspected. A total of two years (731 days) are present, and the time step of the realization is prepared and plotted.

library(tidyverse)
library(janitor)
library(astsa)
library(aTSA)
library(forecast)
library(MLmetrics)

#Import data
medical_time_series <- read.csv("./D213/medical_time_series.csv",
                                row.names = "Day")
medical_time_series <- clean_names(medical_time_series)

#Convert to time series & plot
med_ts <- ts(medical_time_series, frequency = 365, start = 2020)
ts.plot(med_ts, xlab = "Year", ylab = "Revenue (Million USD)")

No missing values or anomalies are observed:

colSums(is.na(medical_time_series))

## revenue 
##       0

The time series appears to be nonstationary. The augmented Dickey-Fuller (DF) test is used to investigate stationarity, with α = 0.05. H0 is that the time series is nonstationary; H0 is not rejected when observing the DF test results (DataCamp, nd.).

adf.test(med_ts)

## Augmented Dickey-Fuller Test 
## alternative: stationary 
##  
## Type 1: no drift no trend 
##      lag    ADF p.value
## [1,]   0  0.210   0.704
## [2,]   1 -0.233   0.577
## [3,]   2 -0.243   0.574
## [4,]   3 -0.178   0.593
## [5,]   4 -0.226   0.579
## [6,]   5 -0.280   0.563
## [7,]   6 -0.346   0.544
## Type 2: with drift no trend 
##      lag   ADF p.value
## [1,]   0 -2.03   0.316
## [2,]   1 -2.22   0.241
## [3,]   2 -2.22   0.241
## [4,]   3 -2.18   0.258
## [5,]   4 -2.18   0.256
## [6,]   5 -2.31   0.204
## [7,]   6 -2.49   0.131
## Type 3: with drift and trend 
##      lag   ADF p.value
## [1,]   0 -1.26   0.890
## [2,]   1 -2.00   0.577
## [3,]   2 -2.01   0.571
## [4,]   3 -1.90   0.621
## [5,]   4 -1.97   0.589
## [6,]   5 -2.13   0.521
## [7,]   6 -2.34   0.432
## ---- 
## Note: in fact, p.value = 0.01 means p.value <= 0.01

The Dickey-Fuller test confirms the data is nonstationary. Differencing is applied and plotted to confirm stationarity.

med_diff <- diff(med_ts)
plot.ts(cbind(med_ts, med_diff))

The time series is split into 75/25 train/test split at the end of the time series, as opposed to stochastically, to retain the order of the data when producing a model.

med_train <- window(med_ts, end = 2021.5)
med_test <- window(med_ts, start = 2020)

write.csv(med_train, "D213/med_train.csv", row.names = FALSE)
write.csv(med_test, "D213/med_test.csv", row.names = FALSE)
write.csv(med_ts, "D213/med_ts_complete.csv", row.names = FALSE)

Model Identification and Analysis

The time series is decomposed to observe the components.

med_components <- decompose(med_ts)
plot(med_components, xlab = "Year")

When observing the time series components, an upward trend is identified. To investigate seasonality, a periodogram and smoothed periodogram are generated to investigate spectral density (Pennsylvania State University, 2021).

spectrum(med_ts)

kern <- kernel("daniell", 1)
mvspec(med_ts, kern, log = "n")

ACF and PACF plots are generated from the differenced time series.

When investigating the plots for seasonality, a strong seasonal trend is not identified.

The ACF and PACF are investigated to determine the appropriate arima model parameters. Since the ACF tails off and the PACF cuts off, an AR(1) model is generated and residual analysis is performed:

med_arima.1 <- sarima(med_train, 1, 1, 0)

## initial  value -0.720109 
## iter   2 value -0.810613
## iter   3 value -0.810614
## iter   4 value -0.810614
## iter   5 value -0.810614
## iter   6 value -0.810614
## iter   7 value -0.810614
## iter   7 value -0.810614
## iter   7 value -0.810614
## final  value -0.810614 
## converged
## initial  value -0.810979 
## iter   2 value -0.810979
## iter   3 value -0.810980
## iter   4 value -0.810980
## iter   5 value -0.810980
## iter   6 value -0.810981
## iter   6 value -0.810981
## final  value -0.810981 
## converged

A lack of a trend in the standardized residuals is observed, the ACF shows likeness where all of the values are between the thresholds, the Q-Q plot shows normality, and finally, the p-values of the Q statistic are at or above the indicated threshold, indicating variation is due to white noise.

The AR(1) model is compared to an AR(2) model to inspect for an improvement in the model.

## initial  value -0.719205 
## iter   2 value -0.793888
## iter   3 value -0.809437
## iter   4 value -0.810139
## iter   5 value -0.810147
## iter   6 value -0.810147
## iter   7 value -0.810148
## iter   7 value -0.810148
## iter   7 value -0.810148
## final  value -0.810148 
## converged
## initial  value -0.811405 
## iter   2 value -0.811405
## iter   3 value -0.811405
## iter   4 value -0.811406
## iter   5 value -0.811406
## iter   6 value -0.811406
## iter   7 value -0.811406
## iter   7 value -0.811406
## iter   7 value -0.811406
## final  value -0.811406 
## converged

When comparing the two plots, the p values for the AR(2) model appear to be closer to 0. Additionally, AR(1) model achieves 1.226885 and 1.2504925 BIC, whereas AR(2) achieves 1.2296898 AIC and 1.2611665 BIC. The lower AIC of the AR(1) model indicates smaller error with a smaller number of parameters, indicating the AR(1) model provides a better fit for the data.

The AR(1) model is selected and used to create a forecast for 90 days:

par(mfrow = c(1,1))
fc <- med_arima.for <- sarima.for(med_train, 1, 1, 0, n.ahead = 182, xlab = "Year",
           main = "Revenue Projection (Millions USD")

Forecast Output:

## $pred
## Time Series:
## Start = c(2021, 184) 
## End = c(2021, 365) 
## Frequency = 365 
##   [1] 12.24289 12.28136 12.31003 12.33473 12.35782 12.38024 12.40240 12.42445
##   [9] 12.44646 12.46845 12.49043 12.51241 12.53438 12.55636 12.57834 12.60031
##  [17] 12.62229 12.64427 12.66624 12.68822 12.71020 12.73217 12.75415 12.77613
##  [25] 12.79810 12.82008 12.84205 12.86403 12.88601 12.90798 12.92996 12.95194
##  [33] 12.97391 12.99589 13.01787 13.03984 13.06182 13.08379 13.10577 13.12775
##  [41] 13.14972 13.17170 13.19368 13.21565 13.23763 13.25961 13.28158 13.30356
##  [49] 13.32554 13.34751 13.36949 13.39146 13.41344 13.43542 13.45739 13.47937
##  [57] 13.50135 13.52332 13.54530 13.56728 13.58925 13.61123 13.63320 13.65518
##  [65] 13.67716 13.69913 13.72111 13.74309 13.76506 13.78704 13.80902 13.83099
##  [73] 13.85297 13.87495 13.89692 13.91890 13.94087 13.96285 13.98483 14.00680
##  [81] 14.02878 14.05076 14.07273 14.09471 14.11669 14.13866 14.16064 14.18261
##  [89] 14.20459 14.22657 14.24854 14.27052 14.29250 14.31447 14.33645 14.35843
##  [97] 14.38040 14.40238 14.42436 14.44633 14.46831 14.49028 14.51226 14.53424
## [105] 14.55621 14.57819 14.60017 14.62214 14.64412 14.66610 14.68807 14.71005
## [113] 14.73202 14.75400 14.77598 14.79795 14.81993 14.84191 14.86388 14.88586
## [121] 14.90784 14.92981 14.95179 14.97377 14.99574 15.01772 15.03969 15.06167
## [129] 15.08365 15.10562 15.12760 15.14958 15.17155 15.19353 15.21551 15.23748
## [137] 15.25946 15.28144 15.30341 15.32539 15.34736 15.36934 15.39132 15.41329
## [145] 15.43527 15.45725 15.47922 15.50120 15.52318 15.54515 15.56713 15.58910
## [153] 15.61108 15.63306 15.65503 15.67701 15.69899 15.72096 15.74294 15.76492
## [161] 15.78689 15.80887 15.83085 15.85282 15.87480 15.89677 15.91875 15.94073
## [169] 15.96270 15.98468 16.00666 16.02863 16.05061 16.07259 16.09456 16.11654
## [177] 16.13851 16.16049 16.18247 16.20444 16.22642 16.24840
## 
## $se
## Time Series:
## Start = c(2021, 184) 
## End = c(2021, 365) 
## Frequency = 365 
##   [1]  0.4443488  0.7667746  1.0370677  1.2671195  1.4674664  1.6457924
##   [7]  1.8074269  1.9560587  2.0942859  2.2239833  2.3465404  2.4630135
##  [13]  2.5742245  2.6808271  2.7833502  2.8822290  2.9778263  3.0704487
##  [19]  3.1603578  3.2477788  3.3329076  3.4159156  3.4969537  3.5761559
##  [25]  3.6536416  3.7295178  3.8038807  3.8768176  3.9484073  4.0187220
##  [31]  4.0878273  4.1557837  4.2226466  4.2884671  4.3532926  4.4171667
##  [37]  4.4801304  4.5422213  4.6034748  4.6639239  4.7235994  4.7825305
##  [43]  4.8407441  4.8982660  4.9551201  5.0113293  5.0669150  5.1218974
##  [49]  5.1762959  5.2301286  5.2834128  5.3361650  5.3884008  5.4401350
##  [55]  5.4913819  5.5421549  5.5924670  5.6423304  5.6917571  5.7407582
##  [61]  5.7893446  5.8375266  5.8853141  5.9327168  5.9797436  6.0264036
##  [67]  6.0727050  6.1186560  6.1642646  6.2095381  6.2544840  6.2991091
##  [73]  6.3434203  6.3874242  6.4311269  6.4745347  6.5176534  6.5604886
##  [79]  6.6030460  6.6453309  6.6873484  6.7291036  6.7706012  6.8118461
##  [85]  6.8528427  6.8935955  6.9341088  6.9743867  7.0144334  7.0542528
##  [91]  7.0938486  7.1332247  7.1723845  7.2113318  7.2500698  7.2886019
##  [97]  7.3269314  7.3650614  7.4029950  7.4407353  7.4782851  7.5156472
## [103]  7.5528246  7.5898198  7.6266356  7.6632746  7.6997392  7.7360319
## [109]  7.7721551  7.8081112  7.8439025  7.8795313  7.9149996  7.9503097
## [115]  7.9854637  8.0204636  8.0553114  8.0900092  8.1245587  8.1589620
## [121]  8.1932207  8.2273369  8.2613121  8.2951482  8.3288469  8.3624097
## [127]  8.3958384  8.4291345  8.4622995  8.4953352  8.5282428  8.5610240
## [133]  8.5936801  8.6262126  8.6586228  8.6909122  8.7230821  8.7551337
## [139]  8.7870685  8.8188876  8.8505923  8.8821839  8.9136635  8.9450323
## [145]  8.9762915  9.0074422  9.0384855  9.0694226  9.1002545  9.1309823
## [151]  9.1616071  9.1921298  9.2225515  9.2528732  9.2830958  9.3132204
## [157]  9.3432478  9.3731791  9.4030150  9.4327566  9.4624047  9.4919602
## [163]  9.5214240  9.5507969  9.5800797  9.6092733  9.6383784  9.6673960
## [169]  9.6963267  9.7251713  9.7539306  9.7826054  9.8111964  9.8397043
## [175]  9.8681298  9.8964737  9.9247367  9.9529194  9.9810225 10.0090467
## [181] 10.0369927 10.0648610

Data Summary and Implications

The forecast length provides an adequate amount of time for planning, however, extending beyond 90 days, the confidence interval widens substantially and may not prove useful in prediction.

The selected model is used to produce a daily interval forecast to compare performance toretained test data with +/- 1 and 2 prediction error bounds. The line indicates test data, suggesting that the model is consistent with test data. Error is calculated.

sarima.for(med_train, 1, 1, 0, n.ahead = 182, xlab = "Year", 
          main = "Revenue Projection (Millions USD & Test Partition Comparison")

## $pred
## Time Series:
## Start = c(2021, 184) 
## End = c(2021, 365) 
## Frequency = 365 
##   [1] 12.24289 12.28136 12.31003 12.33473 12.35782 12.38024 12.40240 12.42445
##   [9] 12.44646 12.46845 12.49043 12.51241 12.53438 12.55636 12.57834 12.60031
##  [17] 12.62229 12.64427 12.66624 12.68822 12.71020 12.73217 12.75415 12.77613
##  [25] 12.79810 12.82008 12.84205 12.86403 12.88601 12.90798 12.92996 12.95194
##  [33] 12.97391 12.99589 13.01787 13.03984 13.06182 13.08379 13.10577 13.12775
##  [41] 13.14972 13.17170 13.19368 13.21565 13.23763 13.25961 13.28158 13.30356
##  [49] 13.32554 13.34751 13.36949 13.39146 13.41344 13.43542 13.45739 13.47937
##  [57] 13.50135 13.52332 13.54530 13.56728 13.58925 13.61123 13.63320 13.65518
##  [65] 13.67716 13.69913 13.72111 13.74309 13.76506 13.78704 13.80902 13.83099
##  [73] 13.85297 13.87495 13.89692 13.91890 13.94087 13.96285 13.98483 14.00680
##  [81] 14.02878 14.05076 14.07273 14.09471 14.11669 14.13866 14.16064 14.18261
##  [89] 14.20459 14.22657 14.24854 14.27052 14.29250 14.31447 14.33645 14.35843
##  [97] 14.38040 14.40238 14.42436 14.44633 14.46831 14.49028 14.51226 14.53424
## [105] 14.55621 14.57819 14.60017 14.62214 14.64412 14.66610 14.68807 14.71005
## [113] 14.73202 14.75400 14.77598 14.79795 14.81993 14.84191 14.86388 14.88586
## [121] 14.90784 14.92981 14.95179 14.97377 14.99574 15.01772 15.03969 15.06167
## [129] 15.08365 15.10562 15.12760 15.14958 15.17155 15.19353 15.21551 15.23748
## [137] 15.25946 15.28144 15.30341 15.32539 15.34736 15.36934 15.39132 15.41329
## [145] 15.43527 15.45725 15.47922 15.50120 15.52318 15.54515 15.56713 15.58910
## [153] 15.61108 15.63306 15.65503 15.67701 15.69899 15.72096 15.74294 15.76492
## [161] 15.78689 15.80887 15.83085 15.85282 15.87480 15.89677 15.91875 15.94073
## [169] 15.96270 15.98468 16.00666 16.02863 16.05061 16.07259 16.09456 16.11654
## [177] 16.13851 16.16049 16.18247 16.20444 16.22642 16.24840
## 
## $se
## Time Series:
## Start = c(2021, 184) 
## End = c(2021, 365) 
## Frequency = 365 
##   [1]  0.4443488  0.7667746  1.0370677  1.2671195  1.4674664  1.6457924
##   [7]  1.8074269  1.9560587  2.0942859  2.2239833  2.3465404  2.4630135
##  [13]  2.5742245  2.6808271  2.7833502  2.8822290  2.9778263  3.0704487
##  [19]  3.1603578  3.2477788  3.3329076  3.4159156  3.4969537  3.5761559
##  [25]  3.6536416  3.7295178  3.8038807  3.8768176  3.9484073  4.0187220
##  [31]  4.0878273  4.1557837  4.2226466  4.2884671  4.3532926  4.4171667
##  [37]  4.4801304  4.5422213  4.6034748  4.6639239  4.7235994  4.7825305
##  [43]  4.8407441  4.8982660  4.9551201  5.0113293  5.0669150  5.1218974
##  [49]  5.1762959  5.2301286  5.2834128  5.3361650  5.3884008  5.4401350
##  [55]  5.4913819  5.5421549  5.5924670  5.6423304  5.6917571  5.7407582
##  [61]  5.7893446  5.8375266  5.8853141  5.9327168  5.9797436  6.0264036
##  [67]  6.0727050  6.1186560  6.1642646  6.2095381  6.2544840  6.2991091
##  [73]  6.3434203  6.3874242  6.4311269  6.4745347  6.5176534  6.5604886
##  [79]  6.6030460  6.6453309  6.6873484  6.7291036  6.7706012  6.8118461
##  [85]  6.8528427  6.8935955  6.9341088  6.9743867  7.0144334  7.0542528
##  [91]  7.0938486  7.1332247  7.1723845  7.2113318  7.2500698  7.2886019
##  [97]  7.3269314  7.3650614  7.4029950  7.4407353  7.4782851  7.5156472
## [103]  7.5528246  7.5898198  7.6266356  7.6632746  7.6997392  7.7360319
## [109]  7.7721551  7.8081112  7.8439025  7.8795313  7.9149996  7.9503097
## [115]  7.9854637  8.0204636  8.0553114  8.0900092  8.1245587  8.1589620
## [121]  8.1932207  8.2273369  8.2613121  8.2951482  8.3288469  8.3624097
## [127]  8.3958384  8.4291345  8.4622995  8.4953352  8.5282428  8.5610240
## [133]  8.5936801  8.6262126  8.6586228  8.6909122  8.7230821  8.7551337
## [139]  8.7870685  8.8188876  8.8505923  8.8821839  8.9136635  8.9450323
## [145]  8.9762915  9.0074422  9.0384855  9.0694226  9.1002545  9.1309823
## [151]  9.1616071  9.1921298  9.2225515  9.2528732  9.2830958  9.3132204
## [157]  9.3432478  9.3731791  9.4030150  9.4327566  9.4624047  9.4919602
## [163]  9.5214240  9.5507969  9.5800797  9.6092733  9.6383784  9.6673960
## [169]  9.6963267  9.7251713  9.7539306  9.7826054  9.8111964  9.8397043
## [175]  9.8681298  9.8964737  9.9247367  9.9529194  9.9810225 10.0090467
## [181] 10.0369927 10.0648610

lines(med_ts)

#Calculate error
MAPE(fc$pred, med_test)

## [1] 0.1555573

RMSE(fc$pred, med_test)

## [1] 3.385255

The daily forecasted data follows the test data and most of the data lies within one prediction error.

The recommended course of action is to implement the final ARIMA model (1,1,0) with a 90 day forecast window to predict revenue for the next quarter.

Citations

DataCamp. (n.d.). Adf.test: Augmented dickey-fuller test. RDocumentation. Retrieved December 17, 2021, from https://www.rdocumentation.org/packages/aTSA/versions/3.1.2/topics/adf.test

Pennsylvania State University. (2021). 12.1 estimating the Spectral Density: Stat 510. Estimating the Spectral Density. Retrieved December 17, 2021, from https://online.stat.psu.edu/stat510/lesson/12/12.1