D213 Part I

RESEARCH QUESTION

A time series of daily revenue was provided for two years from the beginning of operations. Can meaningful patterns within the data be modeled? The contracting agency wishes to understand the impact of readmission to daily revenue.

OBJECTIVES

We will prepare the data and perform a time series analysis to find any identifiable patterns in revenue over the first two years of operations. This data can then be correlated with readmission rates to make recommendations for future policies.

METHOD JUSTIFICATION

ASSUMPTIONS

For a time-series analysis, the data must be stationary, meaning that the data’s statistical properties do not change over time. In other words, the way the data changes for a given change in time must be predictable if not linear. If the data is not already stationary, for instance if it experiences exponential increase, it must be transformed.

A time series must also not be significantly autocorrelated. This means that the data at one point must be independent from a nearby point in the past or future. I there is significant autocorrelation or if there is seasonality in the data, any forecast made will be misleading.

PREPARATION

The first step will be to import data and check for missing values or missing samples.

library(readr)

## Warning: package 'readr' was built under R version 4.1.1

med <- read_csv("medical_time_series .csv")

## Rows: 731 Columns: 2

## -- Column specification --------------------------------------------------------
## Delimiter: ","
## dbl (2): Day, Revenue

## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

med[!complete.cases(med),]

FALSE %in% (med$Day == c(1:length(med$Day)))

## [1] FALSE

The provided data does not contain any NA values, and the Day columns contains every day between the beginning of the data and the ending without any gaps. No data will need to be imputed for the analysis.

The data frame will now be converted into a time series object.

time <- ts(med$Revenue, start = 1, frequency = 30)
head(time)

## [1]  0.0000000 -0.2923555 -0.3277718 -0.3399871 -0.1248875 -0.4915896

A line graph can be created using the ts.plot() function.

ts.plot(time, xlab = "Day of Operation", ylab = "Revenue", main = "Revenue Time Series")

No gaps are present in the data. The data is continuous and steps are taken daily corresponding to the maximum level of granularity available. The length of the sequence is 731 days or two years.

The stationarity of the time series can be evaluated using the Augmented Dickey-Fuller Test, which is available in adf.test().

library(aTSA)

## Warning: package 'aTSA' was built under R version 4.1.1

## 
## Attaching package: 'aTSA'

## The following object is masked from 'package:graphics':
## 
##     identify

adf.test(med$Revenue)

## Augmented Dickey-Fuller Test 
## alternative: stationary 
##  
## Type 1: no drift no trend 
##      lag    ADF p.value
## [1,]   0  0.210   0.704
## [2,]   1 -0.233   0.577
## [3,]   2 -0.243   0.574
## [4,]   3 -0.178   0.593
## [5,]   4 -0.226   0.579
## [6,]   5 -0.280   0.563
## [7,]   6 -0.346   0.544
## Type 2: with drift no trend 
##      lag   ADF p.value
## [1,]   0 -2.03   0.316
## [2,]   1 -2.22   0.241
## [3,]   2 -2.22   0.241
## [4,]   3 -2.18   0.258
## [5,]   4 -2.18   0.256
## [6,]   5 -2.31   0.204
## [7,]   6 -2.49   0.131
## Type 3: with drift and trend 
##      lag   ADF p.value
## [1,]   0 -1.26   0.890
## [2,]   1 -2.00   0.577
## [3,]   2 -2.01   0.571
## [4,]   3 -1.90   0.621
## [5,]   4 -1.97   0.589
## [6,]   5 -2.13   0.521
## [7,]   6 -2.34   0.432
## ---- 
## Note: in fact, p.value = 0.01 means p.value <= 0.01

Because the p-value is greater than 0.05, there is not enough evidence to reject the null hypothesis. Therefore, we can conclude that the time series is non-stationary. This seems evident because there is a significant upwards trend in the first third of the line graph before it seems to level after day 200.

The non-stationarity will be corrected using differencing. This approach utilizes the differences between successive observations.

difftime <- diff(med$Revenue)
adf.test(difftime)

## Augmented Dickey-Fuller Test 
## alternative: stationary 
##  
## Type 1: no drift no trend 
##      lag    ADF p.value
## [1,]   0 -17.36    0.01
## [2,]   1 -14.43    0.01
## [3,]   2 -13.53    0.01
## [4,]   3 -11.50    0.01
## [5,]   4 -10.03    0.01
## [6,]   5  -8.85    0.01
## [7,]   6  -8.29    0.01
## Type 2: with drift no trend 
##      lag    ADF p.value
## [1,]   0 -17.37    0.01
## [2,]   1 -14.45    0.01
## [3,]   2 -13.56    0.01
## [4,]   3 -11.53    0.01
## [5,]   4 -10.06    0.01
## [6,]   5  -8.88    0.01
## [7,]   6  -8.33    0.01
## Type 3: with drift and trend 
##      lag    ADF p.value
## [1,]   0 -17.41    0.01
## [2,]   1 -14.49    0.01
## [3,]   2 -13.61    0.01
## [4,]   3 -11.58    0.01
## [5,]   4 -10.12    0.01
## [6,]   5  -8.95    0.01
## [7,]   6  -8.41    0.01
## ---- 
## Note: in fact, p.value = 0.01 means p.value <= 0.01

detach(package:aTSA, unload = TRUE)

The data is now stationary. It now can be used to create a new time series and plot.

diffts <- ts(difftime, start = 1, frequency = 30)
ts.plot(difftime, xlab = "Day of Operation", ylab = "Revenue", main = "Differenced Revenue Time Series")

Because time series data is sequential, random sampling should not be used. Therefore, the data will be partitioned into two consecutive groups such that the training data will utilize the first 60% of the time series while the testing set will utilize the remainder. Both sets and the cleaned data will be written to file.

train <- ts(difftime[c(1:(length(difftime)*.6))], start = 1, frequency = 30)
test <- ts(difftime[-c(1:(length(difftime)*.6))], start = 1, frequency = 30)
write_csv(as.data.frame(difftime), "cleanedTimeSeries.csv")
capture.output(summary(test), file = "test.csv")
capture.output(summary(train), file = "train.csv")

MODEL IDENTIFICATION AND ANALYSIS

There is a seasonal component as well as a trend, but the trend does not appear to be significant after differencing. The presence of these can be seen with decomposition.

decomposed <- decompose(diffts, type = "additive")
plot(decomposed)

There is some seasonality and trend in the data, but not at a high magnitude. This can be captured by an ARIMA model.

Autocorrelation can be tested with the afc() function.

acf(diffts, lag.max = 10)

Autocorrelation may be significant at lags 1, 2, and 6. To test whether there is significant evidence of non-zero correlation at these lags we can use the Ljung-Box test.

Box.test(diffts, lag=10, type="Ljung-Box")

## 
##  Box-Ljung test
## 
## data:  diffts
## X-squared = 172.53, df = 10, p-value < 2.2e-16

Spectrum can be analyzed with the spectrum() function.

spec <- spectrum(diffts, log="no", span = 30, plot=TRUE)

The periodogram shows the spectrum density at a point. The greatest density appears to be at lower frequencies.

ARIMA

Some seasonality exists in the data. Because differencing has already been performed it should not be specified in the ARIMA model. Because two levels of lag seemed to be significant, the moving average term will be set to 2.

fit <- arima(train, c(2,0,2),c(2,0,0))
fit

## 
## Call:
## arima(x = train, order = c(2, 0, 2), seasonal = c(2, 0, 0))
## 
## Coefficients:
##          ar1      ar2      ma1     ma2    sar1    sar2  intercept
##       0.3901  -0.2957  -0.0174  0.3466  0.0049  0.0351     0.0309
## s.e.  0.2989   0.1668   0.2961  0.1108  0.0503  0.0523     0.0319
## 
## sigma^2 estimated as 0.1926:  log likelihood = -260.89,  aic = 537.78

Once the model is built, the residuals can be calculated with checkresiduals().

library(forecast)

## Warning: package 'forecast' was built under R version 4.1.1

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

checkresiduals(fit)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,0,2)(2,0,0)[30] with non-zero mean
## Q* = 40.897, df = 53, p-value = 0.8875
## 
## Model df: 7.   Total lags used: 60

Autocorrelation appears to be insignificant and residuals appear to be random.

FORECAST

The forecast() function can be used to create a forecast with the testing and training sets.

forecast <- forecast(fit, h = 10, level =c(99.5))
plot(forecast)
lines(diffts, col = rgb(.2,.2,.2,.5))

The shaded gray area denotes a prediction interval of 99.5 percent. The green line is the forecast itself, and the light gray line the test data. The forecast follows the test data but quickly falls into a relatively constant value. The prediction interval for an ARIMA model is yt±1.96σ.

predictions <- predict(forecast)
predictions$mean[10]

## [1] 0.02441288

predictions$mean[10] + 1.96*sd(forecast$residuals)

## [1] 0.8855207

predictions$mean[10] - 1.96*sd(forecast$residuals)

## [1] -0.8366949

At day ten of the forecast, the 95% prediction interval is between -0.8 and 0.8. Accuracy by multiple statistics can be calculated with accuracy().

accuracy(forecast$fitted, test)

##                  ME      RMSE       MAE      MPE     MAPE      ACF1 Theil's U
## Test set -0.0396775 0.5415429 0.4291089 152.3816 195.6382 0.4638863  1.991252

RESULTS

After analyzing the presence of seasonality and trends, an ARIMA model was selected with order p, d, and q 1, 0, 0 and seasonal coefficients 1, 0, 1. Differencing was required even after initial cleaning due to the lack of stationarity in the data. This combination produced an AIC of 797 for the chosen ARIMA model.

The prediction interval for the forecast was plotted along with the forecast and the test data. There is only a 5% likelihood that a real observation will fall outside the -8 – 8 range on the differenced data.

Moderate and short length forecasts provide some actionable insights for decision makers. Two years of data were available for analysis. However, there was a significant upward trend corresponding to the early operation of the hospital chain. After approximately a year, the trend disappeared while retaining significant seasonality. Because of these, the model loses accuracy with increasing range. A value of 10 days was selected; longer forecast lengths were not useful.

The mean absolute percentage error and the root mean squared error of the ARIMA model measure its success. The RMSE was reported to be 0.49 and the MAPE approximately 148.

VISUALIZATION OF FORECAST

forecast <- forecast(fit, h = 10, level =c(99.5))
plot(forecast)
lines(diffts, col = rgb(.2,.2,.2,.5))

The forecasted range is given by the colored line. It follows the test data but degrades.

RECCOMENDED COURSE OF ACTION

The produced model was unable to achieve a high degree of accuracy. This may mean that the data exhibits a significant level of randomness. However, two distinct trends were visible in the data, one belonging to the first year and another belonging to the second. If the change in velocity was related to readmission rates it could provide some insight into the hospital’s financial situation. Additional data beyond the second year may produce a more accurate model.

References CRAN. (2021) R [Computer Software] Retrieved 16 November 2021 from https://mirror.las.iastate.edu/CRAN/

Wickham, H. (2021) Readr [Computer Software] Retrieved 16 November 2021 from https://cran.r-project.org/web/packages/readr/index.html

Wickham, H. (2021) Dplyr [Computer Software] Retrieved 16 November 2021 from https://cran.r-project.org/web/packages/dplyr/index.html

Hyndman, R. (2021) Forecast [Computer Software] Retrieved 16 November 2021 from https://cran.r-project.org/web/packages/forecast/index.html