Table of Contents
Synopsis
Introduction
1) Exploring the Medical Center Traffic Data
2) Statistical Modelling / Analysis
2a) Autocorrelation / Partial Autocorrelation
2b) Moving Average
2c) Baseline Naive Bayes
2d) Linear Regression
2e) Holt-Winters Seasonal Decomposition
2f) ARIMA
3) Results Summary
4) Conclusions
5) Discussion
Appendix
The manager of a local medical walk-in centre has no experience of time series or forecasting, and has asked for help analyzing the number of patients that attend the walk-in centre. The objective is to predict how many patients will come into the walk-in centre each day, in the next week. The centre is open 7 day a week.
The ability to predict future daily volumes of office visits to a medical walk-in centre will subsequently allow for efficient scheduling of medical centre staff rotations. The previous years of visits is a basic future demand predictor, and therefore should have priority of examination. Predicting future office visits is the first step in staff rotation scheduling, before a balance between employer staffing requirements and employee ability to schedule child care, education, and other obligations needs to be planned.
By having the ability to predict future demand for staff, employers can offer employees work flexibility that is highly valued and allows employers to further reduce the cost of staffing. Employees gain health, well-being, quality of life, job satisfaction, and productivity benefits from flexible scheduling, that begins with demand prediction. Employers gain productivity, retention, consistency, and budget management benefits from demand prediction. By understanding the fluctuations in future client visits, employers can determine the right balance of full-time and part-time staff. Time series analysis assists business owners in creating a strategy for business growth.
The objectives of this document is to examine the data for characteristics that could lead to increasing the accuracy of demand forecasting. This begins with determining methods of detecting the source of fluctuations in time series data. One possible source of fluctuations is autocorrelation, another is moving averages. Seasonal changes in the 4 years of data is also a consideration. After the characteristics of the time series variations are generalized, different time series predictive analytics methods are compared with a baseline method. Determing an effective method of time series analysis will allow for the subsequent identification of the phenomenon responsible for the events within the time series. Basic time series analysis is the beginning of developing even more complex demand prediction capabilities, and therefore requires optimization for the most effectiveness and efficiency.
A time series is a sequence of values ordered in time. The changes within time series data usually represent peaks, troughs, recession and expansion of an observed category. The values of the data points within this document’s time series represent the total daily amount of patient visits to the medical centre office.
This analysis follows the Data Science regimen of:
1) exploring the characteristics of the dataset in the form of basic
statistical summary and visualization
2) reformatting the data for algorithmic processing
3) use of advanced techniques of finding complex statistical phenomena
in the dataset
4) separating the data into trial and test datasets for verification of
the accuracy of prediction of future data
5) processing the dataset through trial predictive analytics modeling
algorithms
6) examining the results of the predictive analytics trials in order to
find the most effective method of time series forecasting
This section presents numerical summaries which describe the general statistical content of the data. Graphical summaries of the dataset are presented in the form of a line plot, a scatter plot, and a time series plot.
We can see from the line plot that there is great seasonal variation in the volume of patient visits, over the 4 year time series span. The scatter plot demonstrates that the seasonal extremes of patient visits are usually outliers in the data, and that the median volumes of patient visits usually occur most often. The time series plot verifies that the date column of the dataset is linear.
| Date | Number.of.patients |
|---|---|
| 2015-04-01 | 49 |
| 2015-04-02 | 58 |
| 2015-04-03 | 36 |
| 2015-04-04 | 44 |
| 2015-04-05 | 64 |
| 2015-04-06 | 46 |
| mean | median | max | min | standard_deviation |
|---|---|---|---|---|
| 46.72485 | 45 | 97 | 16 | 12.9109 |
| Date | Number.of.patients | |
|---|---|---|
| 1001 | 2017-12-26 | 97 |
| 1007 | 2018-01-01 | 97 |
| Date | Number.of.patients | |
|---|---|---|
| 490 | 2016-08-02 | 16 |
This section of the document seeks to statistically model and analyze the patient visit data, in order to predict the next 7 days of office visit volume. Several different methods of statistical analysis are applied to the predictive analytics problem, for greater understanding of the content of the dataset.
The first analysis is of the two variables of the project’s dataset. These variables are - “Number.of.patients” (the number of patients that visit the local walk-in centre on one day), and “Date” (the dates of attendence). These variables are examined for patterns that might indicate self-correlation. Any correlations within the dataset could assist in predictive analytics. In addition to dataset self-correlation, the changing number of patient visits per day are examined for seasonal changes and changing groups of averages, (or moving averages).
After the dataset has been examined for patterns within the data, a baseline predictive analytics model is created to compare with the other statistical models implemented by this document. The baseline model will serve as the minimum required level of accuracy accepted by the other regression models. Naive Bayes was chosen as the baseline regression model because of the basic concept of using previous patient visit volumes to predict the next day’s patient visit total.
To find the most accurate predictions of the next 7 days of patient visits, several other statistical modeling methods were attempted. The first of these alternate methods is Linear Regression Modeling, that presumes a linear relationship in the time series data. Then, Holt-Winters Seasonal Decomposition is applied in an attempt to increase accuracy of prediction via factoring in seasonal changes in the 4 years of patient visits. Lastly, Autoregression Integrated Moving Average (ARIMA) analysis is applied to estimate future data with autocorrelation of the dataset and moving averages within the time series as considerations.
The first stage of the analysis for patterns within the data is an examination of autocorrelation and partial autocorrelation (acf/pacf). Autocorrelation correlates a Time Series with lags of itself, to find if the previous states (lagged observations) have an influence on the current state. If the autocorrelation crosses the dashed blue line, it means that specific lag is significantly correlated with current series. A stationary time series will have the autocorrelation fall to zero quickly. A non-stationary series falls gradually. Partial Autocorrelation is the correlation of the time series with a lag of itself. The linear dependence of the lags are removed in pacf.
The acf/pacf plots show a high level of autocorrelation of both the Date column and the “Number.of.patients” column. The Date column also displays partial autocorrelation of the first date in the time series having a linear relationship with the subsequent dates.
## Figure 4. Autocorrelation (acf) Plots
## Figure 5. Partial Autocorrelation (pacf) Plots
After the detection of autocorrelation of the dataset’s variables, the data is examined for moving averages. The graph of the 20-Day moving average of the patient visit data demonstrates extreme fluctuations of average values within 20-Day blocks of the time series.
The first extrapolation model is created as a baseline model for comparison with the other predictive analytics models in this document. The baseline model chosen is Naive Bayes regression of the dataset. Naive Bayes is a basic method of examining apriori trends in the patient visit volumes to determine the coefficients of regression.
An initialisation set of the first 70% of the time series is used as a training set, and the remaining 30% of the time series is used as the test set to verify model accuracy.
Mean Absolute Percentage Error (MAPE) and Mean Square Error (MSE) are used to calculate the accuracy of the Naive Bayes predictions, and also predictions from the other statistical regression methods.
| x |
|---|
| 44 |
| 44 |
| 44 |
| 44 |
| 44 |
| 44 |
| 44 |
| nrow.bayes_train. | nrow.bayes_test. |
|---|---|
| 1025 | 436 |
| x |
|---|
| 172.9954 |
| x |
|---|
| 21.13426 |
The second extrapolation model is Basic Linear Regression of the Office Visit Time Series. This method of predictive analytics presumes there is a basic linear relationship of the patient visit time series, in order to determine the coefficients of regression and predict future values. After the linear modelling, mape/mse accuracy is checked and a quantile-quantile (Q-Q) plot is created to examine the dataset’s data points for outliers.
## Table 9. Summary of the linear regression model
##
## Call:
## lm(formula = Number.of.patients ~ ., data = bayes_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34.215 -9.082 -1.465 7.219 48.160
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.148e+01 1.596e+01 -5.106 3.93e-07 ***
## Date 7.433e-03 9.249e-04 8.037 2.52e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.54 on 1023 degrees of freedom
## Multiple R-squared: 0.05939, Adjusted R-squared: 0.05848
## F-statistic: 64.6 on 1 and 1023 DF, p-value: 2.52e-15
| x |
|---|
| 40.73862 |
| 40.74682 |
| 40.75502 |
| 40.76322 |
| 40.77142 |
| 40.77962 |
| 40.78782 |
| x |
|---|
| 149.4329 |
| x |
|---|
| 21.8036 |
## Figure 8. Q-Q Plots of Linear Regression
The second stage of analysis of the office visit data, seasonal decomposition, is applied to reveal the seasonal trends within the time series data. The decomposition is applied via the third Extrapolation Model - Holt-Winters Seasonal Decomposition. Holt-Winters relies on seasonal trends to predict time series future values. Because of the extreme seasonal variation in the patient visits, simple expontial smoothing (SES, that smoothes extreme variations in data values), and Holt’s two-parameter model, (Holt Linear, that applies linear trend analysis for forecasting), were presumed to not have the capabilities to analyze the patient visit data that is available with Holt-Winters.
| x |
|---|
| 43.41539 |
| 43.06111 |
| 43.02394 |
| 42.24325 |
| 40.74508 |
| 41.03956 |
| 40.27167 |
| x |
|---|
| 346.1506 |
| x |
|---|
| 33.11214 |
The fourth and final extrapolation model is AutoRegressive Integrated Moving Average (ARIMA). ARIMA presumes time series data is affected by periodic averages that are traceable over time. Auto-regression, (the amount of daily patient visits having the characteristic of regressing from previous visit statistics), is also presumed in ARIMA modeling. Because the document’s dataset has displayed auto-correlation, partial autocorrelation, and moving averages, ARIMA seems like a good candidate for predictive analytics.
| x |
|---|
| 39.19743 |
| 39.64116 |
| 39.21236 |
| 39.38330 |
| 39.98562 |
| 40.05228 |
| 40.14752 |
## Table 17. Summary of the ARIMA model
## Series: bayes_ts
## ARIMA(5,0,1) with non-zero mean
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5 ma1 mean
## 0.8268 -0.0168 0.0756 -0.0018 0.0823 -0.6764 45.2867
## s.e. 0.0576 0.0351 0.0345 0.0349 0.0332 0.0528 2.2551
##
## sigma^2 = 85.56: log likelihood = -5316.68
## AIC=10649.35 AICc=10649.45 BIC=10691.64
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.01125438 9.227413 7.287569 -4.395787 17.34164 0.4546427
## ACF1
## Training set -0.001095945
| x |
|---|
| 214.7216 |
| x |
|---|
| 21.37096 |
| Naive Bayes | Linear Modeling | Holt-Winters | ARIMA | |
|---|---|---|---|---|
| April 1, 2019 | 44 | 40.73862 | 43.41539 | 39.19743 |
| April 2, 2019 | 44 | 40.74682 | 43.06111 | 39.64116 |
| April 3, 2019 | 44 | 40.75502 | 43.02394 | 39.21236 |
| April 4, 2019 | 44 | 40.76322 | 42.24325 | 39.38330 |
| April 5, 2019 | 44 | 40.77142 | 40.74508 | 39.98562 |
| April 6, 2019 | 44 | 40.77962 | 41.03956 | 40.05228 |
| April 7, 2019 | 44 | 40.78782 | 40.27167 | 40.14752 |
Naive Bayes is derived from non-complex Bayesian Inference from the late 18th century. Naive Bayes works with only a few data points or many data points, that can be continuous data or discrete values. Naive Bayes is not processing time intensive, and therefore useful for real-time prediction. Naive Bayes loses accuracy if the future data has categories that are not in the data the Naive Bayes Classifier was trained on.
Linear Regression performs well if the categories in the dataset have a linear relationship with each other. Predictive Analytics with Linear Modeling is fast enough for real time analytics. If the data observations are not linearly related, Linear Regression loses accuracy.
Holt-Winters Seasonal Decomposition presumes seasonal changes in time series data, in order to forecast future values. If the determinant in the time series fluctuations is not seasonally dependent, Holt-Winters loses effectiveness.
The ARIMA method of time series modeling is effective with data where the sequence of values are dependent on past values, or if the averages of subsets of the time series have a predictable pattern. However, ARIMA is less effective with a limited amount of time series data.
The above analysis has demonstrated that the extreme seasonal fluctuation in the office visit data requires a compensation strategy, before accurate prediction of future office visit volumes is possible. Algorithms that regress the past values of the time series, without considering seasonal fluctuation, perform optimally when predicting future values.
To summarize the 7-day forecasts, the baseline Naive Bayes method produced an error rate of 20.98%, linear regression had 21.8% errors, Holt-Winters resulted in a 33.11% error rate, and finally ARIMA generated 21.04% predictive errors.
The baseline Naive Bayes regression interestingly had the maximum accuracy of all the regression models considered. Basic linear regression modeling of the past office visit data closely matched the accuracy of Naive Bayes apriori analysis. Holt-Winters seasonal decomposition, (which factors in the extreme seasonal variation of the data), demonstrated the effectiveness of non-consideration of seasonal changes via having over 10% more errors than the other methods. ARIMA regression, (which factors in auto-correlation and moving averages), performed as well as basic regression analysis to predict future levels of patient visits.
Therefore, the best models that provide a realistic fit to the data and believable projections are the baseline naive bayes, basic linear regression, and ARIMA models. Considering that auto-correlation and moving averages were found in our dataset, ARIMA gives the greatest confidence for analysis of the medical centre office visit time series data, without overfitting or underfitting.
The greatest difficulties in fitting the models used for this project, (bayes, linear, seasonal decomposition, and ARIMA), involved relying on basic univariate time series data to predict an outcome that very possibly involves many different influencing factors. Extra data categories involved in the office visits could possibly reveal the hidden factors. A basic data visualization examination of the data revealed that seasonal changes seemed to have the greatest effect. However, subsequent analysis and attempts at applying Holt’s method of seasonal time series analysis demonstrated that the greatest factors affecting the prediction of future office visit levels were independent of seasonal changes. These hidden factors of daily patient visit levels were only accessible via Bayesian Inference that presumes the level of patient visit on any given day are affected by the previous days office visit levels.
# Set the working directory to the folder with the data files
# i.e. setwd("C:/Users/...")
# Load R packages required for the project's algorithms
library(readxl)
library(tidyverse)
library(dplyr)
library(caret)
library(caTools)
library(lpSolve)
library(xts)
library(zoo)
library(tseries)
library(fpp2)
library(TTR)
library(forecast)
library(ggplot2)
library(knitr)
# Load project dataset
office_data <- read.csv("The Data.csv")
# Convert character format dates to date format
office_data$Date <- as.Date(office_data$Date, format="%d/%m/%Y")
# View the first six rows of the "office_data" dataset
kable(head(office_data), caption = "Table 1. The first six rows of the dataset")
# Summary Statistics
kable(data.frame(mean=mean(office_data$Number.of.patients),
median=median(office_data$Number.of.patients),
max=max(office_data$Number.of.patients),
min=min(office_data$Number.of.patients),
standard_deviation=sd(office_data$Number.of.patients)), caption = "Table 2. Number of Patient Visits - Basic Statistics")
kable(office_data[office_data$Number.of.patients==max(office_data$Number.of.patients),], caption = "Table 3. The days of maximum office visits")
kable(office_data[office_data$Number.of.patients==min(office_data$Number.of.patients),], caption = "Table 4. The day of minimum office visits")
par(mfrow = c(1,3))
# View line plot of the number of patients over time
ggplot(office_data, aes(Date, Number.of.patients)) +
geom_line() +
ggtitle("Figure 1. Line Plot of Patient Visits Over Time")
# View scatter plot of the number of patients over time
ggplot(office_data, aes(Date, Number.of.patients)) +
geom_point() +
ggtitle("Figure 2. Scatter Plot of Patient Visits Over Time")
# Create a time series formatted dataset
office_data_ts <- ts(office_data, frequency = 12)
# Plot the Time Series dataset
plot.ts(office_data_ts, col="blue", main="Figure 3. Time Series of Patient Volume")
par(mfrow = c(1,1))
par(mfrow = c(1,2))
cat("Figure 4. Autocorrelation (acf) Plots")
acfRes <- acf(office_data_ts)
cat("Figure 5. Partial Autocorrelation (pacf) Plots")
pacfRes <- pacf(office_data_ts)
par(mfrow = c(1,1))
# Format a new dataframe with moving average data
moving_average_plot <- office_data
moving_average_plot$moving_average <- ma(office_data$Number.of.patients, 20)
# Plot Office visits with 20-day moving average
ggplot(moving_average_plot, aes(Date, Number.of.patients)) +
geom_line() +
geom_line(aes(Date, moving_average, color=moving_average)) +
ggtitle("Figure 6. 20-Day Moving Average of Patient Visits")
# Set the random number generation seed to eliminate random algorithmic results
set.seed(101)
# 70/30 Cross-Validation of the dataset
sample <- sample.split(office_data$Number.of.patients,
SplitRatio = .70)
bayes_train <- subset(office_data, sample == TRUE)
bayes_test <- subset(office_data, sample == FALSE)
# Mean Absolute Percentage Error accuracy testing function
mape <- function(actual, pred){
mape <- mean(abs((actual - pred)/actual))*100
return(mape)
}
# Mean Square Error accuracy testing function
mse <- function(actual, pred){
mse <- mean((actual - pred)^2)
return(mse)
}
# Format a monthly time series dataset from the training data
bayes_ts <- ts(bayes_train$Number.of.patients, start = c(2015, 4),
end = c(2019, 3), frequency = 365)
# Regress the office visit time series via Naive Bayes inference
naive_mod <- naive(bayes_ts, h = 365)
# Predict future values via Bayesian Inference
naive_bayes_prediction <- predict(naive_mod, bayes_test)
Next7days_nb <- c(mean(c(naive_bayes_prediction$upper[1],
naive_bayes_prediction$lower[1])),
mean(c(naive_bayes_prediction$upper[1],
naive_bayes_prediction$lower[1])),
mean(c(naive_bayes_prediction$upper[1],
naive_bayes_prediction$lower[1])),
mean(c(naive_bayes_prediction$upper[1],
naive_bayes_prediction$lower[1])),
mean(c(naive_bayes_prediction$upper[1],
naive_bayes_prediction$lower[1])),
mean(c(naive_bayes_prediction$upper[1],
naive_bayes_prediction$lower[1])),
mean(c(naive_bayes_prediction$upper[1],
naive_bayes_prediction$lower[1])))
kable(Next7days_nb, caption = "Table 5. Next 7 days Prediction - Naive Bayes")
kable(data.frame(nrow(bayes_train),nrow(bayes_test)), caption = "Table 6. Verification of 70%/30% Cross-Validation")
kable(mse(bayes_test$Number.of.patients, mean(naive_mod$mean)), caption = "Table 7. The Mean Squared Error (MSE) accuracy of naive bayes forecasting")
kable(mape(bayes_test$Number.of.patients, mean(naive_mod$mean)), caption = "Table 8. The Mean Absolute Percentage Error (MAPE) accuracy of naive bayes forecasting")
# View Bayesian Inference future values
plot(naive_bayes_prediction, main = "Figure 7. Naive Bayes Forecasts")Q-Q)
# Create a linear regression model from the training data
linear_model <- lm(Number.of.patients ~ ., data = bayes_train)
# Predict the test set values
linear_predictions <- predict(linear_model, bayes_test)
# Linear Model of entire time series
linear_model_7days <- lm(Number.of.patients ~ ., data = office_data)
cat("Table 9. Summary of the linear regression model")
summary(linear_model)
# Predict the next 7 days
Next7days_lm <- predict(linear_model_7days)
kable(Next7days_lm[1:7], caption = "Table 10. Next 7 days Prediction - Linear Modeling")
kable(mse(bayes_test$Number.of.patients, linear_predictions), caption = "Table 11. The Mean Squared Error (MSE) accuracy of linear regression")
kable(mape(bayes_test$Number.of.patients, linear_predictions), caption ="Table 12. The Mean Absolute Percentage Error (MAPE) accuracy of linear regression")
cat("Figure 8. Q-Q Plots of Linear Regression")
par(mfrow = c(2,2))
plot(linear_model)
par(mfrow = c(1,1))
# Seasonal decomposition
fit <- stl(office_data_ts[,'Number.of.patients'], "periodic")
# Decomposition Extrapolation Model, 7 Day Forecasting of office visits
hw_fit <- HoltWinters(office_data_ts[,'Number.of.patients'])
# Predict future values with Holt-Winters Filtering
hw_pred <- forecast(hw_fit, h = 7)
Next7days_hw <- hw_pred$mean[1:7]
kable(hw_pred$mean[1:7], caption = "Table 13. Next 7 days Prediction - Holt-Winters")
kable(mse(bayes_test$Number.of.patients, hw_pred$x), caption = "Table 14. The Mean Squared Error (MSE) accuracy of Holt-Winters Filtering")
kable(mape(bayes_test$Number.of.patients, hw_pred$x), caption = "Table 15. The Mean Absolute Percentage Error (MAPE) accuracy of Holt-Winters Filtering")
plot(hw_fit, main="Figure 9. Seasonal Decomposition of Patient Volume")
monthplot(fit, col="green", main="Figure 10. Month Plot of Patient Volume, Seasonal Adjustment")
# Plot the Holt-Winters Extrapolation Model
plot(hw_pred, main="Figure 11. Holt-Winters Extrapolation")
lines(hw_pred$fitted, lty=2, col="blue")
# Create the ARIMA regression model
arima_model <- auto.arima(bayes_ts)
# Forecast the next 12 months of office visit values
fore_arima <- forecast::forecast(arima_model, h = 7)
# Format the ARIMA forecasts as a dataframe
df_arima <- as.data.frame(fore_arima)
Next7days_ar <- fore_arima$mean
kable(fore_arima$mean, caption = "Table 16. Next 7 days Prediction - ARIMA")
cat("Table 17. Summary of the ARIMA model")
summary(arima_model)
kable(mse(bayes_test$Number.of.patients, df_arima$`Point Forecast`), caption = "Table 18. The Mean Squared Error (MSE) accuracy of ARIMA")
kable(mape(bayes_test$Number.of.patients, df_arima$`Point Forecast`), caption = "Table 19. The Mean Absolute Percentage Error (MAPE) accuracy of ARIMA")
plot(fore_arima, main = "Figure 12. Forecasts ARIMA(0,0,0) non-zero mean")
# Format a matrix of results
results_table <- matrix(nrow = 7, ncol = 4)
rownames(results_table) <- c("April 1, 2019", "April 2, 2019",
"April 3, 2019","April 4, 2019",
"April 5, 2019","April 6, 2019",
"April 7, 2019")
colnames(results_table) <- c("Naive Bayes", "Linear Modeling",
"Holt-Winters", "ARIMA")
# Fill results matrix with forecast results
results_table[,1] <- Next7days_nb
results_table[,2] <- Next7days_lm[1:7]
results_table[,3] <- Next7days_hw
results_table[,4] <- Next7days_ar
kable(results_table, caption = "Table 20. Next 7 Days Forecasts")